]> sourceware.org Git - newlib-cygwin.git/blame - winsup/cygwin/DevNotes
* DevNotes: Add entry cgf-000013.
[newlib-cygwin.git] / winsup / cygwin / DevNotes
CommitLineData
962f9a2c
CF
12012-07-21 cgf-000013
2
3These changes reflect a revamp of the "wait for signal" functionality
4which has existed in Cygwin through several signal massages.
5
6We now create a signal event only when a thread is waiting for a signal
7and arm it only for that thread. The "set_thread_waiting" function is
8used to establish the event and set it in a location referencable by
9the caller.
10
11I still do not handle all of the race conditions. What happens when
12a signal comes in just after a WF?O succeeds for some other purpose? I
13suspect that it will arm the next WF?O call and the subsequent call to
14call_signal_handler could cause a function to get an EINTR when possibly
15it shouldn't have.
16
17I haven't yet checked all of the test cases for the URL listed in the
18previous entry.
19
20Baby steps.
21
4ae63783
CF
222012-06-12 cgf-000012
23
24These changes are the preliminary for redoing the way threads wait for
25signals. The problems are shown by the test case mentioned here:
26
27http://cygwin.com/ml/cygwin/2012-05/msg00434.html
28
29I've known that the signal handling in threads wasn't quite right for
30some time. I lost all of my thread signal tests in the great "rm -r"
31debacle of a few years ago and have been less than enthusiastic about
8bec43b3 32redoing everything (I had PCTS tests and everything). But it really is
4ae63783
CF
33time to redo this signal handling to make it more like it is supposed to
34be.
35
36This change should not introduce any new behavior. Things should
37continue to behave as before. The major differences are a change in the
38arguments to cancelable_wait and cygwait now uses cancelable_wait and,
39so, the returns from cygwait now mirror cancelable_wait.
40
41The next change will consolidate cygwait and cancelable_wait into one
42cygwait function.
43
3143cb7c
CF
442012-06-02 cgf-000011
45
46The refcnt handling was tricky to get right but I had convinced myself
47that the refcnt's were always incremented/decremented under a lock.
48Corinna's 2012-05-23 change to refcnt exposed a potential problem with
49dup handling where the fdtab could be updated while not locked.
50
51That should be fixed by this change but, on closer examination, it seems
aa01a03c 52like there are many places where it is possible for the refcnt to be
3143cb7c
CF
53updated while the fdtab is not locked since the default for
54cygheap_fdget is to not lock the fdtab (and that should be the default -
55you can't have read holding a lock).
56
57Since refcnt was only ever called with 1 or -1, I broke it up into two
58functions but kept the Interlocked* operation. Incrementing a variable
59should not be as racy as adding an arbitrary number to it but we have
60InterlockedIncrement/InterlockedDecrement for a reason so I kept the
61Interlocked operation here.
62
63In the meantime, I'll be mulling over whether the refcnt operations are
64actually safe as they are. Maybe just ensuring that they are atomically
65updated is enough since they control the destruction of an fh. If I got
66the ordering right with incrementing and decrementing then that should
67be adequate.
68
45b61a88
CF
692012-06-02 cgf-000010
70
71<1.7.16>
72- Fix emacs problem which exposed an issue with Cygwin's select() function.
73 If a signal arrives while select is blocking and the program longjmps
74 out of the signal handler then threads and memory may be left hanging.
75 Fixes: http://cygwin.com/ml/cygwin/2012-05/threads.html#00275
76</1.7.16>
77
78This was try #4 or #5 to get select() signal handling working right.
79It's still not there but it should now at least not leak memory or
80threads.
81
82I mucked with the interface between cygwin_select and select_stuff::wait
83so that the "new" loop in select_stuff::wait() was essentially moved
84into the caller. cygwin_select now uses various enum states to decide
85what to do. It builds the select linked list at the beginning of the
86loop, allowing wait() to tear everything down and restart. This is
87necessary before calling a signal handler because the signal handler may
88longjmp away.
89
90I initially had this all coded up to use a special signal_cleanup
91callback which could be called when a longjmp is called in a signal
92handler. And cygwin_select() set up and tore down this callback. Once
93I got everything compiling it, of course, dawned on me that just because
94you call a longjmp in a signal handler it doesn't mean that you are
95jumping *out* of the signal handler. So, if the signal handler invokes
96the callback and returns it will be very bad for select(). Hence, this
97slower, but hopefully more correct implementation.
98
99(I still wonder if some sort of signal cleanup callback might still
100be useful in the future)
101
102TODO: I need to do an audit of other places where this problem could be
103occurring.
104
105As alluded to above, select's signal handling is still not right. It
106still acts as if it could call a signal handler from something other
107than the main thread but, AFAICT, from my STC, this doesn't seem to be
108the case. It might be worthwhile to extend cygwait to just magically
109figure this out and not even bother using w4[0] for scenarios like this.
110
fe66a97a
CF
1112012-05-16 cgf-000009
112
113<1.7.16>
114- Fix broken console mouse handling. Reported here:
115 http://cygwin.com/ml/cygwin/2012-05/msg00360.html
116</1.7.16>
117
118I did a cvs annotate on smallprint.cc and see that the code to translate
119%characters > 127 to 0x notation was in the 1.1 revision. Then I
120checked the smallprint.c predecessor. It was in the 1.1 version of that
121program too, which means that this odd change has probably been around
122since <= 2000.
123
124Since __small_sprintf is supposed to emulate sprintf, I got rid of the
125special case handling. This may affect fhandler_socket::bind. If so, we
126should work around this problem there rather than keeping this strange
127hack in __small_printf.
128
bd8afa5e
CF
1292012-05-14 cgf-000008
130
131<1.7.16>
132- Fix hang when zero bytes are written to a pty using
133 Windows WriteFile or equivalent. Fixes:
134 http://cygwin.com/ml/cygwin/2012-05/msg00323.html
135</1.7.16>
136
137cgf-000002, as usual, fixed one thing while breaking another. See
138Larry's predicament in: http://goo.gl/oGEr2 .
139
140The problem is that zero byte writes to the pty pipe caused the dread
141end-of-the-world-as-we-know-it problem reported on the mailing list
142where ReadFile reads zero bytes even though there is still more to read
143on the pipe. This is because that change caused a 'record' to be read
144and a record can be zero bytes.
145
146I was never really keen about using a throwaway buffer just to get a
147count of the number of characters available to be read in the pty pipe.
148On closer reading of the documentation for PeekNamedPipe it seemed like
149the sixth argument to PeekNamedPipe should return what I needed without
150using a buffer. And, amazingly, it did, except that the problem still
151remained - a zero byte message still screwed things up.
152
153So, we now detect the case where there is zero bytes available as a
154message but there are bytes available in the pipe. In that scenario,
155return the bytes available in the pipe rather than the message length of
156zero. This could conceivably cause problems with pty pipe handling in
157this scenario but since the only way this scenario could possibly happen
158is when someone is writing zero bytes using WriteFile to a pty pipe, I'm
159ok with that.
160
3de7be4c
CF
1612012-05-14 cgf-000007
162
163<1.7.16>
164- Fix invocation of strace from a cygwin process. Fixes:
165 http://cygwin.com/ml/cygwin/2012-05/msg00292.html
166</1.7.16>
167
168The change in cgf-000004 introduced a problem for processes which load
169cygwin1.dll dynamically. strace.exe is the most prominent example of
170this.
171
172Since the parent handle is now closed for "non-Cygwin" processes, when
173strace.exe tried to dynamically load cygwin1.dll, the handle was invalid
174and child_info_spawn::handle_spawn couldn't use retrieve information
175from the parent. This eventually led to a strace_printf error due to an
176attempt to dereference an unavailable cygheap. Probably have to fix
177this someday. You shouldn't use the cygheap while attempting to print
178an error about the inavailability of said cygheap.
179
180This was fixed by saving the parent pid in child_info_spawn and calling
181OpenProcess for the parent pid and using that handle iff a process is
182dynamically loaded.
183
1f994848
CF
1842012-05-12 cgf-000006
185
186<1.7.16>
187- Fix hang when calling pthread_testcancel in a canceled thread.
188 Fixes some of: http://cygwin.com/ml/cygwin/2012-05/msg00186.html
189</1.7.16>
190
191This should fix the first part of the reported problem in the above
192message. The cancel seemed to actually be working but, the fprintf
193eventually ended up calling pthread_testcancel. Since we'd gotten here
194via a cancel, it tried to recursively call the cancel handler causing a
195recursive loop.
196
348b56b5
CF
1972012-05-12 cgf-000005
198
199<1.7.16>
200- Fix pipe creation problem which manifested as a problem creating a
201fifo. Fixes: http://cygwin.com/ml/cygwin/2012-05/msg00253.html
202</1.7.16>
203
204My change on 2012-04-28 introduced a problem with fifos. The passed
205in name was overwritten. This was because I wasn't properly keeping
206track of the length of the generated pipe name when there was a
207name passed in to fhandler_pipe::create.
208
209There was also another problem in fhandler_pipe::create. Since fifos
210use PIPE_ACCESS_DUPLEX and PIPE_ACCESS_DUPLEX is an or'ing of
211PIPE_ACCESS_INBOUND and PIPE_ACCESS_OUTBOUND, using PIPE_ACCESS_OUTBOUND
212as a "never-used" option for PIPE_ADD_PID in fhandler.h was wrong. So,
213fifo creation attempted to add the pid of a pipe to the name which is
214wrong for fifos.
215
dfd5d5be
CF
2162012-05-08 cgf-000004
217
61c33dbf
CF
218The change for cgf-000003 introduced a new problem:
219http://cygwin.com/ml/cygwin/2012-05/msg00154.html
220http://cygwin.com/ml/cygwin/2012-05/msg00157.html
dfd5d5be
CF
221
222Since a handle associated with the parent is no longer being duplicated
223into a non-cygwin "execed child", Windows is free to reuse the pid of
224the parent when the parent exits. However, since we *did* duplicate a
225handle pointing to the pid's shared memory area into the "execed child",
226the shared memory for the pid was still active.
227
228Since the shared memory was still available, if a new process reuses the
229previous pid, Cygwin would detect that the shared memory was not created
230and had a "PID_REAPED" flag. That was considered an error, and, so, it
231would set procinfo to NULL and pinfo::thisproc would die since this
232situation is not supposed to occur.
233
234I fixed this in two ways:
235
2361) If a shared memory region has a PID_REAPED flag then zero it and
237reuse it. This should be safe since you are not really supposed to be
238querying the shared memory region for anything after PID_REAPED has been
239set.
240
2412) Forego duping a copy of myself_pinfo if we're starting a non-cygwin
242child for exec.
243
244It seems like 2) is a common theme and an audit of all of the handles
245that are being passed to non-cygwin children is in order for 1.7.16.
246
247The other minor modification that was made in this change was to add the
248pid of the failing process to fork error output. This helps slightly
249when looking at strace output, even though in this case it was easy to
250find what was failing by looking for '^---' when running the "stv"
251strace dumper. That found the offending exception quickly.
252
2532012-05-07 cgf-000003
06bd0ef2
CF
254
255<1.7.15>
256Don't make Cygwin wait for all children of a non-cygwin child program.
257Fixes: http://cygwin.com/ml/cygwin/2012-05/msg00063.html,
258 http://cygwin.com/ml/cygwin/2012-05/msg00075.html
259</1.7.15>
260
261This problem is due to a recent change which added some robustness and
262speed to Cygwin's exec/spawn handling by not trying to force inheritance
263every time a process is started. See ChangeLog entries starting on
2642012-03-20, and multiple on 2012-03-21.
265
266Making the handle inheritable meant that, as usual, there were problems
267with non-Cygwin processes. When Cygwin "execs" a non-Cygwin process N,
268all of its N + 1, N + 2, ... children will also inherit the handle.
269That means that Cygwin will wait until all subprocesses have exited
270before it returns.
271
272I was willing to make this a restriction of starting non-Cygwin
273processes but the problem with allowing that is that it can cause the
274creation of a "limbo" pid when N exits and N + 1 and friends are still
275around. In this scenario, Cygwin dutifully notices that process N has
276died and sets the exit code to indicate that but N's parent will wait on
277rd_proc_pipe and will only return when every N + ... windows process
278has exited.
279
280The removal of cygheap::pid_handle was not related to the initial
281problem that I set out to fix. The change came from the realization
282that we were duping the current process handle into the child twice and
283only needed to do it once. The current process handle is used by exec
284to keep the Windows pid "alive" so that it will not be reused. So, now
285we just close parent in child_info_spawn::handle_spawn iff we're not
286execing.
287
288In debugging this it bothered me that 'ps' identified a nonactive pid as
289active. Part of the reason for this was the 'parent' handle in
290child_info was opened in non-Cygwin processes, keeping the pid alive.
291That has been kluged around (more changes after 1.7.15) but that didn't
292fix the problem. On further investigation, this seems to be caused by
293the fact that the shared memory region pid handles were still being
294passed to non-cygwin children, keeping the pid alive in a limbo-like
295fashion. This was easily fixed by having pinfo::init() consider a
51180c08
CF
296memory region with PID_REAPED as not available. A more robust fix
297should be considered for 1.7.15+ where these handles are not passed
298to non-cygwin processes.
06bd0ef2
CF
299
300This fixed the problem where a pid showed up in the list after a user
301does something like: "bash$ cmd /c start notepad" but, for some reason,
302it does not fix the problem where "bash$ setsid cmd /c start notepad".
303That bears investigation after 1.7.15 is released but it is not a
304regression and so is not a blocker for the release.
305
fb9d6318
CF
3062012-05-03 cgf-000002
307
308<1.7.15>
309Fix problem where too much input was attempted to be read from a
310pty slave. Fixes: http://cygwin.com/ml/cygwin/2012-05/msg00049.html
311</1.7.15>
312
313My change on 2012/04/05 reintroduced the problem first described by:
314http://cygwin.com/ml/cygwin/2011-10/threads.html#00445
315
316The problem then was, IIRC, due to the fact that bytes sent to the pty
317pipe were not written as records. Changing pipe to PIPE_TYPE_MESSAGE in
318pipe.cc fixed the problem since writing lines to one side of the pipe
319caused exactly that the number of characters to be read on the other
320even if there were more characters in the pipe.
321
322To debug this, I first replaced fhandler_tty.cc with the 1.258,
3232012/04/05 version. The test case started working when I did that.
324
325So, then, I replaced individual functions, one at a time, in
326fhandler_tty.cc with their previous versions. I'd expected this to be a
327problem with fhandler_pty_master::process_slave_output since that had
328seen the most changes but was surprised to see that the culprit was
329fhandler_pty_slave::read().
330
331The reason was that I really needed the bytes_available() function to
332return the number of bytes which would be read in the next operation
333rather than the number of bytes available in the pipe. That's because
334there may be a number of lines available to be read but the number of
335bytes which will be read by ReadFile should reflect the mode of the pty
336and, if there is a line to read, only the number of bytes in the line
337should be seen as available for the next read.
338
339Having bytes_available() return the number of bytes which would be read
340seemed to fix the problem but it could subtly change the behavior of
341other callers of this function. However, I actually think this is
342probably a good thing since they probably should have been seeing the
343line behavior.
344
b79c0094
CF
3452012-05-02 cgf-000001
346
0386e4bf
CF
347<1.7.15>
348Fix problem setting parent pid to 1 when process with children execs
349itself. Fixes: http://cygwin.com/ml/cygwin/2012-05/msg00009.html
350</1.7.15>
b79c0094 351
0386e4bf
CF
352Investigating this problem with strace showed that ssh-agent was
353checking the parent pid and getting a 1 when it shouldn't have. Other
354stuff looked ok so I chose to consider this a smoking gun.
b79c0094
CF
355
356Going back to the version that the OP said did not have the problem, I
357worked forward until I found where the problem first occurred -
358somewhere around 2012-03-19. And, indeed, the getppid call returned the
359correct value in the working version. That means that this stopped
360working when I redid the way the process pipe was inherited around
361this time period.
362
363It isn't clear why (and I suspect I may have to debug this further at
61c33dbf 364some point) this hasn't always been a problem but I made the obvious fix.
b79c0094
CF
365We shouldn't have been setting ppid = 1 when we're about to pass off to
366an execed process.
367
368As I was writing this, I realized that it was necessary to add some
369additional checks. Just checking for "have_execed" isn't enough. If
370we've execed a non-cygwin process then it won't know how to deal with
371any inherited children. So, always set ppid = 1 if we've execed a
372non-cygwin process.
This page took 0.064482 seconds and 5 git commands to generate.