]>
Commit | Line | Data |
---|---|---|
962f9a2c CF |
1 | 2012-07-21 cgf-000013 |
2 | ||
3 | These changes reflect a revamp of the "wait for signal" functionality | |
4 | which has existed in Cygwin through several signal massages. | |
5 | ||
6 | We now create a signal event only when a thread is waiting for a signal | |
7 | and arm it only for that thread. The "set_thread_waiting" function is | |
8 | used to establish the event and set it in a location referencable by | |
9 | the caller. | |
10 | ||
11 | I still do not handle all of the race conditions. What happens when | |
12 | a signal comes in just after a WF?O succeeds for some other purpose? I | |
13 | suspect that it will arm the next WF?O call and the subsequent call to | |
14 | call_signal_handler could cause a function to get an EINTR when possibly | |
15 | it shouldn't have. | |
16 | ||
17 | I haven't yet checked all of the test cases for the URL listed in the | |
18 | previous entry. | |
19 | ||
20 | Baby steps. | |
21 | ||
4ae63783 CF |
22 | 2012-06-12 cgf-000012 |
23 | ||
24 | These changes are the preliminary for redoing the way threads wait for | |
25 | signals. The problems are shown by the test case mentioned here: | |
26 | ||
27 | http://cygwin.com/ml/cygwin/2012-05/msg00434.html | |
28 | ||
29 | I've known that the signal handling in threads wasn't quite right for | |
30 | some time. I lost all of my thread signal tests in the great "rm -r" | |
31 | debacle of a few years ago and have been less than enthusiastic about | |
8bec43b3 | 32 | redoing everything (I had PCTS tests and everything). But it really is |
4ae63783 CF |
33 | time to redo this signal handling to make it more like it is supposed to |
34 | be. | |
35 | ||
36 | This change should not introduce any new behavior. Things should | |
37 | continue to behave as before. The major differences are a change in the | |
38 | arguments to cancelable_wait and cygwait now uses cancelable_wait and, | |
39 | so, the returns from cygwait now mirror cancelable_wait. | |
40 | ||
41 | The next change will consolidate cygwait and cancelable_wait into one | |
42 | cygwait function. | |
43 | ||
3143cb7c CF |
44 | 2012-06-02 cgf-000011 |
45 | ||
46 | The refcnt handling was tricky to get right but I had convinced myself | |
47 | that the refcnt's were always incremented/decremented under a lock. | |
48 | Corinna's 2012-05-23 change to refcnt exposed a potential problem with | |
49 | dup handling where the fdtab could be updated while not locked. | |
50 | ||
51 | That should be fixed by this change but, on closer examination, it seems | |
aa01a03c | 52 | like there are many places where it is possible for the refcnt to be |
3143cb7c CF |
53 | updated while the fdtab is not locked since the default for |
54 | cygheap_fdget is to not lock the fdtab (and that should be the default - | |
55 | you can't have read holding a lock). | |
56 | ||
57 | Since refcnt was only ever called with 1 or -1, I broke it up into two | |
58 | functions but kept the Interlocked* operation. Incrementing a variable | |
59 | should not be as racy as adding an arbitrary number to it but we have | |
60 | InterlockedIncrement/InterlockedDecrement for a reason so I kept the | |
61 | Interlocked operation here. | |
62 | ||
63 | In the meantime, I'll be mulling over whether the refcnt operations are | |
64 | actually safe as they are. Maybe just ensuring that they are atomically | |
65 | updated is enough since they control the destruction of an fh. If I got | |
66 | the ordering right with incrementing and decrementing then that should | |
67 | be adequate. | |
68 | ||
45b61a88 CF |
69 | 2012-06-02 cgf-000010 |
70 | ||
71 | <1.7.16> | |
72 | - Fix emacs problem which exposed an issue with Cygwin's select() function. | |
73 | If a signal arrives while select is blocking and the program longjmps | |
74 | out of the signal handler then threads and memory may be left hanging. | |
75 | Fixes: http://cygwin.com/ml/cygwin/2012-05/threads.html#00275 | |
76 | </1.7.16> | |
77 | ||
78 | This was try #4 or #5 to get select() signal handling working right. | |
79 | It's still not there but it should now at least not leak memory or | |
80 | threads. | |
81 | ||
82 | I mucked with the interface between cygwin_select and select_stuff::wait | |
83 | so that the "new" loop in select_stuff::wait() was essentially moved | |
84 | into the caller. cygwin_select now uses various enum states to decide | |
85 | what to do. It builds the select linked list at the beginning of the | |
86 | loop, allowing wait() to tear everything down and restart. This is | |
87 | necessary before calling a signal handler because the signal handler may | |
88 | longjmp away. | |
89 | ||
90 | I initially had this all coded up to use a special signal_cleanup | |
91 | callback which could be called when a longjmp is called in a signal | |
92 | handler. And cygwin_select() set up and tore down this callback. Once | |
93 | I got everything compiling it, of course, dawned on me that just because | |
94 | you call a longjmp in a signal handler it doesn't mean that you are | |
95 | jumping *out* of the signal handler. So, if the signal handler invokes | |
96 | the callback and returns it will be very bad for select(). Hence, this | |
97 | slower, but hopefully more correct implementation. | |
98 | ||
99 | (I still wonder if some sort of signal cleanup callback might still | |
100 | be useful in the future) | |
101 | ||
102 | TODO: I need to do an audit of other places where this problem could be | |
103 | occurring. | |
104 | ||
105 | As alluded to above, select's signal handling is still not right. It | |
106 | still acts as if it could call a signal handler from something other | |
107 | than the main thread but, AFAICT, from my STC, this doesn't seem to be | |
108 | the case. It might be worthwhile to extend cygwait to just magically | |
109 | figure this out and not even bother using w4[0] for scenarios like this. | |
110 | ||
fe66a97a CF |
111 | 2012-05-16 cgf-000009 |
112 | ||
113 | <1.7.16> | |
114 | - Fix broken console mouse handling. Reported here: | |
115 | http://cygwin.com/ml/cygwin/2012-05/msg00360.html | |
116 | </1.7.16> | |
117 | ||
118 | I did a cvs annotate on smallprint.cc and see that the code to translate | |
119 | %characters > 127 to 0x notation was in the 1.1 revision. Then I | |
120 | checked the smallprint.c predecessor. It was in the 1.1 version of that | |
121 | program too, which means that this odd change has probably been around | |
122 | since <= 2000. | |
123 | ||
124 | Since __small_sprintf is supposed to emulate sprintf, I got rid of the | |
125 | special case handling. This may affect fhandler_socket::bind. If so, we | |
126 | should work around this problem there rather than keeping this strange | |
127 | hack in __small_printf. | |
128 | ||
bd8afa5e CF |
129 | 2012-05-14 cgf-000008 |
130 | ||
131 | <1.7.16> | |
132 | - Fix hang when zero bytes are written to a pty using | |
133 | Windows WriteFile or equivalent. Fixes: | |
134 | http://cygwin.com/ml/cygwin/2012-05/msg00323.html | |
135 | </1.7.16> | |
136 | ||
137 | cgf-000002, as usual, fixed one thing while breaking another. See | |
138 | Larry's predicament in: http://goo.gl/oGEr2 . | |
139 | ||
140 | The problem is that zero byte writes to the pty pipe caused the dread | |
141 | end-of-the-world-as-we-know-it problem reported on the mailing list | |
142 | where ReadFile reads zero bytes even though there is still more to read | |
143 | on the pipe. This is because that change caused a 'record' to be read | |
144 | and a record can be zero bytes. | |
145 | ||
146 | I was never really keen about using a throwaway buffer just to get a | |
147 | count of the number of characters available to be read in the pty pipe. | |
148 | On closer reading of the documentation for PeekNamedPipe it seemed like | |
149 | the sixth argument to PeekNamedPipe should return what I needed without | |
150 | using a buffer. And, amazingly, it did, except that the problem still | |
151 | remained - a zero byte message still screwed things up. | |
152 | ||
153 | So, we now detect the case where there is zero bytes available as a | |
154 | message but there are bytes available in the pipe. In that scenario, | |
155 | return the bytes available in the pipe rather than the message length of | |
156 | zero. This could conceivably cause problems with pty pipe handling in | |
157 | this scenario but since the only way this scenario could possibly happen | |
158 | is when someone is writing zero bytes using WriteFile to a pty pipe, I'm | |
159 | ok with that. | |
160 | ||
3de7be4c CF |
161 | 2012-05-14 cgf-000007 |
162 | ||
163 | <1.7.16> | |
164 | - Fix invocation of strace from a cygwin process. Fixes: | |
165 | http://cygwin.com/ml/cygwin/2012-05/msg00292.html | |
166 | </1.7.16> | |
167 | ||
168 | The change in cgf-000004 introduced a problem for processes which load | |
169 | cygwin1.dll dynamically. strace.exe is the most prominent example of | |
170 | this. | |
171 | ||
172 | Since the parent handle is now closed for "non-Cygwin" processes, when | |
173 | strace.exe tried to dynamically load cygwin1.dll, the handle was invalid | |
174 | and child_info_spawn::handle_spawn couldn't use retrieve information | |
175 | from the parent. This eventually led to a strace_printf error due to an | |
176 | attempt to dereference an unavailable cygheap. Probably have to fix | |
177 | this someday. You shouldn't use the cygheap while attempting to print | |
178 | an error about the inavailability of said cygheap. | |
179 | ||
180 | This was fixed by saving the parent pid in child_info_spawn and calling | |
181 | OpenProcess for the parent pid and using that handle iff a process is | |
182 | dynamically loaded. | |
183 | ||
1f994848 CF |
184 | 2012-05-12 cgf-000006 |
185 | ||
186 | <1.7.16> | |
187 | - Fix hang when calling pthread_testcancel in a canceled thread. | |
188 | Fixes some of: http://cygwin.com/ml/cygwin/2012-05/msg00186.html | |
189 | </1.7.16> | |
190 | ||
191 | This should fix the first part of the reported problem in the above | |
192 | message. The cancel seemed to actually be working but, the fprintf | |
193 | eventually ended up calling pthread_testcancel. Since we'd gotten here | |
194 | via a cancel, it tried to recursively call the cancel handler causing a | |
195 | recursive loop. | |
196 | ||
348b56b5 CF |
197 | 2012-05-12 cgf-000005 |
198 | ||
199 | <1.7.16> | |
200 | - Fix pipe creation problem which manifested as a problem creating a | |
201 | fifo. Fixes: http://cygwin.com/ml/cygwin/2012-05/msg00253.html | |
202 | </1.7.16> | |
203 | ||
204 | My change on 2012-04-28 introduced a problem with fifos. The passed | |
205 | in name was overwritten. This was because I wasn't properly keeping | |
206 | track of the length of the generated pipe name when there was a | |
207 | name passed in to fhandler_pipe::create. | |
208 | ||
209 | There was also another problem in fhandler_pipe::create. Since fifos | |
210 | use PIPE_ACCESS_DUPLEX and PIPE_ACCESS_DUPLEX is an or'ing of | |
211 | PIPE_ACCESS_INBOUND and PIPE_ACCESS_OUTBOUND, using PIPE_ACCESS_OUTBOUND | |
212 | as a "never-used" option for PIPE_ADD_PID in fhandler.h was wrong. So, | |
213 | fifo creation attempted to add the pid of a pipe to the name which is | |
214 | wrong for fifos. | |
215 | ||
dfd5d5be CF |
216 | 2012-05-08 cgf-000004 |
217 | ||
61c33dbf CF |
218 | The change for cgf-000003 introduced a new problem: |
219 | http://cygwin.com/ml/cygwin/2012-05/msg00154.html | |
220 | http://cygwin.com/ml/cygwin/2012-05/msg00157.html | |
dfd5d5be CF |
221 | |
222 | Since a handle associated with the parent is no longer being duplicated | |
223 | into a non-cygwin "execed child", Windows is free to reuse the pid of | |
224 | the parent when the parent exits. However, since we *did* duplicate a | |
225 | handle pointing to the pid's shared memory area into the "execed child", | |
226 | the shared memory for the pid was still active. | |
227 | ||
228 | Since the shared memory was still available, if a new process reuses the | |
229 | previous pid, Cygwin would detect that the shared memory was not created | |
230 | and had a "PID_REAPED" flag. That was considered an error, and, so, it | |
231 | would set procinfo to NULL and pinfo::thisproc would die since this | |
232 | situation is not supposed to occur. | |
233 | ||
234 | I fixed this in two ways: | |
235 | ||
236 | 1) If a shared memory region has a PID_REAPED flag then zero it and | |
237 | reuse it. This should be safe since you are not really supposed to be | |
238 | querying the shared memory region for anything after PID_REAPED has been | |
239 | set. | |
240 | ||
241 | 2) Forego duping a copy of myself_pinfo if we're starting a non-cygwin | |
242 | child for exec. | |
243 | ||
244 | It seems like 2) is a common theme and an audit of all of the handles | |
245 | that are being passed to non-cygwin children is in order for 1.7.16. | |
246 | ||
247 | The other minor modification that was made in this change was to add the | |
248 | pid of the failing process to fork error output. This helps slightly | |
249 | when looking at strace output, even though in this case it was easy to | |
250 | find what was failing by looking for '^---' when running the "stv" | |
251 | strace dumper. That found the offending exception quickly. | |
252 | ||
253 | 2012-05-07 cgf-000003 | |
06bd0ef2 CF |
254 | |
255 | <1.7.15> | |
256 | Don't make Cygwin wait for all children of a non-cygwin child program. | |
257 | Fixes: http://cygwin.com/ml/cygwin/2012-05/msg00063.html, | |
258 | http://cygwin.com/ml/cygwin/2012-05/msg00075.html | |
259 | </1.7.15> | |
260 | ||
261 | This problem is due to a recent change which added some robustness and | |
262 | speed to Cygwin's exec/spawn handling by not trying to force inheritance | |
263 | every time a process is started. See ChangeLog entries starting on | |
264 | 2012-03-20, and multiple on 2012-03-21. | |
265 | ||
266 | Making the handle inheritable meant that, as usual, there were problems | |
267 | with non-Cygwin processes. When Cygwin "execs" a non-Cygwin process N, | |
268 | all of its N + 1, N + 2, ... children will also inherit the handle. | |
269 | That means that Cygwin will wait until all subprocesses have exited | |
270 | before it returns. | |
271 | ||
272 | I was willing to make this a restriction of starting non-Cygwin | |
273 | processes but the problem with allowing that is that it can cause the | |
274 | creation of a "limbo" pid when N exits and N + 1 and friends are still | |
275 | around. In this scenario, Cygwin dutifully notices that process N has | |
276 | died and sets the exit code to indicate that but N's parent will wait on | |
277 | rd_proc_pipe and will only return when every N + ... windows process | |
278 | has exited. | |
279 | ||
280 | The removal of cygheap::pid_handle was not related to the initial | |
281 | problem that I set out to fix. The change came from the realization | |
282 | that we were duping the current process handle into the child twice and | |
283 | only needed to do it once. The current process handle is used by exec | |
284 | to keep the Windows pid "alive" so that it will not be reused. So, now | |
285 | we just close parent in child_info_spawn::handle_spawn iff we're not | |
286 | execing. | |
287 | ||
288 | In debugging this it bothered me that 'ps' identified a nonactive pid as | |
289 | active. Part of the reason for this was the 'parent' handle in | |
290 | child_info was opened in non-Cygwin processes, keeping the pid alive. | |
291 | That has been kluged around (more changes after 1.7.15) but that didn't | |
292 | fix the problem. On further investigation, this seems to be caused by | |
293 | the fact that the shared memory region pid handles were still being | |
294 | passed to non-cygwin children, keeping the pid alive in a limbo-like | |
295 | fashion. This was easily fixed by having pinfo::init() consider a | |
51180c08 CF |
296 | memory region with PID_REAPED as not available. A more robust fix |
297 | should be considered for 1.7.15+ where these handles are not passed | |
298 | to non-cygwin processes. | |
06bd0ef2 CF |
299 | |
300 | This fixed the problem where a pid showed up in the list after a user | |
301 | does something like: "bash$ cmd /c start notepad" but, for some reason, | |
302 | it does not fix the problem where "bash$ setsid cmd /c start notepad". | |
303 | That bears investigation after 1.7.15 is released but it is not a | |
304 | regression and so is not a blocker for the release. | |
305 | ||
fb9d6318 CF |
306 | 2012-05-03 cgf-000002 |
307 | ||
308 | <1.7.15> | |
309 | Fix problem where too much input was attempted to be read from a | |
310 | pty slave. Fixes: http://cygwin.com/ml/cygwin/2012-05/msg00049.html | |
311 | </1.7.15> | |
312 | ||
313 | My change on 2012/04/05 reintroduced the problem first described by: | |
314 | http://cygwin.com/ml/cygwin/2011-10/threads.html#00445 | |
315 | ||
316 | The problem then was, IIRC, due to the fact that bytes sent to the pty | |
317 | pipe were not written as records. Changing pipe to PIPE_TYPE_MESSAGE in | |
318 | pipe.cc fixed the problem since writing lines to one side of the pipe | |
319 | caused exactly that the number of characters to be read on the other | |
320 | even if there were more characters in the pipe. | |
321 | ||
322 | To debug this, I first replaced fhandler_tty.cc with the 1.258, | |
323 | 2012/04/05 version. The test case started working when I did that. | |
324 | ||
325 | So, then, I replaced individual functions, one at a time, in | |
326 | fhandler_tty.cc with their previous versions. I'd expected this to be a | |
327 | problem with fhandler_pty_master::process_slave_output since that had | |
328 | seen the most changes but was surprised to see that the culprit was | |
329 | fhandler_pty_slave::read(). | |
330 | ||
331 | The reason was that I really needed the bytes_available() function to | |
332 | return the number of bytes which would be read in the next operation | |
333 | rather than the number of bytes available in the pipe. That's because | |
334 | there may be a number of lines available to be read but the number of | |
335 | bytes which will be read by ReadFile should reflect the mode of the pty | |
336 | and, if there is a line to read, only the number of bytes in the line | |
337 | should be seen as available for the next read. | |
338 | ||
339 | Having bytes_available() return the number of bytes which would be read | |
340 | seemed to fix the problem but it could subtly change the behavior of | |
341 | other callers of this function. However, I actually think this is | |
342 | probably a good thing since they probably should have been seeing the | |
343 | line behavior. | |
344 | ||
b79c0094 CF |
345 | 2012-05-02 cgf-000001 |
346 | ||
0386e4bf CF |
347 | <1.7.15> |
348 | Fix problem setting parent pid to 1 when process with children execs | |
349 | itself. Fixes: http://cygwin.com/ml/cygwin/2012-05/msg00009.html | |
350 | </1.7.15> | |
b79c0094 | 351 | |
0386e4bf CF |
352 | Investigating this problem with strace showed that ssh-agent was |
353 | checking the parent pid and getting a 1 when it shouldn't have. Other | |
354 | stuff looked ok so I chose to consider this a smoking gun. | |
b79c0094 CF |
355 | |
356 | Going back to the version that the OP said did not have the problem, I | |
357 | worked forward until I found where the problem first occurred - | |
358 | somewhere around 2012-03-19. And, indeed, the getppid call returned the | |
359 | correct value in the working version. That means that this stopped | |
360 | working when I redid the way the process pipe was inherited around | |
361 | this time period. | |
362 | ||
363 | It isn't clear why (and I suspect I may have to debug this further at | |
61c33dbf | 364 | some point) this hasn't always been a problem but I made the obvious fix. |
b79c0094 CF |
365 | We shouldn't have been setting ppid = 1 when we're about to pass off to |
366 | an execed process. | |
367 | ||
368 | As I was writing this, I realized that it was necessary to add some | |
369 | additional checks. Just checking for "have_execed" isn't enough. If | |
370 | we've execed a non-cygwin process then it won't know how to deal with | |
371 | any inherited children. So, always set ppid = 1 if we've execed a | |
372 | non-cygwin process. |