[newlib-cygwin.git] / winsup / cygwin / DevNotes

2012-07-21  cgf-000013

These changes reflect a revamp of the "wait for signal" functionality
which has existed in Cygwin through several signal massages.

We now create a signal event only when a thread is waiting for a signal
and arm it only for that thread.  The "set_thread_waiting" function is
used to establish the event and set it in a location referencable by
the caller.

I still do not handle all of the race conditions.  What happens when
a signal comes in just after a WF?O succeeds for some other purpose?  I
suspect that it will arm the next WF?O call and the subsequent call to
call_signal_handler could cause a function to get an EINTR when possibly
it shouldn't have.

I haven't yet checked all of the test cases for the URL listed in the
previous entry.

Baby steps.

2012-06-12  cgf-000012

These changes are the preliminary for redoing the way threads wait for
signals.  The problems are shown by the test case mentioned here:

http://cygwin.com/ml/cygwin/2012-05/msg00434.html

I've known that the signal handling in threads wasn't quite right for
some time.  I lost all of my thread signal tests in the great "rm -r"
debacle of a few years ago and have been less than enthusiastic about
redoing everything (I had PCTS tests and everything).  But it really is
time to redo this signal handling to make it more like it is supposed to
be.

This change should not introduce any new behavior.  Things should
continue to behave as before.  The major differences are a change in the
arguments to cancelable_wait and cygwait now uses cancelable_wait and,
so, the returns from cygwait now mirror cancelable_wait.

The next change will consolidate cygwait and cancelable_wait into one
cygwait function.

2012-06-02  cgf-000011

The refcnt handling was tricky to get right but I had convinced myself
that the refcnt's were always incremented/decremented under a lock.
Corinna's 2012-05-23 change to refcnt exposed a potential problem with
dup handling where the fdtab could be updated while not locked.

That should be fixed by this change but, on closer examination, it seems
like there are many places where it is possible for the refcnt to be
updated while the fdtab is not locked since the default for
cygheap_fdget is to not lock the fdtab (and that should be the default -
you can't have read holding a lock).

Since refcnt was only ever called with 1 or -1, I broke it up into two
functions but kept the Interlocked* operation.  Incrementing a variable
should not be as racy as adding an arbitrary number to it but we have
InterlockedIncrement/InterlockedDecrement for a reason so I kept the
Interlocked operation here.

In the meantime, I'll be mulling over whether the refcnt operations are
actually safe as they are.  Maybe just ensuring that they are atomically
updated is enough since they control the destruction of an fh.  If I got
the ordering right with incrementing and decrementing then that should
be adequate.

2012-06-02  cgf-000010

<1.7.16>
- Fix emacs problem which exposed an issue with Cygwin's select() function.
  If a signal arrives while select is blocking and the program longjmps
  out of the signal handler then threads and memory may be left hanging.
  Fixes: http://cygwin.com/ml/cygwin/2012-05/threads.html#00275
</1.7.16>

This was try #4 or #5 to get select() signal handling working right.
It's still not there but it should now at least not leak memory or
threads.

I mucked with the interface between cygwin_select and select_stuff::wait
so that the "new" loop in select_stuff::wait() was essentially moved
into the caller.  cygwin_select now uses various enum states to decide
what to do.  It builds the select linked list at the beginning of the
loop, allowing wait() to tear everything down and restart.  This is
necessary before calling a signal handler because the signal handler may
longjmp away.

I initially had this all coded up to use a special signal_cleanup
callback which could be called when a longjmp is called in a signal
handler.  And cygwin_select() set up and tore down this callback.  Once
I got everything compiling it, of course, dawned on me that just because
you call a longjmp in a signal handler it doesn't mean that you are
jumping *out* of the signal handler.  So, if the signal handler invokes
the callback and returns it will be very bad for select().  Hence, this
slower, but hopefully more correct implementation.

(I still wonder if some sort of signal cleanup callback might still
be useful in the future)

TODO: I need to do an audit of other places where this problem could be
occurring.

As alluded to above, select's signal handling is still not right.  It
still acts as if it could call a signal handler from something other
than the main thread but, AFAICT, from my STC, this doesn't seem to be
the case.  It might be worthwhile to extend cygwait to just magically
figure this out and not even bother using w4[0] for scenarios like this.

2012-05-16  cgf-000009

<1.7.16>
- Fix broken console mouse handling.  Reported here:
  http://cygwin.com/ml/cygwin/2012-05/msg00360.html
</1.7.16>

I did a cvs annotate on smallprint.cc and see that the code to translate
%characters > 127 to 0x notation was in the 1.1 revision.  Then I
checked the smallprint.c predecessor.  It was in the 1.1 version of that
program too, which means that this odd change has probably been around
since <= 2000.

Since __small_sprintf is supposed to emulate sprintf, I got rid of the
special case handling.  This may affect fhandler_socket::bind.  If so, we
should work around this problem there rather than keeping this strange
hack in __small_printf.

2012-05-14  cgf-000008

<1.7.16>
- Fix hang when zero bytes are written to a pty using
  Windows WriteFile or equivalent.  Fixes:
  http://cygwin.com/ml/cygwin/2012-05/msg00323.html
</1.7.16>

cgf-000002, as usual, fixed one thing while breaking another.  See
Larry's predicament in: http://goo.gl/oGEr2 .

The problem is that zero byte writes to the pty pipe caused the dread
end-of-the-world-as-we-know-it problem reported on the mailing list
where ReadFile reads zero bytes even though there is still more to read
on the pipe.  This is because that change caused a 'record' to be read
and a record can be zero bytes.

I was never really keen about using a throwaway buffer just to get a
count of the number of characters available to be read in the pty pipe.
On closer reading of the documentation for PeekNamedPipe it seemed like
the sixth argument to PeekNamedPipe should return what I needed without
using a buffer.  And, amazingly, it did, except that the problem still
remained - a zero byte message still screwed things up.

So, we now detect the case where there is zero bytes available as a
message but there are bytes available in the pipe.  In that scenario,
return the bytes available in the pipe rather than the message length of
zero.  This could conceivably cause problems with pty pipe handling in
this scenario but since the only way this scenario could possibly happen
is when someone is writing zero bytes using WriteFile to a pty pipe, I'm
ok with that.

2012-05-14  cgf-000007

<1.7.16>
- Fix invocation of strace from a cygwin process.  Fixes:
  http://cygwin.com/ml/cygwin/2012-05/msg00292.html
</1.7.16>

The change in cgf-000004 introduced a problem for processes which load
cygwin1.dll dynamically.  strace.exe is the most prominent example of
this.

Since the parent handle is now closed for "non-Cygwin" processes, when
strace.exe tried to dynamically load cygwin1.dll, the handle was invalid
and child_info_spawn::handle_spawn couldn't use retrieve information
from the parent.  This eventually led to a strace_printf error due to an
attempt to dereference an unavailable cygheap.  Probably have to fix
this someday.  You shouldn't use the cygheap while attempting to print
an error about the inavailability of said cygheap.

This was fixed by saving the parent pid in child_info_spawn and calling
OpenProcess for the parent pid and using that handle iff a process is
dynamically loaded.

2012-05-12  cgf-000006

<1.7.16>
- Fix hang when calling pthread_testcancel in a canceled thread.
  Fixes some of: http://cygwin.com/ml/cygwin/2012-05/msg00186.html
</1.7.16>

This should fix the first part of the reported problem in the above
message.  The cancel seemed to actually be working but, the fprintf
eventually ended up calling pthread_testcancel.  Since we'd gotten here
via a cancel, it tried to recursively call the cancel handler causing a
recursive loop.

2012-05-12  cgf-000005

<1.7.16>
- Fix pipe creation problem which manifested as a problem creating a
fifo.  Fixes: http://cygwin.com/ml/cygwin/2012-05/msg00253.html
</1.7.16>

My change on 2012-04-28 introduced a problem with fifos.  The passed
in name was overwritten.  This was because I wasn't properly keeping
track of the length of the generated pipe name when there was a
name passed in to fhandler_pipe::create.

There was also another problem in fhandler_pipe::create.  Since fifos
use PIPE_ACCESS_DUPLEX and PIPE_ACCESS_DUPLEX is an or'ing of
PIPE_ACCESS_INBOUND and PIPE_ACCESS_OUTBOUND, using PIPE_ACCESS_OUTBOUND
as a "never-used" option for PIPE_ADD_PID in fhandler.h was wrong.  So,
fifo creation attempted to add the pid of a pipe to the name which is
wrong for fifos.

2012-05-08  cgf-000004

The change for cgf-000003 introduced a new problem:
http://cygwin.com/ml/cygwin/2012-05/msg00154.html
http://cygwin.com/ml/cygwin/2012-05/msg00157.html

Since a handle associated with the parent is no longer being duplicated
into a non-cygwin "execed child", Windows is free to reuse the pid of
the parent when the parent exits.  However, since we *did* duplicate a
handle pointing to the pid's shared memory area into the "execed child",
the shared memory for the pid was still active.

Since the shared memory was still available, if a new process reuses the
previous pid, Cygwin would detect that the shared memory was not created
and had a "PID_REAPED" flag.  That was considered an error, and, so, it
would set procinfo to NULL and pinfo::thisproc would die since this
situation is not supposed to occur.

I fixed this in two ways:

1) If a shared memory region has a PID_REAPED flag then zero it and
reuse it.  This should be safe since you are not really supposed to be
querying the shared memory region for anything after PID_REAPED has been
set.

2) Forego duping a copy of myself_pinfo if we're starting a non-cygwin
child for exec.

It seems like 2) is a common theme and an audit of all of the handles
that are being passed to non-cygwin children is in order for 1.7.16.

The other minor modification that was made in this change was to add the
pid of the failing process to fork error output.  This helps slightly
when looking at strace output, even though in this case it was easy to
find what was failing by looking for '^---' when running the "stv"
strace dumper.  That found the offending exception quickly.

2012-05-07  cgf-000003

<1.7.15>
Don't make Cygwin wait for all children of a non-cygwin child program.
Fixes: http://cygwin.com/ml/cygwin/2012-05/msg00063.html,
       http://cygwin.com/ml/cygwin/2012-05/msg00075.html
</1.7.15>

This problem is due to a recent change which added some robustness and
speed to Cygwin's exec/spawn handling by not trying to force inheritance
every time a process is started.  See ChangeLog entries starting on
2012-03-20, and multiple on 2012-03-21.

Making the handle inheritable meant that, as usual, there were problems
with non-Cygwin processes.  When Cygwin "execs" a non-Cygwin process N,
all of its N + 1, N + 2, ...  children will also inherit the handle.
That means that Cygwin will wait until all subprocesses have exited
before it returns.

I was willing to make this a restriction of starting non-Cygwin
processes but the problem with allowing that is that it can cause the
creation of a "limbo" pid when N exits and N + 1 and friends are still
around.  In this scenario, Cygwin dutifully notices that process N has
died and sets the exit code to indicate that but N's parent will wait on
rd_proc_pipe and will only return when every N + ...  windows process
has exited.

The removal of cygheap::pid_handle was not related to the initial
problem that I set out to fix.  The change came from the realization
that we were duping the current process handle into the child twice and
only needed to do it once.  The current process handle is used by exec
to keep the Windows pid "alive" so that it will not be reused.  So, now
we just close parent in child_info_spawn::handle_spawn iff we're not
execing.

In debugging this it bothered me that 'ps' identified a nonactive pid as
active.  Part of the reason for this was the 'parent' handle in
child_info was opened in non-Cygwin processes, keeping the pid alive.
That has been kluged around (more changes after 1.7.15) but that didn't
fix the problem.  On further investigation, this seems to be caused by
the fact that the shared memory region pid handles were still being
passed to non-cygwin children, keeping the pid alive in a limbo-like
fashion.  This was easily fixed by having pinfo::init() consider a
memory region with PID_REAPED as not available.  A more robust fix
should be considered for 1.7.15+ where these handles are not passed
to non-cygwin processes.

This fixed the problem where a pid showed up in the list after a user
does something like: "bash$ cmd /c start notepad" but, for some reason,
it does not fix the problem where "bash$ setsid cmd /c start notepad".
That bears investigation after 1.7.15 is released but it is not a
regression and so is not a blocker for the release.

2012-05-03  cgf-000002

<1.7.15>
Fix problem where too much input was attempted to be read from a
pty slave.  Fixes: http://cygwin.com/ml/cygwin/2012-05/msg00049.html
</1.7.15>

My change on 2012/04/05 reintroduced the problem first described by:
http://cygwin.com/ml/cygwin/2011-10/threads.html#00445

The problem then was, IIRC, due to the fact that bytes sent to the pty
pipe were not written as records.  Changing pipe to PIPE_TYPE_MESSAGE in
pipe.cc fixed the problem since writing lines to one side of the pipe
caused exactly that the number of characters to be read on the other
even if there were more characters in the pipe.

To debug this, I first replaced fhandler_tty.cc with the 1.258,
2012/04/05 version.  The test case started working when I did that.

So, then, I replaced individual functions, one at a time, in
fhandler_tty.cc with their previous versions.  I'd expected this to be a
problem with fhandler_pty_master::process_slave_output since that had
seen the most changes but was surprised to see that the culprit was
fhandler_pty_slave::read().

The reason was that I really needed the bytes_available() function to
return the number of bytes which would be read in the next operation
rather than the number of bytes available in the pipe.  That's because
there may be a number of lines available to be read but the number of
bytes which will be read by ReadFile should reflect the mode of the pty
and, if there is a line to read, only the number of bytes in the line
should be seen as available for the next read.

Having bytes_available() return the number of bytes which would be read
seemed to fix the problem but it could subtly change the behavior of
other callers of this function.  However, I actually think this is
probably a good thing since they probably should have been seeing the
line behavior.

2012-05-02  cgf-000001

<1.7.15>
Fix problem setting parent pid to 1 when process with children execs
itself.  Fixes: http://cygwin.com/ml/cygwin/2012-05/msg00009.html
</1.7.15>

Investigating this problem with strace showed that ssh-agent was
checking the parent pid and getting a 1 when it shouldn't have.  Other
stuff looked ok so I chose to consider this a smoking gun.

Going back to the version that the OP said did not have the problem, I
worked forward until I found where the problem first occurred -
somewhere around 2012-03-19.  And, indeed, the getppid call returned the
correct value in the working version.  That means that this stopped
working when I redid the way the process pipe was inherited around
this time period.

It isn't clear why (and I suspect I may have to debug this further at
some point) this hasn't always been a problem but I made the obvious fix.
We shouldn't have been setting ppid = 1 when we're about to pass off to
an execed process.

As I was writing this, I realized that it was necessary to add some
additional checks.  Just checking for "have_execed" isn't enough.  If
we've execed a non-cygwin process then it won't know how to deal with
any inherited children.  So, always set ppid = 1 if we've execed a
non-cygwin process.
Commit	Line	Data
962f9a2c CF	1	2012-07-21 cgf-000013
	2
	3	These changes reflect a revamp of the "wait for signal" functionality
	4	which has existed in Cygwin through several signal massages.
	5
	6	We now create a signal event only when a thread is waiting for a signal
	7	and arm it only for that thread. The "set_thread_waiting" function is
	8	used to establish the event and set it in a location referencable by
	9	the caller.
	10
	11	I still do not handle all of the race conditions. What happens when
	12	a signal comes in just after a WF?O succeeds for some other purpose? I
	13	suspect that it will arm the next WF?O call and the subsequent call to
	14	call_signal_handler could cause a function to get an EINTR when possibly
	15	it shouldn't have.
	16
	17	I haven't yet checked all of the test cases for the URL listed in the
	18	previous entry.
	19
	20	Baby steps.
	21
4ae63783 CF	22	2012-06-12 cgf-000012
	23
	24	These changes are the preliminary for redoing the way threads wait for
	25	signals. The problems are shown by the test case mentioned here:
	26
	27	http://cygwin.com/ml/cygwin/2012-05/msg00434.html
	28
	29	I've known that the signal handling in threads wasn't quite right for
	30	some time. I lost all of my thread signal tests in the great "rm -r"
	31	debacle of a few years ago and have been less than enthusiastic about
8bec43b3	32	redoing everything (I had PCTS tests and everything). But it really is
4ae63783 CF	33	time to redo this signal handling to make it more like it is supposed to
	34	be.
	35
	36	This change should not introduce any new behavior. Things should
	37	continue to behave as before. The major differences are a change in the
	38	arguments to cancelable_wait and cygwait now uses cancelable_wait and,
	39	so, the returns from cygwait now mirror cancelable_wait.
	40
	41	The next change will consolidate cygwait and cancelable_wait into one
	42	cygwait function.
	43
3143cb7c CF	44	2012-06-02 cgf-000011
	45
	46	The refcnt handling was tricky to get right but I had convinced myself
	47	that the refcnt's were always incremented/decremented under a lock.
	48	Corinna's 2012-05-23 change to refcnt exposed a potential problem with
	49	dup handling where the fdtab could be updated while not locked.
	50
	51	That should be fixed by this change but, on closer examination, it seems
aa01a03c	52	like there are many places where it is possible for the refcnt to be
3143cb7c CF	53	updated while the fdtab is not locked since the default for
	54	cygheap_fdget is to not lock the fdtab (and that should be the default -
	55	you can't have read holding a lock).
	56
	57	Since refcnt was only ever called with 1 or -1, I broke it up into two
	58	functions but kept the Interlocked* operation. Incrementing a variable
	59	should not be as racy as adding an arbitrary number to it but we have
	60	InterlockedIncrement/InterlockedDecrement for a reason so I kept the
	61	Interlocked operation here.
	62
	63	In the meantime, I'll be mulling over whether the refcnt operations are
	64	actually safe as they are. Maybe just ensuring that they are atomically
	65	updated is enough since they control the destruction of an fh. If I got
	66	the ordering right with incrementing and decrementing then that should
	67	be adequate.
	68
45b61a88 CF	69	2012-06-02 cgf-000010
	70
	71	<1.7.16>
	72	- Fix emacs problem which exposed an issue with Cygwin's select() function.
	73	If a signal arrives while select is blocking and the program longjmps
	74	out of the signal handler then threads and memory may be left hanging.
	75	Fixes: http://cygwin.com/ml/cygwin/2012-05/threads.html#00275
	76	</1.7.16>
	77
	78	This was try #4 or #5 to get select() signal handling working right.
	79	It's still not there but it should now at least not leak memory or
	80	threads.
	81
	82	I mucked with the interface between cygwin_select and select_stuff::wait
	83	so that the "new" loop in select_stuff::wait() was essentially moved
	84	into the caller. cygwin_select now uses various enum states to decide
	85	what to do. It builds the select linked list at the beginning of the
	86	loop, allowing wait() to tear everything down and restart. This is
	87	necessary before calling a signal handler because the signal handler may
	88	longjmp away.
	89
	90	I initially had this all coded up to use a special signal_cleanup
	91	callback which could be called when a longjmp is called in a signal
	92	handler. And cygwin_select() set up and tore down this callback. Once
	93	I got everything compiling it, of course, dawned on me that just because
	94	you call a longjmp in a signal handler it doesn't mean that you are
	95	jumping out of the signal handler. So, if the signal handler invokes
	96	the callback and returns it will be very bad for select(). Hence, this
	97	slower, but hopefully more correct implementation.
	98
	99	(I still wonder if some sort of signal cleanup callback might still
	100	be useful in the future)
	101
	102	TODO: I need to do an audit of other places where this problem could be
	103	occurring.
	104
	105	As alluded to above, select's signal handling is still not right. It
	106	still acts as if it could call a signal handler from something other
	107	than the main thread but, AFAICT, from my STC, this doesn't seem to be
	108	the case. It might be worthwhile to extend cygwait to just magically
	109	figure this out and not even bother using w4[0] for scenarios like this.
	110
fe66a97a CF	111	2012-05-16 cgf-000009
	112
	113	<1.7.16>
	114	- Fix broken console mouse handling. Reported here:
	115	http://cygwin.com/ml/cygwin/2012-05/msg00360.html
	116	</1.7.16>
	117
	118	I did a cvs annotate on smallprint.cc and see that the code to translate
	119	%characters > 127 to 0x notation was in the 1.1 revision. Then I
	120	checked the smallprint.c predecessor. It was in the 1.1 version of that
	121	program too, which means that this odd change has probably been around
	122	since <= 2000.
	123
	124	Since __small_sprintf is supposed to emulate sprintf, I got rid of the
	125	special case handling. This may affect fhandler_socket::bind. If so, we
	126	should work around this problem there rather than keeping this strange
	127	hack in __small_printf.
	128
bd8afa5e CF	129	2012-05-14 cgf-000008
	130
	131	<1.7.16>
	132	- Fix hang when zero bytes are written to a pty using
	133	Windows WriteFile or equivalent. Fixes:
	134	http://cygwin.com/ml/cygwin/2012-05/msg00323.html
	135	</1.7.16>
	136
	137	cgf-000002, as usual, fixed one thing while breaking another. See
	138	Larry's predicament in: http://goo.gl/oGEr2 .
	139
	140	The problem is that zero byte writes to the pty pipe caused the dread
	141	end-of-the-world-as-we-know-it problem reported on the mailing list
	142	where ReadFile reads zero bytes even though there is still more to read
	143	on the pipe. This is because that change caused a 'record' to be read
	144	and a record can be zero bytes.
	145
	146	I was never really keen about using a throwaway buffer just to get a
	147	count of the number of characters available to be read in the pty pipe.
	148	On closer reading of the documentation for PeekNamedPipe it seemed like
	149	the sixth argument to PeekNamedPipe should return what I needed without
	150	using a buffer. And, amazingly, it did, except that the problem still
	151	remained - a zero byte message still screwed things up.
	152
	153	So, we now detect the case where there is zero bytes available as a
	154	message but there are bytes available in the pipe. In that scenario,
	155	return the bytes available in the pipe rather than the message length of
	156	zero. This could conceivably cause problems with pty pipe handling in
	157	this scenario but since the only way this scenario could possibly happen
	158	is when someone is writing zero bytes using WriteFile to a pty pipe, I'm
	159	ok with that.
	160
3de7be4c CF	161	2012-05-14 cgf-000007
	162
	163	<1.7.16>
	164	- Fix invocation of strace from a cygwin process. Fixes:
	165	http://cygwin.com/ml/cygwin/2012-05/msg00292.html
	166	</1.7.16>
	167
	168	The change in cgf-000004 introduced a problem for processes which load
	169	cygwin1.dll dynamically. strace.exe is the most prominent example of
	170	this.
	171
	172	Since the parent handle is now closed for "non-Cygwin" processes, when
	173	strace.exe tried to dynamically load cygwin1.dll, the handle was invalid
	174	and child_info_spawn::handle_spawn couldn't use retrieve information
	175	from the parent. This eventually led to a strace_printf error due to an
	176	attempt to dereference an unavailable cygheap. Probably have to fix
	177	this someday. You shouldn't use the cygheap while attempting to print
	178	an error about the inavailability of said cygheap.
	179
	180	This was fixed by saving the parent pid in child_info_spawn and calling
	181	OpenProcess for the parent pid and using that handle iff a process is
	182	dynamically loaded.
	183
1f994848 CF	184	2012-05-12 cgf-000006
	185
	186	<1.7.16>
	187	- Fix hang when calling pthread_testcancel in a canceled thread.
	188	Fixes some of: http://cygwin.com/ml/cygwin/2012-05/msg00186.html
	189	</1.7.16>
	190
	191	This should fix the first part of the reported problem in the above
	192	message. The cancel seemed to actually be working but, the fprintf
	193	eventually ended up calling pthread_testcancel. Since we'd gotten here
	194	via a cancel, it tried to recursively call the cancel handler causing a
	195	recursive loop.
	196
348b56b5 CF	197	2012-05-12 cgf-000005
	198
	199	<1.7.16>
	200	- Fix pipe creation problem which manifested as a problem creating a
	201	fifo. Fixes: http://cygwin.com/ml/cygwin/2012-05/msg00253.html
	202	</1.7.16>
	203
	204	My change on 2012-04-28 introduced a problem with fifos. The passed
	205	in name was overwritten. This was because I wasn't properly keeping
	206	track of the length of the generated pipe name when there was a
	207	name passed in to fhandler_pipe::create.
	208
	209	There was also another problem in fhandler_pipe::create. Since fifos
	210	use PIPE_ACCESS_DUPLEX and PIPE_ACCESS_DUPLEX is an or'ing of
	211	PIPE_ACCESS_INBOUND and PIPE_ACCESS_OUTBOUND, using PIPE_ACCESS_OUTBOUND
	212	as a "never-used" option for PIPE_ADD_PID in fhandler.h was wrong. So,
	213	fifo creation attempted to add the pid of a pipe to the name which is
	214	wrong for fifos.
	215
dfd5d5be CF	216	2012-05-08 cgf-000004
dfd5d5be CF	217
61c33dbf CF	218	The change for cgf-000003 introduced a new problem:
	219	http://cygwin.com/ml/cygwin/2012-05/msg00154.html
	220	http://cygwin.com/ml/cygwin/2012-05/msg00157.html
dfd5d5be CF	221
	222	Since a handle associated with the parent is no longer being duplicated
	223	into a non-cygwin "execed child", Windows is free to reuse the pid of
	224	the parent when the parent exits. However, since we did duplicate a
	225	handle pointing to the pid's shared memory area into the "execed child",
	226	the shared memory for the pid was still active.
	227
	228	Since the shared memory was still available, if a new process reuses the
	229	previous pid, Cygwin would detect that the shared memory was not created
	230	and had a "PID_REAPED" flag. That was considered an error, and, so, it
	231	would set procinfo to NULL and pinfo::thisproc would die since this
	232	situation is not supposed to occur.
	233
	234	I fixed this in two ways:
	235
	236	1) If a shared memory region has a PID_REAPED flag then zero it and
	237	reuse it. This should be safe since you are not really supposed to be
	238	querying the shared memory region for anything after PID_REAPED has been
	239	set.
	240
	241	2) Forego duping a copy of myself_pinfo if we're starting a non-cygwin
	242	child for exec.
	243
	244	It seems like 2) is a common theme and an audit of all of the handles
	245	that are being passed to non-cygwin children is in order for 1.7.16.
	246
	247	The other minor modification that was made in this change was to add the
	248	pid of the failing process to fork error output. This helps slightly
	249	when looking at strace output, even though in this case it was easy to
	250	find what was failing by looking for '^---' when running the "stv"
	251	strace dumper. That found the offending exception quickly.
	252
	253	2012-05-07 cgf-000003
06bd0ef2 CF	254
	255	<1.7.15>
	256	Don't make Cygwin wait for all children of a non-cygwin child program.
	257	Fixes: http://cygwin.com/ml/cygwin/2012-05/msg00063.html,
	258	http://cygwin.com/ml/cygwin/2012-05/msg00075.html
	259	</1.7.15>
	260
	261	This problem is due to a recent change which added some robustness and
	262	speed to Cygwin's exec/spawn handling by not trying to force inheritance
	263	every time a process is started. See ChangeLog entries starting on
	264	2012-03-20, and multiple on 2012-03-21.
	265
	266	Making the handle inheritable meant that, as usual, there were problems
	267	with non-Cygwin processes. When Cygwin "execs" a non-Cygwin process N,
	268	all of its N + 1, N + 2, ... children will also inherit the handle.
	269	That means that Cygwin will wait until all subprocesses have exited
	270	before it returns.
	271
	272	I was willing to make this a restriction of starting non-Cygwin
	273	processes but the problem with allowing that is that it can cause the
	274	creation of a "limbo" pid when N exits and N + 1 and friends are still
	275	around. In this scenario, Cygwin dutifully notices that process N has
	276	died and sets the exit code to indicate that but N's parent will wait on
	277	rd_proc_pipe and will only return when every N + ... windows process
	278	has exited.
	279
	280	The removal of cygheap::pid_handle was not related to the initial
	281	problem that I set out to fix. The change came from the realization
	282	that we were duping the current process handle into the child twice and
	283	only needed to do it once. The current process handle is used by exec
	284	to keep the Windows pid "alive" so that it will not be reused. So, now
	285	we just close parent in child_info_spawn::handle_spawn iff we're not
	286	execing.
	287
	288	In debugging this it bothered me that 'ps' identified a nonactive pid as
	289	active. Part of the reason for this was the 'parent' handle in
	290	child_info was opened in non-Cygwin processes, keeping the pid alive.
	291	That has been kluged around (more changes after 1.7.15) but that didn't
	292	fix the problem. On further investigation, this seems to be caused by
	293	the fact that the shared memory region pid handles were still being
	294	passed to non-cygwin children, keeping the pid alive in a limbo-like
	295	fashion. This was easily fixed by having pinfo::init() consider a
51180c08 CF	296	memory region with PID_REAPED as not available. A more robust fix
	297	should be considered for 1.7.15+ where these handles are not passed
	298	to non-cygwin processes.
06bd0ef2 CF	299
	300	This fixed the problem where a pid showed up in the list after a user
	301	does something like: "bash$ cmd /c start notepad" but, for some reason,
	302	it does not fix the problem where "bash$ setsid cmd /c start notepad".
	303	That bears investigation after 1.7.15 is released but it is not a
	304	regression and so is not a blocker for the release.
	305
fb9d6318 CF	306	2012-05-03 cgf-000002
	307
	308	<1.7.15>
	309	Fix problem where too much input was attempted to be read from a
	310	pty slave. Fixes: http://cygwin.com/ml/cygwin/2012-05/msg00049.html
	311	</1.7.15>
	312
	313	My change on 2012/04/05 reintroduced the problem first described by:
	314	http://cygwin.com/ml/cygwin/2011-10/threads.html#00445
	315
	316	The problem then was, IIRC, due to the fact that bytes sent to the pty
	317	pipe were not written as records. Changing pipe to PIPE_TYPE_MESSAGE in
	318	pipe.cc fixed the problem since writing lines to one side of the pipe
	319	caused exactly that the number of characters to be read on the other
	320	even if there were more characters in the pipe.
	321
	322	To debug this, I first replaced fhandler_tty.cc with the 1.258,
	323	2012/04/05 version. The test case started working when I did that.
	324
	325	So, then, I replaced individual functions, one at a time, in
	326	fhandler_tty.cc with their previous versions. I'd expected this to be a
	327	problem with fhandler_pty_master::process_slave_output since that had
	328	seen the most changes but was surprised to see that the culprit was
	329	fhandler_pty_slave::read().
	330
	331	The reason was that I really needed the bytes_available() function to
	332	return the number of bytes which would be read in the next operation
	333	rather than the number of bytes available in the pipe. That's because
	334	there may be a number of lines available to be read but the number of
	335	bytes which will be read by ReadFile should reflect the mode of the pty
	336	and, if there is a line to read, only the number of bytes in the line
	337	should be seen as available for the next read.
	338
	339	Having bytes_available() return the number of bytes which would be read
	340	seemed to fix the problem but it could subtly change the behavior of
	341	other callers of this function. However, I actually think this is
	342	probably a good thing since they probably should have been seeing the
	343	line behavior.
	344
b79c0094 CF	345	2012-05-02 cgf-000001
b79c0094 CF	346
0386e4bf CF	347	<1.7.15>
	348	Fix problem setting parent pid to 1 when process with children execs
	349	itself. Fixes: http://cygwin.com/ml/cygwin/2012-05/msg00009.html
	350	</1.7.15>
b79c0094	351
0386e4bf CF	352	Investigating this problem with strace showed that ssh-agent was
	353	checking the parent pid and getting a 1 when it shouldn't have. Other
	354	stuff looked ok so I chose to consider this a smoking gun.
b79c0094 CF	355
	356	Going back to the version that the OP said did not have the problem, I
	357	worked forward until I found where the problem first occurred -
	358	somewhere around 2012-03-19. And, indeed, the getppid call returned the
	359	correct value in the working version. That means that this stopped
	360	working when I redid the way the process pipe was inherited around
	361	this time period.
	362
	363	It isn't clear why (and I suspect I may have to debug this further at
61c33dbf	364	some point) this hasn't always been a problem but I made the obvious fix.
b79c0094 CF	365	We shouldn't have been setting ppid = 1 when we're about to pass off to
	366	an execed process.
	367
	368	As I was writing this, I realized that it was necessary to add some
	369	additional checks. Just checking for "have_execed" isn't enough. If
	370	we've execed a non-cygwin process then it won't know how to deal with
	371	any inherited children. So, always set ppid = 1 if we've execed a
	372	non-cygwin process.