Expect goes crazy... spinning cpu in kill_pgrp

Tue Oct 20 07:43:00 GMT 2009

  I think this only ever happens when I've got multiple simultaneous very long
running expect instances that have been thrashing the pid table quite hard for
a while.

  The symptom I see is that after running for many hours with (e.g.) two
expect instances ploughing through the tests, suddenly I will notice that
there are now three instances, one of which is hogging cpu and spinning
wildly.  Simultaneously with this, I will start seeing process fork failures
in other shells, all of the form

>    1465 [main] sh 13360 fork: child -1 - died waiting for longjmp before initial
> ization, retry 0, exit code 0xC0000005, errno 11
>    1465 [main] sh 13360 fork: child -1 - died waiting for longjmp before initial
> ization, retry 0, exit code 0xC0000005, errno 11

  The bug itself occurs in a forked child, which sends an ioctl to tty 0 that
ends up in the tty fhandler where it raises a SIGTTOU.  Somewhere in signal
handling, this gets turned into an abort, which leads to kill0 and thence to
kill_prgrp, and there is an exception during kill_pgrp:

> Program received signal SIGSEGV, Segmentation fault.
> 0x610b6d4d in kill_pgrp (pid=14592, si=@0x2234b0)
>     at /gnu/winsup/src/winsup/cygwin/pinfo.h:207

  The line number is slightly inaccurate: it actually happens here:

> kill_pgrp (pid_t pid, siginfo_t& si)
> {
>   int res = 0;
>   int found = 0;
>   int killself = 0;
> 
>   sigproc_printf ("pid %d, signal %d", pid, si.si_signo);
> 
>   winpids pids ((DWORD) PID_MAP_RW);
>   for (unsigned i = 0; i < pids.npids; i++)
>     {
>       _pinfo *p = pids[i];
> 
>       if (!p->exists ())
> 	continue;
> 
>       /* Is it a process we want to kill?  */
>       if ((pid == 0 && (p->pgid != myself->pgid || p->ctty != myself->ctty)) ||
> 	  (pid > 1 && p->pgid != pid) ||
> 	  (si.si_signo < 0 && NOTSTATE (p, PID_STOPPED)))
> 	continue;

... during evaluation of the if condition.  Well, cut a long story short; the
SEGV happens when dereferencing p the final time round the loop.

> (gdb) print pids.pinfolist[0x28]
> $73 = {h = 0xac, procinfo = 0x40d70000, destroy = true,
>   rd_proc_pipe = 0x40000000, hProcess = 0xb0, waiter_ready = false,
>   wait_thread = 0x0}
> (gdb) print pids.pinfolist[0x27]
> $74 = {h = 0xb4, procinfo = 0x40d60000, destroy = true, rd_proc_pipe = 0x0,
>   hProcess = 0xb8, waiter_ready = false, wait_thread = 0x0}
> (gdb) x/128xw 0x40d60000

  Although these two pinfos look pretty similar, the procinfo pointer in the
first one points to a block of memory that has only a single 4kB page
allocated, where the second one points to a block of 36kB of data.  The offset
to the ctty member is 0x8028, and so for the abnormal small pinfo block that's
just not there!  The pid number is though, since that's the first entry; it is
16356, which isn't a process that exists any more at the time of the hang.
The process_state flag is 0x4001 - PID_EXECED | PID_IN_USE - and everything
else is zeros.

  I'm not deeply familiar with the pinfo cache internals, but I seem to recall
some concern about the amount of memory taken up by the path[] element now
that we have long paths, and I see this #define PINFO_REDIR_SIZE, and how
pinfo::init uses it to map a large or small view of memory based on PID_EXECED.

  So, one thing I don't know is: is it wrong for kill_pgrp to want to know
about the ctty of an execed process, or should that data be moved into the
leading PINFO_REDIR_SIZE part?

  When the SEGV happens, the exception handler correctly intercepts the SEH
exception thrown by the faulting access and converts it into a signal.  I
can't quite follow this part, but it anyway ends up with abort being called
again, SIGABRT re-raised, and we end up in pgrp_info again where the same SEGV
happens all over again.  I don't suppose this is an entirely good thing
either; is there any mechanism to protect us against this kind of re-entrancy
if there is an exception during termination code like this?  Or do we just
have to take care not to cause exceptions?

  The final thing that worries me is why this spinning causes process fork
failures elsewhere in the system.  I didn't manage to capture one of the
subsidiary failures yet.  It's an access violation which makes me wonder if it
too is stumbling over a partially-deallocated pinfo, that is being kept around
in memory because of this looping process hanging onto a handle to the
corresponding defunct win32 process?  But then I don't know why only some but
not all fork attempts would fail.

  I'm leaving this post somewhat inconclusive because I've been working on
this for a few hours now and it's a sidetrack from what I'm supposed to be
getting on with at the moment, so I'm going to switch tracks for a while.
Just wanted to let you all know what I found out; maybe the solution will seem
obvious to someone else who hasn't been up debugging code all night...

    cheers,
      DaveK