SIGKILL may no longer work after many SIGCONT/SIGSTOP signals

Takashi Yano takashi.yano@nifty.ne.jp
Mon Nov 25 12:40:21 GMT 2024


On Mon, 25 Nov 2024 21:23:45 +0900
Takashi Yano wrote:
> On Sun, 24 Nov 2024 01:15:09 +0900
> Takashi Yano wrote:
> > On Sat, 23 Nov 2024 16:53:21 +0100
> > Christian Franke wrote:
> > > Takashi Yano via Cygwin wrote:
> > > > On Wed, 20 Nov 2024 22:43:08 +0900
> > > > Takashi Yano wrote:
> > > >> On Tue, 19 Nov 2024 18:21:52 +0900
> > > >> Takashi Yano wrote:
> > > >>> On Tue, 12 Nov 2024 10:53:58 +0100
> > > >>> Christian Franke wrote:
> > > >>>> Found with 'stress-ng --cpu-sched' from current stress-ng upstream HEAD:
> > > >>>>
> > > >>>> Testcase (attached):
> > > >>>>
> > > >>>> $ gcc -O2 -o manysignals manysignals.c
> > > >>>>
> > > >>>> $ ./manysignals
> > > >>>> fork() = 1833
> > > >>>> ...
> > > >>>> fork() = 1848
> > > >>>> ...
> > > >>>> kill(1833, 17)
> > > >>>> ...
> > > >>>> kill(1848, 17)
> > > >>>> kill(1833, 9)
> > > >>>> ...
> > > >>>> kill(1848, 9)
> > > >>>> waitpid(1833, ., 0)
> > > >>>>
> > > >>>>
> > > >>>> Run this in second terminal:
> > > >>>>
> > > >>>> $ watch "ps | sed -n '1p;/manysignals/{/sed/d;p}'"
> > > >>>>
> > > >>>> If 'S' appear in the first column, the child processes likely reached
> > > >>>> the final SIGSTOP state. This takes some time. The parent process may
> > > >>>> still hang in first waitpid() but should not.
> > > >>>>
> > > >>>> If the parent process is aborted with ^C, child processes may be stopped
> > > >>>> or left behind. Occasionally a child process that can not be stopped by
> > > >>>> Cygwin (kill -9) is left behind.
> > > >>>>
> > > >>>> Tested with ancient (i7-2600K) and more recent (i7-14700K) CPU :-)
> > > >>>>
> > > >>>>
> > > >>>> Unrelated to the above, but related to 'stress-ng --cpu-sched' which
> > > >>>> uses sched_get/setscheduler():
> > > >>>>
> > > >>>> - sched_getscheduler() always returns SCHED_FIFO. As far as I understand
> > > >>>> Linux sched(7), this is a non-preemptive real-time policy. The
> > > >>>> preemptive SCHED_RR would possibly a more reasonable value.
> > > >>>> Unfortunately SCHED_OTHER cannot be used because it would require to
> > > >>>> ignore the priority.
> > > >>>>
> > > >>>> - sched_setscheduler() always fails with ENOSYS. It IMO should allow to
> > > >>>> set 'param->sched_priority' if 'policy' is equal to the value returned
> > > >>>> by sched_getscheduler().
> > > >>> Thanks for the report and the test case. I'm now looking into
> > > >>> the issue. Please wait a while.
> > > >> Hopefully, I have found the cause.
> > > >>
> > > >> The deadlock happens between main thread and wait_sig thread.
> > > >> The main thread is waiting for the wait_sig thread triggering
> > > >> wakeup event while the wait_sig thread is waiting previous
> > > >> signal being processed by main thread.
> > > >>
> > > >> Let me consider how to fix that.
> > > > I'd like to report my progress for this issue.
> > > >
> > > > The patch attached almost solves the problem. ...
> > > 
> > > Compile error if applied to current git main (3dbc8c3):
> > > 
> > >   ../../../../winsup/cygwin/exceptions.cc:1487:21: error: ‘struct 
> > > _cygtls’ has no member named ‘sig’
> > >    1487 |   while (_main_tls->sig)
> > >         |                     ^~~
> > 
> > This is because the latest Corinna's commit changes the name 'sig'
> > to 'current_sig'.
> > 
> > commit	3dbc8c3fbdc99d3f0f68fab8ba2a814ecdc27e17
> > Cygwin: cygtls: rename sig to current_sig
> > 
> > > >   However, your test
> > > > case is paused for tens of seconds, then ends normally.
> > > 
> > > I guess this is as expected. The processing of the 
> > > SIGSTOP/SIGCONT/.../SIGSTOP/SIGKILL sequence of each child process take 
> > > some time because all are locked to a single core.
> > 
> > I feel it's too slow even if 16 processes (with wait_sig threads) are
> > executed in one CPU core.
> > 
> > > > If the code:
> > > >        cpu_set_t cpus; CPU_ZERO(&cpus);
> > > >        CPU_SET(0, &cpus);
> > > >        if (sched_setaffinity(getpid(), sizeof(cpus), &cpus))
> > > >          perror("setaffinity");
> > > >
> > > >        for (;;)
> > > >          sched_yield();
> > > > is changed to just:
> > > >        for (;;) sleep(1);
> > > > the test case runs without pause.
> > > 
> > > The pause will possibly reappear if the number of child processes is 
> > > increased to some multiple of the available cores.
> > 
> > I tested with np = 16*32 without sched_setaffinity() call, the pause
> > does not happen. My CPU is Threadripper 1950X 16-core 32-thread.
> > 
> > > > I think there still is a bug in the signal handling.
> 
> I have just submitted 6 patches for this issue. With these pathces,
> the problem reported no longer occurs in my environment.

As the patches show, this test case triggers several issues in
cygwin that are combined with each other. With struggling so much,
I think I could resolve the issues finally. The patch turned out
to be a simple ones considering how long it took.

-- 
Takashi Yano <takashi.yano@nifty.ne.jp>


More information about the Cygwin mailing list