SIGKILL may no longer work after many SIGCONT/SIGSTOP signals
Takashi Yano
takashi.yano@nifty.ne.jp
Sat Nov 23 16:15:09 GMT 2024
On Sat, 23 Nov 2024 16:53:21 +0100
Christian Franke wrote:
> Takashi Yano via Cygwin wrote:
> > On Wed, 20 Nov 2024 22:43:08 +0900
> > Takashi Yano wrote:
> >> On Tue, 19 Nov 2024 18:21:52 +0900
> >> Takashi Yano wrote:
> >>> On Tue, 12 Nov 2024 10:53:58 +0100
> >>> Christian Franke wrote:
> >>>> Found with 'stress-ng --cpu-sched' from current stress-ng upstream HEAD:
> >>>>
> >>>> Testcase (attached):
> >>>>
> >>>> $ gcc -O2 -o manysignals manysignals.c
> >>>>
> >>>> $ ./manysignals
> >>>> fork() = 1833
> >>>> ...
> >>>> fork() = 1848
> >>>> ...
> >>>> kill(1833, 17)
> >>>> ...
> >>>> kill(1848, 17)
> >>>> kill(1833, 9)
> >>>> ...
> >>>> kill(1848, 9)
> >>>> waitpid(1833, ., 0)
> >>>>
> >>>>
> >>>> Run this in second terminal:
> >>>>
> >>>> $ watch "ps | sed -n '1p;/manysignals/{/sed/d;p}'"
> >>>>
> >>>> If 'S' appear in the first column, the child processes likely reached
> >>>> the final SIGSTOP state. This takes some time. The parent process may
> >>>> still hang in first waitpid() but should not.
> >>>>
> >>>> If the parent process is aborted with ^C, child processes may be stopped
> >>>> or left behind. Occasionally a child process that can not be stopped by
> >>>> Cygwin (kill -9) is left behind.
> >>>>
> >>>> Tested with ancient (i7-2600K) and more recent (i7-14700K) CPU :-)
> >>>>
> >>>>
> >>>> Unrelated to the above, but related to 'stress-ng --cpu-sched' which
> >>>> uses sched_get/setscheduler():
> >>>>
> >>>> - sched_getscheduler() always returns SCHED_FIFO. As far as I understand
> >>>> Linux sched(7), this is a non-preemptive real-time policy. The
> >>>> preemptive SCHED_RR would possibly a more reasonable value.
> >>>> Unfortunately SCHED_OTHER cannot be used because it would require to
> >>>> ignore the priority.
> >>>>
> >>>> - sched_setscheduler() always fails with ENOSYS. It IMO should allow to
> >>>> set 'param->sched_priority' if 'policy' is equal to the value returned
> >>>> by sched_getscheduler().
> >>> Thanks for the report and the test case. I'm now looking into
> >>> the issue. Please wait a while.
> >> Hopefully, I have found the cause.
> >>
> >> The deadlock happens between main thread and wait_sig thread.
> >> The main thread is waiting for the wait_sig thread triggering
> >> wakeup event while the wait_sig thread is waiting previous
> >> signal being processed by main thread.
> >>
> >> Let me consider how to fix that.
> > I'd like to report my progress for this issue.
> >
> > The patch attached almost solves the problem. ...
>
> Compile error if applied to current git main (3dbc8c3):
>
> ../../../../winsup/cygwin/exceptions.cc:1487:21: error: struct
> _cygtls has no member named sig
> 1487 | while (_main_tls->sig)
> | ^~~
This is because the latest Corinna's commit changes the name 'sig'
to 'current_sig'.
commit 3dbc8c3fbdc99d3f0f68fab8ba2a814ecdc27e17
Cygwin: cygtls: rename sig to current_sig
> > However, your test
> > case is paused for tens of seconds, then ends normally.
>
> I guess this is as expected. The processing of the
> SIGSTOP/SIGCONT/.../SIGSTOP/SIGKILL sequence of each child process take
> some time because all are locked to a single core.
I feel it's too slow even if 16 processes (with wait_sig threads) are
executed in one CPU core.
> > If the code:
> > cpu_set_t cpus; CPU_ZERO(&cpus);
> > CPU_SET(0, &cpus);
> > if (sched_setaffinity(getpid(), sizeof(cpus), &cpus))
> > perror("setaffinity");
> >
> > for (;;)
> > sched_yield();
> > is changed to just:
> > for (;;) sleep(1);
> > the test case runs without pause.
>
> The pause will possibly reappear if the number of child processes is
> increased to some multiple of the available cores.
I tested with np = 16*32 without sched_setaffinity() call, the pause
does not happen. My CPU is Threadripper 1950X 16-core 32-thread.
> > I think there still is a bug in the signal handling.
>
> Possibly related:
> https://sourceware.org/pipermail/cygwin/2024-November/256808.html
I also looked into this issue a bit, but I think this is another issue.
--
Takashi Yano <takashi.yano@nifty.ne.jp>
More information about the Cygwin
mailing list