SIGKILL may no longer work after many SIGCONT/SIGSTOP signals

Christian Franke Christian.Franke@t-online.de
Sat Nov 23 15:53:21 GMT 2024


Takashi Yano via Cygwin wrote:
> On Wed, 20 Nov 2024 22:43:08 +0900
> Takashi Yano wrote:
>> On Tue, 19 Nov 2024 18:21:52 +0900
>> Takashi Yano wrote:
>>> On Tue, 12 Nov 2024 10:53:58 +0100
>>> Christian Franke wrote:
>>>> Found with 'stress-ng --cpu-sched' from current stress-ng upstream HEAD:
>>>>
>>>> Testcase (attached):
>>>>
>>>> $ gcc -O2 -o manysignals manysignals.c
>>>>
>>>> $ ./manysignals
>>>> fork() = 1833
>>>> ...
>>>> fork() = 1848
>>>> ...
>>>> kill(1833, 17)
>>>> ...
>>>> kill(1848, 17)
>>>> kill(1833, 9)
>>>> ...
>>>> kill(1848, 9)
>>>> waitpid(1833, ., 0)
>>>>
>>>>
>>>> Run this in second terminal:
>>>>
>>>> $ watch "ps | sed -n '1p;/manysignals/{/sed/d;p}'"
>>>>
>>>> If 'S' appear in the first column, the child processes likely reached
>>>> the final SIGSTOP state. This takes some time. The parent process may
>>>> still hang in first waitpid() but should not.
>>>>
>>>> If the parent process is aborted with ^C, child processes may be stopped
>>>> or left behind. Occasionally a child process that can not be stopped by
>>>> Cygwin (kill -9) is left behind.
>>>>
>>>> Tested with ancient (i7-2600K) and more recent (i7-14700K) CPU :-)
>>>>
>>>>
>>>> Unrelated to the above, but related to 'stress-ng --cpu-sched' which
>>>> uses sched_get/setscheduler():
>>>>
>>>> - sched_getscheduler() always returns SCHED_FIFO. As far as I understand
>>>> Linux sched(7), this is a non-preemptive real-time policy. The
>>>> preemptive SCHED_RR would possibly a more reasonable value.
>>>> Unfortunately SCHED_OTHER cannot be used because it would require to
>>>> ignore the priority.
>>>>
>>>> - sched_setscheduler() always fails with ENOSYS. It IMO should allow to
>>>> set 'param->sched_priority' if 'policy' is equal to the value returned
>>>> by sched_getscheduler().
>>> Thanks for the report and the test case. I'm now looking into
>>> the issue. Please wait a while.
>> Hopefully, I have found the cause.
>>
>> The deadlock happens between main thread and wait_sig thread.
>> The main thread is waiting for the wait_sig thread triggering
>> wakeup event while the wait_sig thread is waiting previous
>> signal being processed by main thread.
>>
>> Let me consider how to fix that.
> I'd like to report my progress for this issue.
>
> The patch attached almost solves the problem. ...

Compile error if applied to current git main (3dbc8c3):

  ../../../../winsup/cygwin/exceptions.cc:1487:21: error: ‘struct 
_cygtls’ has no member named ‘sig’
   1487 |   while (_main_tls->sig)
        |                     ^~~


>   However, your test
> case is paused for tens of seconds, then ends normally.

I guess this is as expected. The processing of the 
SIGSTOP/SIGCONT/.../SIGSTOP/SIGKILL sequence of each child process take 
some time because all are locked to a single core.


> If the code:
>        cpu_set_t cpus; CPU_ZERO(&cpus);
>        CPU_SET(0, &cpus);
>        if (sched_setaffinity(getpid(), sizeof(cpus), &cpus))
>          perror("setaffinity");
>
>        for (;;)
>          sched_yield();
> is changed to just:
>        for (;;) sleep(1);
> the test case runs without pause.

The pause will possibly reappear if the number of child processes is 
increased to some multiple of the available cores.


> I think there still is a bug in the signal handling.

Possibly related:
https://sourceware.org/pipermail/cygwin/2024-November/256808.html

-- 
Regards,
Christian



More information about the Cygwin mailing list