Bug: Incorrect signal behavior in multi-threaded processes

Dan Bonachea dobonachea@lbl.gov
Sun Jan 20 20:33:00 GMT 2019


I'm writing to report some POSIX compliance problems with Cygwin
signal handling in the presence of multiple pthreads that our group
has encountered in our parallel scientific computing codes.

A minimal test program is copied below and also available here:
https://upc-bugs.lbl.gov/bugzilla/attachment.cgi?id=589

I believe the test program is fully compliant with ISO C 99 and POSIX
1003.1-2016. In a nutshell, it registers one signal handler, spawns a
number of pthreads, and then synchronously generates a signal from
exactly one thread while others sit in a pthread_barrier_wait. The
"throwing" thread and signal number can be varied from the command
line, and diagnostic output indicates what happened.

As a basis for comparison, here are a few examples of the test program
running on x86_64/Linux-3.10.0(Scientific Linux 7.4)/gcc-4.8.5
demonstrating what I believe to be the *correct*/POSIX-required
behavior:

$ ./thread-signal 1 11    # "th#1 sends sig 11 (SIGSEGV) via null deref"
Running test with 5 threads and thread 1 sending signal=11
Spawning pthreads..
thread 1 (0x7f8dd0b13700): Hello
thread 4 (0x7f8dcf310700): Hello
thread 2 (0x7f8dd0312700): Hello
thread 3 (0x7f8dcfb11700): Hello
thread 0 (0x7f8dd131a740): Hello
thread 1 (0x7f8dd0b13700): sending signal 11..
sig_handler: ENTERING
sig_handler: running on thread 0x7f8dd0b13700
sig_handler: calling _exit()

$ ./thread-signal 1 6    # "th#1 sends sig 6 (SIGABRT) via abort()"
Running test with 5 threads and thread 1 sending signal=6
Spawning pthreads..
thread 1 (0x7f1a2451d700): Hello
thread 2 (0x7f1a23d1c700): Hello
thread 0 (0x7f1a24d24740): Hello
thread 3 (0x7f1a2351b700): Hello
thread 4 (0x7f1a22d1a700): Hello
thread 1 (0x7f1a2451d700): sending signal 6..
sig_handler: ENTERING
sig_handler: running on thread 0x7f1a2451d700
sig_handler: calling _exit()

$ ./thread-signal 1 2        # "th#1 sends sig 2 via raise(SIGINT)"
Running test with 5 threads and thread 1 sending signal=2
Spawning pthreads..
thread 1 (0x7f2a29a3f700): Hello
thread 2 (0x7f2a2923e700): Hello
thread 0 (0x7f2a2a246740): Hello
thread 3 (0x7f2a28a3d700): Hello
thread 4 (0x7f2a2823c700): Hello
thread 1 (0x7f2a29a3f700): sending signal 2..
sig_handler: ENTERING
sig_handler: running on thread 0x7f2a29a3f700
sig_handler: calling _exit()

This output indicates that in all cases on Linux, the unique thread
generating the signal jumps to the pre-registered signal handler while
other threads remain stalled at the barrier, as required by POSIX
signalling semantics (e.g. see raise() on p.1765 of POSIX
1003.1-2016). The test program and commands above demonstrate the
substantially same, correct behavior on ALL of the following platform
combinations:

* Linux-3.10/{i686,x86_64}/{gcc-4.8.5,gcc-8.2.0,clang-7.0.0}
* Solaris-11.3/x86_64/{gcc-7.2.0,SunStudio-12.5}
* FreeBSD-12.0/x86_64/clang-6.0.1
* MicrosoftWSL-Ubuntu18.04/x86_64/{gcc-7.3.0,clang-6.0.0)
    - This notably runs on Microsoft Windows! (10.0.17763.288)

Unfortunately the observed behavior on Cygwin (various versions)
deviates far from our expectations and (based on my understanding)
from the behavior required by current POSIX specs. Here is example
output from Cygwin 2.11.1(0.329/5/3) 2018-09-05 on Windows 10, build
17763.288 with gcc 7.3.0:

$ ./thread-signal 1 11    # "th#1 sends sig 11 (SIGSEGV) via null deref"
Running test with 5 threads and thread 1 sending signal=11
Spawning pthreads..
thread 1 (0x600048770): Hello
thread 2 (0x600048870): Hello
thread 3 (0x600048970): Hello
thread 0 (0x600000010): Hello
thread 4 (0x600048a70): Hello
thread 1 (0x600048770): sending signal 11..
<process terminated, without calling handler from ANY thread>

$ ./thread-signal 1 6    # "th#1 sends sig 6 (SIGABRT) via abort()"
Running test with 5 threads and thread 1 sending signal=6
Spawning pthreads..
thread 1 (0x600048770): Hello
thread 2 (0x600048870): Hello
thread 3 (0x600048970): Hello
thread 4 (0x600048a70): Hello
thread 0 (0x600000010): Hello
thread 1 (0x600048770): sending signal 6..
sig_handler: ENTERING
Abort
<non-deterministic race leads to varying behavior here>

$ ./thread-signal 1 2    # "th#1 sends sig 2 via raise(SIGINT)"
Running test with 5 threads and thread 1 sending signal=2
Spawning pthreads..
thread 1 (0x600048770): Hello
thread 2 (0x600048870): Hello
thread 3 (0x600048970): Hello
thread 0 (0x600000010): Hello
thread 4 (0x600048a70): Hello
thread 1 (0x600048770): sending signal 2..
sig_handler: ENTERING
sig_handler: ERROR - signal delivered to wrong thread!
thread 1 (0x600048770): ERROR: STILL ALIVE!
sig_handler: running on thread 0x600000010
sig_handler: calling _exit()

The second case in particular (abort() called by one non-primordial
thread) appears to have non-deterministic/racing behavior. The
evidence seems to indicate the SIGABRT is delivered to the primordial
thread (the wrong thread) via the signal handler and concurrently also
delivered to the SIG_DFL handler of other threads who then race to
invoke abortive process termination (which should not be reachable in
any correct execution of the program). It's worth noting POSIX
1003.1-2016 sec XRAT.B.2.4.1 (p.3577) specifically requires that any
given signal should be delivered to exactly one thread. Also the spec
for abort (p.565) requires the signal to be delivered as if by
`raise(SIGABRT)` (p.1765) aka. `pthread_kill(pthread_self(),SIGABRT)`
(p.1657), which implies any registered SIGABRT handler should run only
on the thread which called abort().

The choice of SIGINT in the third example is arbitrary, and
representative of similar deliver-to-wrong-thread behavior also
observed on Cygwin for all of the following signals:
  HUP, INT, QUIT, ILL, EMT, TRAP, FPE, BUS, SYS, PIPE, ALRM, TERM, URG,
  TSTP, CONT, CHLD, TTIN, TTOU, IO, USR1, USR2, and RTMIN..RTMAX
All of which consequently appear to be unreliable for thread-specific
signalling in Cygwin programs.

Note that in all cases examined, generating the signal from the
"primordial" thread 0 (by changing the 1 to a 0 in the commands above)
yields nominally correct behavior; in that case, the signal handler is
correctly invoked by the primordial thread and the others remain
undisturbed. However it appears the primordial thread is the ONLY
thread that enjoys the special status of POSIX-compliant signal
behavior on Cygwin. Substantially similar broken behavior has been
observed for NON-primordial threads on ALL of the following Cygwin
version combinations (spread across three different workstations):

* Cygwin64-2.11.1(0.329/5/3)-{win7,win10}-{gcc-7.3.0,clang-5.0.1}
* Cygwin64-2.10.0(0.325/5/3)-{win7,win10}-{gcc-6.4.0,clang-5.0.1}
* Cygwin64-2.6.0(0.304/5/3)-win7-{gcc-5.4.0,clang-3.8.1}
* Cygwin64-2.6.0(0.304/5/3)-win7-{gcc-5.4.0,clang-3.8.1}

Possibly of note, a 32-bit version of Cygwin (i686 2.11.1(0.329/5/3))
correctly handles SIGSEGV, but fails all the other cases in
substantially the same manner as Cygwin64.

In case you're wondering why we care: The SIGABRT and SIGSEGV
misbehaviors are particularly problematic for our distributed-memory
codes that register fatal signal handlers to ensure correct tear-down
of a multi-process job if/when any process crashes or aborts (e.g. due
to an assertion failure). Cygwin unfortunately makes it effectively
impossible to reliably handle abort()'s or SIGSEGV's generated by
programming errors in a multi-threaded program, unless one can arrange
to only generate the signal from the primordial thread (impractical
for our applications).

Searching around the Cygwin lists I find some evidence that
tangentially similar problems with signals and multithreading have
been discussed before, but perhaps not adequately
isolated/demonstrated.

Is there any hope of this situation ever improving?

Thanks for your consideration.
-Dan Bonachea

Test program code below, also available for download at:
https://upc-bugs.lbl.gov/bugzilla/attachment.cgi?id=589

=====================================================================
// Thread/signal tester by Dan Bonachea
// compile with a command like:
//   gcc -D_GNU_SOURCE -std=c99 -pedantic -pthread thread-signal.c -o
thread-signal
// usage:
//   thread-signal <thread_idx[0..4]> <signal_number>
//
// page numbers in comments below refer to POSIX IEEE Std 1003.1-2016
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <unistd.h>
#include <string.h>
#include <assert.h>
#include <signal.h>

// Utilities
typedef void (*sig_handler_t)(int); // signal handler function pointer

unsigned long long thidtollu(pthread_t thid) { // map pthread_t to a
unique value
  // non-portable but sufficient on all systems of interest
  return (unsigned long long)(uintptr_t)thid;
}

pthread_barrier_t barrier_object;
void barrier(void) {
  int res = pthread_barrier_wait(&barrier_object); // p.1595
  assert(res == 0 || res == PTHREAD_BARRIER_SERIAL_THREAD);
}

#define FD_STDOUT 1
#define FD_STDERR 2
void writeout(const char *msg) { // signal-safe string output and flush
  int sz = strlen(msg)+1;
  int res = write(FD_STDOUT, msg, sz);
  if (res != sz) {
    const char err[] = "write failed!\n";
    write(FD_STDERR, err, sizeof(err));
    _exit(-1);
  }
  (void)fsync(FD_STDOUT);
}

#ifndef NUMTHREAD
#define NUMTHREAD 5
#endif

// state variables
int sigid = SIGSEGV;
int sender = 1;
volatile sig_atomic_t sender_aid = 0;
volatile sig_atomic_t errs = 0;

// registered signal handler function
void sig_handler(int signum) { // p.494 defines permitted calls
  pthread_t thid = pthread_self();
  writeout("sig_handler: ENTERING\n");
  sig_atomic_t my_aid = (sig_atomic_t)thidtollu(thid);
  if (my_aid != sender_aid) {
    errs++;
    writeout("sig_handler: ERROR - signal delivered to wrong thread!\n");
  }
  #if !STRICT // sprintf technically forbidden, but doesn't affect
behavior in practice
  { char tmp[200];
    sprintf(tmp,"sig_handler: running on thread 0x%llx\n",thidtollu(thid));
    writeout(tmp);
  }
  #endif
  writeout("sig_handler: calling _exit()\n");
  _exit(errs);
}
struct thinfo {
  pthread_t thid;
  int idx;
} thread_info[NUMTHREAD];

// thread entry point
void * thread_main(void *arg) {
  struct thinfo *myinfo = arg;
  pthread_t thid = pthread_self();
  assert(pthread_equal(thid, myinfo->thid));

  printf("thread %i (0x%llx): Hello\n",myinfo->idx, thidtollu(thid));
fflush(NULL);

  if (myinfo->idx == sender) { // this thread will send the signal
    sender_aid = (sig_atomic_t)thidtollu(thid); // record for signal handler
  }

  barrier(); // wait for all threads

  if (myinfo->idx == sender) { // this thread sends the signal
    printf("thread %i (0x%llx): sending signal %i..\n",
           myinfo->idx, thidtollu(thid), sigid);
    fflush(NULL);

    switch (sigid) {
      case SIGABRT:
        abort(); // p.565
      break;
      case SIGSEGV: {
        int *nullpt = NULL;
        *nullpt = 0; // SEGV
      }
      break;
      default: {
        int res = raise(sigid); // p.1765
        if (res) {
          errs++;
          printf("thread %i (0x%llx): ERROR: raise failed: %i %s\n",
                  myinfo->idx, thidtollu(thid), res, strerror(res));
fflush(NULL);
        }
      }
    }
    errs++;
    printf("thread %i (0x%llx): ERROR: STILL ALIVE!\n",myinfo->idx,
thidtollu(thid));
    fflush(NULL);
  }

  barrier(); // wait for all threads
  return NULL;
}

// process entry point
int main(int argc, char **argv) {

  if (argc > 1) sender = atoi(argv[1]);
  if (argc > 2) sigid = atoi(argv[2]);

  printf("Running test with %i threads and thread %i sending signal=%i\n",
         NUMTHREAD,sender,sigid); fflush(NULL);

  int ret = pthread_barrier_init(&barrier_object, NULL, NUMTHREAD); // p.1593
  assert(!ret);

  // establish a signal handler
  sig_handler_t init = signal(sigid, sig_handler); // p.1971
  assert(init == SIG_DFL || init == SIG_IGN);

  // ensure it is registered
  sig_handler_t res = signal(sigid, sig_handler);
  assert(res == sig_handler);

  printf("Spawning pthreads..\n"); fflush(NULL);

  for (int i=1; i < NUMTHREAD; i++) { // create threads
    thread_info[i].idx = i;
    int res = pthread_create(&(thread_info[i].thid), NULL,
                             thread_main, &(thread_info[i])); // p.1633
    assert(!res);
  }

  // primordial thread is "thread 0"
  thread_info[0].idx = 0;
  thread_info[0].thid = pthread_self();
  thread_main(&(thread_info[0]));

  // should never reach this point for a catchable signal
  for (int i=1; i < NUMTHREAD; i++) { // join threads
    int res = pthread_join(thread_info[i].thid, NULL); // p.1649
    assert(!res);
  }

  printf("all threads exited!\n");
  errs++;

  return errs;
}

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple



More information about the Cygwin mailing list