This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Race and segmentation fault in pthread_kill() vs thread teardown


Hi David,

I just pinpointed a glibc bug that appears to have been flagged as invalid in the past, and I really think we have an ugly race in the glibc there. I'd like some advice on which route I should take to file this bug, since it appears that it has been opened twice (since 2007) in the past, and IMO inappropriately closed as "invalid". See:

https://sourceware.org/bugzilla/show_bug.cgi?id=4506
https://sourceware.org/bugzilla/show_bug.cgi?id=4509
http://udrepper.livejournal.com/16844.html

The race:

I use "pthread_kill(tid, 0)" to detect if a thread still exists. By reading pthread_kill(3), this appears to be encouraged:

       If  sig  is 0, then no signal is sent, but error checking is still perâ
       formed; this can be used to check for the existence of a thread ID.

At no point is it specified that we could get a segmentation fault in the glibc by calling pthread_kill(tid, 0) on a thread id belonging to recently terminated thread.

Looking at glibc 2.18:

nptl/sysdeps/unix/sysv/linux/pthread_kill.c

int
__pthread_kill (threadid, signo)
     pthread_t threadid;
     int signo;
{
  struct pthread *pd = (struct pthread *) threadid;

  /* Make sure the descriptor is valid.  */
  if (DEBUGGING_P && INVALID_TD_P (pd))

----> here, we are checking if "pd" is invalid, although only if DEBUGGING_P.

    /* Not a valid thread handle.  */
    return ESRCH;

  /* Force load of pd->tid into local variable or register.  Otherwise
     if a thread exits between ESRCH test and tgkill, we might return
     EINVAL, because pd->tid would be cleared by the kernel.  */
  pid_t tid = atomic_forced_read (pd->tid);

-----> there, we dereference "pd". Note that the comment seems to imply that if a race occurs, we can return EINVAL, not that a segmentation fault is expected.

One thing worth nothing is that I'm very comfortable with the fact that a thread ID could be re-used by another thread. In that case, pthread_kill will just return a true fact: the thread ID actually exists, even though it is now used by another thread. What I am far less comfortable with is that if we run, in parallel with this code, anything that give back to the OS the memory reserved for "pd" (join or destroy), we end up with a segmentation fault. I have a nice test case that reproduces just that.

Within git://git.urcu.so/urcu.git branch: master
at commit: 95b94246b745a49ec7e5a33661ef638b0dd2950b   (just before my fix)

Build (follow README)
cd tests/benchmark
./test_urcu_bp 40 4 10

Here is the backtrace:

     #0  __pthread_kill (threadid=140723681437440, signo=0) at ../nptl/sysdeps/unix/sysv/linux/pthread_kill.c:42
     42 ../nptl/sysdeps/unix/sysv/linux/pthread_kill.c: No such file or directory.
     (gdb) bt full
     #0  __pthread_kill (threadid=140723681437440, signo=0) at ../nptl/sysdeps/unix/sysv/linux/pthread_kill.c:42
             __x = <optimized out>
             pd = 0x7ffcc90b2700
             tid = <optimized out>
             val = <optimized out>
     #1  0x0000000000403009 in rcu_gc_registry () at ../../urcu-bp.c:437
             tid = 140723681437440
             ret = 0
             chunk = 0x7ffcca0b8000
             rcu_reader_reg = 0x7ffcca0b8120
             __PRETTY_FUNCTION__ = "rcu_gc_registry"
     #2  0x0000000000402b9c in synchronize_rcu_bp () at ../../urcu-bp.c:230
             cur_snap_readers = {next = 0x7ffcb4888cc0, prev = 0x7ffcb4888cc0}
             qsreaders = {next = 0x7ffcb4888cd0, prev = 0x7ffcb4888cd0}
             newmask = {__val = {18446744067267100671, 18446744073709551615 <repeats 15 times>}}
             oldmask = {__val = {0, 140723337334144, 0, 0, 0, 140723690351643, 0, 140723127058464, 4, 0, 140723698253920, 140723693868864, 4096, 140723690370432,
                 140723698253920, 140723059951840}}
             ret = 0
             __PRETTY_FUNCTION__ = "synchronize_rcu_bp"
     #3  0x0000000000401803 in thr_writer (_count=0x76b2f0) at test_urcu_bp.c:223
             count = 0x76b2f0
             new = 0x7ffca80008c0
             old = 0x7ffca40008c0
     #4  0x00007ffcc9c83f8e in start_thread (arg=0x7ffcb4889700) at pthread_create.c:311
             __res = <optimized out>
             pd = 0x7ffcb4889700
             now = <optimized out>
             unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140723337336576, 6546223316613858487, 0, 140723698253920, 140723693868864, 4096, -6547756131873848137,
                     -6547872135220034377}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
             not_first_call = 0
             pagesize_m1 = <optimized out>
             sp = <optimized out>
             freesize = <optimized out>
             __PRETTY_FUNCTION__ = "start_thread"
     #5  0x00007ffcc99ade1d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

FWIW, I have found a work-around for my library: I use a pthread key destructor notifier instead of pthread_kill() to manage this cleanup.

However, I think this pthread_kill() behavior is a race worth fixing. It looks like what we want there is a guarantee that the memory location pointed to by the thread id will never be released, but can be re-used by future thread id entries, providing what I believe could be called "type-safe memory".

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]