Bug 29740 - Race condition between pthread_exit and fork
Summary: Race condition between pthread_exit and fork
Status: RESOLVED INVALID
Alias: None
Product: glibc
Classification: Unclassified
Component: dynamic-link (show other bugs)
Version: unspecified
: P2 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-11-01 14:12 UTC by George Prekas
Modified: 2022-11-01 17:01 UTC (History)
2 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:
fw: security-


Attachments
glibc patch to add delay in do_lookup_x (353 bytes, patch)
2022-11-01 14:12 UTC, George Prekas
Details | Diff
Reproduction (365 bytes, text/x-csrc)
2022-11-01 14:14 UTC, George Prekas
Details

Note You need to log in before you can comment on or make changes to this bug.
Description George Prekas 2022-11-01 14:12:51 UTC
Created attachment 14427 [details]
glibc patch to add delay in do_lookup_x

I have observed a crash in a Python program that uses threads and fork. Below, I attach the following:

* a patch for the latest glibc master (commit b4174c28d21e1672ef3cc15a058558e97b8471c6) that introduces a delay to force the race condition.
* a minimal C reproduction.

My understanding of the problem is that the following sequence of events takes place:

parent: main-thread: starts thread-1
parent: thread-1: exits
parent: thread-1: tries to call _Unwind_Find_FDE
parent: thread-1: ld.so loads libgcc_s.so in memory
parent: main-thread: fork()
parent: thread-1: resolves _Unwind_Find_FDE@got.plt
child: main-thread: starts thread-1
child: thread-1: exits
child: thread-1: calls _Unwind_Find_FDE using the unresolved _Unwind_Find_FDE@got.plt

The child process crashes with a SIGSEGV with the following backtrace:

#0  0x0000000000003230 in ?? ()
#1  0x00007ffff75c2868 in uw_frame_state_for (context=0x7ffff7dcdc70, fs=0x7ffff7dcdab0) at ../../../src/libgcc/unwind-dw2.c:1263
#2  0x00007ffff75c3a20 in uw_init_context_1 (context=0x7ffff7dcdc70, outer_cfa=0x7ffff7dcdea0,
    outer_ra=0x7ffff7e64e36 <__GI___pthread_unwind+70>) at ../../../src/libgcc/unwind-dw2.c:1592
#3  0x00007ffff75c431a in _Unwind_ForcedUnwind (exc=0x7ffff7dced30, stop=stop@entry=0x7ffff7e64ca0 <unwind_stop>,
    stop_argument=0x7ffff7dcdef0) at ../../../src/libgcc/unwind.inc:211
#4  0x00007ffff7e64e36 in __GI___pthread_unwind (buf=<optimized out>) at unwind.c:130
#5  0x00007ffff7e5ddba in __do_cancel () at ../sysdeps/nptl/pthreadP.h:276
#6  __GI___pthread_exit (value=0x0) at pthread_exit.c:36
#7  0x00007ffff7fbf283 in ?? ()
#8  0x0000000000000000 in ?? ()

0x3230 is the value of _Unwind_Find_FDE@got.plt in the binary:

$ gdb -q /usr/lib/x86_64-linux-gnu/libgcc_s.so.1
(gdb) p '_Unwind_Find_FDE@got.plt'
$1 = (<text from jump slot in .got.plt, no debug info>) 0x3230

I have also a Python reproduction that fails without the forced delay and by just enabling LD_DEBUG=all (which seems to add the needed delays). Let me know if it's valuable to attach in this bug.
Comment 1 George Prekas 2022-11-01 14:14:44 UTC
Created attachment 14428 [details]
Reproduction
Comment 2 Andreas Schwab 2022-11-01 15:03:55 UTC
The child may only call async-signal-safe functions, which pthread_create isn't.
Comment 3 Florian Weimer 2022-11-01 16:38:45 UTC
What's your glibc version? I don't expect unwind-link to have this problem. See __libc_unwind_link_after_fork. The unwind-link facility was added in glibc 2.34.
Comment 4 George Prekas 2022-11-01 16:50:53 UTC
This was tested and reproduces on the latest master (commit b4174c28d21e1672ef3cc15a058558e97b8471c6).

After reading Andreas' response, I found this snippet in fork's man page which renders the reproduction code invalid:

       *  After a fork() in a multithreaded program, the child can
          safely call only async-signal-safe functions (see
          signal-safety(7)) until such time as it calls execve(2).
Comment 5 Florian Weimer 2022-11-01 17:01:19 UTC
I see. So it's probably the half-completed dlopen that confuses the subprocess. We do this after fork:

      /* Reset the lock the dynamic loader uses to protect its data.  */
      __rtld_lock_initialize (GL(dl_load_lock));

      /* Reset the lock protecting dynamic TLS related data.  */
      __rtld_lock_initialize (GL(dl_load_tls_lock));

That's unfortunately not particularly safe.