29740 – Race condition between pthread_exit and fork

Bug 29740 - Race condition between pthread_exit and fork

Summary: Race condition between pthread_exit and fork

Status:	RESOLVED INVALID

Alias:	None

Product:	glibc
Classification:	Unclassified
Component:	dynamic-link (show other bugs)
Version:	unspecified

Importance:	P2 normal
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:

Depends on:
Blocks:

Reported:	2022-11-01 14:12 UTC by George Prekas
Modified:	2022-11-01 17:01 UTC (History)
CC List:	2 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:

Flags:	fw: security-

Attachments
glibc patch to add delay in do_lookup_x (353 bytes, patch) 2022-11-01 14:12 UTC, George Prekas	Details \| Diff
Reproduction (365 bytes, text/x-csrc) 2022-11-01 14:14 UTC, George Prekas	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description George Prekas 2022-11-01 14:12:51 UTC

Created attachment 14427 [details]
glibc patch to add delay in do_lookup_x

I have observed a crash in a Python program that uses threads and fork. Below, I attach the following:

* a patch for the latest glibc master (commit b4174c28d21e1672ef3cc15a058558e97b8471c6) that introduces a delay to force the race condition.
* a minimal C reproduction.

My understanding of the problem is that the following sequence of events takes place:

parent: main-thread: starts thread-1
parent: thread-1: exits
parent: thread-1: tries to call _Unwind_Find_FDE
parent: thread-1: ld.so loads libgcc_s.so in memory
parent: main-thread: fork()
parent: thread-1: resolves _Unwind_Find_FDE@got.plt
child: main-thread: starts thread-1
child: thread-1: exits
child: thread-1: calls _Unwind_Find_FDE using the unresolved _Unwind_Find_FDE@got.plt

The child process crashes with a SIGSEGV with the following backtrace:

#0  0x0000000000003230 in ?? ()
#1  0x00007ffff75c2868 in uw_frame_state_for (context=0x7ffff7dcdc70, fs=0x7ffff7dcdab0) at ../../../src/libgcc/unwind-dw2.c:1263
#2  0x00007ffff75c3a20 in uw_init_context_1 (context=0x7ffff7dcdc70, outer_cfa=0x7ffff7dcdea0,
    outer_ra=0x7ffff7e64e36 <__GI___pthread_unwind+70>) at ../../../src/libgcc/unwind-dw2.c:1592
#3  0x00007ffff75c431a in _Unwind_ForcedUnwind (exc=0x7ffff7dced30, stop=stop@entry=0x7ffff7e64ca0 <unwind_stop>,
    stop_argument=0x7ffff7dcdef0) at ../../../src/libgcc/unwind.inc:211
#4  0x00007ffff7e64e36 in __GI___pthread_unwind (buf=<optimized out>) at unwind.c:130
#5  0x00007ffff7e5ddba in __do_cancel () at ../sysdeps/nptl/pthreadP.h:276
#6  __GI___pthread_exit (value=0x0) at pthread_exit.c:36
#7  0x00007ffff7fbf283 in ?? ()
#8  0x0000000000000000 in ?? ()

0x3230 is the value of _Unwind_Find_FDE@got.plt in the binary:

$ gdb -q /usr/lib/x86_64-linux-gnu/libgcc_s.so.1
(gdb) p '_Unwind_Find_FDE@got.plt'
$1 = (<text from jump slot in .got.plt, no debug info>) 0x3230

I have also a Python reproduction that fails without the forced delay and by just enabling LD_DEBUG=all (which seems to add the needed delays). Let me know if it's valuable to attach in this bug.

Comment 1 George Prekas 2022-11-01 14:14:44 UTC

Created attachment 14428 [details]
Reproduction

Comment 2 Andreas Schwab 2022-11-01 15:03:55 UTC

The child may only call async-signal-safe functions, which pthread_create isn't.

Comment 3 Florian Weimer 2022-11-01 16:38:45 UTC

What's your glibc version? I don't expect unwind-link to have this problem. See __libc_unwind_link_after_fork. The unwind-link facility was added in glibc 2.34.

Comment 4 George Prekas 2022-11-01 16:50:53 UTC

This was tested and reproduces on the latest master (commit b4174c28d21e1672ef3cc15a058558e97b8471c6).

After reading Andreas' response, I found this snippet in fork's man page which renders the reproduction code invalid:

       *  After a fork() in a multithreaded program, the child can
          safely call only async-signal-safe functions (see
          signal-safety(7)) until such time as it calls execve(2).

Comment 5 Florian Weimer 2022-11-01 17:01:19 UTC

I see. So it's probably the half-completed dlopen that confuses the subprocess. We do this after fork:

      /* Reset the lock the dynamic loader uses to protect its data.  */
      __rtld_lock_initialize (GL(dl_load_lock));

      /* Reset the lock protecting dynamic TLS related data.  */
      __rtld_lock_initialize (GL(dl_load_tls_lock));

That's unfortunately not particularly safe.