This is the mail archive of the
glibc-bugs@sourceware.org
mailing list for the glibc project.
[Bug linuxthreads/3597] New: Possible race condition in pthread_exit() function resulting in core dump.
- From: "rsomla at mysql dot com" <sourceware-bugzilla at sourceware dot org>
- To: glibc-bugs at sources dot redhat dot com
- Date: 27 Nov 2006 17:48:43 -0000
- Subject: [Bug linuxthreads/3597] New: Possible race condition in pthread_exit() function resulting in core dump.
- Reply-to: sourceware-bugzilla at sourceware dot org
Here at MySQL we got a core dump inside pthread_exit() function. One of our
developers did the analysis (quoted below) which shows that the problem might be
related to concurrent execution of pthread_exit() code and aggresive
optimizations made by current compilers.
Suggested fix: declare libgcc_s_getcfa and/or libgcc_s_forcedunwind variables
volatile to prevent keeping them in processor registers or disable context
switches during pthread_exit() execution.
The analysis:
Here is the stacktrace from the core:
(gdb) bt
#0 0x005227a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1 0x007898bb in pthread_kill () from /lib/tls/libpthread.so.0
#2 0x084a1775 in write_core (sig=11) at stacktrace.c:245
#3 0x0826c995 in handle_segfault (sig=11) at mysqld.cc:2115
#4 <signal handler called>
#5 0x00000000 in ?? ()
#6 0x0078d2aa in _Unwind_ForcedUnwind () from /lib/tls/libpthread.so.0
#7 0x0078af81 in __pthread_unwind () from /lib/tls/libpthread.so.0
#8 0x00786f00 in pthread_exit () from /lib/tls/libpthread.so.0
#9 0x084885d7 in handle_slave_io (arg=0x9f6c858) at slave.cc:3769
#10 0x00786341 in start_thread () from /lib/tls/libpthread.so.0
#11 0x006066fe in clone () from /lib/tls/libc.so.6
I have a plausible theory about what is going on.
The crash is in this piece of code from glibc (as found by Google code
search), where pthread_cancel_init() is inlined in _Unwind_Reason_Code():
_Unwind_Reason_Code
_Unwind_ForcedUnwind (struct _Unwind_Exception *exc, _Unwind_Stop_Fn stop,
void *stop_argument)
{
if (__builtin_expect (libgcc_s_forcedunwind == NULL, 0))
pthread_cancel_init ();
return libgcc_s_forcedunwind (exc, stop, stop_argument);
}
void
pthread_cancel_init (void)
{
void *resume, *personality, *forcedunwind, *getcfa;
void *handle;
if (__builtin_expect (libgcc_s_getcfa != NULL, 1))
return;
handle = __libc_dlopen ("libgcc_s.so.1");
if (handle == NULL
|| (resume = __libc_dlsym (handle, "_Unwind_Resume")) == NULL
|| (personality = __libc_dlsym (handle, "__gcc_personality_v0")) == NULL
|| (forcedunwind = __libc_dlsym (handle, "_Unwind_ForcedUnwind"))
== NULL
|| (getcfa = __libc_dlsym (handle, "_Unwind_GetCFA")) == NULL
#ifdef ARCH_CANCEL_INIT
|| ARCH_CANCEL_INIT (handle)
#endif
)
__libc_fatal ("libgcc_s.so.1 must be installed for pthread_cancel to work\n");
libgcc_s_resume = resume;
libgcc_s_personality = personality;
libgcc_s_forcedunwind = forcedunwind;
libgcc_s_getcfa = getcfa;
}
Note that there is actually a race in this code:
- Thread A finds libgcc_s_forcedunwind==NULL and enters
pthread_cancel_init(). A context switch then occurs before thread A has
the time to check the libgcc_s_getcfa variable.
- Thread B finds libgcc_s_forcedunwind==NULL and enters
pthread_cancel_init(). It finds libgcc_s_getcfa==NULL, and goes to set
libgcc_s_getcfa = getcfa.
- Thread A is later re-scheduled, and now finds libgcc_s_getcfa!=NULL so
returns immediately from pthread_cancel_init(). It then proceeds to
execute the call (*libgcc_s_forcedunwind)() using the _old_ previously
loaded value in %edx, which is still NULL. Hence a segfault.
So the problem is that the libgcc_s_getcfa variable is checked and modified
without any kind of synchronization.
I actually found some evidence in the core file that this race is exactly what
happened. Here are the registers at the point of crash:
(gdb) info reg
eax 0x78ade0 7908832
ecx 0xb6fa6dd0 -1225101872
edx 0x0 0
ebx 0x78fff4 7929844
esp 0xb6fa6328 0xb6fa6328
ebp 0xb6fa6348 0xb6fa6348
esi 0xb6fa6480 -1225104256
edi 0xb6fa6dd0 -1225101872
eip 0x78d2aa 0x78d2aa
And here is the disassembly, with some comments.
0x0078d270 <_Unwind_ForcedUnwind+0>: push %ebp
0x0078d271 <_Unwind_ForcedUnwind+1>: mov %esp,%ebp
0x0078d273 <_Unwind_ForcedUnwind+3>: sub $0x20,%esp
0x0078d276 <_Unwind_ForcedUnwind+6>: mov %ebx,0xfffffff4(%ebp)
0x0078d279 <_Unwind_ForcedUnwind+9>: call 0x7852da
<__i686.get_pc_thunk.bx>
0x0078d27e <_Unwind_ForcedUnwind+14>: add $0x2d76,%ebx
0x0078d284 <_Unwind_ForcedUnwind+20>: mov %esi,0xfffffff8(%ebp)
0x0078d287 <_Unwind_ForcedUnwind+23>: mov 0x21ac(%ebx),%edx
libgcc_s_forcedunwind is now loaded in %edx.
0x0078d28d <_Unwind_ForcedUnwind+29>: mov %edi,0xfffffffc(%ebp)
0x0078d290 <_Unwind_ForcedUnwind+32>: test %edx,%edx
0x0078d292 <_Unwind_ForcedUnwind+34>: je 0x78d2b7
<_Unwind_ForcedUnwind+71>
>From the register dump, %edx is 0, so this jump is taken.
0x0078d294 <_Unwind_ForcedUnwind+36>: mov 0x10(%ebp),%esi
0x0078d297 <_Unwind_ForcedUnwind+39>: mov 0x8(%ebp),%edi
0x0078d29a <_Unwind_ForcedUnwind+42>: mov 0xc(%ebp),%eax
0x0078d29d <_Unwind_ForcedUnwind+45>: mov %esi,0x8(%esp)
0x0078d2a1 <_Unwind_ForcedUnwind+49>: mov %edi,(%esp)
0x0078d2a4 <_Unwind_ForcedUnwind+52>: mov %eax,0x4(%esp)
0x0078d2a8 <_Unwind_ForcedUnwind+56>: call *%edx
This is where it crashes since %edx (libgcc_s_forcedunwind) is NULL.
0x0078d2aa <_Unwind_ForcedUnwind+58>: mov 0xfffffff4(%ebp),%ebx
0x0078d2ad <_Unwind_ForcedUnwind+61>: mov 0xfffffff8(%ebp),%esi
0x0078d2b0 <_Unwind_ForcedUnwind+64>: mov 0xfffffffc(%ebp),%edi
0x0078d2b3 <_Unwind_ForcedUnwind+67>: mov %ebp,%esp
0x0078d2b5 <_Unwind_ForcedUnwind+69>: pop %ebp
0x0078d2b6 <_Unwind_ForcedUnwind+70>: ret
This is where the code jump to from above when it finds libgcc_s_forcedunwind
to be NULL (it is the inlined pthread_cancel_init() code).
0x0078d2b7 <_Unwind_ForcedUnwind+71>: mov 0x21b0(%ebx),%eax
0x0078d2bd <_Unwind_ForcedUnwind+77>: test %eax,%eax
0x0078d2bf <_Unwind_ForcedUnwind+79>: jne 0x78d294
<_Unwind_ForcedUnwind+36>
And here it returns immediately, since it loads libgcc_s_getcfa into %eax, and
finds it non-NULL (the register dump shows %eax is 0x78ade0) -> crash.
Some more dumps to show this:
(gdb) x $ebx+0x21ac
0x7921a0 <libgcc_s_forcedunwind>: 0x009340e4
(gdb) x $ebx+0x21b0
0x7921a4 <libgcc_s_getcfa>: 0x00932b98
(gdb) x 0x009340e4
0x9340e4 <_Unwind_ForcedUnwind>: 0x57e58955
(gdb) x 0x00932b98
0x932b98 <_Unwind_GetCFA>: 0x8be58955
So the variable libgcc_s_forcedunwind is actually non-NULL at the time of
crash (set by the other thread in the race). But the compiled code naturally
uses the previously loaded value in %edx, having no reason to believe that it
might have changed since it was last loaded. Hence the crash.
$ ./configure --prefix=/usr/local/mysql --enable-assembler
--with-extra-charsets=complex --enable-thread-safe-clie
nt --with-readline --with-big-tables --with-debug --disable-shared --with-innodb
--with-berkeley-db --with-ndbcluste
r --with-archive-storage-engine --with-big-tables
--with-blackhole-storage-engine --with-federated-storage-engine --
with-csv-storage-engine --with-yassl --with-embedded-server --enable-local-infile
Some build info from config.log follows:
## --------- ##
## Platform. ##
## --------- ##
hostname = <cut>
uname -m = i686
uname -r = 2.6.9-22.0.1.ELsmp
uname -s = Linux
uname -v = #1 SMP Tue Oct 18 18:39:27 EDT 2005
/usr/bin/uname -p = unknown
/bin/uname -X = unknown
/bin/arch = i686
/usr/bin/arch -k = unknown
/usr/convex/getsysinfo = unknown
hostinfo = unknown
/bin/machine = unknown
/usr/bin/oslevel = unknown
/bin/universe = unknown
--
Summary: Possible race condition in pthread_exit() function
resulting in core dump.
Product: glibc
Version: 2.3.4
Status: NEW
Severity: normal
Priority: P2
Component: linuxthreads
AssignedTo: drow at false dot org
ReportedBy: rsomla at mysql dot com
CC: glibc-bugs at sources dot redhat dot com
http://sourceware.org/bugzilla/show_bug.cgi?id=3597
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.