I think there is a race condition in the code in nptl/sysdeps/pthread/unwind-forcedunwind.c which can lead to a segfault. I found this in a redhat build, but it exists in CVS glibc too (as of May 6 2006). Consider a call to _Unwind_ForcedUnwind, when libgcc_s_forcedunwind has not been loaded. _Unwind_ForcedUnwind checks the value against null, and jumps to pthread_cancel_init. Meanwhile another thread comes in and initialises all these pointers, so the first check in pthread_cancel_init shows that libgcc_s_getcfa is non-null, so we return to _Unwind_ForcedUnwind and execute libgcc_s_forcedunwind. As the function pointer libgcc_s_forcedunwind has not been marked volatile, the compiler does not need to reload this value, and attempts to call the address it previously loaded, ie. 0. I have a test case which shows the problem and patch which I believe fixes it.
Created attachment 1004 [details] Patch to make the unwind function pointers volatile. I think this patch fixes the bug. It's a little ugly because I wasn't sure of the syntax for making the function pointers volatile.
Created attachment 1005 [details] Test case which shows the problem This test shows the bug for me on an 8 way xeon. I run it with the following command: for i in `seq 0 100 10000`; do echo $i; for j in `seq 1 100`; do ./pthread_exit_race; done; done and eventually I get some segs: 0 100 received segv! received segv! 200 ... I built the test with g++ pthread_exit_race.cc -o pthread_exit_race -lpthread -DMAXTHREADS=100
To clarify how the problem manifests, here's a snippet of the compiled _Unwind_ForcedUnwind code from the unpatched libc: 00941360 <_Unwind_ForcedUnwind>: [...] 941377: 8b 93 ac 21 00 00 mov 0x21ac(%ebx),%edx 94137d: 89 7d fc mov %edi,0xfffffffc(%ebp) 941380: 85 d2 test %edx,%edx 941382: 74 23 je 9413a7 941384: 8b 75 10 mov 0x10(%ebp),%esi 941387: 8b 4d 0c mov 0xc(%ebp),%ecx 94138a: 8b 45 08 mov 0x8(%ebp),%eax 94138d: 89 74 24 08 mov %esi,0x8(%esp) 941391: 89 4c 24 04 mov %ecx,0x4(%esp) 941395: 89 04 24 mov %eax,(%esp) 941398: ff d2 call *%edx [...] 9413a6: c3 ret 9413a7: 8b 83 b0 21 00 00 mov 0x21b0(%ebx),%eax 9413ad: 85 c0 test %eax,%eax 9413af: 75 d3 jne 941384 If the test at 941380 shows that edx (which is libgcc_s_forcedunwind) contains 0, and the test at 9413ad shows that eax (libgcc_s_getcfa) is non-zero, we jump back to 941384 and try to call edx without having changed it from 0. After patching and rebuilding on my system the following code results: 0000c360 <_Unwind_ForcedUnwind>: [...] c377: 8b 83 ac 21 00 00 mov 0x21ac(%ebx),%eax c37d: 89 7d fc mov %edi,0xfffffffc(%ebp) c380: 85 c0 test %eax,%eax c382: 74 29 je c3ad <_Unwind_ForcedUnwind+0x4d> c384: 8b 7d 10 mov 0x10(%ebp),%edi c387: 8b 75 0c mov 0xc(%ebp),%esi c38a: 8b 4d 08 mov 0x8(%ebp),%ecx c38d: 89 7c 24 08 mov %edi,0x8(%esp) c391: 8b 83 ac 21 00 00 mov 0x21ac(%ebx),%eax c397: 89 74 24 04 mov %esi,0x4(%esp) c39b: 89 0c 24 mov %ecx,(%esp) c39e: ff d0 call *%eax [...] c3ac: c3 ret c3ad: 8b 93 b0 21 00 00 mov 0x21b0(%ebx),%edx c3b3: 85 d2 test %edx,%edx c3b5: 75 cd jne c384 <_Unwind_ForcedUnwind+0x24> In this case eax is used for libgcc_s_forcedunwind, and after returning from the (inlined) pthread_cancel_init we now reload eax at c391 before calling it. This bug seems to exist in current CVS and could probably affect most platforms. I've searched for any previous reports and the closest I got was the patch proposed in http://sourceware.org/ml/libc-hacker/2005-11/msg00010.html, which fixes a different problem with the same code. I can provide more details of compiler versions etc if required.
What compiler do you use? Mine doesn't generate this code. Anyway, this is a pretty heavy handed solution. It forces the compiler to load the value twice while in fact you only want to reload when pthread_cancel_init was called. Instead, try replacing each call of pthread_cancel_init with an approriate call like { pthread_cancel_init (); asm volatile ("" : "=m" (libgcc_s_result)); }
Thanks for the quick response. Your solution does sound better, I'm just rebuilding to check that it works as expected. The compiler I'm using is GCC 3.4.3; to be precise, it's version 3.4.3 20041212 (Red Hat 3.4.3-9.EL4). If it's of any use, the first piece of code I quoted matches that from the libpthread.so.0 that comes from the glibc-2.3.4-2 package on Red Hat EL 4. I'm building with rpmbuild at the moment, but from what I can tell the CFLAGS are set to '-march=i686 -DNDEBUG=1 -g -O3'. Please let me know if you need more details on the configuration. It seems to me that although not all compilers necessarily will plant code that is unsafe, it is at least valid for them to do so from the current source.
Created attachment 1006 [details] Patch to force reload of the pointers only when required As discussed, this patch forces the function pointers to be reloaded when required, without needing them all to marked as volatile. I've used the '+' modifier in the asm, when I used '=' gcc decided to dead-code one of the stores and everything broke.
For completeness, here's the compiler output for the new patched version: 0000c360 <_Unwind_ForcedUnwind>: [...] c377: 8b 93 ac 21 00 00 mov 0x21ac(%ebx),%edx c37d: 89 7d fc mov %edi,0xfffffffc(%ebp) c380: 85 d2 test %edx,%edx c382: 74 23 je c3a7 <_Unwind_ForcedUnwind+0x47> c384: 8b 75 10 mov 0x10(%ebp),%esi c387: 8b 4d 0c mov 0xc(%ebp),%ecx c38a: 8b 45 08 mov 0x8(%ebp),%eax c38d: 89 74 24 08 mov %esi,0x8(%esp) c391: 89 4c 24 04 mov %ecx,0x4(%esp) c395: 89 04 24 mov %eax,(%esp) c398: ff d2 call *%edx [...] c3a6: c3 ret c3a7: 8b 83 b0 21 00 00 mov 0x21b0(%ebx),%eax c3ad: 85 c0 test %eax,%eax c3af: 74 08 je c3b9 <_Unwind_ForcedUnwind+0x59> c3b1: 8b 93 ac 21 00 00 mov 0x21ac(%ebx),%edx c3b7: eb cb jmp c384 <_Unwind_ForcedUnwind+0x24> The common case is now just as it was before, but in the case where we have to do the initialisation the value is correctly loaded at c3b1. I haven't finished testing with this version, but it looks good so far.
I checked in the patch. (Next time, change the state from WAITING when you add comments.)
(In reply to comment #8) > I checked in the patch. Thank you! > (Next time, change the state from WAITING when you add comments.) Sorry, will do.
I think instead of the patch you checked in we should just mark pthread_cancel_init with __attribute__((noinline)). That will do everything that's needed and is desirable anyway.
*** Bug 3597 has been marked as a duplicate of this bug. ***
Hi, I am a phd student in University of Illinois in CS dept. Recently I have been working on memory model related bugs in software. I was experimenting with this bug. And I found out that the bug is not properly fixed. pthread_cancel_init() uses libgcc_s_getcfa as a flag. To make it work, we need to use 2 barrier - one before writing into libgcc_s_getcfa and one after reading it in line 40 (just before returning). The fix puts the first barrier but not the second one. Now consider the following scenario where Thread 1 in inside pthread_cancel_init and is actually initializing the pointers. Thread 2 is in _Unwind_Resume, finds libgcc_s_resume to be NULL and calls the init function. Thread 1 Thread 2 libgcc_s_resume = resume; if(__builtin_expect(libgcc_s_getcfa != NULL,1)) ... ... atomic_write_barrier(); libgcc_s_getcfa = getcfa; libgcc_s_resume(exc); Now in Power-PC memory model, it is perfectly valid to execute read operations to different addresses out of order as long as there is no barrier in between them. Although thread 2 issues the instructions in order, it is possible that the second read (i.e. reading of the pointer libgcc_s_resume) will execute before the first read of libgcc_s_getcfa. This is shown here. Thread 1 Thread 2 libgcc_s_resume(exc); libgcc_s_resume = resume; ... atomic_write_barrier(); libgcc_s_getcfa = getcfa; if(__builtin_expect(libgcc_s_getcfa != NULL,1)) As a result, although the condition of the if statement for thread 2 becomes true, it will end up using NULL value for libgcc_s_resume. This will crash the program. So, you need to put a read_barrier after reading libgcc_s_getcfa in the if statement (i.e at line 42 before returning from pthread_cancel_init). This pattern is very similar to double checked locking (DCL) which also requires 2 barrier to make it work. More on DCL can be found here http://www.cs.umd.edu/~pugh/java/memoryModel/DoubleCheckedLocking.html Thanks. -Abdullah Muzahid PhD Student CS, UIUC
I added some patches and don't reopen bugs.
*** Bug 260998 has been marked as a duplicate of this bug. *** Seen from the domain http://volichat.com Page where seen: http://volichat.com/adult-chat-rooms Marked for reference. Resolved as fixed @bugzilla.