This is the mail archive of the libc-alpha@sources.redhat.com mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

linuxthreads bug in 2.2.4 under ppc linux


Hi,

I recently upgraded to glibc-2.2.4 and seem to have run into a 
linuxthreads problem under PPC Linux.

This problem is very timing dependent.  It does not happen every time but 
after 10 or 20 attempts with the code, I can get it usually get it to 
segfault and the segfault always happens in the exact same place.


Here is an quick analysis of the problem:

(#0  0xfdcce70 in __pthread_alt_unlock () at eval.c:88
#1  0xfdc895c in pthread_mutex_unlock () at eval.c:88
#2  0xfe680d4 in ChildStatusProc ()
   from /src2/openoffice-641c/solver/641/unxlngppc.pro/lib/libsal.so.3
#3  0xfe667e4 in oslWorkerWrapperFunction ()
   from /src2/openoffice-641c/solver/641/unxlngppc.pro/lib/libsal.so.3
#4  0xfdc7448 in pthread_start_thread () at eval.c:88
#5  0xfafe5a8 in clone () at eval.c:88

We are using normal mutexes here.

Based on dissassembling the code, the problem is here in

void __pthread_alt_unlock(struct _pthread_fastlock *lock);

After the test to see if the node was abandoned:

 if (p_node->abandoned) {
        /* Remove abandoned node. */

 It turns out it was not and the else clause is invoked and the following 
code is run:

     } else if ((prio = p_node->thr->p_priority) >= maxprio) {
        /* Otherwise remember it if its thread has a higher or equal 
priority
           compared to that of any node seen thus far. */
        maxprio = prio;
        pp_max_prio = pp_node;
        p_max_prio = p_node;
      }

But the wait_node structure being looked at had all 0 values

In the code r4 is the address of the fastlock and its status value is 
0x0fb57250 which is the pointer to the wait_node.

(gdb) x/10 $r4
0x7fffd3a4:     0x0fb57250      0x0fde66cc      0x0fb56e40      0x7fffd3c0
0x7fffd3b4:     0x0fdc8588      0x0fde66cc      0x0fb57250      0x7fffd3d0
0x7fffd3c4:     0x0fdc895c      0x0fb5f974

Unfortunately the wait node itself is all zeros (pnode->abandoned was 0 
but also the thr and next pointers were 0.

(gdb) x/10 $r11
0xfb57250 <main_arena+1040>:    0x00000000      0x00000000      0x00000000 
     0x00000000


This results in a segfault trying to access the p_priority of a 0 thr 
pointer at 0xfdcce70 since r9 is 0 (the thr value).

0xfdcce6c <__pthread_alt_unlock+240>:   lwz     r9,4(r11)
0xfdcce70 <__pthread_alt_unlock+244>:   lwz     r0,88(r9)
0xfdcce74 <__pthread_alt_unlock+248>:   cmpw    r0,r6
0xfdcce78 <__pthread_alt_unlock+252>:
    blt 0xfdcce88 <__pthread_alt_unlock+268>


So the question is is this a legal state?  

Is it possible to have a nonzero status in a fastlock but the wait_node it 
points at is all zeros.  

If so we should see if check to make sure the thr pointer is not zero 
before trying to access its fields.

I am sorry I can't be more help here but the code in spinlock.c seems to 
be much more complicated that the old way mutexes were done under earlier 
glibc-2.2 releases.

I see lots of reservation lock pairs (lwarx stwcx.) used through the code 
I disassembled.  I am very unsure if the proper syncs and isyncs 
(BARRIERS) are being used here.

Here are 3 examples taken from this routine that all all different in 
their use of sync and isync.

(this one does no sync to start)

    bde0:       7d 20 20 28     lwarx   r9,r0,r4
    bde4:       7d 69 4a 79     xor.    r9,r11,r9
    bde8:       40 82 00 0c     bne-    bdf4 <__pthread_alt_unlock+0x78>
    bdec:       7c 00 21 2d     stwcx.  r0,r0,r4
    bdf0:       40 a2 ff f0     bne-    bde0 <__pthread_alt_unlock+0x64>
    bdf4:       4c 00 01 2c     isync


(this one does a sync to start and an isync after)

    be38:       7c 00 04 ac     sync
    be3c:       7d 20 50 28     lwarx   r9,r0,r10
    be40:       7c 09 4a 79     xor.    r9,r0,r9
    be44:       40 82 00 0c     bne-    be50 <__pthread_alt_unlock+0xd4>
    be48:       7d 60 51 2d     stwcx.  r11,r0,r10
    be4c:       40 a2 ff f0     bne-    be3c <__pthread_alt_unlock+0xc0>
    be50:       4c 00 01 2c     isync

(this one does a sync to start but no isync after) 

    bf38:       7c 00 04 ac     sync
    bf3c:       7d 20 18 28     lwarx   r9,r0,r3
    bf40:       7c 09 4a 79     xor.    r9,r0,r9
    bf44:       40 82 00 0c     bne-    bf50 <__pthread_alt_unlock+0x1d4>
    bf48:       7d 80 19 2d     stwcx.  r12,r0,r3
    bf4c:       40 a2 ff f0     bne-    bf3c <__pthread_alt_unlock+0x1c0>
    bf50:       7d 20 4b 78     mr      r0,r9


My (limited) understanding of this is that you when you grab a lock you 
use the lwarx,stwcx pair and follow it by an isync.  When you write 0 to a 
lock to free it you do a sync first and then simply write it.  Therefore I 
think the last one that does a sync before the reservation but no isync 
after is wrong.

Maybe Geoff or Franz or David knows for sure.

Any guidance on how to address this would be greatly appreciated.

Thanks,

Kevin


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]