This is the mail archive of the libc-alpha@sources.redhat.com mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Example ex11 fails on SMP systems


While porting glibc-2.2.5 to PPC64 I noticed that "make check" would
occasionally fail during the linuxthreads/Example tests. These failures
seem to only happen on SMP systems. At least 3 tests ex9, ex11, and ex17
fail frequently.

As an experiment I compiled these examples for PPC32 using the installed
tool chain ( gcc 2.95.3 and glibc 2.2.4) I was able to reproduce the
failures. The test case is simply a shell script which runs ex9, ex11, ex17
in a loop. On an SMP systems ex9 will occasionally take a segfault while
ex11 or ex17 will eventually hang. I have since run these tests with
gcc3.0.5 and glibc 2.2.5 and saw the same failures. I also examined the
glibc cvs tree and verified that no pertinent changes have made since
2.2.4.

I coded a simple flight recorder and instumented the pthread code to trace
the sequence of operations up to the point of the hang in ex11. This
testing was all based on PPC32, gcc 2.95.3 and glibc-2.2.4. The trace below
is from a ex11 hang where the signal was sent during the timesuspend
(__libc_nanosleep) but the  signal delivery is delayed.

In this case the thread represented by pthread_descr address of 0xfcbffc00
is attempting a pthread_rwlock_timedrdlock. "rwlock.c" uses spinlocks
(&rwlock->__rw_lock) to control access to its wait queues. So the entries
pthread_lock, pthread_lock_exit and pthread_unlock entries in the
13dbf0-13dc40 range represent the spinlocks controlling access to the wait
queue while it "enqueues" itself.

Since thread 0xfcb... can't get the lock yet it calls
pthread_timedsuspend_new which enables a sigsetjmp and unmasks the restart
signal before calling __libc_nanosleep. If the signal pops while in the
__libc_nanosleep the signal handler  will longjmp back into
pthread_timedsuspend_new and return an int 1, if the time expires it
remasks the restart signal and disables the sigsetjmp before it returns an
int 0. In this case the time expired before the signal was delivered
because the pthread_timedsuspend_new exited with "value_1" == 0;

However thread 0xfe3... did call pthread_restart_new to signal thread
0xfcb... while that thread was in  __libc_nanosleep. The entry 0x13ea00...
pthread_restart_new_exit marks the return from "kill" syscall. Thread
0xfe3... is doing this because it called pthread_rwlock_unlock and thread
0xfcb... was on the wait list (from the "enqueue" earlier),

Now that pthread_rwlock_timedrdlock has returned from
pthread_timedsuspend_new it must lock again before dequeueing itself from
the wait list (remove_from_queue (&rwlock->__rw_read_waiting, self)). But
this spinlock is contended for and held by another thread (0xfd1...) So
pthread_lock calls pthread_wait_for_restart_signal which unmasks the
restart signal and suspends.

Unfortunately the signal from thread 0xfe3... finally gets delivered (entry
13ef00, pthread_handle_sigrestart). So pthread_wait_for_restart_signal
returns to pthread_lock, which checks its internal state and decides this
signal isn't the one it was expecting. So it increments the spurious_wakeup
count and calls pthread_wait_for_restart_signal again. Finally the signal
(from thread 0xfd1...'s pthread_unlock) arrives and
pthread_wait_for_restart_signal returns to pthread_lock again. This time
pthread_lock finds this signal to its liking and returns (entry 13f720...
pthread_lock exit) to pthread_rwlock_timedrdlock.

pthread_rwlock_timedrdlock can tell that someone tried to send a signal but
thinks it is still to be delivered (because the __libc_nanosleep timed out
in pthread_timedsuspend_new). So pthread_rwlock_timedrdlock calls
pthread_timedsuspend_new again to "/* Eat the outstanding restart() from
the signaller */".

And stays stuck there.


             pthread-
entry addr   descr addr value 1    value 2    entry type
------------ ---------- ---------- ---------- --------------------------
<0x0013dbe0> 0xfcbffc00 0x100942ac 0x10003580 pthread_rwlock_timedrdlock
<0x0013dbf0> 0xfcbffc00 0x100942ac 0x100042f4 pthread_lock
<0x0013dc10> 0xfcbffc00 0x100942ac 0x100043dc pthread_lock exit
<0x0013dc40> 0xfcbffc00 0x100942ac 0x10004690 pthread_unlock
<0x0013dc50> 0xfcbffc00   00000000 0x10002ca0 pthread_timedsuspend_new
<0x0013e940> 0xfe3ffc00 0xfcbffc00 0x10002c1c pthread_restart_new
<0x0013ea00> 0xfe3ffc00 0xfcbffc00 0x10002c44 pthread_restart_new_exit
<0x0013eb40> 0xfcbffc00   00000000 0x10002dcc pthread_timedsuspend_new exit
<0x0013eb50> 0xfcbffc00 0x100942ac 0x100042f4 pthread_lock
<0x0013ec40> 0xfd1ffc00 0xfcbffc00 0x10002c1c pthread_restart_new
<0x0013ec60> 0xfcbffc00 0x00000020 0x100027e8
pthread_wait_for_restart_signal
<0x0013eda0> 0xfd1ffc00 0xfcbffc00 0x10002c44 pthread_restart_new_exit
<0x0013ef00> 0xfcbffc00 0x00000020 0x100023cc pthread_handle_sigrestart
<0x0013ef50> 0xfcbffc00   00000000 0x10002828
pthread_wait_for_restart_signal exit
<0x0013ef80> 0xfcbffc00 0x00000020 0x100027e8
pthread_wait_for_restart_signal
<0x0013f080> 0xfcbffc00 0x00000020 0x100023cc pthread_handle_sigrestart
<0x0013f0e0> 0xfcbffc00   00000000 0x10002828
pthread_wait_for_restart_signal exit
<0x0013f0f0> 0xfcbffc00 0x00000001 0x10004534 pthread_lock_spurious_wakeup
<0x0013f140> 0xfcbffc00 0x00000020 0x100027e8
pthread_wait_for_restart_signal
<0x0013f450> 0xfc3ffc00 0xfcbffc00 0x10002c1c pthread_restart_new
<0x0013f4f0> 0xfc3ffc00 0xfcbffc00 0x10002c44 pthread_restart_new_exit
<0x0013f680> 0xfcbffc00 0x00000020 0x100023cc pthread_handle_sigrestart
<0x0013f6b0> 0xfcbffc00   00000000 0x10002828
pthread_wait_for_restart_signal exit
<0x0013f720> 0xfcbffc00 0x100942ac 0x100043dc pthread_lock exit
<0x0013f750> 0xfcbffc00 0x100942ac 0x10004690 pthread_unlock
<0x0013f760> 0xfcbffc00 0xfc5ffc00 0x10002c1c pthread_restart_new
<0x0013f830> 0xfcbffc00 0xfc5ffc00 0x10002c44 pthread_restart_new_exit
<0x0013f840> 0xfcbffc00 0x00000020 0x100027e8
pthread_wait_for_restart_signal


The core problem seems to be that rwlock.c uses functions from spinlock.c
and both depend on the same restart signal. On large SMP systems under
heavily load the delivery of signals becomes less predictable and one
phtread function (spinlock.c) will occasionally get the signal intended for
another (rwlock.c). In this case pthread_lock eats the signal intended for
pthread_rwlock_timedrdlock.

pthread_lock does handle and count spurious_wakeup's and has code to
reissue the restart signals. However the introduction of SMP support
(spinlock.c cvs version 1.23) effectively disabled this  spurious_wakeup
logic by returning early and never executes the resignal code.

    while (spurious_wakeup_count--)
       restart(self);

I have serveral more traces like the above that all point to the same
problem.I don't believe that this is a PowerPC problem, but a generic
libc/linuxthreads SMP problem.

Unfortunately my attempts so far to reactivate the spurious_wakeup restart
logic have only resulted in a different type of hang (the wait list can
become circular and pthread_unlock gets into an infinite loop).

I  attempted to work around the spurious_wakeup problem by patching the
code to insure that the spurious-wakeup restarts are issued. My first
attempt involved moving the label "again:" from before the "SMP spinloop"
to after that loop.


>>>>>>>
diff -rc2P ./glibc-2.2.5-base/linuxhthreads/spinlock.c
./glibc-2.2.5/linuxthreads/spinlock.c
*** ./glibc-2.2.5-base/linuxhthreads/spinlock.c Sun Apr 21 13:29:24 2002
--- ./glibc-2.2.5/linuxthreads/spinlock.c Sun Apr 21 13:36:16 2002
***************
*** 88,93 ****
    spin_count = 0;

- again:
-
    /* On SMP, try spinning to get the lock. */

--- 88,91 ----
***************
*** 116,119 ****
--- 114,119 ----
      lock->__spinlock += (spin_count - lock->__spinlock) / 8;
    }
+
+ again:

    /* No luck, try once more or suspend. */
<<<<<<<

This removes the "early return" from the "goto again" retry loop and
naturally falls through to the "spurious_wakeup" restart loop.
Unfortunately this exposed another problem.

In the trace that follows, thread 0xfc3 attempts to pthread_lock spinlock
(Value 1) 0x1009400c. This spinlock is busy, so the thread is enqueued on
the wait list by the following snippit from spinlock.c:

   again:
    do {
       oldstatus = lock->__status;
       successful_seizure = 0;

       if ((oldstatus & 1) == 0) {
         newstatus = oldstatus | 1;
         successful_seizure = 1;
       } else {
         if (self == NULL)
      self = thread_self();
         newstatus = (long) self | 1;
       }

       if (self != NULL) {
         THREAD_SETMEM(self, p_nextlock, (pthread_descr) (oldstatus &
   ~1L));
         MEMORY_BARRIER();
       }
     } while(! __compare_and_swap(&lock->__status, oldstatus, newstatus));

Then pthread_lock falls into the suspend loop:

   if (!successful_seizure) {
       for (;;) {
         suspend(self);
         if (self->p_nextlock != NULL) {
      /* Count resumes that don't belong to us. */
      spurious_wakeup_count++;
      continue;
         }
         break;
       }
       goto again;
     }

"successful_seizure" is false so the code falls into the "suspend" loop. In
the trace below entries <0x001b1150> through <0x001b1190> show the restart
signal (intended for rwlock's timedsuspend) being generated and the signal
being delivered at trace entry <0x001b1d00>. This is clearly a
spurious_wakeup signal but is not recognized as such.

             pthread-
entry addr   descr addr value 1    time base  entry type
------------ ---------- ---------- ---------- --------------------------
<0x001afd90> 0xfc3ffc00 0x1009400c 0xa1b86a5d pthread_rwlock_timedrdlock
<0x001afda0> 0xfc3ffc00 0x1009400c 0xa1b86b1d pthread_lock
<0x001afdb0> 0xfc3ffc00 0x1009400c 0xa1b86b9a pthread_lock exit
<0x001afdc0> 0xfc3ffc00 0x1009400c 0xa1b86fb1 pthread_unlock
<0x001afdd0> 0xfc3ffc00   00000000 0xa1b86feb pthread_timedsuspend_new
<0x001b1150> 0xfdbffc00 0x1009400c 0xa26ec66a pthread_rwlock_unlock
<0x001b1160>   00000000 0x1009400c 0xa26ec7c4 pthread_lock
<0x001b1170>   00000000 0x1009400c 0xa26ec8b6 pthread_lock exit
<0x001b1180> 0xfdbffc00 0x1009400c 0xa26ec924 pthread_unlock
<0x001b1190> 0xfdbffc00 0xfc3ffc00 0xa26eca72 pthread_restart_new
<0x001b1c40> 0xfc3ffc00   00000000 0xa2712512 pthread_timedsuspend_new exit
<0x001b1c50> 0xfc3ffc00 0x1009400c 0xa27125cd pthread_lock
<0x001b1c70> 0xfc3ffc00   00000000 0xa27130e2 pthread_lock_enqueue
<0x001b1c90> 0xfc3ffc00 0x00000020 0xa271358a
pthread_wait_for_restart_signal
<0x001b1d00> 0xfc3ffc00 0x00000020 0xa2714ade pthread_handle_sigrestart
<0x001b1d40> 0xfc3ffc00   00000000 0xa27151f6
pthread_wait_for_restart_signal exit
<0x001b1d50> 0xfc3ffc00 0xfc3ffc00 0xa27152a3 pthread_lock_enqueue
<0x001b1d60> 0xfc3ffc00 0x00000020 0xa2715549
pthread_wait_for_restart_signal
<0x001b1d70> 0xfc9ffc00 0xfc3ffc00 0xa2715d7a pthread_lock_enqueue

The test if (self->p_nextlock != NULL)  is not sufficient. In the case
where only one thread is waiting (or is the last thread on the list of
waiting threads), its self->p_nextlock will also be NULL. In this case
thread 0xfc3 was the only thread waiting. So it's "oldstatus" was 1 (on
entry) and assigning self->p_nextlock = (oldstatus & ~1L) will set that
field to zero (NULL). Thus the logic "breaks" out of the for loop and
executes the "goto again".

Since the spinlock is still busy (this is a spurious signal) the code will
enqueue itself on the wait (again). As lock->__status is now equal to self
| 1 (0xfc3ffc01) (from the previous enqueue), the enqueue logic creates a
circular wait list with __status pointing to pthread_descr at 0xfc3ffc00
and that thread's self->p_nextlock pointing to itself (trace entry
<0x001b1d50>).

So when the thread holding this lock calls pthread_unlock it's "priority
search" loop:

   while (thr != 0) {
       if (thr->p_priority >= maxprio) {
         maxptr = ptr;
         maxprio = thr->p_priority;
       }
       ptr = &(thr->p_nextlock);
       thr = *ptr;
     }

gets stuck in a infinite loop!

I will continue "peeling the onion" on this problem to find a complete fix,
but thought getting more brains involved was appropriate.

One thought I had was the root cause is spinlock and rwlock "share" the
restart single. This make it difficult to trace which signals are intended
for which functions. One posible solution would allocate another signal
number (say 35) to the exclusive use of rwlock and leave the current
restart signal (32) for spinlock and join. This would eliminate the
possibility of spinlock receiving a "spurious_wakeup". Is this signal
number in use? Could another signal number be allocated to linuxthreads?



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]