fork hang with corrupted list_all_lock

Fri Jun 25 22:42:00 GMT 2010

I have discovered an anomaly whose investigation has led to glibc and  
I'm
wondering if this has been seen before.

I have a cluster of machines running RHEL5.4 (glibc-2.5 based) on  
Nehalem
E5530 processors (16 hyperthreaded CPUs, stepping 5) that are running  
a java
process (hadoop TaskTracker).  TaskTracker is 32-bit and multithreaded  
(~80
threads).  The kernel is 64 bit running 2.6.18-164.2.1.el5.

I have caught the process in a relatively rare event that is one of  
those
"can't happen" scenarios.

Whenever a process forks, __libc_fork (nptl/sysdeps/unix/sysv/linux/ 
fork.c)
calls _IO_list_lock() to acquire list_all_lock before calling the fork  
system
call.  list_all_lock contains three fields: lock, cnt, and owner.   
After the
fork system call, the child resets the lock and the parent releases it.

Normally, this works as you would expect, but when it fails, the  
parent's lock
is zeroed (.lock=0, .cnt=0, .owner=0) and when subsequently released,  
results
in a lock in an invalid state.  At that time, the lock has these values.
         list_all_lock.lock:  2
         list_all_lock.cnt:   -1
         list_all_lock.owner: <thread that released the lock>

 From this state, no additional forks can be made.  Many of the  
threads in the
process are waiting for a lock in the malloc code (malloc_atfork) that  
runs
when a fork is currently outstanding.  The process is hung at this  
point.

So, the "can't happen" event is that some thread/process has scrozzled  
the
lock while it is being held by a thread.  Unless there is some glibc  
code that
is just writing out-of-bounds zeroes, it looks like the lock is being  
reset
with _IO_list_resetlock().  Since only the child calls this code in  
its own
address space, it ought not affect the parent's version of the lock.

This anomaly occurs only when running RHEL5.4 on the Nehalem  
processors.  I
have not been able to reproduce the issue running either RHEL5.4 or  
RHEL5.1 on
older E5420 processors.

Remediations tried so far have all resulted in the same TaskTracker  
hang.
         * latest java (jdk1.6_20)
         * set UseMemBar in java
         * use latest microcode from Intel for E5530
         * restrict the CPU set to all CPUs on a single processor
         * disable HyperThreading in the BIOS
         * latest RHEL glibc: glibc-2.5-49

I have tried a couple of tests that resulted in the issue not being  
reproduced.
         * restrict all threads to the same CPU
         * add glibc debugging so that cache line containing  
list_all_lock was rearranged

I have looked at http://sourceware.org/ml/libc-hacker/2007-02/msg00009.html 
,
but this doesn't seem quite like the issue that I'm seeing.  If that  
were
the bug, then I would expect to see a deadlock situation, not corrupted
lock fields.

While it looks like this may be a silicon bug, it is possible that it  
is not
and so I'm looking for anyone who might have seen this kind of  
behavior in
glibc.

Wayne

--
Wayne Badger
Yahoo!