This is the mail archive of the libc-help@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

fork hang with corrupted list_all_lock


I have discovered an anomaly whose investigation has led to glibc and I'm
wondering if this has been seen before.


I have a cluster of machines running RHEL5.4 (glibc-2.5 based) on Nehalem
E5530 processors (16 hyperthreaded CPUs, stepping 5) that are running a java
process (hadoop TaskTracker). TaskTracker is 32-bit and multithreaded (~80
threads). The kernel is 64 bit running 2.6.18-164.2.1.el5.


I have caught the process in a relatively rare event that is one of those
"can't happen" scenarios.


Whenever a process forks, __libc_fork (nptl/sysdeps/unix/sysv/linux/ fork.c)
calls _IO_list_lock() to acquire list_all_lock before calling the fork system
call. list_all_lock contains three fields: lock, cnt, and owner. After the
fork system call, the child resets the lock and the parent releases it.


Normally, this works as you would expect, but when it fails, the parent's lock
is zeroed (.lock=0, .cnt=0, .owner=0) and when subsequently released, results
in a lock in an invalid state. At that time, the lock has these values.
list_all_lock.lock: 2
list_all_lock.cnt: -1
list_all_lock.owner: <thread that released the lock>


From this state, no additional forks can be made. Many of the threads in the
process are waiting for a lock in the malloc code (malloc_atfork) that runs
when a fork is currently outstanding. The process is hung at this point.


So, the "can't happen" event is that some thread/process has scrozzled the
lock while it is being held by a thread. Unless there is some glibc code that
is just writing out-of-bounds zeroes, it looks like the lock is being reset
with _IO_list_resetlock(). Since only the child calls this code in its own
address space, it ought not affect the parent's version of the lock.


This anomaly occurs only when running RHEL5.4 on the Nehalem processors. I
have not been able to reproduce the issue running either RHEL5.4 or RHEL5.1 on
older E5420 processors.


Remediations tried so far have all resulted in the same TaskTracker hang.
* latest java (jdk1.6_20)
* set UseMemBar in java
* use latest microcode from Intel for E5530
* restrict the CPU set to all CPUs on a single processor
* disable HyperThreading in the BIOS
* latest RHEL glibc: glibc-2.5-49


I have tried a couple of tests that resulted in the issue not being reproduced.
* restrict all threads to the same CPU
* add glibc debugging so that cache line containing list_all_lock was rearranged


I have looked at http://sourceware.org/ml/libc-hacker/2007-02/msg00009.html ,
but this doesn't seem quite like the issue that I'm seeing. If that were
the bug, then I would expect to see a deadlock situation, not corrupted
lock fields.


While it looks like this may be a silicon bug, it is possible that it is not
and so I'm looking for anyone who might have seen this kind of behavior in
glibc.


Wayne

--
Wayne Badger
Yahoo!


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]