_int_free() (malloc/malloc.c), which is called from free() with arena mutex locked, checks and eventually prints/logs error message. So if malloc_printerr() handling do some malloc()/free() on the same memory arena, deadlock can occur. vsyslog() can call free() during tz manipulation. Yes, this deadlock is triggered by buggy code. But it's all inside libc, not caused by actual memory corruption.
Created attachment 424 [details] testcase for deadlock on double-free logging Simplified testcase to trigger deadlock (originally it was detected in daemon which didn't want to exit on SIGPIPE when run on NPTL libc - it tried to do double shutdown). Deadlock occurs on NPTL libc or when it's linked with linuxthreads libpthread. When run on just linuxthreads libc it logs double-free error and aborts (as there is no real locking in single-thread code).
*** Bug 771 has been marked as a duplicate of this bug. ***
*** Bug 772 has been marked as a duplicate of this bug. ***
This happens to us once a day or so (obviously the double-free that causes the problem is periodic in some fashion). For us deadlocking is mostly worse than just allowing some potential corruption due to a double-free. We are using glibc-2.5-12 x86-64 on a CentOS 5 machine. Here are some musings from #glibc IRC where I reported that I had seen this bug. <ryanarn> hrm.. that one's been about for a while. I wonder why there's been no movement on it. <sjmunroe> the mutex it locked in free which calls _int_free(), so _int_free does not know about the lock <sjmunroe> to avoid this _int_free would have to report the error back to free so free could do the unlock before reposting the error
I cannot reproduce this on a newer glibc, does this error still happen to you?
Any conceivable testcase is worth nothing compared to the concise explanation of the problem already recorded in this bug. Tweaks to the allocator could easily change the exact circumstances and thus invalidate the testcase, while the bug would still bite real applications. If the problem described wasn't fixed, the bug still exists. So, did you fix the bug? Please provide either a reference for the change where you believe you fixed the bug, or else go fix the bug rather than wasting people's time by asking silly questions.
Why so aggressive? Hmm, it is curious that the bug should still be there, but I cannot trigger it no matter how hard I try so far (well, I didn't try that hard yet). The new ATOMIC_FASTBIN mode shouldn't have the bug anymore, but it is still experimental.
I'd say rather, conservative. For me this is a bug which caused a big clustered production (revenue generating) system to grind to a halt, often at inopportune moments. To prove the assertion "somehow this bug has magically gone away without any action" I have to risk every member of technical staff being woken in the middle of the night as the app deadlocks suddenly and monitoring systems start sending alerts. To even prepare for such an escapade might be a week's work. So I don't want to do that. If the bug is gone, there'll be a patch that fixed it. I can justify testing such a patch. If there isn't a patch, the bug is just hiding and this bug report should remain open. I can confirm (if it's any help) that the testcase supplied by Jakub Bogusz doesn't deadlock on a modern glibc. But there could be lots of reasons for that.
Probably the same as: http://sourceware.org/bugzilla/show_bug.cgi?id=10282 (fixed)
Possibly. It does seem as though there's been considerable churn on that code. Here we no longer have a double free (we eventually managed to reproduce it on a test system under valgrind and fixed it) with which to test, and as already observed Jakub's original testcase no longer shows the problem in modern glibc. So I guess closing this as a DUP doesn't hurt if that's what you'd like to do.
Created attachment 5729 [details] gdb backtrace of deadlocked X server I believe that I just hit this bug. My X server locks hard every time I log out -- apparently trying to print an error message about a corrupted double- linked list. #0 __lll_lock_wait_private () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:100 #1 0x000000351127ec11 in _L_lock_10461 () from /lib64/libc.so.6 #2 0x000000351127c9d7 in __libc_malloc (bytes=227924349408) at malloc.c:3657 #3 0x000000351127235d in __libc_message (do_abort=2, fmt=0x351135d8c8 "*** glibc detected *** %s: %s: 0x%s ***\n") at ../sysdeps/unix/sysv/linux/libc_fatal.c:137 #4 0x000000351127896a in malloc_printerr (action=3, str=0x351135a8d2 "corrupted double-linked list", ptr=<optimized out>) at malloc.c:6283 #5 0x0000003511278d80 in malloc_consolidate (av=0x35115991e0) at malloc.c:5169 #6 0x0000003511279669 in malloc_consolidate (av=0x35115991e0) at malloc.c:5115 #7 _int_free (av=0x35115991e0, p=<optimized out>, have_lock=0) at malloc.c:5034 #8 0x0000000000461094 in FreeOsBuffers (oc=0x1237b30) at io.c:1101 #9 0x000000000045f283 in CloseDownConnection (client=0x1237b70) at connection.c:1068 #10 0x000000000042e1c6 in CloseDownClient (client=0x1237b70) at dispatch.c:3432 #11 0x000000000042ec3a in Dispatch () at dispatch.c:441 #12 0x0000000000422e1a in main (argc=<optimized out>, argv=0x7fff630ba4b8, envp=<optimized out>) at main.c:287 Needless to say, this is really painful, since the only way to recover the system is to ssh in and kill -9 the X server. (A simply crash would be a lot easier to deal with, since the X server is shutting down anyway.) Is there any way to suppress the printing of this message? (This is on Fedora 15 with glibc-2.13.90-11.x86_64, BTW.)
Ian, what you're hitting should have been fixed with commit id f8a3b5bf. The X backtrace is waiting on a lock in a malloc within __libc_message. The commit I mentioned replaces these malloc calls with mmap() calls to avoid getting tangled up in the arena locks.
(In reply to comment #12) > Ian, what you're hitting should have been fixed with commit id f8a3b5bf. The X > backtrace is waiting on a lock in a malloc within __libc_message. The commit I > mentioned replaces these malloc calls with mmap() calls to avoid getting > tangled up in the arena locks. Thanks Siddhesh! Fortunately, I haven't seen this problem in quite a while. (I'd like to think that the real problem in X has been fixed, but it seems more likely that my system has changed enough that the race condition that causes the memory corruption/detection isn't being triggered any more. Oh well.)
Fixed in duplicate bug as was mentioned earlier in thread. *** This bug has been marked as a duplicate of bug 10282 ***
*** Bug 260998 has been marked as a duplicate of this bug. *** Seen from the domain http://volichat.com Page where seen: http://volichat.com/adult-chat-rooms Marked for reference. Resolved as fixed @bugzilla.