Bug 770 - possible deadlock on double-free logging
Summary: possible deadlock on double-free logging
Status: RESOLVED DUPLICATE of bug 10282
Alias: None
Product: glibc
Classification: Unclassified
Component: malloc (show other bugs)
Version: 2.3.4
: P2 normal
Target Milestone: ---
Assignee: GOTO Masanori
URL:
Keywords:
: 771 772 (view as bug list)
Depends on:
Blocks:
 
Reported: 2005-02-26 21:29 UTC by Jakub Bogusz
Modified: 2014-06-17 04:01 UTC (History)
5 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:
fweimer: security-


Attachments
testcase for deadlock on double-free logging (298 bytes, text/x-c)
2005-02-26 21:35 UTC, Jakub Bogusz
Details
gdb backtrace of deadlocked X server (1.13 KB, text/plain)
2011-05-17 02:38 UTC, Ian Pilcher
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Jakub Bogusz 2005-02-26 21:29:07 UTC
_int_free() (malloc/malloc.c), which is called from free() with arena mutex
locked, checks and eventually prints/logs error message.
So if malloc_printerr() handling do some malloc()/free() on the same memory
arena, deadlock can occur.
vsyslog() can call free() during tz manipulation.

Yes, this deadlock is triggered by buggy code.
But it's all inside libc, not caused by actual memory corruption.
Comment 1 Jakub Bogusz 2005-02-26 21:35:01 UTC
Created attachment 424 [details]
testcase for deadlock on double-free logging

Simplified testcase to trigger deadlock (originally it was detected in daemon
which didn't want to exit on SIGPIPE when run on NPTL libc - it tried to do
double shutdown).

Deadlock occurs on NPTL libc or when it's linked with linuxthreads libpthread.
When run on just linuxthreads libc it logs double-free error and aborts
(as there is no real locking in single-thread code).
Comment 2 Jakub Bogusz 2005-02-26 21:37:48 UTC
*** Bug 771 has been marked as a duplicate of this bug. ***
Comment 3 Jakub Bogusz 2005-02-26 21:38:16 UTC
*** Bug 772 has been marked as a duplicate of this bug. ***
Comment 4 Nick Lamb 2008-04-30 20:27:35 UTC
This happens to us once a day or so (obviously the double-free that causes the
problem is periodic in some fashion). For us deadlocking is mostly worse than
just allowing some potential corruption due to a double-free.

We are using glibc-2.5-12 x86-64 on a CentOS 5 machine.

Here are some musings from #glibc IRC where I reported that I had seen this bug.

<ryanarn> hrm.. that one's been about for a while.  I wonder why there's been no
movement on it.
<sjmunroe> the mutex it locked in free which calls _int_free(), so _int_free
does not know about the lock
<sjmunroe> to avoid this _int_free would have to report the error back to free
so free could do the unlock before reposting the error
Comment 5 Petr Baudis 2010-06-01 01:50:06 UTC
I cannot reproduce this on a newer glibc, does this error still happen to you?
Comment 6 Nick Lamb 2010-06-01 15:19:11 UTC
Any conceivable testcase is worth nothing compared to the concise explanation of
the problem already recorded in this bug. Tweaks to the allocator could easily
change the exact circumstances and thus invalidate the testcase, while the bug
would still bite real applications. If the problem described wasn't fixed, the
bug still exists. So, did you fix the bug?

Please provide either a reference for the change where you believe you fixed the
bug, or else go fix the bug rather than wasting people's time by asking silly
questions.
Comment 7 Petr Baudis 2010-06-01 16:10:46 UTC
Why so aggressive?

Hmm, it is curious that the bug should still be there, but I cannot trigger it
no matter how hard I try so far (well, I didn't try that hard yet). The new
ATOMIC_FASTBIN mode shouldn't have the bug anymore, but it is still experimental.
Comment 8 Nick Lamb 2010-06-01 18:34:26 UTC
I'd say rather, conservative. For me this is a bug which caused a big clustered
production (revenue generating) system to grind to a halt, often at inopportune
moments. To prove the assertion "somehow this bug has magically gone away
without any action" I have to risk every member of technical staff being woken
in the middle of the night as the app deadlocks suddenly and monitoring systems
start sending alerts. To even prepare for such an escapade might be a week's work.

So I don't want to do that. If the bug is gone, there'll be a patch that fixed
it. I can justify testing such a patch. If there isn't a patch, the bug is just
hiding and this bug report should remain open.

I can confirm (if it's any help) that the testcase supplied by Jakub Bogusz
doesn't deadlock on a modern glibc. But there could be lots of reasons for that.
Comment 9 Matt Wilson 2010-09-17 04:48:10 UTC
Probably the same as: http://sourceware.org/bugzilla/show_bug.cgi?id=10282 (fixed)
Comment 10 Nick Lamb 2010-09-17 10:25:06 UTC
Possibly. It does seem as though there's been considerable churn on that code.
Here we no longer have a double free (we eventually managed to reproduce it on a
test system under valgrind and fixed it) with which to test, and as already
observed Jakub's original testcase no longer shows the problem in modern glibc.
So I guess closing this as a DUP doesn't hurt if that's what you'd like to do.
Comment 11 Ian Pilcher 2011-05-17 02:38:49 UTC
Created attachment 5729 [details]
gdb backtrace of deadlocked X server

I believe that I just hit this bug.  My X server locks hard every time I log
out -- apparently trying to print an error message about a corrupted double-
linked list.

#0  __lll_lock_wait_private () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:100
#1  0x000000351127ec11 in _L_lock_10461 () from /lib64/libc.so.6
#2  0x000000351127c9d7 in __libc_malloc (bytes=227924349408) at malloc.c:3657
#3  0x000000351127235d in __libc_message (do_abort=2, 
    fmt=0x351135d8c8 "*** glibc detected *** %s: %s: 0x%s ***\n")
    at ../sysdeps/unix/sysv/linux/libc_fatal.c:137
#4  0x000000351127896a in malloc_printerr (action=3, 
    str=0x351135a8d2 "corrupted double-linked list", ptr=<optimized out>) at malloc.c:6283
#5  0x0000003511278d80 in malloc_consolidate (av=0x35115991e0) at malloc.c:5169
#6  0x0000003511279669 in malloc_consolidate (av=0x35115991e0) at malloc.c:5115
#7  _int_free (av=0x35115991e0, p=<optimized out>, have_lock=0) at malloc.c:5034
#8  0x0000000000461094 in FreeOsBuffers (oc=0x1237b30) at io.c:1101
#9  0x000000000045f283 in CloseDownConnection (client=0x1237b70) at connection.c:1068
#10 0x000000000042e1c6 in CloseDownClient (client=0x1237b70) at dispatch.c:3432
#11 0x000000000042ec3a in Dispatch () at dispatch.c:441
#12 0x0000000000422e1a in main (argc=<optimized out>, argv=0x7fff630ba4b8, envp=<optimized out>)
    at main.c:287

Needless to say, this is really painful, since the only way to recover the
system is to ssh in and kill -9 the X server.  (A simply crash would be a
lot easier to deal with, since the X server is shutting down anyway.)  Is
there any way to suppress the printing of this message?

(This is on Fedora 15 with glibc-2.13.90-11.x86_64, BTW.)
Comment 12 Siddhesh Poyarekar 2012-04-16 13:48:38 UTC
Ian, what you're hitting should have been fixed with commit id f8a3b5bf. The X backtrace is waiting on a lock in a malloc within __libc_message. The commit I mentioned replaces these malloc calls with mmap() calls to avoid getting tangled up in the arena locks.
Comment 13 Ian Pilcher 2012-04-17 04:39:21 UTC
(In reply to comment #12)
> Ian, what you're hitting should have been fixed with commit id f8a3b5bf. The X
> backtrace is waiting on a lock in a malloc within __libc_message. The commit I
> mentioned replaces these malloc calls with mmap() calls to avoid getting
> tangled up in the arena locks.

Thanks Siddhesh!

Fortunately, I haven't seen this problem in quite a while.  (I'd like to think that the real problem in X has been fixed, but it seems more likely that my system has changed enough that the race condition that causes the memory corruption/detection isn't being triggered any more.  Oh well.)
Comment 14 Ondrej Bilka 2013-10-13 07:36:31 UTC
Fixed in duplicate bug as was mentioned earlier in thread.

*** This bug has been marked as a duplicate of bug 10282 ***
Comment 15 Jackie Rosen 2014-02-16 19:20:10 UTC Comment hidden (spam)