Summary: | race condition: pthread_atfork() called before first malloc() results in unexpected locking behaviour/deadlocks | ||
---|---|---|---|
Product: | glibc | Reporter: | prumpf |
Component: | malloc | Assignee: | Florian Weimer <fweimer> |
Status: | RESOLVED FIXED | ||
Severity: | normal | CC: | fweimer, prumpf |
Priority: | P2 | Flags: | fweimer:
security-
|
Version: | 2.19 | ||
Target Milestone: | 2.24 | ||
See Also: | https://sourceware.org/bugzilla/show_bug.cgi?id=19431 | ||
Host: | Target: | ||
Build: | Last reconfirmed: | ||
Attachments: |
Test case
proposed patch |
Description
prumpf
2014-03-22 18:35:17 UTC
Created attachment 7633 [details]
proposed patch
I've attached a patch that would fix this issue (and a closely-related one described at https://sourceware.org/ml/libc-alpha/2013-01/msg01051.html), but which I'm not entirely satisfied is the right thing to do yet (it also contains a debugging XXX). Still, it might serve as a starting point for fixing this. The draft message I'd prepared for it follows: ------------ This is a bug report and proposed patch for the git version of glibc; it is specific to NPTL. There appear to be at least two deadlock problems caused by the order in which atfork handlers, in particular ptmalloc_lock_all, are run. https://sourceware.org/bugzilla/show_bug.cgi?id=16742 means ptmalloc_lock_all must be run after other atfork handlers, even if registered after them. https://sourceware.org/ml/libc-alpha/2013-01/msg01051.html is a bit trickier: it demonstrates that lock ordering should be "lock IO list before taking malloc locks", which means ptmalloc_lock_all has to be run with the IO list lock already held. Other solutions would work as well, but I've focussed on that one since it puts the least constraints on libio. This bug can be reproduced by running a test program with the right breakpoints set in gdb. I've prepared a patch that I believe fixes the problem, but I'm not overly attached to doing it this way. This patch fixes both deadlocks and introduces the concept of an atfork handler's priority—a high-priority handler is run inside of a low-priority one; it creates a new atfork handler for the IO lock, and gives it priority intermediate between low-priority user handlers and the high-priority malloc atfork handler; it uses a high priority for malloc's atfork handler; it adjusts register-atfork.c and unregister-atfork.c so they keep the __fork_handlers list sorted by priority. The complication introduced by this patch is unfortunate, but I believe it is the best approach. However, an alternative would be: don't introduce a priority level; put _IO_list_lock into an atfork handler that's installed first; make thread_atfork install malloc's atfork handler at the end of the list, no matter what. (I believe that approach would leave us with very fragile code and without the ability to put more of our MT-specific fork code into atfork handlers.) I believe there are no new restrictions on libio code, but there is a (new?) restriction on malloc: code must not hold malloc locks while waiting for I/O to finish, unless it gets the I/O lock first. In particular, this applies to diagnostic output and malloc.c's assert(), which might deadlock rather than printing a message. In addition to the alternative mentioned above, I've also considered another approach: pretend to take both the I/O and the malloc lock or neither: first lock A, then try to lock B; in the failure case, unlock A, lock B, then try to lock A, and so on. However, that's complicated, doesn't extend well to the case of several high-priority atfork handlers, and might result in a livelock. I do not suggest making the priority level accessible from user space: it's easy enough for glibc users to implement their own priority system or call pthread_atfork() in the right order. Can you retest with current master? I believe this bug is fixed there, but in a completely different fashion (the fork handler mechanism is not used by malloc anymore). Excellent, that's a much more elegant fix. I can confirm the test case now works as expected (commit 186fe877f3df0b84d57dfbf0386f6332c6aa69bc), so I consider this bug fixed. Thank you! |