dlopen calls add_to_global (from dl_open_worker) after running ELF constructors for the new objects. At this point, recovery from malloc failure is quite complicated: We need to run the ELF destructors and close all opened objects. The current implementation does none of this and leaves the process in an inconsistent state. The proper fix to deal with this is to allocate the required memory before running ELF constructors.
Patch posted: https://sourceware.org/ml/libc-alpha/2019-10/msg00534.html
The problem is more general. This loop in dl_open_worker adds new maps to the scope of maps that already exist when dlopen was called: /* If the file is not loaded now as a dependency, add the search list of the newly loaded object to the scope. */ bool any_tls = false; unsigned int first_static_tls = new->l_searchlist.r_nlist; for (unsigned int i = 0; i < new->l_searchlist.r_nlist; ++i) { struct link_map *imap = new->l_searchlist.r_list[i]; int from_scope = 0; /* If the initializer has been called already, the object has not been loaded here and now. */ if (imap->l_init_called && imap->l_type == lt_loaded) … /* First terminate the extended list. Otherwise a thread might use the new last element and then use the garbage at offset IDX+1. */ imap->l_scope[cnt + 1] = NULL; atomic_write_barrier (); imap->l_scope[cnt] = &new->l_searchlist; A concurrent dlsym call or lazy binding operation therefore has access to the new map at this point. But subsequent malloc calls can still fail, resulting in a corrupted overall state. This is simlar to the originally reported add_to_global issue, but it involves different allocations. We need to make objects available in this way before calling ELF constructors because they may use dlsym or lazy binding (including on other threads). That part is unavoidable. But after bug 24304, recoverable dlopen failure is no longer possible at this point, so we will eventually reach a consistent process state, once the constructors have run.
The master branch has been updated by Florian Weimer <fw@sourceware.org>: https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=440b7f8653e4ed8f6e1425145208050b795e9a6c commit 440b7f8653e4ed8f6e1425145208050b795e9a6c Author: Florian Weimer <fweimer@redhat.com> Date: Thu Oct 31 18:25:39 2019 +0100 Avoid late failure in dlopen in global scope update [BZ #25112] The call to add_to_global in dl_open_worker happens after running ELF constructors for new objects. At this point, proper recovery from malloc failure would be quite complicated: We would have to run the ELF destructors and close all opened objects, something that we currently do not do. Instead, this change splits add_to_global into two phases, add_to_global_resize (which can raise an exception, called before ELF constructors run), and add_to_global_update (which cannot, called after ELF constructors). A complication arises due to recursive dlopen: After the inner dlopen consumes some space, the pre-allocation in the outer dlopen may no longer be sufficient. A new member in the namespace structure, _ns_global_scope_pending_adds keeps track of the maximum number of objects that need to be added to the global scope. This enables the inner add_to_global_resize call to take into account the needs of an outer dlopen. Most code in the dynamic linker assumes that the number of global scope entries fits into an unsigned int (matching the r_nlist member of struct r_scop_elem). Therefore, change the type of _ns_global_scope_alloc to unsigned int (from size_t), and add overflow checks. Change-Id: Ie08e2f318510d5a6a4bcb1c315f46791b5b77524
The master branch has been updated by Florian Weimer <fw@sourceware.org>: https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a509eb117fac1d764b15eba64993f4bdb63d7f3c commit a509eb117fac1d764b15eba64993f4bdb63d7f3c Author: Florian Weimer <fweimer@redhat.com> Date: Wed Nov 27 16:37:17 2019 +0100 Avoid late dlopen failure due to scope, TLS slotinfo updates [BZ #25112] This change splits the scope and TLS slotinfo updates in dlopen into two parts: one to resize the data structures, and one to actually apply the update. The call to add_to_global_resize in dl_open_worker is moved before the demarcation point at which no further memory allocations are allowed. _dl_add_to_slotinfo is adjusted to make the list update optional. There is some optimization possibility here because we could grow the slotinfo list of arrays in a single call, one the largest TLS modid is known. This commit does not fix the fatal meory allocation failure in _dl_update_slotinfo. Ideally, this error during dlopen should be recoverable. The update order of scopes and TLS data structures is retained, although it appears to be more correct to fully initialize TLS first, and then expose symbols in the newly loaded objects via the scope update. Tested on x86_64-linux-gnu. Change-Id: I240c58387dabda3ca1bcab48b02115175fa83d6c
Fixed for glibc 2.31.
*** Bug 14577 has been marked as a duplicate of this bug. ***