The fix for Bug 13882 ("Cycle detection in dynamic loader is broken ") fixed premature termination of the inner loop of _dl_sort_fini, making the function (closer to) correct, but it also changes this function's runtime from O(n^2) to O(n^3) where n is the number of items (resident DSOs) to be sorted. (The same is true for the corresponding init sorts in dl-open.c and dl-deps.c.) This can be readily seen by looking at Jeff Law's example in the description of bug 13882 (a linear chain of dependencies that _dl_sort_fini needs to completely reverse). In this case, each of O(n) objects gets moved O(n) times; furthermore the analysis leading to each such move (as well as the move itself) takes O(n) time. That's O(n)*O(n)*O(n) = O(n^3). Another easy way to get O(n^3) behavior is with cycles: any node that's part of a nontrivial cycle is guaranteed to keep getting moved repeatedly until its moved-too-many-times counter expires, which is O(n) times (for O(n) of the items anyway). So for example, if the dependency graph consists of mutually dependent pairs of DSOs: A<->B C<->D E<->F ... that will result in O(n^3) run time as well. We observed the O(n^3) behavior in real life, in our application that had 575 DSOs upon exit-- in RHEL5.3 (glibc 2.5), it took less than 1 second to exit; upon upgrading to RHEL6.3 (glibc 2.12), the same app took 15 seconds to exit. Instrumenting _dl_sort_fini (i.e. putting a counter in it and printing it at the end) revealed that the innermost loop body was entered more than 1.7 billion times, roughly confirming the O(n^3) claim in practice. This is just a topsort, which can be done simply in O(n) time with no fancy data structures.
Created attachment 6951 [details] Makefile that constructs the linear chain test case demonstrating the O(n^3) behavior. Attaching a Makefile that quickly constructs the linear dependency chain test case, demonstrating the O(n^3) behavior. See the comment at the top of this file for a description of exactly what it does and how to use it.
Don, I agree that the sorting could be made *far* faster. Thanks for submitting this. We were well aware that the minimal fix for bug 13882 would cause some kind of performance regression, but it was a balance between a minimal fix and low risk of breakage. I reviewed the patch for 13882 and even build a minimal framework for testing that dynamic loader function outside of the build. Do you have the time to investigate this and propose a patch (requires copyright assignment)?
(In reply to comment #2) > Don, > > I agree that the sorting could be made *far* faster. > > Thanks for submitting this. We were well aware that the minimal fix for bug > 13882 would cause some kind of performance regression, but it was a balance > between a minimal fix and low risk of breakage. I reviewed the patch for 13882 > and even build a minimal framework for testing that dynamic loader function > outside of the build. > > Do you have the time to investigate this and propose a patch (requires > copyright assignment)? I do. I am working on a patch that resolves both this and bug 15311, and I'll submit it here in a day or two. I am very interested in what you came up with in the way of a unit testing scheme for this function... I could certainly use it. I've found it frustrating that the existing tests run by "make check" (the ones I saw anyway) involve just creating/compiling/running a handful of real programs... to really stress test an implementation of _dl_sort_fini properly, I'd want to (at least) enumerate all possible graphs of up to 3 or 4 nodes, and call it on each of them, which would be millions of examples... and a few million randomly generated larger examples as well. It's *really* easy to get this stuff wrong otherwise. Also I'd like to start by moving the init sorting code into a function. It looks to me like this code is duplicated in two places (dl-open.c and dl-deps.c), and (after the fix for bug 15309) it's identical in both places except that one of them starts at i0=0 and the other starts at i0=1. So this could be expressed cleanly as a new function _dl_sort_init that takes i0 as a parameter. Should I start by submitting a patch that does that, with no functional change, and go from there? Or should I let you or someone else do this refactoring (possibly in conjunction with making these sorting functions unit testable)? Let me know how to proceed.
On Wed, Mar 27, 2013 at 07:47:46AM +0000, dhatch at ilm dot com wrote: > http://sourceware.org/bugzilla/show_bug.cgi?id=15310 > > This is just a topsort, which can be done simply in O(n) time > with no fancy data structures. > I do not have domain knowledge to understand what is necessary to consider. However I could write topsort if I knew dependencies. From code I think following pseudocode should work upto possibility that some arrows need be reversed. add_edge(g,x,y) // add edge from x to y topsort(g) // returns topologic ordering of g graph *g; for(k=0;k<nmaps;k++) { /* Add dependencies of the object. */ struct link_map **runp = maps[k]->l_initfini; if (runp != NULL) while (*runp != NULL) add_edge(g,maps[k],*runp); /* Add relocation dependencies. */ if (__builtin_expect (maps[k]->l_reldeps != NULL, 0)) { unsigned int m = maps[k]->l_reldeps->act; struct link_map **relmaps = &maps[k]->l_reldeps->list[0]; for (j=0;j<m;j++) add_edge(g,relmaps[k],*runp) } } memcpy(maps,topsort(g));
(In reply to comment #3) > (In reply to comment #2) > > Don, > > > > I agree that the sorting could be made *far* faster. > > > > Thanks for submitting this. We were well aware that the minimal fix for bug > > 13882 would cause some kind of performance regression, but it was a balance > > between a minimal fix and low risk of breakage. I reviewed the patch for 13882 > > and even build a minimal framework for testing that dynamic loader function > > outside of the build. > > > > Do you have the time to investigate this and propose a patch (requires > > copyright assignment)? > > I do. I am working on a patch that resolves both this and bug 15311, > and I'll submit it here in a day or two. I look forward to the patch. > I am very interested in what you came up with in the way of a unit > testing scheme for this function... I could certainly use it. I aggrandized a bit here. I copied the relevant sorting code out of the loader code, wrapped it up in a function, created some static arrays to simulate DSOs loaded in a certain order, and then ran the sorting function against the the simulated DSO list. The best description is that I ran a simulation. I looked for the code I used to test bug 13882, but I've lost it (changed employers). You can see my comments here: http://sourceware.org/ml/libc-alpha/2012-06/msg00560.html > I've found it frustrating that the existing tests run by "make check" > (the ones I saw anyway) > involve just creating/compiling/running a handful of real programs... > to really stress test an implementation of _dl_sort_fini properly, > I'd want to (at least) enumerate all possible graphs > of up to 3 or 4 nodes, and call it on each of them, > which would be millions of examples... > and a few million randomly generated larger examples as well. > It's *really* easy to get this stuff wrong otherwise. I fully agree. As a volunteer project we live and die by the companies and individuals that choose to contribute to the project. I would be more than happy to see all possible 3 or 4 node graphs tested. The random testing is more problematic as you are probably well aware of; you can still auto-generate millions of test cases just make it deterministic :-) > Also I'd like to start by moving the init sorting code into a function. > It looks to me like this code is duplicated in two places (dl-open.c > and dl-deps.c), and (after the fix for bug 15309) > it's identical in both places except that > one of them starts at i0=0 and the other starts at i0=1. > So this could be expressed cleanly as a new function _dl_sort_init that takes > i0 > as a parameter. That sounds like a great idea. > Should I start by submitting a patch that does that, > with no functional change, and go from there? Or should I let you > or someone else do this refactoring (possibly in conjunction > with making these sorting functions unit testable)? > Let me know how to proceed. Always break the work into as small a piece as conceivably possible. Doing just the refactoring is a great first step. Once we have that in place we can talk about next steps. The Contribution Checklist for this project is rather long, but we are a conservative project and it helps to have everything documented and well specified. You can see the checklist here: http://sourceware.org/glibc/wiki/Contribution%20checklist
Don, Do you have copyright papers in place with the FSF for glibc?
Carlos, This is the tester QE uses for the correctness of the sort: Basically it takes as an input the number of DSOs you want to test. Then builds a random orderings and a pathological ordering of those DSOs on the link line (the script knows the dependencies between DSOs). deps.cpp: #include <stdio.h> #ifdef MAIN int main( int argc, char *argv[] ) { printf("main\n"); } #else static bool deps_init( void ) { puts( NAME ); } bool x = deps_init(); #endif driver: #!/bin/bash # based on reproducer from bug 11724 N=${1:-2} TmpDir=$(mktemp -d) cp deps.cpp $TmpDir pushd $TmpDir > /dev/null 2>&1 for((all=0,k=2; k<=$N; ++all,++k)); do cmd="g++ -fPIC -shared -o libdeps${k}.so -DNAME=\\\"${k}\\\" deps.cpp" printf "# %s\n" "$cmd" > ${all}r.log eval "$cmd" for ((i=k-1; i>=1; --i)); do cmd="g++ -L. -fPIC -shared -o libdeps${i}.so -ldeps$((i+1)) -DNAME=\\\"${i}\\\" deps.cpp" printf "# %s\n" "$cmd" >> ${all}r.log eval "$cmd" done; unset i # random link order # it sucks that rhel5 coreutils don't have shuf or sort -R yet :) roptions=($(seq 1 $k | sed 's/^/ -ldeps/')) for ((i=(${#roptions[@]}-1); i >= 1; --i)); do random=$((RANDOM % i)) m=${roptions[$random]} roptions[$random]=${roptions[$i]} roptions[$i]=$m done; unset i rcmd="g++ -Wl,-rpath -Wl,$(pwd) -L. -o rmain ${roptions[@]} -DMAIN deps.cpp" # pathological link order poptions=($(seq $k -1 1 | sed 's/^/ -ldeps/')) pcmd="g++ -Wl,-rpath -Wl,$(pwd) -L. -o pmain ${poptions[@]} -DMAIN deps.cpp" cp ${all}r.log ${all}p.log for i in r p; do cmd="${i}cmd" printf "# %s\n\n" "${!cmd}" >> ${all}${i}.log eval "${!cmd}" fail=0 cp ${all}${i}.log ${i}out.golden ./${i}main >> ${all}${i}.log seq $k -1 1 >> ${i}out.golden echo main >> ${i}out.golden diff ${all}${i}.log ${i}out.golden > /dev/null 2>&1 || fail=1 # it's ok, we don't care if [ 0 -eq $fail ]; then rm ${all}${i}.log fi done; unset i rm -rf *.so *main *out.golden *out.real done; unset k rm deps.cpp test 0 -eq $(ls | wc -l) && { rm -rf $(pwd); echo PASS; } || { echo FAIL; printf "results are in %s\n" $(pwd); } popd > /dev/null 2>&1
(In reply to comment #6) > Do you have copyright papers in place with the FSF for glibc? Working on it with my legal department. My employer (ILM) is supportive so I don't anticipate a problem.
(In reply to comment #4) > On Wed, Mar 27, 2013 at 07:47:46AM +0000, dhatch at ilm dot com wrote: > > http://sourceware.org/bugzilla/show_bug.cgi?id=15310 > > > > This is just a topsort, which can be done simply in O(n) time > > with no fancy data structures. > > > I do not have domain knowledge to understand what is necessary to > consider. > > However I could write topsort if I knew dependencies. From code I think > following pseudocode should work upto possibility that some arrows need > be reversed. > > add_edge(g,x,y) // add edge from x to y > topsort(g) // returns topologic ordering of g > > graph *g; > for(k=0;k<nmaps;k++) > { > /* Add dependencies of the object. */ > struct link_map **runp = maps[k]->l_initfini; > if (runp != NULL) > while (*runp != NULL) > add_edge(g,maps[k],*runp); > > /* Add relocation dependencies. */ > if (__builtin_expect (maps[k]->l_reldeps != NULL, 0)) > { > unsigned int m = maps[k]->l_reldeps->act; > struct link_map **relmaps = &maps[k]->l_reldeps->list[0]; > for (j=0;j<m;j++) > add_edge(g,relmaps[k],*runp) > } > } > > memcpy(maps,topsort(g)); That looks essentially right, although there are a couple of nuances in the existing code. - under the comment "Do not handle ld.so in secondary namespace" (whatever that means), it ignores a dependency "B depends on A" if A!=A->l_real (but doesn't check whether B!=B->l_real). I don't really know what this means, or whether this assymmetry is intentional... QUESTION: can I just omit all edges involving such items altogether, not caring where these items come out in the output order? - it also ignores "B depends on A" if A->l_idx == -1 (i.e. IDX_STILL_USED)-- this one I believe I understand-- it definitely doesn't matter where such items come out in the final sort order. - static dependendies should override dynamic ones if there's a conflict. I'll do two passes in order to get this right, and more well-behaved than the existing code-- see bug 15311 for details. I hadn't thought about creating the graph structure explicitly as you suggest... arguably that would result in cleaner topsort code, but I was just going to work directly with the data structure that's already there instead. Note that an explicit auxiliary graph representation would have size O(N+E) (number of nodes plus number of edges) whereas the size of needed auxiliary structures is currently only O(N) (making it feasible to allocate them as local variables on the runtime stack) and I wasn't planning on changing that. The topsort algorithm will be essentially Tarjan's SCC algorithm: http://en.wikipedia.org/wiki/Tarjan%27s_strongly_connected_components_algorithm (actually two passes of it). This is a bit more involved than a simple reverse postorder, but keeping SCCs contiguous is desireable (for one thing, the existing code does it-- furthermore, it provides some nice well-behavedness between the two passes that I wouldn't be able to prove otherwise). Tarjan's algorithm is preferable to Kosaraju's since it doesn't require explicitly creating the reverse graph. One other issue/question-- I was assuming a recursive implementation would be out of the question (due to increased runtime stack requirement); but I'm finding that an iterative implementation of Tarjan's SCC algorithm is... well, really unpleasant. I can do it, but it's no fun, and it won't be any fun to review or maintain it either. A recursive implementation would be MUCH cleaner, simpler, more readable and maintainable. Any thoughts/feelings on whether it's okay to make it recursive?
On Thu, Mar 28, 2013 at 12:31:40AM +0000, dhatch at ilm dot com wrote: > I hadn't thought about creating the graph structure explicitly as you > suggest... > arguably that would result in cleaner topsort code, but > I was just going to work directly with the data structure that's > already there instead. > Note that an explicit auxiliary graph representation would have size O(N+E) > (number of nodes plus number of edges) > whereas the size of needed auxiliary structures is currently only O(N) > (making it feasible to allocate them as local variables on the runtime stack) > and I wasn't planning on changing that. > I mainly wanted do it to separate alg from domain specific bits. Next one where I do not know if its necesarry or just optimization is /* We can skip looking for the binary itself which is at the front of the search list for the main namespace. */
Created attachment 6955 [details] Proposed initial patch, rearranging init/fini sorting in prep for overhaul. diffs from master at 7a86be6 Here's the proposed initial patch. Commit message would be something like: ========================================================================== Refactor ld.so init/fini sorting code in preparation for overhaul. - Init sorting code was in two places: dl-open.c and dl-deps.c; it was identical in both places except for the starting index which was i0=0 in dl-open.c and i0=1 in dl-deps.c. Moved this common code out into a new function _dl_sort_init that takes i0 as a parameter. - Moved _dl_sort_init and _dl_sort_fini out into two new files dl-sort-init.c and dl-sort-fini.c. This allows easy diffing between the two (which can be interesting) and may also help facilitate future efforts to unit test these two functions. No functional change here. ========================================================================== I don't have the copyright agreement in place yet, but it seems to me that I'm not introducing any copyrightable material in this initial patch-- I'm just moving existing code around. (okay I did add a comment at the top of each function). Let me know if I'm doing this right...
Actually I think the fix for bug 15309 (dl_open_worker doesn't fully initialize seen array during init sort) needs to go in first... it's a 1-liner and is easily patchable to various downstream code bases, but that won't be true if it ends up on top of my code rearrangement. Also I can't truthfully say "No functional change here" unless the fix for 15309 goes in first, since I'm combining code from two functions which aren't exactly the same yet... they will be the same, after the fix for 15309 is in place.
Actually, I'm pretty sure the problematical sort shows up in 3 places: dl_map_object_deps dl_sort_fini dl_open_worker
(In reply to comment #13) > Actually, I'm pretty sure the problematical sort shows up in 3 places: > > dl_map_object_deps > dl_sort_fini > dl_open_worker that's correct. my initial proposed patch combines two of them into a single function _dl_sort_init.
Progress on unit/stress test... I got something working, in such a way that it can go cleanly into the build (assuming the init sort has been separated out into a function, as in my "Proposed initial patch" attachment). I'll submit the test as soon as I get it polished (and I get legally cleared). It tests _dl_sort_init on all 17854749 graphs of up to 4 nodes in 1min8sec, and _dl_sort_fini on all 16777846 pairs of graphs (static dep graph and dynamic dep graph) of up to 3 nodes in 48sec (on an Intel Xeon L5630 @ 2.13GHz which is a pretty fast machine I guess). That's probably a bit long for a confidence test run during "make check", so for that, I'd probably do something less, augmented by some randomized testing (with deterministic seed of course). As Carlos pointed out in the referenced email thread, this test would have prevented the last several generations of bugs in these sorting routines, and it should help ensure that the impending overhaul properly fixes the bugs it intends to fix, without introducing new ones, and that no similar bugs appear in the future. In addition to lots of expected failures of _dl_sort_fini due to bug 15311, the testing of _dl_sort_init on graphs up to 4 nodes revealed that the existing algorithm doesn't keep SCCs contiguous, which was a surprise to me. I just opened bug 15329 on that. (Not sure whether that's technically a bug... but it's certainly not a nice behavior, and we can do better.) Is it okay to add a test that currently fails due to bugs that we intend to fix very soon?
On Tue, Apr 02, 2013 at 09:54:17AM +0000, dhatch at ilm dot com wrote: > http://sourceware.org/bugzilla/show_bug.cgi?id=15310 > > --- Comment #15 from Don Hatch <dhatch at ilm dot com> 2013-04-02 09:54:17 UTC --- > Progress on unit/stress test... > I got something working, in such a way that it can go cleanly into the build > (assuming the init sort has been separated out into a function, > as in my "Proposed initial patch" attachment). > I'll submit the test as soon as I get it polished (and I get legally cleared). > > It tests _dl_sort_init on all 17854749 graphs of up to 4 nodes in 1min8sec, > and _dl_sort_fini on all 16777846 pairs of graphs > (static dep graph and dynamic dep graph) of up to 3 nodes in 48sec How did you get these numbers? Directed graph on 4 nodes has 12 arcs. So there only 2^{12} = 4096 directed graphs on 4 nodes. What extra data do you need to generate? > (on an Intel Xeon L5630 @ 2.13GHz which is a pretty fast machine I guess). > That's probably a bit long for a confidence test run during "make check", > so for that, I'd probably do something less, > augmented by some randomized testing (with deterministic seed of course). > First optimization would be eliminate isomorphic graphs. I could assist if I got code.
(In reply to comment #16) > > > > It tests _dl_sort_init on all 17854749 graphs of up to 4 nodes in 1min8sec, > > and _dl_sort_fini on all 16777846 pairs of graphs > > (static dep graph and dynamic dep graph) of up to 3 nodes in 48sec > > How did you get these numbers? Directed graph on 4 nodes has 12 arcs. So > there only 2^{12} = 4096 directed graphs on 4 nodes. What extra data do you > need to generate? I was considering the complete graph on 4 nodes to have 16 arcs, not 12, so there are 2^16 = 65536 possible graphs (though that might be a worthwhile corner to cut... once we convince ourselves that an edge from a node to itself doesn't affect the algorithm, then yes, we could just consider 4096 of them)... and all permutations of edges lists have to be considered different, so I was taking the sum, over all 65536 graphs, of the product of the factorials of the node arities. (plus similar sums for 0,1,2,3 nodes, though they are small in comparison) > > > (on an Intel Xeon L5630 @ 2.13GHz which is a pretty fast machine I guess). > > That's probably a bit long for a confidence test run during "make check", > > so for that, I'd probably do something less, > > augmented by some randomized testing (with deterministic seed of course). > > > First optimization would be eliminate isomorphic graphs. I could assist > if I got code. I don't think we can eliminate isomorphic graphs. Whether a bug appears or not often depends on the data order.
Created attachment 6965 [details] Proposed initial patch, rearranging init/fini sorting in prep for overhaul. diffs from master at 60c414c Tweaked the proposed initial refactoring patch a bit to make the calling convention cleaner: consistently let the sort functions handle too-small inputs in whatever way they need to, rather than haphazardly having the caller do it sometimes. This makes for a cleaner contract so callers and unit test can be simpler and less error prone. (found this when unit testing _dl_sort_init on a list of length 0, and it crashed)
> --> http://sourceware.org/bugzilla/attachment.cgi?id=6965 > Proposed initial patch, rearranging init/fini sorting in prep for overhaul. > diffs from master at 60c414c > Please post that on libc-alpha as [RFC] to get comments.
Ping; did anything ever happen with this?
(In reply to Brooks Moses from comment #20) > Ping; did anything ever happen with this? No, but it would be awesome if someone moved this forward. Unfortunately Don doesn't have a copyright assignment as far as I can tell, so please don't look at the code in this patch if you plan to implement it. Instead I suggest we look at bug 17645: https://sourceware.org/bugzilla/show_bug.cgi?id=17645 Fedora bug: https://bugzilla.redhat.com/show_bug.cgi?id=1162810 libc-alpha post: https://sourceware.org/ml/libc-alpha/2014-11/msg00702.html And take a serious look at Paulo's work. It still needs someone to write a test that actually does some decent coverage over the permutation of DT_NEEDED trees.
(In reply to Carlos O'Donell from comment #21) > And take a serious look at Paulo's work. In full disclosure, Paulo works at Red Hat.
Did this get accepted and committed for a given release of glibc? Or is it still waiting for someone to write a test that does the coverage over the permutation of DT_NEEDED trees?
(In reply to Patrick Stahr from comment #23) > Did this get accepted and committed for a given release of glibc? Or is it > still waiting for someone to write a test that does the coverage over the > permutation of DT_NEEDED trees? This is not fixed in any version of glibc. Yes, we still need better coverage testing of the permutation of DT_NEEDED trees to ensure the loader is doing the right thing. Each tree needs to verify it was loaded in the correct order.
(In reply to Don Hatch from comment #3) > I'll submit it here in a day or two. Hello from 2019. We're still suffering from O(n^3). Don, unfortunately you haven't published your code anywhere (even if it still lacks an exhaustive set of tests). For those brave enough out there who would like to try solve this, here's a PDF named "A Space-Efficient Algorithm for Finding Strongly Connected Components" with non-recursive Tarjan SCC implementation: http://homepages.ecs.vuw.ac.nz/~djp/files/IPL15-preprint.pdf "Proposed initial patch" that is attached here is no longer needed because it only moves code around that was done in commit c2c299fd24e87b83c63191ff979d41a86b37d714 Author: Andreas Schwab <schwab@suse.de> Date: Tue Nov 7 15:24:19 2017 +0100 Consolidate link map sorting Combine the four places where link maps are sorted into a single function. This also moves the logic to skip the first map (representing the main binary) to the callers. So, there's currently only a single sorting function in elf/dl-sort-maps.c BTW, has anyone tried looking at how other libcs do this sorting? Maybe there's no need to reinvent the wheel?
Created attachment 11908 [details] Add a linear time function for sorting maps. Hi I have prepared a patch based on my understanding of the discussion above. I am uploading it to seek some feedback and see if there is appetite for this change. The patch implements a function to topologically sort the link_maps. The current implementation requires time that is linear in the number of link_maps and the dependencies amongst them. We do need to allocate more memory than the prior work, though the allocated memory is again linear function of the size of the input. I have not compiled this patch (and therefore not tested). Once I receive some feedback, I would work further on it. Thanks!
The testcase which creates a linear chain test to demostrate the O(n^3) behavior shows an improvement with the new DSO sorting algorithm from 15a0c5730d1d5ae: $ time ./testrun.sh /tmp/test/main1000 2>&1 >/dev/null real 0m1.641s user 0m1.565s sys 0m0.076s $ time GLIBC_TUNABLES=glibc.rtld.dynamic_sort=2 ./testrun.sh /tmp/test/main1000 2>&1 >/dev/null real 0m0.316s user 0m0.219s sys 0m0.095s And profile shows similar improvement: $ perf record ./testrun.sh /tmp/test/main1000 [...] $ perf report --stdio -q [...] 78.28% ld-linux-x86-64 ld.so [.] _dl_sort_maps 6.77% ld-linux-x86-64 ld.so [.] do_lookup_x 5.03% ld-linux-x86-64 ld.so [.] memmove 2.64% ld-linux-x86-64 ld.so [.] strcmp $ perf record env GLIBC_TUNABLES=glibc.rtld.dynamic_sort=2 ./testrun.sh /tmp/test/main1000 [...] $ perf report --stdio -q --dso=ld.so [...] 41.44% ld-linux-x86-64 [.] do_lookup_x 13.76% ld-linux-x86-64 [.] strcmp 6.01% ld-linux-x86-64 [.] _dl_map_object 5.17% ld-linux-x86-64 [.] _dl_name_match_p 1.34% ld-linux-x86-64 [.] _dl_map_object_from_fd 0.69% ld-linux-x86-64 [.] _dl_add_to_namespace_list