Created attachment 7970 [details] Sample "redistributable" test case provided by a customer After glibc-2.11, there are 3 slightly different instances of a very CPU intensive loop in elf/dl-deps.c, elf/dl-fini.c and elf/dl-open.c. The loop does not scale well when there are cyclic dependencies, and may take more than 8 seconds to sort less than 300 shared objects when there a some cycles. Most of the time is spent in memmove. Sample perf output of it taking 8+ secs on a core i5: 31,24% timedlopen ld-2.17.so [.] _dl_map_object_deps 18,36% timedlopen ld-2.17.so [.] _wordcopy_fwd_aligned 16,50% timedlopen ld-2.17.so [.] _dl_sort_fini 15,47% timedlopen ld-2.17.so [.] dl_open_worker 13,08% timedlopen ld-2.17.so [.] _wordcopy_fwd_dest_aligned 4,95% timedlopen ld-2.17.so [.] memmove 0,07% timedlopen ld-2.17.so [.] do_lookup_x I have been using a patch for some weeks in rhel-6.3 (for basic tests where the problem was detected/reported), and my day to day computers, running rhel-7 and fedora rawhide, and cannot see any problems running kde, gnome, libreoffice and several other large applications or environments that load/unload or have large dso dependency chains. The proposed patch runs the test case in 10 ms, and a sample perf output is: 17,45% timedlopen ld-2.17.so [.] do_lookup_x 9,54% timedlopen ld-2.17.so [.] strcmp 6,91% timedlopen ld-2.17.so [.] _dl_map_object_deps 6,12% timedlopen ld-2.17.so [.] _dl_name_match_p 4,35% timedlopen ld-2.17.so [.] _dl_sort_fini 4,33% timedlopen ld-2.17.so [.] dl_open_worker

Created attachment 7972 [details] Proposed patch This patch implements a simple stable topological sort that moves an entry at most once, by keeping track of, and resetting the weight of the entries after every move. It breaks loops by choosing the ones that appear first with the lowest weight, that is, less dependencies. If there is a loop, the "c" variable in the "sort" function will be larger than 0.

The change to elf/Makefile is to make the test case have only one valid result. The expected result is: start_a1 start_a2 start_b1 start_b2 start_a3 start_a4 main finish_a4 finish_a3 finish_b2 finish_b1 finish_a2 finish_a1 but "a1" and "b1" do not have any dependencies, and without the elf/Makefile patch it would result in: start_a1 start_b1 start_a2 start_b2 start_a3 start_a4 main finish_a4 finish_a3 finish_b2 finish_b1 finish_a2 finish_a1 The "input" of that test case, wihtout the patch is: [0] = "" () [1] = a4 (a4 a3 libc.so) [2] = a1 (a1 libc.so) [3] = b2 (b2 b1 a2 libc.so) [4] = libc.so (libc.so ld.so) [5] = a3 (a3 b2 b1 libc.so) [6] = b1 (b1 libc.so) [7] = a2 (a2 a1 libc.so) [8] = ld.so () where items in () are the dependencies of an object, and [] is the input order.

Created attachment 7973 [details] simple sh based helper script to run the test case Simple script to make it easier to run the test case. Just unpack the test case and run sourceme.sh to set environment variables and have it to echo how to run the tests. This allows keeping unchanged the test case (not done by me).

(In reply to Paulo Andrade from comment #1) > Created attachment 7972 [details] > Proposed patch > > This patch implements a simple stable topological sort that moves > an entry at most once, by keeping track of, and resetting the weight > of the entries after every move. It breaks loops by choosing the > ones that appear first with the lowest weight, that is, less > dependencies. > If there is a loop, the "c" variable in the "sort" function will > be larger than 0. At the high-level your patch is almost ready. You refactor the sorting code out into a static function that the compiler can inline and you use the same pattern in all places where sorting happens. This is good. You are missing two important things: * More code comments. You should comment each sort routine with a high-level comment about what you're doing and why. This is particularly important for the weight functions. * Tests. The existing set of tests are insufficient to cover the changes you are proposing. While it seems unfair, and I understand that, this kind of change has not been undertaken because of the lack of testing. You need to add more tests. The *best* testing is actually a test which deterministically generates a sequence of permutations of linked objects and verifies they are sorted correctly by the algorithms in question. This way we can show that the sorting works for a large set of object dependency orderings. When we get a bug report, and we will, we can enlarge the set of automated testing to include the case the bug report claims and see what went wrong. If you can cover those two things, then you're ready to post the patch to libc-alpha for more formal review. This looks really good though and you're making great progress in a difficult part of the dynamic loader.

(In reply to Carlos O'Donell from comment #4) > (In reply to Paulo Andrade from comment #1) > > Created attachment 7972 [details] > > Proposed patch > > > > This patch implements a simple stable topological sort that moves > > an entry at most once, by keeping track of, and resetting the weight > > of the entries after every move. It breaks loops by choosing the > > ones that appear first with the lowest weight, that is, less > > dependencies. > > If there is a loop, the "c" variable in the "sort" function will > > be larger than 0. > > At the high-level your patch is almost ready. > > You refactor the sorting code out into a static function that the compiler > can inline and you use the same pattern in all places where sorting happens. > This is good. The code is small enough that this way (static functions) should be better. But it also cannot be really shared, without extra parameters, as there are 3 slightly different variants of the loop. > You are missing two important things: > > * More code comments. You should comment each sort routine with a high-level > comment about what you're doing and why. This is particularly important for > the weight functions. Ok. I will work on that. A quick explanation is somewhat like this: """ The algorithm implements a simple stable topological sort. The algorithm is expected to be fast and break cycles in a reasonable way. If there are no cycles, it does a plain topological sort, that does not reorder the input, that is, it is stable. Every link_map instance has a l_idx field, that in special cases, currently only when a DSO is being unloaded, can be less than zero. If the l_idx is not less than zero, and not in the dl_sort_fini call that already initializes the l_idx entries, the sort function initializes the l_idx field of every link_map pointer to an unique identifier. The sort is done in-place, by decrementing the insert position (the "k" variable) in the maps vector. Sorting is done by swapping an entry with the current insert position, and decrementing the insert position. The "sort" function allocates in the stack the "seen" vector, that works as a boolean flag for every link_map, to know if it has been already moved. Moving, actually swapping, is done at most once, and not done if it is already in the proper position. The "sort" function also allocates in stack the "w" vector, that is a counter of how many, not self and not already "seen" direct references exist to a link_map entry. The "weight" function resets the "w" entries at every call. Note that other than the first call, every call reduces the number of entries by one, as the last moved link_map pointer does not need to be verified again. In the "weight" function, once a link_map entry l_initfini vector does not find any dependency, either the entry is a self dependency, the object is being unloaded or the dependency has been moved (in which case a cyclic dependency was just broken), the "w" entry for it will be 0. Note that the "w" vector is indexed by position in the input/output maps vector, while the "seen" vector is indexed by the "l_idx" field of the "struct link_map" representing the DSO. Every loop, after calling the "weight" function, the "sort" searches all "w" entries for the first entry matching the "c" variable. The "c" variable is initialized as zero, and reset to zero after an entry is moved. If it scans all "w" entries and none is equal to "c", it increments "c"; if "c" is larger than zero it means there is a loop, in which case it chooses the first one with the lowest weight. Once an entry is moved, it is marked as "seen", and if it was part of a cycle, the cycle is now broken, as it will no longer influence its direct dependencies. Recognized possible problems: o The algorithm is not expected to scale well beyond a few thousand entries, due to the need to reinitialize the "w" vector. o If there are multiple "solutions", it will frequently not match sorting done by the previous algorithm. But, it guarantees an stable sorting, that is, does not change input order. o It does not attempt to be smart when finding loops, a better solution probably would be to break the smallest loop first, when there are multiple entries with the same weight, instead of choosing the first entry. """ > * Tests. The existing set of tests are insufficient to cover the changes you > are proposing. While it seems unfair, and I understand that, this kind of > change has not been undertaken because of the lack of testing. You need to > add more tests. The *best* testing is actually a test which > deterministically generates a sequence of permutations of linked objects and > verifies they are sorted correctly by the algorithms in question. This way > we can show that the sorting works for a large set of object dependency > orderings. When we get a bug report, and we will, we can enlarge the set of > automated testing to include the case the bug report claims and see what > went wrong. I will work on test cases. So far my tests were glibc "make check" and running rawhide and rhel-7 with the patch. I am particularly interested in knowing if there is a test case for this commit: https://sourceware.org/git/?p=glibc.git;a=blame;f=elf/dl-fini.c;h=c35577565eb66ac241365c05efe7c7fb791e0df2#l102 line 102, because it is another special variant of the loop, and what I did to address it is to loop over the entries, and increase the "w" vector of the dependencies, to ensure the dependencies are not moved before the entry, but on very complex cycles, and if having conflicting rules, it cannot guarantee it will really " preserve a link time dependency". But I suspect the same is true for the current algorithm; worse if the link time dependencies generate conflicting dependencies in cycles. > If you can cover those two things, then you're ready to post the patch to > libc-alpha for more formal review. Ok. I followed wiki instructions and sent the first version to libc-alpha, but will now wait for a more concrete situation based on bugzilla. > This looks really good though and you're making great progress in a > difficult part of the dynamic loader.

> > If you can cover those two things, then you're ready to post the patch to > > libc-alpha for more formal review. > > Ok. I followed wiki instructions and sent the first version to > libc-alpha, but will now wait for a more concrete situation based > on bugzilla. There is nothing wrong with what you did, but you are unlikely to get serious review until your patch has the kind of quality expected by libc-alpha. You are making great progress here though, and I appreciate your efforts. It's not easy to make changes to core runtimes, and as a community we try to hold ourselves to a high standard.

Created attachment 7977 [details] Scratch test case Carlos, I would appreciate any comments about the scratch test case as well as suggestions about other patterns to test. There is a detailed description on how to run the tests in the README file. For a quick visualization, the tests with different solutions (due to cycles or multiple dsos without dependencies) are like this, (old) means current glibc, (new) means with the proposed patch: a -> b b -> c c -> d -> link old new ------------------------------ abcd dcba_abcd cbad_abcd abdc cdba_abdc dcba_abdc acbd dcba_abcd cdba_abcd acdb cbda_adbc cbad_adbc adbc cbda_adbc dcba_adbc adcb cbda_adbc dcba_abcd bacd dcba_abcd cbda_abcd badc cdba_abdc dcba_abdc bcad dcba_abcd cbad_abcd bcda cbad_dabc cbda_dabc bdac cbad_dabc dcba_dabc bdca cbad_dabc dcba_abcd cabd dcba_abcd cdba_abcd cadb cbda_adbc cbda_dabc cbad dcba_abcd cdba_abcd cbda cbad_dabc cbad_abdc cdab cbad_dabc cbad_adbc cdba cbad_dabc cdba_dabc dabc cbad_dabc dcba_dabc dacb cbad_dabc dcba_abcd dbac cbad_dabc dcba_abdc dbca cbad_dabc dcba_abcd dcab cbad_dabc dcba_abcd dcba cbad_dabc dcba_abcd A -> B B -> C C -> D D -> A link old new ------------------------------ ABCD DCBA_ABCD ADCB_ABCD ABDC DCBA_ABCD ADCB_DABC ACBD BADC_CDAB ADCB_ABCD ACDB DCBA_ABCD ADCB_CDAB ADBC DCBA_ABCD ADCB_DABC ADCB DCBA_ABCD ADCB_CDAB BACD ADCB_BCDA BADC_ABCD BADC ADCB_BCDA BADC_DABC BCAD ADCB_BCDA BADC_ABCD BCDA ADCB_BCDA BADC_BCDA BDAC ADCB_BCDA BADC_DABC BDCA CBAD_DABC BADC_BCDA CABD BADC_CDAB CBAD_ABCD CADB DCBA_ABCD CBAD_CDAB CBAD BADC_CDAB CBAD_ABCD CBDA BADC_CDAB CBAD_BCDA CDAB BADC_CDAB CBAD_CDAB CDBA BADC_CDAB CBAD_BCDA DABC CBAD_DABC DCBA_DABC DACB CBAD_DABC DCBA_CDAB DBAC ADCB_BCDA DCBA_DABC DBCA CBAD_DABC DCBA_BCDA DCAB CBAD_DABC DCBA_CDAB DCBA CBAD_DABC DCBA_BCDA

Created attachment 8193 [details] Proposed patch Updated proposed patch, without ChangeLog entry, just for asking for extra comments. I understand this patch touches a too sensible part of glibc, and the only reason I wrote it is due to the test case, where the proposed patch runs 5+ orders of magnitude faster, and appears to be fully functional, when tested in fedora and running very complex applications. This is basically the same patch, but now it scans the list in reverse order. This is is actually the correct way to do it, so that on most common/simple cases/cycles it outputs the same sorting as current glibc. But I would like to have one clarification, that maybe is being already tested, and the Makefile patch to have only one result is in fault. tst-initorder is basically this: a2 a1 # a2 is linked to a1 b2 b1 a2 # b2 is linked to b1 and a2 a3 b2 b1 # a3 is linked to b2 and b1 a4 a3 # a4 is linked to a3 main a4 a1 b2 # "main" is linked to a4, a1 and b2 this will cause the sort input like this: [ main a2 a1 b2 b1 a3 a4 ] My question is, is some documentation that says that dsos must be kept together in the ordering? I mean, there are 2 correct results: main a4 a3 b2 b1 a2 a1 and main a4 a3 b2 a2 a1 b1 but if there is some specification that says it should have "b2 b1" together, then the patch is invalid, because it only breaks cycles and does ordering, and if there are no dependencies, it keeps it in the order it received the list to sort.

Created attachment 8194 [details] dso_deps.c - simple test implementing standalone propose sort and current glibc Compile it as gcc -O0 -g3 dso_deps.c -o dso_deps It accepts a very simple grammar input file, it outputs "ORG: ..." to show how input was given, "NEW: ..." as how the proposed patch would sort it, and "CUR: ..." as the result of current glibc sort. For example: ---8<--- $ cat dso_1 a b c b c c a $ ./dso_deps -v dso_1 a(b c) b(c) c(a) ORG: a b c NEW: a b c Real time: 4e-06 sec CUR: a b c Real time: 5e-06 sec $ cat dso_2 a b c d e f b c d e $ ./dso_deps -v dso_2 a(b c) b(c d e) c() d() e(f) f() ORG: a b c d e f NEW: a b c d e f Real time: 5e-06 sec CUR: a b c d e f Real time: 3e-06 sec $ cat dso_4 a b b c c d d e e f $ ./dso_deps -v dso_4 a(b) b(c) c(d) d(e) e(f) f() ORG: a b c d e f NEW: a b c d e f Real time: 5e-06 sec CUR: a b c d e f Real time: 3e-06 sec $ cat dso_5 b a c b d c e d f e $ ./dso_deps -v dso_5 b(a) a() c(b) d(c) e(d) f(e) ORG: b a c d e f NEW: f e d c b a Real time: 5e-06 sec CUR: f e d c b a Real time: 6e-06 sec ---8<--- This is an example matching the current glibc build time test, and that the proposed patch has a different result from current glibc: ---8<--- $ cat dso_6 a2 a1 b2 b1 a2 a3 b2 b1 a4 a3 main a4 a1 b2 $ ./dso_deps -v dso_6 a2(a1) a1() b2(b1 a2) b1() a3(b2 b1) a4(a3) main(a4 a1 b2) ORG: a2 a1 b2 b1 a3 a4 main NEW: main a4 a3 b2 a2 a1 b1 Real time: 6e-06 sec CUR: main a4 a3 b2 b1 a2 a1 Real time: 7e-06 sec ---8<---

Created attachment 8195 [details] dso_3 - Sample input for previous test case, that summarizes the reason of the bug report Since output is quite large, just showing the time result. If built in "debug mode", -O0 -g3: NEW: Real time: 0.004542 sec CUR: Real time: 62.1445 sec If built more close to default glibc, -O3: NEW: Real time: 0.001232 sec CUR: Real time: 22.4437 sec

(In reply to Paulo Andrade from comment #8) > My question is, is some documentation that says > that dsos must be kept together in the ordering? > I mean, there are 2 correct results: > > main a4 a3 b2 b1 a2 a1 > > and > > main a4 a3 b2 a2 a1 b1 > > but if there is some specification that says > it should have "b2 b1" together, then the patch > is invalid, because it only breaks cycles and > does ordering, and if there are no dependencies, > it keeps it in the order it received the list > to sort. I'm not aware of any such spefication, but current glibc implementation does keep direct dependencies together and this looks like A Good Thing. #15310 suggest using Tarjan SCC for toposort and it is expected to keep direct deps together. With regard to your patch - what big-O complexity does it have? If I read the code correctly, you recalculate (num_maps - num_seen) weights whenever you mark link_map as seen. This is just a better than O(n^2). But the problem we're talking about (toposort) is believed to be linear.

(In reply to Marat Radchenko from comment #11) > > My question is, is some documentation that says > > that dsos must be kept together in the ordering? > > I mean, there are 2 correct results: > > > > main a4 a3 b2 b1 a2 a1 > > > > and > > > > main a4 a3 b2 a2 a1 b1 > > > > but if there is some specification that says > > it should have "b2 b1" together, then the patch > > is invalid, because it only breaks cycles and > > does ordering, and if there are no dependencies, > > it keeps it in the order it received the list > > to sort. > > I'm not aware of any such spefication, but current glibc implementation does > keep direct dependencies together and this looks like A Good Thing. > > #15310 suggest using Tarjan SCC for toposort and it is expected to keep > direct deps together. > > With regard to your patch - what big-O complexity does it have? If I read > the code correctly, you recalculate (num_maps - num_seen) weights whenever > you mark link_map as seen. This is just a better than O(n^2). But the > problem we're talking about (toposort) is believed to be linear. The weight computation is basically O(n*m) where 'm' is the number of dependencies of the dso, otherwise, the sorting is linear time. It recomputes the weight once a dso is moved out of the sort input, as the remaining set may no longer have a dependency on the moved one. The idea is to attempt to mimic the current implementation, without doing memmove calls to reorder entries based on dependencies. It also does it this way to make the code simpler, and work on the worst scenario where there are cycles. It does not know details about cycles, so, it just breaks them in the 'lowest weight'. The sample parser with the reverse_sort function in the attachment at https://bugzilla.redhat.com/show_bug.cgi?id=1162810 should probably be the best version of the proposed logic, but otherwise, if there is only one solution, either variant always finds it. The major problem are possible uncertainties, like some kind of missing dependency information, and then, rely order of existing the algorithm...