Bug 15310

Summary: _dl_sort_fini is O(n^3) causing slow exit when many dsos
Product: glibc Reporter: Don Hatch <hatch>
Component: dynamic-linkAssignee: Not yet assigned to anyone <unassigned>
Status: RESOLVED FIXED    
Severity: normal CC: adhemerval.zanella, brooks, carlos, fweimer, law, marat, nilayvaish, patrick.stahr, paulo.cesar.pereira.de.andrade, ppluzhnikov, yann
Priority: P2 Flags: fweimer: security-
Version: unspecified   
Target Milestone: 2.35   
Host: Target:
Build: Last reconfirmed:
Bug Depends on: 15309    
Bug Blocks:    
Attachments: Makefile that constructs the linear chain test case demonstrating the O(n^3) behavior.
Proposed initial patch, rearranging init/fini sorting in prep for overhaul. diffs from master at 7a86be6
Proposed initial patch, rearranging init/fini sorting in prep for overhaul. diffs from master at 60c414c
Add a linear time function for sorting maps.

Description Don Hatch 2013-03-27 07:47:46 UTC
The fix for Bug 13882 ("Cycle detection in dynamic loader is broken ")
fixed premature termination of the inner loop of _dl_sort_fini,
making the function (closer to) correct,
but it also changes this function's runtime from O(n^2) to O(n^3)
where n is the number of items (resident DSOs) to be sorted.
(The same is true for the corresponding init sorts in dl-open.c and dl-deps.c.)

This can be readily seen by looking at Jeff Law's example
in the description of bug 13882 (a linear chain of dependencies
that _dl_sort_fini needs to completely reverse).
In this case, each of O(n) objects gets moved O(n) times;
furthermore the analysis leading to each such move
(as well as the move itself) takes O(n) time.
That's O(n)*O(n)*O(n) = O(n^3).

Another easy way to get O(n^3) behavior
is with cycles: any node that's part of a nontrivial cycle
is guaranteed to keep getting moved repeatedly until its moved-too-many-times
counter expires, which is O(n) times (for O(n) of the items anyway).
So for example, if the dependency graph consists
of mutually dependent pairs of DSOs:
    A<->B  C<->D  E<->F ...
that will result in O(n^3) run time as well.

We observed the O(n^3) behavior in real life, in our application that had 575 DSOs
upon exit-- in RHEL5.3 (glibc 2.5), it took less than 1 second to exit;
upon upgrading to RHEL6.3 (glibc 2.12), the same app took 15 seconds to exit.
Instrumenting _dl_sort_fini (i.e. putting a counter in it
and printing it at the end) revealed that the innermost loop body
was entered more than 1.7 billion times,
roughly confirming the O(n^3) claim in practice.

This is just a topsort, which can be done simply in O(n) time
with no fancy data structures.
Comment 1 Don Hatch 2013-03-27 08:12:18 UTC
Created attachment 6951 [details]
Makefile that constructs the linear chain test case demonstrating the O(n^3) behavior.

Attaching a Makefile that quickly constructs the linear dependency chain
test case, demonstrating the O(n^3) behavior.

See the comment at the top of this file for a description
of exactly what it does and how to use it.
Comment 2 Carlos O'Donell 2013-03-27 12:57:37 UTC
Don,

I agree that the sorting could be made *far* faster.

Thanks for submitting this. We were well aware that the minimal fix for bug 13882 would cause some kind of performance regression, but it was a balance between a minimal fix and low risk of breakage. I reviewed the patch for 13882 and even build a minimal framework for testing that dynamic loader function outside of the build.

Do you have the time to investigate this and propose a patch (requires copyright assignment)?
Comment 3 Don Hatch 2013-03-27 20:33:23 UTC
(In reply to comment #2)
> Don,
> 
> I agree that the sorting could be made *far* faster.
> 
> Thanks for submitting this. We were well aware that the minimal fix for bug
> 13882 would cause some kind of performance regression, but it was a balance
> between a minimal fix and low risk of breakage. I reviewed the patch for 13882
> and even build a minimal framework for testing that dynamic loader function
> outside of the build.
> 
> Do you have the time to investigate this and propose a patch (requires
> copyright assignment)?

I do. I am working on a patch that resolves both this and bug 15311,
and I'll submit it here in a day or two.

I am very interested in what you came up with in the way of a unit
testing scheme for this function... I could certainly use it.
I've found it frustrating that the existing tests run by "make check"
(the ones I saw anyway)
involve just creating/compiling/running a handful of real programs...
to really stress test an implementation of _dl_sort_fini properly,
I'd want to (at least) enumerate all possible graphs
of up to 3 or 4 nodes, and call it on each of them,
which would be millions of examples...
and a few million randomly generated larger examples as well.
It's *really* easy to get this stuff wrong otherwise.

Also I'd like to start by moving the init sorting code into a function.
It looks to me like this code is duplicated in two places (dl-open.c
and dl-deps.c), and (after the fix for bug 15309)
it's identical in both places except that
one of them starts at i0=0 and the other starts at i0=1.
So this could be expressed cleanly as a new function _dl_sort_init that takes i0
as a parameter.
Should I start by submitting a patch that does that,
with no functional change, and go from there?  Or should I let you
or someone else do this refactoring (possibly in conjunction
with making these sorting functions unit testable)?
Let me know how to proceed.
Comment 4 Ondrej Bilka 2013-03-27 20:50:40 UTC
On Wed, Mar 27, 2013 at 07:47:46AM +0000, dhatch at ilm dot com wrote:
> http://sourceware.org/bugzilla/show_bug.cgi?id=15310
> 
> This is just a topsort, which can be done simply in O(n) time
> with no fancy data structures.
> 
I do not have domain knowledge to understand what is necessary to
consider.

However I could write topsort if I knew dependencies. From code I think
following pseudocode should work upto possibility that some arrows need
be reversed.

add_edge(g,x,y) // add edge from x to y
topsort(g)      // returns topologic ordering of g

graph *g; 
for(k=0;k<nmaps;k++)
  {
      /* Add dependencies of the object.  */
      struct link_map **runp = maps[k]->l_initfini;
      if (runp != NULL)
        while (*runp != NULL) 
          add_edge(g,maps[k],*runp);
		
  		/* Add relocation dependencies.  */
      if (__builtin_expect (maps[k]->l_reldeps != NULL, 0))
        {
          unsigned int m = maps[k]->l_reldeps->act;
          struct link_map **relmaps = &maps[k]->l_reldeps->list[0];
          for (j=0;j<m;j++)
            add_edge(g,relmaps[k],*runp) 
        }
  }

	memcpy(maps,topsort(g));
Comment 5 Carlos O'Donell 2013-03-27 21:00:34 UTC
(In reply to comment #3)
> (In reply to comment #2)
> > Don,
> > 
> > I agree that the sorting could be made *far* faster.
> > 
> > Thanks for submitting this. We were well aware that the minimal fix for bug
> > 13882 would cause some kind of performance regression, but it was a balance
> > between a minimal fix and low risk of breakage. I reviewed the patch for 13882
> > and even build a minimal framework for testing that dynamic loader function
> > outside of the build.
> > 
> > Do you have the time to investigate this and propose a patch (requires
> > copyright assignment)?
> 
> I do. I am working on a patch that resolves both this and bug 15311,
> and I'll submit it here in a day or two.

I look forward to the patch.

> I am very interested in what you came up with in the way of a unit
> testing scheme for this function... I could certainly use it.

I aggrandized a bit here. I copied the relevant sorting code out of the loader code, wrapped it up in a function, created some static arrays to simulate DSOs loaded in a certain order, and then ran the sorting function against the the simulated DSO list. The best description is that I ran a simulation. I looked for the code I used to test bug 13882, but I've lost it (changed employers).

You can see my comments here:
http://sourceware.org/ml/libc-alpha/2012-06/msg00560.html

> I've found it frustrating that the existing tests run by "make check"
> (the ones I saw anyway)
> involve just creating/compiling/running a handful of real programs...
> to really stress test an implementation of _dl_sort_fini properly,
> I'd want to (at least) enumerate all possible graphs
> of up to 3 or 4 nodes, and call it on each of them,
> which would be millions of examples...
> and a few million randomly generated larger examples as well.
> It's *really* easy to get this stuff wrong otherwise.

I fully agree.

As a volunteer project we live and die by the companies and individuals that choose to contribute to the project. I would be more than happy to see all possible 3 or 4 node graphs tested. The random testing is more problematic as you are probably well aware of; you can still auto-generate millions of test cases just make it deterministic :-)

> Also I'd like to start by moving the init sorting code into a function.
> It looks to me like this code is duplicated in two places (dl-open.c
> and dl-deps.c), and (after the fix for bug 15309)
> it's identical in both places except that
> one of them starts at i0=0 and the other starts at i0=1.
> So this could be expressed cleanly as a new function _dl_sort_init that takes
> i0
> as a parameter.

That sounds like a great idea.

> Should I start by submitting a patch that does that,
> with no functional change, and go from there?  Or should I let you
> or someone else do this refactoring (possibly in conjunction
> with making these sorting functions unit testable)?
> Let me know how to proceed.

Always break the work into as small a piece as conceivably possible. 

Doing just the refactoring is a great first step.

Once we have that in place we can talk about next steps.

The Contribution Checklist for this project is rather long, but we are a conservative project and it helps to have everything documented and well specified. You can see the checklist here:
http://sourceware.org/glibc/wiki/Contribution%20checklist
Comment 6 Carlos O'Donell 2013-03-27 21:07:00 UTC
Don,

Do you have copyright papers in place with the FSF for glibc?
Comment 7 law 2013-03-27 21:13:39 UTC
Carlos,

This is the tester QE uses for the correctness of the sort:  

Basically it takes as an input the number of DSOs you want to test.  Then builds a random orderings and a pathological ordering of those DSOs on the link line (the script knows the dependencies between DSOs).
deps.cpp:

#include <stdio.h>

#ifdef MAIN
int
main( int argc, char *argv[] )
{
  printf("main\n");
}
#else
static bool deps_init( void ) { puts( NAME ); }
bool x = deps_init();
#endif


driver:
#!/bin/bash
# based on reproducer from bug 11724

N=${1:-2}
TmpDir=$(mktemp -d)
cp deps.cpp $TmpDir
pushd $TmpDir > /dev/null 2>&1

for((all=0,k=2; k<=$N; ++all,++k)); do
    cmd="g++ -fPIC -shared -o libdeps${k}.so -DNAME=\\\"${k}\\\" deps.cpp"
    printf "# %s\n" "$cmd" > ${all}r.log
    eval "$cmd"
    for ((i=k-1; i>=1; --i)); do
        cmd="g++ -L. -fPIC -shared -o libdeps${i}.so -ldeps$((i+1)) -DNAME=\\\"${i}\\\" deps.cpp"
        printf "# %s\n" "$cmd" >> ${all}r.log
        eval "$cmd"
    done; unset i
    # random link order
    # it sucks that rhel5 coreutils don't have shuf or sort -R yet :)
    roptions=($(seq 1 $k | sed 's/^/ -ldeps/'))
    for ((i=(${#roptions[@]}-1); i >= 1; --i)); do
        random=$((RANDOM % i))
        m=${roptions[$random]}
        roptions[$random]=${roptions[$i]}
        roptions[$i]=$m
    done; unset i
    rcmd="g++ -Wl,-rpath -Wl,$(pwd) -L. -o rmain ${roptions[@]} -DMAIN deps.cpp"
    # pathological link order
    poptions=($(seq $k -1 1 | sed 's/^/ -ldeps/'))
    pcmd="g++ -Wl,-rpath -Wl,$(pwd) -L. -o pmain ${poptions[@]} -DMAIN deps.cpp"

    cp ${all}r.log ${all}p.log

    for i in r p; do
        cmd="${i}cmd"
        printf "# %s\n\n" "${!cmd}" >> ${all}${i}.log
        eval "${!cmd}"

        fail=0
        cp ${all}${i}.log ${i}out.golden
        ./${i}main >> ${all}${i}.log
        seq $k -1 1 >> ${i}out.golden
        echo main >> ${i}out.golden
        diff ${all}${i}.log ${i}out.golden > /dev/null 2>&1 || fail=1

        # it's ok, we don't care
        if [ 0 -eq $fail ]; then
            rm ${all}${i}.log
        fi 
    done; unset i

    rm -rf *.so *main *out.golden *out.real
done; unset k

rm deps.cpp
test 0 -eq $(ls | wc -l) && { rm -rf $(pwd); echo PASS; } || { echo FAIL; printf "results are in %s\n" $(pwd); }
popd > /dev/null 2>&1
Comment 8 Don Hatch 2013-03-27 23:44:45 UTC
(In reply to comment #6)
> Do you have copyright papers in place with the FSF for glibc?
Working on it with my legal department.
My employer (ILM) is supportive so I don't anticipate a problem.
Comment 9 Don Hatch 2013-03-28 00:31:40 UTC
(In reply to comment #4)
> On Wed, Mar 27, 2013 at 07:47:46AM +0000, dhatch at ilm dot com wrote:
> > http://sourceware.org/bugzilla/show_bug.cgi?id=15310
> > 
> > This is just a topsort, which can be done simply in O(n) time
> > with no fancy data structures.
> > 
> I do not have domain knowledge to understand what is necessary to
> consider.
> 
> However I could write topsort if I knew dependencies. From code I think
> following pseudocode should work upto possibility that some arrows need
> be reversed.
> 
> add_edge(g,x,y) // add edge from x to y
> topsort(g)      // returns topologic ordering of g
> 
> graph *g; 
> for(k=0;k<nmaps;k++)
>   {
>       /* Add dependencies of the object.  */
>       struct link_map **runp = maps[k]->l_initfini;
>       if (runp != NULL)
>         while (*runp != NULL) 
>           add_edge(g,maps[k],*runp);
> 
>           /* Add relocation dependencies.  */
>       if (__builtin_expect (maps[k]->l_reldeps != NULL, 0))
>         {
>           unsigned int m = maps[k]->l_reldeps->act;
>           struct link_map **relmaps = &maps[k]->l_reldeps->list[0];
>           for (j=0;j<m;j++)
>             add_edge(g,relmaps[k],*runp) 
>         }
>   }
> 
>     memcpy(maps,topsort(g));

That looks essentially right,
although there are a couple of nuances in the existing code.
    - under the comment "Do not handle ld.so in secondary namespace"
      (whatever that means), it ignores a dependency "B depends on A"
      if A!=A->l_real  (but doesn't check whether B!=B->l_real).
      I don't really know what this means, or whether this assymmetry
      is intentional...
      QUESTION: can I just omit all edges involving such items
      altogether, not caring where these items come out
      in the output order?
    - it also ignores "B depends on A" if A->l_idx == -1 (i.e. IDX_STILL_USED)--
      this one I believe I understand--
      it definitely doesn't matter where such items
      come out in the final sort order.
    - static dependendies should override dynamic ones if there's a conflict.
      I'll do two passes in order to get this right, and more well-behaved
      than the existing code--
      see bug 15311 for details.

I hadn't thought about creating the graph structure explicitly as you suggest...
arguably that would result in cleaner topsort code, but
I was just going to work directly with the data structure that's
already there instead.
Note that an explicit auxiliary graph representation would have size O(N+E)
(number of nodes plus number of edges)
whereas the size of needed auxiliary structures is currently only O(N)
(making it feasible to allocate them as local variables on the runtime stack)
and I wasn't planning on changing that.

The topsort algorithm will be essentially Tarjan's SCC algorithm:
http://en.wikipedia.org/wiki/Tarjan%27s_strongly_connected_components_algorithm
(actually two passes of it).  This is a bit more involved than
a simple reverse postorder, but keeping SCCs contiguous is desireable
(for one thing, the existing code does it-- furthermore,
it provides some nice well-behavedness between the two passes
that I wouldn't be able to prove otherwise).
Tarjan's algorithm is preferable to Kosaraju's since it doesn't require
explicitly creating the reverse graph.

One other issue/question-- I was assuming a recursive implementation
would be out of the question (due to increased runtime stack requirement);
but I'm finding that an iterative implementation of Tarjan's SCC algorithm
is... well, really unpleasant.
I can do it, but it's no fun, and it won't be any fun
to review or maintain it either.  A recursive implementation
would be MUCH cleaner, simpler, more readable and maintainable.
Any thoughts/feelings on whether it's okay to make it recursive?
Comment 10 Ondrej Bilka 2013-03-28 07:42:31 UTC
On Thu, Mar 28, 2013 at 12:31:40AM +0000, dhatch at ilm dot com wrote:
 
 
> I hadn't thought about creating the graph structure explicitly as you
> suggest...
> arguably that would result in cleaner topsort code, but
> I was just going to work directly with the data structure that's
> already there instead.
> Note that an explicit auxiliary graph representation would have size O(N+E)
> (number of nodes plus number of edges)
> whereas the size of needed auxiliary structures is currently only O(N)
> (making it feasible to allocate them as local variables on the runtime stack)
> and I wasn't planning on changing that.
>
I mainly wanted do it to separate alg from domain specific bits.

Next one where I do not know if its necesarry or just optimization is

  /* We can skip looking for the binary itself which is at the front
     of the search list for the main namespace.  */
Comment 11 Don Hatch 2013-03-28 10:00:12 UTC
Created attachment 6955 [details]
Proposed initial patch, rearranging init/fini sorting in prep for overhaul. diffs from master at 7a86be6

Here's the proposed initial patch.
Commit message would be something like:
==========================================================================
Refactor ld.so init/fini sorting code in preparation for overhaul.

- Init sorting code was in two places: dl-open.c and dl-deps.c;
  it was identical in both places except for the starting index
  which was i0=0 in dl-open.c and i0=1 in dl-deps.c.
  Moved this common code out into a new function _dl_sort_init
  that takes i0 as a parameter.

- Moved _dl_sort_init and _dl_sort_fini out into two new files
  dl-sort-init.c and dl-sort-fini.c.
  This allows easy diffing between the two (which can be interesting)
  and may also help facilitate future efforts to unit test
  these two functions.

No functional change here.
==========================================================================

I don't have the copyright agreement in place yet, but it seems to me
that I'm not introducing any copyrightable material in this initial patch--
I'm just moving existing code around.
(okay I did add a comment at the top of each function).

Let me know if I'm doing this right...
Comment 12 Don Hatch 2013-03-28 10:19:01 UTC
Actually I think the fix for bug 15309 (dl_open_worker doesn't fully initialize seen array during init sort) needs to go in first...
it's a 1-liner and is easily patchable to various downstream code bases,
but that won't be true if it ends up on top of my code rearrangement.

Also I can't truthfully say "No functional change here"
unless the fix for 15309 goes in first, since I'm combining code
from two functions which aren't exactly the same yet...
they will be the same, after the fix for 15309 is in place.
Comment 13 law 2013-03-28 17:09:25 UTC
Actually, I'm pretty sure the problematical sort shows up in 3 places:

dl_map_object_deps
dl_sort_fini
dl_open_worker
Comment 14 Don Hatch 2013-03-28 17:31:26 UTC
(In reply to comment #13)
> Actually, I'm pretty sure the problematical sort shows up in 3 places:
> 
> dl_map_object_deps
> dl_sort_fini
> dl_open_worker

that's correct. my initial proposed patch combines two of them
into a single function _dl_sort_init.
Comment 15 Don Hatch 2013-04-02 09:54:17 UTC
Progress on unit/stress test...
I got something working, in such a way that it can go cleanly into the build
(assuming the init sort has been separated out into a function,
as in my "Proposed initial patch" attachment).
I'll submit the test as soon as I get it polished (and I get legally cleared).

It tests _dl_sort_init on all 17854749 graphs of up to 4 nodes in 1min8sec,
and _dl_sort_fini on all 16777846 pairs of graphs
(static dep graph and dynamic dep graph) of up to 3 nodes in 48sec
(on an Intel Xeon L5630 @ 2.13GHz which is a pretty fast machine I guess).
That's probably a bit long for a confidence test run during "make check",
so for that, I'd probably do something less,
augmented by some randomized testing (with deterministic seed of course).

As Carlos pointed out in the referenced email thread,
this test would have prevented the last several generations
of bugs in these sorting routines, and it should help ensure that
the impending overhaul properly fixes the bugs it intends to fix, without
introducing new ones, and that no similar bugs appear in the future.

In addition to lots of expected failures of _dl_sort_fini due to bug 15311,
the testing of _dl_sort_init on graphs up to 4 nodes
revealed that the existing algorithm doesn't keep SCCs contiguous,
which was a surprise to me.  I just opened bug 15329 on that.
(Not sure whether that's technically a bug... but it's certainly
not a nice behavior, and we can do better.)

Is it okay to add a test that currently fails
due to bugs that we intend to fix very soon?
Comment 16 Ondrej Bilka 2013-04-02 11:31:00 UTC
On Tue, Apr 02, 2013 at 09:54:17AM +0000, dhatch at ilm dot com wrote:
> http://sourceware.org/bugzilla/show_bug.cgi?id=15310
> 
> --- Comment #15 from Don Hatch <dhatch at ilm dot com> 2013-04-02 09:54:17 UTC ---
> Progress on unit/stress test...
> I got something working, in such a way that it can go cleanly into the build
> (assuming the init sort has been separated out into a function,
> as in my "Proposed initial patch" attachment).
> I'll submit the test as soon as I get it polished (and I get legally cleared).
> 
> It tests _dl_sort_init on all 17854749 graphs of up to 4 nodes in 1min8sec,
> and _dl_sort_fini on all 16777846 pairs of graphs
> (static dep graph and dynamic dep graph) of up to 3 nodes in 48sec

How did you get these numbers? Directed graph on 4 nodes has 12 arcs. So
there only 2^{12} = 4096 directed graphs on 4 nodes. What extra data do you
need to generate?

> (on an Intel Xeon L5630 @ 2.13GHz which is a pretty fast machine I guess).
> That's probably a bit long for a confidence test run during "make check",
> so for that, I'd probably do something less,
> augmented by some randomized testing (with deterministic seed of course).
>
First optimization would be eliminate isomorphic graphs. I could assist
if I got code.
Comment 17 Don Hatch 2013-04-02 13:07:00 UTC
(In reply to comment #16)
> > 
> > It tests _dl_sort_init on all 17854749 graphs of up to 4 nodes in 1min8sec,
> > and _dl_sort_fini on all 16777846 pairs of graphs
> > (static dep graph and dynamic dep graph) of up to 3 nodes in 48sec
> 
> How did you get these numbers? Directed graph on 4 nodes has 12 arcs. So
> there only 2^{12} = 4096 directed graphs on 4 nodes. What extra data do you
> need to generate?

I was considering the complete graph on 4 nodes to have 16 arcs, not 12,
so there are 2^16 = 65536 possible graphs
(though that might be a worthwhile corner to cut... once we convince ourselves
that an edge from a node to itself doesn't affect the algorithm,
then yes, we could just consider 4096 of them)...
and all permutations of edges lists have to be considered different,
so I was taking the sum, over all 65536 graphs,
of the product of the factorials of the node arities.

(plus similar sums for 0,1,2,3 nodes, though they are small in comparison)

> 
> > (on an Intel Xeon L5630 @ 2.13GHz which is a pretty fast machine I guess).
> > That's probably a bit long for a confidence test run during "make check",
> > so for that, I'd probably do something less,
> > augmented by some randomized testing (with deterministic seed of course).
> >
> First optimization would be eliminate isomorphic graphs. I could assist
> if I got code.

I don't think we can eliminate isomorphic graphs.
Whether a bug appears or not often depends on the data order.
Comment 18 Don Hatch 2013-04-02 23:37:34 UTC
Created attachment 6965 [details]
Proposed initial patch, rearranging init/fini sorting in prep for overhaul. diffs from master at 60c414c

Tweaked the proposed initial refactoring patch a bit to make the calling convention cleaner:
consistently let the sort functions handle too-small inputs
in whatever way they need to,
rather than haphazardly having the caller do it sometimes.
This makes for a cleaner contract
so callers and unit test can be simpler and less error prone.
(found this when unit testing _dl_sort_init
on a list of length 0, and it crashed)
Comment 19 Ondrej Bilka 2013-04-03 07:57:18 UTC
>   --> http://sourceware.org/bugzilla/attachment.cgi?id=6965
> Proposed initial patch, rearranging init/fini sorting in prep for overhaul.
> diffs from master at 60c414c
> 
Please post that on libc-alpha as [RFC] to get comments.
Comment 20 Brooks Moses 2015-11-04 19:42:58 UTC
Ping; did anything ever happen with this?
Comment 21 Carlos O'Donell 2015-11-04 19:56:13 UTC
(In reply to Brooks Moses from comment #20)
> Ping; did anything ever happen with this?

No, but it would be awesome if someone moved this forward.

Unfortunately Don doesn't have a copyright assignment as far as I can tell, so please don't look at the code in this patch if you plan to implement it.

Instead I suggest we look at bug 17645:
https://sourceware.org/bugzilla/show_bug.cgi?id=17645

Fedora bug:
https://bugzilla.redhat.com/show_bug.cgi?id=1162810

libc-alpha post:
https://sourceware.org/ml/libc-alpha/2014-11/msg00702.html

And take a serious look at Paulo's work.

It still needs someone to write a test that actually does some decent coverage over the permutation of DT_NEEDED trees.
Comment 22 Carlos O'Donell 2015-11-04 20:32:13 UTC
(In reply to Carlos O'Donell from comment #21)
> And take a serious look at Paulo's work.

In full disclosure, Paulo works at Red Hat.
Comment 23 Patrick Stahr 2016-08-24 21:40:43 UTC
Did this get accepted and committed for a given release of glibc?  Or is it still waiting for someone to write a test that does the coverage over the permutation of DT_NEEDED trees?
Comment 24 Carlos O'Donell 2016-08-26 05:00:35 UTC
(In reply to Patrick Stahr from comment #23)
> Did this get accepted and committed for a given release of glibc?  Or is it
> still waiting for someone to write a test that does the coverage over the
> permutation of DT_NEEDED trees?

This is not fixed in any version of glibc. Yes, we still need better coverage testing of the permutation of DT_NEEDED trees to ensure the loader is doing the right thing. Each tree needs to verify it was loaded in the correct order.
Comment 25 Marat Radchenko 2019-04-14 21:23:13 UTC
(In reply to Don Hatch from comment #3)
> I'll submit it here in a day or two.

Hello from 2019. We're still suffering from O(n^3). Don, unfortunately you haven't published your code anywhere (even if it still lacks an exhaustive set of tests).

For those brave enough out there who would like to try solve this, here's a PDF named "A Space-Efficient Algorithm for Finding Strongly Connected Components" with non-recursive Tarjan SCC implementation: http://homepages.ecs.vuw.ac.nz/~djp/files/IPL15-preprint.pdf

"Proposed initial patch" that is attached here is no longer needed because it only moves code around that was done in

commit c2c299fd24e87b83c63191ff979d41a86b37d714
Author: Andreas Schwab <schwab@suse.de>
Date:   Tue Nov 7 15:24:19 2017 +0100

    Consolidate link map sorting
    
    Combine the four places where link maps are sorted into a single function.
    This also moves the logic to skip the first map (representing the main
    binary) to the callers.

So, there's currently only a single sorting function in elf/dl-sort-maps.c

BTW, has anyone tried looking at how other libcs do this sorting? Maybe there's no need to reinvent the wheel?
Comment 26 Nilay Vaish 2019-07-16 03:08:44 UTC
Created attachment 11908 [details]
Add a linear time function for sorting maps.

Hi

I have prepared a patch based on my understanding of the discussion above.  I am uploading it to seek some feedback and see if there is appetite for this change.  The patch implements a function to topologically sort the link_maps.  The current implementation requires time that is linear in the number of link_maps and the dependencies amongst them.  We do need to allocate more memory than the prior work, though the allocated memory is again linear function of the size of the input.  I have not compiled this patch (and therefore not tested).  Once I receive some feedback, I would work further on it.

Thanks!
Comment 27 Adhemerval Zanella 2021-10-27 14:58:58 UTC
The testcase which creates a linear chain test to demostrate the O(n^3) behavior shows an improvement with the new DSO sorting algorithm from 15a0c5730d1d5ae:

$ time ./testrun.sh /tmp/test/main1000 2>&1 >/dev/null 

real    0m1.641s
user    0m1.565s
sys     0m0.076s

$ time GLIBC_TUNABLES=glibc.rtld.dynamic_sort=2 ./testrun.sh /tmp/test/main1000 2>&1 >/dev/null 

real    0m0.316s
user    0m0.219s
sys     0m0.095s

And profile shows similar improvement:

$ perf record ./testrun.sh /tmp/test/main1000
[...]
$ perf report --stdio -q
[...]
    78.28%  ld-linux-x86-64  ld.so             [.] _dl_sort_maps
     6.77%  ld-linux-x86-64  ld.so             [.] do_lookup_x
     5.03%  ld-linux-x86-64  ld.so             [.] memmove
     2.64%  ld-linux-x86-64  ld.so             [.] strcmp

$ perf record env GLIBC_TUNABLES=glibc.rtld.dynamic_sort=2 ./testrun.sh /tmp/test/main1000
[...]
$ perf report --stdio -q --dso=ld.so
[...]
    41.44%  ld-linux-x86-64  [.] do_lookup_x
    13.76%  ld-linux-x86-64  [.] strcmp
     6.01%  ld-linux-x86-64  [.] _dl_map_object
     5.17%  ld-linux-x86-64  [.] _dl_name_match_p
     1.34%  ld-linux-x86-64  [.] _dl_map_object_from_fd
     0.69%  ld-linux-x86-64  [.] _dl_add_to_namespace_list