Bug 17645 - RFE: Improve performance of dynamic loader for deeply nested DSO dependencies.
Summary: RFE: Improve performance of dynamic loader for deeply nested DSO dependencies.
Status: NEW
Alias: None
Product: glibc
Classification: Unclassified
Component: dynamic-link (show other bugs)
Version: unspecified
: P2 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-11-25 13:37 UTC by Paulo Andrade
Modified: 2016-11-28 20:09 UTC (History)
3 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:
fweimer: security-


Attachments
Sample "redistributable" test case provided by a customer (329.60 KB, application/gzip)
2014-11-25 13:37 UTC, Paulo Andrade
Details
Proposed patch (3.08 KB, patch)
2014-11-25 13:52 UTC, Paulo Andrade
Details | Diff
simple sh based helper script to run the test case (138 bytes, text/plain)
2014-11-25 14:37 UTC, Paulo Andrade
Details
Scratch test case (4.72 KB, application/octet-stream)
2014-11-29 22:30 UTC, Paulo Andrade
Details
Proposed patch (3.07 KB, patch)
2015-03-17 22:50 UTC, Paulo Andrade
Details | Diff
dso_deps.c - simple test implementing standalone propose sort and current glibc (2.48 KB, text/plain)
2015-03-18 17:20 UTC, Paulo Andrade
Details
dso_3 - Sample input for previous test case, that summarizes the reason of the bug report (4.21 KB, text/plain)
2015-03-18 17:26 UTC, Paulo Andrade
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Paulo Andrade 2014-11-25 13:37:04 UTC
Created attachment 7970 [details]
Sample "redistributable" test case provided by a customer

After glibc-2.11, there are 3 slightly different instances
of a very CPU intensive loop in elf/dl-deps.c, elf/dl-fini.c
and elf/dl-open.c. The loop does not scale well when there
are cyclic dependencies, and may take more than 8 seconds
to sort less than 300 shared objects when there a some
cycles. Most of the time is spent in memmove.

Sample perf output of it taking 8+ secs on a core i5:

 31,24%  timedlopen  ld-2.17.so         [.] _dl_map_object_deps
 18,36%  timedlopen  ld-2.17.so         [.] _wordcopy_fwd_aligned
 16,50%  timedlopen  ld-2.17.so         [.] _dl_sort_fini
 15,47%  timedlopen  ld-2.17.so         [.] dl_open_worker
 13,08%  timedlopen  ld-2.17.so         [.] _wordcopy_fwd_dest_aligned
  4,95%  timedlopen  ld-2.17.so         [.] memmove
  0,07%  timedlopen  ld-2.17.so         [.] do_lookup_x

I have been using a patch for some weeks in rhel-6.3
(for basic tests where the problem was detected/reported),
and my day to day computers, running rhel-7 and fedora
rawhide, and cannot see any problems running kde, gnome,
libreoffice and several other large applications or
environments that load/unload or have large dso dependency
chains.

The proposed patch runs the test case in 10 ms, and a
sample perf output is:

 17,45%  timedlopen  ld-2.17.so         [.] do_lookup_x
  9,54%  timedlopen  ld-2.17.so         [.] strcmp
  6,91%  timedlopen  ld-2.17.so         [.] _dl_map_object_deps
  6,12%  timedlopen  ld-2.17.so         [.] _dl_name_match_p
  4,35%  timedlopen  ld-2.17.so         [.] _dl_sort_fini
  4,33%  timedlopen  ld-2.17.so         [.] dl_open_worker
Comment 1 Paulo Andrade 2014-11-25 13:52:48 UTC
Created attachment 7972 [details]
Proposed patch

This patch implements a simple stable topological sort that moves
an entry at most once, by keeping track of, and resetting the weight
of the entries after every move. It breaks loops by choosing the
ones that appear first with the lowest weight, that is, less
dependencies.
If there is a loop, the "c" variable in the "sort" function will
be larger than 0.
Comment 2 Paulo Andrade 2014-11-25 14:10:17 UTC
The change to elf/Makefile is to make the test case have
only one valid result.

The expected result is:

start_a1
start_a2
start_b1
start_b2
start_a3
start_a4
main
finish_a4
finish_a3
finish_b2
finish_b1
finish_a2
finish_a1

but "a1" and "b1" do not have any dependencies, and
without the elf/Makefile patch it would result in:

start_a1
start_b1
start_a2
start_b2
start_a3
start_a4
main
finish_a4
finish_a3
finish_b2
finish_b1
finish_a2
finish_a1

The "input" of that test case, wihtout the patch is:

[0] = ""		()
[1] = a4		(a4 a3 libc.so)
[2] = a1		(a1 libc.so)
[3] = b2		(b2 b1 a2 libc.so)
[4] = libc.so		(libc.so ld.so)
[5] = a3		(a3 b2 b1 libc.so)
[6] = b1		(b1 libc.so)
[7] = a2		(a2 a1 libc.so)
[8] = ld.so		()

where items in () are the dependencies of an object,
and [] is the input order.
Comment 3 Paulo Andrade 2014-11-25 14:37:37 UTC
Created attachment 7973 [details]
simple sh based helper script to run the test case

Simple script to make it easier to run the test case.
Just unpack the test case and run sourceme.sh to set
environment variables and have it to echo how to run
the tests. This allows keeping unchanged the test case
(not done by me).
Comment 4 Carlos O'Donell 2014-11-26 16:41:50 UTC
(In reply to Paulo Andrade from comment #1)
> Created attachment 7972 [details]
> Proposed patch
> 
> This patch implements a simple stable topological sort that moves
> an entry at most once, by keeping track of, and resetting the weight
> of the entries after every move. It breaks loops by choosing the
> ones that appear first with the lowest weight, that is, less
> dependencies.
> If there is a loop, the "c" variable in the "sort" function will
> be larger than 0.

At the high-level your patch is almost ready.

You refactor the sorting code out into a static function that the compiler can inline and you use the same pattern in all places where sorting happens. This is good.

You are missing two important things:

* More code comments. You should comment each sort routine with a high-level comment about what you're doing and why. This is particularly important for the weight functions.

* Tests. The existing set of tests are insufficient to cover the changes you are proposing. While it seems unfair, and I understand that, this kind of change has not been undertaken because of the lack of testing. You need to add more tests. The *best* testing is actually a test which deterministically generates a sequence of permutations of linked objects and verifies they are sorted correctly by the algorithms in question. This way we can show that the sorting works for a large set of object dependency orderings. When we get a bug report, and we will, we can enlarge the set of automated testing to include the case the bug report claims and see what went wrong.

If you can cover those two things, then you're ready to post the patch to libc-alpha for more formal review.

This looks really good though and you're making great progress in a difficult part of the dynamic loader.
Comment 5 Paulo Andrade 2014-11-26 21:56:41 UTC
(In reply to Carlos O'Donell from comment #4)
> (In reply to Paulo Andrade from comment #1)
> > Created attachment 7972 [details]
> > Proposed patch
> > 
> > This patch implements a simple stable topological sort that moves
> > an entry at most once, by keeping track of, and resetting the weight
> > of the entries after every move. It breaks loops by choosing the
> > ones that appear first with the lowest weight, that is, less
> > dependencies.
> > If there is a loop, the "c" variable in the "sort" function will
> > be larger than 0.
> 
> At the high-level your patch is almost ready.
> 
> You refactor the sorting code out into a static function that the compiler
> can inline and you use the same pattern in all places where sorting happens.
> This is good.

  The code is small enough that this way (static functions) should be
better. But it also cannot be really shared, without extra parameters,
as there are 3 slightly different variants of the loop.

> You are missing two important things:
> 
> * More code comments. You should comment each sort routine with a high-level
> comment about what you're doing and why. This is particularly important for
> the weight functions.

  Ok. I will work on that. A quick explanation is somewhat like this:
"""
  The algorithm implements a simple stable topological sort.

  The algorithm is expected to be fast and break cycles in a reasonable
way. If there are no cycles, it does a plain topological sort, that
does not reorder the input, that is, it is stable.

  Every link_map instance has a l_idx field, that in special cases,
currently only when a DSO is being unloaded, can be less than zero.
If the l_idx is not less than zero, and not in the dl_sort_fini call
that already initializes the l_idx entries, the sort function initializes
the l_idx field of every link_map pointer to an unique identifier.

  The sort is done in-place, by decrementing the insert position (the
"k" variable) in the maps vector. Sorting is done by swapping an entry
with the current insert position, and decrementing the insert position.

  The "sort" function allocates in the stack the "seen" vector, that
works as a boolean flag for every link_map, to know if it has been
already moved. Moving, actually swapping, is done at most once, and not
done if it is already in the proper position.

  The "sort" function also allocates in stack the "w" vector, that
is a counter of how many, not self and not already "seen" direct
references exist to a link_map entry.

  The "weight" function resets the "w" entries at every call. Note
that other than the first call, every call reduces the number of
entries by one, as the last moved link_map pointer does not need
to be verified again.

  In the "weight" function, once a link_map entry l_initfini vector
does not find any dependency, either the entry is a self dependency,
the object is being unloaded or the dependency has been moved
(in which case a cyclic dependency was just broken), the "w" entry
for it will be 0.

  Note that the "w" vector is indexed by position in the input/output
maps vector, while the "seen" vector is indexed by the "l_idx" field
of the "struct link_map" representing the DSO.

  Every loop, after calling the "weight" function, the "sort" searches
all "w" entries for the first entry matching the "c" variable. The "c"
variable is initialized as zero, and reset to zero after an entry
is moved. If it scans all "w" entries and none is equal to "c", it
increments "c"; if "c" is larger than zero it means there is a loop,
in which case it chooses the first one with the lowest weight.

  Once an entry is moved, it is marked as "seen", and if it was
part of a cycle, the cycle is now broken, as it will no longer
influence its direct dependencies.

  Recognized possible problems:
o The algorithm is not expected to scale well beyond a few thousand
  entries, due to the need to reinitialize the "w" vector.
o If there are multiple "solutions", it will frequently not match
  sorting done by the previous algorithm. But, it guarantees an
  stable sorting, that is, does not change input order.
o It does not attempt to be smart when finding loops, a better solution
  probably would be to break the smallest loop first, when there are
  multiple entries with the same weight, instead of choosing the first
  entry.
"""

> * Tests. The existing set of tests are insufficient to cover the changes you
> are proposing. While it seems unfair, and I understand that, this kind of
> change has not been undertaken because of the lack of testing. You need to
> add more tests. The *best* testing is actually a test which
> deterministically generates a sequence of permutations of linked objects and
> verifies they are sorted correctly by the algorithms in question. This way
> we can show that the sorting works for a large set of object dependency
> orderings. When we get a bug report, and we will, we can enlarge the set of
> automated testing to include the case the bug report claims and see what
> went wrong.

  I will work on test cases. So far my tests were glibc "make check" and
running rawhide and rhel-7 with the patch. I am particularly interested
in knowing if there is a test case for this commit:
https://sourceware.org/git/?p=glibc.git;a=blame;f=elf/dl-fini.c;h=c35577565eb66ac241365c05efe7c7fb791e0df2#l102
line 102, because it is another special variant of the loop, and what I
did to address it is to loop over the entries, and increase the "w"
vector of the dependencies, to ensure the dependencies are not moved
before the entry, but on very complex cycles, and if having conflicting
rules, it cannot guarantee it will really " preserve a link time dependency".
But I suspect the same is true for the current algorithm; worse if the
link time dependencies generate conflicting dependencies in cycles.

> If you can cover those two things, then you're ready to post the patch to
> libc-alpha for more formal review.

  Ok. I followed wiki instructions and sent the first version to
libc-alpha, but will now wait for a more concrete situation based
on bugzilla.

> This looks really good though and you're making great progress in a
> difficult part of the dynamic loader.
Comment 6 Carlos O'Donell 2014-11-28 15:51:05 UTC
> > If you can cover those two things, then you're ready to post the patch to
> > libc-alpha for more formal review.
> 
>   Ok. I followed wiki instructions and sent the first version to
> libc-alpha, but will now wait for a more concrete situation based
> on bugzilla.

There is nothing wrong with what you did, but you are unlikely to get serious review until your patch has the kind of quality expected by libc-alpha.

You are making great progress here though, and I appreciate your efforts. It's not easy to make changes to core runtimes, and as a community we try to hold ourselves to a high standard.
Comment 7 Paulo Andrade 2014-11-29 22:30:41 UTC
Created attachment 7977 [details]
Scratch test case

Carlos, I would appreciate any comments about the scratch test
case as well as suggestions about other patterns to test. There
is a detailed description on how to run the tests in the README
file.

For a quick visualization, the tests with different solutions
(due to cycles or multiple dsos without dependencies) are like
this, (old) means current glibc, (new) means with the proposed
patch:

a -> b
b -> c
c ->
d ->

link    old          new
------------------------------
abcd    dcba_abcd    cbad_abcd
abdc    cdba_abdc    dcba_abdc
acbd    dcba_abcd    cdba_abcd
acdb    cbda_adbc    cbad_adbc
adbc    cbda_adbc    dcba_adbc
adcb    cbda_adbc    dcba_abcd
bacd    dcba_abcd    cbda_abcd
badc    cdba_abdc    dcba_abdc
bcad    dcba_abcd    cbad_abcd
bcda    cbad_dabc    cbda_dabc
bdac    cbad_dabc    dcba_dabc
bdca    cbad_dabc    dcba_abcd
cabd    dcba_abcd    cdba_abcd
cadb    cbda_adbc    cbda_dabc
cbad    dcba_abcd    cdba_abcd
cbda    cbad_dabc    cbad_abdc
cdab    cbad_dabc    cbad_adbc
cdba    cbad_dabc    cdba_dabc
dabc    cbad_dabc    dcba_dabc
dacb    cbad_dabc    dcba_abcd
dbac    cbad_dabc    dcba_abdc
dbca    cbad_dabc    dcba_abcd
dcab    cbad_dabc    dcba_abcd
dcba    cbad_dabc    dcba_abcd


A -> B
B -> C
C -> D
D -> A

link    old          new
------------------------------
ABCD    DCBA_ABCD    ADCB_ABCD
ABDC    DCBA_ABCD    ADCB_DABC
ACBD    BADC_CDAB    ADCB_ABCD
ACDB    DCBA_ABCD    ADCB_CDAB
ADBC    DCBA_ABCD    ADCB_DABC
ADCB    DCBA_ABCD    ADCB_CDAB
BACD    ADCB_BCDA    BADC_ABCD
BADC    ADCB_BCDA    BADC_DABC
BCAD    ADCB_BCDA    BADC_ABCD
BCDA    ADCB_BCDA    BADC_BCDA
BDAC    ADCB_BCDA    BADC_DABC
BDCA    CBAD_DABC    BADC_BCDA
CABD    BADC_CDAB    CBAD_ABCD
CADB    DCBA_ABCD    CBAD_CDAB
CBAD    BADC_CDAB    CBAD_ABCD
CBDA    BADC_CDAB    CBAD_BCDA
CDAB    BADC_CDAB    CBAD_CDAB
CDBA    BADC_CDAB    CBAD_BCDA
DABC    CBAD_DABC    DCBA_DABC
DACB    CBAD_DABC    DCBA_CDAB
DBAC    ADCB_BCDA    DCBA_DABC
DBCA    CBAD_DABC    DCBA_BCDA
DCAB    CBAD_DABC    DCBA_CDAB
DCBA    CBAD_DABC    DCBA_BCDA
Comment 8 Paulo Andrade 2015-03-17 22:50:05 UTC
Created attachment 8193 [details]
Proposed patch

Updated proposed patch, without ChangeLog entry,
just for asking for extra comments.

I understand this patch touches a too sensible
part of glibc, and the only reason I wrote it is
due to the test case, where the proposed patch
runs 5+ orders of magnitude faster, and appears to
be fully functional, when tested in fedora and
running very complex applications.

This is basically the same patch, but now it scans
the list in reverse order. This is is actually the
correct way to do it, so that on most common/simple
cases/cycles it outputs the same sorting as current
glibc.

But I would like to have one clarification, that maybe
is being already tested, and the Makefile patch to
have only one result is in fault.

tst-initorder is basically this:

a2 a1          # a2 is linked to a1
b2 b1 a2       # b2 is linked to b1 and a2
a3 b2 b1       # a3 is linked to b2 and b1
a4 a3          # a4 is linked to a3
main a4 a1 b2  # "main" is linked to a4, a1 and b2

this will cause the sort input like this:

[ main a2 a1 b2 b1 a3 a4 ]

My question is, is some documentation that says
that dsos must be kept together in the ordering?
I mean, there are 2 correct results:

main a4 a3 b2 b1 a2 a1

and

main a4 a3 b2 a2 a1 b1

but if there is some specification that says
it should have "b2 b1" together, then the patch
is invalid, because it only breaks cycles and
does ordering, and if there are no dependencies,
it keeps it in the order it received the list
to sort.
Comment 9 Paulo Andrade 2015-03-18 17:20:55 UTC
Created attachment 8194 [details]
dso_deps.c - simple test implementing standalone propose sort and current glibc

Compile it as gcc -O0 -g3 dso_deps.c -o dso_deps

It accepts a very simple grammar input file, it
outputs "ORG: ..." to show how input was given,
"NEW: ..." as how the proposed patch would sort
it, and "CUR: ..." as the result of current
glibc sort. For example:

---8<---
$ cat dso_1
a b c
b c
c a
$ ./dso_deps -v dso_1
a(b c)
b(c)
c(a)
ORG: a b c
NEW: a b c
Real time: 4e-06 sec

CUR: a b c
Real time: 5e-06 sec

$ cat dso_2
a b c
d
e f
b c d e

$ ./dso_deps -v dso_2
a(b c)
b(c d e)
c()
d()
e(f)
f()
ORG: a b c d e f
NEW: a b c d e f
Real time: 5e-06 sec

CUR: a b c d e f
Real time: 3e-06 sec

$ cat dso_4
a b
b c
c d
d e
e f
$ ./dso_deps -v dso_4
a(b)
b(c)
c(d)
d(e)
e(f)
f()
ORG: a b c d e f
NEW: a b c d e f
Real time: 5e-06 sec

CUR: a b c d e f
Real time: 3e-06 sec

$ cat dso_5
b a
c b
d c
e d
f e
$ ./dso_deps -v dso_5
b(a)
a()
c(b)
d(c)
e(d)
f(e)
ORG: b a c d e f
NEW: f e d c b a
Real time: 5e-06 sec

CUR: f e d c b a
Real time: 6e-06 sec

---8<---


This is an example matching the current glibc build
time test, and that the proposed patch has a different
result from current glibc:
---8<---
$ cat dso_6
a2 a1
b2 b1 a2
a3 b2 b1
a4 a3
main a4 a1 b2
$ ./dso_deps -v dso_6
a2(a1)
a1()
b2(b1 a2)
b1()
a3(b2 b1)
a4(a3)
main(a4 a1 b2)
ORG: a2 a1 b2 b1 a3 a4 main
NEW: main a4 a3 b2 a2 a1 b1
Real time: 6e-06 sec

CUR: main a4 a3 b2 b1 a2 a1
Real time: 7e-06 sec

---8<---
Comment 10 Paulo Andrade 2015-03-18 17:26:00 UTC
Created attachment 8195 [details]
dso_3 - Sample input for previous test case, that summarizes the reason of the bug report

Since output is quite large, just showing the time result.

If built in "debug mode", -O0 -g3:

NEW: Real time: 0.004542 sec
CUR: Real time: 62.1445 sec

If built more close to default glibc, -O3:

NEW: Real time: 0.001232 sec
CUR: Real time: 22.4437 sec