Bug 17645 - RFE: Improve performance of dynamic loader for deeply nested DSO dependencies.
Summary: RFE: Improve performance of dynamic loader for deeply nested DSO dependencies.
Status: RESOLVED FIXED
Alias: None
Product: glibc
Classification: Unclassified
Component: dynamic-link (show other bugs)
Version: unspecified
: P2 normal
Target Milestone: 2.35
Assignee: Carlos O'Donell
URL:
Keywords:
: 26179 (view as bug list)
Depends on:
Blocks:
 
Reported: 2014-11-25 13:37 UTC by Paulo Andrade
Modified: 2023-11-24 07:24 UTC (History)
11 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:
fweimer: security-


Attachments
Sample "redistributable" test case provided by a customer (329.60 KB, application/gzip)
2014-11-25 13:37 UTC, Paulo Andrade
Details
Proposed patch (3.08 KB, patch)
2014-11-25 13:52 UTC, Paulo Andrade
Details | Diff
simple sh based helper script to run the test case (138 bytes, text/plain)
2014-11-25 14:37 UTC, Paulo Andrade
Details
Scratch test case (4.72 KB, application/octet-stream)
2014-11-29 22:30 UTC, Paulo Andrade
Details
Proposed patch (3.07 KB, patch)
2015-03-17 22:50 UTC, Paulo Andrade
Details | Diff
dso_deps.c - simple test implementing standalone propose sort and current glibc (2.48 KB, text/plain)
2015-03-18 17:20 UTC, Paulo Andrade
Details
dso_3 - Sample input for previous test case, that summarizes the reason of the bug report (4.21 KB, text/plain)
2015-03-18 17:26 UTC, Paulo Andrade
Details
Proposed patch v3 updated to master branch as of 27th december 2020 (5.46 KB, text/plain)
2020-12-27 18:31 UTC, Romain Geissler
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Paulo Andrade 2014-11-25 13:37:04 UTC
Created attachment 7970 [details]
Sample "redistributable" test case provided by a customer

After glibc-2.11, there are 3 slightly different instances
of a very CPU intensive loop in elf/dl-deps.c, elf/dl-fini.c
and elf/dl-open.c. The loop does not scale well when there
are cyclic dependencies, and may take more than 8 seconds
to sort less than 300 shared objects when there a some
cycles. Most of the time is spent in memmove.

Sample perf output of it taking 8+ secs on a core i5:

 31,24%  timedlopen  ld-2.17.so         [.] _dl_map_object_deps
 18,36%  timedlopen  ld-2.17.so         [.] _wordcopy_fwd_aligned
 16,50%  timedlopen  ld-2.17.so         [.] _dl_sort_fini
 15,47%  timedlopen  ld-2.17.so         [.] dl_open_worker
 13,08%  timedlopen  ld-2.17.so         [.] _wordcopy_fwd_dest_aligned
  4,95%  timedlopen  ld-2.17.so         [.] memmove
  0,07%  timedlopen  ld-2.17.so         [.] do_lookup_x

I have been using a patch for some weeks in rhel-6.3
(for basic tests where the problem was detected/reported),
and my day to day computers, running rhel-7 and fedora
rawhide, and cannot see any problems running kde, gnome,
libreoffice and several other large applications or
environments that load/unload or have large dso dependency
chains.

The proposed patch runs the test case in 10 ms, and a
sample perf output is:

 17,45%  timedlopen  ld-2.17.so         [.] do_lookup_x
  9,54%  timedlopen  ld-2.17.so         [.] strcmp
  6,91%  timedlopen  ld-2.17.so         [.] _dl_map_object_deps
  6,12%  timedlopen  ld-2.17.so         [.] _dl_name_match_p
  4,35%  timedlopen  ld-2.17.so         [.] _dl_sort_fini
  4,33%  timedlopen  ld-2.17.so         [.] dl_open_worker
Comment 1 Paulo Andrade 2014-11-25 13:52:48 UTC
Created attachment 7972 [details]
Proposed patch

This patch implements a simple stable topological sort that moves
an entry at most once, by keeping track of, and resetting the weight
of the entries after every move. It breaks loops by choosing the
ones that appear first with the lowest weight, that is, less
dependencies.
If there is a loop, the "c" variable in the "sort" function will
be larger than 0.
Comment 2 Paulo Andrade 2014-11-25 14:10:17 UTC
The change to elf/Makefile is to make the test case have
only one valid result.

The expected result is:

start_a1
start_a2
start_b1
start_b2
start_a3
start_a4
main
finish_a4
finish_a3
finish_b2
finish_b1
finish_a2
finish_a1

but "a1" and "b1" do not have any dependencies, and
without the elf/Makefile patch it would result in:

start_a1
start_b1
start_a2
start_b2
start_a3
start_a4
main
finish_a4
finish_a3
finish_b2
finish_b1
finish_a2
finish_a1

The "input" of that test case, wihtout the patch is:

[0] = ""		()
[1] = a4		(a4 a3 libc.so)
[2] = a1		(a1 libc.so)
[3] = b2		(b2 b1 a2 libc.so)
[4] = libc.so		(libc.so ld.so)
[5] = a3		(a3 b2 b1 libc.so)
[6] = b1		(b1 libc.so)
[7] = a2		(a2 a1 libc.so)
[8] = ld.so		()

where items in () are the dependencies of an object,
and [] is the input order.
Comment 3 Paulo Andrade 2014-11-25 14:37:37 UTC
Created attachment 7973 [details]
simple sh based helper script to run the test case

Simple script to make it easier to run the test case.
Just unpack the test case and run sourceme.sh to set
environment variables and have it to echo how to run
the tests. This allows keeping unchanged the test case
(not done by me).
Comment 4 Carlos O'Donell 2014-11-26 16:41:50 UTC
(In reply to Paulo Andrade from comment #1)
> Created attachment 7972 [details]
> Proposed patch
> 
> This patch implements a simple stable topological sort that moves
> an entry at most once, by keeping track of, and resetting the weight
> of the entries after every move. It breaks loops by choosing the
> ones that appear first with the lowest weight, that is, less
> dependencies.
> If there is a loop, the "c" variable in the "sort" function will
> be larger than 0.

At the high-level your patch is almost ready.

You refactor the sorting code out into a static function that the compiler can inline and you use the same pattern in all places where sorting happens. This is good.

You are missing two important things:

* More code comments. You should comment each sort routine with a high-level comment about what you're doing and why. This is particularly important for the weight functions.

* Tests. The existing set of tests are insufficient to cover the changes you are proposing. While it seems unfair, and I understand that, this kind of change has not been undertaken because of the lack of testing. You need to add more tests. The *best* testing is actually a test which deterministically generates a sequence of permutations of linked objects and verifies they are sorted correctly by the algorithms in question. This way we can show that the sorting works for a large set of object dependency orderings. When we get a bug report, and we will, we can enlarge the set of automated testing to include the case the bug report claims and see what went wrong.

If you can cover those two things, then you're ready to post the patch to libc-alpha for more formal review.

This looks really good though and you're making great progress in a difficult part of the dynamic loader.
Comment 5 Paulo Andrade 2014-11-26 21:56:41 UTC
(In reply to Carlos O'Donell from comment #4)
> (In reply to Paulo Andrade from comment #1)
> > Created attachment 7972 [details]
> > Proposed patch
> > 
> > This patch implements a simple stable topological sort that moves
> > an entry at most once, by keeping track of, and resetting the weight
> > of the entries after every move. It breaks loops by choosing the
> > ones that appear first with the lowest weight, that is, less
> > dependencies.
> > If there is a loop, the "c" variable in the "sort" function will
> > be larger than 0.
> 
> At the high-level your patch is almost ready.
> 
> You refactor the sorting code out into a static function that the compiler
> can inline and you use the same pattern in all places where sorting happens.
> This is good.

  The code is small enough that this way (static functions) should be
better. But it also cannot be really shared, without extra parameters,
as there are 3 slightly different variants of the loop.

> You are missing two important things:
> 
> * More code comments. You should comment each sort routine with a high-level
> comment about what you're doing and why. This is particularly important for
> the weight functions.

  Ok. I will work on that. A quick explanation is somewhat like this:
"""
  The algorithm implements a simple stable topological sort.

  The algorithm is expected to be fast and break cycles in a reasonable
way. If there are no cycles, it does a plain topological sort, that
does not reorder the input, that is, it is stable.

  Every link_map instance has a l_idx field, that in special cases,
currently only when a DSO is being unloaded, can be less than zero.
If the l_idx is not less than zero, and not in the dl_sort_fini call
that already initializes the l_idx entries, the sort function initializes
the l_idx field of every link_map pointer to an unique identifier.

  The sort is done in-place, by decrementing the insert position (the
"k" variable) in the maps vector. Sorting is done by swapping an entry
with the current insert position, and decrementing the insert position.

  The "sort" function allocates in the stack the "seen" vector, that
works as a boolean flag for every link_map, to know if it has been
already moved. Moving, actually swapping, is done at most once, and not
done if it is already in the proper position.

  The "sort" function also allocates in stack the "w" vector, that
is a counter of how many, not self and not already "seen" direct
references exist to a link_map entry.

  The "weight" function resets the "w" entries at every call. Note
that other than the first call, every call reduces the number of
entries by one, as the last moved link_map pointer does not need
to be verified again.

  In the "weight" function, once a link_map entry l_initfini vector
does not find any dependency, either the entry is a self dependency,
the object is being unloaded or the dependency has been moved
(in which case a cyclic dependency was just broken), the "w" entry
for it will be 0.

  Note that the "w" vector is indexed by position in the input/output
maps vector, while the "seen" vector is indexed by the "l_idx" field
of the "struct link_map" representing the DSO.

  Every loop, after calling the "weight" function, the "sort" searches
all "w" entries for the first entry matching the "c" variable. The "c"
variable is initialized as zero, and reset to zero after an entry
is moved. If it scans all "w" entries and none is equal to "c", it
increments "c"; if "c" is larger than zero it means there is a loop,
in which case it chooses the first one with the lowest weight.

  Once an entry is moved, it is marked as "seen", and if it was
part of a cycle, the cycle is now broken, as it will no longer
influence its direct dependencies.

  Recognized possible problems:
o The algorithm is not expected to scale well beyond a few thousand
  entries, due to the need to reinitialize the "w" vector.
o If there are multiple "solutions", it will frequently not match
  sorting done by the previous algorithm. But, it guarantees an
  stable sorting, that is, does not change input order.
o It does not attempt to be smart when finding loops, a better solution
  probably would be to break the smallest loop first, when there are
  multiple entries with the same weight, instead of choosing the first
  entry.
"""

> * Tests. The existing set of tests are insufficient to cover the changes you
> are proposing. While it seems unfair, and I understand that, this kind of
> change has not been undertaken because of the lack of testing. You need to
> add more tests. The *best* testing is actually a test which
> deterministically generates a sequence of permutations of linked objects and
> verifies they are sorted correctly by the algorithms in question. This way
> we can show that the sorting works for a large set of object dependency
> orderings. When we get a bug report, and we will, we can enlarge the set of
> automated testing to include the case the bug report claims and see what
> went wrong.

  I will work on test cases. So far my tests were glibc "make check" and
running rawhide and rhel-7 with the patch. I am particularly interested
in knowing if there is a test case for this commit:
https://sourceware.org/git/?p=glibc.git;a=blame;f=elf/dl-fini.c;h=c35577565eb66ac241365c05efe7c7fb791e0df2#l102
line 102, because it is another special variant of the loop, and what I
did to address it is to loop over the entries, and increase the "w"
vector of the dependencies, to ensure the dependencies are not moved
before the entry, but on very complex cycles, and if having conflicting
rules, it cannot guarantee it will really " preserve a link time dependency".
But I suspect the same is true for the current algorithm; worse if the
link time dependencies generate conflicting dependencies in cycles.

> If you can cover those two things, then you're ready to post the patch to
> libc-alpha for more formal review.

  Ok. I followed wiki instructions and sent the first version to
libc-alpha, but will now wait for a more concrete situation based
on bugzilla.

> This looks really good though and you're making great progress in a
> difficult part of the dynamic loader.
Comment 6 Carlos O'Donell 2014-11-28 15:51:05 UTC
> > If you can cover those two things, then you're ready to post the patch to
> > libc-alpha for more formal review.
> 
>   Ok. I followed wiki instructions and sent the first version to
> libc-alpha, but will now wait for a more concrete situation based
> on bugzilla.

There is nothing wrong with what you did, but you are unlikely to get serious review until your patch has the kind of quality expected by libc-alpha.

You are making great progress here though, and I appreciate your efforts. It's not easy to make changes to core runtimes, and as a community we try to hold ourselves to a high standard.
Comment 7 Paulo Andrade 2014-11-29 22:30:41 UTC
Created attachment 7977 [details]
Scratch test case

Carlos, I would appreciate any comments about the scratch test
case as well as suggestions about other patterns to test. There
is a detailed description on how to run the tests in the README
file.

For a quick visualization, the tests with different solutions
(due to cycles or multiple dsos without dependencies) are like
this, (old) means current glibc, (new) means with the proposed
patch:

a -> b
b -> c
c ->
d ->

link    old          new
------------------------------
abcd    dcba_abcd    cbad_abcd
abdc    cdba_abdc    dcba_abdc
acbd    dcba_abcd    cdba_abcd
acdb    cbda_adbc    cbad_adbc
adbc    cbda_adbc    dcba_adbc
adcb    cbda_adbc    dcba_abcd
bacd    dcba_abcd    cbda_abcd
badc    cdba_abdc    dcba_abdc
bcad    dcba_abcd    cbad_abcd
bcda    cbad_dabc    cbda_dabc
bdac    cbad_dabc    dcba_dabc
bdca    cbad_dabc    dcba_abcd
cabd    dcba_abcd    cdba_abcd
cadb    cbda_adbc    cbda_dabc
cbad    dcba_abcd    cdba_abcd
cbda    cbad_dabc    cbad_abdc
cdab    cbad_dabc    cbad_adbc
cdba    cbad_dabc    cdba_dabc
dabc    cbad_dabc    dcba_dabc
dacb    cbad_dabc    dcba_abcd
dbac    cbad_dabc    dcba_abdc
dbca    cbad_dabc    dcba_abcd
dcab    cbad_dabc    dcba_abcd
dcba    cbad_dabc    dcba_abcd


A -> B
B -> C
C -> D
D -> A

link    old          new
------------------------------
ABCD    DCBA_ABCD    ADCB_ABCD
ABDC    DCBA_ABCD    ADCB_DABC
ACBD    BADC_CDAB    ADCB_ABCD
ACDB    DCBA_ABCD    ADCB_CDAB
ADBC    DCBA_ABCD    ADCB_DABC
ADCB    DCBA_ABCD    ADCB_CDAB
BACD    ADCB_BCDA    BADC_ABCD
BADC    ADCB_BCDA    BADC_DABC
BCAD    ADCB_BCDA    BADC_ABCD
BCDA    ADCB_BCDA    BADC_BCDA
BDAC    ADCB_BCDA    BADC_DABC
BDCA    CBAD_DABC    BADC_BCDA
CABD    BADC_CDAB    CBAD_ABCD
CADB    DCBA_ABCD    CBAD_CDAB
CBAD    BADC_CDAB    CBAD_ABCD
CBDA    BADC_CDAB    CBAD_BCDA
CDAB    BADC_CDAB    CBAD_CDAB
CDBA    BADC_CDAB    CBAD_BCDA
DABC    CBAD_DABC    DCBA_DABC
DACB    CBAD_DABC    DCBA_CDAB
DBAC    ADCB_BCDA    DCBA_DABC
DBCA    CBAD_DABC    DCBA_BCDA
DCAB    CBAD_DABC    DCBA_CDAB
DCBA    CBAD_DABC    DCBA_BCDA
Comment 8 Paulo Andrade 2015-03-17 22:50:05 UTC
Created attachment 8193 [details]
Proposed patch

Updated proposed patch, without ChangeLog entry,
just for asking for extra comments.

I understand this patch touches a too sensible
part of glibc, and the only reason I wrote it is
due to the test case, where the proposed patch
runs 5+ orders of magnitude faster, and appears to
be fully functional, when tested in fedora and
running very complex applications.

This is basically the same patch, but now it scans
the list in reverse order. This is is actually the
correct way to do it, so that on most common/simple
cases/cycles it outputs the same sorting as current
glibc.

But I would like to have one clarification, that maybe
is being already tested, and the Makefile patch to
have only one result is in fault.

tst-initorder is basically this:

a2 a1          # a2 is linked to a1
b2 b1 a2       # b2 is linked to b1 and a2
a3 b2 b1       # a3 is linked to b2 and b1
a4 a3          # a4 is linked to a3
main a4 a1 b2  # "main" is linked to a4, a1 and b2

this will cause the sort input like this:

[ main a2 a1 b2 b1 a3 a4 ]

My question is, is some documentation that says
that dsos must be kept together in the ordering?
I mean, there are 2 correct results:

main a4 a3 b2 b1 a2 a1

and

main a4 a3 b2 a2 a1 b1

but if there is some specification that says
it should have "b2 b1" together, then the patch
is invalid, because it only breaks cycles and
does ordering, and if there are no dependencies,
it keeps it in the order it received the list
to sort.
Comment 9 Paulo Andrade 2015-03-18 17:20:55 UTC
Created attachment 8194 [details]
dso_deps.c - simple test implementing standalone propose sort and current glibc

Compile it as gcc -O0 -g3 dso_deps.c -o dso_deps

It accepts a very simple grammar input file, it
outputs "ORG: ..." to show how input was given,
"NEW: ..." as how the proposed patch would sort
it, and "CUR: ..." as the result of current
glibc sort. For example:

---8<---
$ cat dso_1
a b c
b c
c a
$ ./dso_deps -v dso_1
a(b c)
b(c)
c(a)
ORG: a b c
NEW: a b c
Real time: 4e-06 sec

CUR: a b c
Real time: 5e-06 sec

$ cat dso_2
a b c
d
e f
b c d e

$ ./dso_deps -v dso_2
a(b c)
b(c d e)
c()
d()
e(f)
f()
ORG: a b c d e f
NEW: a b c d e f
Real time: 5e-06 sec

CUR: a b c d e f
Real time: 3e-06 sec

$ cat dso_4
a b
b c
c d
d e
e f
$ ./dso_deps -v dso_4
a(b)
b(c)
c(d)
d(e)
e(f)
f()
ORG: a b c d e f
NEW: a b c d e f
Real time: 5e-06 sec

CUR: a b c d e f
Real time: 3e-06 sec

$ cat dso_5
b a
c b
d c
e d
f e
$ ./dso_deps -v dso_5
b(a)
a()
c(b)
d(c)
e(d)
f(e)
ORG: b a c d e f
NEW: f e d c b a
Real time: 5e-06 sec

CUR: f e d c b a
Real time: 6e-06 sec

---8<---


This is an example matching the current glibc build
time test, and that the proposed patch has a different
result from current glibc:
---8<---
$ cat dso_6
a2 a1
b2 b1 a2
a3 b2 b1
a4 a3
main a4 a1 b2
$ ./dso_deps -v dso_6
a2(a1)
a1()
b2(b1 a2)
b1()
a3(b2 b1)
a4(a3)
main(a4 a1 b2)
ORG: a2 a1 b2 b1 a3 a4 main
NEW: main a4 a3 b2 a2 a1 b1
Real time: 6e-06 sec

CUR: main a4 a3 b2 b1 a2 a1
Real time: 7e-06 sec

---8<---
Comment 10 Paulo Andrade 2015-03-18 17:26:00 UTC
Created attachment 8195 [details]
dso_3 - Sample input for previous test case, that summarizes the reason of the bug report

Since output is quite large, just showing the time result.

If built in "debug mode", -O0 -g3:

NEW: Real time: 0.004542 sec
CUR: Real time: 62.1445 sec

If built more close to default glibc, -O3:

NEW: Real time: 0.001232 sec
CUR: Real time: 22.4437 sec
Comment 11 Marat Radchenko 2019-04-14 21:49:42 UTC
(In reply to Paulo Andrade from comment #8)
> My question is, is some documentation that says
> that dsos must be kept together in the ordering?
> I mean, there are 2 correct results:
> 
> main a4 a3 b2 b1 a2 a1
> 
> and
> 
> main a4 a3 b2 a2 a1 b1
> 
> but if there is some specification that says
> it should have "b2 b1" together, then the patch
> is invalid, because it only breaks cycles and
> does ordering, and if there are no dependencies,
> it keeps it in the order it received the list
> to sort.

I'm not aware of any such spefication, but current glibc implementation does keep direct dependencies together and this looks like A Good Thing.

#15310 suggest using Tarjan SCC for toposort and it is expected to keep direct deps together.

With regard to your patch - what big-O complexity does it have? If I read the code correctly, you recalculate (num_maps - num_seen) weights whenever you mark link_map as seen. This is just a better than O(n^2). But the problem we're talking about (toposort) is believed to be linear.
Comment 12 Paulo Andrade 2019-06-17 20:01:12 UTC
(In reply to Marat Radchenko from comment #11)
> > My question is, is some documentation that says
> > that dsos must be kept together in the ordering?
> > I mean, there are 2 correct results:
> > 
> > main a4 a3 b2 b1 a2 a1
> > 
> > and
> > 
> > main a4 a3 b2 a2 a1 b1
> > 
> > but if there is some specification that says
> > it should have "b2 b1" together, then the patch
> > is invalid, because it only breaks cycles and
> > does ordering, and if there are no dependencies,
> > it keeps it in the order it received the list
> > to sort.
> 
> I'm not aware of any such spefication, but current glibc implementation does
> keep direct dependencies together and this looks like A Good Thing.
> 
> #15310 suggest using Tarjan SCC for toposort and it is expected to keep
> direct deps together.
> 
> With regard to your patch - what big-O complexity does it have? If I read
> the code correctly, you recalculate (num_maps - num_seen) weights whenever
> you mark link_map as seen. This is just a better than O(n^2). But the
> problem we're talking about (toposort) is believed to be linear.

  The weight computation is basically O(n*m) where 'm' is the number of dependencies
of the dso, otherwise, the sorting is linear time. It recomputes the weight once a
dso is moved out of the sort input, as the remaining set may no longer have a dependency
on the moved one. The idea is to attempt to mimic the current implementation, without
doing memmove calls to reorder entries based on dependencies.
  It also does it this way to make the code simpler, and work on the worst scenario
where there are cycles. It does not know details about cycles, so, it just breaks
them in the 'lowest weight'.
  The sample parser with the reverse_sort function in the attachment at
https://bugzilla.redhat.com/show_bug.cgi?id=1162810 should probably be the best
version of the proposed logic, but otherwise, if there is only one solution, either
variant always finds it. The major problem are possible uncertainties, like some
kind of missing dependency information, and then, rely order of existing the algorithm...
Comment 13 Romain Geissler 2020-05-06 16:30:49 UTC
Hi,

As far as I understand, a patch to fix this bug (and the ones related 15310 and 15311) has been posted on the list last summer:
 - The testing framework (which was the reason all past attempt did not eventually got merged): https://sourceware.org/pipermail/libc-alpha/2019-July/105199.html
 - the actual patch: https://sourceware.org/pipermail/libc-alpha/2019-July/105200.html

Carlos, I also understood that back then during the 2.31 release period, at some point you did consider these patches for a review, and potentially a merge into 2.31. It looks like you lacked time to review this. May I gently ping this patch for a review again ?

Thanks,
Romain
Comment 14 Carlos O'Donell 2020-05-06 16:54:01 UTC
(In reply to Romain Geissler from comment #13)
> Hi,
> 
> As far as I understand, a patch to fix this bug (and the ones related 15310
> and 15311) has been posted on the list last summer:
>  - The testing framework (which was the reason all past attempt did not
> eventually got merged):
> https://sourceware.org/pipermail/libc-alpha/2019-July/105199.html
>  - the actual patch:
> https://sourceware.org/pipermail/libc-alpha/2019-July/105200.html
> 
> Carlos, I also understood that back then during the 2.31 release period, at
> some point you did consider these patches for a review, and potentially a
> merge into 2.31. It looks like you lacked time to review this. May I gently
> ping this patch for a review again ?

I reviewed the patches from Chung-Ling Tang here:
https://sourceware.org/pipermail/libc-alpha/2019-December/109676.html
Comment 15 Carlos O'Donell 2020-07-03 16:18:21 UTC
Reviewed v2.1:
https://sourceware.org/pipermail/libc-alpha/2020-June/115184.html
https://sourceware.org/pipermail/libc-alpha/2020-June/115538.html

Looking like we're almost done, but a few design points need ironing out.
Comment 16 Carlos O'Donell 2020-07-03 16:18:27 UTC
*** Bug 26179 has been marked as a duplicate of this bug. ***
Comment 17 Romain Geissler 2020-12-27 18:31:44 UTC
Created attachment 13079 [details]
Proposed patch v3 updated to master branch as of 27th december 2020

Proposed patch v3 written by Chung-Lin Tang and posted in https://sourceware.org/pipermail/libc-alpha/2020-July/116546.html, updated to master branch as of 27th december 2020
Comment 18 Romain Geissler 2020-12-27 18:33:03 UTC
Hi,

I am not sure if this proposed patch is still being considered for eventual review/merge.

Anyway, in comment #17 is an updated version of the version 3 patch written by Chung-Lin Tang and posted on the mailing list here: https://sourceware.org/pipermail/libc-alpha/2020-July/116546.html (only the part of the patch adding the code, *NOT* the part of the patch updating the test suite), as what was posted in July doesn't apply cleanly anymore.

Cheers,
Romain
Comment 19 .san 2021-03-03 22:29:22 UTC
Hi all,

It is probably not common knowledge that the Unreal Engine 4, a popular and very advanced real-time 3D creation tool used for game and other real-time rendering applications, is affected by the behavior described in this ticket. 

The patches, version 4, as provided by Chung-Lin Tang reduce the startup time of UE4 from 200+ seconds to ~15 seconds. 

Not sure if it should affect the priority on this ticket but I wanted to let you all know that there are real world examples of this issue and having a fix out would be a very significant improvement for Game Developers on Linux. 

Considering an opt-in behavior change, toggle-able with an global/environment variable, would allow us devs to use upstream glibc again.

Cheers!
.san
Comment 20 Marat Radchenko 2021-07-06 11:08:29 UTC
It looks like Chung-Lin Tang patch (that is currently at V5) got stuck again: https://patchwork.sourceware.org/project/glibc/list/?q=17645

Meanwhile, there now exist several unofficial sources that provide patched glibc:

For Arch Linux: https://aur.archlinux.org/packages/glibc-dso/
For Ubuntu Linux: https://launchpad.net/~slonopotamus/+archive/ubuntu/glibc-dso
Gentoo Linux users just put patches into /etc/portage/patches/sys-libs/glibc/ and rebuild glibc.
Comment 21 Romain Geissler 2021-08-10 13:03:55 UTC
Hi,

As stated some months ago on the list, we are using this patch (an earlier version) in our own glibc and we need that for some specific production workloads. We will upgrade to the latest patch version with the move to glibc 2.34.

I hope Florian/Carlos will have the time to do another round of review with the latest v6 patch before the release of glibc 2.35 ;)

Cheers,
Romain
Comment 22 Sourceware Commits 2021-10-21 19:21:18 UTC
The master branch has been updated by Adhemerval Zanella <azanella@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e6fd79f3795d46dfb583e124be49fc063bc3d58b

commit e6fd79f3795d46dfb583e124be49fc063bc3d58b
Author: Chung-Lin Tang <cltang@codesourcery.com>
Date:   Thu Oct 21 21:41:21 2021 +0800

    elf: Testing infrastructure for ld.so DSO sorting (BZ #17645)
    
    This is the first of a 2-part patch set that fixes slow DSO sorting behavior in
    the dynamic loader, as reported in BZ #17645. In order to facilitate such a
    large modification to the dynamic loader, this first patch implements a testing
    framework for validating shared object sorting behavior, to enable comparison
    between old/new sorting algorithms, and any later enhancements.
    
    This testing infrastructure consists of a Python script
    scripts/dso-ordering-test.py' which takes in a description language, consisting
    of strings that describe a set of link dependency relations between DSOs, and
    generates testcase programs and Makefile fragments to automatically test the
    described situation, for example:
    
      a->b->c->d          # four objects linked one after another
    
      a->[bc]->d;b->c     # a depends on b and c, which both depend on d,
                          # b depends on c (b,c linked to object a in fixed order)
    
      a->b->c;{+a;%a;-a}  # a, b, c serially dependent, main program uses
                          # dlopen/dlsym/dlclose on object a
    
      a->b->c;{}!->[abc]  # a, b, c serially dependent; multiple tests generated
                          # to test all permutations of a, b, c ordering linked
                          # to main program
    
     (Above is just a short description of what the script can do, more
      documentation is in the script comments.)
    
    Two files containing several new tests, elf/dso-sort-tests-[12].def are added,
    including test scenarios for BZ #15311 and Redhat issue #1162810 [1].
    
    Due to the nature of dynamic loader tests, where the sorting behavior and test
    output occurs before/after main(), generating testcases to use
    support/test-driver.c does not suffice to control meaningful timeout for ld.so.
    Therefore a new utility program 'support/test-run-command', based on
    test-driver.c/support_test_main.c has been added. This does the same testcase
    control, but for a program specified through a command-line rather than at the
    source code level. This utility is used to run the dynamic loader testcases
    generated by dso-ordering-test.py.
    
    [1] https://bugzilla.redhat.com/show_bug.cgi?id=1162810
    
    Signed-off-by: Chung-Lin Tang  <cltang@codesourcery.com>
    Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
Comment 23 Adhemerval Zanella 2021-10-21 19:23:28 UTC
A new algorithm based on topologic sort has been added on master (15a0c5730d1d5aeb95f50c9ec7470640084feae8) and it can be selected throug a new tunable (glibc.rtld.dynamic_sort).  I will close this task and coordenate upstream when and how to move to this algorithm as the default one.