Bug 27695 - ld.so has poor performance characteristics when loading large quantities of .so files
Summary: ld.so has poor performance characteristics when loading large quantities of ....
Status: WAITING
Alias: None
Product: glibc
Classification: Unclassified
Component: dynamic-link (show other bugs)
Version: 2.28
: P2 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-04-05 17:58 UTC by steve.gargolinski
Modified: 2022-03-03 23:36 UTC (History)
4 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed: 2022-01-27 00:00:00
fweimer: security-


Attachments
Core dump from SIGSEGV in ld (deleted)
2022-03-03 22:56 UTC, Pieter-Jan Briers
Details

Note You need to log in before you can comment on or make changes to this bug.
Description steve.gargolinski 2021-04-05 17:58:32 UTC
Our application is growing and our startup time is increasing significantly on Linux while remaining fairly consistent on Windows. A typical startup workflow that we've been measuring takes about 10 seconds on Windows and over 60 seconds on Linux with comparable hardware.

Profiling the platform startup time difference attributes the time completely to ld.so. We did a bunch of experimentation and investigation and realized that our growing quantity of dynamic libraries is a major contributor to this change.

In order to replicate this outside of our product we generated a small sample application that measure time to load 100,000 small generated classes (constructor, virtual destructor) spread across a varying quantity of dynamic libraries. Loading these 100,000 classes in one dynamic library takes about 0.3 seconds. Loading the same 100,000 classes spread across 1,000 libraries takes over 9 seconds!

Back to our real world use case. In our product we generally load libraries with RTLD_GLOBAL. One of the main performance bottlenecks we were able to identify is in _dl_lookup_symbol_x(). When searching the global scope (symbol_scope[0]), the search found nothing > 50% of the time and did so with linear performance.

return _dl_lookup_symbol_x(undef_name, undef_map, ref, symbol_scope, version, type_class, flags, skip_map);

A major portion of our 60 second startup time is spent here. We experimented with adding a hashset of symbols previously loaded into the global scope (updated in add_to_global()) so that we could get constant time lookup on this check instead of linear. This was a major improvement to both our test application and our real product.

The test application mentioned above, which previously took 9 seconds to load 1,000 libraries, now performs the same operation in 1 second.

We've prototyped a strategy to dynamically patch ld.so at startup of our application and our workflow time measurements improved from 60 seconds to 30 seconds. Still not nearly as fast as Windows, but a major improvement. We've tested this on a bunch of versions of multiple distributions and have been able to improve all of them.

With this change we're adding some memory overhead. Also timing improvements will not be seen by applications loading a small number of dynamic libraries (and can even cause a performance regression due to time spent populating the hashset) - but it's a huge improvement to our use case.

I'm happy to share any of the fixes or investigations in more detail. Improving ld.so performance as dynamic library quantity scales is really important to our use case and we're looking for input on whether this can be a useful addition to the glibc codebase.
Comment 1 H.J. Lu 2021-04-05 20:57:42 UTC
This may be related to PR 17645.
Comment 2 steve.gargolinski 2021-04-22 13:15:44 UTC
@H.J. Lu - We experimented with PR 17645 and it had no effect on our use case at all. Performance was the same with or without that patch.
Comment 3 Florian Weimer 2022-01-28 11:07:33 UTC
I assume you use BIND_NOW (which is definitely the forward-looking thing to do). This means that you already have to handle inter-module dependencies in some way. Why do you need to use RTLD_GLOBAL, then?

Is it perhaps because you want to use dlsym as some sort of global service lookup mechanism, without having the to find the implementing module first?

Typically, switching to RTLD_LOCAL provides a nice speedup for dlopen, for non-degenerate dependency graphs.
Comment 4 steve.gargolinski 2022-02-08 22:07:09 UTC
We use a combination of RTLD_LOCAL and RTLD_GLOBAL in our program. For the core libraries that most other libraries and plugins depend on, we use RTLD_GLOBAL to load to avoid cross-library exception and RTTI issues. Also, localizing the core libraries prevent our program from “stacked” dynamic loading (e.g., at some later time dynamically loading another shared object that depends on the same core libraries).

For small plugins that are self-contained, our program uses RTLD_LOCAL to load them. As to RTLD_NOW, we use that to avoid runtime symbol resolution issues, which can only be found via exhaustive testing when using RTLD_LAZY.

If a targeted discussion about our use case would be helpful to dig into more details, we would definitely be willing to do so.
Comment 5 Florian Weimer 2022-03-01 10:48:26 UTC
(In reply to steve.gargolinski from comment #4)
> We use a combination of RTLD_LOCAL and RTLD_GLOBAL in our program. For the
> core libraries that most other libraries and plugins depend on, we use
> RTLD_GLOBAL to load to avoid cross-library exception and RTTI issues.

Is this really necessary? Do you use the RTTI/exception handling implementation from GCC? I think it disregards type metadata addresses and falls back on type name comparisons. (Historically, this was not the case.)

Maybe narrowing the search scope is the way to tackle this.

> Also,
> localizing the core libraries prevent our program from “stacked” dynamic
> loading (e.g., at some later time dynamically loading another shared object
> that depends on the same core libraries).

Sorry, I don't understand this comment.

> For small plugins that are self-contained, our program uses RTLD_LOCAL to
> load them. As to RTLD_NOW, we use that to avoid runtime symbol resolution
> issues, which can only be found via exhaustive testing when using RTLD_LAZY.

Right, that's makes a lot of sense to me. Although RTLD_NOW greatly amplifies issues with binding performance. But we prefer it for its predictability (and the full RELRO hardening it enables).

> If a targeted discussion about our use case would be helpful to dig into
> more details, we would definitely be willing to do so.

It's very hard to improve the general case. It may be possible to lift some ideas from JOIN optimization for relational databases.

Are many of your shared objects loaded from the system search paths covered by /etc/ld.so.cache (after they have been found by ldconfig)?
Comment 6 Pieter-Jan Briers 2022-03-03 22:56:36 UTC
Created attachment 14003 [details]
Core dump from SIGSEGV in ld
Comment 7 Pieter-Jan Briers 2022-03-03 23:04:12 UTC
Ok I have no idea how I managed to do this but I could have sworn I just attached that file to my bug report (I have no idea how I even ended up here??)

Maybe I shouldn't be doing this kind of stuff at midnight?
Comment 8 Frank Ch. Eigler 2022-03-03 23:36:10 UTC
The content of attachment 14003 [details] has been deleted for the following reason:

erroneous addition