This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: A per-user or per-application ld.so.cache?


> On Feb 8, 2016, at 12:19 PM, Carlos O'Donell <carlos@redhat.com> wrote:
> 
> On 02/08/2016 02:10 PM, Florian Weimer wrote:
>> On 02/08/2016 07:40 PM, Carlos O'Donell wrote:
>>> Under what conditions might it make sense to implement
>>> a per-user ld.so.cache?
>>> 
>>> At Red Hat we have some customers, particularly in HPC,
>>> which deploy quite large applications across systems that
>>> they don't themselves maintain. In this case the given
>>> application could have thousands of DSOs. When you load
>>> up such an application the normal search paths apply
>>> and that's not very optimal.
>> 
>> Are these processes short-lived?
> 
> No.
> 
> See [1]. 
> 
>> Is symbol lookup performance an issue as well?
> 
> Yes. So are the various O(n^2) algorithms we need to fix
> inside the loader, particularly the DSO sorts we use.
> 
>> What's the total size of all relevant DSOs, combined?  What does the
>> directory structure look like?
> 
> I don't know. We should as Ben Woodard. To get us that data.
> 
> Ben?

I just talked to one of the developers to get a good sense of the current problem. 
The sum of the on-disk file ELF files including debuginfo for one app that we looked at is around 3GB but when we just look at the text in all the ELF files it is 100-200MB depending on architecture spread across about 1400 DSOs. 

Not including the directories already found in the system runtime linker cache there were around 15 directories being pointed to.

> 
>> Which ELF dynamic linking features are used?
> 
> I don't know.

Currently to improve performance they use a lot of rpath and environment specific link paths setup by something quite a lot like modules.

As you pointed out they also have an application which assists in loading all of these libraries for a large MPI job which is called spindle. That makes use of the audit interface.

They wrote a benchmark which demonstrates some of the challenges that they face: https://codesign.llnl.gov/pynamic.php
 
> 
>> Is the bulk of those DSOs pulled in with dlopen, after the initial
>> dynamic link?  If yes, does this happen directly (many DSOs dlopen'ed
>> individually) or indirectly (few of them pull in a huge cascade of
>> dependencies)?
> 
> I do not believe the bulk of the DSOs are pulled in with dlopen.
> 
> Though for python code I know that might be the reverse with each
> python module being a DSO that is loaded by the interpreter.

Unfortunately, that is in fact the case. Many of the applications are glued together with python while bulk of the computation occurs in C++. So much of the code is in fact loaded by a python interpreter.
> 
> Which means we probably have two cases:
> * Long chains of DSOs (non-python applications)
> * Short single DSO chains, but lots of them (python modules).

I brought up this scenario with the developer and there is a 3rd scenario which is python loads a computational library which then has quite a few dependencies. In particular, basically every high level physics library needs to use MPI to communicate between adjacent cells in the mesh. 
> 
>> If the processes are not short-lived and most of the DSOs are loaded
>> after user code has started executing, I doubt an on-disk cache is the
>> right solution.

Except for the fact that the process is starting on literally thousands of nodes simultaneously and its libraries are scattered around about 15 non-system project directories. This leads to a phenomenal number of NFS operations as the compute nodes search through 20 or so directories for all their components. That brings even very powerful NFS servers to their knees. 

> 
> Why would a long-lived process that uses dlopen fail to benefit from an
> on-disk cache? The on-disk cache, as it is today, is used for a similar
> situation already, why not extend it? The biggest difference is that
> we trust the cache we have today and mmap into memory. We would have to
> harden the code that processes that cache, but it should not be that
> hard.
> 
> Would you mind expanding on your concern that the solution would not work?
> 

-ben

> Cheers,
> Carlos.
> 
> 
> [1] http://computation.llnl.gov/projects/spindle/spindle-paper.pdf


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]