This is the mail archive of the mailing list for the glibc project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: A per-user or per-application

> On Feb 8, 2016, at 11:18 PM, Florian Weimer <> wrote:
> On 02/08/2016 11:29 PM, Ben Woodard wrote:
>> I just talked to one of the developers to get a good sense of the current problem. 
>> The sum of the on-disk file ELF files including debuginfo for one app that we looked at is around 3GB but when we just look at the text in all the ELF files it is 100-200MB depending on architecture spread across about 1400 DSOs. 
> This means that copying the text files together into a single file would
> be feasible.

Am I understanding you correctly? You’re suggesting linking?
We have hundreds of applications and hundreds of libraries all with their own development teams and release schedules. What you seem to be suggesting sounds like combinatoric insanity. 

Some people like the weather service may have one or a couple of apps that they run over and over but we are a national lab and we have literally thousands of users doing all sorts of things. We have around 10,000 active users. What we have is more on the scale of building a distribution like Fedora including Gnome, firefox, and openoffice with the huge tangled web of dependencies. 

>> Except for the fact that the process is starting on literally thousands of nodes simultaneously and its libraries are scattered around about 15 non-system project directories. This leads to a phenomenal number of NFS operations as the compute nodes search through 20 or so directories for all their components. That brings even very powerful NFS servers to their knees. 
> Okay, this is the critical bit which was missing so far.  I think Linux
> has pretty good caching for lookup failures, so the whole performance
> issue was a bit puzzling.  If the whole thing runs on many nodes against
> storage which lacks such caching, then I can see that this could turn
> into a problem.

It isn’t just caching, it comes down to the number iops that the thundering herd of all the compute nodes participating in a MPI job generate. In the base OS there isn’t anything like a distributed cache where because one node figures out that a particular library is in a particular place the 3000 nodes participating in the job all know not to bother looking in all the places where the library wasn’t found. That kind of distributed per-application cache is what Carlos is suggesting. 

In essence spindle is a tool which solves this problem as well as providing an efficient mechanism for distributing the ELF files to the compute nodes without hammering the NFS servers.

> The main question is: Will the storage be able to cope with millions of
> file opens if they magically pick the right file name (avoiding ENOENT)?
> If not, the only viable optimization seems to be the single file approach.

That is a storage system design constraint. Honestly, that is not really hard because every single compute node is asking for the same thing and so the server has the blocks in cache and just spits them out to all the nodes through its high speed network interfaces. Yes it would be better to have them flood fill out to all the nodes but that is a different problem.
> How will the storage react to parallel read operations on those 15
> directories from many nodes?

Once again that is a different problem that is tangentially related and solved at the center wide storage system design level.

> I'm worried a bit that this turns into a request to tune to very
> peculiar storage stack behavior.

I don’t see that. I think that this is part of the larger change in the division of labor as computing became cheaper in relation to manpower as well as the commodification of the OS and distribution. 

As computing became cheaper it is being used for a broader range of applications. Instead of shipping a huge array of every conceivable library and piece of software, we OS distributors have trimmed the system libraries down to a small supportable subset. This means that an increasing percentage of the libraries used to accomplish some task are not part of the OS distribution.

Then as we moved from a system administrator maintaining a small number of systems each carefully and deliberately configured including the software and libraries for the applications running on these servers to a more devops model where a system administrator oversees the provisioning of hundreds or thousands of machines and the developers who may not have root access must install and maintain their own software and libraries above and beyond the OS instance, the notion of a universal system wide /etc/ feeding into a cache becomes less and less practical. 

This isn’t tuning for a peculiar storage stack behavior. This is adapting to the reality of the way things exist now where the OS vendor and the system administrator do not have the time, inclination, or sometimes ability to configure a system for the app that is going to run on it. That is why we need to push the capability down to allow non-root users to make use of benefits of ldconfig and the to optimize the load time for their work environment or application.
> Depending on what they do with Python, the Python module importer will
> still cause a phenomenal amount of ENOENT traffic, and there is nothing
> we can do about that because it's not related to dlopen.
> Florian

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]