Bug 11767 - RFE: dlopen of in-memory ET_DYN or ET_EXEC object
Summary: RFE: dlopen of in-memory ET_DYN or ET_EXEC object
Status: ASSIGNED
Alias: None
Product: glibc
Classification: Unclassified
Component: dynamic-link (show other bugs)
Version: 2.12
: P2 enhancement
Target Milestone: ---
Assignee: Paul Pluzhnikov
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-06-29 18:02 UTC by John Reiser
Modified: 2016-09-21 18:32 UTC (History)
12 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:
fweimer: security-


Attachments
attachment-3251-0.html (112 bytes, text/html)
2013-11-05 04:25 UTC, Gregory P. Smith
Details

Note You need to log in before you can comment on or make changes to this bug.
Description John Reiser 2010-06-29 18:02:13 UTC
Request For Enhancement (RFE): Please implement some way to add an ELF object
(ET_DYN or ET_EXEC) that already is in memory, into the set of modules that are
managed by the runtime loader rtld, instead of requiring dlopen() of a file from
the filesystem.  This capability is useful for managing modules that are created
at runtime, and/or to help implement protection and access controls, etc.

Here is a suggestion of syntax and semantics: a new function
   #include <link.h>  /* macro-ize Elf32_Xxx vs. Elf64_Xxx */
   void *handle = dlopen_phdr(ElfW(Phdr) const *phdr, int n_phdr, int flag);
which is much like dlopen() except that the mmap()s already have been done
before calling dlopen_phdr().  The "slide" value (adjustment in load address)
may be computed from the difference between the input phdr parameter and the
PT_PHDR.p_vaddr that is found in the vector of ElfW(Phdr)s.
Comment 1 John Tobey 2010-11-12 23:36:28 UTC
I've been itching for this for quite some time.  Hope it happens!  Meanwhile, I'll implement a fallback using a temp file.
Thanks,
John
Comment 2 Gregory P. Smith 2012-03-05 23:22:36 UTC
Indeed. I've wanted this as well.

It would also be useful to support bundled up applications where everything lives in a single file including potentially multiple .so files embedded at known offsets within that one archive file when you cannot or do not want to extract the .so's to local storage (if any exists) in order to run the binary.
Comment 3 Rich Felker 2012-07-18 13:59:57 UTC
If you want to embed the .so's in the main program binary, why are you not just static linking them to begin with? That will give much better performance (no time wasted on relocations, no PIC overhead, etc.) and make your program more portable (no dependence on glibc-specific dynamic loading features or even on POSIX dlopen).

With that said, I really question the validity of this feature request. Using the .so in-place from data embedded in the main program would only be possible if it's page-aligned, and would be a security risk anyway since the whole .so image would be writable, thus requiring write+exec permission on the pages that most security-enhanced systems don't even allow these days. Thus you'd have to make a copy of the whole .so image, and the copy would have to be anonymous memory that consumes actual physical runtime memory and commit charge, rather than being a file-backed mapping. In other words, it wastes a good deal more memory than loading a .so from a separate file.
Comment 4 Paul Pluzhnikov 2012-07-18 14:16:59 UTC
(In reply to comment #3)
> If you want to embed the .so's in the main program binary, why are you not just
> static linking them to begin with?

How is static linking useful when managing modules created at runtime?

Various JITters do that. Usually they don't generate full ELF, but then GDB doesn't know how to debug them. 

> With that said, I really question the validity of this feature request. Using
> the .so in-place from data embedded in the main program would only be possible
> if it's page-aligned, and would be a security risk anyway since the whole .so
> image would be writable, thus requiring write+exec permission on the pages that
> most security-enhanced systems don't even allow these days. Thus you'd have to
> make a copy of the whole .so image, and the copy would have to be anonymous
> memory that consumes actual physical runtime memory and commit charge, rather
> than being a file-backed mapping. In other words, it wastes a good deal more
> memory than loading a .so from a separate file.

You appear to be making a whole lot of unwarranted assumptions in your argument.

Think UPX: it has the main executable (and could have shared libraries) compressed. It decompresses them to memory, and can arrange for them to be properly aligned, and mprotect()ed RO. It is wasteful to require UPX to write such images to disk only so they can be dlopen()ed and immediately unlink()ed.
Comment 5 Rich Felker 2012-07-18 18:15:50 UTC
My argument was based on the usage cases presented in this bug tracker thread.

Anyway, it's wasteful and backwards for things like UPX to exist at all. They trade startup time (valuable) and runtime memory usage (valuable) for disk space (dirt cheap). Even if disk space is valuable, using a compressed filesystem managed by the kernel (where demand paging will be available) is the right solution. Putting a runtime binary decompressor in your application is just bad design.

I maintain that any use of this feature would also be bad design. If you really want the possibility of putting embedded so files in your binary, it makes more sense to make toolchain feature for embedding them in the ELF binary using the linker (where they'll be aligned and mapped with the proper permissions) rather than supporting loading from arbitrary buffers.
Comment 6 Remy Blank 2012-07-18 19:51:20 UTC
Comment 2 was another use case: creating single-file executables for scripting languages. For example, Python applications can be bundled into a single executable .zip file. However, when the application uses C extensions (and most applications do), it has to extract the .so from the .zip to a temporary directory, just to allow dlopen() to load it. This is not only slow, but also creates various race conditions. If it was possible to dlopen() the .so *within in the zip file* (it could be stored there uncompressed, with the right alignment if necessary), or to load it and dlopen() it from a buffer, the extraction wouldn't be necessary.

Note that dlopen()ing the libraries by mmap()ing selected parts of the zip file would allow for sharing between processes, and would therefore not consume more memory.
Comment 7 John Reiser 2012-07-18 20:32:34 UTC
(In reply to comment #5)
> My argument was based on the usage cases presented in this bug tracker thread.

"An ELF object that already is in memory" means that the bytes are in the right
place and have the correct access permissions, whether by mmap(), read(), or
store-to-memory, followed by mprotect() as appropriate.  Approximately, the
bytes will occupy an interval of pages.  Exactly, they will be an image of the
PT_LOADs, slid by some whole number of pages.  Equivalently, they will be what
is described by struct dl_phdr_info during a callback from dl_iterate_phdr(). 
The pages need to be "blessed" as an in-memory module that the dynamic linker
recognizes, and connected to the rest of the collection of modules in memory. 
No "mass copying" or re-arranging is necessary.

In the proposed dlopen_phdr(), one of the ElfXX_Phdr will be a PT_PHDR, and the
slide value for the module is equal to the difference between the actual
address and the PT_PHDR.p_vaddr.  (If there is no such PT_PHDR, then use zero.)
Knowing that, then rtld can find the PT_DYNAMIC, and process it.  Create the
internal structures which keep track of a loaded module, apply the DT_SONAME,
load the DT_NEEDED dependencies, connect the DT_SYMTAB, DT_STRTAB, and
DT_{GNU_}HASH, perform the indicated relocations, call the DT_INIT_ARRAY
functions, etc.

Regarding the use case(s): Storage that is "dirt cheap" tends to be "dirt
slow."  A class 4 SDHC flash memory device supplies less than 4 MB/s, whereas
RAM usually gives at least 100 MB/s.  Most hand-held mobile devices do not
offer a compressed filesystem.  Managing files (including updates) using
something like jffs2 requires complex code, battery energy, and perhaps a
somewhat sophisticated user to understand the behavior of fragmentation.  "Dirt
cheap" does not mean a cost of zero, and every $0.10 matters.  A device with
8MB RAM and 8GB flash storage cannot afford to use 6MB to store a program with
library, if 3.5MB would be enough because of compression.  "Decompress to
filesystem, then dlopen" costs time (and an unneeded write to flash stoage.) 
Distributing 3.5MB "over the air" is understandable: it's a "song" (same size
as typical MP3 audio).  Distributing 6MB gets noticed.  I would like to live in
a world where such costs did not matter (or were absorbed by somebody else),
but today I am forced to pay, and probably will be for at least a couple more
years.
Comment 8 John Tobey 2012-07-18 20:48:27 UTC
Another use case: languages like Lisp require a global, canonical, dynamic mapping of strings to symbols.  When the language runtime implements this, it duplicates a lot of the dynamic loader's work.  I would like to implement Lisp symbols as ELF symbols.  This way, the system handles the common case of symbol names known at compile time, whether in the executable or libraries.  The only things it doesn't handle are dynamically created symbols (INTERN in Lisp).

Currently, my best option is to create the symbols as memory addresses arranged to resemble a DSO.  I intercept calls to dlopen() and "flush" the current set of dynamically allocated symbols.  This "flushing" operation involves writing an ELF object with a fixed load address that lines up with the symbol values in use, then loading the object.  I do this so the dlopen'd library will see any symbols it may refer to.  I suspect this will require a mutex around mmap and friends as well as DSO operations.  All this file writing and mutexing would be unneeded if we had dlopen_phdr.
Comment 9 Rich Felker 2012-07-19 01:59:40 UTC
If the proposal is to require the in-memory dso to already be properly mapped (alignment, permissions, etc. issues) then I withdraw most of my criticisms. However I still disagree with John Reiser's arguments about costs. If storage is really slow and it's desirable to use ram instead, you have tmpfs at your disposal. And since we're talking about systems on which the GNU C library can be used, the idea that compressed filesystems might not be available is unconvincing. UPX is backwards technology that makes no sense on a system with virtual memory or any nontrivial kernel.
Comment 10 Gregory P. Smith 2012-07-19 04:32:05 UTC
I see no problem with a requirement to already have the dso mapped and
aligned with proper permissions beforehand. That makes sense.

Remy described my "comment 2" use case in much better detail. The .so's are
extension modules for a runtime being executed via the #! line on the
bundle or similar. Python in my case but this applies equally to any
dynamic language runtime.

tmpfs is not an ideal solution as now you would be required to setup tmpfs,
mount it, use it, and require some separate process configured not to be
OOM killed to sit around and monitor your process that is using the tmpfs
to be able to unmount it when the process dies for whatever reason to free
up the resources. Not to mention that systems run without swap so a tmpfs
would pin the full dso in memory rather than demand paging the parts being
used as a mapping would do.
Comment 11 Roni Simonian 2013-02-22 01:55:27 UTC
I am planning to work on this.  If anyone has done any implementation work, please speak up.
Comment 12 Ondrej Bilka 2013-10-21 07:55:15 UTC
> I am planning to work on this.  If anyone has done any implementation work, 
> please speak up.

How far did you get with implementing this?
A new gcc jit would benefit from this functionality.
Comment 13 Paul Pluzhnikov 2013-11-04 20:01:48 UTC
(In reply to Ondrej Bilka from comment #12)

> How far did you get with implementing this?

Not very far. We had a prototype, but it proved trickier than we expected, and in particular the semantics of dlclose() on such in-memory object proved unclear.

I am currently working on a dlopen_with_offset(), which is just like dlopen, but with a given offset into the file.

That would meet our actual needs, but necessarily those of UPX or gcc/jit.
Comment 14 Ondrej Bilka 2013-11-04 21:03:06 UTC
For workarounds a closest that I could thing is use shm_open with random filename to create file descriptor.
Comment 15 John Reiser 2013-11-04 21:34:05 UTC
(In reply to Ondrej Bilka from comment #14)
> For workarounds a closest that I could think is use shm_open with random
> filename to create file descriptor.

That shows good imagination!

The main desired functionality is that of "blessing" as a loaded module the data that is already resident in pages at the appropriate addresses, without creating new copies of pages.  This is somewhat like reversing dl_iterate_phdr(); see Comment #7.

Related to Comment #13: dlclose() would "remove" the accounting information and "forget" the internal object that was created by the corresponding dlopen(), but otherwise leave the data alone.  Do not call munmap(), etc.
Comment 16 Rich Felker 2013-11-04 21:40:21 UTC
As I mentioned before, using an already-mapped-in-vm DSO with dlopen is not viable. Usually, DSOs have at least one page (where the end of .text and the beginning of .data share a page on disk) that must be mapped twice at different offsets, and likewise all subsequent data pages must be mapped offset by one page from their location in the image. Further, you need to have empty VM space for sufficiently many .bss pages past the end of the mapping. It would be possible to require the caller to arrange all of these things, but that's basically offloading A LOT of the ELF loading process onto the calling program and I don't think that makes for a reasonable public interface for glibc to provide.

If you don't demand this crazy in-place usage of the DSO image, simply copying it to a temp file or shared memory object and loading it from there would work perfectly well.
Comment 17 Paul Pluzhnikov 2013-11-04 21:46:45 UTC
(In reply to John Reiser from comment #15)

> Related to Comment #13: dlclose() would "remove" the accounting information
> and "forget" the internal object that was created by the corresponding
> dlopen(), but otherwise leave the data alone.  Do not call munmap(), etc.

The problem semantics weren't about munmap -- that part was easy.
They were about relocations.

In UPX case, you probably don't have any.

In our case, we actually do have a bona-fide DSO with relocations that is
at some offset in another file. Calling "pretend"-dlopen() applies them.

I guess dlclose() could undo them, but we didn't get that far.

Not undoing them would cause them to be re-applied again on re-dlopen(),
which would be wrong.
Comment 18 John Tobey 2013-11-05 01:47:09 UTC
(In reply to Rich Felker from comment #16)
> It would be possible to require the caller to arrange all of these
> things, but that's basically offloading A LOT of the ELF loading process
> onto the calling program and I don't think that makes for a reasonable
> public interface for glibc to provide.

Well, it's all in a day's work for the compiler writers who would directly use this.  I like to make simple things easy and complex things possible.
Comment 19 Rich Felker 2013-11-05 03:37:10 UTC
It's already possible: you write into a temp file and call dlopen on the temp file. What you're asking for is not "making simple things easy and complex things possible" but rather "making simple things complex as a dubious premature optimization".

As for your proposed Lisp implementation usage case, it's probably a bad idea. Even aside from the issue of avoiding symbol clashes with the C namespace (which you could avoid with prefixing of some sort, at the cost of added hashing/lookup cost), dlsym is simply not very efficient. POSIX requires it to accept invalid DSO handles (which glibc currently does not tolerate; see bug #14989) and report an error rather than crashing, which adds a good deal of otherwise-unnecessary overhead. I'm also unclear on how lookup time and space requirements scale with number of DSOs loaded (of which you may have a lot). But even if not for all these issues, it's just bad design to write one thing that depends on the implementation internals of another.
Comment 20 John Tobey 2013-11-05 04:14:18 UTC
(In reply to Rich Felker from comment #19)
> It's already possible: you write into a temp file and call dlopen on the
> temp file.

"Just bad design" in your words.

> POSIX
> requires it to accept invalid DSO handles (which glibc currently does not
> tolerate; see bug #14989)

Interesting, thanks!  Have you thought about a hash table (or similar) mapping handle to header?

> lookup time and space requirements scale with number of DSOs loaded (of
> which you may have a lot).

I grant that there may exist good reasons not to implement this feature in this time and place.  Once we get our foot in the door with a minimal implementation, if scaling issues arise later, we optimize.
Comment 21 Gregory P. Smith 2013-11-05 04:25:04 UTC
Created attachment 7266 [details]
attachment-3251-0.html

There is no writable storage or the ability to mount any in the situation
Paul and I are looking to support.
Comment 22 Jackie Rosen 2014-02-16 17:44:07 UTC Comment hidden (spam)
Comment 23 dholth 2016-09-21 18:29:51 UTC
I would also appreciate this feature, for Python.
Comment 24 dholth 2016-09-21 18:32:03 UTC
Here is someone's proof of concept implementation for 64-bit Linux https://github.com/m1m1x/memdlopen