I've come to a need of dlopen the solib into the previously mmapped buffer. Currently the load address is chosen in _dl_map_segments(). The elf preferred address is taken, which is usually 0 so any address is used. I can think of 2 possible solutions. One would be to add a new func for DL_AUDIT which passes the needed length to the user and expects an address of a buffer as a return. This will allow the user to mmap the MAP_SHARED buffer if he wants, but the down-side is that ld.so will then need to use read() instead of mmap() to not trash the user's shared mapping. This will likely also need some efforts to implement. Another solution is trivial: just add a new fn dlopen3(file, flags, addr) that provides the base address for dlopen. This will not allow to use the pre-allocated buffer (user doesn't know the needed buffer size at that point) but its trivial to code up and will likely also solve my problem. It was also already requested here by someone else: https://stackoverflow.com/questions/62064806/is-there-a-way-to-specify-the-base-address-of-a-shared-library-using-dlopen What do people think about such an extension?
Any GNU extension requires a specific usercase that can't be easily accomplished with current API. What problem are trying to solve that you require to map a shared library to an specific pre-allocated address?
(In reply to Adhemerval Zanella from comment #1) > Any GNU extension requires a specific usercase that can't be easily > accomplished with current API. What problem are trying to solve that you > require to map a shared library to an specific pre-allocated address? It needs to interact with legacy 32bit code that is running in VM. The memory of the VM is mapped in a 64bit space under a particular address. I need to be able to load solib within a 4Gb range from the aforementioned address, in which case the 32bit code will be able to create the pointers to that solib's objects. Another way of solving that, is to put the solib into a MAP_SHARED buffer. In this case I will be able to create the "mirror" of that solib under any address I need, so the 32bit pointers will likely work in that case too (I will not execute functions via pointers to that window). For that, I'd probably need the following API: void *dladdr(void *handle, int *buffer_size_out); So the ability to get the address and length is probably already enough for my needs, as that will allow me to do the MAP_SHARED trick. And that can probably be made an LD_AUDIT extension, instead of a new global function. Of course I still need to test either way to make sure it really works. Which may mean that eventually I'll have to implement that extension myself. So for now this is just a query to find out if it is acceptable, and if so - in what form.
This seems a very specific usercase that I am not sure if it fits on usual case of dlopen usage. Mapping to a preferred address basically defeats ASRL and has other issues: what this dlopen extension should do if the required mapping is not large enough, if the address is already occupied, or if the mapping does not have the correct permissions? In the case of passing mapping not sufficing and dlopen fallbacks to normal mapping, the result might not be directly related to the address that was mapped, so the user will need to call dladdr to check if the mapping was with the passed buffer or with the fallback mechanism. So it would be a cumbersome interface. We used to have hacks to force mmap to load executables to 32-bits, it was added to overcome some particular architecture limitations, and it has caused some issues and it was eventually removed (check ea5814467a02c9d2d7608b6445c5d60e2a81d3ee). So I am not very fond of this extension.
(In reply to Adhemerval Zanella from comment #3) > This seems a very specific usercase that I am not sure if it fits on usual > case of dlopen usage. Mapping to a preferred address basically defeats ASRL > and has other issues: what this dlopen extension should do if the required > mapping is not large enough, if the address is already occupied, or if the > mapping does not have the correct permissions? My current plan (which may of course change) is to have the LD_AUDIT func that will tell the length and get the address back. The user have to make sure there is no any mapping at that address for a specified length (I am looking in /proc/self/maps to find the needed hole). If eventually it appears the hole is not large enough, dlopen() should just fail. That basically addresses the aforementioned concern of yours. After dlopen() succeeded, knowing the length I will mmap(MAP_SHARED) another buffer, memcpy() the solib there and mmap() it back to its addr, but in a shared buffer. So basically all the dirty work is on my side, dlopen() should only tell me the length and get the address back. > In the case of passing mapping not sufficing Passing mapping was just an option to consider. I already realized it would be a bad extension, so currently my plan is to pass only address in response to the length, and then, knowing length, create that "other mapping" by hands, then memcpy(), then remap back. More work for me, but much smaller extension to glibc. > We used to have hacks to force mmap to load executables to 32-bits, it was > added to overcome some particular architecture limitations, and it has > caused some issues and it was eventually removed (check > ea5814467a02c9d2d7608b6445c5d60e2a81d3ee). Wow, just recently removed! Quite sad I haven't had a chance to try it out... But I think we can come up with something much more flexible. Which is why I created that ticket before actually prototyping the thing myself. Eg LD_PREFER_MAP_32BIT_EXEC wouldn't give me the length, so I'd have difficulties wrapping that into a shared mapping. I really aim for something very small and simple on a glibc side, with more work on a user's side. But if glibc isn't helpful, I'll need to implement/use alternative dynamic linker, which would be quite bad.
Posted a patch here: https://sourceware.org/pipermail/libc-alpha/2023-February/145640.html
(In reply to Adhemerval Zanella from comment #3) > We used to have hacks to force mmap to load executables to 32-bits, it was > added to overcome some particular architecture limitations, and it has > caused some issues and it was eventually removed (check > ea5814467a02c9d2d7608b6445c5d60e2a81d3ee). Mm, it was indeed removed in the commit you mention, but re-introduced in 317f1c0a8a7
(In reply to Stas Sergeev from comment #6) > (In reply to Adhemerval Zanella from comment #3) > > We used to have hacks to force mmap to load executables to 32-bits, it was > > added to overcome some particular architecture limitations, and it has > > caused some issues and it was eventually removed (check > > ea5814467a02c9d2d7608b6445c5d60e2a81d3ee). > > Mm, it was indeed removed in the commit you > mention, but re-introduced in 317f1c0a8a7 To workaround an architecture limitation, a better (and more complex) solution would be to fix it on the kernel as suggested on patch reviewing.
Hopping over here from a long and winding discussion in https://sourceware.org/bugzilla/show_bug.cgi?id=30127. (In reply to Stas Sergeev from comment https://sourceware.org/bugzilla/show_bug.cgi?id=30127#c46) > So let me summarize that memfd_create() > (shm_open() actually) is not a replacement, > but rather is an essential part of the > scheme. Using it together with la_premap_dlmem() > and la_premap() you can get the desired > picture. Desired picture is 2 identical > mappings of the same lib, one at relolc_addr, > one at mmap_addr=reloc_addr+VM_window_start. > > There is basically nothing else! > That scheme is very simple to describe, > but not that simple to grok from that > description, as no one have tried that > layout yet. I think above and this is a succinct description of Stas's intended use case: having double mappings for solibs allows sharing data between the host and a VM with only address translation at the VM boundary, instead of address translation on every memory access inside the VM. Solutions exist for heap memory and stack memory, leaving primarily the .data/.bss memory allocated as part of an solib. (Correct me if I'm mistaken of course.) The proposed la_premap and la_premap_dlmem (part of the dlmem() patch) collectively "solve" this problem by granting LD_AUDIT some limited control over the object (segment) mapping process. My first impression from reading the test cases, they seem a bit too specific to this use case. IMHO they are also out-of-scope for LD_AUDIT: LD_AUDIT works at the level of symbols and objects, both generic across OSs and even binary formats (ELF + DLL), whereas la_premap* expose an implementation detail of the dynamic linker. Most importantly, we do not yet deeply understand the implications exposing these callbacks can have, security or otherwise. An alternative solution I brought up in the prior discussion is "wrapping" the mmap syscall. In general, any Linux syscall can be wrapped using seccomp (e.g. via libseccomp [1]) or more recently with syscall user dispatch [2]. With the wrapper in place, every mmap would be replicated in the VM memory window and update a table used for address translation. Some behavior changes would be needed to appropriately implement MAP_ANONYMOUS and MAP_FIXED, but neither seem particularly problematic. AFAIK, this "wrapping mmap" approach is vastly more powerful and effective than the proposed la_premap{,_dlmem}. It operates at the Linux kernel level, and requires no changes to Glibc to implement nor a bleeding-edge kernel. It is powerful enough to transparently handle heap memory (provided the targeted allocation arena is brand new, i.e. in a newly opened dlmopen namespace). Wrapping and reimplementing syscalls are well-understood and widely used techniques by VM-adjacent tools, e.g. Wine (Windows syscall emulator) [3] and Docker/Podman (container runtimes) [4]. If this well-understood approach solves the problem, IMHO there isn't much point in arguing this RFE further. [1]: https://libseccomp.readthedocs.io/en/latest/ [2]: https://docs.kernel.org/admin-guide/syscall-user-dispatch.html [3]: https://lwn.net/Articles/826313/ [4]: https://docs.docker.com/engine/security/seccomp/ In response to a few other bits of prior discussion about mapping objects: (In reply to Stas Sergeev from comment https://sourceware.org/bugzilla/show_bug.cgi?id=30127#c45) > > > by doing 2 mappings of the same lib. > > ...If all you wanted was to mmap the solib to another address, you can > > already do that using mmap and /proc/self/map_files/. Maybe dl_iterate_phdr. > > That can only work for loadable sections, > I believe. .bss cannot be shared that way, > and likely much more. You're right, neglected .bss when suggesting this idea. This would not be an issue when using an mmap wrapper however, as the region is simply mapped with MAP_ANONYMOUS. (In reply to Stas Sergeev from comment https://sourceware.org/bugzilla/show_bug.cgi?id=30127#c48) > (In reply to Jonathon Anderson from comment https://sourceware.org/bugzilla/show_bug.cgi?id=30127#c33) > > The result of the first call to mmap() for an solib decides the base address > > While a bit outdated topic, I don't > think "the first call to mmap()" is a > good or reliable work-around. It may > change with an impl, or because of the > threads. To clarify here, the "first" call to mmap() is the one without MAP_FIXED, and is used to allocate the pages that will later be overwritten by MAP_FIXED. Threads should not become a problem here, just check the flags. Also, this is the pattern heavily recommended in man mmap(2) (NOTES, "Using MAP_FIXED safely"). IMHO it's unlikely that part of the implementation will change drastically, and I'm confident an mmap syscall wrapper could still handle it even if it did. :D > > AFAICT these discussions are all solved by memfd_create. Almost all of the > > complaints revolve around the memory vs. disk performance difference, > > I am getting a bit nervous already when people > mention memfd_create(). :) In what way is it > any better than shm_open(), that I used in my > la_premap_dlmem() example? > Yes, I could also use memfd_create() with > la_premap_dlmem(), but I prefer shm_open(). > Why people think that memfd_create() is the > thing, is unclear to me. :) But it fits my > design very well, as does shm_open(). My understanding is that the "file" created by memfd_create() cannot be shared outside the process and it's spawned children, whereas the "file" created by shm_open() can be accessed by any other process with the same path argument. memfd_create() seems to be the more appropriate function when a *private* memory-backed file descriptor is needed, shm_open() is better suited for shared memory across processes (hence the name).
(In reply to Jonathon Anderson from comment #8) > I think above and this is a succinct description of Stas's intended use > case: having double mappings for solibs allows sharing data between the host > and a VM with only address translation at the VM boundary, instead of > address translation on every memory access inside the VM. Solutions exist > for heap memory and stack memory, leaving primarily the .data/.bss memory > allocated as part of an solib. (Correct me if I'm mistaken of course.) That's correct. > The proposed la_premap and la_premap_dlmem (part of the dlmem() patch) > collectively "solve" this problem by granting LD_AUDIT some limited control > over the object (segment) mapping process. My first impression from reading > the test cases, they seem a bit too specific to this use case. IMHO they are > also out-of-scope for LD_AUDIT: LD_AUDIT works at the level of symbols and > objects, both generic across OSs and even binary formats (ELF + DLL), > whereas la_premap* expose an implementation detail of the dynamic linker. What exactly implementation detail? Its just "here's the length I need to map for solib. if you want, give me a buffer and/or fd for it". To me its quite similar to "here's the name of the solib. if you want, give me a different one to use". > Most importantly, we do not yet deeply understand the implications exposing > these callbacks can have, security or otherwise. Any explanation why there can be any security concerns here? > An alternative solution I brought up in the prior discussion is "wrapping" > the mmap syscall. In general, any Linux syscall can be wrapped using seccomp > (e.g. via libseccomp [1]) or more recently with syscall user dispatch [2]. > With the wrapper in place, every mmap would be replicated in the VM memory > window and update a table used for address translation. Some behavior > changes would be needed to appropriately implement MAP_ANONYMOUS and > MAP_FIXED, but neither seem particularly problematic. I don't understand what you mean. Besides the fact that you want to describe something very specific to particular libc (intercepting the particular mmap call, knowing how the particular dynamic loader works), you haven't written the detailed scheme of what you propose. You were referring (in another thread) to trapping only the first mmap() call done by dynamic loader, IIRC. How can that lead to a solution of having 2 identical mappings, is essentially unclear. At best it can solve the problem of specifying the reloc address, by the cost of depending on a particular impl of particular libc, forgetting about any portability to other libces. So please detail your proposal. > If this well-understood approach solves the problem, IMHO there isn't much > point in arguing this RFE further. It doesn't solve anything (except probably the reloc address), and the statements like this, together with the statement that my patch breaks your use-case or raises a security concerns, only suggests that you want to down-play any contributions that you review. In fact, since you never ever said a single word about how any of the multiple proposals (including DT_AUDIT for dlopen()) can be improved, I am quite sure its the case. I hope we can get more constructive. > (In reply to Stas Sergeev from comment > You're right, neglected .bss when suggesting this idea. This would not be an > issue when using an mmap wrapper however, as the region is simply mapped > with MAP_ANONYMOUS. I don't understand how this would not be an issue, please clarify. Region mapped with MAP_ANONYMOUS cannot be shared with VM. > (In reply to Stas Sergeev from comment > To clarify here, the "first" call to mmap() is the one without MAP_FIXED, > and is used to allocate the pages that will later be overwritten by > MAP_FIXED. Threads should not become a problem here, just check the flags. Why any other thread can't do unrelated mmap() without MAP_FIXED?
Created attachment 14753 [details] patches Here is the next impl of dlmem(), this time split into many small patches. I just need to update the log entries and post them to ml. I attach them here for the sake of this discussion. You need to look into patches 2, 3 and 10. 2 and 3 are trivial, and in patch 10 you need to look only at what I do with _dl_map_segments(). In particular, how I add "premap" there, which is actually la_premap_dlmem(). This is the very minimal set of changes you need to see to understand how shm_open() steps into the game. As I believe this part is still not well understood. Don't worry, these changes are really small this time! Please take a look. It will take 5 minutes.
(In reply to Stas Sergeev from comment #9) > What exactly implementation detail? > Its just "here's the length I need to > map for solib. if you want, give me a > buffer I meant to say "address" of course, not "buffer". Buffer approach was already criticized by Adhemerval, so why have I mentioned it again, is unclear. :)
(In reply to Jonathon Anderson from comment #8) > The proposed la_premap and la_premap_dlmem (part of the dlmem() patch) > collectively "solve" this problem by granting LD_AUDIT some limited control > over the object (segment) mapping process. My first impression from reading > the test cases, they seem a bit too specific to this use case. IMHO they are > also out-of-scope for LD_AUDIT: OK, if this is the case (which is entirely possible, even if I don't agree with the provided reasoning), then let's just not use audit. :) I can just add the "premap_ops" optional pointer to dlmem(). Advantages: much, much fewer changes. No need for dlload_audit_module() for that use-case, but I'll keep discussing it as a "bonus", in case someone finds it interesting to load audit modules at run-time. Everything is confined within dlmem(). Perfect for my use-case. Plain perfect. Disadvantages: Well, extra arg that most people will set to NULL. Very small disadvantage, given that this API is not standard anyways. And it will not be possible to specify the reloc address for pre-existing functions like dlopem()/dlmopen(), or prospective ones like BSD's fdlopen(). I find that a bit pity given that people requested that functionality for currently existing APIs, but not a problem for my use-case. I only need to control dlmem(), not to help others on stackoverflow. :) So sounds better?
OK in fact that approach is so much better, that supporting pre-existing APIs is irrelevant here. Trying to fulfill someone's request on stackoverflow was a huge mis-goal. So... thanks for pointing that I was heading the wrong direction. Will implement a small and simple dlmem() with an extra ops arg, and w/o any audit machinery.
Hi, posted a comment here: https://sourceware.org/bugzilla/show_bug.cgi?id=30100 We can continue the dlmem() discussion there, as for a moment it is no longer relevant to having a way of specifying the reloc address for dlopen(). That was what the audit callbacks were needed for, but they are gone. Custom dlopen() can be trivially implemented on top of dlmem(), so the intention to control the dlopen() directly, was ill-fated from the beginning. :(
First dealing with a few meta off-topics: (In reply to Stas Sergeev from comment #9) > > If this well-understood approach solves the problem, IMHO there isn't much > > point in arguing this RFE further. > > It doesn't solve anything (except probably > the reloc address), and the statements like > this, together with the statement that my > patch breaks your use-case or raises a > security concerns, only suggests that you > want to down-play any contributions that you > review. > In fact, since you never ever said a single > word about how any of the multiple proposals > (including DT_AUDIT for dlopen()) can be > improved, I am quite sure its the case. > I hope we can get more constructive. As Adhemerval has already mentioned from the very start of this RFE (comment #1): > Any GNU extension requires a specific usercase that can't be easily accomplished with current API. Thus, the first priority for this RFE should be to establish this use case and express the failings of the current technology. A proposed patch series is difficult to review and near-impossible to merge without reaching some kind of consensus on these two points. Until this occurs, all but the most preliminary work on a patch is a waste of many people's time and patience, including yours as the author. I also have limited time to investigate the details in my responses. In an effort to remain useful (and succinct), I prioritize any discussion that will lead closer to this first priority. This of course means I often cannot discuss your contribution at length or make any suggestions; there is simply too much otherwise to discuss with higher priorities, and it takes me multiple hours of my spare time to collect that together in a cohesive reply. I hope you can understand. :) Coming back on topic, comment #8 establishes the succinct and sensible use case for this RFE. This is half the requirement, what remains now is to express the failings of the current technology for this use case. Comment #8 also describes the high-level view of an alternative solution available with current technology available in GNU/Linux (Glibc on Linux). The next step then is to discover where this solution fails for your specific use case. It would be very constructive if you could investigate my proposed solution as detailed below, and precisely express what the insurmountable problems with it are. :D (In reply to Stas Sergeev from comment #13) > OK in fact that approach is so much > better, that supporting pre-existing > APIs is irrelevant here. Trying to > fulfill someone's request on stackoverflow > was a huge mis-goal. > So... thanks for pointing that I was > heading the wrong direction. > Will implement a small and simple > dlmem() with an extra ops arg, and w/o > any audit machinery. Although a bit late now, I would advise against pursuing dlmem() unless the extra no-file-descriptor functionality is absolutely required for your use case. There are many open questions about the API, and it is clear dlmem() will have a far larger impact than la_premap* ever would. If you need the no-file-descriptor functionality and do want to continue dlmem(), I would recommend first developing a solid argument to assuage the initial concerns raised by Carlos almost a month ago now (https://sourceware.org/pipermail/libc-alpha/2023-February/145735.html). Namely, establishing the use case in clear terms and expressing why the alternative technology of "dlopenfd()" + memfd_create() fails to meet the use case. Coming back to the topic of the hour: (In reply to Stas Sergeev from comment #9) > > The proposed la_premap and la_premap_dlmem (part of the dlmem() patch) > > collectively "solve" this problem by granting LD_AUDIT some limited control > > over the object (segment) mapping process. My first impression from reading > > the test cases, they seem a bit too specific to this use case. IMHO they are > > also out-of-scope for LD_AUDIT: LD_AUDIT works at the level of symbols and > > objects, both generic across OSs and even binary formats (ELF + DLL), > > whereas la_premap* expose an implementation detail of the dynamic linker. > > What exactly implementation detail? > Its just "here's the length I need to > map for solib. if you want, give me a > buffer and/or fd for it". > To me its quite similar to "here's the > name of the solib. if you want, give > me a different one to use". Primarily, I am unclear what mmap flags an la_premap callback should be use, or what order it should mmap to keep the page table consistent with multiple threads (like _dl_map_segments). These are far deeper implementation details than simply "here's the file path to use when looking up an solib," and will depend heavily on the dynamic linker and OS it is compiled for. I do not believe it makes sense to expose details this deep via LD_AUDIT. On the API side, file descriptors are a concept specific to POSIX, and the ELF standard (technically) does not require that the objects be mmap()'d. While I do not believe there will be significant problems, it doesn't hurt to be kinder to our non-Glibc friends, on the off-chance LD_AUDIT becomes significantly more popular than it is today. :D > > Most importantly, we do not yet deeply understand the implications exposing > > these callbacks can have, security or otherwise. > > Any explanation why there can be any > security concerns here? These callbacks allow ASLR to be disabled completely in userspace. If a poorly implemented auditor causes dynamic library loading to become extremely predictable, an attacker may find a way to steal cryptographic secrets stored in the bss segment. High-security container runtimes can't easily protect against this ASLR-loss since the kernel is not involved. Is there a *real* security risk here? No idea, I have no clue if disabling ASLR in non-setuid applications is really a problem. LD_PREFER_MAP_32BIT_EXEC exists after all. But I can say there will be implications we do not yet (in this discussion) completely understand. > > An alternative solution I brought up in the prior discussion is "wrapping" > > the mmap syscall. In general, any Linux syscall can be wrapped using seccomp > > (e.g. via libseccomp [1]) or more recently with syscall user dispatch [2]. > > With the wrapper in place, every mmap would be replicated in the VM memory > > window and update a table used for address translation. Some behavior > > changes would be needed to appropriately implement MAP_ANONYMOUS and > > MAP_FIXED, but neither seem particularly problematic. > > I don't understand what you mean. > Besides the fact that you want to describe > something very specific to particular libc > (intercepting the particular mmap call, knowing > how the particular dynamic loader works), > you haven't written the detailed scheme of > what you propose. > ... > So please detail your proposal. Alright, here goes. There are few syscalls on Linux that alter the page table for a process (you can get a rough list by grepping the x86_64 syscall table in strace [1] for "TM"). On x86_64, there are three common ones that add *new* pages to a process: mmap(), mremap() and brk(). brk() and mremap() are most often used through malloc() and realloc(), so your custom libc shim should catch them even if you don't wrap them as syscalls. mmap() is far more common, both in ld.so and in Glibc in general, so that's the main target here. The general idea of the approach is to wrap mmap() and "mirror" the pages it allocates "outside" the VM to pages "inside" the VM. In most cases (~(MAP_ANONYMOUS|MAP_FIXED)), this should boil down to approximately: 1. mmap() the "outside" pages, 2. allocate some pages "inside" the VM to serve as the mirror pages, 3. mmap() the "inside" pages with the same fd, size and offset (+ MAP_FIXED with addr as the "inside" target address), and 4. update the address translation table to map "outside" to "inside" pages. MAP_ANONYMOUS doesn't provide an fd to "mirror" the pages through, so the wrapper will need to provide one. This can easily be a private memory-backed file (e.g. memfd_create). Allocate some pages from this file before (1), and use that for the fd and offset in the remaining steps. MAP_FIXED specifies an addr, so instead of allocating pages in (2) the wrapper will need to translate the provided addr from the "outside" to "inside" memory space. Usually the pages affected by an mmap(MAP_FIXED) will have been previously allocated via an mmap(~MAP_FIXED) (recommended practice from man mmap and implemented in _dl_map_segments), so this translation should always succeed (the wrapper could also abort the application if this precondition isn't met). That's the basic approach. This approach wraps mmap() while conforming to the Linux API, so it works for any segments that are mmap()'d. In GNU/Linux, solibs and .bss are included in that set. There are plenty of details that could be added, e.g. brk() could be reimplemented in terms of mmap(memfd_create()), mremap() could be duplicated in much the same way as mmap(), unimplemented/problematic syscalls can be initially replaced with abort(), etc. For a preliminary solution on GNU/Linux though, wrapping mmap() should be enough to create a duplicate mapping of an solib. [1]: https://gitlab.com/strace/strace/-/blob/master/src/linux/64/syscallent.h > You were referring (in another thread) to > trapping only the first mmap() call done by > dynamic loader, IIRC. How can that lead to > a solution of having 2 identical mappings, > is essentially unclear. At best it can solve > the problem of specifying the reloc address, When I suggested that before, I was trying to solve the problem of specifying the reloc address. I thought that was the core use case at the time. That said, the approach can be adjusted with little effort. In most cases (~MAP_FIXED) the mmap() wrapper simply needs to: 1. allocate some pages "inside" the VM to place the result, 2. mmap() the pages with the same fd, size and offset (+ MAP_FIXED with addr as the "inside" target address), and 3. return the "inside" target address. MAP_ANONYMOUS requires no special handling, since the pages aren't being mirrored in this case. MAP_FIXED specifies an addr, so instead of (1) just use the given addr instead. The wrapper could also abort the application if this addr is not "inside" the VM. The rest of the basic approach still holds. This approach wraps mmap() while conforming to the Linux API, so it works for any segments that are mmap()'d. In GNU/Linux, solibs and .bss are included in that set. > by the cost of depending on a particular impl > of particular libc, forgetting about any > portability to other libces. The only requirement of this approach is that the solibs (and .bss) are mmap()'d following the recommendation in man mmap. This is true of GNU/Linux (Glibc on Linux). I doubt there is another popular libc that doesn't mmap() the solibs, but is there one you plan to support? (I also doubt Glibc-only is a dealbreaker for you, given la_premap was a viable solution and LD_AUDIT is basically a GNU extension at this point. :P) Obviously this approach only works on Linux. Other OSs have their own syscalls and methods for intercepting them. I only know Linux, was there another OS you plan to support? > > (In reply to Stas Sergeev from comment > > You're right, neglected .bss when suggesting this idea. This would not be an > > issue when using an mmap wrapper however, as the region is simply mapped > > with MAP_ANONYMOUS. > > I don't understand how this would not be > an issue, please clarify. Region mapped > with MAP_ANONYMOUS cannot be shared with VM. See the description above. Pages mmap()'d with MAP_ANONYMOUS are mirrored via a memory-backed file to allow sharing with the VM. > > (In reply to Stas Sergeev from comment > > To clarify here, the "first" call to mmap() is the one without MAP_FIXED, > > and is used to allocate the pages that will later be overwritten by > > MAP_FIXED. Threads should not become a problem here, just check the flags. > > Why any other thread can't do unrelated > mmap() without MAP_FIXED? See the description above. No information (except the address translation table) needs to be persisted between mmap() calls, so it doesn't matter which thread invokes mmap() when. The only requirement is that mmap(MAP_FIXED) always overwrites pages previously allocated with mmap(~MAP_FIXED), as recommended in man mmap and implemented in _dl_map_segments.
(In reply to Jonathon Anderson from comment #15) > As Adhemerval has already mentioned from the very start of this RFE (comment > #1): > > Any GNU extension requires a specific usercase that can't be easily accomplished with current API. "easily" is quite important here. Even if your syscall interception approach could work (which I think is not the case), it doesn't fall into an "easy" category. To me, having a good API is also important. Why dlmem() is not the one? But lets deal with the question if your trick works at all first. > Thus, the first priority for this RFE should be to establish this use case > and express the failings of the current technology. A proposed patch series > is difficult to review Even when I split them into 13 nearly trivial patches? Then what else can I do to have it easy for a review? > and near-impossible to merge without reaching some > kind of consensus on these two points. Until this occurs, all but the most > preliminary work on a patch is a waste of many people's time and patience, > including yours as the author. Not necessarily: after all I can try to use the private glibc build for my project. But of course its a PITA so I prefer to not having to do that. > It would be very constructive if you could investigate my proposed solution > as detailed below, and precisely express what the insurmountable problems > with it are. :D I always do. :) > Namely, establishing the use case in clear terms and expressing why the > alternative technology of "dlopenfd()" + memfd_create() fails to meet the > use case. I thought we are already passed that point, and instead are discussing why mmap() intercept fails? dlopenfd()+memfd doesn't give even the possibility of specifying the reloc address, and that's a very minimal, insufficient requirement. > These callbacks allow ASLR to be disabled completely in userspace. If a > poorly implemented auditor causes dynamic library loading to become > extremely predictable, an attacker may find a way to steal cryptographic > secrets stored in the bss segment. High-security container runtimes can't > easily protect against this ASLR-loss since the kernel is not involved. > > Is there a *real* security risk here? No idea, I have no clue if disabling > ASLR in non-setuid applications is really a problem. I think its quite similar to allowing MMAP_FIXED, which is allowed. One can already map his secret data always to the specific location, so there was nothing new. But in any case, LD_AUDIT bits are dropped from my patch. > MAP_ANONYMOUS doesn't provide an fd to "mirror" the pages through, so the > wrapper will need to provide one. This can easily be a private memory-backed > file (e.g. memfd_create). Allocate some pages from this file before (1), and > use that for the fd and offset in the remaining steps. I still don't understand that part. You propose to intercept first mmap. In the current impl its a file-backed private mmap that spans past the end of the file. So basically it is 2 mmaps in one: what goes beyond the file image, is similar to anonymous. So do you propose to split that mmap into 2, mirror the file-backed part and then mirror the anonymous part with the memfd_create, correct? But the problem is that first mmap actually only establishes the map address. Subsequent mmaps re-arrange the segments, over-mapping that space. So where you initially thought is an anonymous space, will be a file-based mapping, and vice-versa. You can clearly see that from my patch 2. It converts the first mmap into an anonymous one (fully anonymous, not file-backed past eof). So the first mmap doesn't have the actual layout, and so it can't establish any mirroring. It just allocates the space, and gets later over-mapped by the loader. Does this clarify? > That's the basic approach. This approach wraps mmap() while conforming to > the Linux API, so it works for any segments that are mmap()'d. In GNU/Linux, > solibs and .bss are included in that set. The problem is that first mmap has no idea where the .bss will be mapped. Neither does he know where the code is going to be mapped. The only reason its not anonymous currently in glibc, is because the first elf segment is supposed to be in the beginning. But everything else gets over-mapped. I am not even sure why the first segment should be at the beginning, does it always have the vaddr==0? But so do the comments in the present code suggest. > Obviously this approach only works on Linux. Other OSs have their own > syscalls and methods for intercepting them. I only know Linux, was there > another OS you plan to support? Not in the near future, maybe eventually. But I still fail to see why would the one want to do syscall interceptions instead of adding the right API, even if such interception could work (but it can't).
Note that even the parts that were initially file-backed, are later re-evaluates as anonymous space. Such pages are re-protected to R/W and zeroed out with memset(). That includes .bss. I believe its a quite bad solution, as in /proc/self/maps it would remain file-backed. After my patch there are no such discrepancies, as the first mmap is always anonymous after my patch. So you don't even need to clear .bss. :) Hope this clarifies why intercepting the first mmap doesn't give you anything.
(In reply to Stas Sergeev from comment #16) > I am not even sure why the first segment > should be at the beginning, does it > always have the vaddr==0? But so do the > comments in the present code suggest. Actually it doesn't suggest that and all segments are over-mapped over the initial mapping. I'll need to re-check my changes to _dl_map_segment() to see if I got the things right. But given that all segments are over-mapped, certainly there is no reason to trap the first mmap call. Of course now I have some very bad feeling that your next proposal will be "trap all mmaps, not just the first one"... Well, before you do that, consider the following: 1. Some mappings are converted from file-based to anonymous via mprotect+memset. 2. _dl_map_segment() handles the "large alignment" case with 2 mmaps. The first large one is done only for alignment, and should I share with VM also that? 3. Do you really think that trapping all mmaps and trying to hack around the aforementioned problems, is a good idea?
Actually I see you already suggest to trap all mmaps, fixed and not ones. You just never said explicitly about that change. And of course you will also suggest to trap mprotect(), do not set PROT_WRITE when glibc is trying to, and trap the memset() to that area via SIGSEGV to find out that it should be converted to anonymous mapping?
I studied the code a bit more to detail my former claims. mprotect()+memset() is applied when segment's data end is not page-aligned, and there are still the alloc sections (like .bss) within that segment. Then they go to the same page. This is under "if (zeropage > zero)" clause of dl-map-segments.h. Subsequent .bss pages are anonymously mapped under "if (zeroend > zeropage)" clause. So your algo fails if some SHT_NOBITS section is not page-aligned. Plus I'd say your algo is not a solution. Intercepting all mmap calls from dynamic loader and provide some weird tricks to them, is not any better than to write another loader, for example. :) I am very surprised you make the claims like "your patch is very difficult to review" w/o even looking into the very small patches that mostly split the huge multi-thousands-line funcs into a reusable parts...
Ping, so any idea how to get your technique working with unaligned SHT_NOBITS section? I spent some time trying to figure out if its possible to privately use the patched glibc, and unfortunately it seems impossible. :( So either your technique should work, or my patch should be reviewed, because the private use of the patched glibc seems not possible even if you load the patched glibc to the separate NS. Please let me know what needs to be done to my patch to make it reviewable. I split it into 13 nearly trivial pieces, hoping that's enough for a review, but please let me know what else to do here.
Hi guys, so what is the status of all this? If my patches would never be looked into, no matter what, then perhaps you should tell me that right here, so that I stop wasting my and other's time. In any other case I have the following questions: 1. Have we passed the stage where my use-case is explained and clarified? 2. Have we passed the stage where I kept presented with an "alternative solutions" like "intercept all mmap (and perhaps also mprotect) syscalls and do some weird thing on them"? My last conclusion was that such "solution" doesn't work for unaligned SHT_NOBITS sections. 3. If we passed 1 and 2, then I think the next step is to discuss an API, so here's the API: dlfcn.h addition: /* Callback for dlmem. */ typedef void * (dlmem_premap_t) (void *mappref, size_t maplength, size_t mapalign, void *cookie); /* Do not replace destination mapping. dlmem() will then use memcpy(). */ #define DLMEM_DONTREPLACE 1 struct dlmem_args { /* Optional name to associate with the loaded object. */ const char *soname; /* Namespace where to load the object. */ Lmid_t nsid; /* dlmem-specific flags. */ unsigned flags; /* Optional premap callback. */ dlmem_premap_t *premap; /* Optional argument for premap callback. */ void *cookie; }; /* Like `dlmopen', but loads shared object from memory buffer. */ extern void *dlmem (const unsigned char *buffer, size_t size, int mode, struct dlmem_args *dlm_args); Does anyone know if its a good or bad API, and how should it be improved? It allows to implement dlopen_with_offset() in a couple of lines, it preserves the file-based mappings so that /proc/self/maps or /proc/self/map_files are valid, and it allows to specify the solib name, so it handles the file-based mmaps, like dlopen_with_offset(), rather perfectly. I wish I could have a separate libdl, but so far that looks very difficult. If you have any suggestions how can I have the separate libdl, then that would indeed be a perfect alternative solution that will eliminate any need to patch glibc sources. Or maybe some simple hooks can be added to aid a standalone libdl? Let me know and I will work in that direction then. But "no reply" is a bit inconclusive.
Sorry for the long delay in response, it's still a very busy time on my end. :P I'll make up for it with a very long (and probably repetitive) response instead. >.< (In reply to Stas Sergeev from comment #22) > Hi guys, so what is the status of > all this? If my patches would never > be looked into, no matter what, then > perhaps you should tell me that right > here, so that I stop wasting my and > other's time. AFAIK your patches will be looked at once a use case that requires it is solidified, that can't be solved with current tech nor any better proposed API. So far, it has been unclear why the primary function of dlmem() is needed for your use case. Why do you need to load solibs straight from memory at all? > In any other case I have the following > questions: > 1. Have we passed the stage where my > use-case is explained and clarified? Yes. > 2. Have we passed the stage where I > kept presented with an "alternative > solutions" like "intercept all mmap > (and perhaps also mprotect) syscalls > and do some weird thing on them"? My > last conclusion was that such "solution" > doesn't work for unaligned SHT_NOBITS > sections. No. I'm certain it works for unaligned SHT_NOBITS sections, any changes made to one side of the "mirror" are reflected in the other. (Although there is another flaw I missed before, an updated version of the technique is towards the bottom of this message. :P) > 3. If we passed 1 and 2, then I think > the next step is to discuss an API, > so here's the API: > ... > Does anyone know if its a good or bad > API, and how should it be improved? There is not yet a solid use case for the primary function of this API, the fact that it "loads an solib from memory." This primary functionality is the main source of concern originally raised by Carlos O'Donell, and AFAICT hasn't been resolved. The following API is close to your use case but doesn't raise the same concerns as dlmem(). Does this solve your problem, if not what's missing? void *dlopen4(const char *filename, int flags, const struct dlopen4_args *ext, size_t ext_size /* = sizeof(struct dlopen3_args) */); void *dlmopen5(Lmid_t lmid, const char *filename, int flags, const struct dlopen4_args *ext, size_t ext_size /* = sizeof(struct dlopen3_args) */); struct dlopen4_args { /* If not NULL, function called before mmap when loading the object [and its dependencies?]. Returns the base of a mmapped range of given length and alignment. This mapping will be overwritten by the loaded object. */ void *(*dla_premap)(void *preferred_addr, size_t length, size_t align, void *userdata); /* User data passed to dla_premap. */ void *dla_premap_userdata; }; > It > allows to implement dlopen_with_offset() > in a couple of lines, it preserves the > file-based mappings so that /proc/self/maps > or /proc/self/map_files are valid, and > it allows to specify the solib name, so > it handles the file-based mmaps, like > dlopen_with_offset(), rather perfectly. These are niceties, but I think we can agree a direct implementation of dlopen_with_offset() would be better for the use cases that need it. It would also require far less refactors than dlmem(). > I wish I could have a separate libdl, but > so far that looks very difficult. If you > have any suggestions how can I have the > separate libdl, then that would indeed be > a perfect alternative solution that will > eliminate any need to patch glibc sources. > Or maybe some simple hooks can be added to > aid a standalone libdl? Let me know and I > will work in that direction then. I don't have any suggestions here, ld.so and libdl and Glibc are all deeply tied together. The best I can recommend is to patch Glibc and base a container around it, if that works for your client(s). :P > But "no reply" is a bit inconclusive. You don't need to tell me that I'm slow to respond. :P FWIW, Glibc like many other large OSS projects moves slowly. Speaking from experience, expect many months before getting a change landed in a Fedora release, and multiple years before it spreads to other Linux distributions like Debian/Ubuntu or OpenSUSE. (In reply to Stas Sergeev from comment #16) > (In reply to Jonathon Anderson from comment #15) > > As Adhemerval has already mentioned from the very start of this RFE (comment > > #1): > > > Any GNU extension requires a specific usercase that can't be easily accomplished with current API. > > "easily" is quite important here. > Even if your syscall interception approach > could work (which I think is not the case), > it doesn't fall into an "easy" category. As I mentioned before, syscall interception is a technique used in many VM-adjacent and widely used technologies, to name a few: containers (Podman/Docker), Windows emulation (Wine), browser sandboxes (Firefox/Chromium), and debuggers (GDB/strace). Many great examples exist in the open-source community suitable for study, IMHO strace and Crun (part of Podman) are good choices to start. Given all this, I consider it much easier to write a syscall interception code than to write a shim library to translate between 32- and 64-bit call ABIs. FWIW. :D > To me, having a good API is also important. > Why dlmem() is not the one? > ... > > > Thus, the first priority for this RFE should be to establish this use case > > and express the failings of the current technology. A proposed patch series > > is difficult to review > > Even when I split them into 13 nearly trivial > patches? Then what else can I do to have it > easy for a review? I don't have many comments about the patch itself. If I find time to write them up I'll direct them to the dlmem() RFE. > dlopenfd()+memfd doesn't give even the > possibility of specifying the reloc address, > and that's a very minimal, insufficient requirement. Because you need the pages to be mirrored? Or is there another requirement here? > > It would be very constructive if you could investigate my proposed solution > > as detailed below, and precisely express what the insurmountable problems > > with it are. :D > > I always do. :) So far, there seems to be a lot of confusion about the technique but no objective flaws about the overall approach. I did notice a flaw in the interim that complicates the technique, but again not insurmountable. * * * I'll describe the approach and updated technique verbatim below, in the hopes it will smooth the discussion here, with the goal of understanding the flaws with the overall approach for your use case. The goal of the overall approach is to "mirror" ALL pages mmapped (after the syscall interception is installed) to pages inside the VM. That includes the pages forming a newly loaded solib. This is a very powerful approach that is not limited to the dynamic linker, it can be extended to mirror ANY memory allocated by the userspace code, including malloc()'d memory. "Mirroring" pages here (e.g. page A is mirrored to page A') has three strong criteria that need to be met: a. Any change to the memory in page A is reflected in page A', and vice versa. b. The location of page A' relative to some other mirrored page B' reflects the location of page A relative to page B, if the userspace code requires such (MAP_FIXED). c. A "page translation table" exists that records the mirror relationship from A to A'. The only way to implement criteria (a) on Linux is to propagate memory changes back to the backing fd (MAP_SHARED), so /proc/self/maps will definitely see file-backed mappings even for anonymous pages. On the other hand, (a) also means if a .bss region is cleared with memset(), those changes will be reflected in the mirror pages and so we don't have to intercept those. Criteria (b) only matters for MAP_FIXED calls, in the ~MAP_FIXED case the kernel (syscall interception) is allowed to choose any reasonable address to place the mmap()'d pages. The recommendation from man mmap is to (paraphrased): "mmap() without MAP_FIXED first, then overwrite the allocated mapping with MAP_FIXED." This avoids races in multithreaded code. The technique described later presumes this recommendation is followed in all userspace code and will abort() if not. This recommendation is followed by Glibc's dynamic linker, this is the rationale behind the first mmap() call you noticed gets completely over-mapped. Every mmap() syscall is intercepted with this technique (I thought I said that explicitly but maybe it got lost in editing :P). There are other syscalls that alter the page table that could be intercepted for a more complete solution: munmap(), mremap(), mprotect(), brk(). For simplicity I'm only going to discuss the interception for mmap(), other syscalls are left as an exercise to the reader (and should not be necessary for a preliminary implementation, I think). Now for the actual technique. The intercepted wrapper for mmap(addr, length, prot, flags, fd, offset) performs the following operations: 0. Adjust the arguments if flags contains MAP_ANONYMOUS or MAP_PRIVATE (described below), 1. mmap() the original pages (that live "outside" the VM), call them A, 2. allocate the mirror pages (that live "inside" the VM), call them A', 3. mmap() A' as a mirror of A, 4. update the "page translation table" (criteria (c)) with the A -> A' relation, and 5. return the address of A from step (1). There are a number of cases that need to be handled. The "base case" is (MAP_SHARED & ~MAP_ANONYMOUS & ~MAP_FIXED), here step (1) calls mmap(addr, length, prot, flags, fd, offset), and step (3) calls mmap(A'.addr, length, prot, flags | MAP_FIXED, fd, offset). Step (2) allocates any free pages in the VM. This creates a natural mirrored mapping between A and A'. If flags contains MAP_ANONYMOUS, an extra step (0) is added before step (1). In step (0), fd is replaced by a file descriptor allocated with memfd_create(), and offset by the offset of some freshly allocated pages in that file. flags has the MAP_ANONYMOUS bit removed, since now it is no longer an anonymous mapping. All cases below then apply. If flags contains MAP_FIXED, step (2) needs to change. Assuming the man mmap recommendation is followed, there must already be an A -> A' mapping in the "page translation table" in this case. Step (2) reuses this prior mapping and uses this previously-allocated A', if one doesn't exist it abort()s the entire application. (Note that this reflects the over-mapping done by the dynamic linker in the VM space, so no issues with that.) If flags contains MAP_PRIVATE, extra steps are once again needed. If this is a read-only mapping (~PROT_WRITE) and assuming mprotect() is not used later to add write access (IIRC I have not observed Glibc's ld.so do so with strace), then simply replace MAP_PRIVATE with MAP_SHARED in step (0) and the rest will work. Otherwise, if flags contains MAP_PRIVATE and prot contains PROT_WRITE, the mapped portion of the file needs to be copied out to an editable file. I can think of two implementations off the top of my head, others likly exist. First idea: 0.1. Allocate some pages in an anonymous file as if flags contained MAP_ANONYMOUS, results in fd_a and offset_a. 0.2. orig = mmap(NULL, length, PROT_READ, flags, fd, offset); 0.3. fd = fd_a, offset = offset_a; 1. mmap(..., prot, ..., fd, offset) the original pages (that live "outside" the VM), call them A, 1.1. memcpy(A.addr, orig, length); 1.2. munmap(orig, length); Second idea: 0.1. Allocate some pages in an anonymous file as if flags contained MAP_ANONYMOUS, results in fd_a and offset_a. 0.2. orig_off = lseek(fd, 0, SEEK_CUR); 0.3. lseek(fd, offset, SEEK_SET); 1. mmap(..., prot, ..., fd_a, offset_a) the original pages (that live "outside" the VM), call them A, 1.1. read(fd, A.addr, length); 1.2. lseek(fd, orig_off, SEEK_SET); 1.3. fd = fd_a, offset = offset_a; That's it, that's the entire technique. It's a powerful approach reminiscent of container tech, which I find fitting for a use case messing with a VM. It's a straightforward technique with good similar examples in the open-source community, for example strace's --inject= options. It's a small technique, I would budget at around 100-300 lines for a PoC implementation. It's not a performant approach, but presumably your apps aren't dlopen()/dlclose()'ing solibs like there's no tomorrow. What's wrong with it? * * * > Of course now I have some very bad feeling > that your next proposal will be "trap > all mmaps, not just the first one"... > Well, before you do that, consider the > following: > 1. Some mappings are converted from > file-based to anonymous via mprotect+memset. The fact that the pages are mirrored handles this, changes in one are reflected in the other. Note that this trait is required to make shared memory work at all. IIRC ld.so only uses mprotect() to mark the RELRO segments as read-only, so they don't need to be mirrored in a simple PoC implementation. At least for simple cases, YMMV. > 2. _dl_map_segment() handles the "large > alignment" case with 2 mmaps. The first > large one is done only for alignment, and > should I share with VM also that? Yes. It's simpler and more robust if you don't try to be smart about these cases, at least for a PoC. > 3. Do you really think that trapping all > mmaps and trying to hack around the > aforementioned problems, is a good idea? > ... > Plus I'd say your algo is not a solution. > Intercepting all mmap calls from dynamic > loader and provide some weird tricks to them, > is not any better than to write another loader, > for example. :) Yes, I really think syscall interception is a great idea. It's an order of magnitude smaller than your refactoring patches, and works on every GNU/Linux box (possibly every Linux box) updated in the last 5 years. It can be extended to be more powerful than any alteration to the dynamic linker. If it works for you, IMHO it is VASTLY better solution than patching Glibc, both for you and for your client(s). :D > I am very surprised you make the claims like > "your patch is very difficult to review" > w/o even looking into the very small patches > that mostly split the huge multi-thousands-line > funcs into a reusable parts... Your patch is difficult to review for reasons that have to do with the API and use case, not the implementation. It's also a refactor touching over a thousand lines, that's enough reason to make it hard to review. :P
(In reply to Jonathon Anderson from comment #23) > AFAIK your patches will be looked at once a use case that requires it is > solidified, that can't be solved with current tech nor any better proposed > API. So far, it has been unclear why the primary function of dlmem() is > needed for your use case. Why do you need to load solibs straight from > memory at all? While this is quite handy for my use-case (solib image comes from a vm, so its already in memory and has no host fd), the primary problem is that any file-based API destroys the existing mapping by definition. So I choose dlmem() because it both suits surprisingly well and has the potential to preserve the user's mapping. Other than that, its completely agnostic of my use-case. It just allows to dlmem() into the user's buffer. > No. I'm certain it works for unaligned SHT_NOBITS sections, any changes made > to one side of the "mirror" are reflected in the other. (Although there is > another flaw I missed before, an updated version of the technique is towards > the bottom of this message. :P) I think its the same problem that you try to avoid by introducing the writable file now. Unaligned SHT_NOBITS section results in re-protecting the file-backed MAP_PARIVATE page into a writable one. > There is not yet a solid use case for the primary function of this API, the > fact that it "loads an solib from memory." This primary functionality is the > main source of concern originally raised by Carlos O'Donell, and AFAICT > hasn't been resolved. Could you please explain the concern itself? I mean, what problem is there to have an API to dlmem() from memory? Is it a security concern, or what kind of? What justifies the straight "no" or "no unless you disprove 1024+ tricks to do the same with unportable syscall- intercepting techniques"? > The following API is close to your use case but doesn't raise the same > concerns as dlmem(). Does this solve your problem, if not what's missing? > void *dlopen4(const char *filename, int flags, const struct dlopen4_args > *ext, size_t ext_size /* = sizeof(struct dlopen3_args) */); > void *dlmopen5(Lmid_t lmid, const char *filename, int flags, const > struct dlopen4_args *ext, size_t ext_size /* = sizeof(struct dlopen3_args) > */); > struct dlopen4_args { > /* If not NULL, function called before mmap when loading the object > [and its dependencies?]. > Returns the base of a mmapped range of given length and alignment. > This mapping will be > overwritten by the loaded object. */ > void *(*dla_premap)(void *preferred_addr, size_t length, size_t align, > void *userdata); > /* User data passed to dla_premap. */ > void *dla_premap_userdata; > }; The primary problem is that this API doesn't allow to preserve the user's mapping. It is only using that mapping to specify the reloc address, while dlmem() can optionally preserve it (I use the separate flag for that). The secondary problem is "filename", but yes, I know you'll suggest to get it from /proc/self/fd. > These are niceties, but I think we can agree a direct implementation of > dlopen_with_offset() would be better for the use cases that need it. It > would also require far less refactors than dlmem(). I can remove all refactors and replace them with copy/pasts. Much bigger code but no change to existing code. Will that be any better? OTOH all refactors I did, just take some code chunk and move it to a separate func with the different indentation level. These diffs should be looked into with some tool that ignores indentation. Only then it would be clear how small they are. > As I mentioned before, syscall interception is a technique used in many > VM-adjacent and widely used technologies, to name a few: containers > Windows emulation (Wine), browser sandboxes > (Firefox/Chromium), I wonder if the above ones are actually do the syscall interception, or just use the bpf filters to avoid malicious code from using syscalls? > Given all this, I consider it much easier to write a syscall interception > code than to write a shim library to translate between 32- and 64-bit call > ABIs. FWIW. :D Its a bit strange to intercept the syscalls of your own code. I am quite sure none of the projects you mentioned, actually do this. They intercept the syscalls of some 3rd-party code running along, but never their own syscalls. gdb/strace definitely intercept the syscalls of the debugee, same with the rest of the projects. Most of dl_audit framework can be implemented with syscall interception, but why don't you want to do that? > > dlopenfd()+memfd doesn't give even the > > possibility of specifying the reloc address, > > and that's a very minimal, insufficient requirement. > Because you need the pages to be mirrored? Or is there another requirement > here? Mirrored and also reloc address specified. AFAICT fdlopen()+memfd gives neither. > There are a number of cases that need to be handled. The "base case" is > (MAP_SHARED & ~MAP_ANONYMOUS & ~MAP_FIXED), Not used by libdl AFAIK, so skipping. > If flags contains MAP_ANONYMOUS, an extra step (0) is added before step (1). That's quite clear. > If flags contains MAP_PRIVATE, extra steps are once again needed. If this is > a read-only mapping (~PROT_WRITE) and assuming mprotect() is not used later > to add write access (IIRC I have not observed Glibc's ld.so do so with > strace), But this is exactly what happens if you have unaligned SHT_NOBITS section. It goes to the same page that used MAP_PRIVATE to load an elf segment. glibc then re-protects and memsets that part. Even if you haven't seen that with strace, I was pointing to the exact code that does this. > then simply replace MAP_PRIVATE with MAP_SHARED in step (0) and the > rest will work. If the page is never re-protected, then MAP_SHARED is not even needed. You can just have 2 private mappings from same file. > Otherwise, if flags contains MAP_PRIVATE and prot contains PROT_WRITE, the AFAIK there is no such case. PROT_WRITE is applied later with mprotect() if you have an unaligned SHT_NOBITS section, but is AFAICS never applied initially. > That's it, that's the entire technique. It's a powerful approach reminiscent > of container tech, which I find fitting for a use case messing with a VM. > It's a straightforward technique with good similar examples in the > open-source community, for example strace's --inject= options. It's a small > technique, I would budget at around 100-300 lines for a PoC implementation. > It's not a performant approach, but presumably your apps aren't > dlopen()/dlclose()'ing solibs like there's no tomorrow. What's wrong with it? Contrary to what you say, no one is intercepting his own syscalls. And the SHT_NOBITS section problem is not yet addressed, although of course you will propose to intercept also mprotect() to get it in. > > Of course now I have some very bad feeling > > that your next proposal will be "trap > > all mmaps, not just the first one"... > > Well, before you do that, consider the > > following: > > 1. Some mappings are converted from > > file-based to anonymous via mprotect+memset. > The fact that the pages are mirrored handles this, changes in one are Pages are not mirrored in case of a MAP_PRIVATE mapping that was later re-protected to r/w. Of course you can always use MAP_SHARED beforehand, and do a writable file copy, which will basically mean to just copy the initially memory-based solib into a file on hdd rather than to even properly use memfd. > IIRC ld.so only uses mprotect() to mark the RELRO segments as read-only, so > they don't need to be mirrored in a simple PoC implementation. At least for > simple cases, YMMV. Not sure if the unaligned SHT_NOBITS (that causes re-protect to R/W) is a "simple case" or not. > > I am very surprised you make the claims like > > "your patch is very difficult to review" > > w/o even looking into the very small patches > > that mostly split the huge multi-thousands-line > > funcs into a reusable parts... > Your patch is difficult to review for reasons that have to do with the API > and use case, not the implementation. It's also a refactor touching over a > thousand lines, that's enough reason to make it hard to review. :P If indentation is ignored, then my patches touch a dozen of lines. There are just the moves of a large chunks of code to a separate funcs.
For example the diffstat of the largest patch that actually implements dlmem, is: 48 files changed, 484 insertions(+), 1 deletion(-) 1 deletion! (in a makefile) And another patch that adds the optional part of dlmem, looks like this: 5 files changed, 202 insertions(+), 2 deletions(-) 2 deletions in a makefile. You probably can't ask for the better changes separation: the 2 main patches change no existing code at all. Yes, there are also 2 patches with diffstats under 200 lines, but if the indentation is ignored, then they are 20 lines. The rest of the patches are in a range of 10-50 lines. Not sure if any better separation is possible. > Your patch is difficult to review for reasons that have to do with the API What does this mean? We can discuss API also here if the patch makes it somehow difficult.
(In reply to Jonathon Anderson from comment #23) > These are niceties, but I think we can agree a direct implementation of > dlopen_with_offset() would be better for the use cases that need it. It > would also require far less refactors than dlmem(). Getting a bit more abstract here, why refactors are that bad? glibc is full of multi-thousands-line funcs intersected by gotos. Is this because the refactors are prohibited? I mean, I was hoping for a "thank you" for a couple of small refactors. Is the current glibc code style (huge spaghetti funcs) is intentional and enforced?
(In reply to Stas Sergeev from comment #24) > Could you please explain the concern > itself? I mean, what problem is there > to have an API to dlmem() from memory? > Is it a security concern, or what kind of? Briefly summarizing the main points from the original email in the mailing list [1]: > dlmem() works at a lower level of abstraction than the rest of the dl* APIs, i.e. memory instead of solibs/objects. That has widespread impacts across many users of Glibc, including but not limited to security, LD_AUDIT, and developer tools (GDB). Some reasons follow: > - dlmem() does not ensure that the passed memory is a correctly mmap()'d object. It would be strongly preferable that the API ensures we CAN'T end up in an inconsistent state, instead of making it UB if the user slips up. > - dlmem() removes the "file descriptor" abstraction out of the link_map. A lot of tooling has to change to fit this new reality, both inside and outside Glibc: LD_AUDIT, developer tools (e.g. GDB), etc. > - dlmem() skips many syscalls, removing the kernel-side auditable events required for security tooling. In contrast, "dlopenfd" requires both memfd_create() (or similar) and mmap() of that fd, allowing e.g. FFI/JIT to be locked down by a security seccomp filter. Adding my own concern as well: - dlmem() seems to to expect the user to parse the program headers and mmap() the binary as required. That requires the application to re-implement a core, delicate piece of ld.so... and do so correctly. From an API design perspective, that seems like a very poor choice of abstraction. AFAICS none of these issues have been resolved in the latest patches. Some of these issues are intrinsic to the dlmem() semantics. So if another, better API will work for your case, that certainly would be preferred. [1]: https://sourceware.org/pipermail/libc-alpha/2023-February/145735.html > > The following API is close to your use case but doesn't raise the same > > concerns as dlmem(). Does this solve your problem, if not what's missing? > > void *dlopen4(const char *filename, int flags, const struct dlopen4_args > > *ext, size_t ext_size /* = sizeof(struct dlopen3_args) */); > > void *dlmopen5(Lmid_t lmid, const char *filename, int flags, const > > struct dlopen4_args *ext, size_t ext_size /* = sizeof(struct dlopen3_args) > > */); > > struct dlopen4_args { > > /* If not NULL, function called before mmap when loading the object > > [and its dependencies?]. > > Returns the base of a mmapped range of given length and alignment. > > This mapping will be > > overwritten by the loaded object. */ > > void *(*dla_premap)(void *preferred_addr, size_t length, size_t align, > > void *userdata); > > /* User data passed to dla_premap. */ > > void *dla_premap_userdata; > > }; > > The primary problem is that this API > doesn't allow to preserve the user's > mapping. It is only using that mapping > to specify the reloc address, while > dlmem() can optionally preserve it (I > use the separate flag for that). This is precisely one of the concerns with dlmem(). Why must the user's mapping be preserved? So that the mirroring can be set up before the object is loaded? Would replacing the dla_premap hook with some kind of custom-mmap() (dla_mmap()) hook fit your use case better? That could allow you to set up mirroring *as* the object is loaded, instead of before. FWIW, do you need page-mirroring at all if you can just choose the reloc address to be within the VM space? > The secondary problem is "filename", > but yes, I know you'll suggest to get > it from /proc/self/fd. I would prefer /proc/self/fd over dlopenfd4(). But dlopenfd() seems to be of wider interest, so whatever works. > I can remove all refactors and replace > them with copy/pasts. Much bigger code > but no change to existing code. > Will that be any better? > OTOH all refactors I did, just take some > code chunk and move it to a separate func > with the different indentation level. > These diffs should be looked into with > some tool that ignores indentation. > Only then it would be clear how small they > are. I wouldn't waste any more time on the dlmem() patch until the concerns above can be addressed. > > As I mentioned before, syscall interception is a technique used in many > > VM-adjacent and widely used technologies, to name a few: containers > > Windows emulation (Wine), browser sandboxes > > (Firefox/Chromium), > > I wonder if the above ones are actually > do the syscall interception, or just use > the bpf filters to avoid malicious code > from using syscalls? None of the examples above do exactly what you're looking for. If I knew of any OSS that did, I would just point you there. AFAIK your use case is very unique. Of the examples I've named: - Wine outright implements Windows syscalls on Linux, by intercepting all syscalls in the running process and performing the translation in userspace (SIGSYS handler). - strace and GDB intercept syscalls "remotely" via ptrace(). IMHO the process of poking the registers and memory via ptrace() is not all that different than doing so from inside a SIGSYS signal handler. - Podman/Docker use libseccomp to filter syscalls with BPF seccomp() filters. BPF isn't powerful enough for the proposed approach, but it is similar in that it can alter the arguments and return values (to a limited extent). - Firefox and Chrom(ium) also use seccomp() filters, but they also register special handlers for SIGSYS. IIRC it's mainly for error reporting and not for interception, but you get the idea. In short, intercepting syscalls is done in multiple OSS projects to varying extents, for security and for profit. Wine is the only one that is as extreme as your use case, but the rest do have some degree of similarity. > > Given all this, I consider it much easier to write a syscall interception > > code than to write a shim library to translate between 32- and 64-bit call > > ABIs. FWIW. :D > > Its a bit strange to intercept the syscalls > of your own code. I am quite sure none of > the projects you mentioned, actually do this. > They intercept the syscalls of some 3rd-party > code running along, but never their own syscalls. Presumably you won't intercept (all of) your own syscalls, primarily you're aiming for the syscalls while the 3rd-party "ancient code" is loading. So isn't it pretty much the same? > Most of dl_audit framework can be implemented > with syscall interception, but why don't you > want to do that? Because (1) LD_AUDIT hearkens back to the days of Solaris and so is already on literally every GNU/Linux box in active use, and because (2) symbol binding (la_symbind) is done completely in userspace and can't be intercepted by syscalls. Very different situation. > > > dlopenfd()+memfd doesn't give even the > > > possibility of specifying the reloc address, > > > and that's a very minimal, insufficient requirement. > > Because you need the pages to be mirrored? Or is there another requirement > > here? > > Mirrored and also reloc address specified. > AFAICT fdlopen()+memfd gives neither. And based on prior comments, I assume you also want to preserve user mappings here. > > If flags contains MAP_PRIVATE, extra steps are once again needed. If this is > > a read-only mapping (~PROT_WRITE) and assuming mprotect() is not used later > > to add write access (IIRC I have not observed Glibc's ld.so do so with > > strace), > > But this is exactly what happens if you > have unaligned SHT_NOBITS section. It > goes to the same page that used MAP_PRIVATE > to load an elf segment. glibc then re-protects > and memsets that part. Even if you haven't > seen that with strace, I was pointing to the > exact code that does this. Missed that comment, sorry. Link to the code so we're all on the same page: [2] Note that the mprotect() calls are only if(__glibc_unlikely((c->prot & PROT_WRITE) == 0)). It seems that newer ld places .data and small .bss in a RW LOAD segment, which would explain why I've never observed it happen myself with strace and modern software. This makes me curious how old/common binaries are that trip this case. This code (complete with the "Dag nab it" comment) have been present in Glibc since 1995: [3]. So maybe... *really* ancient binaries? :D If it bothers you, this case can be ignored and the following case (that copies the data to a writable anonymous file) used instead. [2]: https://sourceware.org/git/?p=glibc.git;a=blob;f=elf/dl-map-segments.h;hb=07dd75589ecbedec5162a5645d57f8bd093a45db#l165 [3]: https://sourceware.org/git/?p=glibc.git;a=blob;f=elf/dl-load.c;hb=d66e34cd423425c348bcc83df127dd19711b0b9a#l339 > > then simply replace MAP_PRIVATE with MAP_SHARED in step (0) and the > > rest will work. > > If the page is never re-protected, then > MAP_SHARED is not even needed. You can > just have 2 private mappings from same file. True! > > Otherwise, if flags contains MAP_PRIVATE and prot contains PROT_WRITE, the > > AFAIK there is no such case. > PROT_WRITE is applied later with mprotect() > if you have an unaligned SHT_NOBITS section, > but is AFAICS never applied initially. PROT_WRITE is applied initially if the LOAD segment is marked as RW. A quick readelf -l on a few of my system's binaries seems to indicate this is pretty common for .data and .bss in modern software. > Contrary to what you say, no one is > intercepting his own syscalls. I beg to disagree. Many projects filter or intercept their own syscalls. This *specific* approach hasn't been done before (I would point you to it if it was), but intercepting (or at least filtering) syscalls in-process is nothing new. > And the SHT_NOBITS section problem is not > yet addressed, although of course you will > propose to intercept also mprotect() to get > it in. The most I would do in an mprotect() interception is ensure PROT_WRITE doesn't get added to any pages (i.e. abort() the application if it does). That doesn't really solve this problem, but it could catch some issues with the mirrored pages. Maybe. :P > Pages are not mirrored in case of a > MAP_PRIVATE mapping that was later > re-protected to r/w. Of course you > can always use MAP_SHARED beforehand, > and do a writable file copy, Indeed! Which is exactly what I suggested. :D > which will > basically mean to just copy the initially > memory-based solib into a file on hdd rather > than to even properly use memfd. Why is the HDD required here, can't you just copy to a memfd file? That's what I suggested above. > > IIRC ld.so only uses mprotect() to mark the RELRO segments as read-only, so > > they don't need to be mirrored in a simple PoC implementation. At least for > > simple cases, YMMV. > > Not sure if the unaligned SHT_NOBITS > (that causes re-protect to R/W) is a > "simple case" or not. Well, it *seems* very uncommon in modern software. Not sure whether it's rare in your "ancient code" case. Either way, solution discussed above. > If indentation is ignored, then my patches > touch a dozen of lines. There are just the > moves of a large chunks of code to a separate > funcs. I more meant that comparing the hundreds of lines that have moved around is time-consuming. There are tools to help, it just takes a lot of time that could be better spent other places. Speaking from experience with my main project. :P But it seems like your latest patches are shorter than I had remembered, I stand corrected. IIRC at one point there was a 1300-addition patch, which is where my comment came from, but that seems to have been cleaned up now. Great! :D (In reply to Stas Sergeev from comment #26) > (In reply to Jonathon Anderson from comment #23) > > These are niceties, but I think we can agree a direct implementation of > > dlopen_with_offset() would be better for the use cases that need it. It > > would also require far less refactors than dlmem(). > > Getting a bit more abstract here, > why refactors are that bad? glibc > is full of multi-thousands-line funcs > intersected by gotos. Is this because > the refactors are prohibited? > I mean, I was hoping for a "thank you" > for a couple of small refactors. > Is the current glibc code style (huge > spaghetti funcs) is intentional and > enforced? I don't run the show here... but AFAIK the code here is carefully, heavily, manually optimized to generate the best performance with a wide range of C compilers. Carelessly refactoring it and especially adding additional function calls will destroy a lot of that work. (Although I dislike the spaghetti as much as you do. :P) I've seen other refactors merge from the mailing list, IIRC performance almost always comes up in the leading discussion. But again, the main problem with your patches is the concerns with the dlmem() semantics, not the size nor quality of your patches themselves. So let's fix that first.
(In reply to Jonathon Anderson from comment #27) > Briefly summarizing the main points from the original email in the mailing > list [1]: You are creatively summarizing. :) To me, all Carlos's concerns were addressed and yours are completely new to me. > > dlmem() works at a lower level of abstraction than the rest of the dl* APIs, i.e. memory instead of solibs/objects. That has widespread impacts across many users of Glibc, including but not limited to security, LD_AUDIT, and developer tools (GDB). Some reasons follow: I think we need _all_ reasons for such a broad claims, not "some". > > - dlmem() does not ensure that the passed memory is a correctly mmap()'d object. It would be strongly preferable that the API ensures we CAN'T end up in an inconsistent state, instead of making it UB if the user slips up. That's a not valid assumption. The refactors in my patch are done not out of nothing to do, but exactly to have the common path for dlopen() and dlmem(). All elf sanity checks done by dopen(), are applied also to dlmem(). > > - dlmem() removes the "file descriptor" abstraction out of the link_map. Could you please clarify? In struct link_map I don't remember the fd field, and the object name, which is there, is supported by dlmem(). > A lot of tooling has to change to fit this new reality, both inside and outside Glibc: LD_AUDIT, developer tools (e.g. GDB), etc. This needs a clarification, I don't understand that part. What should they change any why? Maybe gdb needs to be able to trap dlmem() to auto-load debug symbols - yes, that's what I admitted long ago. But anything else than that? > > - dlmem() skips many syscalls, removing the kernel-side auditable events required for security tooling. There are 2 use-cases. 1 is when dlmem() skips nothing, in a sense that you yourself need to mmap() an elf beforehand. So kernel still sees everything, and even /proc/self/map_files are correct. 2 is when the memory buffer comes out of some other world, like from VM. In that case it doesn't matter if the extra call like memfd_create() is not done, as verifying the code source is impossible in that case. > In contrast, "dlopenfd" requires both memfd_create() (or similar) and mmap() of that fd, allowing e.g. FFI/JIT to be locked down by a security seccomp filter. You can still lock down your jit by a seccomp filter. Not sure why you need memfd_create() to do that. > Adding my own concern as well: They were all your own though. :) > - dlmem() seems to to expect the user to parse the program headers and > mmap() the binary as required. That requires the application to re-implement > a core, delicate piece of ld.so... Not sure what are you talking about. My patch adds quite comprehensive test-cases that try to cover the basic scenarios. So it will help if you refer to a particular test of mine that does something like this, as I don't remember it did. Like I said before, dlmem() uses essentially the same code path in glibc as does dlopen(). And only a few small refacts were needed to accomplish that. > and do so correctly. From an API design > perspective, that seems like a very poor choice of abstraction. If I know what are you referring to, maybe I'll answer. :) > AFAICS none of these issues have been resolved in the latest patches. This is because, as I said above, your summary of Carlos's concerns is "creative". I addressed his concerns: I dropped LD_AUDIT bits and I showed how to implement fdlopen() and dlopen_with_offset(). > Some > of these issues are intrinsic to the dlmem() semantics. So if another, > better API will work for your case, that certainly would be preferred. I am all for discussing any better API that can work for me. > > The primary problem is that this API > > doesn't allow to preserve the user's > > mapping. It is only using that mapping > > to specify the reloc address, while > > dlmem() can optionally preserve it (I > > use the separate flag for that). > This is precisely one of the concerns with dlmem(). Why must the user's > mapping be preserved? So that the mirroring can be set up before the object > is loaded? Indeed. This behavior is optional. > Would replacing the dla_premap hook with some kind of custom-mmap() > (dla_mmap()) hook fit your use case better? That could allow you to set up > mirroring *as* the object is loaded, instead of before. With the only difference being to give the user 100 times more work? :) Instead of dealing with mmap flags and file copies, he has 1 small and simple call-back in my impl. > FWIW, do you need page-mirroring at all if you can just choose the reloc > address to be within the VM space? Yes because the VM see the pointers as if VM_window_start==0. So all pointers there will be incorrect and not passable to the 32bit world. Reloc address is planned to be within MAP_32BIT. > Presumably you won't intercept (all of) your own syscalls, primarily you're > aiming for the syscalls while the 3rd-party "ancient code" is loading. So > isn't it pretty much the same? This is where the 64bit library does the loading. The foreign code all runs under KVM, so I don't even need a seccomp filter for it. You propose me to intercept my own syscalls, and this is what no other project does. > > Most of dl_audit framework can be implemented > > with syscall interception, but why don't you > > want to do that? > Because (1) LD_AUDIT hearkens back to the days of Solaris and so is already > on literally every GNU/Linux box in active use, and because (2) symbol > binding (la_symbind) is done completely in userspace and can't be > intercepted by syscalls. > > Very different situation. Which is why I said "most", not "all". You actually can implement most/some parts of LD_AUDIT via a syscall trapping, leaving things like symbind or la_activity in glibc, but you don't want to do that. > > Mirrored and also reloc address specified. > > AFAICT fdlopen()+memfd gives neither. > And based on prior comments, I assume you also want to preserve user > mappings here. Only for the sake of mirroring. Its a more broad feature of course, but me - I only need it for mirroring. > Note that the mprotect() calls are only if(__glibc_unlikely((c->prot & > PROT_WRITE) == 0)). Well, and otherwise (when PROT_WRITE is set) I'd need the file copy. Which means I always need. > > Contrary to what you say, no one is > > intercepting his own syscalls. > I beg to disagree. Many projects filter or intercept their own syscalls. > This *specific* approach hasn't been done before (I would point you to it if > it was), but intercepting (or at least filtering) syscalls in-process is > nothing new. I think its only done when that process executes an alien code. And even that is likely wine-specific: I would be very surprised if any other alien code can execute a "syscall" instruction. For example the js code can't execute a syscall, so, as you already confirmed, chromium mostly does filtering to catch occasional bugs of its own. What I don't believe you can ever find, is some project intercepting the syscalls of its own, and "emulating" them as if its an alien code running. More generally, I don't think someone uses that technique to extend the functionality. They either implement that for security reasons (chromium), or for debugging reasons (gdb), or for an emulation of an alien code (wine). Extending the functionality on a syscall level looks like a gross hack, given that a very simple high-level API suits well. > > which will > > basically mean to just copy the initially > > memory-based solib into a file on hdd rather > > than to even properly use memfd. > Why is the HDD required here, can't you just copy to a memfd file? That's > what I suggested above. There are 2 "files" in that picture. One memfd comes from the solib in memory, and another memfd seems to come from your suggestion. So I won't be able to even use the solib's memfd properly, and will instead have to copy it to the file on hdd (or to the second memfd). > But it seems like your latest patches are shorter than I had remembered, I > stand corrected. IIRC at one point there was a 1300-addition patch, which is > where my comment came from, but that seems to have been cleaned up now. > Great! :D Thanks! Knowing that the patches are at least looked into, is a big relief. :) > I don't run the show here... but AFAIK the code here is carefully, heavily, > manually optimized to generate the best performance with a wide range of C > compilers. Carelessly refactoring it and especially adding additional > function calls will destroy a lot of that work. (Although I dislike the > spaghetti as much as you do. :P) Well, if not for the musl that demonstrated the possibility of writing a libc without any spaghetti code (or a small and structured, but completely obfuscated code as in uclibc), I would believe that argument. :)
In case it wasn't visible, I apologize to Jonathon for a bad joke in an ML. Not the best day actually, I uninstalled the firefox from snapstore (ubuntu), and it removed the entire profile, with all passwords, credentials, cookies, histories, everything. Which ended up in a jokes like that one, sorry. I wish I could target all the possible dark humor to the authors of snap instead...
I need to also put that demostration here, because even Jonathon claimed this "elf parsing" argument: $ LD_LIBRARY_PATH=..:. ./tst-dlmem-fdlopen unaligned buf gives buffer not aligned: Invalid argument 7fb413101000-7fb413102000 r--p 00000000 00:28 17195405 /home/stas/src/glibc-dev/build/dlfcn/glreflib1.so 7fb413102000-7fb413103000 r-xp 00001000 00:28 17195405 /home/stas/src/glibc-dev/build/dlfcn/glreflib1.so 7fb413103000-7fb413104000 r--p 00002000 00:28 17195405 /home/stas/src/glibc-dev/build/dlfcn/glreflib1.so 7fb413104000-7fb413105000 r--p 00002000 00:28 17195405 /home/stas/src/glibc-dev/build/dlfcn/glreflib1.so 7fb413105000-7fb413106000 rw-p 00003000 00:28 17195405 /home/stas/src/glibc-dev/build/dlfcn/glreflib1.so As can be seen, dlmem() created 5 references to the solib when laying out segments. And no manual elf parsing was involved, this test-case was in a v9 patch so anyone can see I am not cheating. Jonathon, will you allow this false claim about some "elf parsing" to spread that widely, that no one even wants to see my patches any more? I think this is a bit unfair, I wanted to put my patches down when some _valid_ argument is raised...
Created attachment 14795 [details] API description Also here's the API description, with "limitations" and everything needed to describe. I am shocked to see no one even believes me that it works, that it can lay out elf by vaddr's and so on... Is it a rocket science to write a code that lays out an elf segments? No, its not! It works, documented, demonstrated, posted as a patch, passed regression suit, and yet no one believes? :(
(In reply to Stas Sergeev from comment #30) > 7fb413103000-7fb413104000 r--p 00002000 00:28 17195405 > /home/stas/src/glibc-dev/build/dlfcn/glreflib1.so > 7fb413104000-7fb413105000 r--p 00002000 00:28 17195405 We can also see here 2 sections with file offset being 0x2000 for both. Of course their vaddr's are not equal to the file offset. What else can be done to demonstrate the obvious fact that the elf is properly laid out by vaddr's? Come on...
Created attachment 14799 [details] API description I am glad to finally present v10 which incorporated work on all the comments I got to v9, and that was a bit number. Thanks to all who contributed! I received a few mails that I ignore the comments and therefore my patches should not be looked into. I think this is a contradiction, because the only way to find out if I ignore any comments or not, is to look into the patches. But, to make that task easier, here's the changelog: Changes in v10: - addressed review comments of Adhemerval Zanella - moved refactor patches to the beginning of the serie to simplify review - fixed a few bugs in an elf relocation machinery after various hot discussions - added a new test tst-dlmem-extfns that demo-implements dlopen_with_offset4() and fdlopen() - studied and documented all limitations, most importantly those leading to UB - better documented premap callback as suggested by Szabolcs Nagy - added DLMEM_GENBUF_SRC flag for unaligned generic memory buffers As can be seen, ALL comments were addressed. And at the end of the day it doesn't even matter if that "elf parsing attack" was malicious or not. The main thing is that the problem is not there in v10, so who cares it is existed ever before. :) It motivated me to study every corner case when my loader actually failed to lay out elf segments properly. and as the result, there is a much better API description (attached here), "Limitations" section and a new flag DLMEM_GENBUF_SRC. These all are the measures against any possible failure to lay out an elf segments. So it can be firmly said that v10 have no such problem, and so, the comment was properly addressed and resolved. Thanks!
The URL to the v10: https://sourceware.org/pipermail/libc-alpha/2023-April/146866.html
Created attachment 14827 [details] demo diff I am putting the new dlmem() demonstration here, because unfortunately the onslaught continues: https://sourceware.org/pipermail/libc-alpha/2023-April/147254.html Demo shows this: $ cat tst-dlmem-extfns.out before dlmem 7f5210ca8000-7f5210cad000 r--p 00000000 00:29 18840304 /home/stas/src/glibc-dlmem/build/dlfcn/glreflib1.so after dlmem 7f5210ca3000-7f5210ca4000 r--p 00000000 00:29 18840304 /home/stas/src/glibc-dlmem/build/dlfcn/glreflib1.so 7f5210ca4000-7f5210ca5000 r-xp 00001000 00:29 18840304 /home/stas/src/glibc-dlmem/build/dlfcn/glreflib1.so 7f5210ca5000-7f5210ca6000 r--p 00002000 00:29 18840304 /home/stas/src/glibc-dlmem/build/dlfcn/glreflib1.so 7f5210ca6000-7f5210ca7000 r--p 00002000 00:29 18840304 /home/stas/src/glibc-dlmem/build/dlfcn/glreflib1.so 7f5210ca7000-7f5210ca8000 rw-p 00003000 00:29 18840304 /home/stas/src/glibc-dlmem/build/dlfcn/glreflib1.so 7f5210ca8000-7f5210cad000 r--p 00000000 00:29 18840304 /home/stas/src/glibc-dlmem/build/dlfcn/glreflib1.so post fdlopen 7f5210ca3000-7f5210ca4000 r--p 00000000 00:29 18840304 /home/stas/src/glibc-dlmem/build/dlfcn/glreflib1.so 7f5210ca4000-7f5210ca5000 r-xp 00001000 00:29 18840304 /home/stas/src/glibc-dlmem/build/dlfcn/glreflib1.so 7f5210ca5000-7f5210ca6000 r--p 00002000 00:29 18840304 /home/stas/src/glibc-dlmem/build/dlfcn/glreflib1.so 7f5210ca6000-7f5210ca7000 r--p 00002000 00:29 18840304 /home/stas/src/glibc-dlmem/build/dlfcn/glreflib1.so 7f5210ca7000-7f5210ca8000 rw-p 00003000 00:29 18840304 /home/stas/src/glibc-dlmem/build/dlfcn/glreflib1.so When nothing can be changed, at least the truth must be made as visible as possible.
This demo clearly shows the elf loading process of dlmem(), and of course without any "elf parsing" on a user side.
Created attachment 14867 [details] patches RTLD_NORELOCATE api is a proposal that adds a fine-grained control over the solib dynamic-load process. It allows the user to load the solib to the particular address he needs, using the mapping type he needs. The basic idea is that after loading the solib with RTLD_NORELOCATE flag, the user can move an unrelocated object before relocating it. The API consist of the following elements: `RTLD_NORELOCATE' - new dlopen() flag. It defers the relocation of an object, allowing to perform the relocation later. Ctors are delayed, and are called immediately after the relocation is done. Relocation is performed upon the first dlsym() or dlrelocate() call with the obtained handle. This flag doesn't delay the load of an object deps, but their relocation and ctors are delayed. This flag doesn't delay the LA_ACT_CONSISTENT audit event. `int dlrelocate(void *handle)' - new function to perform the object relocation if the RTLD_NORELOCATE flag was used. The object itself and all of its dependencies are relocated. Returns EINVAL if already relocated. This function may be omitted even if RTLD_NORELOCATE was used, in which case the relocation will be performed upon the first dlsym() call with the obtained handle, but using dlrelocate() function allows to handle relocation errors and run ctors before using the object's handle. If the function returned success then ctors of an object and all of its deps were called by it. If it returned error other than EINVAL (EINVAL means object already relocated), then relocation error happened and the handle should be closed with dlclose(). `RTLD_DI_MAPINFO' - new dlinfo() request that fills in this structure: typedef struct { void *map_start; /* Beginning of mapping containing address. */ size_t map_length; /* Length of mapping. */ size_t map_align; /* Alignment of mapping. */ int relocated; /* Indicates whether an object was relocated. */ } Dl_mapinfo; The user have to check the `relocated` member, and if it is 0 then the object can be moved to the new location. The new location must be aligned according to the `map_aligned' member, which is usually equal to a page size. One way to move a solib image is to use mmap() for allocating a new memory mapping, then use memcpy() to copy an image, and finally use munmap() to unmap the memory space at an old location. This request may fail if the used handle was not obtained from dlopen(). `int dlset_object_base(void *handle, void *addr)' - new function to set the new base address of an unrelocated object, after it was moved. Returns error if the object is already relocated. The base address set by this function, will be used when relocation is performed. `RTLD_DI_DEPLIST' is a new dlinfo() request that fills in this structure: typedef struct { void **deps; /* Array of handles for the deps. */ unsigned int ndeps; /* Number of entries in the list. */ } Dl_deplist; It is needed if the user wants to move also the dependencies of the loaded solib. In this case he needs to traverse the `deps' array, make RTLD_DI_MAPINFO dlinfo() request per each handle from an array, find the object he needs by inspecting the filled-in Dl_mapinfo structure, make sure this object is not relocated yet, and move it, calling dlset_object_base() at the end. Use-case. Suppose you have a VM that runs a 32bit code. Suppose you wrote a compatibility layer that allows to compile the old 32bit non-unix code under linux, into the native 64bit shared libraries. But compiling is not enough and some calls should still go to a VM. VM's memory is available in a 4Gb window somewhere in a 64bit space. In order for the code under VM to handle the calls from a 64bit solib, you need to make sure all pointers, that may be passed as a call arguments, are within 32 bits. Heap and stack are dealt with by a custom libc, but in order to use pointers to .bss objects, we need to relocate the solib to the low 32bit address. But that's not enough, because in order for that lib to be visible to the code under VM, it must also be mirrored to the VM window under the map_address = reloc_address+VM_window_start. RTLD_NORELOCATE solves that problem by allowing the user to mmap the shared memory into the low 32bit address space and move an object there. He may want to do so for all the library deps as well (using RTLD_DI_DEPLIST), or only with the ones he is interested in. Then he maps the shared memory into the VM window and either calls dlrelocate() or just starts using the solib, in which case it will be relocated on the first symbol lookup. Stas Sergeev (14): elf: switch _dl_map_segment() to anonymous mapping use initial mmap also for ET_EXEC rework maphole split do_reloc_1() from dl_open_worker_begin() split do_reloc_2() out of do_open_worker() move relocation into _dl_object_reloc() func split out _dl_finalize_segments() finalize elf segments on a relocation step implement RTLD_NORELOCATE flag add test-case for RTLD_NORELOCATE implement dlrelocate() implement RTLD_DI_MAPINFO implement dlset_object_base() implement RTLD_DI_DEPLIST bits/dlfcn.h | 3 + dlfcn/Makefile | 11 +- dlfcn/Versions | 4 + dlfcn/ctorlib1.c | 39 ++ dlfcn/dlfcn.h | 34 +- dlfcn/dlinfo.c | 28 ++ dlfcn/dlopen.c | 2 +- dlfcn/dlrelocate.c | 68 +++ dlfcn/dlset_object_base.c | 124 ++++++ dlfcn/tst-noreloc.c | 157 +++++++ elf/dl-close.c | 3 + elf/dl-load.c | 35 +- elf/dl-load.h | 8 +- elf/dl-lookup.c | 6 +- elf/dl-main.h | 2 + elf/dl-map-segments.h | 169 +++++--- elf/dl-open.c | 386 +++++++++++------- elf/rtld.c | 1 + include/dlfcn.h | 11 + include/link.h | 6 + sysdeps/generic/ldsodefs.h | 1 + sysdeps/mach/hurd/i386/libc.abilist | 2 + sysdeps/unix/sysv/linux/aarch64/libc.abilist | 2 + sysdeps/unix/sysv/linux/alpha/libc.abilist | 2 + sysdeps/unix/sysv/linux/arc/libc.abilist | 2 + sysdeps/unix/sysv/linux/arm/be/libc.abilist | 2 + sysdeps/unix/sysv/linux/arm/le/libc.abilist | 2 + sysdeps/unix/sysv/linux/csky/libc.abilist | 2 + sysdeps/unix/sysv/linux/hppa/libc.abilist | 2 + sysdeps/unix/sysv/linux/i386/libc.abilist | 2 + sysdeps/unix/sysv/linux/ia64/libc.abilist | 2 + .../sysv/linux/loongarch/lp64/libc.abilist | 2 + .../sysv/linux/m68k/coldfire/libc.abilist | 2 + .../unix/sysv/linux/m68k/m680x0/libc.abilist | 2 + .../sysv/linux/microblaze/be/libc.abilist | 2 + .../sysv/linux/microblaze/le/libc.abilist | 2 + .../sysv/linux/mips/mips32/fpu/libc.abilist | 2 + .../sysv/linux/mips/mips32/nofpu/libc.abilist | 2 + .../sysv/linux/mips/mips64/n32/libc.abilist | 2 + .../sysv/linux/mips/mips64/n64/libc.abilist | 2 + sysdeps/unix/sysv/linux/nios2/libc.abilist | 2 + sysdeps/unix/sysv/linux/or1k/libc.abilist | 2 + .../linux/powerpc/powerpc32/fpu/libc.abilist | 2 + .../powerpc/powerpc32/nofpu/libc.abilist | 2 + .../linux/powerpc/powerpc64/be/libc.abilist | 2 + .../linux/powerpc/powerpc64/le/libc.abilist | 2 + .../unix/sysv/linux/riscv/rv32/libc.abilist | 2 + .../unix/sysv/linux/riscv/rv64/libc.abilist | 2 + .../unix/sysv/linux/s390/s390-32/libc.abilist | 2 + .../unix/sysv/linux/s390/s390-64/libc.abilist | 2 + sysdeps/unix/sysv/linux/sh/be/libc.abilist | 2 + sysdeps/unix/sysv/linux/sh/le/libc.abilist | 2 + .../sysv/linux/sparc/sparc32/libc.abilist | 2 + .../sysv/linux/sparc/sparc64/libc.abilist | 2 + .../unix/sysv/linux/x86_64/64/libc.abilist | 2 + .../unix/sysv/linux/x86_64/x32/libc.abilist | 2 + 56 files changed, 941 insertions(+), 227 deletions(-) create mode 100644 dlfcn/ctorlib1.c create mode 100644 dlfcn/dlrelocate.c create mode 100644 dlfcn/dlset_object_base.c create mode 100644 dlfcn/tst-noreloc.c -- 2.39.2