30007 – rfe: dlopen to specified address

Bug 30007 - rfe: dlopen to specified address

Summary: rfe: dlopen to specified address

Status:	UNCONFIRMED

Alias:	None

Product:	glibc
Classification:	Unclassified
Component:	dynamic-link (show other bugs)
Version:	unspecified

Importance:	P2 enhancement
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:

Depends on:
Blocks:

Reported:	2023-01-16 14:13 UTC by Stas Sergeev
Modified:	2023-05-08 14:51 UTC (History)
CC List:	1 user (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:

Attachments
patches (27.81 KB, application/gzip) 2023-03-15 10:03 UTC, Stas Sergeev	Details
API description (2.96 KB, patch) 2023-03-31 15:53 UTC, Stas Sergeev	Details \| Diff
API description (3.36 KB, patch) 2023-04-03 09:28 UTC, Stas Sergeev	Details \| Diff
demo diff (757 bytes, patch) 2023-04-14 19:09 UTC, Stas Sergeev	Details \| Diff
patches (26.41 KB, application/gzip) 2023-05-08 14:51 UTC, Stas Sergeev	Details
Show Obsolete (3) View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Stas Sergeev 2023-01-16 14:13:37 UTC

I've come to a need of dlopen the
solib into the previously mmapped
buffer. Currently the load address
is chosen in _dl_map_segments().
The elf preferred address is taken,
which is usually 0 so any address
is used.

I can think of 2 possible solutions.
One would be to add a new func for
DL_AUDIT which passes the needed
length to the user and expects an
address of a buffer as a return.
This will allow the user to mmap
the MAP_SHARED buffer if he wants,
but the down-side is that ld.so
will then need to use read() instead
of mmap() to not trash the user's
shared mapping. This will likely
also need some efforts to implement.

Another solution is trivial: just
add a new fn dlopen3(file, flags, addr)
that provides the base address for
dlopen. This will not allow to use
the pre-allocated buffer (user doesn't
know the needed buffer size at that
point) but its trivial to code up and
will likely also solve my problem.

It was also already requested here
by someone else:
https://stackoverflow.com/questions/62064806/is-there-a-way-to-specify-the-base-address-of-a-shared-library-using-dlopen

What do people think about such an
extension?

Comment 1 Adhemerval Zanella 2023-01-17 14:17:13 UTC

Any GNU extension requires a specific usercase that can't be easily accomplished with current API. What problem are trying to solve that you require to map a shared library to an specific pre-allocated address?

Comment 2 Stas Sergeev 2023-01-17 14:35:58 UTC

(In reply to Adhemerval Zanella from comment #1)
> Any GNU extension requires a specific usercase that can't be easily
> accomplished with current API. What problem are trying to solve that you
> require to map a shared library to an specific pre-allocated address?

It needs to interact with legacy
32bit code that is running in VM.
The memory of the VM is mapped in
a 64bit space under a particular
address. I need to be able to load
solib within a 4Gb range from the
aforementioned address, in which
case the 32bit code will be able
to create the pointers to that
solib's objects.
Another way of solving that, is to
put the solib into a MAP_SHARED buffer.
In this case I will be able to create
the "mirror" of that solib under
any address I need, so the 32bit
pointers will likely work in that
case too (I will not execute functions
via pointers to that window). For
that, I'd probably need the following
API:
void *dladdr(void *handle, int *buffer_size_out);
So the ability to get the address
and length is probably already enough
for my needs, as that will allow me
to do the MAP_SHARED trick. And that
can probably be made an LD_AUDIT extension,
instead of a new global function.

Of course I still need to test either
way to make sure it really works. Which
may mean that eventually I'll have to
implement that extension myself. So for
now this is just a query to find out
if it is acceptable, and if so - in what
form.

Comment 3 Adhemerval Zanella 2023-01-19 13:23:38 UTC

This seems a very specific usercase that I am not sure if it fits on usual case of dlopen usage.  Mapping to a preferred address basically defeats ASRL and has other issues: what this dlopen extension should do if the required mapping is not large enough, if the address is already occupied, or if the mapping does not have the correct permissions?

In the case of passing mapping not sufficing and dlopen fallbacks to normal mapping, the result might not be directly related to the address that was mapped, so the user will need to call dladdr to check if the mapping was with the passed buffer or with the fallback mechanism.  So it would be a cumbersome interface.

We used to have hacks to force mmap to load executables to 32-bits, it was added to overcome some particular architecture limitations, and it has caused some issues and it was eventually removed (check ea5814467a02c9d2d7608b6445c5d60e2a81d3ee).

So I am not very fond of this extension.

Comment 4 Stas Sergeev 2023-01-19 13:46:43 UTC

(In reply to Adhemerval Zanella from comment #3)
> This seems a very specific usercase that I am not sure if it fits on usual
> case of dlopen usage.  Mapping to a preferred address basically defeats ASRL
> and has other issues: what this dlopen extension should do if the required
> mapping is not large enough, if the address is already occupied, or if the
> mapping does not have the correct permissions?

My current plan (which may of course
change) is to have the LD_AUDIT func
that will tell the length and get the
address back. The user have to make sure
there is no any mapping at that address
for a specified length (I am looking in
/proc/self/maps to find the needed hole).
If eventually it appears the hole is not
large enough, dlopen() should just fail.
That basically addresses the aforementioned
concern of yours.
After dlopen() succeeded, knowing the
length I will mmap(MAP_SHARED) another
buffer, memcpy() the solib there and
mmap() it back to its addr, but in a
shared buffer. So basically all the dirty
work is on my side, dlopen() should only
tell me the length and get the address back.

> In the case of passing mapping not sufficing

Passing mapping was just an option to
consider. I already realized it would
be a bad extension, so currently my plan
is to pass only address in response to
the length, and then, knowing length,
create that "other mapping" by hands,
then memcpy(), then remap back. More work
for me, but much smaller extension to glibc.


> We used to have hacks to force mmap to load executables to 32-bits, it was
> added to overcome some particular architecture limitations, and it has
> caused some issues and it was eventually removed (check
> ea5814467a02c9d2d7608b6445c5d60e2a81d3ee).

Wow, just recently removed!
Quite sad I haven't had a chance to try
it out... But I think we can come up with
something much more flexible. Which is why
I created that ticket before actually
prototyping the thing myself. Eg
LD_PREFER_MAP_32BIT_EXEC wouldn't give me
the length, so I'd have difficulties wrapping
that into a shared mapping.
I really aim for something very small and
simple on a glibc side, with more work on
a user's side. But if glibc isn't helpful,
I'll need to implement/use alternative
dynamic linker, which would be quite bad.

Comment 5 Stas Sergeev 2023-02-15 16:58:55 UTC

Posted a patch here:
https://sourceware.org/pipermail/libc-alpha/2023-February/145640.html

Comment 6 Stas Sergeev 2023-03-14 03:20:04 UTC

(In reply to Adhemerval Zanella from comment #3)
> We used to have hacks to force mmap to load executables to 32-bits, it was
> added to overcome some particular architecture limitations, and it has
> caused some issues and it was eventually removed (check
> ea5814467a02c9d2d7608b6445c5d60e2a81d3ee).

Mm, it was indeed removed in the commit you
mention, but re-introduced in 317f1c0a8a7

Comment 7 Adhemerval Zanella 2023-03-14 13:21:27 UTC

(In reply to Stas Sergeev from comment #6)
> (In reply to Adhemerval Zanella from comment #3)
> > We used to have hacks to force mmap to load executables to 32-bits, it was
> > added to overcome some particular architecture limitations, and it has
> > caused some issues and it was eventually removed (check
> > ea5814467a02c9d2d7608b6445c5d60e2a81d3ee).
> 
> Mm, it was indeed removed in the commit you
> mention, but re-introduced in 317f1c0a8a7

To workaround an architecture limitation, a better (and more complex) solution would be to fix it on the kernel as suggested on patch reviewing.

Comment 8 Jonathon Anderson 2023-03-15 05:34:59 UTC

Hopping over here from a long and winding discussion in https://sourceware.org/bugzilla/show_bug.cgi?id=30127.

(In reply to Stas Sergeev from comment https://sourceware.org/bugzilla/show_bug.cgi?id=30127#c46)
> So let me summarize that memfd_create()
> (shm_open() actually) is not a replacement,
> but rather is an essential part of the
> scheme. Using it together with la_premap_dlmem()
> and la_premap() you can get the desired
> picture. Desired picture is 2 identical
> mappings of the same lib, one at relolc_addr,
> one at mmap_addr=reloc_addr+VM_window_start.
> 
> There is basically nothing else!
> That scheme is very simple to describe,
> but not that simple to grok from that
> description, as no one have tried that
> layout yet.
I think above and this is a succinct description of Stas's intended use case: having double mappings for solibs allows sharing data between the host and a VM with only address translation at the VM boundary, instead of address translation on every memory access inside the VM. Solutions exist for heap memory and stack memory, leaving primarily the .data/.bss memory allocated as part of an solib. (Correct me if I'm mistaken of course.)

The proposed la_premap and la_premap_dlmem (part of the dlmem() patch) collectively "solve" this problem by granting LD_AUDIT some limited control over the object (segment) mapping process. My first impression from reading the test cases, they seem a bit too specific to this use case. IMHO they are also out-of-scope for LD_AUDIT: LD_AUDIT works at the level of symbols and objects, both generic across OSs and even binary formats (ELF + DLL), whereas la_premap* expose an implementation detail of the dynamic linker. Most importantly, we do not yet deeply understand the implications exposing these callbacks can have, security or otherwise.

An alternative solution I brought up in the prior discussion is "wrapping" the mmap syscall. In general, any Linux syscall can be wrapped using seccomp (e.g. via libseccomp [1]) or more recently with syscall user dispatch [2]. With the wrapper in place, every mmap would be replicated in the VM memory window and update a table used for address translation. Some behavior changes would be needed to appropriately implement MAP_ANONYMOUS and MAP_FIXED, but neither seem particularly problematic.

AFAIK, this "wrapping mmap" approach is vastly more powerful and effective than the proposed la_premap{,_dlmem}. It operates at the Linux kernel level, and requires no changes to Glibc to implement nor a bleeding-edge kernel. It is powerful enough to transparently handle heap memory (provided the targeted allocation arena is brand new, i.e. in a newly opened dlmopen namespace). Wrapping and reimplementing syscalls are well-understood and widely used techniques by VM-adjacent tools, e.g. Wine (Windows syscall emulator) [3] and Docker/Podman (container runtimes) [4].

If this well-understood approach solves the problem, IMHO there isn't much point in arguing this RFE further.

[1]: https://libseccomp.readthedocs.io/en/latest/
[2]: https://docs.kernel.org/admin-guide/syscall-user-dispatch.html
[3]: https://lwn.net/Articles/826313/
[4]: https://docs.docker.com/engine/security/seccomp/ 

In response to a few other bits of prior discussion about mapping objects:

(In reply to Stas Sergeev from comment https://sourceware.org/bugzilla/show_bug.cgi?id=30127#c45)
> > > by doing 2 mappings of the same lib. 
> > ...If all you wanted was to mmap the solib to another address, you can
> > already do that using mmap and /proc/self/map_files/. Maybe dl_iterate_phdr.
> 
> That can only work for loadable sections,
> I believe. .bss cannot be shared that way,
> and likely much more.
You're right, neglected .bss when suggesting this idea. This would not be an issue when using an mmap wrapper however, as the region is simply mapped with MAP_ANONYMOUS.

(In reply to Stas Sergeev from comment https://sourceware.org/bugzilla/show_bug.cgi?id=30127#c48)
> (In reply to Jonathon Anderson from comment https://sourceware.org/bugzilla/show_bug.cgi?id=30127#c33)
> > The result of the first call to mmap() for an solib decides the base address
> 
> While a bit outdated topic, I don't
> think "the first call to mmap()" is a
> good or reliable work-around. It may
> change with an impl, or because of the
> threads.
To clarify here, the "first" call to mmap() is the one without MAP_FIXED, and is used to allocate the pages that will later be overwritten by MAP_FIXED. Threads should not become a problem here, just check the flags.

Also, this is the pattern heavily recommended in man mmap(2) (NOTES, "Using MAP_FIXED safely"). IMHO it's unlikely that part of the implementation will change drastically, and I'm confident an mmap syscall wrapper could still handle it even if it did. :D

> > AFAICT these discussions are all solved by memfd_create. Almost all of the
> > complaints revolve around the memory vs. disk performance difference,
> 
> I am getting a bit nervous already when people
> mention memfd_create(). :) In what way is it
> any better than shm_open(), that I used in my
> la_premap_dlmem() example?
> Yes, I could also use memfd_create() with
> la_premap_dlmem(), but I prefer shm_open().
> Why people think that memfd_create() is the
> thing, is unclear to me. :) But it fits my
> design very well, as does shm_open().
My understanding is that the "file" created by memfd_create() cannot be shared outside the process and it's spawned children, whereas the "file" created by shm_open() can be accessed by any other process with the same path argument. memfd_create() seems to be the more appropriate function when a *private* memory-backed file descriptor is needed, shm_open() is better suited for shared memory across processes (hence the name).

Comment 9 Stas Sergeev 2023-03-15 06:33:25 UTC

(In reply to Jonathon Anderson from comment #8)
> I think above and this is a succinct description of Stas's intended use
> case: having double mappings for solibs allows sharing data between the host
> and a VM with only address translation at the VM boundary, instead of
> address translation on every memory access inside the VM. Solutions exist
> for heap memory and stack memory, leaving primarily the .data/.bss memory
> allocated as part of an solib. (Correct me if I'm mistaken of course.)

That's correct.


> The proposed la_premap and la_premap_dlmem (part of the dlmem() patch)
> collectively "solve" this problem by granting LD_AUDIT some limited control
> over the object (segment) mapping process. My first impression from reading
> the test cases, they seem a bit too specific to this use case. IMHO they are
> also out-of-scope for LD_AUDIT: LD_AUDIT works at the level of symbols and
> objects, both generic across OSs and even binary formats (ELF + DLL),
> whereas la_premap* expose an implementation detail of the dynamic linker.

What exactly implementation detail?
Its just "here's the length I need to
map for solib. if you want, give me a
buffer and/or fd for it".
To me its quite similar to "here's the
name of the solib. if you want, give
me a different one to use".


> Most importantly, we do not yet deeply understand the implications exposing
> these callbacks can have, security or otherwise.

Any explanation why there can be any
security concerns here?


> An alternative solution I brought up in the prior discussion is "wrapping"
> the mmap syscall. In general, any Linux syscall can be wrapped using seccomp
> (e.g. via libseccomp [1]) or more recently with syscall user dispatch [2].
> With the wrapper in place, every mmap would be replicated in the VM memory
> window and update a table used for address translation. Some behavior
> changes would be needed to appropriately implement MAP_ANONYMOUS and
> MAP_FIXED, but neither seem particularly problematic.

I don't understand what you mean.
Besides the fact that you want to describe
something very specific to particular libc
(intercepting the particular mmap call, knowing
how the particular dynamic loader works),
you haven't written the detailed scheme of
what you propose.
You were referring (in another thread) to
trapping only the first mmap() call done by
dynamic loader, IIRC. How can that lead to
a solution of having 2 identical mappings,
is essentially unclear. At best it can solve
the problem of specifying the reloc address,
by the cost of depending on a particular impl
of particular libc, forgetting about any
portability to other libces.
So please detail your proposal.


> If this well-understood approach solves the problem, IMHO there isn't much
> point in arguing this RFE further.

It doesn't solve anything (except probably
the reloc address), and the statements like
this, together with the statement that my
patch breaks your use-case or raises a
security concerns, only suggests that you
want to down-play any contributions that you
review.
In fact, since you never ever said a single
word about how any of the multiple proposals
(including DT_AUDIT for dlopen()) can be
improved, I am quite sure its the case.
I hope we can get more constructive.


> (In reply to Stas Sergeev from comment
> You're right, neglected .bss when suggesting this idea. This would not be an
> issue when using an mmap wrapper however, as the region is simply mapped
> with MAP_ANONYMOUS.

I don't understand how this would not be
an issue, please clarify. Region mapped
with MAP_ANONYMOUS cannot be shared with VM.


> (In reply to Stas Sergeev from comment
> To clarify here, the "first" call to mmap() is the one without MAP_FIXED,
> and is used to allocate the pages that will later be overwritten by
> MAP_FIXED. Threads should not become a problem here, just check the flags.

Why any other thread can't do unrelated
mmap() without MAP_FIXED?

Comment 10 Stas Sergeev 2023-03-15 10:03:16 UTC

Created attachment 14753 [details]
patches

Here is the next impl of dlmem(), this
time split into many small patches. I just
need to update the log entries and post
them to ml.
I attach them here for the sake of this
discussion. You need to look into patches
2, 3 and 10. 2 and 3 are trivial, and in
patch 10 you need to look only at what I
do with _dl_map_segments(). In particular,
how I add "premap" there, which is actually
la_premap_dlmem().
This is the very minimal set of changes
you need to see to understand how shm_open()
steps into the game. As I believe this part
is still not well understood.
Don't worry, these changes are really small
this time! Please take a look. It will take
5 minutes.

Comment 11 Stas Sergeev 2023-03-15 11:05:31 UTC

(In reply to Stas Sergeev from comment #9)
> What exactly implementation detail?
> Its just "here's the length I need to
> map for solib. if you want, give me a
> buffer

I meant to say "address" of course, not
"buffer". Buffer approach was already
criticized by Adhemerval, so why have I
mentioned it again, is unclear. :)

Comment 12 Stas Sergeev 2023-03-15 11:41:26 UTC

(In reply to Jonathon Anderson from comment #8)
> The proposed la_premap and la_premap_dlmem (part of the dlmem() patch)
> collectively "solve" this problem by granting LD_AUDIT some limited control
> over the object (segment) mapping process. My first impression from reading
> the test cases, they seem a bit too specific to this use case. IMHO they are
> also out-of-scope for LD_AUDIT:

OK, if this is the case (which is
entirely possible, even if I don't
agree with the provided reasoning),
then let's just not use audit. :)
I can just add the "premap_ops"
optional pointer to dlmem().

Advantages: much, much fewer changes.
No need for dlload_audit_module() for
that use-case, but I'll keep discussing
it as a "bonus", in case someone finds
it interesting to load audit modules at
run-time.
Everything is confined within dlmem().
Perfect for my use-case.
Plain perfect.

Disadvantages:
Well, extra arg that most people will
set to NULL. Very small disadvantage,
given that this API is not standard anyways.
And it will not be possible to specify the
reloc address for pre-existing functions
like dlopem()/dlmopen(), or prospective ones
like BSD's fdlopen(). I find that a bit pity
given that people requested that functionality
for currently existing APIs, but not a
problem for my use-case. I only need to
control dlmem(), not to help others on
stackoverflow. :)

So sounds better?

Comment 13 Stas Sergeev 2023-03-15 12:07:01 UTC

OK in fact that approach is so much
better, that supporting pre-existing
APIs is irrelevant here. Trying to
fulfill someone's request on stackoverflow
was a huge mis-goal.
So... thanks for pointing that I was
heading the wrong direction.
Will implement a small and simple
dlmem() with an extra ops arg, and w/o
any audit machinery.

Comment 14 Stas Sergeev 2023-03-17 06:44:55 UTC

Hi, posted a comment here:
https://sourceware.org/bugzilla/show_bug.cgi?id=30100
We can continue the dlmem() discussion there,
as for a moment it is no longer relevant to
having a way of specifying the reloc address
for dlopen(). That was what the audit callbacks
were needed for, but they are gone.
Custom dlopen() can be trivially implemented on
top of dlmem(), so the intention to control the
dlopen() directly, was ill-fated from the beginning. :(

Comment 15 Jonathon Anderson 2023-03-18 23:28:56 UTC

First dealing with a few meta off-topics:

(In reply to Stas Sergeev from comment #9)
> > If this well-understood approach solves the problem, IMHO there isn't much
> > point in arguing this RFE further.
> 
> It doesn't solve anything (except probably
> the reloc address), and the statements like
> this, together with the statement that my
> patch breaks your use-case or raises a
> security concerns, only suggests that you
> want to down-play any contributions that you
> review.
> In fact, since you never ever said a single
> word about how any of the multiple proposals
> (including DT_AUDIT for dlopen()) can be
> improved, I am quite sure its the case.
> I hope we can get more constructive.
As Adhemerval has already mentioned from the very start of this RFE (comment #1):
> Any GNU extension requires a specific usercase that can't be easily accomplished with current API.
Thus, the first priority for this RFE should be to establish this use case and express the failings of the current technology. A proposed patch series is difficult to review and near-impossible to merge without reaching some kind of consensus on these two points. Until this occurs, all but the most preliminary work on a patch is a waste of many people's time and patience, including yours as the author.

I also have limited time to investigate the details in my responses. In an effort to remain useful (and succinct), I prioritize any discussion that will lead closer to this first priority. This of course means I often cannot discuss your contribution at length or make any suggestions; there is simply too much otherwise to discuss with higher priorities, and it takes me multiple hours of my spare time to collect that together in a cohesive reply. I hope you can understand. :)

Coming back on topic, comment #8 establishes the succinct and sensible use case for this RFE. This is half the requirement, what remains now is to express the failings of the current technology for this use case. Comment #8 also describes the high-level view of an alternative solution available with current technology available in GNU/Linux (Glibc on Linux). The next step then is to discover where this solution fails for your specific use case.

It would be very constructive if you could investigate my proposed solution as detailed below, and precisely express what the insurmountable problems with it are. :D

(In reply to Stas Sergeev from comment #13)
> OK in fact that approach is so much
> better, that supporting pre-existing
> APIs is irrelevant here. Trying to
> fulfill someone's request on stackoverflow
> was a huge mis-goal.
> So... thanks for pointing that I was
> heading the wrong direction.
> Will implement a small and simple
> dlmem() with an extra ops arg, and w/o
> any audit machinery.
Although a bit late now, I would advise against pursuing dlmem() unless the extra no-file-descriptor functionality is absolutely required for your use case. There are many open questions about the API, and it is clear dlmem() will have a far larger impact than la_premap* ever would.

If you need the no-file-descriptor functionality and do want to continue dlmem(), I would recommend first developing a solid argument to assuage the initial concerns raised by Carlos almost a month ago now (https://sourceware.org/pipermail/libc-alpha/2023-February/145735.html). Namely, establishing the use case in clear terms and expressing why the alternative technology of "dlopenfd()" + memfd_create() fails to meet the use case.


Coming back to the topic of the hour:

(In reply to Stas Sergeev from comment #9)
> > The proposed la_premap and la_premap_dlmem (part of the dlmem() patch)
> > collectively "solve" this problem by granting LD_AUDIT some limited control
> > over the object (segment) mapping process. My first impression from reading
> > the test cases, they seem a bit too specific to this use case. IMHO they are
> > also out-of-scope for LD_AUDIT: LD_AUDIT works at the level of symbols and
> > objects, both generic across OSs and even binary formats (ELF + DLL),
> > whereas la_premap* expose an implementation detail of the dynamic linker.
> 
> What exactly implementation detail?
> Its just "here's the length I need to
> map for solib. if you want, give me a
> buffer and/or fd for it".
> To me its quite similar to "here's the
> name of the solib. if you want, give
> me a different one to use".
Primarily, I am unclear what mmap flags an la_premap callback should be use, or what order it should mmap to keep the page table consistent with multiple threads (like _dl_map_segments). These are far deeper implementation details than simply "here's the file path to use when looking up an solib," and will depend heavily on the dynamic linker and OS it is compiled for. I do not believe it makes sense to expose details this deep via LD_AUDIT.

On the API side, file descriptors are a concept specific to POSIX, and the ELF standard (technically) does not require that the objects be mmap()'d. While I do not believe there will be significant problems, it doesn't hurt to be kinder to our non-Glibc friends, on the off-chance LD_AUDIT becomes significantly more popular than it is today. :D

> > Most importantly, we do not yet deeply understand the implications exposing
> > these callbacks can have, security or otherwise.
> 
> Any explanation why there can be any
> security concerns here?
These callbacks allow ASLR to be disabled completely in userspace. If a poorly implemented auditor causes dynamic library loading to become extremely predictable, an attacker may find a way to steal cryptographic secrets stored in the bss segment. High-security container runtimes can't easily protect against this ASLR-loss since the kernel is not involved.

Is there a *real* security risk here? No idea, I have no clue if disabling ASLR in non-setuid applications is really a problem. LD_PREFER_MAP_32BIT_EXEC exists after all. But I can say there will be implications we do not yet (in this discussion) completely understand.

> > An alternative solution I brought up in the prior discussion is "wrapping"
> > the mmap syscall. In general, any Linux syscall can be wrapped using seccomp
> > (e.g. via libseccomp [1]) or more recently with syscall user dispatch [2].
> > With the wrapper in place, every mmap would be replicated in the VM memory
> > window and update a table used for address translation. Some behavior
> > changes would be needed to appropriately implement MAP_ANONYMOUS and
> > MAP_FIXED, but neither seem particularly problematic.
> 
> I don't understand what you mean.
> Besides the fact that you want to describe
> something very specific to particular libc
> (intercepting the particular mmap call, knowing
> how the particular dynamic loader works),
> you haven't written the detailed scheme of
> what you propose.
> ...
> So please detail your proposal.
Alright, here goes.

There are few syscalls on Linux that alter the page table for a process (you can get a rough list by grepping the x86_64 syscall table in strace [1] for "TM"). On x86_64, there are three common ones that add *new* pages to a process: mmap(), mremap() and brk(). brk() and mremap() are most often used through malloc() and realloc(), so your custom libc shim should catch them even if you don't wrap them as syscalls. mmap() is far more common, both in ld.so and in Glibc in general, so that's the main target here.

The general idea of the approach is to wrap mmap() and "mirror" the pages it allocates "outside" the VM to pages "inside" the VM. In most cases (~(MAP_ANONYMOUS|MAP_FIXED)), this should boil down to approximately:
  1. mmap() the "outside" pages,
  2. allocate some pages "inside" the VM to serve as the mirror pages,
  3. mmap() the "inside" pages with the same fd, size and offset (+ MAP_FIXED with addr as the "inside" target address), and
  4. update the address translation table to map "outside" to "inside" pages.
  
MAP_ANONYMOUS doesn't provide an fd to "mirror" the pages through, so the wrapper will need to provide one. This can easily be a private memory-backed file (e.g. memfd_create). Allocate some pages from this file before (1), and use that for the fd and offset in the remaining steps.

MAP_FIXED specifies an addr, so instead of allocating pages in (2) the wrapper will need to translate the provided addr from the "outside" to "inside" memory space. Usually the pages affected by an mmap(MAP_FIXED) will have been previously allocated via an mmap(~MAP_FIXED) (recommended practice from man mmap and implemented in _dl_map_segments), so this translation should always succeed (the wrapper could also abort the application if this precondition isn't met).

That's the basic approach. This approach wraps mmap() while conforming to the Linux API, so it works for any segments that are mmap()'d. In GNU/Linux, solibs and .bss are included in that set.

There are plenty of details that could be added, e.g. brk() could be reimplemented in terms of mmap(memfd_create()), mremap() could be duplicated in much the same way as mmap(), unimplemented/problematic syscalls can be initially replaced with abort(), etc. For a preliminary solution on GNU/Linux though, wrapping mmap() should be enough to create a duplicate mapping of an solib.

[1]: https://gitlab.com/strace/strace/-/blob/master/src/linux/64/syscallent.h

> You were referring (in another thread) to
> trapping only the first mmap() call done by
> dynamic loader, IIRC. How can that lead to
> a solution of having 2 identical mappings,
> is essentially unclear. At best it can solve
> the problem of specifying the reloc address,
When I suggested that before, I was trying to solve the problem of specifying the reloc address. I thought that was the core use case at the time.

That said, the approach can be adjusted with little effort. In most cases (~MAP_FIXED) the mmap() wrapper simply needs to:
  1. allocate some pages "inside" the VM to place the result,
  2. mmap() the pages with the same fd, size and offset (+ MAP_FIXED with addr as the "inside" target address), and
  3. return the "inside" target address.
  
MAP_ANONYMOUS requires no special handling, since the pages aren't being mirrored in this case.

MAP_FIXED specifies an addr, so instead of (1) just use the given addr instead. The wrapper could also abort the application if this addr is not "inside" the VM.

The rest of the basic approach still holds. This approach wraps mmap() while conforming to the Linux API, so it works for any segments that are mmap()'d. In GNU/Linux, solibs and .bss are included in that set.

> by the cost of depending on a particular impl
> of particular libc, forgetting about any
> portability to other libces.
The only requirement of this approach is that the solibs (and .bss) are mmap()'d following the recommendation in man mmap. This is true of GNU/Linux (Glibc on Linux). I doubt there is another popular libc that doesn't mmap() the solibs, but is there one you plan to support? (I also doubt Glibc-only is a dealbreaker for you, given la_premap was a viable solution and LD_AUDIT is basically a GNU extension at this point. :P)

Obviously this approach only works on Linux. Other OSs have their own syscalls and methods for intercepting them. I only know Linux, was there another OS you plan to support?

> > (In reply to Stas Sergeev from comment
> > You're right, neglected .bss when suggesting this idea. This would not be an
> > issue when using an mmap wrapper however, as the region is simply mapped
> > with MAP_ANONYMOUS.
> 
> I don't understand how this would not be
> an issue, please clarify. Region mapped
> with MAP_ANONYMOUS cannot be shared with VM.
See the description above. Pages mmap()'d with MAP_ANONYMOUS are mirrored via a memory-backed file to allow sharing with the VM.

> > (In reply to Stas Sergeev from comment
> > To clarify here, the "first" call to mmap() is the one without MAP_FIXED,
> > and is used to allocate the pages that will later be overwritten by
> > MAP_FIXED. Threads should not become a problem here, just check the flags.
> 
> Why any other thread can't do unrelated
> mmap() without MAP_FIXED?
See the description above. No information (except the address translation table) needs to be persisted between mmap() calls, so it doesn't matter which thread invokes mmap() when. The only requirement is that mmap(MAP_FIXED) always overwrites pages previously allocated with mmap(~MAP_FIXED), as recommended in man mmap and implemented in _dl_map_segments.

Comment 16 Stas Sergeev 2023-03-19 02:12:02 UTC

(In reply to Jonathon Anderson from comment #15)
> As Adhemerval has already mentioned from the very start of this RFE (comment
> #1):
> > Any GNU extension requires a specific usercase that can't be easily accomplished with current API.

"easily" is quite important here.
Even if your syscall interception approach
could work (which I think is not the case),
it doesn't fall into an "easy" category.
To me, having a good API is also important.
Why dlmem() is not the one?
But lets deal with the question if your trick
works at all first.


> Thus, the first priority for this RFE should be to establish this use case
> and express the failings of the current technology. A proposed patch series
> is difficult to review

Even when I split them into 13 nearly trivial
patches? Then what else can I do to have it
easy for a review?


> and near-impossible to merge without reaching some
> kind of consensus on these two points. Until this occurs, all but the most
> preliminary work on a patch is a waste of many people's time and patience,
> including yours as the author.

Not necessarily: after all I can try to use
the private glibc build for my project. But
of course its a PITA so I prefer to not having
to do that.


> It would be very constructive if you could investigate my proposed solution
> as detailed below, and precisely express what the insurmountable problems
> with it are. :D

I always do. :)


> Namely, establishing the use case in clear terms and expressing why the
> alternative technology of "dlopenfd()" + memfd_create() fails to meet the
> use case.

I thought we are already passed that point,
and instead are discussing why mmap() intercept
fails? dlopenfd()+memfd doesn't give even the
possibility of specifying the reloc address,
and that's a very minimal, insufficient requirement.


> These callbacks allow ASLR to be disabled completely in userspace. If a
> poorly implemented auditor causes dynamic library loading to become
> extremely predictable, an attacker may find a way to steal cryptographic
> secrets stored in the bss segment. High-security container runtimes can't
> easily protect against this ASLR-loss since the kernel is not involved.
> 
> Is there a *real* security risk here? No idea, I have no clue if disabling
> ASLR in non-setuid applications is really a problem.

I think its quite similar to allowing MMAP_FIXED,
which is allowed. One can already map his secret
data always to the specific location, so there was
nothing new. But in any case, LD_AUDIT bits are
dropped from my patch.


> MAP_ANONYMOUS doesn't provide an fd to "mirror" the pages through, so the
> wrapper will need to provide one. This can easily be a private memory-backed
> file (e.g. memfd_create). Allocate some pages from this file before (1), and
> use that for the fd and offset in the remaining steps.

I still don't understand that part.
You propose to intercept first mmap.
In the current impl its a file-backed
private mmap that spans past the end
of the file. So basically it is 2 mmaps
in one: what goes beyond the file image,
is similar to anonymous.
So do you propose to split that mmap
into 2, mirror the file-backed part and
then mirror the anonymous part with the
memfd_create, correct?
But the problem is that first mmap
actually only establishes the map address.
Subsequent mmaps re-arrange the segments,
over-mapping that space. So where you
initially thought is an anonymous space,
will be a file-based mapping, and vice-versa.
You can clearly see that from my patch 2.
It converts the first mmap into an
anonymous one (fully anonymous, not
file-backed past eof). So the first mmap
doesn't have the actual layout, and so
it can't establish any mirroring. It
just allocates the space, and gets later
over-mapped by the loader.
Does this clarify?


> That's the basic approach. This approach wraps mmap() while conforming to
> the Linux API, so it works for any segments that are mmap()'d. In GNU/Linux,
> solibs and .bss are included in that set.

The problem is that first mmap has no
idea where the .bss will be mapped.
Neither does he know where the code is
going to be mapped. The only reason its
not anonymous currently in glibc, is
because the first elf segment is supposed
to be in the beginning. But everything
else gets over-mapped.
I am not even sure why the first segment
should be at the beginning, does it
always have the vaddr==0? But so do the
comments in the present code suggest.


> Obviously this approach only works on Linux. Other OSs have their own
> syscalls and methods for intercepting them. I only know Linux, was there
> another OS you plan to support?

Not in the near future, maybe eventually.
But I still fail to see why would the one
want to do syscall interceptions instead
of adding the right API, even if such
interception could work (but it can't).

Comment 17 Stas Sergeev 2023-03-19 02:37:52 UTC

Note that even the parts that were
initially file-backed, are later
re-evaluates as anonymous space.
Such pages are re-protected to
R/W and zeroed out with memset().
That includes .bss.
I believe its a quite bad solution,
as in /proc/self/maps it would remain
file-backed.
After my patch there are no such
discrepancies, as the first mmap
is always anonymous after my patch.
So you don't even need to clear .bss. :)

Hope this clarifies why intercepting
the first mmap doesn't give you anything.

Comment 18 Stas Sergeev 2023-03-19 09:47:02 UTC

(In reply to Stas Sergeev from comment #16)
> I am not even sure why the first segment
> should be at the beginning, does it
> always have the vaddr==0? But so do the
> comments in the present code suggest.

Actually it doesn't suggest that and
all segments are over-mapped over the
initial mapping. I'll need to re-check
my changes to _dl_map_segment() to see
if I got the things right.
But given that all segments are over-mapped,
certainly there is no reason to trap
the first mmap call.
Of course now I have some very bad feeling
that your next proposal will be "trap
all mmaps, not just the first one"...
Well, before you do that, consider the
following:
1. Some mappings are converted from
   file-based to anonymous via mprotect+memset.
2. _dl_map_segment() handles the "large
   alignment" case with 2 mmaps. The first
   large one is done only for alignment, and
   should I share with VM also that?
3. Do you really think that trapping all
   mmaps and trying to hack around the
   aforementioned problems, is a good idea?

Comment 19 Stas Sergeev 2023-03-19 10:14:11 UTC

Actually I see you already suggest to
trap all mmaps, fixed and not ones. You
just never said explicitly about that
change.
And of course you will also suggest to
trap mprotect(), do not set PROT_WRITE
when glibc is trying to, and trap the
memset() to that area via SIGSEGV to
find out that it should be converted
to anonymous mapping?

Comment 20 Stas Sergeev 2023-03-19 13:56:36 UTC

I studied the code a bit more to detail
my former claims. mprotect()+memset() is
applied when segment's data end is not page-aligned,
and there are still the alloc sections (like .bss)
within that segment. Then they go to the same
page. This is under "if (zeropage > zero)"
clause of dl-map-segments.h. Subsequent .bss
pages are anonymously mapped under
"if (zeroend > zeropage)" clause.
So your algo fails if some SHT_NOBITS section
is not page-aligned.

Plus I'd say your algo is not a solution.
Intercepting all mmap calls from dynamic
loader and provide some weird tricks to them,
is not any better than to write another loader,
for example. :)
I am very surprised you make the claims like
"your patch is very difficult to review"
w/o even looking into the very small patches
that mostly split the huge multi-thousands-line
funcs into a reusable parts...

Comment 21 Stas Sergeev 2023-03-23 10:34:52 UTC

Ping, so any idea how to get your
technique working with unaligned
SHT_NOBITS section?
I spent some time trying to figure
out if its possible to privately use
the patched glibc, and unfortunately
it seems impossible. :(
So either your technique should work,
or my patch should be reviewed, because
the private use of the patched glibc
seems not possible even if you load
the patched glibc to the separate NS.

Please let me know what needs to be
done to my patch to make it reviewable.
I split it into 13 nearly trivial pieces,
hoping that's enough for a review, but
please let me know what else to do here.

Comment 22 Stas Sergeev 2023-03-27 07:12:22 UTC

Hi guys, so what is the status of
all this? If my patches would never
be looked into, no matter what, then
perhaps you should tell me that right
here, so that I stop wasting my and
other's time.

In any other case I have the following
questions:
1. Have we passed the stage where my
   use-case is explained and clarified?
2. Have we passed the stage where I
   kept presented with an "alternative
   solutions" like "intercept all mmap
   (and perhaps also mprotect) syscalls
   and do some weird thing on them"? My
   last conclusion was that such "solution"
   doesn't work for unaligned SHT_NOBITS
   sections.
3. If we passed 1 and 2, then I think
   the next step is to discuss an API,
   so here's the API:

dlfcn.h addition:
/* Callback for dlmem. */
typedef void *
(dlmem_premap_t) (void *mappref, size_t maplength, size_t mapalign,
                  void *cookie);
   
/* Do not replace destination mapping. dlmem() will then use memcpy(). */
#define DLMEM_DONTREPLACE 1
   
struct dlmem_args {
  /* Optional name to associate with the loaded object. */
  const char *soname;
  /* Namespace where to load the object. */
  Lmid_t nsid;
  /* dlmem-specific flags. */
  unsigned flags;
  /* Optional premap callback. */
  dlmem_premap_t *premap;
  /* Optional argument for premap callback. */
  void *cookie;
};

/* Like `dlmopen', but loads shared object from memory buffer.  */
extern void *dlmem (const unsigned char *buffer, size_t size, int mode,
                    struct dlmem_args *dlm_args);

Does anyone know if its a good or bad
API, and how should it be improved? It
allows to implement dlopen_with_offset()
in a couple of lines, it preserves the
file-based mappings so that /proc/self/maps
or /proc/self/map_files are valid, and
it allows to specify the solib name, so
it handles the file-based mmaps, like
dlopen_with_offset(), rather perfectly.

I wish I could have a separate libdl, but
so far that looks very difficult. If you
have any suggestions how can I have the
separate libdl, then that would indeed be
a perfect alternative solution that will
eliminate any need to patch glibc sources.
Or maybe some simple hooks can be added to
aid a standalone libdl? Let me know and I
will work in that direction then.
But "no reply" is a bit inconclusive.

Comment 23 Jonathon Anderson 2023-03-27 17:16:38 UTC

Sorry for the long delay in response, it's still a very busy time on my end. :P

I'll make up for it with a very long (and probably repetitive) response instead. >.<

(In reply to Stas Sergeev from comment #22)
> Hi guys, so what is the status of
> all this? If my patches would never
> be looked into, no matter what, then
> perhaps you should tell me that right
> here, so that I stop wasting my and
> other's time.
AFAIK your patches will be looked at once a use case that requires it is solidified, that can't be solved with current tech nor any better proposed API. So far, it has been unclear why the primary function of dlmem() is needed for your use case. Why do you need to load solibs straight from memory at all? 

> In any other case I have the following
> questions:
> 1. Have we passed the stage where my
>    use-case is explained and clarified?
Yes.

> 2. Have we passed the stage where I
>    kept presented with an "alternative
>    solutions" like "intercept all mmap
>    (and perhaps also mprotect) syscalls
>    and do some weird thing on them"? My
>    last conclusion was that such "solution"
>    doesn't work for unaligned SHT_NOBITS
>    sections.
No. I'm certain it works for unaligned SHT_NOBITS sections, any changes made to one side of the "mirror" are reflected in the other. (Although there is another flaw I missed before, an updated version of the technique is towards the bottom of this message. :P)

> 3. If we passed 1 and 2, then I think
>    the next step is to discuss an API,
>    so here's the API:
> ...
> Does anyone know if its a good or bad
> API, and how should it be improved?
There is not yet a solid use case for the primary function of this API, the fact that it "loads an solib from memory." This primary functionality is the main source of concern originally raised by Carlos O'Donell, and AFAICT hasn't been resolved.

The following API is close to your use case but doesn't raise the same concerns as dlmem(). Does this solve your problem, if not what's missing?
    void *dlopen4(const char *filename, int flags, const struct dlopen4_args *ext, size_t ext_size /* = sizeof(struct dlopen3_args) */);
    void *dlmopen5(Lmid_t lmid, const char *filename, int flags, const struct dlopen4_args *ext, size_t ext_size /* = sizeof(struct dlopen3_args) */);
    struct dlopen4_args {
      /* If not NULL, function called before mmap when loading the object [and its dependencies?].
         Returns the base of a mmapped range of given length and alignment. This mapping will be
         overwritten by the loaded object.  */
      void *(*dla_premap)(void *preferred_addr, size_t length, size_t align, void *userdata);
      /* User data passed to dla_premap.  */
      void *dla_premap_userdata;
    };

> It
> allows to implement dlopen_with_offset()
> in a couple of lines, it preserves the
> file-based mappings so that /proc/self/maps
> or /proc/self/map_files are valid, and
> it allows to specify the solib name, so
> it handles the file-based mmaps, like
> dlopen_with_offset(), rather perfectly.
These are niceties, but I think we can agree a direct implementation of dlopen_with_offset() would be better for the use cases that need it. It would also require far less refactors than dlmem().

> I wish I could have a separate libdl, but
> so far that looks very difficult. If you
> have any suggestions how can I have the
> separate libdl, then that would indeed be
> a perfect alternative solution that will
> eliminate any need to patch glibc sources.
> Or maybe some simple hooks can be added to
> aid a standalone libdl? Let me know and I
> will work in that direction then.
I don't have any suggestions here, ld.so and libdl and Glibc are all deeply tied together. The best I can recommend is to patch Glibc and base a container around it, if that works for your client(s). :P

> But "no reply" is a bit inconclusive.
You don't need to tell me that I'm slow to respond. :P

FWIW, Glibc like many other large OSS projects moves slowly. Speaking from experience, expect many months before getting a change landed in a Fedora release, and multiple years before it spreads to other Linux distributions like Debian/Ubuntu or OpenSUSE.

(In reply to Stas Sergeev from comment #16)
> (In reply to Jonathon Anderson from comment #15)
> > As Adhemerval has already mentioned from the very start of this RFE (comment
> > #1):
> > > Any GNU extension requires a specific usercase that can't be easily accomplished with current API.
> 
> "easily" is quite important here.
> Even if your syscall interception approach
> could work (which I think is not the case),
> it doesn't fall into an "easy" category.
As I mentioned before, syscall interception is a technique used in many VM-adjacent and widely used technologies, to name a few: containers (Podman/Docker), Windows emulation (Wine), browser sandboxes (Firefox/Chromium), and debuggers (GDB/strace). Many great examples exist in the open-source community suitable for study, IMHO strace and Crun (part of Podman) are good choices to start.

Given all this, I consider it much easier to write a syscall interception code than to write a shim library to translate between 32- and 64-bit call ABIs. FWIW. :D

> To me, having a good API is also important.
> Why dlmem() is not the one?
> ...
> 
> > Thus, the first priority for this RFE should be to establish this use case
> > and express the failings of the current technology. A proposed patch series
> > is difficult to review
> 
> Even when I split them into 13 nearly trivial
> patches? Then what else can I do to have it
> easy for a review?
I don't have many comments about the patch itself. If I find time to write them up I'll direct them to the dlmem() RFE.

> dlopenfd()+memfd doesn't give even the
> possibility of specifying the reloc address,
> and that's a very minimal, insufficient requirement.
Because you need the pages to be mirrored? Or is there another requirement here?

> > It would be very constructive if you could investigate my proposed solution
> > as detailed below, and precisely express what the insurmountable problems
> > with it are. :D
> 
> I always do. :)
So far, there seems to be a lot of confusion about the technique but no objective flaws about the overall approach. I did notice a flaw in the interim that complicates the technique, but again not insurmountable.

* * *

I'll describe the approach and updated technique verbatim below, in the hopes it will smooth the discussion here, with the goal of understanding the flaws with the overall approach for your use case.

The goal of the overall approach is to "mirror" ALL pages mmapped (after the syscall interception is installed) to pages inside the VM. That includes the pages forming a newly loaded solib. This is a very powerful approach that is not limited to the dynamic linker, it can be extended to mirror ANY memory allocated by the userspace code, including malloc()'d memory.

"Mirroring" pages here (e.g. page A is mirrored to page A') has three strong criteria that need to be met:
  a. Any change to the memory in page A is reflected in page A', and vice versa.
  b. The location of page A' relative to some other mirrored page B' reflects the location of page A relative to page B, if the userspace code requires such (MAP_FIXED).
  c. A "page translation table" exists that records the mirror relationship from A to A'.

The only way to implement criteria (a) on Linux is to propagate memory changes back to the backing fd (MAP_SHARED), so /proc/self/maps will definitely see file-backed mappings even for anonymous pages. On the other hand, (a) also means if a .bss region is cleared with memset(), those changes will be reflected in the mirror pages and so we don't have to intercept those.

Criteria (b) only matters for MAP_FIXED calls, in the ~MAP_FIXED case the kernel (syscall interception) is allowed to choose any reasonable address to place the mmap()'d pages. The recommendation from man mmap is to (paraphrased): "mmap() without MAP_FIXED first, then overwrite the allocated mapping with MAP_FIXED." This avoids races in multithreaded code. The technique described later presumes this recommendation is followed in all userspace code and will abort() if not. This recommendation is followed by Glibc's dynamic linker, this is the rationale behind the first mmap() call you noticed gets completely over-mapped.

Every mmap() syscall is intercepted with this technique (I thought I said that explicitly but maybe it got lost in editing :P). There are other syscalls that alter the page table that could be intercepted for a more complete solution: munmap(), mremap(), mprotect(), brk(). For simplicity I'm only going to discuss the interception for mmap(), other syscalls are left as an exercise to the reader (and should not be necessary for a preliminary implementation, I think). 

Now for the actual technique. The intercepted wrapper for mmap(addr, length, prot, flags, fd, offset) performs the following operations:
 0. Adjust the arguments if flags contains MAP_ANONYMOUS or MAP_PRIVATE (described below),
 1. mmap() the original pages (that live "outside" the VM), call them A,
 2. allocate the mirror pages (that live "inside" the VM), call them A',
 3. mmap() A' as a mirror of A,
 4. update the "page translation table" (criteria (c)) with the A -> A' relation, and
 5. return the address of A from step (1).

There are a number of cases that need to be handled. The "base case" is (MAP_SHARED & ~MAP_ANONYMOUS & ~MAP_FIXED), here step (1) calls mmap(addr, length, prot, flags, fd, offset), and step (3) calls mmap(A'.addr, length, prot, flags | MAP_FIXED, fd, offset). Step (2) allocates any free pages in the VM. This creates a natural mirrored mapping between A and A'.

If flags contains MAP_ANONYMOUS, an extra step (0) is added before step (1). In step (0), fd is replaced by a file descriptor allocated with memfd_create(), and offset by the offset of some freshly allocated pages in that file. flags has the MAP_ANONYMOUS bit removed, since now it is no longer an anonymous mapping. All cases below then apply.

If flags contains MAP_FIXED, step (2) needs to change. Assuming the man mmap recommendation is followed, there must already be an A -> A' mapping in the "page translation table" in this case. Step (2) reuses this prior mapping and uses this previously-allocated A', if one doesn't exist it abort()s the entire application. (Note that this reflects the over-mapping done by the dynamic linker in the VM space, so no issues with that.) 

If flags contains MAP_PRIVATE, extra steps are once again needed. If this is a read-only mapping (~PROT_WRITE) and assuming mprotect() is not used later to add write access (IIRC I have not observed Glibc's ld.so do so with strace), then simply replace MAP_PRIVATE with MAP_SHARED in step (0) and the rest will work.

Otherwise, if flags contains MAP_PRIVATE and prot contains PROT_WRITE, the mapped portion of the file needs to be copied out to an editable file. I can think of two implementations off the top of my head, others likly exist. First idea:
 0.1. Allocate some pages in an anonymous file as if flags contained MAP_ANONYMOUS, results in fd_a and offset_a.
 0.2. orig = mmap(NULL, length, PROT_READ, flags, fd, offset);
 0.3. fd = fd_a, offset = offset_a;
 1. mmap(..., prot, ..., fd, offset) the original pages (that live "outside" the VM), call them A,
 1.1. memcpy(A.addr, orig, length);
 1.2. munmap(orig, length);
Second idea:
 0.1. Allocate some pages in an anonymous file as if flags contained MAP_ANONYMOUS, results in fd_a and offset_a.
 0.2. orig_off = lseek(fd, 0, SEEK_CUR);
 0.3. lseek(fd, offset, SEEK_SET);
 1. mmap(..., prot, ..., fd_a, offset_a) the original pages (that live "outside" the VM), call them A,
 1.1. read(fd, A.addr, length);
 1.2. lseek(fd, orig_off, SEEK_SET);
 1.3. fd = fd_a, offset = offset_a;

That's it, that's the entire technique. It's a powerful approach reminiscent of container tech, which I find fitting for a use case messing with a VM. It's a straightforward technique with good similar examples in the open-source community, for example strace's --inject= options. It's a small technique, I would budget at around 100-300 lines for a PoC implementation. It's not a performant approach, but presumably your apps aren't dlopen()/dlclose()'ing solibs like there's no tomorrow. What's wrong with it?

* * *

> Of course now I have some very bad feeling
> that your next proposal will be "trap
> all mmaps, not just the first one"...
> Well, before you do that, consider the
> following:
> 1. Some mappings are converted from
>    file-based to anonymous via mprotect+memset.
The fact that the pages are mirrored handles this, changes in one are reflected in the other. Note that this trait is required to make shared memory work at all.

IIRC ld.so only uses mprotect() to mark the RELRO segments as read-only, so they don't need to be mirrored in a simple PoC implementation. At least for simple cases, YMMV. 

> 2. _dl_map_segment() handles the "large
>    alignment" case with 2 mmaps. The first
>    large one is done only for alignment, and
>    should I share with VM also that?
Yes. It's simpler and more robust if you don't try to be smart about these cases, at least for a PoC.

> 3. Do you really think that trapping all
>    mmaps and trying to hack around the
>    aforementioned problems, is a good idea?
> ...
> Plus I'd say your algo is not a solution.
> Intercepting all mmap calls from dynamic
> loader and provide some weird tricks to them,
> is not any better than to write another loader,
> for example. :)
Yes, I really think syscall interception is a great idea. It's an order of magnitude smaller than your refactoring patches, and works on every GNU/Linux box (possibly every Linux box) updated in the last 5 years. It can be extended to be more powerful than any alteration to the dynamic linker. If it works for you, IMHO it is VASTLY better solution than patching Glibc, both for you and for your client(s). :D

> I am very surprised you make the claims like
> "your patch is very difficult to review"
> w/o even looking into the very small patches
> that mostly split the huge multi-thousands-line
> funcs into a reusable parts...
Your patch is difficult to review for reasons that have to do with the API and use case, not the implementation. It's also a refactor touching over a thousand lines, that's enough reason to make it hard to review. :P

Comment 24 Stas Sergeev 2023-03-28 01:29:47 UTC

(In reply to Jonathon Anderson from comment #23)
> AFAIK your patches will be looked at once a use case that requires it is
> solidified, that can't be solved with current tech nor any better proposed
> API. So far, it has been unclear why the primary function of dlmem() is
> needed for your use case. Why do you need to load solibs straight from
> memory at all? 

While this is quite handy for my
use-case (solib image comes from a vm,
so its already in memory and has no
host fd), the primary problem is that
any file-based API destroys the existing
mapping by definition.
So I choose dlmem() because it both suits
surprisingly well and has the potential
to preserve the user's mapping.
Other than that, its completely agnostic
of my use-case. It just allows to dlmem()
into the user's buffer.


> No. I'm certain it works for unaligned SHT_NOBITS sections, any changes made
> to one side of the "mirror" are reflected in the other. (Although there is
> another flaw I missed before, an updated version of the technique is towards
> the bottom of this message. :P)

I think its the same problem that you
try to avoid by introducing the writable
file now. Unaligned SHT_NOBITS section
results in re-protecting the file-backed
MAP_PARIVATE page into a writable one.


> There is not yet a solid use case for the primary function of this API, the
> fact that it "loads an solib from memory." This primary functionality is the
> main source of concern originally raised by Carlos O'Donell, and AFAICT
> hasn't been resolved.

Could you please explain the concern
itself? I mean, what problem is there
to have an API to dlmem() from memory?
Is it a security concern, or what kind of?
What justifies the straight "no" or
"no unless you disprove 1024+ tricks
to do the same with unportable syscall-
intercepting techniques"?


> The following API is close to your use case but doesn't raise the same
> concerns as dlmem(). Does this solve your problem, if not what's missing?
>     void *dlopen4(const char *filename, int flags, const struct dlopen4_args
> *ext, size_t ext_size /* = sizeof(struct dlopen3_args) */);
>     void *dlmopen5(Lmid_t lmid, const char *filename, int flags, const
> struct dlopen4_args *ext, size_t ext_size /* = sizeof(struct dlopen3_args)
> */);
>     struct dlopen4_args {
>       /* If not NULL, function called before mmap when loading the object
> [and its dependencies?].
>          Returns the base of a mmapped range of given length and alignment.
> This mapping will be
>          overwritten by the loaded object.  */
>       void *(*dla_premap)(void *preferred_addr, size_t length, size_t align,
> void *userdata);
>       /* User data passed to dla_premap.  */
>       void *dla_premap_userdata;
>     };

The primary problem is that this API
doesn't allow to preserve the user's
mapping. It is only using that mapping
to specify the reloc address, while
dlmem() can optionally preserve it (I
use the separate flag for that).
The secondary problem is "filename",
but yes, I know you'll suggest to get
it from /proc/self/fd.


> These are niceties, but I think we can agree a direct implementation of
> dlopen_with_offset() would be better for the use cases that need it. It
> would also require far less refactors than dlmem().

I can remove all refactors and replace
them with copy/pasts. Much bigger code
but no change to existing code.
Will that be any better?
OTOH all refactors I did, just take some
code chunk and move it to a separate func
with the different indentation level.
These diffs should be looked into with
some tool that ignores indentation.
Only then it would be clear how small they
are.


> As I mentioned before, syscall interception is a technique used in many
> VM-adjacent and widely used technologies, to name a few: containers
> Windows emulation (Wine), browser sandboxes
> (Firefox/Chromium),

I wonder if the above ones are actually
do the syscall interception, or just use
the bpf filters to avoid malicious code
from using syscalls?


> Given all this, I consider it much easier to write a syscall interception
> code than to write a shim library to translate between 32- and 64-bit call
> ABIs. FWIW. :D

Its a bit strange to intercept the syscalls
of your own code. I am quite sure none of
the projects you mentioned, actually do this.
They intercept the syscalls of some 3rd-party
code running along, but never their own syscalls.
gdb/strace definitely intercept the syscalls
of the debugee, same with the rest of the projects.
Most of dl_audit framework can be implemented
with syscall interception, but why don't you
want to do that?


> > dlopenfd()+memfd doesn't give even the
> > possibility of specifying the reloc address,
> > and that's a very minimal, insufficient requirement.
> Because you need the pages to be mirrored? Or is there another requirement
> here?

Mirrored and also reloc address specified.
AFAICT fdlopen()+memfd gives neither.


> There are a number of cases that need to be handled. The "base case" is
> (MAP_SHARED & ~MAP_ANONYMOUS & ~MAP_FIXED),

Not used by libdl AFAIK, so skipping.


> If flags contains MAP_ANONYMOUS, an extra step (0) is added before step (1).

That's quite clear.


> If flags contains MAP_PRIVATE, extra steps are once again needed. If this is
> a read-only mapping (~PROT_WRITE) and assuming mprotect() is not used later
> to add write access (IIRC I have not observed Glibc's ld.so do so with
> strace),

But this is exactly what happens if you
have unaligned SHT_NOBITS section. It
goes to the same page that used MAP_PRIVATE
to load an elf segment. glibc then re-protects
and memsets that part. Even if you haven't
seen that with strace, I was pointing to the
exact code that does this.


> then simply replace MAP_PRIVATE with MAP_SHARED in step (0) and the
> rest will work.

If the page is never re-protected, then
MAP_SHARED is not even needed. You can
just have 2 private mappings from same file.


> Otherwise, if flags contains MAP_PRIVATE and prot contains PROT_WRITE, the

AFAIK there is no such case.
PROT_WRITE is applied later with mprotect()
if you have an unaligned SHT_NOBITS section,
but is AFAICS never applied initially.


> That's it, that's the entire technique. It's a powerful approach reminiscent
> of container tech, which I find fitting for a use case messing with a VM.
> It's a straightforward technique with good similar examples in the
> open-source community, for example strace's --inject= options. It's a small
> technique, I would budget at around 100-300 lines for a PoC implementation.
> It's not a performant approach, but presumably your apps aren't
> dlopen()/dlclose()'ing solibs like there's no tomorrow. What's wrong with it?

Contrary to what you say, no one is
intercepting his own syscalls.
And the SHT_NOBITS section problem is not
yet addressed, although of course you will
propose to intercept also mprotect() to get
it in.


> > Of course now I have some very bad feeling
> > that your next proposal will be "trap
> > all mmaps, not just the first one"...
> > Well, before you do that, consider the
> > following:
> > 1. Some mappings are converted from
> >    file-based to anonymous via mprotect+memset.
> The fact that the pages are mirrored handles this, changes in one are

Pages are not mirrored in case of a
MAP_PRIVATE mapping that was later
re-protected to r/w. Of course you
can always use MAP_SHARED beforehand,
and do a writable file copy, which will
basically mean to just copy the initially
memory-based solib into a file on hdd rather
than to even properly use memfd.


> IIRC ld.so only uses mprotect() to mark the RELRO segments as read-only, so
> they don't need to be mirrored in a simple PoC implementation. At least for
> simple cases, YMMV. 

Not sure if the unaligned SHT_NOBITS
(that causes re-protect to R/W) is a
"simple case" or not.


> > I am very surprised you make the claims like
> > "your patch is very difficult to review"
> > w/o even looking into the very small patches
> > that mostly split the huge multi-thousands-line
> > funcs into a reusable parts...
> Your patch is difficult to review for reasons that have to do with the API
> and use case, not the implementation. It's also a refactor touching over a
> thousand lines, that's enough reason to make it hard to review. :P

If indentation is ignored, then my patches
touch a dozen of lines. There are just the
moves of a large chunks of code to a separate
funcs.

Comment 25 Stas Sergeev 2023-03-28 05:02:54 UTC

For example the diffstat of the largest
patch that actually implements dlmem, is:
 48 files changed, 484 insertions(+), 1 deletion(-)

1 deletion! (in a makefile)
And another patch that adds the optional
part of dlmem, looks like this:
 5 files changed, 202 insertions(+), 2 deletions(-)

2 deletions in a makefile.
You probably can't ask for the better
changes separation: the 2 main patches change
no existing code at all.

Yes, there are also 2 patches with diffstats
under 200 lines, but if the indentation is
ignored, then they are 20 lines.

The rest of the patches are in a range of
10-50 lines. Not sure if any better separation
is possible.


> Your patch is difficult to review for reasons that have to do with the API

What does this mean?
We can discuss API also here if the patch
makes it somehow difficult.

Comment 26 Stas Sergeev 2023-03-28 05:37:40 UTC

(In reply to Jonathon Anderson from comment #23)
> These are niceties, but I think we can agree a direct implementation of
> dlopen_with_offset() would be better for the use cases that need it. It
> would also require far less refactors than dlmem().

Getting a bit more abstract here,
why refactors are that bad? glibc
is full of multi-thousands-line funcs
intersected by gotos. Is this because
the refactors are prohibited?
I mean, I was hoping for a "thank you"
for a couple of small refactors.
Is the current glibc code style (huge
spaghetti funcs) is intentional and
enforced?

Comment 27 Jonathon Anderson 2023-03-28 06:17:57 UTC

(In reply to Stas Sergeev from comment #24)
> Could you please explain the concern
> itself? I mean, what problem is there
> to have an API to dlmem() from memory?
> Is it a security concern, or what kind of?
Briefly summarizing the main points from the original email in the mailing list [1]:
> dlmem() works at a lower level of abstraction than the rest of the dl* APIs, i.e. memory instead of solibs/objects. That has widespread impacts across many users of Glibc, including but not limited to security, LD_AUDIT, and developer tools (GDB). Some reasons follow:
>   - dlmem() does not ensure that the passed memory is a correctly mmap()'d object. It would be strongly preferable that the API ensures we CAN'T end up in an inconsistent state, instead of making it UB if the user slips up.
>   - dlmem() removes the "file descriptor" abstraction out of the link_map. A lot of tooling has to change to fit this new reality, both inside and outside Glibc: LD_AUDIT, developer tools (e.g. GDB), etc.
>   - dlmem() skips many syscalls, removing the kernel-side auditable events required for security tooling. In contrast, "dlopenfd" requires both memfd_create() (or similar) and mmap() of that fd, allowing e.g. FFI/JIT to be locked down by a security seccomp filter.

Adding my own concern as well:
  - dlmem() seems to to expect the user to parse the program headers and mmap() the binary as required. That requires the application to re-implement a core, delicate piece of ld.so... and do so correctly. From an API design perspective, that seems like a very poor choice of abstraction.
  
AFAICS none of these issues have been resolved in the latest patches. Some of these issues are intrinsic to the dlmem() semantics. So if another, better API will work for your case, that certainly would be preferred.

[1]: https://sourceware.org/pipermail/libc-alpha/2023-February/145735.html

> > The following API is close to your use case but doesn't raise the same
> > concerns as dlmem(). Does this solve your problem, if not what's missing?
> >     void *dlopen4(const char *filename, int flags, const struct dlopen4_args
> > *ext, size_t ext_size /* = sizeof(struct dlopen3_args) */);
> >     void *dlmopen5(Lmid_t lmid, const char *filename, int flags, const
> > struct dlopen4_args *ext, size_t ext_size /* = sizeof(struct dlopen3_args)
> > */);
> >     struct dlopen4_args {
> >       /* If not NULL, function called before mmap when loading the object
> > [and its dependencies?].
> >          Returns the base of a mmapped range of given length and alignment.
> > This mapping will be
> >          overwritten by the loaded object.  */
> >       void *(*dla_premap)(void *preferred_addr, size_t length, size_t align,
> > void *userdata);
> >       /* User data passed to dla_premap.  */
> >       void *dla_premap_userdata;
> >     };
> 
> The primary problem is that this API
> doesn't allow to preserve the user's
> mapping. It is only using that mapping
> to specify the reloc address, while
> dlmem() can optionally preserve it (I
> use the separate flag for that).
This is precisely one of the concerns with dlmem(). Why must the user's mapping be preserved? So that the mirroring can be set up before the object is loaded?

Would replacing the dla_premap hook with some kind of custom-mmap() (dla_mmap()) hook fit your use case better? That could allow you to set up mirroring *as* the object is loaded, instead of before.

FWIW, do you need page-mirroring at all if you can just choose the reloc address to be within the VM space?

> The secondary problem is "filename",
> but yes, I know you'll suggest to get
> it from /proc/self/fd.
I would prefer /proc/self/fd over dlopenfd4(). But dlopenfd() seems to be of wider interest, so whatever works.

> I can remove all refactors and replace
> them with copy/pasts. Much bigger code
> but no change to existing code.
> Will that be any better?
> OTOH all refactors I did, just take some
> code chunk and move it to a separate func
> with the different indentation level.
> These diffs should be looked into with
> some tool that ignores indentation.
> Only then it would be clear how small they
> are.
I wouldn't waste any more time on the dlmem() patch until the concerns above can be addressed.

> > As I mentioned before, syscall interception is a technique used in many
> > VM-adjacent and widely used technologies, to name a few: containers
> > Windows emulation (Wine), browser sandboxes
> > (Firefox/Chromium),
> 
> I wonder if the above ones are actually
> do the syscall interception, or just use
> the bpf filters to avoid malicious code
> from using syscalls?
None of the examples above do exactly what you're looking for. If I knew of any OSS that did, I would just point you there. AFAIK your use case is very unique.

Of the examples I've named:
  - Wine outright implements Windows syscalls on Linux, by intercepting all syscalls in the running process and performing the translation in userspace (SIGSYS handler).
  - strace and GDB intercept syscalls "remotely" via ptrace(). IMHO the process of poking the registers and memory via ptrace() is not all that different than doing so from inside a SIGSYS signal handler.
  - Podman/Docker use libseccomp to filter syscalls with BPF seccomp() filters. BPF isn't powerful enough for the proposed approach, but it is similar in that it can alter the arguments and return values (to a limited extent).
  - Firefox and Chrom(ium) also use seccomp() filters, but they also register special handlers for SIGSYS. IIRC it's mainly for error reporting and not for interception, but you get the idea.
  
In short, intercepting syscalls is done in multiple OSS projects to varying extents, for security and for profit. Wine is the only one that is as extreme as your use case, but the rest do have some degree of similarity.

> > Given all this, I consider it much easier to write a syscall interception
> > code than to write a shim library to translate between 32- and 64-bit call
> > ABIs. FWIW. :D
> 
> Its a bit strange to intercept the syscalls
> of your own code. I am quite sure none of
> the projects you mentioned, actually do this.
> They intercept the syscalls of some 3rd-party
> code running along, but never their own syscalls.
Presumably you won't intercept (all of) your own syscalls, primarily you're aiming for the syscalls while the 3rd-party "ancient code" is loading. So isn't it pretty much the same?

> Most of dl_audit framework can be implemented
> with syscall interception, but why don't you
> want to do that?
Because (1) LD_AUDIT hearkens back to the days of Solaris and so is already on literally every GNU/Linux box in active use, and because (2) symbol binding (la_symbind) is done completely in userspace and can't be intercepted by syscalls.

Very different situation.

> > > dlopenfd()+memfd doesn't give even the
> > > possibility of specifying the reloc address,
> > > and that's a very minimal, insufficient requirement.
> > Because you need the pages to be mirrored? Or is there another requirement
> > here?
> 
> Mirrored and also reloc address specified.
> AFAICT fdlopen()+memfd gives neither.
And based on prior comments, I assume you also want to preserve user mappings here.

> > If flags contains MAP_PRIVATE, extra steps are once again needed. If this is
> > a read-only mapping (~PROT_WRITE) and assuming mprotect() is not used later
> > to add write access (IIRC I have not observed Glibc's ld.so do so with
> > strace),
> 
> But this is exactly what happens if you
> have unaligned SHT_NOBITS section. It
> goes to the same page that used MAP_PRIVATE
> to load an elf segment. glibc then re-protects
> and memsets that part. Even if you haven't
> seen that with strace, I was pointing to the
> exact code that does this.
Missed that comment, sorry. Link to the code so we're all on the same page: [2]

Note that the mprotect() calls are only if(__glibc_unlikely((c->prot & PROT_WRITE) == 0)). It seems that newer ld places .data and small .bss in a RW LOAD segment, which would explain why I've never observed it happen myself with strace and modern software.

This makes me curious how old/common binaries are that trip this case. This code (complete with the "Dag nab it" comment) have been present in Glibc since 1995: [3]. So maybe... *really* ancient binaries? :D

If it bothers you, this case can be ignored and the following case (that copies the data to a writable anonymous file) used instead.

[2]: https://sourceware.org/git/?p=glibc.git;a=blob;f=elf/dl-map-segments.h;hb=07dd75589ecbedec5162a5645d57f8bd093a45db#l165
[3]: https://sourceware.org/git/?p=glibc.git;a=blob;f=elf/dl-load.c;hb=d66e34cd423425c348bcc83df127dd19711b0b9a#l339

> > then simply replace MAP_PRIVATE with MAP_SHARED in step (0) and the
> > rest will work.
> 
> If the page is never re-protected, then
> MAP_SHARED is not even needed. You can
> just have 2 private mappings from same file.
True!

> > Otherwise, if flags contains MAP_PRIVATE and prot contains PROT_WRITE, the
> 
> AFAIK there is no such case.
> PROT_WRITE is applied later with mprotect()
> if you have an unaligned SHT_NOBITS section,
> but is AFAICS never applied initially.
PROT_WRITE is applied initially if the LOAD segment is marked as RW. A quick readelf -l on a few of my system's binaries seems to indicate this is pretty common for .data and .bss in modern software.

> Contrary to what you say, no one is
> intercepting his own syscalls.
I beg to disagree. Many projects filter or intercept their own syscalls. This *specific* approach hasn't been done before (I would point you to it if it was), but intercepting (or at least filtering) syscalls in-process is nothing new.

> And the SHT_NOBITS section problem is not
> yet addressed, although of course you will
> propose to intercept also mprotect() to get
> it in.
The most I would do in an mprotect() interception is ensure PROT_WRITE doesn't get added to any pages (i.e. abort() the application if it does). That doesn't really solve this problem, but it could catch some issues with the mirrored pages. Maybe. :P

> Pages are not mirrored in case of a
> MAP_PRIVATE mapping that was later
> re-protected to r/w. Of course you
> can always use MAP_SHARED beforehand,
> and do a writable file copy,
Indeed! Which is exactly what I suggested. :D

> which will
> basically mean to just copy the initially
> memory-based solib into a file on hdd rather
> than to even properly use memfd.
Why is the HDD required here, can't you just copy to a memfd file? That's what I suggested above.

> > IIRC ld.so only uses mprotect() to mark the RELRO segments as read-only, so
> > they don't need to be mirrored in a simple PoC implementation. At least for
> > simple cases, YMMV. 
> 
> Not sure if the unaligned SHT_NOBITS
> (that causes re-protect to R/W) is a
> "simple case" or not.
Well, it *seems* very uncommon in modern software. Not sure whether it's rare in your "ancient code" case. Either way, solution discussed above.

> If indentation is ignored, then my patches
> touch a dozen of lines. There are just the
> moves of a large chunks of code to a separate
> funcs.
I more meant that comparing the hundreds of lines that have moved around is time-consuming. There are tools to help, it just takes a lot of time that could be better spent other places. Speaking from experience with my main project. :P

But it seems like your latest patches are shorter than I had remembered, I stand corrected. IIRC at one point there was a 1300-addition patch, which is where my comment came from, but that seems to have been cleaned up now. Great! :D

(In reply to Stas Sergeev from comment #26)
> (In reply to Jonathon Anderson from comment #23)
> > These are niceties, but I think we can agree a direct implementation of
> > dlopen_with_offset() would be better for the use cases that need it. It
> > would also require far less refactors than dlmem().
> 
> Getting a bit more abstract here,
> why refactors are that bad? glibc
> is full of multi-thousands-line funcs
> intersected by gotos. Is this because
> the refactors are prohibited?
> I mean, I was hoping for a "thank you"
> for a couple of small refactors.
> Is the current glibc code style (huge
> spaghetti funcs) is intentional and
> enforced?
I don't run the show here... but AFAIK the code here is carefully, heavily, manually optimized to generate the best performance with a wide range of C compilers. Carelessly refactoring it and especially adding additional function calls will destroy a lot of that work. (Although I dislike the spaghetti as much as you do. :P)

I've seen other refactors merge from the mailing list, IIRC performance almost always comes up in the leading discussion.

But again, the main problem with your patches is the concerns with the dlmem() semantics, not the size nor quality of your patches themselves. So let's fix that first.

Comment 28 Stas Sergeev 2023-03-28 09:14:33 UTC

(In reply to Jonathon Anderson from comment #27)
> Briefly summarizing the main points from the original email in the mailing
> list [1]:

You are creatively summarizing. :)
To me, all Carlos's concerns were addressed
and yours are completely new to me.

> > dlmem() works at a lower level of abstraction than the rest of the dl* APIs, i.e. memory instead of solibs/objects. That has widespread impacts across many users of Glibc, including but not limited to security, LD_AUDIT, and developer tools (GDB). Some reasons follow:

I think we need _all_ reasons for such
a broad claims, not "some".


> >   - dlmem() does not ensure that the passed memory is a correctly mmap()'d object. It would be strongly preferable that the API ensures we CAN'T end up in an inconsistent state, instead of making it UB if the user slips up.

That's a not valid assumption.
The refactors in my patch are done not
out of nothing to do, but exactly to have
the common path for dlopen() and dlmem().
All elf sanity checks done by dopen(), are
applied also to dlmem().


> >   - dlmem() removes the "file descriptor" abstraction out of the link_map.

Could you please clarify?
In struct link_map I don't remember the
fd field, and the object name, which is
there, is supported by dlmem().


> A lot of tooling has to change to fit this new reality, both inside and outside Glibc: LD_AUDIT, developer tools (e.g. GDB), etc.

This needs a clarification, I don't
understand that part. What should they
change any why? Maybe gdb needs to be
able to trap dlmem() to auto-load debug
symbols - yes, that's what I admitted
long ago. But anything else than that?


> >   - dlmem() skips many syscalls, removing the kernel-side auditable events required for security tooling.

There are 2 use-cases.
1 is when dlmem() skips nothing, in a
sense that you yourself need to mmap()
an elf beforehand. So kernel still sees
everything, and even /proc/self/map_files
are correct.
2 is when the memory buffer comes out of
some other world, like from VM. In that
case it doesn't matter if the extra call
like memfd_create() is not done, as verifying
the code source is impossible in that case.


> In contrast, "dlopenfd" requires both memfd_create() (or similar) and mmap() of that fd, allowing e.g. FFI/JIT to be locked down by a security seccomp filter.

You can still lock down your jit by a
seccomp filter. Not sure why you need
memfd_create() to do that.


> Adding my own concern as well:

They were all your own though. :)


>   - dlmem() seems to to expect the user to parse the program headers and
> mmap() the binary as required. That requires the application to re-implement
> a core, delicate piece of ld.so...

Not sure what are you talking about.
My patch adds quite comprehensive test-cases
that try to cover the basic scenarios. So it
will help if you refer to a particular test
of mine that does something like this, as I
don't remember it did.
Like I said before, dlmem() uses essentially
the same code path in glibc as does dlopen().
And only a few small refacts were needed to
accomplish that.


> and do so correctly. From an API design
> perspective, that seems like a very poor choice of abstraction.

If I know what are you referring to, maybe
I'll answer. :)


> AFAICS none of these issues have been resolved in the latest patches.

This is because, as I said above, your summary
of Carlos's concerns is "creative". I addressed
his concerns: I dropped LD_AUDIT bits and I showed
how to implement fdlopen() and dlopen_with_offset().


> Some
> of these issues are intrinsic to the dlmem() semantics. So if another,
> better API will work for your case, that certainly would be preferred.

I am all for discussing any better API that can
work for me.


> > The primary problem is that this API
> > doesn't allow to preserve the user's
> > mapping. It is only using that mapping
> > to specify the reloc address, while
> > dlmem() can optionally preserve it (I
> > use the separate flag for that).
> This is precisely one of the concerns with dlmem(). Why must the user's
> mapping be preserved? So that the mirroring can be set up before the object
> is loaded?

Indeed.
This behavior is optional.


> Would replacing the dla_premap hook with some kind of custom-mmap()
> (dla_mmap()) hook fit your use case better? That could allow you to set up
> mirroring *as* the object is loaded, instead of before.

With the only difference being to give the
user 100 times more work? :) Instead of
dealing with mmap flags and file copies,
he has 1 small and simple call-back in my
impl.


> FWIW, do you need page-mirroring at all if you can just choose the reloc
> address to be within the VM space?

Yes because the VM see the pointers as if
VM_window_start==0. So all pointers there
will be incorrect and not passable to the
32bit world. Reloc address is planned to be
within MAP_32BIT.


> Presumably you won't intercept (all of) your own syscalls, primarily you're
> aiming for the syscalls while the 3rd-party "ancient code" is loading. So
> isn't it pretty much the same?

This is where the 64bit library does the
loading. The foreign code all runs under
KVM, so I don't even need a seccomp filter
for it. You propose me to intercept my own
syscalls, and this is what no other project
does.


> > Most of dl_audit framework can be implemented
> > with syscall interception, but why don't you
> > want to do that?
> Because (1) LD_AUDIT hearkens back to the days of Solaris and so is already
> on literally every GNU/Linux box in active use, and because (2) symbol
> binding (la_symbind) is done completely in userspace and can't be
> intercepted by syscalls.
> 
> Very different situation.

Which is why I said "most", not "all".
You actually can implement most/some parts
of LD_AUDIT via a syscall trapping, leaving
things like symbind or la_activity in glibc,
but you don't want to do that.


> > Mirrored and also reloc address specified.
> > AFAICT fdlopen()+memfd gives neither.
> And based on prior comments, I assume you also want to preserve user
> mappings here.

Only for the sake of mirroring.
Its a more broad feature of course, but me -
I only need it for mirroring.


> Note that the mprotect() calls are only if(__glibc_unlikely((c->prot &
> PROT_WRITE) == 0)).

Well, and otherwise (when PROT_WRITE is set)
I'd need the file copy. Which means I always
need.


> > Contrary to what you say, no one is
> > intercepting his own syscalls.
> I beg to disagree. Many projects filter or intercept their own syscalls.
> This *specific* approach hasn't been done before (I would point you to it if
> it was), but intercepting (or at least filtering) syscalls in-process is
> nothing new.

I think its only done when that process
executes an alien code. And even that is
likely wine-specific: I would be very
surprised if any other alien code can
execute a "syscall" instruction. For
example the js code can't execute a syscall,
so, as you already confirmed, chromium
mostly does filtering to catch occasional
bugs of its own.
What I don't believe you can ever find, is
some project intercepting the syscalls of
its own, and "emulating" them as if its an
alien code running. More generally, I don't
think someone uses that technique to extend
the functionality. They either implement that
for security reasons (chromium), or for
debugging reasons (gdb), or for an emulation
of an alien code (wine). Extending the
functionality on a syscall level looks like
a gross hack, given that a very simple
high-level API suits well.


> > which will
> > basically mean to just copy the initially
> > memory-based solib into a file on hdd rather
> > than to even properly use memfd.
> Why is the HDD required here, can't you just copy to a memfd file? That's
> what I suggested above.

There are 2 "files" in that picture.
One memfd comes from the solib in memory,
and another memfd seems to come from your
suggestion. So I won't be able to even use
the solib's memfd properly, and will instead
have to copy it to the file on hdd (or to
the second memfd).


> But it seems like your latest patches are shorter than I had remembered, I
> stand corrected. IIRC at one point there was a 1300-addition patch, which is
> where my comment came from, but that seems to have been cleaned up now.
> Great! :D

Thanks!
Knowing that the patches are at least
looked into, is a big relief. :)


> I don't run the show here... but AFAIK the code here is carefully, heavily,
> manually optimized to generate the best performance with a wide range of C
> compilers. Carelessly refactoring it and especially adding additional
> function calls will destroy a lot of that work. (Although I dislike the
> spaghetti as much as you do. :P)

Well, if not for the musl that demonstrated
the possibility of writing a libc without any
spaghetti code (or a small and structured,
but completely obfuscated code as in uclibc),
I would believe that argument. :)

Comment 29 Stas Sergeev 2023-03-29 15:26:25 UTC

In case it wasn't visible, I apologize
to Jonathon for a bad joke in an ML.
Not the best day actually, I uninstalled the
firefox from snapstore (ubuntu), and it
removed the entire profile, with all passwords,
credentials, cookies, histories, everything.
Which ended up in a jokes like that one, sorry.
I wish I could target all the possible dark humor
to the authors of snap instead...

Comment 30 Stas Sergeev 2023-03-31 15:43:48 UTC

I need to also put that demostration here,
because even Jonathon claimed this "elf parsing"
argument:

$ LD_LIBRARY_PATH=..:. ./tst-dlmem-fdlopen
unaligned buf gives buffer not aligned: Invalid argument
7fb413101000-7fb413102000 r--p 00000000 00:28 17195405 /home/stas/src/glibc-dev/build/dlfcn/glreflib1.so
7fb413102000-7fb413103000 r-xp 00001000 00:28 17195405 /home/stas/src/glibc-dev/build/dlfcn/glreflib1.so
7fb413103000-7fb413104000 r--p 00002000 00:28 17195405 /home/stas/src/glibc-dev/build/dlfcn/glreflib1.so
7fb413104000-7fb413105000 r--p 00002000 00:28 17195405 /home/stas/src/glibc-dev/build/dlfcn/glreflib1.so
7fb413105000-7fb413106000 rw-p 00003000 00:28 17195405 /home/stas/src/glibc-dev/build/dlfcn/glreflib1.so


As can be seen, dlmem() created 5 references
to the solib when laying out segments. And no
manual elf parsing was involved, this test-case
was in a v9 patch so anyone can see I am not
cheating.

Jonathon, will you allow this false claim about
some "elf parsing" to spread that widely, that
no one even wants to see my patches any more?
I think this is a bit unfair, I wanted to put
my patches down when some _valid_ argument is
raised...

Comment 31 Stas Sergeev 2023-03-31 15:53:19 UTC

Created attachment 14795 [details]
API description

Also here's the API description, with "limitations"
and everything needed to describe.
I am shocked to see no one even believes me that
it works, that it can lay out elf by vaddr's and
so on... Is it a rocket science to write a code
that lays out an elf segments? No, its not!
It works, documented, demonstrated, posted as a
patch, passed regression suit, and yet no one believes? :(

Comment 32 Stas Sergeev 2023-03-31 16:08:54 UTC

(In reply to Stas Sergeev from comment #30)
> 7fb413103000-7fb413104000 r--p 00002000 00:28 17195405
> /home/stas/src/glibc-dev/build/dlfcn/glreflib1.so
> 7fb413104000-7fb413105000 r--p 00002000 00:28 17195405

We can also see here 2 sections
with file offset being 0x2000 for both.
Of course their vaddr's are not equal to
the file offset.
What else can be done to demonstrate the
obvious fact that the elf is properly laid
out by vaddr's?
Come on...

Comment 33 Stas Sergeev 2023-04-03 09:28:16 UTC

Created attachment 14799 [details]
API description

I am glad to finally present v10
which incorporated work on all the
comments I got to v9, and that was
a bit number. Thanks to all who
contributed!

I received a few mails that I ignore
the comments and therefore my patches
should not be looked into. I think this
is a contradiction, because the only
way to find out if I ignore any comments
or not, is to look into the patches.
But, to make that task easier, here's
the changelog:
Changes in v10:
- addressed review comments of Adhemerval Zanella
- moved refactor patches to the beginning of the serie to simplify review
- fixed a few bugs in an elf relocation machinery after various hot discussions
- added a new test tst-dlmem-extfns that demo-implements dlopen_with_offset4()
  and fdlopen()
- studied and documented all limitations, most importantly those leading to UB
- better documented premap callback as suggested by Szabolcs Nagy
- added DLMEM_GENBUF_SRC flag for unaligned generic memory buffers

As can be seen, ALL comments were addressed.
And at the end of the day it doesn't even
matter if that "elf parsing attack" was
malicious or not. The main thing is that
the problem is not there in v10, so who cares
it is existed ever before. :) It motivated
me to study every corner case when my loader
actually failed to lay out elf segments properly.
and as the result, there is a much better
API description (attached here), "Limitations"
section and a new flag DLMEM_GENBUF_SRC. These
all are the measures against any possible failure
to lay out an elf segments. So it can be firmly
said that v10 have no such problem, and so,
the comment was properly addressed and resolved.

Thanks!

Comment 34 Stas Sergeev 2023-04-03 09:28:55 UTC

The URL to the v10:
https://sourceware.org/pipermail/libc-alpha/2023-April/146866.html

Comment 35 Stas Sergeev 2023-04-14 19:09:55 UTC

Created attachment 14827 [details]
demo diff

I am putting the new dlmem() demonstration
here, because unfortunately the onslaught
continues:
https://sourceware.org/pipermail/libc-alpha/2023-April/147254.html

Demo shows this:

$ cat tst-dlmem-extfns.out 

before dlmem
7f5210ca8000-7f5210cad000 r--p 00000000 00:29 18840304                   /home/stas/src/glibc-dlmem/build/dlfcn/glreflib1.so

after dlmem
7f5210ca3000-7f5210ca4000 r--p 00000000 00:29 18840304                   /home/stas/src/glibc-dlmem/build/dlfcn/glreflib1.so
7f5210ca4000-7f5210ca5000 r-xp 00001000 00:29 18840304                   /home/stas/src/glibc-dlmem/build/dlfcn/glreflib1.so
7f5210ca5000-7f5210ca6000 r--p 00002000 00:29 18840304                   /home/stas/src/glibc-dlmem/build/dlfcn/glreflib1.so
7f5210ca6000-7f5210ca7000 r--p 00002000 00:29 18840304                   /home/stas/src/glibc-dlmem/build/dlfcn/glreflib1.so
7f5210ca7000-7f5210ca8000 rw-p 00003000 00:29 18840304                   /home/stas/src/glibc-dlmem/build/dlfcn/glreflib1.so
7f5210ca8000-7f5210cad000 r--p 00000000 00:29 18840304                   /home/stas/src/glibc-dlmem/build/dlfcn/glreflib1.so

post fdlopen
7f5210ca3000-7f5210ca4000 r--p 00000000 00:29 18840304                   /home/stas/src/glibc-dlmem/build/dlfcn/glreflib1.so
7f5210ca4000-7f5210ca5000 r-xp 00001000 00:29 18840304                   /home/stas/src/glibc-dlmem/build/dlfcn/glreflib1.so
7f5210ca5000-7f5210ca6000 r--p 00002000 00:29 18840304                   /home/stas/src/glibc-dlmem/build/dlfcn/glreflib1.so
7f5210ca6000-7f5210ca7000 r--p 00002000 00:29 18840304                   /home/stas/src/glibc-dlmem/build/dlfcn/glreflib1.so
7f5210ca7000-7f5210ca8000 rw-p 00003000 00:29 18840304                   /home/stas/src/glibc-dlmem/build/dlfcn/glreflib1.so


When nothing can be changed, at least the
truth must be made as visible as possible.

Comment 36 Stas Sergeev 2023-04-14 19:12:13 UTC

This demo clearly shows the elf loading
process of dlmem(), and of course without
any "elf parsing" on a user side.

Comment 37 Stas Sergeev 2023-05-08 14:51:08 UTC

Created attachment 14867 [details]
patches

RTLD_NORELOCATE api is a proposal that adds a fine-grained control
over the solib dynamic-load process. It allows the user to load the
solib to the particular address he needs, using the mapping type he
needs. The basic idea is that after loading the solib with RTLD_NORELOCATE
flag, the user can move an unrelocated object before relocating it.

The API consist of the following elements:

`RTLD_NORELOCATE' - new dlopen() flag.
It defers the relocation of an object, allowing to perform the
relocation later. Ctors are delayed, and are called immediately
after the relocation is done.
Relocation is performed upon the first dlsym() or dlrelocate()
call with the obtained handle. This flag doesn't delay the
load of an object deps, but their relocation and ctors are
delayed. This flag doesn't delay the LA_ACT_CONSISTENT audit event.


`int dlrelocate(void *handle)' - new function to perform the
object relocation if the RTLD_NORELOCATE flag was used. The object
itself and all of its dependencies are relocated.
Returns EINVAL if already relocated. This function may be omitted
even if RTLD_NORELOCATE was used, in which case the relocation will
be performed upon the first dlsym() call with the obtained handle,
but using dlrelocate() function allows to handle relocation errors
and run ctors before using the object's handle. If the function
returned success then ctors of an object and all of its deps were
called by it.
If it returned error other than EINVAL (EINVAL means object
already relocated), then relocation error happened and the
handle should be closed with dlclose().


`RTLD_DI_MAPINFO' - new dlinfo() request that fills in this structure:
typedef struct
{
  void *map_start;		/* Beginning of mapping containing address.  */
  size_t map_length;		/* Length of mapping.  */
  size_t map_align;		/* Alignment of mapping.  */
  int relocated;		/* Indicates whether an object was relocated. */
} Dl_mapinfo;

The user have to check the `relocated` member, and if it is 0
then the object can be moved to the new location. The new location
must be aligned according to the `map_aligned' member, which is
usually equal to a page size. One way to move a solib image is to
use mmap() for allocating a new memory mapping, then use memcpy()
to copy an image, and finally use munmap() to unmap the memory space
at an old location.
This request may fail if the used handle was not obtained from dlopen().


`int dlset_object_base(void *handle, void *addr)' - new function to
set the new base address of an unrelocated object, after it was moved.
Returns error if the object is already relocated. The base address
set by this function, will be used when relocation is performed.


`RTLD_DI_DEPLIST' is a new dlinfo() request that fills in this structure:
typedef struct
{
  void **deps;			/* Array of handles for the deps.  */
  unsigned int ndeps;		/* Number of entries in the list.  */
} Dl_deplist;

It is needed if the user wants to move also the dependencies of the
loaded solib. In this case he needs to traverse the `deps' array,
make RTLD_DI_MAPINFO dlinfo() request per each handle from an array,
find the object he needs by inspecting the filled-in Dl_mapinfo structure,
make sure this object is not relocated yet, and move it, calling
dlset_object_base() at the end.


Use-case.

Suppose you have a VM that runs a 32bit code. Suppose you wrote a
compatibility layer that allows to compile the old 32bit non-unix code
under linux, into the native 64bit shared libraries. But compiling is
not enough and some calls should still go to a VM. VM's memory is available
in a 4Gb window somewhere in a 64bit space. In order for the code under
VM to handle the calls from a 64bit solib, you need to make sure all
pointers, that may be passed as a call arguments, are within 32 bits.
Heap and stack are dealt with by a custom libc, but in order to use
pointers to .bss objects, we need to relocate the solib to the low 32bit
address. But that's not enough, because in order for that lib to be
visible to the code under VM, it must also be mirrored to the VM window
under the map_address = reloc_address+VM_window_start.

RTLD_NORELOCATE solves that problem by allowing the user to mmap the
shared memory into the low 32bit address space and move an object there.
He may want to do so for all the library deps as well (using RTLD_DI_DEPLIST),
or only with the ones he is interested in. Then he maps the shared memory
into the VM window and either calls dlrelocate() or just starts using the
solib, in which case it will be relocated on the first symbol lookup.

Stas Sergeev (14):
  elf: switch _dl_map_segment() to anonymous mapping
  use initial mmap also for ET_EXEC
  rework maphole
  split do_reloc_1() from dl_open_worker_begin()
  split do_reloc_2() out of do_open_worker()
  move relocation into _dl_object_reloc() func
  split out _dl_finalize_segments()
  finalize elf segments on a relocation step
  implement RTLD_NORELOCATE flag
  add test-case for RTLD_NORELOCATE
  implement dlrelocate()
  implement RTLD_DI_MAPINFO
  implement dlset_object_base()
  implement RTLD_DI_DEPLIST

 bits/dlfcn.h                                  |   3 +
 dlfcn/Makefile                                |  11 +-
 dlfcn/Versions                                |   4 +
 dlfcn/ctorlib1.c                              |  39 ++
 dlfcn/dlfcn.h                                 |  34 +-
 dlfcn/dlinfo.c                                |  28 ++
 dlfcn/dlopen.c                                |   2 +-
 dlfcn/dlrelocate.c                            |  68 +++
 dlfcn/dlset_object_base.c                     | 124 ++++++
 dlfcn/tst-noreloc.c                           | 157 +++++++
 elf/dl-close.c                                |   3 +
 elf/dl-load.c                                 |  35 +-
 elf/dl-load.h                                 |   8 +-
 elf/dl-lookup.c                               |   6 +-
 elf/dl-main.h                                 |   2 +
 elf/dl-map-segments.h                         | 169 +++++---
 elf/dl-open.c                                 | 386 +++++++++++-------
 elf/rtld.c                                    |   1 +
 include/dlfcn.h                               |  11 +
 include/link.h                                |   6 +
 sysdeps/generic/ldsodefs.h                    |   1 +
 sysdeps/mach/hurd/i386/libc.abilist           |   2 +
 sysdeps/unix/sysv/linux/aarch64/libc.abilist  |   2 +
 sysdeps/unix/sysv/linux/alpha/libc.abilist    |   2 +
 sysdeps/unix/sysv/linux/arc/libc.abilist      |   2 +
 sysdeps/unix/sysv/linux/arm/be/libc.abilist   |   2 +
 sysdeps/unix/sysv/linux/arm/le/libc.abilist   |   2 +
 sysdeps/unix/sysv/linux/csky/libc.abilist     |   2 +
 sysdeps/unix/sysv/linux/hppa/libc.abilist     |   2 +
 sysdeps/unix/sysv/linux/i386/libc.abilist     |   2 +
 sysdeps/unix/sysv/linux/ia64/libc.abilist     |   2 +
 .../sysv/linux/loongarch/lp64/libc.abilist    |   2 +
 .../sysv/linux/m68k/coldfire/libc.abilist     |   2 +
 .../unix/sysv/linux/m68k/m680x0/libc.abilist  |   2 +
 .../sysv/linux/microblaze/be/libc.abilist     |   2 +
 .../sysv/linux/microblaze/le/libc.abilist     |   2 +
 .../sysv/linux/mips/mips32/fpu/libc.abilist   |   2 +
 .../sysv/linux/mips/mips32/nofpu/libc.abilist |   2 +
 .../sysv/linux/mips/mips64/n32/libc.abilist   |   2 +
 .../sysv/linux/mips/mips64/n64/libc.abilist   |   2 +
 sysdeps/unix/sysv/linux/nios2/libc.abilist    |   2 +
 sysdeps/unix/sysv/linux/or1k/libc.abilist     |   2 +
 .../linux/powerpc/powerpc32/fpu/libc.abilist  |   2 +
 .../powerpc/powerpc32/nofpu/libc.abilist      |   2 +
 .../linux/powerpc/powerpc64/be/libc.abilist   |   2 +
 .../linux/powerpc/powerpc64/le/libc.abilist   |   2 +
 .../unix/sysv/linux/riscv/rv32/libc.abilist   |   2 +
 .../unix/sysv/linux/riscv/rv64/libc.abilist   |   2 +
 .../unix/sysv/linux/s390/s390-32/libc.abilist |   2 +
 .../unix/sysv/linux/s390/s390-64/libc.abilist |   2 +
 sysdeps/unix/sysv/linux/sh/be/libc.abilist    |   2 +
 sysdeps/unix/sysv/linux/sh/le/libc.abilist    |   2 +
 .../sysv/linux/sparc/sparc32/libc.abilist     |   2 +
 .../sysv/linux/sparc/sparc64/libc.abilist     |   2 +
 .../unix/sysv/linux/x86_64/64/libc.abilist    |   2 +
 .../unix/sysv/linux/x86_64/x32/libc.abilist   |   2 +
 56 files changed, 941 insertions(+), 227 deletions(-)
 create mode 100644 dlfcn/ctorlib1.c
 create mode 100644 dlfcn/dlrelocate.c
 create mode 100644 dlfcn/dlset_object_base.c
 create mode 100644 dlfcn/tst-noreloc.c

-- 
2.39.2