Created attachment 15224 [details] Contains the C files, headers and makefile to reproduce the issue. # The story of the extra struct vm_area_struct and the loader ## Problem When the page size is `4096` and shared libraries and binaries are `16384/65536` elf aligned with the flag `-Wl,-z,max-page-size=[16384|65536]` the dynamic linker (loader) creates an extra vm_area_struct for each segment. This happens because the loader allocates a memory area big enough to map the shared library and then mprotect every segment at page size boundaries instead of a `p_align` boundary. The size of `struct vm_area_struct` is 152 bytes in the `Linux 6.3.13` kernel. The increase in the number of vm_area_struct is an issue for low end memory devices (mobile devices, etc). We have seen an increase of 30MBs of unreclaimable kernel memory on those devices due the extra vm_area_struct. ## Linux and gcc details ``` $ lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 20.04.1 LTS Release: 20.04 Codename: focal $ gcc --version gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 ``` ## How to reproduce it (See attachments) In a `4096` page size kernel, compile a shared library with 16384 or 65536 elf alignment. In the `sample` directory, there are these files: ``` $ tree sample/ sample/ ├── arithmetic.c ├── arithmetic.h ├── main.c └── Makefile ``` To compile the shared library and main, run: ``` $ cd sample # Build shared library and main $ make build ``` To load and execute a program, there are 2 possible ways: 1. Ask the Linux kernel to load the program and dynamic linker, once that the dynamic linker is loaded, it will load the shared libraries, in this case `libarithmetic.so`. ``` $ make run ``` 2. Ask the kernel to load the dynamic linker and then the dynamic linker will load the program and shared libraries, in this case `libarithmetic.so`. ``` $ make runloader ``` ### Linux kernel loads the program and dynamic linker When the `Linux kernel` loads the program and dynamic linker, the dynamic linker loads the shared libraries, we can see that an extra vm_area_struct is used with the permissions `---p` in the shared libraries that are 16384/65536 elf aligned. ``` $ make run LD_LIBRARY_PATH=./build ./build/main # In another console, run $ cat /proc/1717808/maps 564d8e90f000-564d8e910000 r--p 00000000 08:05 2519504 /sample/build/main 564d8e91f000-564d8e920000 r-xp 00010000 08:05 2519504 /sample/build/main 564d8e92f000-564d8e930000 r--p 00020000 08:05 2519504 /sample/build/main 564d8e93f000-564d8e940000 r--p 00020000 08:05 2519504 /sample/build/main 564d8e940000-564d8e941000 rw-p 00021000 08:05 2519504 /sample/build/main 564d9055d000-564d9057e000 rw-p 00000000 00:00 0 [heap] 7ff6417ba000-7ff6417bd000 rw-p 00000000 00:00 0 7ff6417bd000-7ff6417df000 r--p 00000000 08:05 2623619 /usr/lib/x86_64-linux-gnu/libc-2.31.so 7ff6417df000-7ff641957000 r-xp 00022000 08:05 2623619 /usr/lib/x86_64-linux-gnu/libc-2.31.so 7ff641957000-7ff6419a5000 r--p 0019a000 08:05 2623619 /usr/lib/x86_64-linux-gnu/libc-2.31.so 7ff6419a5000-7ff6419a9000 r--p 001e7000 08:05 2623619 /usr/lib/x86_64-linux-gnu/libc-2.31.so 7ff6419a9000-7ff6419ab000 rw-p 001eb000 08:05 2623619 /usr/lib/x86_64-linux-gnu/libc-2.31.so 7ff6419ab000-7ff6419af000 rw-p 00000000 00:00 0 7ff6419c0000-7ff6419c1000 r--p 00000000 08:05 2519503 /sample/build/libarithmetic.so 7ff6419c1000-7ff6419d0000 ---p 00001000 08:05 2519503 /sample/build/libarithmetic.so 7ff6419d0000-7ff6419d1000 r-xp 00010000 08:05 2519503 /sample/build/libarithmetic.so 7ff6419d1000-7ff6419e0000 ---p 00011000 08:05 2519503 /sample/build/libarithmetic.so 7ff6419e0000-7ff6419e1000 r--p 00020000 08:05 2519503 /sample/build/libarithmetic.so 7ff6419e1000-7ff6419f0000 ---p 00021000 08:05 2519503 /sample/build/libarithmetic.so 7ff6419f0000-7ff6419f1000 r--p 00020000 08:05 2519503 /sample/build/libarithmetic.so 7ff6419f1000-7ff6419f2000 rw-p 00021000 08:05 2519503 /sample/build/libarithmetic.so 7ff6419f2000-7ff6419f4000 rw-p 00000000 00:00 0 7ff6419f4000-7ff6419f5000 r--p 00000000 08:05 2623613 /usr/lib/x86_64-linux-gnu/ld-2.31.so 7ff6419f5000-7ff641a18000 r-xp 00001000 08:05 2623613 /usr/lib/x86_64-linux-gnu/ld-2.31.so 7ff641a18000-7ff641a20000 r--p 00024000 08:05 2623613 /usr/lib/x86_64-linux-gnu/ld-2.31.so 7ff641a21000-7ff641a22000 r--p 0002c000 08:05 2623613 /usr/lib/x86_64-linux-gnu/ld-2.31.so 7ff641a22000-7ff641a23000 rw-p 0002d000 08:05 2623613 /usr/lib/x86_64-linux-gnu/ld-2.31.so 7ff641a23000-7ff641a24000 rw-p 00000000 00:00 0 7ffdfd748000-7ffdfd769000 rw-p 00000000 00:00 0 [stack] 7ffdfd77c000-7ffdfd77f000 r--p 00000000 00:00 0 [vvar] 7ffdfd77f000-7ffdfd780000 r-xp 00000000 00:00 0 [vdso] ffffffffff600000-ffffffffff601000 --xp 00000000 00:00 0 [vsyscall] ``` > Note that in `main` there is not a vm_area_struct with `---p` permissions due > the Linux kernel loads the program. ### Linux kernel ONLY loads the dynamic linker When the Linux kernel `ONLY` loads the dynamic linker, and the dynamic linker loads the shared libraries and program, we can see that an extra vm_area_struct is used with the permissions `---p` in the shared libraries and program that are 16384/65536 elf aligned. ``` $ make runloader LD_LIBRARY_PATH=./build /lib64/ld-linux-x86-64.so.2 ./build/main # In another console, run $ cat /proc/1711495/maps 555557048000-555557069000 rw-p 00000000 00:00 0 [heap] 7f58a13c0000-7f58a13c3000 rw-p 00000000 00:00 0 7f58a13c3000-7f58a13e5000 r--p 00000000 08:05 2623619 /usr/lib/x86_64-linux-gnu/libc-2.31.so 7f58a13e5000-7f58a155d000 r-xp 00022000 08:05 2623619 /usr/lib/x86_64-linux-gnu/libc-2.31.so 7f58a155d000-7f58a15ab000 r--p 0019a000 08:05 2623619 /usr/lib/x86_64-linux-gnu/libc-2.31.so 7f58a15ab000-7f58a15af000 r--p 001e7000 08:05 2623619 /usr/lib/x86_64-linux-gnu/libc-2.31.so 7f58a15af000-7f58a15b1000 rw-p 001eb000 08:05 2623619 /usr/lib/x86_64-linux-gnu/libc-2.31.so 7f58a15b1000-7f58a15b5000 rw-p 00000000 00:00 0 7f58a15c6000-7f58a15c7000 r--p 00000000 08:05 2519503 /sample/build/libarithmetic.so 7f58a15c7000-7f58a15d6000 ---p 00001000 08:05 2519503 /sample/build/libarithmetic.so 7f58a15d6000-7f58a15d7000 r-xp 00010000 08:05 2519503 /sample/build/libarithmetic.so 7f58a15d7000-7f58a15e6000 ---p 00011000 08:05 2519503 /sample/build/libarithmetic.so 7f58a15e6000-7f58a15e7000 r--p 00020000 08:05 2519503 /sample/build/libarithmetic.so 7f58a15e7000-7f58a15f6000 ---p 00021000 08:05 2519503 /sample/build/libarithmetic.so 7f58a15f6000-7f58a15f7000 r--p 00020000 08:05 2519503 /sample/build/libarithmetic.so 7f58a15f7000-7f58a15f8000 rw-p 00021000 08:05 2519503 /sample/build/libarithmetic.so 7f58a15f8000-7f58a15fa000 rw-p 00000000 00:00 0 7f58a15fa000-7f58a15fb000 r--p 00000000 08:05 2519504 /sample/build/main 7f58a15fb000-7f58a160a000 ---p 00001000 08:05 2519504 /sample/build/main 7f58a160a000-7f58a160b000 r-xp 00010000 08:05 2519504 /sample/build/main 7f58a160b000-7f58a161a000 ---p 00011000 08:05 2519504 /sample/build/main 7f58a161a000-7f58a161b000 r--p 00020000 08:05 2519504 /sample/build/main 7f58a161b000-7f58a162a000 ---p 00021000 08:05 2519504 /sample/build/main 7f58a162a000-7f58a162b000 r--p 00020000 08:05 2519504 /sample/build/main 7f58a162b000-7f58a162c000 rw-p 00021000 08:05 2519504 /sample/build/main 7f58a162c000-7f58a162d000 r--p 00000000 08:05 2623613 /usr/lib/x86_64-linux-gnu/ld-2.31.so 7f58a162d000-7f58a1650000 r-xp 00001000 08:05 2623613 /usr/lib/x86_64-linux-gnu/ld-2.31.so 7f58a1650000-7f58a1658000 r--p 00024000 08:05 2623613 /usr/lib/x86_64-linux-gnu/ld-2.31.so 7f58a1659000-7f58a165a000 r--p 0002c000 08:05 2623613 /usr/lib/x86_64-linux-gnu/ld-2.31.so 7f58a165a000-7f58a165b000 rw-p 0002d000 08:05 2623613 /usr/lib/x86_64-linux-gnu/ld-2.31.so 7f58a165b000-7f58a165c000 rw-p 00000000 00:00 0 7ffe19e5c000-7ffe19e7d000 rw-p 00000000 00:00 0 [stack] 7ffe19f7e000-7ffe19f81000 r--p 00000000 00:00 0 [vvar] 7ffe19f81000-7ffe19f82000 r-xp 00000000 00:00 0 [vdso] ffffffffff600000-ffffffffff601000 --xp 00000000 00:00 0 [vsyscall] ``` ### Linux kernel page size is 4096 and the elf alignment is 4096 When a shared library and executable are linked using the flag `-Wl,-z,max-page-size=4096` and loaded by the dynamic linker, we can see that there is `NOT` an extra vm_area_struct with permissions `---p`. ``` $ make runloader LD_LIBRARY_PATH=./build /lib64/ld-linux-x86-64.so.2 ./build/main # In another console, run $ cat /proc/1721958/maps 555556cef000-555556d10000 rw-p 00000000 00:00 0 [heap] 7f70bb246000-7f70bb249000 rw-p 00000000 00:00 0 7f70bb249000-7f70bb26b000 r--p 00000000 08:05 2623619 /usr/lib/x86_64-linux-gnu/libc-2.31.so 7f70bb26b000-7f70bb3e3000 r-xp 00022000 08:05 2623619 /usr/lib/x86_64-linux-gnu/libc-2.31.so 7f70bb3e3000-7f70bb431000 r--p 0019a000 08:05 2623619 /usr/lib/x86_64-linux-gnu/libc-2.31.so 7f70bb431000-7f70bb435000 r--p 001e7000 08:05 2623619 /usr/lib/x86_64-linux-gnu/libc-2.31.so 7f70bb435000-7f70bb437000 rw-p 001eb000 08:05 2623619 /usr/lib/x86_64-linux-gnu/libc-2.31.so 7f70bb437000-7f70bb43b000 rw-p 00000000 00:00 0 7f70bb44c000-7f70bb44d000 r--p 00000000 08:05 2519503 /sample/build/libarithmetic.so 7f70bb44d000-7f70bb44e000 r-xp 00001000 08:05 2519503 /sample/build/libarithmetic.so 7f70bb44e000-7f70bb44f000 r--p 00002000 08:05 2519503 /sample/build/libarithmetic.so 7f70bb44f000-7f70bb450000 r--p 00002000 08:05 2519503 /sample/build/libarithmetic.so 7f70bb450000-7f70bb451000 rw-p 00003000 08:05 2519503 /sample/build/libarithmetic.so 7f70bb451000-7f70bb453000 rw-p 00000000 00:00 0 7f70bb453000-7f70bb454000 r--p 00000000 08:05 2519504 /sample/build/main 7f70bb454000-7f70bb455000 r-xp 00001000 08:05 2519504 /sample/build/main 7f70bb455000-7f70bb456000 r--p 00002000 08:05 2519504 /sample/build/main 7f70bb456000-7f70bb457000 r--p 00002000 08:05 2519504 /sample/build/main 7f70bb457000-7f70bb458000 rw-p 00003000 08:05 2519504 /sample/build/main 7f70bb458000-7f70bb459000 r--p 00000000 08:05 2623613 /usr/lib/x86_64-linux-gnu/ld-2.31.so 7f70bb459000-7f70bb47c000 r-xp 00001000 08:05 2623613 /usr/lib/x86_64-linux-gnu/ld-2.31.so 7f70bb47c000-7f70bb484000 r--p 00024000 08:05 2623613 /usr/lib/x86_64-linux-gnu/ld-2.31.so 7f70bb485000-7f70bb486000 r--p 0002c000 08:05 2623613 /usr/lib/x86_64-linux-gnu/ld-2.31.so 7f70bb486000-7f70bb487000 rw-p 0002d000 08:05 2623613 /usr/lib/x86_64-linux-gnu/ld-2.31.so 7f70bb487000-7f70bb488000 rw-p 00000000 00:00 0 7ffe42e67000-7ffe42e88000 rw-p 00000000 00:00 0 [stack] 7ffe42ebe000-7ffe42ec1000 r--p 00000000 00:00 0 [vvar] 7ffe42ec1000-7ffe42ec2000 r-xp 00000000 00:00 0 [vdso] ffffffffff600000-ffffffffff601000 --xp 00000000 00:00 0 [vsyscall] ``` ## Solution 1 - Modify the dynamic linker When the dynamic linker loads the shared libraries, extend the `vm_area_struct` to be at a `p_align` boundary. This is done by passing to `mprotect()` the right address space ranges. ## Solution 2 - Modify the static linker When the static linker creates the program segments, make sure that the `p_filesz` and `p_memsz` are extended to a `p_align` boundary. This can be done by inserting a new ELF section at every PT_LOAD ELF segment that will serve as padding.
As per my conversation with Carlos O'Donell, he or Siddehesh Poyarekar can take a look at the issue.
Juan, Thank you for filling this issue! It was great to meet you at LPC 2023 and discuss the problem. I'll have a look at the details and think about possible solutions. I'm not sure yet if this is something the loader can do or if the static linker has arrange the PT_LOAD segments in the right way.
Thanks Carlos, It was nice to meet you too. I am looking forward to collaborating with you.
In ancient times, ldso used to unmap the disjoined regions from PT_LOAD. It was changed decades ago (22930c9bf21ea15d0da1477a379029e2de259b69), most likely due some kernel VMA limitation since it done between Linux 1.3 to 2.0 release. Changing back to unmap should remove the extra VMA ranges: diff --git a/elf/dl-load.h b/elf/dl-load.h index 1d5207694b..f53983fd1f 100644 --- a/elf/dl-load.h +++ b/elf/dl-load.h @@ -124,6 +124,8 @@ static const char *_dl_map_segments (struct link_map *l, int fd, guaranteed to have translations. */ #define DL_MAP_SEGMENTS_ERROR_MAP_SEGMENT \ N_("failed to map segment from shared object") +#define DL_MAP_SEGMENTS_ERROR_UNMAP_SEGMENT \ + N_("failed to unmap segment from shared object") #define DL_MAP_SEGMENTS_ERROR_MPROTECT \ N_("cannot change memory protections") #define DL_MAP_SEGMENTS_ERROR_MAP_ZERO_FILL \ diff --git a/elf/dl-map-segments.h b/elf/dl-map-segments.h index ac10182d58..7ecb1d917b 100644 --- a/elf/dl-map-segments.h +++ b/elf/dl-map-segments.h @@ -115,11 +115,10 @@ _dl_map_segments (struct link_map *l, int fd, if (__glibc_unlikely (loadcmds[nloadcmds - 1].mapstart < c->mapend)) return N_("ELF load command address/offset not page-aligned"); - if (__glibc_unlikely - (__mprotect ((caddr_t) (l->l_addr + c->mapend), - loadcmds[nloadcmds - 1].mapstart - c->mapend, - PROT_NONE) < 0)) - return DL_MAP_SEGMENTS_ERROR_MPROTECT; + if (__glibc_unlikely + (__munmap ((caddr_t) (l->l_addr + c->mapend), + loadcmds[nloadcmds - 1].mapstart - c->mapend) < 0)) + return DL_MAP_SEGMENTS_ERROR_UNMAP_SEGMENT; } l->l_contiguous = 1; It would be somewhat more coslty, since kernel will need to remap and split the VMA range. However, it is not really clear to me what is the advantage of mprotect/munmap the scenario where the PT_LOAD might have 'holes' (essentially where the system page size is lower than the required alignment). I can understand the munmap, since it might have some sense back on 32 bit deployments were common and VMA range was scarse in some situation (specially due fragmentation). But the mprotect does not relly bring any advantage here, it consumes more kernel metadata, do not improve the process VMA range usage, nor it was done as a hardening scheme.
I don't see how a 152 byte struct results in 30 MB of unreclaimable kernel memory. Wouldn't that need ~200,000 instances? That seems really large. I've only got ~70,000 lines in /proc/*/maps on this desktop system, and not all these mappings will exhibit this issue. Would it help if we use MAP_FIXED with PROT_NONE to map over these unused parts? But as far as I understand it, these tails have not been written to, so it shouldn't matter if the underlying memory needs to be preserved by the kernel or not. Regarding not doing the mprotect altogether, I believe this would result in a loss of functionality. Today, you can use the current behavior to get as few gadgets as possible in the process image on systems with smaller page sizes, while still maintaining run-time compatibility with larger page sizes and avoiding on-disk padding (which would increase file size). If we stop doing the mprotect, then the gadgets would become visible even with smaller page sizes. At least in principle, it should be possible for a link editor to produce objects that do not require tail adjustment because the load segments are usable as-is (but I understand that there are link editor limitations in this area today).
I forgot to mention that we have assumptions in the code base that shared objects *we* map are contiguous, so Adhemerval's patch isn't correct as -is. (We tolerate non-contiguous mappings created by the kernel, both for the main program and ld.so.)
(In reply to Florian Weimer from comment #5) > I don't see how a 152 byte struct results in 30 MB of unreclaimable kernel > memory. Wouldn't that need ~200,000 instances? That seems really large. > I've only got ~70,000 lines in /proc/*/maps on this desktop system, and not > all these mappings will exhibit this issue. > > Would it help if we use MAP_FIXED with PROT_NONE to map over these unused > parts? But as far as I understand it, these tails have not been written to, > so it shouldn't matter if the underlying memory needs to be preserved by the > kernel or not. > > Regarding not doing the mprotect altogether, I believe this would result in > a loss of functionality. Today, you can use the current behavior to get as > few gadgets as possible in the process image on systems with smaller page > sizes, while still maintaining run-time compatibility with larger page sizes > and avoiding on-disk padding (which would increase file size). If we stop > doing the mprotect, then the gadgets would become visible even with smaller > page sizes. At least in principle, it should be possible for a link editor > to produce objects that do not require tail adjustment because the load > segments are usable as-is (but I understand that there are link editor > limitations in this area today). So the mprotect is essentially a hardening feature, assuming that the dynamic object padding/holes might contain gadgets. It still does not happen for loader and main program itself, since normally they would be mapped by the kernel and its does do anything with holes, and IMHO it should up to the static linker to fill the padding with NOP/trap instruction to avoid such issues. So I am not fully convinced that the mprotect is really helping much here.
(In reply to Adhemerval Zanella from comment #7) > So the mprotect is essentially a hardening feature, assuming that the > dynamic object padding/holes might contain gadgets. It still does not > happen for loader and main program itself, since normally they would be > mapped by the kernel and its does do anything with holes, There's no expectation that these are contiguous in memory, yes. Such an expectation exists for shared objects loaded by glibc. > and IMHO it should > up to the static linker to fill the padding with NOP/trap instruction to > avoid such issues. That requires padding to the maximum page size on disk, though.
(In reply to Florian Weimer from comment #8) > (In reply to Adhemerval Zanella from comment #7) > > So the mprotect is essentially a hardening feature, assuming that the > > dynamic object padding/holes might contain gadgets. It still does not > > happen for loader and main program itself, since normally they would be > > mapped by the kernel and its does do anything with holes, > > There's no expectation that these are contiguous in memory, yes. Such an > expectation exists for shared objects loaded by glibc. Right, but my point is if this is a hardening feature that glibc aims to provide it does not help if the kernel still does not provide it for the loader (if this is built with a different page size) and the main program itself. Should we push for kernel to implement a similar handling? And the holes seem to be always from the initial read-only segment, which makes the whole gadget avoidance argument moot. This would make sense back when RELRO was not wildly used/deployed (even with the current somewhat broken status), but it still does not make much sense to me. > > > and IMHO it should > > up to the static linker to fill the padding with NOP/trap instruction to > > avoid such issues. > > That requires padding to the maximum page size on disk, though. But this is what binutils does [1], and while not being optimized it seems that it would be somewhat hard to fix it (at least with Fangrui Song suggestion). [1] https://sourceware.org/bugzilla/show_bug.cgi?id=30612
(In reply to Adhemerval Zanella from comment #9) > (In reply to Florian Weimer from comment #8) > > (In reply to Adhemerval Zanella from comment #7) > > > So the mprotect is essentially a hardening feature, assuming that the > > > dynamic object padding/holes might contain gadgets. It still does not > > > happen for loader and main program itself, since normally they would be > > > mapped by the kernel and its does do anything with holes, > > > > There's no expectation that these are contiguous in memory, yes. Such an > > expectation exists for shared objects loaded by glibc. > > Right, but my point is if this is a hardening feature that glibc aims to > provide it does not help if the kernel still does not provide it for the > loader (if this is built with a different page size) and the main program > itself. Should we push for kernel to implement a similar handling? The kernel effectively implements the munmap behavior, I think. It creates unmapped holes, not PROT_MEM reserved holes.
(In reply to Florian Weimer from comment #10) > (In reply to Adhemerval Zanella from comment #9) > > (In reply to Florian Weimer from comment #8) > > > (In reply to Adhemerval Zanella from comment #7) > > > > So the mprotect is essentially a hardening feature, assuming that the > > > > dynamic object padding/holes might contain gadgets. It still does not > > > > happen for loader and main program itself, since normally they would be > > > > mapped by the kernel and its does do anything with holes, > > > > > > There's no expectation that these are contiguous in memory, yes. Such an > > > expectation exists for shared objects loaded by glibc. > > > > Right, but my point is if this is a hardening feature that glibc aims to > > provide it does not help if the kernel still does not provide it for the > > loader (if this is built with a different page size) and the main program > > itself. Should we push for kernel to implement a similar handling? > > The kernel effectively implements the munmap behavior, I think. It creates > unmapped holes, not PROT_MEM reserved holes. It does indeed, which makes me realize that _dl_find_object does not work correctly if the loader is built with -Wl,-z,max-page-size= different than the system one. The resulting address of the tst-dl_find_object will be: 0x555555554000 0x555555556000 0x2000 0x0 r--p /home/azanella/Projects/glibc/build/x86_64-linux-gnu/elf/tst-dl_find_object 0x555555556000 0x55555555a000 0x4000 0x2000 r-xp /home/azanella/Projects/glibc/build/x86_64-linux-gnu/elf/tst-dl_find_object 0x55555555a000 0x55555555c000 0x2000 0x6000 r--p /home/azanella/Projects/glibc/build/x86_64-linux-gnu/elf/tst-dl_find_object 0x55555555c000 0x55555555d000 0x1000 0x7000 r--p /home/azanella/Projects/glibc/build/x86_64-linux-gnu/elf/tst-dl_find_object 0x55555555d000 0x55555555e000 0x1000 0x8000 rw-p /home/azanella/Projects/glibc/build/x86_64-linux-gnu/elf/tst-dl_find_object 0x7ffff7c00000 0x7ffff7c26000 0x26000 0x0 r--p /home/azanella/Projects/glibc/build/x86_64-linux-gnu/libc.so 0x7ffff7c26000 0x7ffff7d9a000 0x174000 0x26000 r-xp /home/azanella/Projects/glibc/build/x86_64-linux-gnu/libc.so 0x7ffff7d9a000 0x7ffff7df1000 0x57000 0x19a000 r--p /home/azanella/Projects/glibc/build/x86_64-linux-gnu/libc.so 0x7ffff7df1000 0x7ffff7df5000 0x4000 0x1f0000 r--p /home/azanella/Projects/glibc/build/x86_64-linux-gnu/libc.so 0x7ffff7df5000 0x7ffff7df7000 0x2000 0x1f4000 rw-p /home/azanella/Projects/glibc/build/x86_64-linux-gnu/libc.so 0x7ffff7df7000 0x7ffff7e04000 0xd000 0x0 rw-p 0x7ffff7f9d000 0x7ffff7f9e000 0x1000 0x0 r--p /home/azanella/Projects/glibc/build/x86_64-linux-gnu/elf/ld.so 0x7ffff7fad000 0x7ffff7fd3000 0x26000 0x10000 r-xp /home/azanella/Projects/glibc/build/x86_64-linux-gnu/elf/ld.so 0x7ffff7fdd000 0x7ffff7fe7000 0xa000 0x40000 r--p /home/azanella/Projects/glibc/build/x86_64-linux-gnu/elf/ld.so 0x7ffff7fef000 0x7ffff7ff0000 0x1000 0x0 rw-s /dev/zero (deleted) 0x7ffff7ff0000 0x7ffff7ff5000 0x5000 0x0 rw-p 0x7ffff7ff5000 0x7ffff7ff9000 0x4000 0x0 r--p [vvar] 0x7ffff7ff9000 0x7ffff7ffb000 0x2000 0x0 r-xp [vdso] 0x7ffff7ffb000 0x7ffff7ffd000 0x2000 0x4e000 r--p /home/azanella/Projects/glibc/build/x86_64-linux-gnu/elf/ld.so 0x7ffff7ffd000 0x7ffff7fff000 0x2000 0x50000 rw-p /home/azanella/Projects/glibc/build/x86_64-linux-gnu/elf/ld.so 0x7ffffffdd000 0x7ffffffff000 0x22000 0x0 rw-p [stack] 0xffffffffff600000 0xffffffffff601000 0x1000 0x0 --xp [vsyscall]
(In reply to Florian Weimer from comment #5) > I don't see how a 152 byte struct results in 30 MB of unreclaimable kernel > memory. Wouldn't that need ~200,000 instances? That seems really large. > I've only got ~70,000 lines in /proc/*/maps on this desktop system, and not > all these mappings will exhibit this issue. > When we compile all the shared libraries in the system with -Wl,-z,max-page-size=65536 and use 4k page size kernel, for every PT_LOAD segment (except the last) loaded in memory, there will be a vm_area_struct with ---p permissions. In mobile devices we have seen an increased of 120K vm_area_structs due -Wl,-z,max-page-size=65536. There used to be 40K vm_area_structs. This is an increase of ~18MB (120K * 152).
> > Right, but my point is if this is a hardening feature that glibc aims to > > provide it does not help if the kernel still does not provide it for the > > loader (if this is built with a different page size) and the main program > > itself. Should we push for kernel to implement a similar handling? > > The kernel effectively implements the munmap behavior, I think. It creates > unmapped holes, not PROT_MEM reserved holes. That is right. The kernel vm_munmap the holes for the program image and dynamic linker image: Code that vm_munmap the holes. https://elixir.bootlin.com/linux/v6.6.2/source/fs/binfmt_elf.c#L397 Code that loads the program image segments. https://elixir.bootlin.com/linux/v6.6.2/source/fs/binfmt_elf.c#L1165 Code that loads the dynamic linker segments. https://elixir.bootlin.com/linux/v6.6.2/source/fs/binfmt_elf.c#L397
> When we compile all the shared libraries in the system with -Wl,-z,max-page-size=65536 and use 4k page size kernel, for every PT_LOAD segment (except the last) loaded in memory, there will be a vm_area_struct with ---p permissions. > In mobile devices we have seen an increased of 120K vm_area_structs due -Wl,-z,max-page-size=65536. There used to be 40K vm_area_structs. This is an increase of ~18MB (120K * 152). I would've expected the number of VMAs to nearly double, but it sounds like they quadrupled instead?
(In reply to Adhemerval Zanella from comment #9) > (In reply to Florian Weimer from comment #8) > > (In reply to Adhemerval Zanella from comment #7) > > > So the mprotect is essentially a hardening feature, assuming that the > > > dynamic object padding/holes might contain gadgets. It still does not > > > happen for loader and main program itself, since normally they would be > > > mapped by the kernel and its does do anything with holes, > > > > There's no expectation that these are contiguous in memory, yes. Such an > > expectation exists for shared objects loaded by glibc. > > Right, but my point is if this is a hardening feature that glibc aims to > provide it does not help if the kernel still does not provide it for the > loader (if this is built with a different page size) and the main program > itself. Should we push for kernel to implement a similar handling? > > And the holes seem to be always from the initial read-only segment, which > makes the whole gadget avoidance argument moot. This would make sense back > when RELRO was not wildly used/deployed (even with the current somewhat > broken status), but it still does not make much sense to me. > > > > > > and IMHO it should > > > up to the static linker to fill the padding with NOP/trap instruction to > > > avoid such issues. Hi folks, I was wondering if this is really a security feature or rather a result of how the original VA space is reserved (split VMAs from the original mapping). AIUI if the runtime-page-size equals the max-page-size, the holes are also mapped in as part of the segment mapping and share the same permissions. Does this mean that on such systems, any protection it offers becomes void? Sorry if dumb questions (I'm not too familiar with this area) Thanks, Kalesh > > > > That requires padding to the maximum page size on disk, though. > > But this is what binutils does [1], and while not being optimized it seems > that it would be somewhat hard to fix it (at least with Fangrui Song > suggestion). > > [1] https://sourceware.org/bugzilla/show_bug.cgi?id=30612
(In reply to Ryan Prichard from comment #14) > > When we compile all the shared libraries in the system with -Wl,-z,max-page-size=65536 and use 4k page size kernel, for every PT_LOAD segment (except the last) loaded in memory, there will be a vm_area_struct with ---p permissions. > > > In mobile devices we have seen an increased of 120K vm_area_structs due -Wl,-z,max-page-size=65536. There used to be 40K vm_area_structs. This is an increase of ~18MB (120K * 152). > > I would've expected the number of VMAs to nearly double, but it sounds like > they quadrupled instead? Sorry for the confusion, I was comparing different devices. These are results where we can see the increase in extra vmas when the alignment is max-page-size=65536. VMA size = 200 Loaded ELF segments = 181806 PROT_NONE (---p) gaps = 124297 <- This is the # of extra vmas Slab Memory ELF Usage: 34 MB Slab Memory noneprot Usage: 23 MB The script that was used is below: ``` #!/bin/sh vma_size=$(cat /proc/slabinfo | grep vm_area_struct | awk '{ print $4 }') echo "VMA size = $vma_size" nr_elf_segs=$(cat /proc/*/maps | grep '\.so' | wc -l) echo "Loaded ELF segments = $nr_elf_segs" nr_noneprot=$(cat /proc/*/maps | grep -A1 '\.so' | grep '\---p' | wc -l) echo "PROT_NONE (---p) gaps = $nr_noneprot" slab_mem_mb=$(echo "$vma_size * $nr_elf_segs / 1024 / 1024" | bc) echo "Slab Memory ELF Usage: $slab_mem_mb MB" slab_mem_mb=$(echo "$vma_size * $nr_noneprot / 1024 / 1024" | bc) echo "Slab Memory PROTNONE GAP Usage: $slab_mem_mb MB" ```
Let's refer to the first two build/map maps in comment 0 for discussion. # In another console, run $ cat /proc/1717808/maps 564d8e90f000-564d8e910000 r--p 00000000 08:05 2519504 /sample/build/main 564d8e91f000-564d8e920000 r-xp 00010000 08:05 2519504 /sample/build/main ... We have three choices for the gap. 1. Do not create a map for the gap (separate mmap using MAP_FIXED, or a large mmap with explicit munmap). Linux kernel fs/binfmt_elf.c employs this approach. However, there is a risk that an unrelated future mmap may be placed in the gap: 564d8e90f000-564d8e910000 r--p 00000000 08:05 2519504 /sample/build/main ================ an unrelated mmap may be placed in the gap 564d8e91f000-564d8e920000 r-xp 00010000 08:05 2519504 /sample/build/main If there is no map in between, there is nothing lose. Accessing the gap will cause SIGSEGV. Is the potential occurrence of an unrelated mmap considered a compromise in the hardening features? Personally, I don't think this poses a significant issue. The program does not access the gaps. We can even guarantee this property for direct access when input relocations to the linker use symbols with in-bounds addends (e.g. when x is defined relative to an input section, we know R_X86_64_PC32(x) must be in-bounds). However, some programs may expect contiguous maps of a file (such as when glibc link_map::l_contiguous is set to 1). Does this choice render the program exploitable if an attacker can ensure a map within the gap instead of outside the file? It seems to me that they could achieve everything with a map outside of the file. Having said that, the presence of an unrelated map between maps associated with a single file descriptor remains odd, so it's preferable to avoid it if possible (Choice 3 below). 2. Map the gap with the associated fd This approach is commonly adopted by many dynamic loaders. It involves a large PROT_READ mmap covering the whole file followed by several mmap instances using MAP_FIXED and the relevant flags. The gaps may have permissions set as r--p or ---p, resulting in additional instances of the `struct vm_area_struct`. Within glibc/elf/dl-map-segments.h, the `has_holes` code calls mprotect to change r--p permissions to ---p: 564d8e90f000-564d8e910000 r--p 00000000 08:05 2519504 /sample/build/main 564d8e910000-564d8e91f000 ---p 00001000 08:05 2519504 /sample/build/main 564d8e91f000-564d8e920000 r-xp 00010000 08:05 2519504 /sample/build/main Although ---p might be seen as a hardening measure, personally, I don't believe it significantly impacts the exploitability. While there might be plenty of gadgets in r-xp sections, reducing gadgets in r--p sections doesn't seem notably useful. (https://isopenbsdsecu.re/mitigations/rop_removal/) 3. Extend the map end to cover the gap 564d8e90f000-**564d8e91f000** r--p 00000000 08:05 2519504 /sample/build/main (the end is extended) 564d8e91f000-564d8e920000 r-xp 00010000 08:05 2519504 /sample/build/main The loader code can adjust p_filesz and p_memsz when invoking mmap. By extending the map size to a multiple of the maximum page size, it benefits transparent huge pages or a userspace hugepage remapper. Right now Android Bionic is exploring this choice: https://r.android.com/2796855 --- Personally I favor "3. Extend the map end to cover the gap" the most. I've also pondered whether this falls under the purview of linkers. Such a change seems intrusive and unsightly. If the linker extends the end of p_memsz to cover the gap, should it also extend p_filesz? * If it doesn't, we create a PT_LOAD with p_filesz/p_memsz that is not for BSS, which is weird. * If it does, we have an output file featuring overlapping file offset ranges, which is weird as well. Moreover, a PT_LOAD whose end isn't backed by a section is unusual. I'm concerned that many binary manipulation tools may not handle this case correctly. Utilizing a linker script can intentionally create discontiguous address ranges. I'm concerned that the linker might not discern such cases with intelligent logic regarding p_filesz/p_memsz. This feature request seems to be within the realm of loaders and specific information, such as the page size, is only accessible to loaders. I believe loaders are better equipped to handle this task."
(In reply to Kalesh Singh from comment #15) > AIUI if the runtime-page-size equals the max-page-size, the holes are also > mapped in as part of the segment mapping and share the same permissions. > Does this mean that on such systems, any protection it offers becomes void? That's my understanding. The current behavior gives programmers the *option* to avoid mapping extra code on small-page systems, while maintaining compatibility with large-page systems. The proposed change to over-map to the next load-segment, to fill the gap, would take away that option. Unmapping (to create a gap) has compatibility implications. Just to clarify, I'm opposed to a glibc change here mainly because it's not necessary to enable the desired behavior (fewer VMAs), and it removes functionality that people might find useful. The link editor merely needs to over-map in the generated LOAD segments, then there won't be any gaps for the loader to process.
(In reply to Florian Weimer from comment #18) > (In reply to Kalesh Singh from comment #15) > > AIUI if the runtime-page-size equals the max-page-size, the holes are also > > mapped in as part of the segment mapping and share the same permissions. > > Does this mean that on such systems, any protection it offers becomes void? > > That's my understanding. The current behavior gives programmers the *option* > to avoid mapping extra code on small-page systems, while maintaining > compatibility with large-page systems. The proposed change to over-map to > the next load-segment, to fill the gap, would take away that option. Can you elaborate what the option (and "functionality" below) is? Since 22930c9bf21ea15d0da1477a379029e2de259b69 (1996) rtld just calls `mprotect` (one extra `struct vm_area_struct` instance, see `sudo grep vm_area_struct /proc/slabinfo`), and the user has no option to control this behavior. > Unmapping (to create a gap) has compatibility implications. I wonder the implications are. If they are related to l_contiguous not accounted for correctly, the user needs to be fixed. > Just to clarify, I'm opposed to a glibc change here mainly because it's not > necessary to enable the desired behavior (fewer VMAs), and it removes > functionality that people might find useful. See my question above. > The link editor merely needs to over-map in the generated LOAD segments, then > there won't be any gaps for the loader to process. I have thought about this but I feel that a linker change would be intrusive and unsightly: "I've also pondered whether this falls under the purview of linkers." in comment 17.
(In reply to Fangrui Song from comment #19) > (In reply to Florian Weimer from comment #18) > > (In reply to Kalesh Singh from comment #15) > > > AIUI if the runtime-page-size equals the max-page-size, the holes are also > > > mapped in as part of the segment mapping and share the same permissions. > > > Does this mean that on such systems, any protection it offers becomes void? > > > > That's my understanding. The current behavior gives programmers the *option* > > to avoid mapping extra code on small-page systems, while maintaining > > compatibility with large-page systems. The proposed change to over-map to > > the next load-segment, to fill the gap, would take away that option. > > Can you elaborate what the option (and "functionality" below) is? Since > 22930c9bf21ea15d0da1477a379029e2de259b69 (1996) rtld just calls `mprotect` > (one > extra `struct vm_area_struct` instance, see `sudo grep vm_area_struct > /proc/slabinfo`), and the user has no option to control this behavior. The developer can produce a binary with a LOAD segment that does not need that mprotect call. > > Unmapping (to create a gap) has compatibility implications. > > I wonder the implications are. If they are related to l_contiguous not > accounted for correctly, the user needs to be fixed. We do not set l_contiguous for the dynamic loader, and it's an internal variable. The bug we have is that we assume (without checking) the the loader is mapped contiguously although the kernel does not do that. We have feedback from downstream that some tools break if the gaps are so large that objects are mapped in the gaps of other objects. This is particularly visible on x86-64 when objects linked with different binutils versions/flags are loaded because the gaps can be quite large due to 2 MiB load segment alignment in some builds. Other objects do not do that alignment and are much smaller than 2 MiB, and so can fit comfortably into the gaps. I'm worried that if we change loader behavior to munmap instead of mprotect, that we'll retroactively expose these bugs more widely. We obviously need to fix the l_contiguous assumption for ld.so in glibc, either by making sure that ld.so has contiguous load segments, or assuming in the code that it does not.
(In reply to Florian Weimer from comment #20) > (In reply to Fangrui Song from comment #19) > > (In reply to Florian Weimer from comment #18) > > > (In reply to Kalesh Singh from comment #15) > > > > AIUI if the runtime-page-size equals the max-page-size, the holes are also > > > > mapped in as part of the segment mapping and share the same permissions. > > > > Does this mean that on such systems, any protection it offers becomes void? > > > > > > That's my understanding. The current behavior gives programmers the *option* > > > to avoid mapping extra code on small-page systems, while maintaining > > > compatibility with large-page systems. The proposed change to over-map to > > > the next load-segment, to fill the gap, would take away that option. > > > > Can you elaborate what the option (and "functionality" below) is? Since > > 22930c9bf21ea15d0da1477a379029e2de259b69 (1996) rtld just calls `mprotect` > > (one > > extra `struct vm_area_struct` instance, see `sudo grep vm_area_struct > > /proc/slabinfo`), and the user has no option to control this behavior. > > The developer can produce a binary with a LOAD segment that does not need > that mprotect call. This is described by "## Solution 2 - Modify the static linker" in comment 1. I find the approach intrusive. ("Moreover, a PT_LOAD whose end isn't backed by a section is unusual." in comment 17.) A loader alternative is "3. Extend the map end to cover the gap" in comment 17. Instead of mmap(l->l_addr+c->mapstart, c->mapend - c->mapstart, c->prot, MAP_FIXED|MAP_COPY|MAP_FILE, fd, c->mapoff) increase the map length mmap(l->l_addr+c->mapstart, c[1].mapstart - c->mapstart, c->prot, MAP_FIXED|MAP_COPY|MAP_FILE, fd, c->mapoff) The existing mprotect in glibc/elf/dl-map-segments.h will not be touched.
(In reply to Adhemerval Zanella from comment #11) > It does indeed, which makes me realize that _dl_find_object does not work correctly if the loader is built with -Wl,-z,max-page-size= different than the system one. FWIW, I've just hit this issue in practice on off-the-shelf Debian on aarch64: /lib/ld-linux-aarch64.so.1 specifies 64K alignment for its segments, but the system page size is 4K, so the kernel leaves gaps when mapping it in. Other shared objects can then get loaded into those gaps (if they are small and specify only 4K alignment for their segments), and then _dl_find_object against those shared objects will return /lib/ld-linux-aarch64.so.1 rather than returning the correct shared object. In turn, this causes _Unwind_Backtrace et. al. to fail to unwind through these shared objects, as _Unwind_Backtrace relies on _dl_find_object to find the unwind data.