Bug 31076 - Extra struct vm_area_struct with ---p created when PAGE_SIZE < max-page-size
Summary: Extra struct vm_area_struct with ---p created when PAGE_SIZE < max-page-size
Status: NEW
Alias: None
Product: glibc
Classification: Unclassified
Component: dynamic-link (show other bugs)
Version: unspecified
: P2 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-11-18 18:48 UTC by Juan Yescas
Modified: 2024-06-23 18:24 UTC (History)
9 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed: 2023-11-21 00:00:00
fweimer: security-


Attachments
Contains the C files, headers and makefile to reproduce the issue. (1.74 KB, application/zip)
2023-11-18 18:48 UTC, Juan Yescas
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Juan Yescas 2023-11-18 18:48:04 UTC
Created attachment 15224 [details]
Contains the C files, headers and makefile to reproduce the issue.

# The story of the extra struct vm_area_struct and the loader

## Problem

When the page size is `4096` and shared libraries and binaries are
`16384/65536` elf aligned with the flag `-Wl,-z,max-page-size=[16384|65536]`
the dynamic linker (loader) creates an extra vm_area_struct for each segment.

This happens because the loader allocates a memory area big enough to
map the shared library and then mprotect every segment at
page size boundaries instead of a `p_align` boundary.

The size of `struct vm_area_struct` is 152 bytes in the `Linux 6.3.13` kernel. The increase in the number of vm_area_struct is an issue for low end memory devices (mobile devices, etc). We have seen an increase of 30MBs of unreclaimable kernel memory on those devices due the extra vm_area_struct.

## Linux and gcc details

```
 $ lsb_release  -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 20.04.1 LTS
Release:	20.04
Codename:	focal

 $ gcc --version
gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
```

## How to reproduce it (See attachments)

In a `4096` page size kernel, compile a shared library with 16384 or 65536 elf
alignment.

In the `sample` directory, there are these files:

```
 $ tree sample/
sample/
├── arithmetic.c
├── arithmetic.h
├── main.c
└── Makefile
```

To compile the shared library and main, run:

```
$ cd sample

# Build shared library and main
$ make build
```

To load and execute a program, there are 2 possible ways:

1. Ask the Linux kernel to load the program and dynamic linker, once that the
dynamic linker is loaded, it will load the shared libraries, in this case
`libarithmetic.so`.

```
$ make run
```

2. Ask the kernel to load the dynamic linker and then the dynamic linker will
load the program and shared libraries, in this case `libarithmetic.so`.

```
$ make runloader
```

### Linux kernel loads the program and dynamic linker

When the `Linux kernel` loads the program and dynamic linker, the dynamic linker
loads the shared libraries, we can see that an extra vm_area_struct is used with
the permissions `---p` in the shared libraries that are 16384/65536 elf aligned.

```
 $ make run
LD_LIBRARY_PATH=./build ./build/main

# In another console, run
 $ cat /proc/1717808/maps
564d8e90f000-564d8e910000 r--p 00000000 08:05 2519504        /sample/build/main
564d8e91f000-564d8e920000 r-xp 00010000 08:05 2519504        /sample/build/main
564d8e92f000-564d8e930000 r--p 00020000 08:05 2519504        /sample/build/main
564d8e93f000-564d8e940000 r--p 00020000 08:05 2519504        /sample/build/main
564d8e940000-564d8e941000 rw-p 00021000 08:05 2519504        /sample/build/main
564d9055d000-564d9057e000 rw-p 00000000 00:00 0              [heap]
7ff6417ba000-7ff6417bd000 rw-p 00000000 00:00 0
7ff6417bd000-7ff6417df000 r--p 00000000 08:05 2623619        /usr/lib/x86_64-linux-gnu/libc-2.31.so
7ff6417df000-7ff641957000 r-xp 00022000 08:05 2623619        /usr/lib/x86_64-linux-gnu/libc-2.31.so
7ff641957000-7ff6419a5000 r--p 0019a000 08:05 2623619        /usr/lib/x86_64-linux-gnu/libc-2.31.so
7ff6419a5000-7ff6419a9000 r--p 001e7000 08:05 2623619        /usr/lib/x86_64-linux-gnu/libc-2.31.so
7ff6419a9000-7ff6419ab000 rw-p 001eb000 08:05 2623619        /usr/lib/x86_64-linux-gnu/libc-2.31.so
7ff6419ab000-7ff6419af000 rw-p 00000000 00:00 0
7ff6419c0000-7ff6419c1000 r--p 00000000 08:05 2519503        /sample/build/libarithmetic.so
7ff6419c1000-7ff6419d0000 ---p 00001000 08:05 2519503        /sample/build/libarithmetic.so
7ff6419d0000-7ff6419d1000 r-xp 00010000 08:05 2519503        /sample/build/libarithmetic.so
7ff6419d1000-7ff6419e0000 ---p 00011000 08:05 2519503        /sample/build/libarithmetic.so
7ff6419e0000-7ff6419e1000 r--p 00020000 08:05 2519503        /sample/build/libarithmetic.so
7ff6419e1000-7ff6419f0000 ---p 00021000 08:05 2519503        /sample/build/libarithmetic.so
7ff6419f0000-7ff6419f1000 r--p 00020000 08:05 2519503        /sample/build/libarithmetic.so
7ff6419f1000-7ff6419f2000 rw-p 00021000 08:05 2519503        /sample/build/libarithmetic.so
7ff6419f2000-7ff6419f4000 rw-p 00000000 00:00 0
7ff6419f4000-7ff6419f5000 r--p 00000000 08:05 2623613        /usr/lib/x86_64-linux-gnu/ld-2.31.so
7ff6419f5000-7ff641a18000 r-xp 00001000 08:05 2623613        /usr/lib/x86_64-linux-gnu/ld-2.31.so
7ff641a18000-7ff641a20000 r--p 00024000 08:05 2623613        /usr/lib/x86_64-linux-gnu/ld-2.31.so
7ff641a21000-7ff641a22000 r--p 0002c000 08:05 2623613        /usr/lib/x86_64-linux-gnu/ld-2.31.so
7ff641a22000-7ff641a23000 rw-p 0002d000 08:05 2623613        /usr/lib/x86_64-linux-gnu/ld-2.31.so
7ff641a23000-7ff641a24000 rw-p 00000000 00:00 0
7ffdfd748000-7ffdfd769000 rw-p 00000000 00:00 0              [stack]
7ffdfd77c000-7ffdfd77f000 r--p 00000000 00:00 0              [vvar]
7ffdfd77f000-7ffdfd780000 r-xp 00000000 00:00 0              [vdso]
ffffffffff600000-ffffffffff601000 --xp 00000000 00:00 0      [vsyscall]
```

> Note that in `main` there is not a vm_area_struct with `---p` permissions due
> the Linux kernel loads the program.

### Linux kernel ONLY loads the dynamic linker

When the Linux kernel `ONLY` loads the dynamic linker, and the dynamic linker
loads the shared libraries and program, we can see that an extra vm_area_struct
is used with the permissions `---p` in the shared libraries and program that
are 16384/65536 elf aligned.

```
 $ make runloader
LD_LIBRARY_PATH=./build /lib64/ld-linux-x86-64.so.2 ./build/main

# In another console, run
 $ cat /proc/1711495/maps
555557048000-555557069000 rw-p 00000000 00:00 0             [heap]
7f58a13c0000-7f58a13c3000 rw-p 00000000 00:00 0
7f58a13c3000-7f58a13e5000 r--p 00000000 08:05 2623619       /usr/lib/x86_64-linux-gnu/libc-2.31.so
7f58a13e5000-7f58a155d000 r-xp 00022000 08:05 2623619       /usr/lib/x86_64-linux-gnu/libc-2.31.so
7f58a155d000-7f58a15ab000 r--p 0019a000 08:05 2623619       /usr/lib/x86_64-linux-gnu/libc-2.31.so
7f58a15ab000-7f58a15af000 r--p 001e7000 08:05 2623619       /usr/lib/x86_64-linux-gnu/libc-2.31.so
7f58a15af000-7f58a15b1000 rw-p 001eb000 08:05 2623619       /usr/lib/x86_64-linux-gnu/libc-2.31.so
7f58a15b1000-7f58a15b5000 rw-p 00000000 00:00 0
7f58a15c6000-7f58a15c7000 r--p 00000000 08:05 2519503       /sample/build/libarithmetic.so
7f58a15c7000-7f58a15d6000 ---p 00001000 08:05 2519503       /sample/build/libarithmetic.so
7f58a15d6000-7f58a15d7000 r-xp 00010000 08:05 2519503       /sample/build/libarithmetic.so
7f58a15d7000-7f58a15e6000 ---p 00011000 08:05 2519503       /sample/build/libarithmetic.so
7f58a15e6000-7f58a15e7000 r--p 00020000 08:05 2519503       /sample/build/libarithmetic.so
7f58a15e7000-7f58a15f6000 ---p 00021000 08:05 2519503       /sample/build/libarithmetic.so
7f58a15f6000-7f58a15f7000 r--p 00020000 08:05 2519503       /sample/build/libarithmetic.so
7f58a15f7000-7f58a15f8000 rw-p 00021000 08:05 2519503       /sample/build/libarithmetic.so
7f58a15f8000-7f58a15fa000 rw-p 00000000 00:00 0
7f58a15fa000-7f58a15fb000 r--p 00000000 08:05 2519504       /sample/build/main
7f58a15fb000-7f58a160a000 ---p 00001000 08:05 2519504       /sample/build/main
7f58a160a000-7f58a160b000 r-xp 00010000 08:05 2519504       /sample/build/main
7f58a160b000-7f58a161a000 ---p 00011000 08:05 2519504       /sample/build/main
7f58a161a000-7f58a161b000 r--p 00020000 08:05 2519504       /sample/build/main
7f58a161b000-7f58a162a000 ---p 00021000 08:05 2519504       /sample/build/main
7f58a162a000-7f58a162b000 r--p 00020000 08:05 2519504       /sample/build/main
7f58a162b000-7f58a162c000 rw-p 00021000 08:05 2519504       /sample/build/main
7f58a162c000-7f58a162d000 r--p 00000000 08:05 2623613       /usr/lib/x86_64-linux-gnu/ld-2.31.so
7f58a162d000-7f58a1650000 r-xp 00001000 08:05 2623613       /usr/lib/x86_64-linux-gnu/ld-2.31.so
7f58a1650000-7f58a1658000 r--p 00024000 08:05 2623613       /usr/lib/x86_64-linux-gnu/ld-2.31.so
7f58a1659000-7f58a165a000 r--p 0002c000 08:05 2623613       /usr/lib/x86_64-linux-gnu/ld-2.31.so
7f58a165a000-7f58a165b000 rw-p 0002d000 08:05 2623613       /usr/lib/x86_64-linux-gnu/ld-2.31.so
7f58a165b000-7f58a165c000 rw-p 00000000 00:00 0
7ffe19e5c000-7ffe19e7d000 rw-p 00000000 00:00 0             [stack]
7ffe19f7e000-7ffe19f81000 r--p 00000000 00:00 0             [vvar]
7ffe19f81000-7ffe19f82000 r-xp 00000000 00:00 0             [vdso]
ffffffffff600000-ffffffffff601000 --xp 00000000 00:00 0     [vsyscall]
```

### Linux kernel page size is 4096 and the elf alignment is 4096

When a shared library and executable are linked using the flag `-Wl,-z,max-page-size=4096`
and loaded by the dynamic linker, we can see that there is `NOT` an extra vm_area_struct with
permissions `---p`.

```
 $ make runloader
LD_LIBRARY_PATH=./build /lib64/ld-linux-x86-64.so.2 ./build/main

# In another console, run
 $ cat /proc/1721958/maps
555556cef000-555556d10000 rw-p 00000000 00:00 0            [heap]
7f70bb246000-7f70bb249000 rw-p 00000000 00:00 0
7f70bb249000-7f70bb26b000 r--p 00000000 08:05 2623619      /usr/lib/x86_64-linux-gnu/libc-2.31.so
7f70bb26b000-7f70bb3e3000 r-xp 00022000 08:05 2623619      /usr/lib/x86_64-linux-gnu/libc-2.31.so
7f70bb3e3000-7f70bb431000 r--p 0019a000 08:05 2623619      /usr/lib/x86_64-linux-gnu/libc-2.31.so
7f70bb431000-7f70bb435000 r--p 001e7000 08:05 2623619      /usr/lib/x86_64-linux-gnu/libc-2.31.so
7f70bb435000-7f70bb437000 rw-p 001eb000 08:05 2623619      /usr/lib/x86_64-linux-gnu/libc-2.31.so
7f70bb437000-7f70bb43b000 rw-p 00000000 00:00 0
7f70bb44c000-7f70bb44d000 r--p 00000000 08:05 2519503      /sample/build/libarithmetic.so
7f70bb44d000-7f70bb44e000 r-xp 00001000 08:05 2519503      /sample/build/libarithmetic.so
7f70bb44e000-7f70bb44f000 r--p 00002000 08:05 2519503      /sample/build/libarithmetic.so
7f70bb44f000-7f70bb450000 r--p 00002000 08:05 2519503      /sample/build/libarithmetic.so
7f70bb450000-7f70bb451000 rw-p 00003000 08:05 2519503      /sample/build/libarithmetic.so
7f70bb451000-7f70bb453000 rw-p 00000000 00:00 0
7f70bb453000-7f70bb454000 r--p 00000000 08:05 2519504      /sample/build/main
7f70bb454000-7f70bb455000 r-xp 00001000 08:05 2519504      /sample/build/main
7f70bb455000-7f70bb456000 r--p 00002000 08:05 2519504      /sample/build/main
7f70bb456000-7f70bb457000 r--p 00002000 08:05 2519504      /sample/build/main
7f70bb457000-7f70bb458000 rw-p 00003000 08:05 2519504      /sample/build/main
7f70bb458000-7f70bb459000 r--p 00000000 08:05 2623613      /usr/lib/x86_64-linux-gnu/ld-2.31.so
7f70bb459000-7f70bb47c000 r-xp 00001000 08:05 2623613      /usr/lib/x86_64-linux-gnu/ld-2.31.so
7f70bb47c000-7f70bb484000 r--p 00024000 08:05 2623613      /usr/lib/x86_64-linux-gnu/ld-2.31.so
7f70bb485000-7f70bb486000 r--p 0002c000 08:05 2623613      /usr/lib/x86_64-linux-gnu/ld-2.31.so
7f70bb486000-7f70bb487000 rw-p 0002d000 08:05 2623613      /usr/lib/x86_64-linux-gnu/ld-2.31.so
7f70bb487000-7f70bb488000 rw-p 00000000 00:00 0
7ffe42e67000-7ffe42e88000 rw-p 00000000 00:00 0            [stack]
7ffe42ebe000-7ffe42ec1000 r--p 00000000 00:00 0            [vvar]
7ffe42ec1000-7ffe42ec2000 r-xp 00000000 00:00 0            [vdso]
ffffffffff600000-ffffffffff601000 --xp 00000000 00:00 0    [vsyscall]
```

## Solution 1 - Modify the dynamic linker

When the dynamic linker loads the shared libraries, extend the `vm_area_struct`
to be at a `p_align` boundary. This is done by passing to `mprotect()` the
right address space ranges.

## Solution 2 - Modify the static linker

When the static linker creates the program segments, make sure that the
`p_filesz` and `p_memsz` are extended to a `p_align` boundary. This can be done by inserting a new ELF section at every PT_LOAD ELF segment that will serve as padding.
Comment 1 Juan Yescas 2023-11-18 18:50:36 UTC
As per my conversation with Carlos O'Donell, he or Siddehesh Poyarekar can take a look at the issue.
Comment 2 Carlos O'Donell 2023-11-21 13:42:17 UTC
Juan, Thank you for filling this issue! It was great to meet you at LPC 2023 and discuss the problem. I'll have a look at the details and think about possible solutions. I'm not sure yet if this is something the loader can do or if the static linker has arrange the PT_LOAD segments in the right way.
Comment 3 Juan Yescas 2023-11-22 18:19:38 UTC
Thanks Carlos, It was nice to meet you too. I am looking forward to collaborating with you.
Comment 4 Adhemerval Zanella 2023-11-24 17:40:07 UTC
In ancient times, ldso used to unmap the disjoined regions from PT_LOAD.  It was changed decades ago (22930c9bf21ea15d0da1477a379029e2de259b69), most likely due some kernel VMA limitation since it done between Linux 1.3 to 2.0 release.  Changing back to unmap should remove the extra VMA ranges:

diff --git a/elf/dl-load.h b/elf/dl-load.h
index 1d5207694b..f53983fd1f 100644
--- a/elf/dl-load.h
+++ b/elf/dl-load.h
@@ -124,6 +124,8 @@ static const char *_dl_map_segments (struct link_map *l, int fd,
    guaranteed to have translations.  */
 #define DL_MAP_SEGMENTS_ERROR_MAP_SEGMENT \
   N_("failed to map segment from shared object")
+#define DL_MAP_SEGMENTS_ERROR_UNMAP_SEGMENT \
+  N_("failed to unmap segment from shared object")
 #define DL_MAP_SEGMENTS_ERROR_MPROTECT \
   N_("cannot change memory protections")
 #define DL_MAP_SEGMENTS_ERROR_MAP_ZERO_FILL \
diff --git a/elf/dl-map-segments.h b/elf/dl-map-segments.h
index ac10182d58..7ecb1d917b 100644
--- a/elf/dl-map-segments.h
+++ b/elf/dl-map-segments.h
@@ -115,11 +115,10 @@ _dl_map_segments (struct link_map *l, int fd,
          if (__glibc_unlikely (loadcmds[nloadcmds - 1].mapstart <
                                c->mapend))
            return N_("ELF load command address/offset not page-aligned");
-          if (__glibc_unlikely
-              (__mprotect ((caddr_t) (l->l_addr + c->mapend),
-                           loadcmds[nloadcmds - 1].mapstart - c->mapend,
-                           PROT_NONE) < 0))
-            return DL_MAP_SEGMENTS_ERROR_MPROTECT;
+         if (__glibc_unlikely
+             (__munmap ((caddr_t) (l->l_addr + c->mapend),
+                        loadcmds[nloadcmds - 1].mapstart - c->mapend) < 0))
+            return DL_MAP_SEGMENTS_ERROR_UNMAP_SEGMENT;
         }

       l->l_contiguous = 1;

It would be somewhat more coslty, since kernel will need to remap and split the VMA range.

However, it is not really clear to me what is the advantage of mprotect/munmap the scenario where the PT_LOAD might have 'holes' (essentially where the system page size is lower than the required alignment).  I can understand the munmap, since it might have some sense back on 32 bit deployments were common and VMA range was scarse in some situation (specially due fragmentation).

But the mprotect does not relly bring any advantage here, it consumes more kernel metadata, do not improve the process VMA range usage, nor it was done as a hardening scheme.
Comment 5 Florian Weimer 2023-11-27 15:11:07 UTC
I don't see how a 152 byte struct results in 30 MB of unreclaimable kernel memory.  Wouldn't that need ~200,000 instances? That seems really large. I've only got ~70,000 lines in /proc/*/maps on this desktop system, and not all these mappings will exhibit this issue.

Would it help if we use MAP_FIXED with PROT_NONE to map over these unused parts? But as far as I understand it, these tails have not been written to, so it shouldn't matter if the underlying memory needs to be preserved by the kernel or not.

Regarding not doing the mprotect altogether, I believe this would result in a loss of functionality. Today, you can use the current behavior to get as few gadgets as possible in the process image on systems with smaller page sizes, while still maintaining run-time compatibility with larger page sizes and avoiding on-disk padding (which would increase file size). If we stop doing the mprotect, then the gadgets would become visible even with smaller page sizes. At least in principle, it should be possible for a link editor to produce objects that do not require tail adjustment because the load segments are usable as-is (but I understand that there are link editor limitations in this area today).
Comment 6 Florian Weimer 2023-11-27 15:22:09 UTC
I forgot to mention that we have assumptions in the code base that shared objects *we* map are contiguous, so Adhemerval's patch isn't correct as -is. (We tolerate non-contiguous mappings created by the kernel, both for the main program and ld.so.)
Comment 7 Adhemerval Zanella 2023-11-27 16:27:23 UTC
(In reply to Florian Weimer from comment #5)
> I don't see how a 152 byte struct results in 30 MB of unreclaimable kernel
> memory.  Wouldn't that need ~200,000 instances? That seems really large.
> I've only got ~70,000 lines in /proc/*/maps on this desktop system, and not
> all these mappings will exhibit this issue.
> 
> Would it help if we use MAP_FIXED with PROT_NONE to map over these unused
> parts? But as far as I understand it, these tails have not been written to,
> so it shouldn't matter if the underlying memory needs to be preserved by the
> kernel or not.
> 
> Regarding not doing the mprotect altogether, I believe this would result in
> a loss of functionality. Today, you can use the current behavior to get as
> few gadgets as possible in the process image on systems with smaller page
> sizes, while still maintaining run-time compatibility with larger page sizes
> and avoiding on-disk padding (which would increase file size). If we stop
> doing the mprotect, then the gadgets would become visible even with smaller
> page sizes. At least in principle, it should be possible for a link editor
> to produce objects that do not require tail adjustment because the load
> segments are usable as-is (but I understand that there are link editor
> limitations in this area today).

So the mprotect is essentially a hardening feature, assuming that the dynamic object padding/holes might contain gadgets.  It still does not happen for loader and main program itself, since normally they would be mapped by the kernel and its does do anything with holes, and IMHO it should up to the static linker to fill the padding with NOP/trap instruction to avoid such issues. 

So I am not fully convinced that the mprotect is really helping much here.
Comment 8 Florian Weimer 2023-11-27 17:19:25 UTC
(In reply to Adhemerval Zanella from comment #7)
> So the mprotect is essentially a hardening feature, assuming that the
> dynamic object padding/holes might contain gadgets.  It still does not
> happen for loader and main program itself, since normally they would be
> mapped by the kernel and its does do anything with holes,

There's no expectation that these are contiguous in memory, yes. Such an expectation exists for shared objects loaded by glibc.

> and IMHO it should
> up to the static linker to fill the padding with NOP/trap instruction to
> avoid such issues. 

That requires padding to the maximum page size on disk, though.
Comment 9 Adhemerval Zanella 2023-11-27 17:39:48 UTC
(In reply to Florian Weimer from comment #8)
> (In reply to Adhemerval Zanella from comment #7)
> > So the mprotect is essentially a hardening feature, assuming that the
> > dynamic object padding/holes might contain gadgets.  It still does not
> > happen for loader and main program itself, since normally they would be
> > mapped by the kernel and its does do anything with holes,
> 
> There's no expectation that these are contiguous in memory, yes. Such an
> expectation exists for shared objects loaded by glibc.

Right, but my point is if this is a hardening feature that glibc aims to provide it does not help if the kernel still does not provide it for the loader (if this is built with a different page size) and the main program itself. Should we push for kernel to implement a similar handling?

And the holes seem to be always from the initial read-only segment, which makes the whole gadget avoidance argument moot. This would make sense back when RELRO was not wildly used/deployed (even with the current somewhat broken status), but it still does not make much sense to me.

> 
> > and IMHO it should
> > up to the static linker to fill the padding with NOP/trap instruction to
> > avoid such issues. 
> 
> That requires padding to the maximum page size on disk, though.

But this is what binutils does [1], and while not being optimized it seems that it would be somewhat hard to fix it (at least with Fangrui Song suggestion).

[1] https://sourceware.org/bugzilla/show_bug.cgi?id=30612
Comment 10 Florian Weimer 2023-11-27 17:45:01 UTC
(In reply to Adhemerval Zanella from comment #9)
> (In reply to Florian Weimer from comment #8)
> > (In reply to Adhemerval Zanella from comment #7)
> > > So the mprotect is essentially a hardening feature, assuming that the
> > > dynamic object padding/holes might contain gadgets.  It still does not
> > > happen for loader and main program itself, since normally they would be
> > > mapped by the kernel and its does do anything with holes,
> > 
> > There's no expectation that these are contiguous in memory, yes. Such an
> > expectation exists for shared objects loaded by glibc.
> 
> Right, but my point is if this is a hardening feature that glibc aims to
> provide it does not help if the kernel still does not provide it for the
> loader (if this is built with a different page size) and the main program
> itself. Should we push for kernel to implement a similar handling?

The kernel effectively implements the munmap behavior, I think. It creates unmapped holes, not PROT_MEM reserved holes.
Comment 11 Adhemerval Zanella 2023-11-27 17:58:59 UTC
(In reply to Florian Weimer from comment #10)
> (In reply to Adhemerval Zanella from comment #9)
> > (In reply to Florian Weimer from comment #8)
> > > (In reply to Adhemerval Zanella from comment #7)
> > > > So the mprotect is essentially a hardening feature, assuming that the
> > > > dynamic object padding/holes might contain gadgets.  It still does not
> > > > happen for loader and main program itself, since normally they would be
> > > > mapped by the kernel and its does do anything with holes,
> > > 
> > > There's no expectation that these are contiguous in memory, yes. Such an
> > > expectation exists for shared objects loaded by glibc.
> > 
> > Right, but my point is if this is a hardening feature that glibc aims to
> > provide it does not help if the kernel still does not provide it for the
> > loader (if this is built with a different page size) and the main program
> > itself. Should we push for kernel to implement a similar handling?
> 
> The kernel effectively implements the munmap behavior, I think. It creates
> unmapped holes, not PROT_MEM reserved holes.

It does indeed, which makes me realize that _dl_find_object does not work correctly if the loader is built with -Wl,-z,max-page-size= different than the system one.  The resulting address of the tst-dl_find_object will be:

      0x555555554000     0x555555556000     0x2000        0x0  r--p   /home/azanella/Projects/glibc/build/x86_64-linux-gnu/elf/tst-dl_find_object
      0x555555556000     0x55555555a000     0x4000     0x2000  r-xp   /home/azanella/Projects/glibc/build/x86_64-linux-gnu/elf/tst-dl_find_object
      0x55555555a000     0x55555555c000     0x2000     0x6000  r--p   /home/azanella/Projects/glibc/build/x86_64-linux-gnu/elf/tst-dl_find_object
      0x55555555c000     0x55555555d000     0x1000     0x7000  r--p   /home/azanella/Projects/glibc/build/x86_64-linux-gnu/elf/tst-dl_find_object
      0x55555555d000     0x55555555e000     0x1000     0x8000  rw-p   /home/azanella/Projects/glibc/build/x86_64-linux-gnu/elf/tst-dl_find_object
      0x7ffff7c00000     0x7ffff7c26000    0x26000        0x0  r--p   /home/azanella/Projects/glibc/build/x86_64-linux-gnu/libc.so
      0x7ffff7c26000     0x7ffff7d9a000   0x174000    0x26000  r-xp   /home/azanella/Projects/glibc/build/x86_64-linux-gnu/libc.so
      0x7ffff7d9a000     0x7ffff7df1000    0x57000   0x19a000  r--p   /home/azanella/Projects/glibc/build/x86_64-linux-gnu/libc.so
      0x7ffff7df1000     0x7ffff7df5000     0x4000   0x1f0000  r--p   /home/azanella/Projects/glibc/build/x86_64-linux-gnu/libc.so
      0x7ffff7df5000     0x7ffff7df7000     0x2000   0x1f4000  rw-p   /home/azanella/Projects/glibc/build/x86_64-linux-gnu/libc.so
      0x7ffff7df7000     0x7ffff7e04000     0xd000        0x0  rw-p
      0x7ffff7f9d000     0x7ffff7f9e000     0x1000        0x0  r--p   /home/azanella/Projects/glibc/build/x86_64-linux-gnu/elf/ld.so
      0x7ffff7fad000     0x7ffff7fd3000    0x26000    0x10000  r-xp   /home/azanella/Projects/glibc/build/x86_64-linux-gnu/elf/ld.so
      0x7ffff7fdd000     0x7ffff7fe7000     0xa000    0x40000  r--p   /home/azanella/Projects/glibc/build/x86_64-linux-gnu/elf/ld.so
      0x7ffff7fef000     0x7ffff7ff0000     0x1000        0x0  rw-s   /dev/zero (deleted)
      0x7ffff7ff0000     0x7ffff7ff5000     0x5000        0x0  rw-p
      0x7ffff7ff5000     0x7ffff7ff9000     0x4000        0x0  r--p   [vvar]
      0x7ffff7ff9000     0x7ffff7ffb000     0x2000        0x0  r-xp   [vdso]
      0x7ffff7ffb000     0x7ffff7ffd000     0x2000    0x4e000  r--p   /home/azanella/Projects/glibc/build/x86_64-linux-gnu/elf/ld.so
      0x7ffff7ffd000     0x7ffff7fff000     0x2000    0x50000  rw-p   /home/azanella/Projects/glibc/build/x86_64-linux-gnu/elf/ld.so
      0x7ffffffdd000     0x7ffffffff000    0x22000        0x0  rw-p   [stack]
  0xffffffffff600000 0xffffffffff601000     0x1000        0x0  --xp   [vsyscall]
Comment 12 Juan Yescas 2023-11-27 19:47:05 UTC
(In reply to Florian Weimer from comment #5)
> I don't see how a 152 byte struct results in 30 MB of unreclaimable kernel
> memory.  Wouldn't that need ~200,000 instances? That seems really large.
> I've only got ~70,000 lines in /proc/*/maps on this desktop system, and not
> all these mappings will exhibit this issue.
> 

When we compile all the shared libraries in the system with -Wl,-z,max-page-size=65536 and use 4k page size kernel, for every PT_LOAD segment (except the last) loaded in memory, there will be a vm_area_struct with ---p permissions.

In mobile devices we have seen an increased of 120K vm_area_structs due -Wl,-z,max-page-size=65536. There used to be 40K vm_area_structs. This is an increase of ~18MB (120K * 152).
Comment 13 Juan Yescas 2023-11-27 19:55:18 UTC
> > Right, but my point is if this is a hardening feature that glibc aims to
> > provide it does not help if the kernel still does not provide it for the
> > loader (if this is built with a different page size) and the main program
> > itself. Should we push for kernel to implement a similar handling?
> 
> The kernel effectively implements the munmap behavior, I think. It creates
> unmapped holes, not PROT_MEM reserved holes.

That is right. The kernel vm_munmap the holes for the program image and dynamic linker image:

Code that vm_munmap the holes.
https://elixir.bootlin.com/linux/v6.6.2/source/fs/binfmt_elf.c#L397

Code that loads the program image segments.
https://elixir.bootlin.com/linux/v6.6.2/source/fs/binfmt_elf.c#L1165

Code that loads the dynamic linker segments.
https://elixir.bootlin.com/linux/v6.6.2/source/fs/binfmt_elf.c#L397
Comment 14 Ryan Prichard 2023-11-28 08:48:13 UTC
> When we compile all the shared libraries in the system with -Wl,-z,max-page-size=65536 and use 4k page size kernel, for every PT_LOAD segment (except the last) loaded in memory, there will be a vm_area_struct with ---p permissions.

> In mobile devices we have seen an increased of 120K vm_area_structs due -Wl,-z,max-page-size=65536. There used to be 40K vm_area_structs. This is an increase of ~18MB (120K * 152).

I would've expected the number of VMAs to nearly double, but it sounds like they quadrupled instead?
Comment 15 Kalesh Singh 2023-11-28 18:59:33 UTC
(In reply to Adhemerval Zanella from comment #9)
> (In reply to Florian Weimer from comment #8)
> > (In reply to Adhemerval Zanella from comment #7)
> > > So the mprotect is essentially a hardening feature, assuming that the
> > > dynamic object padding/holes might contain gadgets.  It still does not
> > > happen for loader and main program itself, since normally they would be
> > > mapped by the kernel and its does do anything with holes,
> > 
> > There's no expectation that these are contiguous in memory, yes. Such an
> > expectation exists for shared objects loaded by glibc.
> 
> Right, but my point is if this is a hardening feature that glibc aims to
> provide it does not help if the kernel still does not provide it for the
> loader (if this is built with a different page size) and the main program
> itself. Should we push for kernel to implement a similar handling?
> 
> And the holes seem to be always from the initial read-only segment, which
> makes the whole gadget avoidance argument moot. This would make sense back
> when RELRO was not wildly used/deployed (even with the current somewhat
> broken status), but it still does not make much sense to me.
> 
> > 
> > > and IMHO it should
> > > up to the static linker to fill the padding with NOP/trap instruction to
> > > avoid such issues. 


Hi folks,

I was wondering if this is really a security feature or rather a result of how the original VA space is reserved (split VMAs from the original mapping).

AIUI if the runtime-page-size equals the max-page-size, the holes are also mapped in as part of the segment mapping and share the same permissions. Does this mean that on such systems, any protection it offers becomes void?

Sorry if dumb questions (I'm not too familiar with this area)

Thanks,
Kalesh


> > 
> > That requires padding to the maximum page size on disk, though.
> 
> But this is what binutils does [1], and while not being optimized it seems
> that it would be somewhat hard to fix it (at least with Fangrui Song
> suggestion).
> 
> [1] https://sourceware.org/bugzilla/show_bug.cgi?id=30612
Comment 16 Juan Yescas 2023-11-28 23:58:54 UTC
(In reply to Ryan Prichard from comment #14)
> > When we compile all the shared libraries in the system with -Wl,-z,max-page-size=65536 and use 4k page size kernel, for every PT_LOAD segment (except the last) loaded in memory, there will be a vm_area_struct with ---p permissions.
> 
> > In mobile devices we have seen an increased of 120K vm_area_structs due -Wl,-z,max-page-size=65536. There used to be 40K vm_area_structs. This is an increase of ~18MB (120K * 152).
> 
> I would've expected the number of VMAs to nearly double, but it sounds like
> they quadrupled instead?


Sorry for the confusion, I was comparing different devices.

These are results where we can see the increase in extra vmas when the alignment is max-page-size=65536.

VMA size = 200
Loaded ELF segments = 181806
PROT_NONE (---p) gaps = 124297    <- This is the # of extra vmas
Slab Memory ELF Usage: 34 MB
Slab Memory noneprot Usage: 23 MB

The script that was used is below:

```
#!/bin/sh

vma_size=$(cat /proc/slabinfo | grep vm_area_struct | awk '{ print $4 }')
echo "VMA size = $vma_size"

nr_elf_segs=$(cat /proc/*/maps | grep '\.so' | wc -l)
echo "Loaded ELF segments = $nr_elf_segs"

nr_noneprot=$(cat /proc/*/maps | grep -A1 '\.so' | grep '\---p' | wc -l)
echo "PROT_NONE (---p) gaps = $nr_noneprot"

slab_mem_mb=$(echo "$vma_size * $nr_elf_segs / 1024 / 1024" | bc)
echo "Slab Memory ELF Usage: $slab_mem_mb MB"

slab_mem_mb=$(echo "$vma_size * $nr_noneprot / 1024 / 1024" | bc)
echo "Slab Memory PROTNONE GAP Usage: $slab_mem_mb MB"
```
Comment 17 Fangrui Song 2023-12-02 17:08:12 UTC
Let's refer to the first two build/map maps in comment 0 for discussion.

    # In another console, run
     $ cat /proc/1717808/maps
    564d8e90f000-564d8e910000 r--p 00000000 08:05 2519504        /sample/build/main
    564d8e91f000-564d8e920000 r-xp 00010000 08:05 2519504        /sample/build/main
    ...

We have three choices for the gap.

1. Do not create a map for the gap (separate mmap using MAP_FIXED, or a large mmap with explicit munmap).

Linux kernel fs/binfmt_elf.c employs this approach. However, there is a risk that an unrelated future mmap may be placed in the gap:

    564d8e90f000-564d8e910000 r--p 00000000 08:05 2519504        /sample/build/main
       ================ an unrelated mmap may be placed in the gap
    564d8e91f000-564d8e920000 r-xp 00010000 08:05 2519504        /sample/build/main

If there is no map in between, there is nothing lose. Accessing the gap will cause SIGSEGV.

Is the potential occurrence of an unrelated mmap considered a compromise in the hardening features?
Personally, I don't think this poses a significant issue. The program does not access the gaps.
We can even guarantee this property for direct access when input relocations to the linker use symbols with in-bounds addends (e.g. when x is defined relative to an input section, we know R_X86_64_PC32(x) must be in-bounds).

However, some programs may expect contiguous maps of a file (such as when glibc link_map::l_contiguous is set to 1).
Does this choice render the program exploitable if an attacker can ensure a map within the gap instead of outside the file?
It seems to me that they could achieve everything with a map outside of the file.

Having said that, the presence of an unrelated map between maps associated with a single file descriptor remains odd, so it's preferable to avoid it if possible (Choice 3 below).

2. Map the gap with the associated fd

This approach is commonly adopted by many dynamic loaders.
It involves a large PROT_READ mmap covering the whole file followed by several mmap instances using MAP_FIXED and the relevant flags.

The gaps may have permissions set as r--p or ---p, resulting in additional instances of the `struct vm_area_struct`.
Within glibc/elf/dl-map-segments.h, the `has_holes` code calls mprotect to change r--p permissions to ---p:

    564d8e90f000-564d8e910000 r--p 00000000 08:05 2519504        /sample/build/main
    564d8e910000-564d8e91f000 ---p 00001000 08:05 2519504        /sample/build/main
    564d8e91f000-564d8e920000 r-xp 00010000 08:05 2519504        /sample/build/main

Although ---p might be seen as a hardening measure, personally, I don't believe it significantly impacts the exploitability.
While there might be plenty of gadgets in r-xp sections, reducing gadgets in r--p sections doesn't seem notably useful.
(https://isopenbsdsecu.re/mitigations/rop_removal/)

3. Extend the map end to cover the gap

    564d8e90f000-**564d8e91f000** r--p 00000000 08:05 2519504        /sample/build/main  (the end is extended)
    564d8e91f000-564d8e920000 r-xp 00010000 08:05 2519504        /sample/build/main

The loader code can adjust p_filesz and p_memsz when invoking mmap.

By extending the map size to a multiple of the maximum page size, it benefits transparent huge pages or a userspace hugepage remapper.

Right now Android Bionic is exploring this choice: https://r.android.com/2796855

---

Personally I favor "3. Extend the map end to cover the gap" the most.
I've also pondered whether this falls under the purview of linkers. Such a change seems intrusive and unsightly.
If the linker extends the end of p_memsz to cover the gap, should it also extend p_filesz?

* If it doesn't, we create a PT_LOAD with p_filesz/p_memsz that is not for BSS, which is weird.
* If it does, we have an output file featuring overlapping file offset ranges, which is weird as well.

Moreover, a PT_LOAD whose end isn't backed by a section is unusual. I'm concerned that many binary manipulation tools may not handle this case correctly.
Utilizing a linker script can intentionally create discontiguous address ranges. I'm concerned that the linker might not discern such cases with intelligent logic regarding p_filesz/p_memsz.

This feature request seems to be within the realm of loaders and specific information, such as the page size, is only accessible to loaders.
I believe loaders are better equipped to handle this task."
Comment 18 Florian Weimer 2023-12-06 11:57:13 UTC
(In reply to Kalesh Singh from comment #15)
> AIUI if the runtime-page-size equals the max-page-size, the holes are also
> mapped in as part of the segment mapping and share the same permissions.
> Does this mean that on such systems, any protection it offers becomes void?

That's my understanding. The current behavior gives programmers the *option* to avoid mapping extra code on small-page systems, while maintaining compatibility with large-page systems. The proposed change to over-map to the next load-segment, to fill the gap, would take away that option. Unmapping (to create a gap) has compatibility implications.

Just to clarify, I'm opposed to a glibc change here mainly because it's not necessary to enable the desired behavior (fewer VMAs), and it removes functionality that people might find useful. The link editor merely needs to over-map in the generated LOAD segments, then there won't be any gaps for the loader to process.
Comment 19 Fangrui Song 2023-12-07 05:11:25 UTC
(In reply to Florian Weimer from comment #18)
> (In reply to Kalesh Singh from comment #15)
> > AIUI if the runtime-page-size equals the max-page-size, the holes are also
> > mapped in as part of the segment mapping and share the same permissions.
> > Does this mean that on such systems, any protection it offers becomes void?
> 
> That's my understanding. The current behavior gives programmers the *option*
> to avoid mapping extra code on small-page systems, while maintaining
> compatibility with large-page systems. The proposed change to over-map to
> the next load-segment, to fill the gap, would take away that option.

Can you elaborate what the option (and "functionality" below) is? Since
22930c9bf21ea15d0da1477a379029e2de259b69 (1996) rtld just calls `mprotect` (one
extra `struct vm_area_struct` instance, see `sudo grep vm_area_struct
/proc/slabinfo`), and the user has no option to control this behavior.

> Unmapping (to create a gap) has compatibility implications.

I wonder the implications are. If they are related to l_contiguous not
accounted for correctly, the user needs to be fixed.

> Just to clarify, I'm opposed to a glibc change here mainly because it's not
> necessary to enable the desired behavior (fewer VMAs), and it removes
> functionality that people might find useful.

See my question above.

> The link editor merely needs to over-map in the generated LOAD segments, then
> there won't be any gaps for the loader to process.

I have thought about this but I feel that a linker change would be intrusive and unsightly: "I've also pondered whether this falls under the purview of linkers." in comment 17.
Comment 20 Florian Weimer 2023-12-07 09:30:01 UTC
(In reply to Fangrui Song from comment #19)
> (In reply to Florian Weimer from comment #18)
> > (In reply to Kalesh Singh from comment #15)
> > > AIUI if the runtime-page-size equals the max-page-size, the holes are also
> > > mapped in as part of the segment mapping and share the same permissions.
> > > Does this mean that on such systems, any protection it offers becomes void?
> > 
> > That's my understanding. The current behavior gives programmers the *option*
> > to avoid mapping extra code on small-page systems, while maintaining
> > compatibility with large-page systems. The proposed change to over-map to
> > the next load-segment, to fill the gap, would take away that option.
> 
> Can you elaborate what the option (and "functionality" below) is? Since
> 22930c9bf21ea15d0da1477a379029e2de259b69 (1996) rtld just calls `mprotect`
> (one
> extra `struct vm_area_struct` instance, see `sudo grep vm_area_struct
> /proc/slabinfo`), and the user has no option to control this behavior.

The developer can produce a binary with a LOAD segment that does not need that mprotect call.

> > Unmapping (to create a gap) has compatibility implications.
> 
> I wonder the implications are. If they are related to l_contiguous not
> accounted for correctly, the user needs to be fixed.

We do not set l_contiguous for the dynamic loader, and it's an internal variable. The bug we have is that we assume (without checking) the the loader is mapped contiguously although the kernel does not do that.

We have feedback from downstream that some tools break if the gaps are so large that objects are mapped in the gaps of other objects. This is particularly visible on x86-64 when objects linked with different binutils versions/flags are loaded because the gaps can be quite large due to 2 MiB load segment alignment in some builds. Other objects do not do that alignment and are much smaller than 2 MiB, and so can fit comfortably into the gaps. I'm worried that if we change loader behavior to munmap instead of mprotect, that we'll retroactively expose these bugs more widely.

We obviously need to fix the l_contiguous assumption for ld.so in glibc, either by making sure that ld.so has contiguous load segments, or assuming in the code that it does not.
Comment 21 Fangrui Song 2023-12-08 03:22:29 UTC
(In reply to Florian Weimer from comment #20)
> (In reply to Fangrui Song from comment #19)
> > (In reply to Florian Weimer from comment #18)
> > > (In reply to Kalesh Singh from comment #15)
> > > > AIUI if the runtime-page-size equals the max-page-size, the holes are also
> > > > mapped in as part of the segment mapping and share the same permissions.
> > > > Does this mean that on such systems, any protection it offers becomes void?
> > > 
> > > That's my understanding. The current behavior gives programmers the *option*
> > > to avoid mapping extra code on small-page systems, while maintaining
> > > compatibility with large-page systems. The proposed change to over-map to
> > > the next load-segment, to fill the gap, would take away that option.
> > 
> > Can you elaborate what the option (and "functionality" below) is? Since
> > 22930c9bf21ea15d0da1477a379029e2de259b69 (1996) rtld just calls `mprotect`
> > (one
> > extra `struct vm_area_struct` instance, see `sudo grep vm_area_struct
> > /proc/slabinfo`), and the user has no option to control this behavior.
> 
> The developer can produce a binary with a LOAD segment that does not need
> that mprotect call.

This is described by "## Solution 2 - Modify the static linker" in comment 1. I find the approach intrusive.
("Moreover, a PT_LOAD whose end isn't backed by a section is unusual." in comment 17.)

A loader alternative is "3. Extend the map end to cover the gap" in comment 17.
Instead of 

    mmap(l->l_addr+c->mapstart, c->mapend - c->mapstart, c->prot, MAP_FIXED|MAP_COPY|MAP_FILE, fd, c->mapoff)

increase the map length

    mmap(l->l_addr+c->mapstart, c[1].mapstart - c->mapstart, c->prot, MAP_FIXED|MAP_COPY|MAP_FILE, fd, c->mapoff)

The existing mprotect in glibc/elf/dl-map-segments.h will not be touched.
Comment 22 Peter Cawley 2024-06-23 18:24:21 UTC
(In reply to Adhemerval Zanella from comment #11)
> It does indeed, which makes me realize that _dl_find_object does not work correctly if the loader is built with -Wl,-z,max-page-size= different than the system one.

FWIW, I've just hit this issue in practice on off-the-shelf Debian on aarch64: /lib/ld-linux-aarch64.so.1 specifies 64K alignment for its segments, but the system page size is 4K, so the kernel leaves gaps when mapping it in. Other shared objects can then get loaded into those gaps (if they are small and specify only 4K alignment for their segments), and then _dl_find_object against those shared objects will return /lib/ld-linux-aarch64.so.1 rather than returning the correct shared object. In turn, this causes _Unwind_Backtrace et. al. to fail to unwind through these shared objects, as _Unwind_Backtrace relies on _dl_find_object to find the unwind data.