Bug 31924 - aarch64 kernels built with binutils 2.42.50.20240618 and later fail to boot
Summary: aarch64 kernels built with binutils 2.42.50.20240618 and later fail to boot
Status: NEW
Alias: None
Product: binutils
Classification: Unclassified
Component: binutils (show other bugs)
Version: 2.43 (HEAD)
: P2 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-06-24 18:11 UTC by Emanuele Rocca
Modified: 2024-07-01 07:11 UTC (History)
7 users (show)

See Also:
Host:
Target: aarch64
Build:
Last reconfirmed:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Emanuele Rocca 2024-06-24 18:11:30 UTC
Debian Kernels built with binutils 2.42.50.20240618 on aarch64 are smaller than usual (~26M instead of ~31M) and fail to boot. The issue was reported in Debian here: https://bugs.debian.org/1074112

To rule out any possible bad interaction with shim/grub, I've tried booting the kernel from a EFI shell. Right after the EFI stub output (exiting boot services) the system reboots.

I have tried the latest binutils trunk snapshot from today, and that is affected too: https://snapshots.sourceware.org/binutils/trunk/2024-06-24_16-56_1719248161/src/binutils-2.42.50-a6e529673a9.tar.xz

The latest know working version is 2.42.
Comment 1 Sam James 2024-06-24 23:12:52 UTC
If I had to make a complete guess, DT_RELR support, but I'm not convinced.

Can you upload a good and bad kernel?
Comment 2 Sam James 2024-06-24 23:22:59 UTC
There's a lot of information missing here:
* bisect?
* bad / good files? (ideally with purely the binutils version differing between them)
* kernel version
* kernel configuration
* compiler version
Comment 3 Emanuele Rocca 2024-06-26 09:01:47 UTC
(In reply to Sam James from comment #2)
> * bisect?

We don't have one yet, but I'll get back to you when we do

> * bad / good files? (ideally with purely the binutils version differing
> between them)

https://mister-muffin.de/bug1074111/broken/vmlinuz-6.8.12-mnt-reform-arm64
https://mister-muffin.de/bug1074111/working/vmlinuz-6.8.12-mnt-reform-arm64

> * kernel version

6.8.12 in the examples above, but the exact version does not seem to matter. We have seen the issue with 6.9.2 as well.

> * kernel configuration

https://mister-muffin.de/bug1074111/config

> * compiler version

gcc 13.3.0
Comment 4 Szabolcs Nagy 2024-06-26 13:50:34 UTC
i can confirm that boot fails depending on if vmlinux is linked with -z pack-relative-relocs or not, so this is DT_RELR related.

i will try to debug this further.
Comment 5 Emanuele Rocca 2024-06-26 14:59:27 UTC
(In reply to Szabolcs Nagy from comment #4)
> i can confirm that boot fails depending on if vmlinux is linked with -z
> pack-relative-relocs or not, so this is DT_RELR related.
> 
> i will try to debug this further.

Thanks. I've tried building a kernel with CONFIG_RELR explicitly set to 'n', and it boots fine. Setting it to 'y' just to double-check, the resulting kernel does not boot.
Comment 6 H.J. Lu 2024-06-26 23:52:41 UTC
Does DT_RELR work in aarch64 glibc?
Comment 7 Szabolcs Nagy 2024-06-27 09:27:32 UTC
(In reply to H.J. Lu from comment #6)
> Does DT_RELR work in aarch64 glibc?

yes
Comment 8 Tj 2024-06-27 14:22:26 UTC
I've been working on diagnosis in Debian for the last few days and have collected useful information from building mainline kernels with different binutils versions. I've shared all the info on my workstation at:

http://[2a0d:3344:11e:1ff0::ff]/binutils-aarch64/

See the included "index.txt" for info; which is copied here:

Debian bug 1074111: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1074111
binutils bug 31924: https://sourceware.org/bugzilla/show_bug.cgi?id=31924

This directory contains the product of three mainline kernel builds using the Debian
aarch64 config taken from 6.9.2-1~exp and found here as ./config-aarch64 and,
for each kernel, the actual build-time generated config.

builds have these suffixes:

Good: -1 built using binutils v2.40 (from Debian Bookworm)
Bad:  -3 built using binutils v2.42 20240625
Good: -4 built using binutils v2.42 20240625 with 3 aarch64 DT_RELR patches reverted

List of reverted commits in ./binutils-gdb.git-revert.commits.log.txt

Kernel image for each is extracted from the Debian package and then ungzipped,
resulting in files with names ./vmlinuz* (compressed) and ./vmlinux* (uncompressed).

Most visible difference for the Bad kernel is the size:

$ ls -lh vmlinux*
-rw-r--r-- 1 tj tj 37M 2024-06-26 17:48 vmlinux-6.10.0-rc5-1.binutils_2.40_good
-rw-r--r-- 1 tj tj 32M 2024-06-26 17:48 vmlinux-6.10.0-rc5-3.binutils_2.42.20240625_bad
-rw-r--r-- 1 tj tj 37M 2024-06-26 17:44 vmlinux-6.10.0-rc5-4.binutils_2.42.20240625_good_revert_DT_RELR

However, all these files are wrapped in the EFI libstub and there is no extractor to get to the included vmlinux.

For that reason the kernel build product directories containing all build product and 'hidden'
".*.cmd" files (and the original ./vmlinux in their base) are included as:

./linux-aarch64.GOOD/
./linux-aarch64.BAD/

What is interesting here is the sizes of the pre- and post- libstub images:

ls -l linux-aarch64.*/vmlinux linux-aarch64.*/arch/arm64/boot/Image
-rw-r--r-- 1 vu-linux-builder-0 vg-linux-builder-0  32619008 2024-06-26 19:50 linux-aarch64.BAD/arch/arm64/boot/Image
-rwxr-xr-x 1 vu-linux-builder-0 vg-linux-builder-0 367529680 2024-06-26 19:50 linux-aarch64.BAD/vmlinux
-rw-r--r-- 1 vu-linux-builder-0 vg-linux-builder-0  38189568 2024-06-27 12:27 linux-aarch64.GOOD/arch/arm64/boot/Image
-rwxr-xr-x 1 vu-linux-builder-0 vg-linux-builder-0 373099856 2024-06-27 12:27 linux-aarch64.GOOD/vmlinux

So I focused on libstub and found in the linker script drivers/firmware/efi/libstub/zboot.lds
a DISCARD section that I commented out and rebuilt just "Image"

/*
/DISCARD/ : {
*(.discard .discard.*)
*(.modinfo .init.modinfo)
}
*/

This results in "Image" being the same size in the BAD build as in the GOOD:

$ ls -l linux-aarch64.*/arch/arm64/boot/
linux-aarch64.BAD/arch/arm64/boot/:
total 90460
drwxr-xr-x 36 vu-linux-builder-0 vg-linux-builder-0     4096 2024-06-26 19:27 dts
-rw-r--r--  1 vu-linux-builder-0 vg-linux-builder-0 38189568 2024-06-27 13:14 Image
-rw-r--r--  1 vu-linux-builder-0 vg-linux-builder-0 12471370 2024-06-27 13:25 Image.gz
-rw-r--r--  1 vu-linux-builder-0 vg-linux-builder-0 32619008 2024-06-26 19:50 orig.Image
-rw-r--r--  1 vu-linux-builder-0 vg-linux-builder-0 11500440 2024-06-26 19:50 orig.Image.gz

linux-aarch64.GOOD/arch/arm64/boot/:
total 49304
drwxr-xr-x 36 vu-linux-builder-0 vg-linux-builder-0     4096 2024-06-27 12:04 dts
-rw-r--r--  1 vu-linux-builder-0 vg-linux-builder-0 38189568 2024-06-27 12:27 Image
-rw-r--r--  1 vu-linux-builder-0 vg-linux-builder-0 12467722 2024-06-27 12:27 Image.gz

Installing the resulting Image.gz in the OS test image results in a correct boot.

Investigating the 'BAD' build logs (./build-*-3*.log) I see that even with the kernel's setting
ld's  --pack-relative-relocs that without the DISCARD section the kernel executes correctly.
Comment 9 Szabolcs Nagy 2024-06-28 12:52:29 UTC
it seems arm64 linux passes --no-apply-dynamic-relocs which means
the relative reloc addend is not stored to the referenced location
(0 is stored) and since -z pack-relative-relocs does not have the
addend stored elsewhere, the linux self-relocation code can't work.

so either linux is wrong for passing

   --no-apply-dynamic-relocs -z pack-relative-relocs

together or ld should ignore --no-apply-dynamic-relocs in this case.
i think linux is wrong here.

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/Makefile#n15

if i edit that out then i get a bootable Image.
Comment 10 Tj 2024-06-28 15:07:55 UTC
On Friday, 28 June 2024 at 13:52, nsz at gcc dot gnu.org <sourceware-bugzilla@sourceware.org> wrote:

> 
> so either linux is wrong for passing
> 
> --no-apply-dynamic-relocs -z pack-relative-relocs
> 
> together or ld should ignore --no-apply-dynamic-relocs in this case.
> i think linux is wrong here.
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/Makefile#n15
> 
> if i edit that out then i get a bootable Image.
> 

Did you notice my finding about zboot.lds and removing the DISCARD ?

That was only added in v6.9 with commit 5134acb15d9ef27aa2b90aad46d and is targeted at loongarch so I suspect that commit needs reverting and redoing to be conditional on the architecture.
> You are receiving this mail because:
> You are on the CC list for the bug.
Comment 11 Ard Biesheuvel 2024-06-28 17:54:21 UTC
(In reply to Tj from comment #10)
> On Friday, 28 June 2024 at 13:52, nsz at gcc dot gnu.org
> <sourceware-bugzilla@sourceware.org> wrote:
> 
> > 
> > so either linux is wrong for passing
> > 
> > --no-apply-dynamic-relocs -z pack-relative-relocs
> > 
> > together or ld should ignore --no-apply-dynamic-relocs in this case.
> > i think linux is wrong here.
> > 
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/Makefile#n15
> > 
> > if i edit that out then i get a bootable Image.
> > 
> 
> Did you notice my finding about zboot.lds and removing the DISCARD ?
> 
> That was only added in v6.9 with commit 5134acb15d9ef27aa2b90aad46d and is
> targeted at loongarch so I suspect that commit needs reverting and redoing
> to be conditional on the architecture.

That seems spurious to me - zboot.lds is not used for building Image, only for building vmlinuz.efi, which incorporates a compressed copy of Image.

(In reply to Szabolcs Nagy from comment #9)
> it seems arm64 linux passes --no-apply-dynamic-relocs which means
> the relative reloc addend is not stored to the referenced location
> (0 is stored) and since -z pack-relative-relocs does not have the
> addend stored elsewhere, the linux self-relocation code can't work.
> 
> so either linux is wrong for passing
> 
>    --no-apply-dynamic-relocs -z pack-relative-relocs
> 
> together or ld should ignore --no-apply-dynamic-relocs in this case.
> i think linux is wrong here.
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/
> arm64/Makefile#n15
> 
> if i edit that out then i get a bootable Image.

Agree that this combination makes no sense, although LLD does the right thing here.

So the Makefile logic should be updated to only pass --no-apply-dynamic-relocs if CONFIG_RELR is not set. That would result in all locations to have the addend stored, even the ones that are covered by RELA rather than RELR (assuming ld.bfd could emit both just like LLD) but that shouldn't matter - RELR saves so much space that the overhead of a handful of less compressible statically initialized pointers should be negligible.
Comment 12 Ard Biesheuvel 2024-06-29 09:19:54 UTC
Actually, I changed my mind.

--no-apply-dynamic-relocs counters an optimization that removes the need to process RELA entries of type R_AARCH64_RELATIVE by copying the addend into the executable.


RELR relocations fundamentally rely on the addend being present in the executable, as it is not stored anywhere else. This means --no-apply-dynamic-relocs must only apply to RELA relocations, and should be ignored for RELR relocations, as the resulting binary will always be broken otherwise.

IOW, copying the addend into the executable is optional for RELA (and has little value unless the relocation type is R_AARCH64_RELATIVE), but it is required for RELR so disabling it can never make sense.

So this should be fixed in ld.bfd.
Comment 13 Szabolcs Nagy 2024-07-01 07:11:33 UTC
(In reply to Ard Biesheuvel from comment #12)
> RELR relocations fundamentally rely on the addend being present in the
> executable, as it is not stored anywhere else. This means
> --no-apply-dynamic-relocs must only apply to RELA relocations, and should be
> ignored for RELR relocations, as the resulting binary will always be broken
> otherwise.

ok, this makes sense

but the name and the documentation for the option is misleading.