This is the mail archive of the binutils@sourceware.org mailing list for the binutils project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Re: Help needed to track down bug: linking Linux kernel with gold creates unbootable kernel


On 01/-10/-28163 09:59 PM, John Reiser wrote:
Here you go, looks like gold dropped 1/3 of the kernel:
-rwxr-xr-x 1 edwin edwin 11M Apr 10 15:38 vmlinux.gold
-rwxr-xr-x 1 edwin edwin 17M Apr 10 15:37 vmlinux.bfd

Not necessarily. The 6M difference might be explained by a combination of alignment in the file and/or symbol tables and/or debugging information that need not be relevant to execution, particularly for a kernel.

OK, lets compare bzImage then: 3446848 arch/x86/boot/bzImage.gold 3475136 /boot/vmlinuz-2.6.34-rc3-00138-gecb385a

The difference is not that big here.


[Nr] Name Type Addr Off Size ES Flags Lk Inf A
-[ 1] .text PROGBITS ffffffff81000000 00001000 003d1c55 0 AX 0 0 4096
+[ 1] .text PROGBITS ffffffff81000000 00200000 003d1c55 0 AX 0 0 4096

The difference in file offset between 0x1000 (4 KiB) and 0x200000 (2 MiB) probably accounts for just under 2 MiB of zeroes in the file. Such a difference for file offset of Elf64_Shdr need not be relevant to execution of a Linux kernel, as long as the 0x3d1c55 bytes of content are identical.

Not identical, difference starts at byte 230:
-vmlinux.gold: file format elf64-x86-64
+vmlinux.bfd: file format elf64-x86-64
....
-ffffffff810000e1: 48 01 2d b8 c1 46 00 add %rbp,0x46c1b8(%rip) # ffffffff8146c2a0 <trampoline_level4_pgt>
-ffffffff810000e8: 48 01 2d a9 d1 46 00 add %rbp,0x46d1a9(%rip) # ffffffff8146d298 <trampoline_level4_pgt+0xff8>
+ffffffff810000e1: 48 01 2d b8 74 40 00 add %rbp,0x4074b8(%rip) # ffffffff814075a0 <trampoline_level4_pgt>
+ffffffff810000e8: 48 01 2d a9 84 40 00 add %rbp,0x4084a9(%rip) # ffffffff81408598 <trampoline_level4_pgt+0xff8>
ffffffff810000ef: e9 0c 00 00 00 jmpq ffffffff81000100 <secondary_startup_64>


It appears to be a difference in the address chosen for that global (and other globals later on).

Other than differences in addresses there is also difference in padding:
gold uses 00 00 90 90 (add %al, (%rax) nop nop), while BFD uses 90 90 90 90 (4 nops).


That address *ff814075a0 is in .rodata:
[Nr] Name Type Addr Off Size ES Flags Lk Inf Al
-[ 4] .rodata PROGBITS ffffffff81400000 00401000 001b7262 0 A 0 0 64
+[ 4] .rodata PROGBITS ffffffff81400000 00600000 001b4cb2 0 A 0 0 64


Which has the difference in alignment you noticed below.


alignment differences: [KEY: -: gold; +: ld]
Program Headers:
Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align
- LOAD 0x001000 0xffffffff81000000 0x0000000001000000 0x5f2000
0x5f2000 R E 0x1000
- LOAD 0x5f3000 0xffffffff81600000 0x0000000001600000 0x183220
0x183220 RWE 0x1000
- LOAD 0x777000 0xffffffffff600000 0x0000000001784000 0x000888
0x000888 R E 0x1000
- LOAD 0x778000 0x0000000000000000 0x0000000001785000 0x014628
0x014628 RW 0x1000
- LOAD 0x78d000 0xffffffff8179a000 0x000000000179a000 0x071000
0x456d000 RWE 0x1000
- NOTE 0x3d2c58 0xffffffff813d1c58 0x00000000013d1c58 0x00003c
0x00003c 0x4
+ LOAD 0x200000 0xffffffff81000000 0x0000000001000000 0x5ef000
0x5ef000 R E 0x200000
+ LOAD 0x800000 0xffffffff81600000 0x0000000001600000 0x183220
0x183220 RWE 0x200000
+ LOAD 0xa00000 0xffffffffff600000 0x0000000001784000 0x000888
0x000888 R E 0x200000
+ LOAD 0xc00000 0x0000000000000000 0x0000000001785000 0x014628
0x014628 RW 0x200000
+ LOAD 0xd9a000 0xffffffff8179a000 0x000000000179a000 0x071000
0x456d000 RWE 0x200000
+ NOTE 0x5d1c58 0xffffffff813d1c58 0x00000000013d1c58 0x000024
0x000024 0x4

The differing .p_align values of 0x1000 vs 0x200000 indicate that gold has a bug interpreting the commands from the linker script for alignment of Elf64_Phdr.

vmlinux.lds has this comment:
/*
* On 64-bit, align RODATA to 2MB so that even with CONFIG_DEBUG_RODATA
* we retain large page mappings for boundaries spanning kernel text, rodata
* and data sections.
*
* However, kernel identity mappings will have different RWX permissions
* to the pages mapping to text and to the pages padding (which are freed) the
* text section. Hence kernel identity mappings will be broken to smaller
* pages. For 64-bit, kernel text and kernel identity mappings are different,
* so we can enable protection checks that come with CONFIG_DEBUG_RODATA,
* as well as retain 2MB large page mappings for kernel text.
*/


If I read that correctly it means it uses hardware pages with a pagesize of 2MB for kernel text.
Since gold aligns only to 0x1000 perhaps the rodata ends up in the same hardware page as the .text.


I think these are the relevant align commands from the vmlinux.lds for .text and .rodata:
.text : AT(ADDR(.text) - 0xffffffff80000000) {
...
} :text=0x9090
. = ALIGN(16); __ex_table : AT(ADDR(__ex_table) - 0xffffffff80000000) { __start___ex_table = .; *(__ex_table) __stop___ex_tab
le = .; } :text = 0x9090
. = ALIGN((1 << 21));
. = ALIGN(((1 << 12))); .rodata : AT(ADDR(.rodata) - 0xffffffff80000000) { __start_rodata = .;


This also suggests that I should try with DEBUG_RODATA turned off.
Indeed without DEBUG_RODATA the kernel starts booting in KVM.

This is description of DEBUG_RODATA:
"Mark the kernel read-only data as write-protected in the pagetables,
in order to catch accidental (and incorrect) writes to such const data. This is recommended so that we can catch kernel bugs sooner.
If in doubt, say "Y"."


I think that the problem is that the kernel tries to write-protect the .rodata, thinking that it is aligned to 2MB, when in fact its not.


The .p_align applies to the .p_vaddr
and .p_paddr (which are related to address space at execution),
and not necessarily to .p_offset (which is related to file storage.)
For an ordinary ET_EXEC application or ET_DYN shared lib these are
connected by mmap(): the actual hardware page size must divide all
three of (.p_vaddr - .p_offset), (.p_paddr - .p_offset), .p_align.
For a Linux kernel they are connected by the boot loader: .p_vaddr
and .p_paddr must be divisible by the actual hardware page size,
but it could be OK for .p_offset to be anything as long as the
content (the interval of loaded bytes) was identical. This looser
restriction requires a boot loader that processes each PT_LOAD
independently as "real bytes." If the boot loader tries to do more
than one PT_LOAD at a time without carefully checking that the adjacency
in the input stream is compliant with the adjacency in the address space,
then that is a problem. Also, if the boot loader is loading the
address space of some virtual machine by using mmap() on the host
machine, without a fallback for the case when mmap() fails, then
the more-stringent restrictions of "ordinary ET_EXEC" are relevant,
and thus gold's differing .p_align is an underlying bug.

If the hw pagesize is 2MB, then its not divisible, so its a bug.
Should I open a bugreport, or are there some patches to gold that I could try?



The differing .p_filsz and .p_memsz of 0x3c vs 0x24 for the PT_NOTE indicate that gold may have a bug there. Examine the contents (see Elf64_Nhdr in /usr/include/elf.h) to determine the added/omitted/ merged/changed content.

The differing .p_filesz and .p_memsz of 0x5f2000 vs 0x5ef000 for
the first PT_LOAD is a bug in gold.

I think the .note difference is just due to gold embedding its version: -Note section [ 2] '.notes' of 60 bytes at offset 0x3d2c58: +Note section [ 2] '.notes' of 36 bytes at offset 0x5d1c58: Owner Data size Type - GNU 8 GNU_GOLD_VERSION - Linker version: gold 1.9 GNU 20 GNU_BUILD_ID - Build ID: a865af685f5222cdc17a28ea4e49d58b2185bc05 + Build ID: 07b53da4e169ad1079080043ad72384fb80d0ea3

Thanks for the help,

Best regards,
--Edwin


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]