Bug 32491 - [2.44 Regression] ld-elf/compress.exp test failures on arm-linux-gnueabi*
Summary: [2.44 Regression] ld-elf/compress.exp test failures on arm-linux-gnueabi*
Status: NEW
Alias: None
Product: binutils
Classification: Unclassified
Component: ld (show other bugs)
Version: 2.44 (HEAD)
: P2 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-12-21 10:59 UTC by Matthias Klose
Modified: 2024-12-24 06:57 UTC (History)
5 users (show)

See Also:
Host:
Target: arm-linux-gnueabi arm-linux-gnueabihf
Build:
Last reconfirmed:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Matthias Klose 2024-12-21 10:59:37 UTC
these fail with trunk 20241221 on arm-linux-gnueabi*, it looks like in some cases these succeed, then fail in subsequent runs.

W: [ld-elf/compress.exp] REGRESSION (PASS -> FAIL): Build zlibbegin.o with zlib compressed debug sections
W: [ld-elf/compress.exp] REGRESSION (PASS -> FAIL): Build gnubegin.o with zlib-gnu compressed debug sections
W: [ld-elf/compress.exp] REGRESSION (PASS -> FAIL): Build gabiend.o with zlib-gabi compressed debug sections
W: [ld-elf/compress.exp] REGRESSION (PASS -> FAIL): Run gnunormal with libgnufoo.so with zlib-gnu compressed debug sections
W: [ld-elf/compress.exp] REGRESSION (PASS -> FAIL): Run gnunormal with libfoozlib.so with zlib-gnu compressed debug sections
W: [ld-elf/compress.exp] REGRESSION (PASS -> FAIL): Run gabinormal with libgabifoo.so with zlib-gabi compressed debug sections
W: [ld-elf/compress.exp] REGRESSION (PASS -> FAIL): Run gabinormal with libfoozlib.so with zlib-gabi compressed debug sections
W: [ld-elf/compress.exp] REGRESSION (PASS -> FAIL): Link -r with zlib compressed debug output
W: [ld-elf/compress.exp] REGRESSION (PASS -> FAIL): Link -r with zlib-gnu compressed debug output
W: [ld-elf/compress.exp] REGRESSION (PASS -> FAIL): Link -r with zlib-gabi compressed debug output
W: [ld-elf/compress.exp] REGRESSION (PASS -> FAIL): Link with zlib-gnu compressed debug output 1
W: [ld-elf/compress.exp] REGRESSION (PASS -> FAIL): Link with zlib-gnu compressed debug output 2
W: [ld-elf/compress.exp] REGRESSION (PASS -> FAIL): Link with zlib-gabi compressed debug output 1
W: [ld-elf/compress.exp] REGRESSION (PASS -> FAIL): Link with zlib-gabi compressed debug output 2

all fail with the same pattern:

spawn [open ...]
/home/doko/binutils-2.43.50.20241221/builddir-single/ld/.libs/ld-new: tmpdir/zlibbegin.o: final close failed: invalid operation
collect2: error: ld returned 1 exit status
/home/doko/binutils-2.43.50.20241221/builddir-single/ld/.libs/ld-new: tmpdir/zlibbegin.o: final close failed: invalid operation
collect2: error: ld returned 1 exit status

Unexpected linker warning or error
FAIL: Build zlibbegin.o with zlib compressed debug sections
Comment 2 Sam James 2024-12-22 02:36:46 UTC
(In reply to Matthias Klose from comment #0)
> it looks like in some cases these succeed, then fail in subsequent runs.

Given this, and the contents of Jan's commit, I suspect the bisect done by the CI bot was unreliable. We need to try bisect again in a setup which can reproduce, with several tries per commit before declaring it good.
Comment 3 Thiago Jung Bauermann 2024-12-22 03:57:22 UTC
(In reply to Sam James from comment #2)
> (In reply to Matthias Klose from comment #0)
> > it looks like in some cases these succeed, then fail in subsequent runs.
> 
> Given this, and the contents of Jan's commit, I suspect the bisect done by
> the CI bot was unreliable. We need to try bisect again in a setup which can
> reproduce, with several tries per commit before declaring it good.

I reverted the binutils commit that the CI bisected and these compress.exp regressions were fixed. Though if they succeed sometimes I'll try it again and run the testcase a number of times to be sure.

Also, I'm in the middle of debugging ld to see if I can figure out what exactly it's complaining about.

The error it gives is this:

/home/thiago.bauermann/.cache/builds/abe/builds/armv8l-unknown-linux-gnueabihf/armv8l-unknown-linux-gnueabihf/binutils-binutils-gdb.git~master/ld/tmpdir/ld/collect-ld: tmpdir/zlibbegin.o: final close failed: invalid operation
collect2: error: ld returned 1 exit status
Comment 4 Sam James 2024-12-22 04:15:35 UTC
Thanks (also, of course, no blame intended there, I surely would've made the same mistake if it even was one).

I know newer Valgrind can warn on invalid FD use (like closing a closed FD etc). The error seems really odd and almost sounds like threading or something if it's not consistent, but there shouldn't be any in use here.
Comment 5 Sam James 2024-12-22 04:15:53 UTC
Huh: PR20006.
Comment 6 Thiago Jung Bauermann 2024-12-24 03:58:32 UTC
Here's what I found out so far:

The "invalid operation" error is set by bfd_compress_section () when called for the .debug_str section because its uncompressed_size (i.e. sec->size) is 0. I'm not familiar with ld internals so I'm still digging into where or why this happens. 

This is the backtrace at that point:

(gdb) bt
#0  0x0044ac06 in bfd_set_error (error_tag=bfd_error_invalid_operation)
    at /path/to/binutils-gdb.git/bfd/bfd.c:837
#1  0x0044f01a in bfd_compress_section (abfd=0x61ed58, sec=0x620548,
    uncompressed_buffer=0x642330 "")
    at /path/to/binutils-gdb.git/bfd/compress.c:1132
#2  0x0049036a in _bfd_elf_assign_file_positions_for_non_load (abfd=0x61ed58)
    at /path/to/binutils-gdb.git/bfd/elf.c:7056
#3  0x00490624 in _bfd_elf_write_object_contents (abfd=0x61ed58)
    at /path/to/binutils-gdb.git/bfd/elf.c:7148
#4  0x0045844e in bfd_close (abfd=0x61ed58)
    at /path/to/binutils-gdb.git/bfd/opncls.c:908
#5  0x0042be38 in main (argc=31, argv=0xfffee944)
    at /path/to/binutils-gdb.git/ld/ldmain.c:597

If I revert just the gas part of commit d5cbf916be4a ("gas/ELF: also reject merge entity size being zero") and leave the testcase fix in place (henceforth referred to as "reverted version"), then all tests in compress.exp pass.

Looking at the begin.o file that is used by the test, the difference between trunk and the reverted versions are:

1. .debug_str is empty in trunk but has 495 bytes in the reverted version.
2. The trunk version has 3 .debug_macro sections and the reverted version has none.
3. The trunk version has two COMDAT group sections (containing one .debug_macro section each), and the reverted version has none.

I don't know yet if differences 2 and 3 are relevant.

(In reply to Sam James from comment #4)
> Thanks (also, of course, no blame intended there, I surely would've made the
> same mistake if it even was one).
 
I didn't have the impression that there was any blame intended or implied, but thank you for your concern.

> I know newer Valgrind can warn on invalid FD use (like closing a closed FD
> etc). The error seems really odd and almost sounds like threading or
> something if it's not consistent, but there shouldn't be any in use here.

I left compress.exp running in a loop with the reverted version for more than 1000 iterations and the test didn't fail once. So I still think the bisection is correct.

Maybe the gas change surfaced a latent bug in ld?
Comment 7 Thiago Jung Bauermann 2024-12-24 04:17:32 UTC
(In reply to Sam James from comment #5)
> Huh: PR20006.

Thanks for the reference. Looking at that bug and the commits that caused and fixed the problem there, I don't think it's a similar situation.

In the case of PR20006, sec->compressed_size was being used to store estimated branch distances, which caused problems for debug sections, which actually have a need for that field during compression.

In this case, when bfd_compress_section () sets the invalid operation error, sec->compressed_size is 0 so it's not being used for anything untoward.
Comment 8 Thiago Jung Bauermann 2024-12-24 06:57:34 UTC
(In reply to Thiago Jung Bauermann from comment #6)
> Looking at the begin.o file that is used by the test, the difference between
> trunk and the reverted versions are:
> 
> 1. .debug_str is empty in trunk but has 495 bytes in the reverted version.
> 2. The trunk version has 3 .debug_macro sections and the reverted version
> has none.
> 3. The trunk version has two COMDAT group sections (containing one
> .debug_macro section each), and the reverted version has none.
> 
> I don't know yet if differences 2 and 3 are relevant.

It turned out I was comparing the reverted version with a slightly different trunk revision than I should. After correcting that, differences 2 and 3 vanished. There's only difference 1.
 
> I left compress.exp running in a loop with the reverted version for more
> than 1000 iterations and the test didn't fail once. So I still think the
> bisection is correct.

It occurred to me that I should also run compress.exp in a loop with the trunk version to check that the failures are stable so I did that, and they are. compress.exp failed with all 1600 iterations of the loop.