Bug 23254 - ld.bfd mishandles file pointers while scanning archive with LTO members, causing "malformed archive" errors for a well-formed archive
Summary: ld.bfd mishandles file pointers while scanning archive with LTO members, caus...
Status: VERIFIED FIXED
Alias: None
Product: binutils
Classification: Unclassified
Component: ld (show other bugs)
Version: 2.31
: P2 critical
Target Milestone: ---
Assignee: Alan Modra
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-06-02 10:39 UTC by zenith432
Modified: 2018-06-05 16:09 UTC (History)
0 users

See Also:
Host:
Target:
Build:
Last reconfirmed: 2018-06-05 00:00:00


Attachments
dup -> open (762 bytes, patch)
2018-06-04 01:47 UTC, Alan Modra
Details | Diff
code to reproduce error (1.45 KB, application/octet-stream)
2018-06-04 14:58 UTC, zenith432
Details

Note You need to log in before you can comment on or make changes to this bug.
Description zenith432 2018-06-02 10:39:04 UTC
When processing an archive with LTO object members, ld.bfd mismanages the file pointer, occasionally causing "malformed archive" errors even though the archive is well-formed.

The cause of this problem is as follows:
An archive is processed by the function elf_link_add_archive_symbols (bfd/elflink.c), which calls for each member the functions
_bfd_get_elt_at_filepos
bfd_check_format
add_archive_element

The first two functions (_bfd_get_elt_at_filepos, bfd_check_format) process the archive element using the I/O functions found in bfd/bfdio.c and bfd/cache.c.  These I/O functions use stdio FILE*.  They store the FILE* in abfd->iostream for the parent archive, and use fseek/fread to scan the archive file.

After this is done, if an archive element is loaded, add_archive_element (in ld/ldmain.c) is called, which eventually calls plugin_maybe_claim (in ld/plugin.c) to handle LTO elements.  This calls plugin_object_p (in ld/plugin.c) which eventually calls the plugin's claim_file_handler method (see plugin_call_claim_file).  The plugin's claim_file_handler function processes the archive file using system calls lseek()/read() using a file descriptor set up in plugin_object_p.

Description of The Problem:
Prior to commit 7d0b9ebc1e0079271a7c7737b53bc026525eab64 (by Alan Modra), plugin_object_p would obtain a file descriptor to the archive file based on its name using system call open(), then pass the file descriptor downward to the plugin's claim_file_handler for it to process the file.

In the commit mentioned above, plugin_object_p was changed so that instead of opening the archive by name, it obtains the file descriptor from (FILE*) abfd->iostream using fileno(), then duplicates it with systemcall dup(), then passes it down to the plugin's claim_file_handler.

*The documentation for system call dup() very clearly explains that the dup'd file descriptor shares a file pointer with the original descriptor*.

As a result of this change, the plugin's claim_file_handler clobbers the file pointer using calls to lseek()/read().  HOWEVER, the archive is later continued to be accessed using fseek()/fread() on abfd->iostream when processing the next archive element !!

This is WRONG.  Because fseek/fread *do not know* that the file pointer was changed outside the stdio FILE* framework.  The FILE structure caches a buffer for the file, and if fseek() gets a position that is within the loaded buffer, it may not call lseek().  Moreover, if FILE struct caches a position for the file, this potentially causes fseek() and fread() to believe the file pointer is in a different place than it really is and they read the wrong part of the file and eventually cause a "malformed archive" error.

A later commit ace667e59aede65c400381f1cff704b61e8ccb0b by Andrew Burgess added a small fix that wraps the call to plugin's claim_file_handler in code saving and restoring the file pointer.  However, restoration of the file pointer is only done if the plugin does not "claim" the archive element.  It seems this was done to prevent bugs if the archive consists of non-LTO elements.  However, if the archive has LTO elements and the element is "claimed", the file pointer remains clobbered and the problem persists.

This problem started with 2.28 official release of binutils (in line with dates on above commits), and continues to this very day.
I am able to consistently reproduce the "malformed archive" errors on large EDK2-based build on macOS.  This is with self-built gcc 8.1 and binutils 2.30 or binutils git master.  I've tried tweaking configure options, but none of them made any difference.
For some reason I cannot reproduce this problem with EDK2 build on Fedora 28 with GCC 8.1.1.  I've examined the patches Redhat makes to gcc and binutils from their source RPMs, but there doesn't seem to be anything there that fixes this.  Another possibility is that fseek() on Fedora works differently than on macOS, so it always calls lseek() and fixes the file pointer that's been clobbered, so the issue never surfaces.  I can't be sure because the development tools are all pre-built binaries and difficult to debug and see what's going on.

Due to the complexity of this problem I cannot attach a simple example to reproduce it, and like I said - I'm not sure it reproduces on Fedora at all.  However, I spent time debugging it and the cause is what is explained above and can be checked in source code.  At any rate, mixing use of fseek()/fread() with out-of-line use of lseek()/read() on the same (or dup) file descriptor is obviously a programming error - as stdio was not meant to be used this way.
Comment 1 Alan Modra 2018-06-04 01:47:09 UTC
Created attachment 11051 [details]
dup -> open

Does this patch, which goes back to using open, fix your problem?  I'm just trying to get a data point.  I'm not convinced it is the right patch..
Comment 2 zenith432 2018-06-04 14:58:52 UTC
Created attachment 11052 [details]
code to reproduce error
Comment 3 zenith432 2018-06-04 15:16:57 UTC
- Your attached patch fixes the problem.

- For the record, I did a workaround with the patch

========
--- orig_plugin.c	2018-01-13 13:31:16.000000000 +0000
+++ plugin.c	2018-06-04 14:41:38.000000000 +0000
@@ -1059,7 +1059,9 @@ plugin_call_claim_file (const struct ld_
 	  called_plugin = curplug;
 	  cur_offset = lseek (file->fd, 0, SEEK_CUR);
 	  rv = (*curplug->claim_file_handler) (file, claimed);
+#if 0
 	  if (!*claimed)
+#endif
 	    lseek (file->fd, cur_offset, SEEK_SET);
 	  called_plugin = NULL;
 	  if (rv != LDPS_OK)
========

I do not believe this workaround is formally correct - because the plugin is allowed to continue moving the file pointer via the file descriptor it got.  As a practical matter, I found by setting debug breakpoints that the current plugin only moves the file pointer in claim_file_handler, so protecting the pointer around that call resolved the issue.

- I attached above a standalone example to reproduce the problem, however I can only get it to happen on macOS.  On Fedora everything runs fine.

gcc is 8.1, binutils built from 2.30 (same happens on git master)

On macOS the output is
/usr/local/bin/gcc -flto -c -o a.o a.c
/usr/local/bin/gcc -flto -c -o b.o b.c
/usr/local/bin/gcc -flto -c -o c.o c.c
/usr/local/bin/gcc-ar cr libarch.a a.o b.o c.o
/usr/local/bin/gcc -nostdlib -flto -o main main.c -L . -larch
./libarch.a: error adding symbols: Malformed archive
collect2: error: ld returned 1 exit status
make: *** [all] Error 1

I had to use -nostdib on macOS because it's a cross compiler and I don't have the runtime, but it doesn't matter because the "Malformed archive" error is encountered *before* other errors about unfound entry point or undefined printf.

The size of stdio FILE buffer is 4096 (it's the same on Fedora)
a.c is cooked up so inside libarch.a the ar_hdr for b.o is split across the 4096 position boundary.
Here is what happens exactly:
- First LD scans and processes a.o inside libarch.a using fseek/fread.  To do this, stdio reads a 4096 buffer managed through FILE and caches 4096 as the file pointer.
- Eventually a.o is passed down to claim_file_handler.  The plugin re-processes a.o using lseek/read, with the file pointer ending up at the end of a.o inside libarch.a instead of at 4096.
- Then it goes back to process b.o.  At this point stdio FILE has a 4096 byte buffer and believes the file pointer is at 4096.
- b.o ar_hdr starts a few bytes below 4096, which is in the cached buffer.  fseek is called to move to b.o header - Since it's in the buffer, fseek does not lseek, but positions its internal pointer.
- then fread is called to read the ar_hdr for b.o.  The few bytes in the cached buffer are copied.  Then fread, believing the file pointer to be at 4096, reads another 4096, but from the wrong location in the file.
- Because the wrong location was read, the ar_hdr for b.o ends up being read corrupted which is why it breaks with "Malformed archive".

So even if you can't reproduce it, this is what happens and you can follow it by adding printfs to ld code (or running in debugger).

It doesn't break on Fedora and I'm not sure why.  One possibility is that the pre-canned binaries are patched somehow.  The other possibility is that Fedora fseek always calls lseek when moving to b.o ar_hdr which prevents the glitch.

Regards.
Comment 4 Sourceware Commits 2018-06-05 13:10:47 UTC
The master branch has been updated by Alan Modra <amodra@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;h=27b0767593284f97384b3597ebd211164f8c8b47

commit 27b0767593284f97384b3597ebd211164f8c8b47
Author: Alan Modra <amodra@gmail.com>
Date:   Tue Jun 5 21:04:00 2018 +0930

    PR23254, ld.bfd mishandles file pointers while scanning archive
    
    Best practice is to not mix lseek/read with fseek/fread on the same
    underlying file descriptor, as not all stdio implementations will cope.
    Since the plugin uses lseek/read while bfd uses fseek/fread this patch
    reopens the file for exclusive use by the plugin rather than trying to
    restore the file descriptor.  That allows the plugin to read the file
    after plugin_call_claim_file too.
    
    bfd/
    	PR 23254
    	* plugin.c (bfd_plugin_open_input): Allow for possibility of
    	nested archives.  Open file again for plugin.
    	(try_claim): Don't save and restore file position.  Close file
    	if not claimed.
    	* sysdep.h (O_BINARY): Define.
    ld/
    	PR 23254
    	* plugin.c (plugin_call_claim_file): Revert 2016-07-19 patch.
    	(plugin_object_p): Don't dup file descriptor.
Comment 5 Alan Modra 2018-06-05 13:30:55 UTC
I believe the patch I've just committed should fix your problem, but the testcase doesn't fail for me even though the archive I construct does see a header crossing over a 4k boundary.
Comment 6 zenith432 2018-06-05 16:09:07 UTC
I built from latest head commit 5c4ce239a (one after 27b076759) and verified on macOS that it fixes the problem and ld works.