Bug 10708 - "out of file descriptors and couldn't close any" -- probably fd leak
Summary: "out of file descriptors and couldn't close any" -- probably fd leak
Status: RESOLVED FIXED
Alias: None
Product: binutils
Classification: Unclassified
Component: gold (show other bugs)
Version: 2.21
: P2 normal
Target Milestone: ---
Assignee: Ian Lance Taylor
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-09-30 08:34 UTC by Bernhard Rosenkraenzer
Modified: 2014-05-28 19:45 UTC (History)
4 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Bernhard Rosenkraenzer 2009-09-30 08:34:22 UTC
QtWebKit builds file with "normal" ld -- but trying to build it with gold fails,
resulting in

/usr/bin/ld: fatal error: out of file descriptors and couldn't close any

QtWebKit links 1623 object files plus 12 shared libraries. According to
/proc/sys/fs/file-max, it should be possible to have 370804 fds open
Comment 1 Bernhard Rosenkraenzer 2009-09-30 09:18:15 UTC
Actually this is not caused by a shortage of system FDs, so something else is
causing gold to believe it's out of FDs:

# cat /proc/sys/fs/file-nr
6496    1037       370804
Comment 2 Ian Lance Taylor 2009-09-30 13:20:01 UTC
Can you confirm that this was with the development version of gold?  There were
some bugs in this area fixed back February.

Otherwise, as far as I can see, this can only happen if open returns -1 with
errno set to ENFILE or EMFILE.  Please check ulimit -n.

If this is repeatable, is there any chance that you can debug it a bit?
Comment 3 Bernhard Rosenkraenzer 2009-09-30 13:43:44 UTC
Yes, this is on a fairly current build:

$ ld --version
GNU gold (Linux/GNU Binutils 2.20.51.0.1.20090905) 1.9
Copyright 2008 Free Software Foundation, Inc.
This program is free software; you may redistribute it under the terms of
the GNU General Public License version 3 or (at your option) a later version.
This program has absolutely no warranty.

ulimit -n is 1024, will try again after increasing it to 65536.

It is 100% reproducible, I can do a bit of debugging but don't have a lot of
time right now (insane day job schedule right now).
Comment 4 Bernhard Rosenkraenzer 2009-10-01 08:31:15 UTC
ulimit -n 65536 fixes it, but using > 1024 FDs still seems somewhat excessive
Comment 5 Ian Lance Taylor 2009-11-07 02:27:48 UTC
I have tried to recreate this problem, but failed.  As far as I can tell, gold 
will react correctly to a lack of file descriptors.  I will need more 
information on what could be causing the issue for you.

gold will try to open as many file descriptors as it needs.  If you give it more 
than 1024 input files, then it will open more than 1024 descriptors.  However, 
if an open fails with ENFILE or EMFILE it will close some descriptors, and will 
not try to open that many again.

The error you are getting is the error that gold gives if it runs out of file 
descriptors but can not find any to close.  It's not reasonable that it would 
need to keep 1024 descriptors open--unless perhaps you are running with a very 
large number of threads.  Are you passing any --thread option to gold?
Comment 6 Andreas Hartmetz 2010-02-03 15:46:13 UTC
I could reproduce the problem (also when linking QtWebKit) until a few months
ago. I'm always using the latest CVS version. At the time I simply used the old
ld when linking Qt instead of increasing the appropriate ulimit.
So it looks like this is fixed now.
Comment 7 Ian Lance Taylor 2010-02-04 03:45:07 UTC
Bernhard, are you still seeing the problem?
Comment 8 Ismail Donmez 2010-07-21 09:11:42 UTC
This is still reproducable while linking QtWebKit;

GNU gold (GNU Binutils for Ubuntu 2.20.51-system.20100710) 1.9
  Supported targets:
   elf32-i386
   elf32-i386-freebsd
   elf64-x86-64
   elf64-x86-64-freebsd

It tries to link ~1800 object files btw.
Comment 9 Cary Coutant 2010-10-23 01:05:41 UTC
I've been trying to track down possible sources of file descriptor leakage. I've found one:

In copy_relocs.cc, Copy_relocs::emit_copy_reloc():

  typename elfcpp::Elf_types<size>::Elf_WXword addralign =
    sym->object()->section_addralign(shndx);

This, and probably other similar places where we go back to an ELF file for some info, seems to be leaking file descriptors. The call to section_addralign() creates an Object::View, and reopens the file descriptor, but never releases it. Also, at least in this particular case, we're accessing a different file from the one we currently have locked (the shared library that contains the definition of the symbol), and we haven't locked the file. If we had locked the file here, the descriptor would have been released, but I'm not sure it's safe to lock the shared library at this point -- we're in a Scan_relocs task, which isn't necessarily single threaded.

I'm wondering whether it would be better to just find and eradicate places where we need to read a file outside of the times we normally have the file open.

I have no idea whether this is the cause of the problem reported here, but a good way to tell is if you can rerun the link with -Wl,--debug=task. That would give us an idea of where it is when you finally run out of file descriptors. For this leakage to cause real problems, you'll need lots of shared libraries, and COPY relocations into lots of them. It seems unlikely, but it's worth a shot. It's also possible that there are other leakages similar to this that would trigger under different conditions.

-cary
Comment 10 Cary Coutant 2010-11-04 00:08:46 UTC
I found another leak that will explain the problem -- if you're using the --no-keep-files-mapped option (or a 32-bit build of gold, for which that's the default) and --gc-sections and/or --icf.

Can you try the patch below and let me know if it fixes the problem for you?

-cary


Index: gold.cc
===================================================================
RCS file: /cvs/src/src/gold/gold.cc,v
retrieving revision 1.85
diff -u -p -r1.85 gold.cc
--- gold.cc     14 Oct 2010 22:10:22 -0000      1.85
+++ gold.cc     3 Nov 2010 23:39:44 -0000
@@ -359,6 +359,7 @@ queue_middle_tasks(const General_options
           p != input_objects->relobj_end();
           ++p)
        {
+          Task_lock_obj<Object> tlo(task, *p);
          (*p)->layout(symtab, layout, NULL);
        }
    }
Comment 11 Bengt-Erik Soderstrom 2011-02-26 17:40:37 UTC
(In reply to comment #10)
> I found another leak that will explain the problem -- if you're using the
> --no-keep-files-mapped option (or a 32-bit build of gold, for which that's the
> default) and --gc-sections and/or --icf.
> 
> Can you try the patch below and let me know if it fixes the problem for you?
> 
> -cary
> 
> 
> Index: gold.cc
> ===================================================================
> RCS file: /cvs/src/src/gold/gold.cc,v
> retrieving revision 1.85
> diff -u -p -r1.85 gold.cc
> --- gold.cc     14 Oct 2010 22:10:22 -0000      1.85
> +++ gold.cc     3 Nov 2010 23:39:44 -0000
> @@ -359,6 +359,7 @@ queue_middle_tasks(const General_options
>            p != input_objects->relobj_end();
>            ++p)
>         {
> +          Task_lock_obj<Object> tlo(task, *p);
>           (*p)->layout(symtab, layout, NULL);
>         }
>     }


I just found this behaviour when building Chromium on a 32 bit machine using Gold (binutils 2.21) and Fedora 14. Building from the same source, but in a 64 bit environment was OK. (Ubuntu 10.10 binutils 2.21).
Chromium was OK some week ago with the Gold linker with the Fedora 14 (32-bit).
I tried again, but this time linking with the normal ld, that is, not using Gold, and then, the build was successful.

I did not yet try your patch, but I will, and let you know.
Comment 12 Bengt-Erik Soderstrom 2011-02-26 18:58:27 UTC
The patch as proposed in comment #10 works. I can now again build Chromium on my 32-bit machine.
Comment 13 cvs-commit@gcc.gnu.org 2011-02-27 15:17:32 UTC
CVSROOT:	/cvs/src
Module name:	src
Branch: 	binutils-2_21-branch
Changes by:	ian@sourceware.org	2011-02-27 15:17:29

Modified files:
	gold           : ChangeLog copy-relocs.cc gold.cc icf.cc 
	                 mapfile.cc plugin.cc 

Log message:
	Backport from mainline:
	2010-11-05  Cary Coutant  <ccoutant@google.com>
	PR gold/10708
	* copy-relocs.cc (Copy_relocs::emit_copy_reloc): Hold a lock on the
	object when reading from the file.
	* gold.cc (queue_middle_tasks): Hold a lock on the object when doing
	second layout pass.
	* icf.cc (preprocess_for_unique_sections): Hold a lock on the object
	when reading section contents.
	(get_section_contents): Likewise.
	(icf::find_identical_sections): Likewise.
	* mapfile.cc (Mapfile::print_discarded_sections): Hold a lock on the
	object when reading from the file.
	* plugin.cc (Plugin_manager::layout_deferred_objects): Hold a lock on
	the object when doing deferred section layout.

Patches:
http://sourceware.org/cgi-bin/cvsweb.cgi/src/gold/ChangeLog.diff?cvsroot=src&only_with_tag=binutils-2_21-branch&r1=1.664.2.15&r2=1.664.2.16
http://sourceware.org/cgi-bin/cvsweb.cgi/src/gold/copy-relocs.cc.diff?cvsroot=src&only_with_tag=binutils-2_21-branch&r1=1.11&r2=1.11.2.1
http://sourceware.org/cgi-bin/cvsweb.cgi/src/gold/gold.cc.diff?cvsroot=src&only_with_tag=binutils-2_21-branch&r1=1.85&r2=1.85.2.1
http://sourceware.org/cgi-bin/cvsweb.cgi/src/gold/icf.cc.diff?cvsroot=src&only_with_tag=binutils-2_21-branch&r1=1.15.2.1&r2=1.15.2.2
http://sourceware.org/cgi-bin/cvsweb.cgi/src/gold/mapfile.cc.diff?cvsroot=src&only_with_tag=binutils-2_21-branch&r1=1.6&r2=1.6.2.1
http://sourceware.org/cgi-bin/cvsweb.cgi/src/gold/plugin.cc.diff?cvsroot=src&only_with_tag=binutils-2_21-branch&r1=1.40.2.1&r2=1.40.2.2
Comment 14 Ian Lance Taylor 2011-02-27 15:23:15 UTC
Seems to be fixed on mainline and in the upcoming 2.21.1 release.
Comment 15 Jackie Rosen 2014-02-16 19:41:31 UTC Comment hidden (spam)