Bug 16592 - crash in startup
Summary: crash in startup
Status: UNCONFIRMED
Alias: None
Product: glibc
Classification: Unclassified
Component: dynamic-link (show other bugs)
Version: 2.18
: P2 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-02-15 01:42 UTC by Stefan Seefeld
Modified: 2016-05-16 17:23 UTC (History)
2 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:
fweimer: security-


Attachments
test case demonstrating the bug (2.54 KB, application/x-gtar)
2014-02-15 01:42 UTC, Stefan Seefeld
Details
test case demonstrating the bug (2.65 KB, application/x-gtar)
2014-02-18 03:56 UTC, Stefan Seefeld
Details
stacktrace from gdb (920 bytes, text/plain)
2014-02-18 04:08 UTC, Stefan Seefeld
Details
glibc-ldaudit-tls-segv.diff (2.13 KB, patch)
2014-02-19 05:42 UTC, Carlos O'Donell
Details | Diff
test case demonstrating the bug (128.17 KB, application/gzip)
2014-03-20 21:49 UTC, Stefan Seefeld
Details
LD_DEBUG.out (32.34 KB, text/plain)
2014-06-13 15:01 UTC, Paul Woegerer
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Stefan Seefeld 2014-02-15 01:42:16 UTC
Created attachment 7420 [details]
test case demonstrating the bug

The attached directory contains a testcase for a crash during program startup when an audit library is used.

To reproduce, run 'make' in the directory to build a small probe application as well as audit library. Then run 'make run' (after adding '.' to LD_LIBRARY_PATH) to invoke the probe application with the audit library set, to observe the crash. (I debugged this by running 

.../ld-linux-x86-64.so.2 --audit ldaudit.so ./probe

I could prevent the crash by removing the -llttng-ust argument on the link command. (In reality I would actually like to use that library. In this test case I have merely removed any actual use as the crash happens even if the library is never used at runtime.)

Are there any limitations on what an audit library may link to ?

I'm using gcc 4.8.2 on a Fedora 20 platform (using the system glibc 2.18).
Comment 1 Andreas Schwab 2014-02-15 11:59:21 UTC
/usr/lib64/gcc/x86_64-suse-linux/4.8/../../../../x86_64-suse-linux/bin/ld: cannot find -llttng-ust
Comment 2 Stefan Seefeld 2014-02-15 12:53:21 UTC
On 02/15/2014 06:59 AM, schwab@linux-m68k.org wrote:
> https://sourceware.org/bugzilla/show_bug.cgi?id=16592
> 
> --- Comment #1 from Andreas Schwab <schwab@linux-m68k.org> ---
> /usr/lib64/gcc/x86_64-suse-linux/4.8/../../../../x86_64-suse-linux/bin/ld:
> cannot find -llttng-ust

You need a suitable package of lttng-ust (from http://lttng.org/)
installed. It's available in most Linux distributions. (Sorry I wasn't
able to narrow it down to a simpler test case not requiring extra
prerequisites.)
Comment 3 Andreas Schwab 2014-02-15 14:25:48 UTC
Please create a self-contained test case.
Comment 4 Stefan Seefeld 2014-02-15 17:30:56 UTC
On 02/15/2014 09:25 AM, schwab@linux-m68k.org wrote:
> http://sourceware.org/bugzilla/show_bug.cgi?id=16592
> 
> Andreas Schwab <schwab@linux-m68k.org> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>              Status|NEW                         |WAITING
> 
> --- Comment #3 from Andreas Schwab <schwab@linux-m68k.org> ---
> Please create a self-contained test case.

Well, the problem seems to be related to loading that particular library
with an auditor lib. I have already tried to reproduce the issue with
other libs, but failed.
Comment 5 Stefan Seefeld 2014-02-16 18:10:46 UTC
I have continued trying to debug this myself, but without much luck.
I'm running `/usr/lib64/ld-linux-x86-64.so.2 --audit ./ldaudit.so ./probe` in a debugger, which tells me the crash happens in dl_open_worker (dl-open.c:343) during program startup.

Please let me know if there is anything else I can provide (or do to help).
Comment 6 Carlos O'Donell 2014-02-17 04:03:34 UTC
(In reply to Stefan Seefeld from comment #5)
> I have continued trying to debug this myself, but without much luck.
> I'm running `/usr/lib64/ld-linux-x86-64.so.2 --audit ./ldaudit.so ./probe`
> in a debugger, which tells me the crash happens in dl_open_worker
> (dl-open.c:343) during program startup.
> 
> Please let me know if there is anything else I can provide (or do to help).

Does the audit library use TLS?
Comment 7 Stefan Seefeld 2014-02-17 11:56:15 UTC
On 02/16/2014 11:03 PM, carlos at redhat dot com wrote:

> Does the audit library use TLS?

One of its dependencies probably does, yes.

	Stefan
Comment 8 Stefan Seefeld 2014-02-17 22:16:10 UTC
I have managed to isolate the problem to a library constructor (function marked as __attribute__((constructor)) ) causing the crash.

What I don't understand is why this constructor fails in this situation (of being part of an audit library), when it doesn't fail during normal linking & loading.

Shouldn't the loader take care of initializing the libraries in proper order (as determined by symbol dependency analysis) ? Or is there in fact no guarantee of order of initialization, and the library was just lucky enough to always be initialized "late enough" until I started using it as part of an auditor ?
Comment 9 Stefan Seefeld 2014-02-18 03:56:49 UTC
Created attachment 7422 [details]
test case demonstrating the bug

This is a (somewhat) simplified version of my previous test case. It still relies on a pre-installed lttng-ust package, unfortunately. See next attachment for debugging details...
Comment 10 Stefan Seefeld 2014-02-18 04:08:37 UTC
Created attachment 7423 [details]
stacktrace from gdb

The attached stacktrace is seen in gdb when run as

  gdb .../ld-2.18.90.so --audit ./ldaudit.so ./probe

The crash happens at 

Program received signal SIGSEGV, Segmentation fault.
0x00005555555657a0 in add_to_global (new=new@entry=0x7ffff78509f0) at dl-open.c:94
94              = ns->_ns_main_searchlist->r_nlist + to_add + 8;

(and `where` prints the attached stacktrace).

The stacktrace suggests that the ldaudit.so constructor enters the call to
dlopen("liblttng-ust-tracepoint.so.0",...), which eventually triggers a call to add_to_global() in dl-open.c (in ld.so), where the crash happens. Initialization of the liblttng-ust-tracepoint.so.0 library (i.e. the execution of any constructor functions) hasn't even started yet, meaning this is a genuine ld.so bug. (However, the crash is specific to this particular library. I wasn't able to reproduce it when dlopen'ing a different library.)

Let me know if there is any other info I should supply.
Comment 11 Carlos O'Donell 2014-02-19 05:29:00 UTC
(In reply to Stefan Seefeld from comment #8)
> I have managed to isolate the problem to a library constructor (function
> marked as __attribute__((constructor)) ) causing the crash.
> 
> What I don't understand is why this constructor fails in this situation (of
> being part of an audit library), when it doesn't fail during normal linking
> & loading.
> 
> Shouldn't the loader take care of initializing the libraries in proper order
> (as determined by symbol dependency analysis) ? Or is there in fact no
> guarantee of order of initialization, and the library was just lucky enough
> to always be initialized "late enough" until I started using it as part of
> an auditor ?

Multiple constructors in one library run in the order in which they are declared and consequently seen by the static linker and added to the .ctors section. The same applies for constructors for static objects in that the order of declaration is important. Inter constructor ordering can be modified by using a priority e.g. __attribute__((constructor(N))).

The constructor ordering between libraries is specified by a breadth first search of DT_NEEDED entries. This ensures required libraries are initialized first before they are used. Symbol dependencies are not used at runtime to determine the constructor ordering.

If you have a circular dependency then no order is guaranteed for the portion of the graph that has the circular dependency.

We should provide some ld.so tooling to help find circular dependencies, detect them, and diagnose them, but we don't. Patches welcome.

Is it possible you have a circular dependency? Can you look into that please?

The test case you provided does not crash for me on Fedora 19 which is glibc-2.17 based.

We really need a self-contained reproducible test case.
Comment 12 Carlos O'Donell 2014-02-19 05:29:33 UTC
(In reply to Stefan Seefeld from comment #7)
> On 02/16/2014 11:03 PM, carlos at redhat dot com wrote:
> 
> > Does the audit library use TLS?
> 
> One of its dependencies probably does, yes.
> 
> 	Stefan

I know of one bug which is not yet fixed upstream where an LD_AUDIT library that uses TLS can cause a segfault. I have the patch in my tree and should push it out shortly. Do you have a way to test a patch? Can you rebuild your distro glibc with a patch?
Comment 13 Stefan Seefeld 2014-02-19 05:33:44 UTC
On 02/19/2014 12:29 AM, carlos at redhat dot com wrote:

> I know of one bug which is not yet fixed upstream where an LD_AUDIT library
> that uses TLS can cause a segfault. I have the patch in my tree and should push
> it out shortly. Do you have a way to test a patch? Can you rebuild your distro
> glibc with a patch?

I can reproduce the error with a custom build of glibc-2.18.90, and thus
would be able to test your patch.

Thanks,
		Stefan
Comment 14 Carlos O'Donell 2014-02-19 05:42:42 UTC
Created attachment 7427 [details]
glibc-ldaudit-tls-segv.diff

This patch should fix the case where the audit library uses TLS.
Comment 15 Stefan Seefeld 2014-02-19 13:40:34 UTC
I applied the patch to my local glibc-2.18-90 tree. The error unfortunately persists.

As mentioned in a recent message, gdb reports the error in 

Program received signal SIGSEGV, Segmentation fault.
0x0000555555565834 in add_to_global (new=new@entry=0x7ffff78509f0) at dl-open.c:94
94              = ns->_ns_main_searchlist->r_nlist + to_add + 8;


as ns->_ns_main_searchlist is 0x0. Any idea how this may happen ? has 'ns' not been initialized properly ?

Any suggestion on how to debug this further would be very appreciated.
Comment 16 Carlos O'Donell 2014-02-19 13:45:31 UTC
> as ns->_ns_main_searchlist is 0x0. Any idea how this may happen ? has 'ns'
> not been initialized properly ?
> 
> Any suggestion on how to debug this further would be very appreciated.

Bugs in the compiler or linker?

You're on your own until you find a way to reproduce this for us here.
Comment 17 Andreas Schwab 2014-02-19 13:51:46 UTC
_ns_main_searchlist is set up in elf/rtld.c.  Try setting LD_DEBUG=all to find out why it isn't initialized.
Comment 18 Stefan Seefeld 2014-03-20 21:49:31 UTC
Created attachment 7485 [details]
test case demonstrating the bug

Here is a new and self-contained test-case. The tarball contains some pre-processed files to reduce external dependencies. It was produced on x86_64 with gcc 4.8.2.

The error is caused by an audit library which itself dlopens a shared object in one of its constructor functions (in ldaudit_tp.c).

Please note that it is quite sensitive to the exact way this is built. For example, if I remove '-lpthread' from the link command of tracepoint.so, the crash will disappear. Likewise if I remove some of the compilation units.
Comment 19 Stefan Seefeld 2014-03-26 14:50:38 UTC
Can you please confirm that you can reproduce the crash with this latest reduced test case ?

Thanks,
Comment 20 Stefan Seefeld 2014-04-10 20:32:40 UTC
ping ?
Comment 21 Paul Woegerer 2014-06-13 14:59:21 UTC
I'm facing the same problem as Stefan. Interestingly it does not matter when liblttng-ust.so gets opened from an ldaudit shared object. In my example I dlopen liblttng-ust.so from the la_preinit() callback in my ldaudit.so. The result is the same:

Program received signal SIGSEGV, Segmentation fault.
0x00005555555657a0 in add_to_global (new=new@entry=0x7ffff78509f0) at dl-open.c:94
94              = ns->_ns_main_searchlist->r_nlist + to_add + 8;


Opening any other shared object from la_preinit() works just fine. Also applying Carlos patch glibc-ldaudit-tls-segv.diff does unfortunately not fix the problem.

I have attached a dump that I created with:
LD_DEBUG=all LD_AUDIT=$PWD/ldaudit.so ./gmontest 2> LD_DEBUG.out
Comment 22 Paul Woegerer 2014-06-13 15:01:13 UTC
Created attachment 7636 [details]
LD_DEBUG.out
Comment 23 Stefan Seefeld 2014-06-20 15:12:04 UTC
I can confirm Paul's failure mode with the above self-contained testcase by moving the call to 'dlopen("tracepoint.so")' from the constructor function into the call to la_preinit().

I would *really* appreciate if someone could have a look at the testcase, which does not have any dependency other than to glibc itself. At least please confirm that you can reproduce the failure.
While it originally seemed like an initialization ordering problem, it now looks as if the initialization of an audit library is missing something that would be done for "normal" DSOs.

With the test case above, this works:

    gcc -I. -Itracepoint -ggdb  -L. -o probe main.c foo.so ldaudit.so
    LD_LIBRARY_PATH=`pwd` ./probe


while this segfaults:

    gcc -I. -Itracepoint -ggdb  -L. -o probe main.c foo.so
    LD_LIBRARY_PATH=`pwd` LD_AUDIT=./ldaudit.so ./probe