Bug 10652 - getaddrinfo causes segfault if multithreaded and linked statically
Summary: getaddrinfo causes segfault if multithreaded and linked statically
Status: NEW
Alias: None
Product: glibc
Classification: Unclassified
Component: network (show other bugs)
Version: unspecified
: P2 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-09-17 11:34 UTC by Marius Heuler
Modified: 2014-07-01 06:50 UTC (History)
7 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:
fweimer: security-


Attachments
Example module which calls getaddrinfo() from many threads. (272 bytes, text/x-csrc)
2011-03-26 07:15 UTC, Robert G. Jakabosky
Details
Simple host program to dynamically load a module with dlopen(). (323 bytes, text/x-csrc)
2011-03-26 07:24 UTC, Robert G. Jakabosky
Details
Valgrind output shows some invalid reads into freed memory before the program crashes on a NULL pointer. (1.18 KB, application/octet-stream)
2011-03-26 07:58 UTC, Robert G. Jakabosky
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Marius Heuler 2009-09-17 11:34:39 UTC
The getaddrinfo call causes an internal segmentation fault when called from
threads and the binary is linked with "-static". The documentation says the
function is thread safe. This should be also the case when linked with "-static"
since there is no exception mentioned.
The crash only occurs if the binary is executed on a multi core system, on a
single core system it does not crash. This seems to be a synchronization problem
inside the library, but somehow only in the static version.

To reproduce just use this small test program:
#include <stdio.h>
#include <netdb.h>
#include <pthread.h>
#include <unistd.h>

void *test(void *)
{
        struct addrinfo *res = NULL;
        fprintf(stderr, "x=");
        int ret = getaddrinfo("localhost", NULL, NULL, &res);
        fprintf(stderr, "%d ", ret);
        return NULL;
}

int main()
{
        for (int i = 0; i < 512; i++)
        {
                pthread_t thr;
                pthread_create(&thr, NULL, test, NULL);
        }
        sleep(5);
        return 0;
}

Compile with "g++ -o dnstest -static dnstest.cpp -lpthread" and then start.
Usually when linked with "-static" it crashes immediately, without it works fine.
This was verified with different glibc versions from Fedore 7, 11, CentOS 5.3,
Ubuntu 8.x and 9.x, SuSE 11.1 32bit and 64bit.
The glibc versions tested are from 2.6 to 2.10.

I see no reason why this only works if dynamically linked. The documentation
also does not mention any restrictions if linked statically.
Comment 1 Marius Heuler 2009-09-17 12:13:28 UTC
Here output with debug info:
gdb ./dnstest

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffefdf2910 (LWP 8287)]
*__GI_fgets_unlocked (buf=0x7fffefdf17a0 "", n=992, fp=0x0) at iofgets_u.c:54
54        old_error = fp->_IO_file_flags & _IO_ERR_SEEN;
Current language:  auto; currently minimal
(gdb) bt
#0  *__GI_fgets_unlocked (buf=0x7fffefdf17a0 "", n=992, fp=0x0) at iofgets_u.c:54
#1  0x00007fffefdf68c7 in internal_getent (result=<value optimized out>,
buffer=0x7fffefdf1780 "", buflen=<value optimized out>, errnop=0x7fffefdf1d4c,
herrnop=0x7fffefdf1d48, af=0,
    flags=<value optimized out>) at nss_files/files-XXX.c:208
#2  0x00007fffefdf6e52 in _nss_files_gethostbyname4_r (name=0x4923b3
"localhost", pat=0x7fffefdf1d38, buffer=0x7fffefdf1780 "", buflen=1024,
errnop=<value optimized out>,
    herrnop=<value optimized out>, ttlp=0x0) at nss_files/files-hosts.c:347
#3  0x0000000000435f86 in gaih_inet ()
#4  0x0000000000437c7f in getaddrinfo ()
#5  0x0000000000000000 in ?? ()
(gdb) print fp
$1 = (_IO_FILE *) 0x0
(gdb)
Comment 2 Marius Heuler 2009-09-17 12:31:27 UTC
The segmentation fault happens on different addresses below the
_nss_files_gethostbyname4_r. This function shows a call to __libc_lock_lock in
the source, but this probably does not work!?

The assembler code shows calls to the phread_lock() function:

Dump of assembler code for function _nss_files_gethostbyname4_r:
0x00007ffff55d8d70 <_nss_files_gethostbyname4_r+0>:     push   %r15
0x00007ffff55d8d72 <_nss_files_gethostbyname4_r+2>:     push   %r14
0x00007ffff55d8d74 <_nss_files_gethostbyname4_r+4>:     push   %r13
0x00007ffff55d8d76 <_nss_files_gethostbyname4_r+6>:     mov    %rsi,%r13
0x00007ffff55d8d79 <_nss_files_gethostbyname4_r+9>:     push   %r12
0x00007ffff55d8d7b <_nss_files_gethostbyname4_r+11>:    mov    %rdi,%r12
0x00007ffff55d8d7e <_nss_files_gethostbyname4_r+14>:    push   %rbp
0x00007ffff55d8d7f <_nss_files_gethostbyname4_r+15>:    mov    %rdx,%rbp
0x00007ffff55d8d82 <_nss_files_gethostbyname4_r+18>:    push   %rbx
0x00007ffff55d8d83 <_nss_files_gethostbyname4_r+19>:    mov    %rcx,%rbx
0x00007ffff55d8d86 <_nss_files_gethostbyname4_r+22>:    sub    $0x88,%rsp
0x00007ffff55d8d8d <_nss_files_gethostbyname4_r+29>:    cmpq  
$0x0,0x20823b(%rip)        # 0x7ffff57e0fd0 <fgetpos+2137728>
0x00007ffff55d8d95 <_nss_files_gethostbyname4_r+37>:    mov    %r8,0x30(%rsp)
0x00007ffff55d8d9a <_nss_files_gethostbyname4_r+42>:    mov    %r9,0x38(%rsp)
0x00007ffff55d8d9f <_nss_files_gethostbyname4_r+47>:    je     0x7ffff55d8dad
<_nss_files_gethostbyname4_r+61>
0x00007ffff55d8da1 <_nss_files_gethostbyname4_r+49>:    lea   
0x208498(%rip),%rdi        # 0x7ffff57e1240 <lock>
0x00007ffff55d8da8 <_nss_files_gethostbyname4_r+56>:    callq  0x7ffff55d7020
<__pthread_mutex_lock@plt>
0x00007ffff55d8dad <_nss_files_gethostbyname4_r+61>:    mov   
0x2084d1(%rip),%edi        # 0x7ffff57e1284 <keep_stream>
0x00007ffff55d8db3 <_nss_files_gethostbyname4_r+67>:    callq  0x7ffff55d8700
<internal_setent>
0x00007ffff55d8db8 <_nss_files_gethostbyname4_r+72>:    cmp    $0x1,%eax
0x00007ffff55d8dbb <_nss_files_gethostbyname4_r+75>:    mov    %eax,0x5c(%rsp)
0x00007ffff55d8dbf <_nss_files_gethostbyname4_r+79>:    je     0x7ffff55d8ded
<_nss_files_gethostbyname4_r+125>
0x00007ffff55d8dc1 <_nss_files_gethostbyname4_r+81>:    cmpq  
$0x0,0x20820f(%rip)        # 0x7ffff57e0fd8 <fgetpos+2137736>
0x00007ffff55d8dc9 <_nss_files_gethostbyname4_r+89>:    je     0x7ffff55d8dd7
<_nss_files_gethostbyname4_r+103>
0x00007ffff55d8dcb <_nss_files_gethostbyname4_r+91>:    lea   
0x20846e(%rip),%rdi        # 0x7ffff57e1240 <lock>
0x00007ffff55d8dd2 <_nss_files_gethostbyname4_r+98>:    callq  0x7ffff55d7040
<__pthread_mutex_unlock@plt>
0x00007ffff55d8dd7 <_nss_files_gethostbyname4_r+103>:   mov    0x5c(%rsp),%eax
0x00007ffff55d8ddb <_nss_files_gethostbyname4_r+107>:   add    $0x88,%rsp
Comment 3 Marius Heuler 2009-09-17 12:44:30 UTC
When debugging the _nss_files_gethostbyname4_r function with dynamic linking the
pthread_mutex_lock function is executed and can be stepped into. But statically
linked the step does not reveal that function is called at all even when the
disassemble looks like it should!?
Comment 4 Jakub Jelinek 2009-09-17 12:47:28 UTC
You shouldn't link statically, there are many reasons why it is a bad idea.
If you for whatever strange reason still need it, you need to make sure you link
all of libpthread.a into your application (e.g. using -Wl,--whole-archive around
-lpthread), otherwise many things won't work as expected.
Comment 5 Marius Heuler 2009-09-17 13:30:38 UTC
Ok, I will try that. But why is there no warning or information when statically
linking pthread library. The linker warns about he would need the library for
lookups but no warning at all about the pthread library.
The reason we used to link statically is that the binary should run on different
linux version including versions which use older libraries.
Is there another way to e.g. link dynamically with glibc-2.10 and run on systems
with only glibc-2.6?
Comment 6 Jakub Jelinek 2009-09-17 13:39:47 UTC
Please read http://people.redhat.com/drepper/no_static_linking.html, by linking
statically you make the portability far worse.  Unless you are creating a system
recovery tool that needs to work when shared libraries are hosed up, you should
link at least glibc libraries dynamically.
Comment 7 Marius Heuler 2009-09-17 13:47:45 UTC
Ok, thank you for that information!

My problem with dynamic linking on a new linux system e.g. using glibc-2.11 the
binary won't start on older linux, it says: /lib64/libc.so.6: version
`GLIBC_2.7' not found. The application does not need any functions of that new
library, it would work fine with e.g. glibc-2.6. Is there a way to change the
minimum dependency of the library? It works when I compile on an old linux
system, it will run on new systems.
When compiling the application on windows I can define the minimum needed
version in a define and then I can only uses functions available at that version
and not newer functions. Can this be done with glibc, that the binary still
works with libraries definine e.g. GLIBC_2.6?

Thank you very much for helper so far!
Marius
Comment 8 Ulrich Drepper 2009-10-30 03:08:24 UTC
This is no place to ask question.

On the other hand you haven't responded to the question whether linking in the
entire libpthread helps.  I assume it does.
Comment 9 Marius Heuler 2009-10-30 14:53:03 UTC
Hello!

I included the whole pbthread:
Using "-static -lpthread" or "-static /usr/lib64/libpthread.a" creates the same
binary. The lib pthread is also used by our code so should be included in the
binary.
But the binary create like above still has the problems!

Our solution is to link libc, libm and libpthread dynamic on a Ubuntu LTS 8.0.4
system. This binary works also on most other systems (with reasonable new glibc).

On strange thing is: if I compile dynamic the same on a Fedora 7 system and run
it on e.g. SLES 10 the binary breaks already in the loader with Floating
Exception. The binary compiled on Unbuntu with the same setup works fine. That
is strange (probable SLES has no standard glibc)
Comment 10 Greg Alexander 2009-11-19 15:18:12 UTC
I have not bothered to actually trace this but I have a likely suspect.
As I understand it, resolution is handled by libnss_*.so, which are still
dynamically linked even if the executable is statically linked.  They
presumably feature weak extern references to various pthread functions.
If pthreoads is dynamically linked, these references succeed.  If
pthreads is statically linked then the pthread symbols are not reexported
to things loaded with dlopen() like the libnss libraries.

I don't know a good solution but perhaps -rdynamic has some role to play?

Or perhaps a less bloated libc than glibc could be used, one which has a
number of simple resolvers built in?  The libnss resolvers on my Linux
system take up 275kB which is enough space for many other unixes to
implement an entire libc....
Comment 11 Vital Pisaryk 2010-03-17 11:49:04 UTC
I have the same problem.
Sometimes call of getaddrinfo function in one of pthreads causes an segfault.
Application linked with -static flag. It should be linked statically because I
use it on systems without installed pthread libraries and don't have ability to
install it.

I didn't find any helpful suggestions in the thread. So, what should I do to fix
this problem? 
Comment 12 Robert G. Jakabosky 2011-03-26 07:09:49 UTC
The same crash happens if the host program is not compiled with "-pthread" and dynamically loads a module which is linked to libpthread.so and calls getaddrinfo() from multiple threads.

I will attache two example C files that show case this problem.
Comment 13 Robert G. Jakabosky 2011-03-26 07:15:24 UTC
Created attachment 5325 [details]
Example module which calls getaddrinfo() from many threads.

Compile this example module with:
gcc -o crash_getaddrinfo.so -Wall -fPIC -shared -pthread crash_getaddrinfo.c
Comment 14 Robert G. Jakabosky 2011-03-26 07:24:33 UTC
Created attachment 5326 [details]
Simple host program to dynamically load a module with dlopen().

Compile without -pthread:
gcc -ldl -Wall -o crash_main_no_pthread crash_main.c

Compile with -pthread:
gcc -ldl -Wall -o crash_main_pthread crash_main.c -pthread

By default the program will try to load a module named: /tmp/crash_getaddrinfo.so
Comment 15 Robert G. Jakabosky 2011-03-26 07:36:11 UTC
I first ran into this problem when using a Lua C module (ZeroMQ bindings for Lua) that uses IO threads in the background.  The only work-around is to either compile the Lua VM with -pthread (This shouldn't be required, since not all Lua scripts need pthread support) or to use "LD_PRELOAD=/lib/libpthread.so host_program".

I would prefer an option where the host program (Lua VM) didn't have to either be compiled with -pthread or wrapped in a script to preload libpthread.so.

Also the example program will even crash on a single-cpu(single-core) computer running Debian 6.0, glibc 2.11.2.
Comment 16 Robert G. Jakabosky 2011-03-26 07:58:37 UTC
Created attachment 5327 [details]
Valgrind output shows some invalid reads into freed memory before the program crashes on a NULL pointer.

This problem seems to be caused by a race condition between the threads calling getaddrinfo().  With a small number of threads it doesn't always happen.  Atleast the backtrace has always been the same.
Comment 17 Andreas Jaeger 2013-01-03 12:13:25 UTC
(In reply to comment #12)
> The same crash happens if the host program is not compiled with "-pthread" and
> dynamically loads a module which is linked to libpthread.so and calls
> getaddrinfo() from multiple threads.
> 
> I will attache two example C files that show case this problem.

A comment on this case is at:
http://sourceware.org/ml/libc-alpha/2012-10/msg00361.html

my advise for now is to link your application against libpthread until somebody really digs into this and figures out what is supposed to work and how.
Comment 18 Carlos O'Donell 2013-11-25 05:22:24 UTC
This bug is still reproducible in 2.18.90.
Comment 19 Carlos O'Donell 2013-11-26 23:02:49 UTC
In a test case where the application doesn't link against libpthread, but a dlopen'd library does, parallel calls to getaddrinfo cause corruption in the IO layers and eventually a crash.

Even though libpthread.so.1 has been loaded the weak-ref-and-check idiom in the NSS code isn't working. The GOT entry stays zero and therefore the nss code skips doing any locking and we get serious corruption via get_contents->__GI_fgets_unlocked (doing unlocked file IO with multiple threads causes data races and corruption). 

The skipped locks are in _nss_files_gethostbyname4_r (libnss_files.so). When the application is compiled with -lpthread the GOT entry has a non-zero value of 0x00007ffff77bc460 which is "0x7ffff77bc460 <__GI___pthread_mutex_lock>:	sub    $0x8,%rsp" and therefore correct. That entry is the GOT entry #40 with relocation: 000000000020bfd8  0000001a00000006 R_X86_64_GLOB_DAT      0000000000000000 __pthread_mutex_lock + 0.

If libpthread is loaded *after* libnss_files.so is loaded I don't see that there is anything you can do to make the NSS code use locks since the GOT relocation has already been processed. However in this case libpthread is loaded *before* libnss_files.so, but it appears as if the resolution scope prevents the symbols from libpthread being made available to libnss_files.so?

e.g.
     20987:     object=/home/carlos/build/glibc/nss/libnss_files.so.2 [0]
     20987:      scope 0: ./crash_main_no_pthread /home/carlos/build/glibc/dlfcn/libdl.so.2 /home/carlos/build/glibc/libc.so.6 /home/carlos/build/glibc/elf/ld.so
     20987:      scope 1: /home/carlos/build/glibc/nss/libnss_files.so.2 /home/carlos/build/glibc/libc.so.6 /home/carlos/build/glibc/elf/ld.so

Notice libnss_files.so.2 is in it's own scope without libpthread. As opposed to crash_getaddrinfo.so's scope with libpthread in it

e.g.
     20987:     object=/home/carlos/support/2013-11-22/crash_getaddrinfo.so [0]
     20987:      scope 0: ./crash_main_no_pthread /home/carlos/build/glibc/dlfcn/libdl.so.2 /home/carlos/build/glibc/libc.so.6 /home/carlos/build/glibc/elf/ld.so
     20987:      scope 1: /home/carlos/support/2013-11-22/crash_getaddrinfo.so /home/carlos/build/glibc/nptl/libpthread.so.0 /home/carlos/build/glibc/libc.so.6 /home/carlos/build/glibc/elf/ld.so

I don't know what's the right answer here. There are really only two resolution scopes, global and local, the scopes listed above are internal details of glibc's dyanmic loader. Why libpthread's symbols wouldn't be used for the relocation in libnss_files.so is what baffles me, one would have to track down the exact relocation and determine why the libpthread symbol isn't used.

I'm not working on this so I'm flipping this to NEW, but I thought I'd post what I saw during my analysis of a similar internal Red Hat bug.
Comment 20 Rich Felker 2013-11-27 03:05:46 UTC
Why is getaddrinfo trying to "optimize" out the locking for single-threaded programs anyway? Certainly the time spent in getaddrinfo is dominated by actual lookups, not by locking overhead.
Comment 21 Carlos O'Donell 2013-11-27 05:04:03 UTC
(In reply to Rich Felker from comment #20)
> Why is getaddrinfo trying to "optimize" out the locking for single-threaded
> programs anyway? Certainly the time spent in getaddrinfo is dominated by
> actual lookups, not by locking overhead.

I can only assume it does this to avoid requiring libpthread. The actual lookups might also be very fast if they are resolved by /etc/hosts or some other local file-based NSS backend. Requiring the thread library would have a non-zero impact on performance for single-threaded applications. What other reason could there be for using the weak-ref-and-check idiom (which I know you don't like)?
Comment 22 Rich Felker 2013-11-27 05:19:17 UTC
As long as libpthread is a separate DSO, avoiding loading it makes sense, yes. However it seems that all the internal locking in glibc components (including nss modules) could be done with lock functions available unconditionally in libc rather than needing the pthread lock functions. I'm not familiar enough with the glibc internals to know whether such functions are already available, but it would certainly make for a cleaner solution to this and many other problems if they are. Note that the locking requirements for internal use are much simpler than pthread requirements; there are no difficult issues like different mutex types, self-synchronized destruction, etc.
Comment 23 Carlos O'Donell 2013-11-27 05:29:26 UTC
(In reply to Rich Felker from comment #22)
> As long as libpthread is a separate DSO, avoiding loading it makes sense,
> yes. However it seems that all the internal locking in glibc components
> (including nss modules) could be done with lock functions available
> unconditionally in libc rather than needing the pthread lock functions. I'm
> not familiar enough with the glibc internals to know whether such functions
> are already available, but it would certainly make for a cleaner solution to
> this and many other problems if they are. Note that the locking requirements
> for internal use are much simpler than pthread requirements; there are no
> difficult issues like different mutex types, self-synchronized destruction,
> etc.

No, you make a good point, and internally glibc already uses just plain futexes for __libc_lock_lock, but for non-libc modules like libnss_files.so.2 (loaded as part of the NSS plugin mechanism) the __libc_lock_lock defines redirect to __pthread_mutex_lock. I see no reason at the moment why they couldn't just use futexes for serializing threaded access. There was certainly no futex support when these NSS modules were written so it might be a legacy issue. Switching them over to futex locking would solve this problem and the uncontended lock case is an atomic operation that should always succeeds.
Comment 24 Jackie Rosen 2014-02-16 17:47:05 UTC
*** Bug 260998 has been marked as a duplicate of this bug. ***
Seen from the domain http://volichat.com
Page where seen: http://volichat.com/adult-chat-rooms
Marked for reference. Resolved as fixed @bugzilla.