This is the mail archive of the libc-help@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Indefinite hang in getaddrinfo / check_pf / make_request



On 22-09-2015 20:53, Steven Schlansker wrote:
> Hello,
> 
> We are having issues where our applications (so far reproduced both on Node.js and Mono applications)
> enter a completely stuck state inside of libc calling into the kernel netlink interface.
> 
> When they get stuck, they have a characteristic stack trace.  Taken from a Node process:
> 
> #0  0x00007fd7d8d214ad in recvmsg () at ../sysdeps/unix/syscall-template.S:81
> #1  0x00007fd7d8d3e44d in make_request (fd=fd@entry=13, pid=1) at ../sysdeps/unix/sysv/linux/check_pf.c:177
> #2  0x00007fd7d8d3e9a4 in __check_pf (seen_ipv4=seen_ipv4@entry=0x7fd7d37fdd00, seen_ipv6=seen_ipv6@entry=0x7fd7d37fdd10, 
>     in6ai=in6ai@entry=0x7fd7d37fdd40, in6ailen=in6ailen@entry=0x7fd7d37fdd50) at ../sysdeps/unix/sysv/linux/check_pf.c:341
> #3  0x00007fd7d8cf64e1 in __GI_getaddrinfo (name=0x31216e0 "mesos-slave4-prod-uswest2.otsql.opentable.com", service=0x0, 
>     hints=0x31216b0, pai=0x31f09e8) at ../sysdeps/posix/getaddrinfo.c:2355
> #4  0x0000000000e101c8 in uv__getaddrinfo_work (w=0x31f09a0) at ../deps/uv/src/unix/getaddrinfo.c:102
> #5  0x0000000000e09179 in worker (arg=<optimized out>) at ../deps/uv/src/threadpool.c:91
> #6  0x0000000000e16eb1 in uv__thread_start (arg=<optimized out>) at ../deps/uv/src/unix/thread.c:49
> #7  0x00007fd7d8ff3182 in start_thread (arg=0x7fd7d37fe700) at pthread_create.c:312
> #8  0x00007fd7d8d2047d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
> 
> The recvmsg call never returns.
> 
> We found the following issue:
> https://sourceware.org/bugzilla/show_bug.cgi?id=15946
> 
> This bug matches our symptoms perfectly.
> 
> However, we are running a libc that has the patch applied!
> 
> ii  libc6:amd64  2.19-0ubuntu6.6  amd64  Embedded GNU C Library: Shared libraries
> 
> https://bugs.launchpad.net/ubuntu/+source/eglibc/+bug/1328975
> 
> So now I'm confused, as we are still seeing the symptoms, but have the patch applied.
> 
> Once this hang happens, eventually all threads in the process end up blocked trying to take the check_pf lock,
> and there is no recourse but to kill the process.
> 
> Is it possible there is another race condition or other error here?  How could we have so many processes
> getting stuck here?  What diagnostics might I run to get a better fix on the problem?
> 
> We run vanilla kernel 4.0.4.  These processes are inside of a Docker container (1.7.1), but with network isolation in "host" mode which hopefully means that there is no separate network namespace that might be interfering.
> 
> Thank you for any advice, this issue is driving us crazy!
> 

Hi I believe first thing is to try the bug reproducer in the bug
report to check if it is the same issue.  On the bug report itself
they are more than confirmation it is indeed fix the issue.

Another one is to check with either a more recent GLIBC or with
master if you can reproduce your issue.  I noted there is a 
recent (2.21) fixes that might be related:

fda389c8f0311dd5786be91a7b54b9f935fcafa1 - Fix infinite loop in check_pf (BZ #12926)


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]