This is the mail archive of the
mailing list for the glibc project.
Indefinite hang in getaddrinfo / check_pf / make_request
- From: Steven Schlansker <stevenschlansker at gmail dot com>
- To: libc-help at sourceware dot org
- Date: Tue, 22 Sep 2015 20:53:03 -0700
- Subject: Indefinite hang in getaddrinfo / check_pf / make_request
- Authentication-results: sourceware.org; auth=none
We are having issues where our applications (so far reproduced both on Node.js and Mono applications)
enter a completely stuck state inside of libc calling into the kernel netlink interface.
When they get stuck, they have a characteristic stack trace. Taken from a Node process:
#0 0x00007fd7d8d214ad in recvmsg () at ../sysdeps/unix/syscall-template.S:81
#1 0x00007fd7d8d3e44d in make_request (fd=fd@entry=13, pid=1) at ../sysdeps/unix/sysv/linux/check_pf.c:177
#2 0x00007fd7d8d3e9a4 in __check_pf (seen_ipv4=seen_ipv4@entry=0x7fd7d37fdd00, seen_ipv6=seen_ipv6@entry=0x7fd7d37fdd10,
in6ai=in6ai@entry=0x7fd7d37fdd40, in6ailen=in6ailen@entry=0x7fd7d37fdd50) at ../sysdeps/unix/sysv/linux/check_pf.c:341
#3 0x00007fd7d8cf64e1 in __GI_getaddrinfo (name=0x31216e0 "mesos-slave4-prod-uswest2.otsql.opentable.com", service=0x0,
hints=0x31216b0, pai=0x31f09e8) at ../sysdeps/posix/getaddrinfo.c:2355
#4 0x0000000000e101c8 in uv__getaddrinfo_work (w=0x31f09a0) at ../deps/uv/src/unix/getaddrinfo.c:102
#5 0x0000000000e09179 in worker (arg=<optimized out>) at ../deps/uv/src/threadpool.c:91
#6 0x0000000000e16eb1 in uv__thread_start (arg=<optimized out>) at ../deps/uv/src/unix/thread.c:49
#7 0x00007fd7d8ff3182 in start_thread (arg=0x7fd7d37fe700) at pthread_create.c:312
#8 0x00007fd7d8d2047d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
The recvmsg call never returns.
We found the following issue:
This bug matches our symptoms perfectly.
However, we are running a libc that has the patch applied!
ii libc6:amd64 2.19-0ubuntu6.6 amd64 Embedded GNU C Library: Shared libraries
So now I'm confused, as we are still seeing the symptoms, but have the patch applied.
Once this hang happens, eventually all threads in the process end up blocked trying to take the check_pf lock,
and there is no recourse but to kill the process.
Is it possible there is another race condition or other error here? How could we have so many processes
getting stuck here? What diagnostics might I run to get a better fix on the problem?
We run vanilla kernel 4.0.4. These processes are inside of a Docker container (1.7.1), but with network isolation in "host" mode which hopefully means that there is no separate network namespace that might be interfering.
Thank you for any advice, this issue is driving us crazy!