I have a production deployment. resolv.conf contains: nameserver A nameserver B gai.conf has only comments (default Ubuntu config). nsswitch.conf is the default, which contains: hosts: files mdns4_minimal [NOTFOUND=return] dns I have long-running programs that use getaddrinfo to resolve the same hostname over and over. It's using Python: socket.getaddrinfo('host.name.com', '22', socket.AF_UNSPEC, socket.SOCK_STREAM) When nameserver A goes down, which happens every now and then, frequently one or more of the long-running services will have that getaddrinfo call fail every time. When this is happening, a packet capture shows this sequence of events: t=0: query to A t = 5.005 seconds: query to B t = 5.053153 seconds: response from B (looks valid to me) t = 5.053208 seconds: new query to A t = 10.058251 seconds: new query to B t = 10.076853 seconds: response from B t = 10.076908 seconds: query for host.name.com.my.domain.name to A t = 15.080450 seconds: query for host.name.com.my.domain.name to B t = 15.099330 seconds: NXDOMAIN from B t = 15.099387 seconds: query for host.name.com.my.domain.name to A t = 20.104418 seconds: query for host.name.com.my.domain.name to B t = 20.123345 seconds: NXDOMAIN from B At this point, I get "gaierror: [Errno -2] Name or service not known" back from Python. From the timing, it looks like glibc is considering the actual valid responses from B to be failures. If I restart the service with nameserver A still down, everything works (it tries A, then tries B 5 seconds later and accepts the answer). The system in question does not use iptables. I can't consistently reproduce this, unfortunately. I can try to run different diagnostics the next time it happens, though.
Please try this patch: diff --git a/resolv/res_send.c b/resolv/res_send.c index c35fb66..d6b99ba 100644 --- a/resolv/res_send.c +++ b/resolv/res_send.c @@ -448,7 +448,7 @@ __libc_res_nsend(res_state statp, const u_char *buf, int buflen, malloc(sizeof (struct sockaddr_in6)); if (EXT(statp).nsaddrs[n] != NULL) { memset (mempcpy(EXT(statp).nsaddrs[n], - &statp->nsaddr_list[ns], + &statp->nsaddr_list[n], sizeof (struct sockaddr_in)), '\0', sizeof (struct sockaddr_in6)
Unfortunately, I don't know how to test this, given that I was never able to trigger it intentionally.
Based on comment 1, this could be bug 13028, fixed by this commit (which went in to 2.19): commit cabba9343c8bd99e4aea66aa1e0ec7d93aa18a7e Author: Ondřej Bílka <neleai@seznam.cz> Date: Sun Oct 13 23:03:28 2013 +0200 Correctly copy resolver address. Fixes bug #13028. Closing due to insufficient data.