Bug 17802 - The DNS resolver gets stuck when one nameserver is down
Summary: The DNS resolver gets stuck when one nameserver is down
Status: RESOLVED WORKSFORME
Alias: None
Product: glibc
Classification: Unclassified
Component: network (show other bugs)
Version: 2.19
: P2 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks: 13028
  Show dependency treegraph
 
Reported: 2015-01-06 01:56 UTC by Andy Lutomirski
Modified: 2016-02-09 18:38 UTC (History)
2 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:
fweimer: security-


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Andy Lutomirski 2015-01-06 01:56:09 UTC
I have a production deployment.  resolv.conf contains:

nameserver A
nameserver B

gai.conf has only comments (default Ubuntu config).  nsswitch.conf is the default, which contains:

hosts:          files mdns4_minimal [NOTFOUND=return] dns

I have long-running programs that use getaddrinfo to resolve the same hostname over and over.  It's using Python:

socket.getaddrinfo('host.name.com', '22', socket.AF_UNSPEC, socket.SOCK_STREAM)

When nameserver A goes down, which happens every now and then, frequently one or more of the long-running services will have that getaddrinfo call fail every time.  When this is happening, a packet capture shows this sequence of events:

t=0: query to A
t = 5.005 seconds: query to B
t = 5.053153 seconds: response from B (looks valid to me)
t = 5.053208 seconds: new query to A
t = 10.058251 seconds: new query to B
t = 10.076853 seconds: response from B
t = 10.076908 seconds: query for host.name.com.my.domain.name to A
t = 15.080450 seconds: query for host.name.com.my.domain.name to B
t = 15.099330 seconds: NXDOMAIN from B
t = 15.099387 seconds: query for host.name.com.my.domain.name to A
t = 20.104418 seconds: query for host.name.com.my.domain.name to B
t = 20.123345 seconds: NXDOMAIN from B

At this point, I get "gaierror: [Errno -2] Name or service not known" back from Python.

From the timing, it looks like glibc is considering the actual valid responses from B to be failures.

If I restart the service with nameserver A still down, everything works (it tries A, then tries B 5 seconds later and accepts the answer).

The system in question does not use iptables.

I can't consistently reproduce this, unfortunately.  I can try to run different diagnostics the next time it happens, though.
Comment 1 Andreas Schwab 2015-02-17 12:54:24 UTC
Please try this patch:

diff --git a/resolv/res_send.c b/resolv/res_send.c
index c35fb66..d6b99ba 100644
--- a/resolv/res_send.c
+++ b/resolv/res_send.c
@@ -448,7 +448,7 @@ __libc_res_nsend(res_state statp, const u_char *buf, int buflen,
 				    malloc(sizeof (struct sockaddr_in6));
 			if (EXT(statp).nsaddrs[n] != NULL) {
 				memset (mempcpy(EXT(statp).nsaddrs[n],
-						&statp->nsaddr_list[ns],
+						&statp->nsaddr_list[n],
 						sizeof (struct sockaddr_in)),
 					'\0',
 					sizeof (struct sockaddr_in6)
Comment 2 Andy Lutomirski 2015-02-19 00:20:33 UTC
Unfortunately, I don't know how to test this, given that I was never able to trigger it intentionally.
Comment 3 Florian Weimer 2016-02-09 18:38:51 UTC
Based on comment 1, this could be bug 13028, fixed by this commit (which went in to 2.19):

commit cabba9343c8bd99e4aea66aa1e0ec7d93aa18a7e
Author: Ondřej Bílka <neleai@seznam.cz>
Date:   Sun Oct 13 23:03:28 2013 +0200

    Correctly copy resolver address. Fixes bug #13028.

Closing due to insufficient data.