Sources Bugzilla – Bug 12994
getaddrinfo fails if response records returned in wrong order and one of them is server failure
Last modified: 2012-12-16 15:54:57 UTC
Created attachment 5848 [details] tcpdump capture from getaddrinfo en.wikipedia.org A program calls getaddrinfo. Deep within the bowels of the resolver library, __libc_res_nquery in res_query.c creates two queries, an A query and an AAAA query. Deeper within the bowels of the resolver library, send_dg in res_send.c sends both queries and waits for responses. My name server sends the response to the *second* query *first*, and it's a server failure. I'm pretty sure that if the responses were sent in the reverse order, the problem would not occur. At this point things get all screwed up. I'm not sure whether the problem is in send_dg or _libc_res_nsend or _libc_res_nquery. I've spent hours poring over the code trying to figure out who is at fault. I can't, because this is some of the most poorly written code I've looked at in a very long time. It's completely incomprehensible and most of its "cleverness" is inadequately documented. Anyway, by the time status results bubble back up to getaddrinfo, the code has decided that it was unable to resolve the host name to an address, even though one of the two responses that came back from the DNS server had a valid A record in it. Test case? Run getaddrinfo on en.wikipedia.org immediately after restarting your name server. I'm using BIND 9.8.0-7.P4.fc15.x86_64; I don't know how universal this behavior is. I am attaching a wireshark dump from the virtual interface that captures both my loopback interface (on which my client is making its queries) and the queries my DNS server is making to try to satisfy the local queries. And here's what my test program (which I will also attach) prints as output: Wed Jul 13 00:14:18 2011: getaddrinfo: Name or service not known Note that if you run the exact same getaddrinfo call a second time immediately afterwards it works, because the previous successful query response, which is a CNAME, is cached and gets returned in response to both the A and AAAA queries. Since this bug causes DNS queries that should succeed to fail in a very user-visible way, I'm tempted to set it to critical, but I suppose since there's no permanent loss of data it isn't actually. I don't know, tough call.
Created attachment 5849 [details] test program
By the way, a workaround for the problem is putting "options single-request" in /etc/resolv.conf.
First: I didn't test with the latest glibc because i failed to compile it But I am quite sure the bug is still present and quite severe. It happens not only if the order is wrong it also happens if there is no answer to the A record request (either request/response is lost, dns server of the typical home router to slow, ...) I will attach a test with some comments.
Created attachment 6714 [details] another test
I now believe packet re-ordering is not enough to reproduce the problem. I have written a small dns proxy for better testing. The simplest scenario to reproduce the problem is to drop all a record requests and just answer the aaaa request. Answer for getaddrinfo with hints.ai_family = AF_UNSPEC is then error: r=-2 Name or service not known. Traffic is like: 12.786202 127.0.0.1 -> 127.0.0.1 DNS 68 Standard query 0x9f5f A karme.de 12.786962 127.0.0.1 -> 127.0.0.1 DNS 68 Standard query 0x77c8 AAAA karme.de 14.896700 127.0.0.1 -> 127.0.0.1 DNS 119 Standard query response 0x77c8 17.788941 127.0.0.1 -> 127.0.0.1 DNS 68 Standard query 0x9f5f A karme.de 22.794223 127.0.0.1 -> 127.0.0.1 DNS 68 Standard query 0x9f5f A karme.de
Just one question (because I believe this bug is ultimately a duplicate of another existing issue), what is the failure mode you're seeing? ie, do you hit an assert, abort, segfault, error code, whatever.
(In reply to comment #6) > Just one question (because I believe this bug is ultimately a duplicate of > another existing issue), what is the failure mode you're seeing? ie, do you > hit an assert, abort, segfault, error code, whatever. getaddrinfo returns error code EAI_NONAME when it should return EAI_EAGAIN
(In reply to comment #6) > Just one question (because I believe this bug is ultimately a duplicate of > another existing issue), what is the failure mode you're seeing? ie, do you > hit an assert, abort, segfault, error code, whatever. which is the bug number you think this a duplicate of?
I thought it might be a duplicate of 13013, 13651 or another (# escapes me) in the Red Hat bugzilla database. Based on the information you provided in c#7 I believe this is a separate issue.
Splitting this bug report so that it only refers to literal address translation. For /etc/hosts resolutions, see: http://sourceware.org/bugzilla/show_bug.cgi?id=14966 For other name resolution problems, search for the bug report and file a new one if you don't find it. That means that if Tore's patch works, this bug can be closed and the other one would be tracked separately.