Bug 12994 - getaddrinfo fails if response records returned in wrong order and one of them is server failure
: getaddrinfo fails if response records returned in wrong order and one of them...
Status: NEW
Product: glibc
Classification: Unclassified
Component: network
: 2.14
: P2 normal
: ---
Assigned To: Not yet assigned to anyone
:
:
:
:
  Show dependency treegraph
 
Reported: 2011-07-13 04:22 UTC by Jonathan Kamens
Modified: 2012-12-16 15:54 UTC (History)
7 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:


Attachments
tcpdump capture from getaddrinfo en.wikipedia.org (12.94 KB, application/x-pcap)
2011-07-13 04:22 UTC, Jonathan Kamens
Details
test program (802 bytes, text/plain)
2011-07-13 04:22 UTC, Jonathan Kamens
Details
another test (2.09 KB, text/x-csrc)
2012-11-04 10:53 UTC, karme
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Jonathan Kamens 2011-07-13 04:22:14 UTC
Created attachment 5848 [details]
tcpdump capture from getaddrinfo en.wikipedia.org

A program calls getaddrinfo.

Deep within the bowels of the resolver library, __libc_res_nquery in
res_query.c creates two queries, an A query and an AAAA query.

Deeper within the bowels of the resolver library, send_dg in res_send.c sends
both queries and waits for responses. My name server sends the response to the
*second* query *first*, and it's a server failure. I'm pretty sure that if the
responses were sent in the reverse order, the problem would not occur.

At this point things get all screwed up. I'm not sure whether the problem is in
send_dg or _libc_res_nsend or _libc_res_nquery. I've spent hours poring over
the code trying to figure out who is at fault. I can't, because this is some of
the most poorly written code I've looked at in a very long time. It's
completely incomprehensible and most of its "cleverness" is inadequately
documented.

Anyway, by the time status results bubble back up to getaddrinfo, the code has
decided that it was unable to resolve the host name to an address, even though
one of the two responses that came back from the DNS server had a valid A
record in it.

Test case? Run getaddrinfo on en.wikipedia.org immediately after restarting
your name server. I'm using BIND 9.8.0-7.P4.fc15.x86_64; I don't know how
universal this behavior is. I am attaching a wireshark dump from the virtual
interface that captures both my loopback interface (on which my client is
making its queries) and the queries my DNS server is making to try to satisfy
the local queries. And here's what my test program (which I will also attach)
prints as output:

Wed Jul 13 00:14:18 2011: getaddrinfo: Name or service not known

Note that if you run the exact same getaddrinfo call a second time immediately
afterwards it works, because the previous successful query response, which is a
CNAME, is cached and gets returned in response to both the A and AAAA queries.

Since this bug causes DNS queries that should succeed to fail in a very
user-visible way, I'm tempted to set it to critical, but I suppose since
there's no permanent loss of data it isn't actually. I don't know, tough call.
Comment 1 Jonathan Kamens 2011-07-13 04:22:57 UTC
Created attachment 5849 [details]
test program
Comment 2 Jonathan Kamens 2011-07-13 04:24:43 UTC
By the way, a workaround for the problem is putting "options single-request" in
/etc/resolv.conf.
Comment 3 karme 2012-11-04 10:49:15 UTC
First: I didn't test with the latest glibc because i failed to compile it
But I am quite sure the bug is still present and quite severe. It happens not
only if the order is wrong it also happens if there is no answer to the A
record request (either request/response is lost, dns server of the typical home
router to slow, ...)

I will attach a test with some comments.
Comment 4 karme 2012-11-04 10:53:17 UTC
Created attachment 6714 [details]
another test
Comment 5 karme 2012-11-06 11:49:19 UTC
I now believe packet re-ordering is not enough to reproduce the problem. I have
written a small dns proxy for better testing. The simplest scenario to
reproduce the problem is to drop all a record requests and just answer the aaaa
request.

Answer for getaddrinfo with hints.ai_family = AF_UNSPEC is then error: r=-2
Name or service not known.

Traffic is like:

12.786202    127.0.0.1 -> 127.0.0.1    DNS 68 Standard query 0x9f5f  A karme.de
12.786962    127.0.0.1 -> 127.0.0.1    DNS 68 Standard query 0x77c8  AAAA
karme.de
14.896700    127.0.0.1 -> 127.0.0.1    DNS 119 Standard query response 0x77c8 
17.788941    127.0.0.1 -> 127.0.0.1    DNS 68 Standard query 0x9f5f  A karme.de
22.794223    127.0.0.1 -> 127.0.0.1    DNS 68 Standard query 0x9f5f  A karme.de
Comment 6 law 2012-11-06 13:22:26 UTC
Just one question (because I believe this bug is ultimately a duplicate of
another existing issue), what is the failure mode you're seeing?  ie, do you
hit an assert, abort, segfault, error code, whatever.
Comment 7 karme 2012-11-07 17:32:42 UTC
(In reply to comment #6)
> Just one question (because I believe this bug is ultimately a duplicate of
> another existing issue), what is the failure mode you're seeing?  ie, do you
> hit an assert, abort, segfault, error code, whatever.

getaddrinfo returns error code EAI_NONAME when it should return EAI_EAGAIN
Comment 8 karme 2012-11-27 12:02:30 UTC
(In reply to comment #6)
> Just one question (because I believe this bug is ultimately a duplicate of
> another existing issue), what is the failure mode you're seeing?  ie, do you
> hit an assert, abort, segfault, error code, whatever.

which is the bug number you think this a duplicate of?
Comment 9 law 2012-11-27 16:51:41 UTC
I thought it might be a duplicate of 13013, 13651 or another (# escapes me) in
the Red Hat bugzilla database.   Based on the information you provided in c#7 I
believe this is a separate issue.
Comment 10 Pavel Šimerda 2012-12-16 15:54:57 UTC
Splitting this bug report so that it only refers to literal address
translation.

For /etc/hosts resolutions, see:

http://sourceware.org/bugzilla/show_bug.cgi?id=14966

For other name resolution problems, search for the bug report and file a new
one if you don't find it.

That means that if Tore's patch works, this bug can be closed and the other one
would be tracked separately.