Bug 13904 - getaddrinfo does two identical queries, breaks dns round robin with two hosts
Summary: getaddrinfo does two identical queries, breaks dns round robin with two hosts
Status: RESOLVED DUPLICATE of bug 14307
Alias: None
Product: glibc
Classification: Unclassified
Component: network (show other bugs)
Version: 2.11
: P2 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-03-26 19:33 UTC by kjp
Modified: 2014-06-18 04:30 UTC (History)
1 user (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:
fweimer: security-


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description kjp 2012-03-26 19:33:42 UTC
The host 'gridftp.nautilus.nics.xsede.org' has two ipv4 ips.  when using ping, each ping execution alternates hosts, and strace shows only one dns query is sent.  

however, another program using getaddrinfo() does not get the same rotation.  It appears that getaddrinfo inexplicably sends a redundant dns query (only thing different is transaction id), thus causing an entire rotation through both hosts each time.  The returned ip will always be fairly constant - it will be the last ip that 'ping' got.   So, if a host has a power of 2 number of ips, rotation is broken.. particularly badly if there are only two ips total.

note: nscd is not running, and it does not appear to fix the problem

test program:

int main(void)
{
    struct addrinfo *result;
    struct addrinfo *res;
    int error;

   /* nautilus has two hosts, we want to pick a random
     * one.  
     * Note: problem occurs with or without final trailing '.' */
    error = getaddrinfo("gridftp.nautilus.nics.xsede.org",
            NULL, NULL, &result);
    // sent two DNS queries, which breaks rotation of two ips
    // first ip result is always the same ip, even though nameserver is reordering each response
    
strace of c program:

strace -s 100 -e sendto ./a.out 
sendto(3, "\24\0\0\0\26\0\1\3\276\301pO\0\0\0\0\0\0\0\0", 20, 0, {sa_family=AF_NETLINK, pid=0, groups=00000000}, 12) = 20
sendto(3, "\360\342\1\0\0\1\0\0\0\0\0\0\7gridftp\10nautilus\4nics\5xsede\3org\0\0\1\0\1", 49, MSG_NOSIGNAL, NULL, 0) = 49
sendto(3, "\2654\1\0\0\1\0\0\0\0\0\0\7gridftp\10nautilus\4nics\5xsede\3org\0\0\1\0\1", 49, MSG_NOSIGNAL, NULL, 0) = 49
Comment 1 kjp 2012-03-26 20:00:54 UTC
I'm also getting the same behavior on the latest ubuntu 12.04 beta:

GNU C Library (Ubuntu EGLIBC 2.15-0ubuntu6) stable release version 2.15, by Roland McGrath et al.
Copyright (C) 2012 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
Compiled by GNU CC version 4.6.3.
Compiled on a Linux 3.2.12 system on 2012-03-21.
Available extensions:
	crypt add-on version 2.1 by Michael Glad and others
	GNU Libidn by Simon Josefsson
	Native POSIX Threads Library by Ulrich Drepper et al
	BIND-8.2.3-T5B
libc ABIs: UNIQUE IFUNC
Comment 2 kjp 2012-03-27 16:28:06 UTC
This doesn't seem to happen on fedora 15, so maybe it's eglibc specific.  I'm reporting it there.
Comment 3 Jeroen van Bemmel 2012-06-29 06:17:15 UTC
Possibly related to bug #14307? I found an issue specifically on x86_64 hosts

If you were running on a x86_64 machine, try compiling your test program with "-m32" and see if that fixes it
Comment 4 kjp 2012-06-29 15:10:03 UTC
You are the man.  Compiling with 32 changes the behavior - it just sends one query using send().  64 bit mode does two sendto() calls.

So 32 bit mode has working IP round robin with 2 hosts, 64 bit mode doesn't.


koa@mydev ~$ gcc t.c -m32
koa@mydev ~$ strace -tt -s 100 -e send ./a.out
[ Process PID=14183 runs in 32 bit mode. ]
15:09:32.195930 send(3, "\307\354\1\0\0\1\0\0\0\0\0\0\7gridftp\10nautilus\4nics\5xsede\3org\0\0\1\0\1", 49, MSG_NOSIGNAL) = 49




koa@mydev ~$ strace -tt -s 100 -e sendto ./a.out
15:09:45.178425 sendto(3, "\24\0\0\0\26\0\1\0039\305\355O\0\0\0\0\0\0\0\0", 20, 0, {sa_family=AF_NETLINK, pid=0, groups=00000000}, 12) = 20
15:09:45.182789 sendto(3, "a]\1\0\0\1\0\0\0\0\0\0\7gridftp\10nautilus\4nics\5xsede\3org\0\0\1\0\1", 49, MSG_NOSIGNAL, NULL, 0) = 49
15:09:45.183953 sendto(3, "]\35\1\0\0\1\0\0\0\0\0\0\7gridftp\10nautilus\4nics\5xsede\3org\0\0\1\0\1", 49, MSG_NOSIGNAL, NULL, 0) = 49
Comment 5 kjp 2012-06-29 15:13:39 UTC
Just to clarify, my second snipped above was with -m64:


koa@mydev ~$ gcc t.c  -m64
koa@mydev ~$ strace -tt -s 100 -e send,sendto ./a.out
15:13:10.946056 sendto(3, "\24\0\0\0\26\0\1\3\6\306\355O\0\0\0\0\0\0\0\0", 20, 0, {sa_family=AF_NETLINK, pid=0, groups=00000000}, 12) = 20
15:13:10.950462 sendto(3, "x\227\1\0\0\1\0\0\0\0\0\0\7gridftp\10nautilus\4nics\5xsede\3org\0\0\1\0\1", 49, MSG_NOSIGNAL, NULL, 0) = 49
15:13:10.951656 sendto(3, "\244\211\1\0\0\1\0\0\0\0\0\0\7gridftp\10nautilus\4nics\5xsede\3org\0\0\1\0\1", 49, MSG_NOSIGNAL, NULL, 0) = 49

definitely the trigger...
Comment 6 Jeroen van Bemmel 2012-06-30 15:08:51 UTC
Additional test confirms that root cause is the same as #14307. Patch available for the latter

*** This bug has been marked as a duplicate of bug 14307 ***