Bug 20874 - getaddrinfo_a segfault
Summary: getaddrinfo_a segfault
Status: UNCONFIRMED
Alias: None
Product: glibc
Classification: Unclassified
Component: nss (show other bugs)
Version: 2.24
: P2 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-11-27 09:45 UTC by Matt Corallo
Modified: 2021-06-21 12:45 UTC (History)
5 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:
fweimer: security?


Attachments
Patch for uninitialized pointers (752 bytes, patch)
2018-10-25 14:37 UTC, George Pee
Details | Diff
Fix getaddrinfo_a / gai_suspend race condition (1.03 KB, patch)
2021-06-21 12:45 UTC, Rainhard Driessler
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Matt Corallo 2016-11-27 09:45:41 UTC
The following code segfaults after running for a while (a few minutes on some systems I've tested on), or very quickly if you allow it to run with more threads by uncommenting the thread starting. I could be calling it wrong, but it seems to mesh with the man page to me.


#define _GNU_SOURCE
#include <netdb.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <assert.h>
 
#include <string>
#include <thread>
 
const char *host = "bluematt.me";
 
void beatit() {
	struct gaicb gai;
	struct gaicb *gaip = &gai;
 
	while (1) {
		memset(&gai, 0, sizeof(gai));
 
		fprintf(stderr, "OK, cleared gai, now calling gai_a\n");
		std::string hostname(host);
		gai.ar_name = hostname.c_str();
 
		assert(!getaddrinfo_a(GAI_NOWAIT, &gaip, 1, NULL));
 
		struct timespec ts = { 0, 42 };
		gai_suspend(&gaip, 1, &ts);
 
		while (gai_cancel(gaip) == EAI_NOTCANCELED)
			gai_suspend(&gaip, 1, &ts);
	}
 
}
 
int main(int argc, char *argv[]) {
	std::thread t1(beatit);
	/*std::thread t2(beatit);
	std::thread t3(beatit);
	std::thread t4(beatit);
	std::thread t5(beatit);
	std::thread t6(beatit);*/
 
	while (true) {}
}
Comment 1 Florian Weimer 2016-11-29 08:02:27 UTC
(In reply to Matt Corallo from comment #0)

> void beatit() {
> 	struct gaicb gai;
> 	struct gaicb *gaip = &gai;
>  
> 	while (1) {
> 		memset(&gai, 0, sizeof(gai));
>  
> 		fprintf(stderr, "OK, cleared gai, now calling gai_a\n");
> 		std::string hostname(host);
> 		gai.ar_name = hostname.c_str();
>  
> 		assert(!getaddrinfo_a(GAI_NOWAIT, &gaip, 1, NULL));

getaddrinfo_a does not make copy of the submitted requests, so this code has multiple use-after-free issues (the gai local variable, and the backing string for hostname).
Comment 2 Matt Corallo 2016-11-29 08:23:26 UTC
The way I read the man page, after gai_cancel returns anything other than EAI_NOTCANCELED (ie it has returned EAI_CANCELED or EAI_ALLDONE), the memory is again the user's to do with as the user pleases. If that is not the case, when is the gaicb struct again free for the user to release?
Comment 3 Matt Corallo 2016-11-29 08:28:33 UTC
Changing back to UNCONFIRMED, as the loop as written clearly waits for gai_cancel to return non-NOTCANCELED before allowing std::string to release the backing memory, or wiping the gaicb struct.
Comment 4 Florian Weimer 2016-11-29 08:43:30 UTC
(In reply to Matt Corallo from comment #2)
> The way I read the man page, after gai_cancel returns anything other than
> EAI_NOTCANCELED (ie it has returned EAI_CANCELED or EAI_ALLDONE), the memory
> is again the user's to do with as the user pleases. If that is not the case,
> when is the gaicb struct again free for the user to release?

You are right, I was confused.  This does look like a bug.
Comment 5 Wladimir van der Laan 2016-12-02 04:44:30 UTC
Can confirm that this sample (even without uncommenting the threads) crashes quickly with the following traceback:

(gdb) bt
#0  0x00007ffff79b77f0 in __gai_notify (req=req@entry=0x7ffff00648a0) at gai_notify.c:122
#1  0x00007ffff79b707c in handle_requests (arg=<optimized out>) at gai_misc.c:328
#2  0x00007ffff7bc170a in start_thread (arg=0x7ffff4d11700) at pthread_create.c:333
#3  0x00007ffff715b82d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
Comment 6 Wladimir van der Laan 2016-12-02 04:47:10 UTC
(above with libc6 Version: 2.23-0ubuntu4)
Comment 7 Phil Blundell 2017-06-19 12:12:27 UTC
I think the bug is in gai_suspend(), here:

      /* Now remove the entry in the waiting list for all requests
	 which didn't terminate.  */
      for (cnt = 0; cnt < ent; ++cnt)
	if (list[cnt] != NULL && list[cnt]->__return == EAI_INPROGRESS
	    && requestlist[cnt] != NULL)
	  {
	    struct waitlist **listp = &requestlist[cnt]->waiting;

	    /* There is the chance that we cannot find our entry anymore.
	       This could happen if the request terminated and restarted
	       again.  */
	    while (*listp != NULL && *listp != &waitlist[cnt])
	      listp = &(*listp)->next;

	    if (*listp != NULL)
	      *listp = (*listp)->next;
	  }

The check for (list[cnt]->__return == EAI_INPROGRESS) here is incorrect because handle_request() may have assigned a new return value while we were sleeping (or in fact at any other time, since this particular operation in handle_request() is not guarded by __gai_requests_mutex).

I think the fix is probably to remove that condition and allow the loop to search all lists for an entry that we added.

diff --git a/resolv/gai_suspend.c b/resolv/gai_suspend.c
index a86bd4360d..139d636c78 100644
--- a/resolv/gai_suspend.c
+++ b/resolv/gai_suspend.c
@@ -111,8 +111,7 @@ gai_suspend (const struct gaicb *const list[], int ent,
       /* Now remove the entry in the waiting list for all requests
         which didn't terminate.  */
       for (cnt = 0; cnt < ent; ++cnt)
-       if (list[cnt] != NULL && list[cnt]->__return == EAI_INPROGRESS
-           && requestlist[cnt] != NULL)
+       if (list[cnt] != NULL && requestlist[cnt] != NULL)
          {
            struct waitlist **listp = &requestlist[cnt]->waiting;
Comment 8 George Pee 2018-10-25 14:35:50 UTC
Running the example code with valgrind shows this:
Compiling the test app on my desktop and running valgrind points to the issue:
==14188== Thread 4:
==14188== Conditional jump or move depends on uninitialised value(s)
==14188== at 0x4E3B7EA: __gai_notify (gai_notify.c:108)
==14188== by 0x4E3B07B: handle_requests (gai_misc.c:328)
==14188== by 0x55DD6B9: start_thread (pthread_create.c:333)
==14188== 
==14188== Use of uninitialised value of size 8
==14188== at 0x4E3B81C: __gai_notify (gai_notify.c:111)
==14188== by 0x4E3B07B: handle_requests (gai_misc.c:328)
==14188== by 0x55DD6B9: start_thread (pthread_create.c:333)
==14188== 
==14188== Conditional jump or move depends on uninitialised value(s)
==14188== at 0x4E3B820: __gai_notify (gai_notify.c:111)
==14188== by 0x4E3B07B: handle_requests (gai_misc.c:328)
==14188== by 0x55DD6B9: start_thread (pthread_create.c:333)
==14188== 
==14188== Use of uninitialised value of size 8
==14188== at 0x4E3B822: __gai_notify (gai_notify.c:111)
==14188== by 0x4E3B07B: handle_requests (gai_misc.c:328)
==14188== by 0x55DD6B9: start_thread (pthread_create.c:333)
Comment 9 George Pee 2018-10-25 14:37:27 UTC
Created attachment 11365 [details]
Patch for uninitialized pointers
Comment 10 Rainhard Driessler 2021-06-21 12:45:34 UTC
Created attachment 13505 [details]
Fix getaddrinfo_a / gai_suspend race condition

I worked on this bug in 2.23 and 2.26 - after checking the current version I'm confident this bug should still be around up to now.

From my point of view, there's two issues here:
* The async_waitlist allocated in getaddrinfo_a is freed in gai_notify, but not removed from the requests waitlist, leaving a dangling pointer behind
* Pointers to a gai_suspend calls' local waitlist structures may persist in requests waitlists after gai_suspend returned, leaving invalid pointers into the stack behind

Although the combination of getaddrinfo_a (without signals) + gai_suspend usually works, there seems to be a thread execution order that causes invalid waitlist pointers to be dereferenced in gai_notify, specifically in (--*waitlist->counterp).

I don't have definitive proof so far, but I could reliably trigger the segmentation fault as follows:
- Turn off local DNS cache
- Use a fast, local DNS for resolution
- Simulate delay using "tc qdisc add dev <ethX> root netem delay Xms 10ms 25%"
- Loop calls of getaddrinfo_a, followed gai_suspend with timeout Xms
- The more instances of the looping binary run, the faster the segfault happens

The attached patch fixes the issue for me: I unconditionally remove the waitlist entries added within gai_suspend and furthermore remove the async_waitlist entry freed within gai_notify from the list.

Let me know what you think.