getaddrinfo thread-safe?

Mon Mar 23 19:19:00 GMT 2015

On 03/23/2015 01:16 PM, Joshua Rogers wrote:
> On 24/03/15 03:53, Carlos O'Donell wrote:
>>>>>>                                         __have_o_nonblock
>>>>>>                                           = (EXT(statp).nssocks[ns] == -1
>>>>>>                                              && errno == EINVAL ? -1 : 1);
>>>> And that is why it complains, probably.
>>>>
>>>> Please check the vanilla glibc code.
>> The code is still there, but you still need to show what is happening.
>> Why does hellgrind think there is a race condition there, and does
>> it make sense.
> The reason helgrind is complaining is because __have_o_nonblock is being
> set, but __have_o_nonblock is a global variable. So, if
> __have_o_nonblock is overwritten by a different thread, the running
> thread will use whatever the __have_o_nonblock is set to in the
> "overwriting thread."
> My guess is this will affect the connection that reopen makes, and will
> make non-blocking a problem.
> 
> Another guess as to how to create a 'proof of concept', is by having the
> threaded function randomly choose 1 or 0, and if 1, set nonblocking, and
> if 0, don't choose nonblocking. Then do the getaddrinfo(), then monitor
> if it sets it to non-blocking or not, and confirm that it does what it's
> supposed to do.
> However, I don't know how to do this.

You're doing a great job so far.

The next step is actually to read the glibc getaddrinfo code and see how
__have_o_nonblock works and why it's set the way it's being set.

The SOCK_NONBLOCK flag is linux specific and used to save the extra call
to fcntl to make the socket non-blocking.

Therefore the flag is global because it applies to the running kernel not
the running thread, and any thread that attempts to use it can determine
if indeed the kernel supports it. Thus it's likely the case that multiple
threads write to it as they make their socket calls, and either they all
succeed or they all fail. Then they all probably set __have_o_nonblock to
1 (success), and that is indeed a race, but it doesn't matter, they all always see
the same result. Any thread reading __have_o_nonblock reads either 0 (don't
know), 1 (SOCK_NONBLOCK supported by kernel), or -1 (SOCK_NONBLOCK not supported
by kernel). If the thread reads any of those values, it knows how to proceed.
Just because it reads 0 when it should or could have read 1 or -1 doesn't
make the operation unsafe since the thread succeeds, and sets __have_o_nonblock
to the result it got (1 for success, or -1 for failure).

It *is* still a data race though, but it happens to benign. However, in glibc
we have taken the consensus position to stamp out benign data races where
possible and where reasonable. This is a reasonable case where we should use
atomic operations since the atomic ops are likely to be cheaper than letting
other threads call into the kernel to do the same operation.

"9. Standards we use" - "C11 (internally, not exposed via headers)"
https://sourceware.org/glibc/wiki/Consensus#Standards_we_use

Could you do me a favour please? Look for a matching bug in the bugzilla and
file one if it doesn't exist? Despite the fact that this is a benign race
it should still be fixed so it doesn't race. The bug should say that this
is a benign race that needs fixing.

Cheers,
Carlos.