Issue with stale resolv.conf state
John Levon
levon@movementarian.org
Mon Mar 11 09:08:26 GMT 2024
I have an intermittent issue where a getaddrinfo()-using application uses stale
nameservers. That is, /etc/resolv.conf has been updated, the original
nameservers are not reachable at all, but the application doesn't ever notice.
Note that this only reproduces very occassionally so difficult for me to distill
into a simple test case.
This is with glibc 2.35 but from a quick look I didn't see any changes in master
that would help.
I confirmed that glibc never stat()s the file, and this is because we are here:
68 /* Initialize *RESP if RES_INIT is not yet set in RESP->options, or if
69 res_init in some other thread requested re-initializing. */
70 static __attribute__ ((warn_unused_result)) bool
71 maybe_init (struct resolv_context *ctx, bool preinit)
72 {
73 struct __res_state *resp = ctx->resp;
74 if (resp->options & RES_INIT)
75 {
76 if (resp->options & RES_NORELOAD)
77 /* Configuration reloading was explicitly disabled. */
78 return true;
79
80 /* If there is no associated resolv_conf object despite the
81 initialization, something modified *ctx->resp. Do not
82 override those changes. */
83 if (ctx->conf != NULL && replicated_configuration_matches (ctx))
And "replicated_configuration_matches()" is false. Thus we never examine the
file for any changes and continue using the old version indefinitely.
I don't understand the first part of the comment, but indeed, ->resp doesn't
match. In particular:
62 return ctx->resp->options == ctx->conf->options
and ctx->resp (aka _resp) has 0x47002c1 whereas ctx->conf has 0x41002c1.
I'm not sure but I suspect the additional RES_SNGLKUP|RES_SNGLKUPREOP may be due
to this code:
1000 /* There are quite a few broken name servers out
1001 there which don't handle two outstanding
1002 requests from the same source. There are also
1003 broken firewall settings. If we time out after
1004 having received one answer switch to the mode
1005 where we send the second request only once we
1006 have received the first answer. */
1007 if (!single_request)
1008 {
1009 statp->options |= RES_SNGLKUP;
1010 single_request = true;
1011 *gotsomewhere = save_gotsomewhere;
1012 goto retry;
1013 }
1014 else if (!single_request_reopen)
1015 {
1016 statp->options |= RES_SNGLKUPREOP;
1017 single_request_reopen = true;
1018 *gotsomewhere = save_gotsomewhere;
1019 __res_iclose (statp, false);
1020 goto retry_reopen;
1021 }
I'm guessing these got set when the VPN dropped routing to the old nameservers,
but before the next getaddrinfo() came in, thus leading to the match failing.
I can't see where the application code itself can be at fault here, but I'm not
100% confident about the above analysis either. Any thoughts?
thanks
john
More information about the Libc-alpha
mailing list