Gentoo Linux recently enabled use of glibc-2.36 and I quickly found a repeatable crashing regression with nscd, which was rock-solid reliable with glibc-2.35. Initial analysis of the bug is at: https://bugs.gentoo.org/872401 The bug does not occur with hosts cache disabled. Reducing the number of threads does not help, i.e. it also crashes when run with a single thread and in -d mode, e.g. in gdb. With enabled hosts cache nscd quickly crashes, repeatably so with a quick series of requests which happens e.g. when using mtr (multi-trace-route). Initial analysis points to aicache:153 being passed a NULL value; this theory seems to have merit because the crash also happens also on a different platform where it crashes in __strlen_sse2 - pointing to the same pattern. The NULL value seems to originate from nss's file-hosts.c:459. A quick-fix attempt at checking for NULL and using 0 as value for strlen (like in aicache.c:324) did not help; instead nscd returns odd results (e.g. mtr says: Packet type unsupported: Invalid argument) and still crashes, so the NULL pointer being passed to strlen() just seems to be the messenger. A quick check of the nscd tree shows no major changes recently, so the real problem is likely somewhere else (nss?). Unfortunately I do not know enough about glibc's resolver internals to go on a hunt, and can therefore only report. I can however gladly and easily test patches.
It's probably important to mention that Gentoo's glibc carries additional patches from glibc after a release. This list of patches can be found here: https://gitweb.gentoo.org/fork/glibc.git/log/?h=gentoo/2.36
Created attachment 14358 [details] Add null check to strlen
A sucessful lookup cannot have a NULL name. Where does it come from?
(In reply to Andreas Schwab from comment #3) > A sucessful lookup cannot have a NULL name. Where does it come from? That's a good question - I understand that sprinkling NULL checks for supposedly-non-NULL values can mask unrelated problems. Two more observations as they just happened a minute ago: 1) the real bug seems to be related to #29605 2) even with this patch I still see segfaults in strlen() at that call site (though not right away), so maybe it's not a NULL value at all.
Turns out the crash on strlen() is something else: (gdb) bt full #0 __strlen_avx2 () at ../sysdeps/x86_64/multiarch/strlen-avx2.S:76 No locals. #1 0x00005555555672bd in addhstaiX (db=db@entry=0x555555577340 <dbs+704>, fd=fd@entry=17, req=req@entry=0x7fffecdf9804, key=key@entry=0x7fffecdf9a90, uid=uid@entry=4294967295, he=he@entry=0x0, dh=<optimized out>) at aicache.c:153 atmem = {next = 0x0, name = 0x99c369cec67a4600 <error: Cannot access memory at address 0x99c369cec67a4600>, family = -402650048, addr = {32767, 4160332864, 32767, 0}, scopeid = 0} at = 0x7fffecdf8af0 addrs = <optimized out> family = <optimized out> status = {-1, -1} naddrs = 1 canon = 0x99c369cec67a4600 <error: Cannot access memory at address 0x99c369cec67a4600> canonlen = <optimized out> cp = <optimized out> addrslen = 0 fct4 = 0x7ffff7ef0730 <__GI__nss_dns_gethostbyname4_r> dataset = 0x0 nip = 0x55555557c600 no_more = 254 rc6 = 0 rc4 = 0 herrno = 1 ctx = 0x7fffe8000bb0 tmpbuf6 = {data = 0x7fffecdf8b40, length = 1024, __space = {__align = {__max_align_ll = 0, __max_align_ld = 1.05759510034850465873e-4932}, __c = "\000\000\000\000\000\000\000\000h\213\337\354\377\177\000\000\002\000\000\000P\236C(\000\000\000\000\000\000\000\000\n\000\062.\000\000\000\000www.telekom.de\000lied-asynchrony.com\000lex\000\000\213\213\337\354\377\177\000\000\000\000\000\000\000\000\000\000\220\213\337\354\377\177\000\000\000\000\000\000\000\000\000\000\022\000\000\000\000\000\000\000\200\377\377\377\377\377\377\377", '\000' <repeats 16 times>, "Haven't found \"27\" in group cache!\000\367\377\177\000\000\250\245\371\367\377\177\000\000P\214\337\354\377\177\000\000\000"...}} tmpbuf4 = {data = 0x7fffecdf8f50, length = 1024, __space = {__align = {__max_align_ll = 140737353737120, __max_align_ld = <invalid float value>}, __c = "\240\343\371\367\377\177\000\000\340\220\337\354\377\177\000\000\351\032WUUU\000\000\340\222\337\354\377\177\000\000\340\217\337\354\377\177\000\000\337\063VUUU\000\000\340qWUUU\000\000+=0c\000\000\000\000\v\000\000\000$\000\000\000\r\000\000\000\031\000\000\000\b\000\000\000z\000\000\000\000\000\000\000\v\001\000\000\001\000\000\000\000\000\000\000 \034\000\000\000\000\000\000\240\270WUUU\000\000\030\000\000\000\060\000\000\000З\337\354\377\177\000\000\360\226\337\354\377\177\000\000Sun Sep 25 13:36:11 2022\000Fz\306\316iÙ\000\000\000\000\000\000\000\000\a\t\200\363\377\177\000\000\220\222\337\354\377\177\000\000\270\b"...}} canonbuf = {data = 0x7fffecdf9360, length = 1024, __space = {__align = {__max_align_ll = 0, __max_align_ld = 0}, __c = '\000' <repeats 216 times>...}} ttl = 14400 total = 0 key_copy = 0x0 alloca_used = false timeout = 9223372036854775807 __PRETTY_FUNCTION__ = "addhstaiX" The address of "canon" looks suspicious, which means "at" is probably garbage: (gdb) print at $20 = (struct gaih_addrtuple *) 0x7fffecdf8af0 (gdb) print at.addr $21 = {32767, 4160332864, 32767, 0} (gdb) print at.family $22 = -402650048 (gdb) print at.name $23 = 0x99c369cec67a4600 <error: Cannot access memory at address 0x99c369cec67a4600> (gdb) print at.next $24 = (struct gaih_addrtuple *) 0x0 (gdb) print at.scopeid $25 = 0
(In reply to Holger Hoffstätte from comment #5) > Turns out the crash on strlen() is something else: More precisely, it's garbage that is sometimes NULL - e.g. after a fresh start.
So this crash indeed turned out to be caused by post-2.36 release patches from the backport branch, presumably the resolver rewrite. Building a completely vanilla 2.36 made everything work again, and nscd runs just fine with enabled host cache.
As suspected the attached null check patch is also a dud and not necessary.
Closing since it's not a problem with the 2.36 release.
(In reply to Holger Hoffstätte from comment #9) > Closing since it's not a problem with the 2.36 release. If it's an issue on the backport branch, it's still valid.
Reopening so that we can figure out what's going on with the backports branch.
i just built glibc HEAD (22f4ab2d200f605441c) and the problem reproduces.
It would be nice if someone who can reproduce this can point out to the commit that actually breaks this. Given that it's likely not nscd, identifying the offending patch could maybe even help come up with an independent reproducer that doesn't need nscd.
(In reply to Siddhesh Poyarekar from comment #13) > It would be nice if someone who can reproduce this can point out to the > commit that actually breaks this. Given that it's likely not nscd, > identifying the offending patch could maybe even help come up with an > independent reproducer that doesn't need nscd. I removed "nss_dns: Rewrite _nss_dns_gethostbyname4_r using current interfaces" (https://sourceware.org/git/?p=glibc.git;a=commit;h=1d495912a746e2a1ffb780c9a81fd234ec2464e8) from my set of patches and the crash is gone.
Thanks for the pointer. I was beating around the bush in Fedora because the systemd resolve module overshadows the nss_dns module, thus masking the crash in Fedora. I can reproduce the crash reliably now.
The master branch has been updated by Siddhesh Poyarekar <siddhesh@sourceware.org>: https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=6e33e5c4b73cea7b8aa3de0947123db16200fb65 commit 6e33e5c4b73cea7b8aa3de0947123db16200fb65 Author: Siddhesh Poyarekar <siddhesh@sourceware.org> Date: Tue Oct 4 18:40:25 2022 -0400 nscd: Drop local address tuple variable [BZ #29607] When a request needs to be resent (e.g. due to insufficient buffer space), the references to subsequent tuples in the local variable are stale and should not be used. This used to work by accident before, but since 1d495912a it no longer does. Instead of trying to reset it, just let gethostbyname4_r write into TUMPBUF6 for us, thus maintaining a consistent state at all times. This is now consistent with what is done in gaih_inet for getaddrinfo. Resolves: BZ #29607 Reported-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Reviewed-by: Carlos O'Donell <carlos@redhat.com>
The release/2.36/master branch has been updated by Siddhesh Poyarekar <siddhesh@sourceware.org>: https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=2bd815d8347851212b9a91dbdca8053f4dbdac87 commit 2bd815d8347851212b9a91dbdca8053f4dbdac87 Author: Siddhesh Poyarekar <siddhesh@sourceware.org> Date: Tue Oct 4 18:43:50 2022 -0400 nscd: Drop local address tuple variable [BZ #29607] When a request needs to be resent (e.g. due to insufficient buffer space), the references to subsequent tuples in the local variable are stale and should not be used. This used to work by accident before, but since 1d495912a it no longer does. Instead of trying to reset it, just let gethostbyname4_r write into TUMPBUF6 for us, thus maintaining a consistent state at all times. This is now consistent with what is done in gaih_inet for getaddrinfo. Resolves: BZ #29607 Reported-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Reviewed-by: Carlos O'Donell <carlos@redhat.com> (cherry picked from commit 6e33e5c4b73cea7b8aa3de0947123db16200fb65)
Fixed on main and 2.36 branches.
The release/2.35/master branch has been updated by Arjun Shankar <arjun@sourceware.org>: https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=bca80a916e1a7fda51d0f30e9cfb5b111f8a2a7a commit bca80a916e1a7fda51d0f30e9cfb5b111f8a2a7a Author: Siddhesh Poyarekar <siddhesh@sourceware.org> Date: Tue Oct 4 18:40:25 2022 -0400 nscd: Drop local address tuple variable [BZ #29607] When a request needs to be resent (e.g. due to insufficient buffer space), the references to subsequent tuples in the local variable are stale and should not be used. This used to work by accident before, but since 1d495912a it no longer does. Instead of trying to reset it, just let gethostbyname4_r write into TUMPBUF6 for us, thus maintaining a consistent state at all times. This is now consistent with what is done in gaih_inet for getaddrinfo. Resolves: BZ #29607 Reported-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Reviewed-by: Carlos O'Donell <carlos@redhat.com> (cherry picked from commit 6e33e5c4b73cea7b8aa3de0947123db16200fb65)
The release/2.34/master branch has been updated by Arjun Shankar <arjun@sourceware.org>: https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e3976287b22422787f3cc6fc9adda58304b55bd9 commit e3976287b22422787f3cc6fc9adda58304b55bd9 Author: Siddhesh Poyarekar <siddhesh@sourceware.org> Date: Tue Oct 4 18:40:25 2022 -0400 nscd: Drop local address tuple variable [BZ #29607] When a request needs to be resent (e.g. due to insufficient buffer space), the references to subsequent tuples in the local variable are stale and should not be used. This used to work by accident before, but since 1d495912a it no longer does. Instead of trying to reset it, just let gethostbyname4_r write into TUMPBUF6 for us, thus maintaining a consistent state at all times. This is now consistent with what is done in gaih_inet for getaddrinfo. Resolves: BZ #29607 Reported-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Reviewed-by: Carlos O'Donell <carlos@redhat.com> (cherry picked from commit 6e33e5c4b73cea7b8aa3de0947123db16200fb65)