Bug 29607 - nscd repeatably crashes calling __strlen_avx2 when hosts cache is enabled
Summary: nscd repeatably crashes calling __strlen_avx2 when hosts cache is enabled
Status: RESOLVED FIXED
Alias: None
Product: glibc
Classification: Unclassified
Component: nscd (show other bugs)
Version: 2.36
: P2 critical
Target Milestone: 2.37
Assignee: Siddhesh Poyarekar
URL: https://bugs.gentoo.org/872401
Keywords:
Depends on:
Blocks:
 
Reported: 2022-09-23 15:39 UTC by Holger Hoffstätte
Modified: 2022-10-07 14:34 UTC (History)
5 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed: 2022-09-30 00:00:00


Attachments
Add null check to strlen (295 bytes, patch)
2022-09-25 10:50 UTC, Holger Hoffstätte
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Holger Hoffstätte 2022-09-23 15:39:25 UTC
Gentoo Linux recently enabled use of glibc-2.36 and I quickly found a repeatable crashing regression with nscd, which was rock-solid reliable with glibc-2.35.

Initial analysis of the bug is at: https://bugs.gentoo.org/872401

The bug does not occur with hosts cache disabled. Reducing the number of threads does not help, i.e. it also crashes when run with a single thread and in -d mode, e.g. in gdb.

With enabled hosts cache nscd quickly crashes, repeatably so with a quick series of requests which happens e.g. when using mtr (multi-trace-route).

Initial analysis points to aicache:153 being passed a NULL value; this theory seems to have merit because the crash also happens also on a different platform where it crashes in __strlen_sse2 - pointing to the same pattern.

The NULL value seems to originate from nss's file-hosts.c:459.

A quick-fix attempt at checking for NULL and using 0 as value for strlen (like in aicache.c:324) did not help; instead nscd returns odd results (e.g. mtr says: Packet type unsupported: Invalid argument) and still crashes, so the NULL pointer being passed to strlen() just seems to be the messenger.

A quick check of the nscd tree shows no major changes recently, so the real problem is likely somewhere else (nss?). Unfortunately I do not know enough about glibc's resolver internals to go on a hunt, and can therefore only report.
I can however gladly and easily test patches.
Comment 1 Holger Hoffstätte 2022-09-23 17:19:28 UTC
It's probably important to mention that Gentoo's glibc carries additional patches from glibc after a release. This list of patches can be found here:
https://gitweb.gentoo.org/fork/glibc.git/log/?h=gentoo/2.36
Comment 2 Holger Hoffstätte 2022-09-25 10:50:51 UTC
Created attachment 14358 [details]
Add null check to strlen
Comment 3 Andreas Schwab 2022-09-25 11:09:57 UTC
A sucessful lookup cannot have a NULL name.  Where does it come from?
Comment 4 Holger Hoffstätte 2022-09-25 11:13:54 UTC
(In reply to Andreas Schwab from comment #3)
> A sucessful lookup cannot have a NULL name.  Where does it come from?

That's a good question - I understand that sprinkling NULL checks for supposedly-non-NULL values can mask unrelated problems. Two more observations as they just happened a minute ago:

1) the real bug seems to be related to #29605
2) even with this patch I still see segfaults in strlen() at that call site (though not right away), so maybe it's not a NULL value at all.
Comment 5 Holger Hoffstätte 2022-09-25 11:46:57 UTC
Turns out the crash on strlen() is something else:

(gdb) bt full
#0  __strlen_avx2 () at ../sysdeps/x86_64/multiarch/strlen-avx2.S:76
No locals.
#1  0x00005555555672bd in addhstaiX (db=db@entry=0x555555577340 <dbs+704>, fd=fd@entry=17, req=req@entry=0x7fffecdf9804, key=key@entry=0x7fffecdf9a90, 
    uid=uid@entry=4294967295, he=he@entry=0x0, dh=<optimized out>) at aicache.c:153
        atmem = {next = 0x0, name = 0x99c369cec67a4600 <error: Cannot access memory at address 0x99c369cec67a4600>, family = -402650048, addr = {32767, 4160332864, 32767, 0}, 
          scopeid = 0}
        at = 0x7fffecdf8af0
        addrs = <optimized out>
        family = <optimized out>
        status = {-1, -1}
        naddrs = 1
        canon = 0x99c369cec67a4600 <error: Cannot access memory at address 0x99c369cec67a4600>
        canonlen = <optimized out>
        cp = <optimized out>
        addrslen = 0
        fct4 = 0x7ffff7ef0730 <__GI__nss_dns_gethostbyname4_r>
        dataset = 0x0
        nip = 0x55555557c600
        no_more = 254
        rc6 = 0
        rc4 = 0
        herrno = 1
        ctx = 0x7fffe8000bb0
        tmpbuf6 = {data = 0x7fffecdf8b40, length = 1024, __space = {__align = {__max_align_ll = 0, __max_align_ld = 1.05759510034850465873e-4932}, 
            __c = "\000\000\000\000\000\000\000\000h\213\337\354\377\177\000\000\002\000\000\000P\236C(\000\000\000\000\000\000\000\000\n\000\062.\000\000\000\000www.telekom.de\000lied-asynchrony.com\000lex\000\000\213\213\337\354\377\177\000\000\000\000\000\000\000\000\000\000\220\213\337\354\377\177\000\000\000\000\000\000\000\000\000\000\022\000\000\000\000\000\000\000\200\377\377\377\377\377\377\377", '\000' <repeats 16 times>, "Haven't found \"27\" in group cache!\000\367\377\177\000\000\250\245\371\367\377\177\000\000P\214\337\354\377\177\000\000\000"...}}
        tmpbuf4 = {data = 0x7fffecdf8f50, length = 1024, __space = {__align = {__max_align_ll = 140737353737120, __max_align_ld = <invalid float value>}, 
            __c = "\240\343\371\367\377\177\000\000\340\220\337\354\377\177\000\000\351\032WUUU\000\000\340\222\337\354\377\177\000\000\340\217\337\354\377\177\000\000\337\063VUUU\000\000\340qWUUU\000\000+=0c\000\000\000\000\v\000\000\000$\000\000\000\r\000\000\000\031\000\000\000\b\000\000\000z\000\000\000\000\000\000\000\v\001\000\000\001\000\000\000\000\000\000\000 \034\000\000\000\000\000\000\240\270WUUU\000\000\030\000\000\000\060\000\000\000З\337\354\377\177\000\000\360\226\337\354\377\177\000\000Sun Sep 25 13:36:11 2022\000Fz\306\316iÙ\000\000\000\000\000\000\000\000\a\t\200\363\377\177\000\000\220\222\337\354\377\177\000\000\270\b"...}}
        canonbuf = {data = 0x7fffecdf9360, length = 1024, __space = {__align = {__max_align_ll = 0, __max_align_ld = 0}, __c = '\000' <repeats 216 times>...}}
        ttl = 14400
        total = 0
        key_copy = 0x0
        alloca_used = false
        timeout = 9223372036854775807
        __PRETTY_FUNCTION__ = "addhstaiX"

The address of "canon" looks suspicious, which means "at" is probably garbage:

(gdb) print at
$20 = (struct gaih_addrtuple *) 0x7fffecdf8af0
(gdb) print at.addr 
$21 = {32767, 4160332864, 32767, 0}
(gdb) print at.family 
$22 = -402650048
(gdb) print at.name 
$23 = 0x99c369cec67a4600 <error: Cannot access memory at address 0x99c369cec67a4600>
(gdb) print at.next 
$24 = (struct gaih_addrtuple *) 0x0
(gdb) print at.scopeid 
$25 = 0
Comment 6 Holger Hoffstätte 2022-09-25 12:16:11 UTC
(In reply to Holger Hoffstätte from comment #5)
> Turns out the crash on strlen() is something else:

More precisely, it's garbage that is sometimes NULL - e.g. after a fresh start.
Comment 7 Holger Hoffstätte 2022-09-26 19:13:12 UTC
So this crash indeed turned out to be caused by post-2.36 release patches from the backport branch, presumably the resolver rewrite. Building a completely vanilla 2.36 made everything work again, and nscd runs just fine with enabled host cache.
Comment 8 Holger Hoffstätte 2022-09-26 19:20:01 UTC
As suspected the attached null check patch is also a dud and not necessary.
Comment 9 Holger Hoffstätte 2022-09-26 19:34:25 UTC
Closing since it's not a problem with the 2.36 release.
Comment 10 Sam James 2022-09-26 19:34:48 UTC
(In reply to Holger Hoffstätte from comment #9)
> Closing since it's not a problem with the 2.36 release.

If it's an issue on the backport branch, it's still valid.
Comment 11 Holger Hoffstätte 2022-09-26 19:49:28 UTC
Reopening so that we can figure out what's going on with the backports branch.
Comment 12 Holger Hoffstätte 2022-09-26 20:11:52 UTC
i just built glibc HEAD (22f4ab2d200f605441c) and the problem reproduces.
Comment 13 Siddhesh Poyarekar 2022-09-26 20:22:11 UTC
It would be nice if someone who can reproduce this can point out to the commit that actually breaks this.  Given that it's likely not nscd, identifying the offending patch could maybe even help come up with an independent reproducer that doesn't need nscd.
Comment 14 Holger Hoffstätte 2022-09-27 05:33:54 UTC
(In reply to Siddhesh Poyarekar from comment #13)
> It would be nice if someone who can reproduce this can point out to the
> commit that actually breaks this.  Given that it's likely not nscd,
> identifying the offending patch could maybe even help come up with an
> independent reproducer that doesn't need nscd.

I removed "nss_dns: Rewrite _nss_dns_gethostbyname4_r using current interfaces" (https://sourceware.org/git/?p=glibc.git;a=commit;h=1d495912a746e2a1ffb780c9a81fd234ec2464e8)

from my set of patches and the crash is gone.
Comment 15 Siddhesh Poyarekar 2022-09-30 18:03:50 UTC
Thanks for the pointer.  I was beating around the bush in Fedora because the systemd resolve module overshadows the nss_dns module, thus masking the crash in Fedora.  I can reproduce the crash reliably now.
Comment 16 Sourceware Commits 2022-10-04 22:40:59 UTC
The master branch has been updated by Siddhesh Poyarekar <siddhesh@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=6e33e5c4b73cea7b8aa3de0947123db16200fb65

commit 6e33e5c4b73cea7b8aa3de0947123db16200fb65
Author: Siddhesh Poyarekar <siddhesh@sourceware.org>
Date:   Tue Oct 4 18:40:25 2022 -0400

    nscd: Drop local address tuple variable [BZ #29607]
    
    When a request needs to be resent (e.g. due to insufficient buffer
    space), the references to subsequent tuples in the local variable are
    stale and should not be used.  This used to work by accident before, but
    since 1d495912a it no longer does.  Instead of trying to reset it, just
    let gethostbyname4_r write into TUMPBUF6 for us, thus maintaining a
    consistent state at all times.  This is now consistent with what is done
    in gaih_inet for getaddrinfo.
    
    Resolves: BZ #29607
    Reported-by: Holger Hoffstätte <holger@applied-asynchrony.com>
    Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
    Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Comment 17 Sourceware Commits 2022-10-04 22:44:07 UTC
The release/2.36/master branch has been updated by Siddhesh Poyarekar <siddhesh@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=2bd815d8347851212b9a91dbdca8053f4dbdac87

commit 2bd815d8347851212b9a91dbdca8053f4dbdac87
Author: Siddhesh Poyarekar <siddhesh@sourceware.org>
Date:   Tue Oct 4 18:43:50 2022 -0400

    nscd: Drop local address tuple variable [BZ #29607]
    
    When a request needs to be resent (e.g. due to insufficient buffer
    space), the references to subsequent tuples in the local variable are
    stale and should not be used.  This used to work by accident before, but
    since 1d495912a it no longer does.  Instead of trying to reset it, just
    let gethostbyname4_r write into TUMPBUF6 for us, thus maintaining a
    consistent state at all times.  This is now consistent with what is done
    in gaih_inet for getaddrinfo.
    
    Resolves: BZ #29607
    Reported-by: Holger Hoffstätte <holger@applied-asynchrony.com>
    Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
    Reviewed-by: Carlos O'Donell <carlos@redhat.com>
    (cherry picked from commit 6e33e5c4b73cea7b8aa3de0947123db16200fb65)
Comment 18 Siddhesh Poyarekar 2022-10-04 22:45:08 UTC
Fixed on main and 2.36 branches.
Comment 19 Sourceware Commits 2022-10-07 14:34:19 UTC
The release/2.35/master branch has been updated by Arjun Shankar <arjun@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=bca80a916e1a7fda51d0f30e9cfb5b111f8a2a7a

commit bca80a916e1a7fda51d0f30e9cfb5b111f8a2a7a
Author: Siddhesh Poyarekar <siddhesh@sourceware.org>
Date:   Tue Oct 4 18:40:25 2022 -0400

    nscd: Drop local address tuple variable [BZ #29607]
    
    When a request needs to be resent (e.g. due to insufficient buffer
    space), the references to subsequent tuples in the local variable are
    stale and should not be used.  This used to work by accident before, but
    since 1d495912a it no longer does.  Instead of trying to reset it, just
    let gethostbyname4_r write into TUMPBUF6 for us, thus maintaining a
    consistent state at all times.  This is now consistent with what is done
    in gaih_inet for getaddrinfo.
    
    Resolves: BZ #29607
    Reported-by: Holger Hoffstätte <holger@applied-asynchrony.com>
    Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
    Reviewed-by: Carlos O'Donell <carlos@redhat.com>
    (cherry picked from commit 6e33e5c4b73cea7b8aa3de0947123db16200fb65)
Comment 20 Sourceware Commits 2022-10-07 14:34:41 UTC
The release/2.34/master branch has been updated by Arjun Shankar <arjun@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e3976287b22422787f3cc6fc9adda58304b55bd9

commit e3976287b22422787f3cc6fc9adda58304b55bd9
Author: Siddhesh Poyarekar <siddhesh@sourceware.org>
Date:   Tue Oct 4 18:40:25 2022 -0400

    nscd: Drop local address tuple variable [BZ #29607]
    
    When a request needs to be resent (e.g. due to insufficient buffer
    space), the references to subsequent tuples in the local variable are
    stale and should not be used.  This used to work by accident before, but
    since 1d495912a it no longer does.  Instead of trying to reset it, just
    let gethostbyname4_r write into TUMPBUF6 for us, thus maintaining a
    consistent state at all times.  This is now consistent with what is done
    in gaih_inet for getaddrinfo.
    
    Resolves: BZ #29607
    Reported-by: Holger Hoffstätte <holger@applied-asynchrony.com>
    Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
    Reviewed-by: Carlos O'Donell <carlos@redhat.com>
    (cherry picked from commit 6e33e5c4b73cea7b8aa3de0947123db16200fb65)