When walking a DIE tree getting any attribute values that come from an alt file (or DWARF5 supplemental file) is not thread-safe. The dwarf_getalt () function does the following (without any locking): Dwarf * dwarf_getalt (Dwarf *main) { /* Only try once. */ if (main == NULL || main->alt_dwarf == (void *) -1) return NULL; if (main->alt_dwarf != NULL) return main->alt_dwarf; find_debug_altlink (main); /* If we found nothing, make sure we don't try again. */ if (main->alt_dwarf == NULL) { main->alt_dwarf = (void *) -1; return NULL; } return main->alt_dwarf; } find_debug_altlink will search for the alt file (which is a normal ELF file, that will be opened with dwarf_begin) if it can be found (and no error occurs). The rest of the code (except for dwarf_end and an explicit dwarf_setalt call) doesn't access dwarf->alt_dwarf directly, but all call dwarf_getalt to access it.
Created attachment 12982 [details] example backtrace tsearch.c:243
Created attachment 12983 [details] example backtrace tsearch.c:229
Created attachment 12984 [details] example backtrace tsearch.c:269
Created attachment 12985 [details] example backtrace tsearch.c:214
There were 214 crashes in my test that ended up in tseach.c those 4 back traces represent the 4 places where the problem occured. tsearch.c:229 - 181 cases tsearch.c:269 - 16 cases tsearch.c:214 - 13 cases tsearch.c:243 - 8 cases
Created attachment 12986 [details] a unique one at tsearch.c:229 This one particular crash is unique in that it does NOT contain __libdw_intern_next_unit
Thanks for the backtraces, they show a different concurrent unsafe thing in libdw, the usage of (lazy) tsearch caches. I opened a separate bug for that, bug #26930
Created attachment 12992 [details] attachment-1102809-0.html One thing that I’m unclear about is how these two are different. It doesn’t matter to me that they are split up but I just don’t understand why? -ben On Sat, Nov 21, 2020 at 2:00 PM mark at klomp dot org < sourceware-bugzilla@sourceware.org> wrote: > https://sourceware.org/bugzilla/show_bug.cgi?id=26921 > > --- Comment #7 from Mark Wielaard <mark at klomp dot org> --- > Thanks for the backtraces, they show a different concurrent unsafe thing in > libdw, the usage of (lazy) tsearch caches. I opened a separate bug for > that, > bug #26930 > > -- > You are receiving this mail because: > You are on the CC list for the bug. > >
(In reply to Ben Woodard from comment #8) > One thing that I’m unclear about is how these two are different. > It doesn’t matter to me that they are split up but I just don’t understand > why? Simply so the bugs cover something tractable. There are other places where we update shared state in a way that is not safe for concurrent code. We don't have a list of all such places yet. The task to find and fix them all is IMHO too big for one bug. If you want you could create a meta bug for that which then can depend on these 2 bugs (plus any others for issues we find in the future).
https://inbox.sourceware.org/elfutils-devel/CAK-+vz0guN1=D5-EcuRTQAF0ZnL+aGYFg4n7pguuMkwNiOmW1Q@mail.gmail.com/