Bug 26921 - dwarf_getalt () not thread-safe
Summary: dwarf_getalt () not thread-safe
Status: NEW
Alias: None
Product: elfutils
Classification: Unclassified
Component: libdw (show other bugs)
Version: unspecified
: P2 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-11-19 16:22 UTC by Mark Wielaard
Modified: 2023-10-06 11:15 UTC (History)
2 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:
Project(s) to access:
ssh public key:


Attachments
example backtrace tsearch.c:243 (1.19 KB, text/plain)
2020-11-20 19:11 UTC, Ben Woodard
Details
example backtrace tsearch.c:229 (1.06 KB, text/plain)
2020-11-20 19:13 UTC, Ben Woodard
Details
example backtrace tsearch.c:269 (1.02 KB, text/plain)
2020-11-20 19:17 UTC, Ben Woodard
Details
example backtrace tsearch.c:214 (1.13 KB, text/plain)
2020-11-20 19:20 UTC, Ben Woodard
Details
a unique one at tsearch.c:229 (1.14 KB, text/plain)
2020-11-20 19:43 UTC, Ben Woodard
Details
attachment-1102809-0.html (621 bytes, text/html)
2020-11-21 23:34 UTC, Ben Woodard
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Mark Wielaard 2020-11-19 16:22:32 UTC
When walking a DIE tree getting any attribute values that come from an alt file (or DWARF5 supplemental file) is not thread-safe. The dwarf_getalt () function does the following (without any locking):

Dwarf *
dwarf_getalt (Dwarf *main)
{
  /* Only try once.  */
  if (main == NULL || main->alt_dwarf == (void *) -1)
    return NULL;

  if (main->alt_dwarf != NULL)
    return main->alt_dwarf;

  find_debug_altlink (main);

  /* If we found nothing, make sure we don't try again.  */
  if (main->alt_dwarf == NULL)
    {
      main->alt_dwarf = (void *) -1;
      return NULL;
    }

  return main->alt_dwarf;
}

find_debug_altlink will search for the alt file (which is a normal ELF file, that will be opened with dwarf_begin) if it can be found (and no error occurs).

The rest of the code (except for dwarf_end and an explicit dwarf_setalt call) doesn't access dwarf->alt_dwarf directly, but all call dwarf_getalt to access it.
Comment 1 Ben Woodard 2020-11-20 19:11:44 UTC
Created attachment 12982 [details]
example backtrace tsearch.c:243
Comment 2 Ben Woodard 2020-11-20 19:13:01 UTC
Created attachment 12983 [details]
example backtrace tsearch.c:229
Comment 3 Ben Woodard 2020-11-20 19:17:50 UTC
Created attachment 12984 [details]
example backtrace tsearch.c:269
Comment 4 Ben Woodard 2020-11-20 19:20:53 UTC
Created attachment 12985 [details]
example backtrace tsearch.c:214
Comment 5 Ben Woodard 2020-11-20 19:26:08 UTC
There were 214 crashes in my test that ended up in tseach.c those 4 back traces represent the 4 places where the problem occured.

tsearch.c:229 - 181 cases
tsearch.c:269 - 16 cases
tsearch.c:214 - 13 cases
tsearch.c:243 - 8 cases
Comment 6 Ben Woodard 2020-11-20 19:43:39 UTC
Created attachment 12986 [details]
a unique one at tsearch.c:229

This one particular crash is unique in that it does NOT contain __libdw_intern_next_unit
Comment 7 Mark Wielaard 2020-11-21 22:00:23 UTC
Thanks for the backtraces, they show a different concurrent unsafe thing in libdw, the usage of (lazy) tsearch caches. I opened a separate bug for that, bug #26930
Comment 8 Ben Woodard 2020-11-21 23:34:02 UTC
Created attachment 12992 [details]
attachment-1102809-0.html

One thing that I’m unclear about is how these two are different.
It doesn’t matter to me that they are split up but I just don’t understand
why?

-ben


On Sat, Nov 21, 2020 at 2:00 PM mark at klomp dot org <
sourceware-bugzilla@sourceware.org> wrote:

> https://sourceware.org/bugzilla/show_bug.cgi?id=26921
>
> --- Comment #7 from Mark Wielaard <mark at klomp dot org> ---
> Thanks for the backtraces, they show a different concurrent unsafe thing in
> libdw, the usage of (lazy) tsearch caches. I opened a separate bug for
> that,
> bug #26930
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.
>
>
Comment 9 Mark Wielaard 2020-11-21 23:53:18 UTC
(In reply to Ben Woodard from comment #8)
> One thing that I’m unclear about is how these two are different.
> It doesn’t matter to me that they are split up but I just don’t understand
> why?

Simply so the bugs cover something tractable. There are other places where we update shared state in a way that is not safe for concurrent code. We don't have a list of all such places yet. The task to find and fix them all is IMHO too big for one bug. If you want you could create a meta bug for that which then can depend on these 2 bugs (plus any others for issues we find in the future).