[PATCH] libdw: add thread-safety to dwarf_getabbrev()

Sun Oct 27 00:56:00 GMT 2019

On Sun, Oct 27, 2019 at 00:50, Mark Wielaard <mark@klomp.org> wrote:
> Hi,
> 
> On Sat, 2019-10-26 at 11:45 -0500, Jonathon Anderson wrote:
>>  For some overall perspective, this patch replaces the original libdw
>>  allocator with a thread-safe variant. The original acts both as a
>>  suballocator (to keep from paying the malloc tax on frequent small
>>  allocations) and a garbage collection list (to free internal 
>> structures
>>  on dwarf_end). The patch attempts to replicate the same overall
>>  behavior in the more volatile parallel case.
> 
> That is a nice description. Basically it is a little obstack
> implementation. There are a lot of small allocations which we want to
> store together and free together when the Dwarf object is destroyed.
> 
> The allocations (and parsing of DWARF structures) is done lazily. So
> you only pay when you are actually using the data. e.g. if you skip a
> DIE (subtree) or CU no parsing or allocations are done.
> 
> For example when parsing all of the linux kernel debug data we are
> talking about ~535000 allocations, a bit less than half (~233000) are
> of the same small size, 24bytes.
> 
>>  On Sat, Oct 26, 2019 at 18:14, Florian Weimer <fw@deneb.enyo.de 
>> <mailto:fw@deneb.enyo.de>> wrote:
>>  > * Mark Wielaard:
>>  >
>>  > >  I'll see if I can create a case where that is a problem. Then 
>> we can
>>  > >  see how to adjust things to use less pthread_keys. Is there a
>>  > > different
>>  > >  pattern we can use?
>>  >
>>  > It's unclear what purpose thread-local storage serves in this 
>> context.
>> 
>>  The thread-local storage provides the suballocator side: for each
>>  Dwarf, each thread has its own "top block" to perform allocations 
>> from.
>>  To make this simple, each Dwarf has a key to give threads local 
>> storage
>>  specific to that Dwarf. Or at least that was the intent, I didn't 
>> think
>>  to consider the limit, we didn't run into it in our use cases.
> 
> I see that getconf PTHREAD_KEYS_MAX gives 1024 on my machine.
> Is this tunable in any way?

 From what I can tell, no. A quick google search indicates as such, and 
its even #defined as 1024 on my machine.

> 
>>  There may be other ways to handle this, I'm high-level considering
>>  potential alternatives (with more atomics, of course). The 
>> difficulty
>>  is mostly in providing the same performance in the single-threaded 
>> case.
>> 
>>  > You already have a Dwarf *.  I would consider adding some sort of
>>  > clone function which creates a shallow Dwarf * with its own 
>> embedded
>>  > allocator or something like that.
>> 
>>  The downside with this is that its an API addition, which we (the
>>  Dyninst + HPCToolkit projects) would need to enforce. Which isn't a
>>  huge deal for us, but I will need to make a case to those teams to 
>> make
>>  the shift.
>> 
>>  On the upside, it does provide a very understandable semantic in the
>>  case of parallelism. For an API without synchronization clauses, 
>> this
>>  would put our work back into the realm of "reasonably correct" (from
>>  "technically incorrect but works.")
> 
> Could someone give an example of this pattern?
> I don't fully understand what is being proposed and how the interface
> would look exactly.

An application would do something along these lines:

Dwarf* dbg = dwarf_begin(...);
Dwarf* dbg2 = dwarf_clone(dbg, ...);
pthread_create(worker, ...);
// ...
dwarf_get_units(dbg, ...);
// ...
pthread_join(worker);
dwarf_end(dbg);

// worker:
// ...
dwarf_getabbrev(...);
// ...
dwarf_end(dbg2);

The idea being that dbg2 and dbg share most of the same internal state, 
but concurrent access to said state is between Dwarfs (or 
"Dwarf_Views", maybe?), and the state is cleaned up on the original's 
dwarf_end. I suppose in database terms the Dwarfs are acting like 
separate "cursors" for the internal DWARF data. For this particular 
instance, the "top of stack" pointers would be in dbg and dbg2 (the 
non-shared state), while the atomic mem_tail would be part of the 
internal (shared) state.

I'm not sure how viable implementing this sort of thing would be, it 
might end up overhauling a lot of internals, and I'm not familiar 
enough with all the components of the API to know whether there would 
be some quirks with this style. But at least then the blanket "Dwarfs 
must be externally synchronized (all operations issued in serial)" 
implicit clause doesn't limit the parallelism at the API level. And 
those of us who don't follow that rule wouldn't have to walk on 
eggshells to avoid segfaulting.

> 
> Thanks,
> 
> Mark