Hi,
On Sat, 2019-10-26 at 11:45 -0500, Jonathon Anderson wrote:
For some overall perspective, this patch replaces the original libdw
allocator with a thread-safe variant. The original acts both as a
suballocator (to keep from paying the malloc tax on frequent small
allocations) and a garbage collection list (to free internal
structures
on dwarf_end). The patch attempts to replicate the same overall
behavior in the more volatile parallel case.
That is a nice description. Basically it is a little obstack
implementation. There are a lot of small allocations which we want to
store together and free together when the Dwarf object is destroyed.
The allocations (and parsing of DWARF structures) is done lazily. So
you only pay when you are actually using the data. e.g. if you skip a
DIE (subtree) or CU no parsing or allocations are done.
For example when parsing all of the linux kernel debug data we are
talking about ~535000 allocations, a bit less than half (~233000) are
of the same small size, 24bytes.
On Sat, Oct 26, 2019 at 18:14, Florian Weimer <fw@deneb.enyo.de
<mailto:fw@deneb.enyo.de>> wrote:
> * Mark Wielaard:
>
> > I'll see if I can create a case where that is a problem. Then
we can
> > see how to adjust things to use less pthread_keys. Is there a
> > different
> > pattern we can use?
>
> It's unclear what purpose thread-local storage serves in this
context.
The thread-local storage provides the suballocator side: for each
Dwarf, each thread has its own "top block" to perform allocations
from.
To make this simple, each Dwarf has a key to give threads local
storage
specific to that Dwarf. Or at least that was the intent, I didn't
think
to consider the limit, we didn't run into it in our use cases.
I see that getconf PTHREAD_KEYS_MAX gives 1024 on my machine.
Is this tunable in any way?
There may be other ways to handle this, I'm high-level considering
potential alternatives (with more atomics, of course). The
difficulty
is mostly in providing the same performance in the single-threaded
case.
> You already have a Dwarf *. I would consider adding some sort of
> clone function which creates a shallow Dwarf * with its own
embedded
> allocator or something like that.
The downside with this is that its an API addition, which we (the
Dyninst + HPCToolkit projects) would need to enforce. Which isn't a
huge deal for us, but I will need to make a case to those teams to
make
the shift.
On the upside, it does provide a very understandable semantic in the
case of parallelism. For an API without synchronization clauses,
this
would put our work back into the realm of "reasonably correct" (from
"technically incorrect but works.")
Could someone give an example of this pattern?
I don't fully understand what is being proposed and how the interface
would look exactly.