[PATCH 3/3] lib + libdw: Add and use a concurrent version of the dynamic-size hash table.
Sun Oct 27 16:13:00 GMT 2019
On Fri, 2019-10-25 at 23:11 -0500, Jonathon Anderson wrote:
> On Sat, Oct 26, 2019 at 01:50, Mark Wielaard <firstname.lastname@example.org> wrote:
> > For example I am not sure I can proof that in resize_worker() this
> > always holds: assert(GET_STATE(resize_state) != NO_RESIZING);
> > In general the handling of the resizing_state field is pretty tricky.
> > What is the maximum number of concurrent threads it can handle?
> The master thread waits for all workers to deregister before
> transitioning from CLEANING to NO_RESIZING. Since the worker thread by
> that point has registered (and the synchronization is arranged as such)
> it will never see a transition all the way back to NO_RESIZING before
> getting its word in. This way a worker thread doesn't "miss the boat"
> and end up in an later resizing cycle.
Thanks. I think my confusion came from the fact that these states are
expressed by 2 STATE_BITs. I didn't realize they are distinct states
and cannot hold simultaneously. I somehow had the impression all states
could be "active" at the same time. The bit-fiddling with
atomic_fetch_xor_explicit is tricky.
So the busy loop to check for GET_ACTIVE_WORKERS in resize_master works
because the master itself never does an STATE_INCREMENT itself.
Are these busy loops not expensive? There are a couple in the code
where are thread is just spinning busily till all other threads are
ready and/or the state is changed.
> The maximum number of threads is a size_t minus 2 bits, so bare minimum
> is 16384 on a 16-bit machine (which standard C still technically
> supports). Any modern system will allow either a billion (30 bits) or 4
> quintillion (62 bits).
Ah, right, that was also connected to my confusion about how many bits
were used to hold the state. 1 billion threads should be enough for
> > The other dynamic size hash is Dwarf_Sig8_Hash. This also doesn't
> > use
> > COMPARE (nor ITERATE and REVERSE). I haven't tried to replace that one
> > with the concurrent version, but I think we should. It is only used
> > when there are debug type units in the Dwarf. Which is not the default
> > (and some gcc hackers really don't like them) so you will probably not
> > encounter them normally. But we should probably make looking them up
> > also concurrent safe as a followup patch.
> A quick grep -r shows that ITERATE and REVERSE are used for the
> asm_symbol_tab. If iteration and insertion are not concurrent we can
> easily add bidirectional iteration (using the same Treiber stack-like
> structure as used for the memory management). Also COMPARE is not
> defined to be (0) in this instance.
Yes. We would need some other solution for the libasm code. But I think
we can use the concurrent table for everything in libdw.
> > I only tested the single-threaded case. It is slightly slower, but not
> > very much. The previous patch made the code slightly faster, that
> > slight speed increase is gone now. But that also means it is about on
> > par with the old version without any concurrent improvements.
> > It does use a bit more memory though. For each CU the abbrev table
> > structure is 112 bytes larger. Given that there can be several 1000
> > CUs
> > in a (large) Dwarf that means a couple of hundred K of extra memory
> > use. It is probably acceptable since such files contain 100 MBs of DIE
> > data.
> That extra memory comes from the state for resizing mostly, so it
> probably could be reclaimed. Maybe it could live on the master thread's
> stack and then be accessed via pointer by the worker threads? If its
> enough of an issue to consider fixing I can work out a simple patch for
I am not too concerned by it. But haven't done much measurements. You
probably have one of the most demanding programs using this code. Are
you seeing a significant use in memory?
> > But I was wondering whether next_init_block and num_initialized_blocks
> > could be shared with next_move_block and num_moved_blocks. If I
> > understand the code in resize_helper() correctly then all threads are
> > either initializing or all threads are moving. So can't we just use
> > one
> > next_block and one num_blocks field?
> The main issue is that either a) somehow the fields have to be reset
> between phases (which means something like a double barrier, and/or a
> "chosen" thread), or b) the second phase has to go in reverse (which
> has the complication of juggling the difference in ending conditions,
> and dealing with the phase transition properly).
> Not impossible, just probably too complex to be worth the 16 byte gain.
Yes. Now that I have a better handle on the state changes I see that
this is trickier than I thought. It looks like just 16 bytes, but this
is per CU. Given that there can be a thousand CUs that does add up. But
probably not worth it at this point (see also above about memory usage
More information about the Elfutils-devel