[PATCH 3/3] lib + libdw: Add and use a concurrent version of the dynamic-size hash table.

Sat Oct 26 04:11:00 GMT 2019

On Sat, Oct 26, 2019 at 01:50, Mark Wielaard <mark@klomp.org> wrote:
> Hi,
> 
> Sorry this took so long to review. But it is pretty complex code.
> I think I got how it works mostly. It is hard to proof correct though.
> How did you convince yourself that the code is correct?

No worries, its a complex piece of code.

Srdan designed the hash table portions, so I might have to defer to him 
on some points. That said, I'll try to answer what I can.

I think the core parallel logic can be clarified by noting:
 - The total order for insertions and lookups is based on the val_ptr 
atomic field(s). The hashval field(s) act as a sort of "peek" for 
catching collisions before issuing a full CAS.
 - Wrt resizing, the table is always in one of five phases: 
NO_RESIZING, ALLOCATING_MEMORY, MOVING_DATA (initialization), 
MOVING_DATA (insertion), and CLEANING. Four of those five transitions 
are handled by a "master" thread chosen by a CAS on the resizing_state 
field.

> 
> For example I am not sure I can proof that in resize_worker() this
> always holds: assert(GET_STATE(resize_state) != NO_RESIZING);
> In general the handling of the resizing_state field is pretty tricky.
> What is the maximum number of concurrent threads it can handle?

The master thread waits for all workers to deregister before 
transitioning from CLEANING to NO_RESIZING. Since the worker thread by 
that point has registered (and the synchronization is arranged as such) 
it will never see a transition all the way back to NO_RESIZING before 
getting its word in. This way a worker thread doesn't "miss the boat" 
and end up in an later resizing cycle.

The maximum number of threads is a size_t minus 2 bits, so bare minimum 
is 16384 on a 16-bit machine (which standard C still technically 
supports). Any modern system will allow either a billion (30 bits) or 4 
quintillion (62 bits).

> 
> I tried to simplify the code a little. You already observed that
> COMPARE can be zero. But it always is. We never try comparing values.
> So all the COMPARE and value passing to the find functions can simply
> be removed. So if you agree I would like to merge the attached
> simplification diff into this patch.

I'm fine with it, although at a glance it seems that this means 
insert_helper will never return -1. Which doesn't quite sound accurate, 
so I'll have to defer to Srdan on this one.

> 
> The other dynamic size hash is Dwarf_Sig8_Hash. This also doesn't use
> COMPARE (nor ITERATE and REVERSE). I haven't tried to replace that one
> with the concurrent version, but I think we should. It is only used
> when there are debug type units in the Dwarf. Which is not the default
> (and some gcc hackers really don't like them) so you will probably not
> encounter them normally. But we should probably make looking them up
> also concurrent safe as a followup patch.

A quick grep -r shows that ITERATE and REVERSE are used for the 
asm_symbol_tab. If iteration and insertion are not concurrent we can 
easily add bidirectional iteration (using the same Treiber stack-like 
structure as used for the memory management). Also COMPARE is not 
defined to be (0) in this instance.

> 
> I only tested the single-threaded case. It is slightly slower, but not
> very much. The previous patch made the code slightly faster, that
> slight speed increase is gone now. But that also means it is about on
> par with the old version without any concurrent improvements.
> 
> It does use a bit more memory though. For each CU the abbrev table
> structure is 112 bytes larger. Given that there can be several 1000 
> CUs
> in a (large) Dwarf that means a couple of hundred K of extra memory
> use. It is probably acceptable since such files contain 100 MBs of DIE
> data.

That extra memory comes from the state for resizing mostly, so it 
probably could be reclaimed. Maybe it could live on the master thread's 
stack and then be accessed via pointer by the worker threads? If its 
enough of an issue to consider fixing I can work out a simple patch for 
it.

> 
> But I was wondering whether next_init_block and num_initialized_blocks
> could be shared with next_move_block and num_moved_blocks. If I
> understand the code in resize_helper() correctly then all threads are
> either initializing or all threads are moving. So can't we just use 
> one
> next_block and one num_blocks field?

The main issue is that either a) somehow the fields have to be reset 
between phases (which means something like a double barrier, and/or a 
"chosen" thread), or b) the second phase has to go in reverse (which 
has the complication of juggling the difference in ending conditions, 
and dealing with the phase transition properly).

Not impossible, just probably too complex to be worth the 16 byte gain.

> 
> I do think the code itself is good. The above are mostly just 
> nitpicks.
> But if you could reply and give your thoughts on them that would be
> appreciated.
> 
> Thanks,
> 
> Mark