This is the mail archive of the elfutils-devel@sourceware.org mailing list for the elfutils project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH 3/3] lib + libdw: Add and use a concurrent version of the dynamic-size hash table.

From: Jonathon Anderson <jma14 at rice dot edu>
To: Mark Wielaard <mark at klomp dot org>
Cc: elfutils-devel at sourceware dot org, Srdan Milakovic <sm108 at rice dot edu>
Date: Sun, 27 Oct 2019 12:49:21 -0500
Subject: Re: [PATCH 3/3] lib + libdw: Add and use a concurrent version of the dynamic-size hash table.
References: <1566877968.10901.0@smtp.mail.rice.edu> <20190829131614.18190-1-mark@klomp.org> <20190829131614.18190-4-mark@klomp.org> <574ab190fb7be83abfaef0977f904a15de7a521a.camel@klomp.org> <1572063095.9092.1@rice.edu> <37fd5ac667e321734842ec1ec76e30ea880a53ab.camel@klomp.org>



On Sun, Oct 27, 2019 at 17:13, Mark Wielaard <mark@klomp.org> wrote:

On Fri, 2019-10-25 at 23:11 -0500, Jonathon Anderson wrote:
On Sat, Oct 26, 2019 at 01:50, Mark Wielaard <mark@klomp.org<mailto:mark@klomp.org>> wrote:
 > For example I am not sure I can proof that in resize_worker() this
 > always holds: assert(GET_STATE(resize_state) != NO_RESIZING);
> In general the handling of the resizing_state field is prettytricky.
 > What is the maximum number of concurrent threads it can handle?

 The master thread waits for all workers to deregister before
transitioning from CLEANING to NO_RESIZING. Since the worker threadbythat point has registered (and the synchronization is arranged assuch)it will never see a transition all the way back to NO_RESIZINGbeforegetting its word in. This way a worker thread doesn't "miss theboat"
 and end up in an later resizing cycle.
Thanks. I think my confusion came from the fact that these states are
expressed by 2 STATE_BITs. I didn't realize they are distinct states
and cannot hold simultaneously. I somehow had the impression allstates
could be "active" at the same time. The bit-fiddling with
atomic_fetch_xor_explicit is tricky.
So the busy loop to check for GET_ACTIVE_WORKERS in resize_masterworks
because the master itself never does an STATE_INCREMENT itself.

Are these busy loops not expensive? There are a couple in the code
where are thread is just spinning busily till all other threads are
ready and/or the state is changed.

In theory (if the system isn't overloaded) the threads should finishtheir individual work at around the same time, so the amount of waitingany one thread would do (should be) very short. Also, this is only onceper resize, which (shouldn't) happen very often anyway.

The other busy loops (in insert_helper) are also read-only with asingle concurrent store, so they (should) spin in cache until theinvalidation. Again, (should) be a short wait.

If it starts to be an issue we can add some exponential backoff, but Ihaven't noticed any issues in general (and, a nanosleep itself is aspin for suitably short durations, or so I've been told).

The maximum number of threads is a size_t minus 2 bits, so bareminimum
 is 16384 on a 16-bit machine (which standard C still technically
supports). Any modern system will allow either a billion (30 bits)or 4
 quintillion (62 bits).
Ah, right, that was also connected to my confusion about how many bits
were used to hold the state. 1 billion threads should be enough for
anybody :)
 > The other dynamic size hash is Dwarf_Sig8_Hash. This also doesn't
 > use
> COMPARE (nor ITERATE and REVERSE). I haven't tried to replacethat one> with the concurrent version, but I think we should. It is onlyused> when there are debug type units in the Dwarf. Which is not thedefault> (and some gcc hackers really don't like them) so you willprobably not> encounter them normally. But we should probably make looking themup
 > also concurrent safe as a followup patch.

 A quick grep -r shows that ITERATE and REVERSE are used for the
 asm_symbol_tab. If iteration and insertion are not concurrent we can
easily add bidirectional iteration (using the same Treiberstack-like
 structure as used for the memory management). Also COMPARE is not
 defined to be (0) in this instance.
Yes. We would need some other solution for the libasm code. But Ithink
we can use the concurrent table for everything in libdw.
> I only tested the single-threaded case. It is slightly slower,but not
 > very much. The previous patch made the code slightly faster, that
> slight speed increase is gone now. But that also means it isabout on
 > par with the old version without any concurrent improvements.
 >
 > It does use a bit more memory though. For each CU the abbrev table
> structure is 112 bytes larger. Given that there can be several1000
 > CUs
> in a (large) Dwarf that means a couple of hundred K of extramemory> use. It is probably acceptable since such files contain 100 MBsof DIE
 > data.

 That extra memory comes from the state for resizing mostly, so it
probably could be reclaimed. Maybe it could live on the masterthread's
 stack and then be accessed via pointer by the worker threads? If its
enough of an issue to consider fixing I can work out a simple patchfor
 it.
I am not too concerned by it. But haven't done much measurements. You
probably have one of the most demanding programs using this code. Are
you seeing a significant use in memory?

Depends on your definition of significant, anything less than 10MBwould fall entirely below my radar, and the biggest thing I've triedhas 935 CUs. We also tend to close our Dwarfs regularly, I think themaximum open we've used so far is one per thread.

That said, I can also envision cases (say, a gdb on elfutils on a smallsystem) that might be prohibited (or at least annoyed) by this. I canlower the overhead to 64 bytes easily (moving bits to the masterthread's stack), but that's the limit with this design.

> But I was wondering whether next_init_block andnum_initialized_blocks
 > could be shared with next_move_block and num_moved_blocks. If I
> understand the code in resize_helper() correctly then all threadsare> either initializing or all threads are moving. So can't we justuse
 > one
 > next_block and one num_blocks field?

 The main issue is that either a) somehow the fields have to be reset
between phases (which means something like a double barrier, and/ora
 "chosen" thread), or b) the second phase has to go in reverse (which
has the complication of juggling the difference in endingconditions,
 and dealing with the phase transition properly).
Not impossible, just probably too complex to be worth the 16 bytegain.
Yes. Now that I have a better handle on the state changes I see that
this is trickier than I thought. It looks like just 16 bytes, but this
is per CU. Given that there can be a thousand CUs that does add up.But
probably not worth it at this point (see also above about memory usage
measurements).

Thanks,

Mark

Follow-Ups:
- Re: [PATCH 3/3] lib + libdw: Add and use a concurrent version of the dynamic-size hash table.
  - From: Mark Wielaard

References:
- Re: [PATCH 3/3] lib + libdw: Add and use a concurrent version of the dynamic-size hash table.
  - From: Mark Wielaard
- Re: [PATCH 3/3] lib + libdw: Add and use a concurrent version of the dynamic-size hash table.
  - From: Jonathon Anderson
- Re: [PATCH 3/3] lib + libdw: Add and use a concurrent version of the dynamic-size hash table.
  - From: Mark Wielaard

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]