This is the mail archive of the
binutils@sourceware.org
mailing list for the binutils project.
Re: DT_GNU_HASH: reducing working set ...
- From: Michael Meeks <michael dot meeks at novell dot com>
- To: Jakub Jelinek <jakub at redhat dot com>
- Cc: binutils at sources dot redhat dot com, libc-alpha at sources dot redhat dot com, Ulrich Drepper <drepper at redhat dot com>
- Date: Mon, 03 Jul 2006 16:15:20 +0100
- Subject: Re: DT_GNU_HASH: reducing working set ...
- References: <20060628170900.GX3823@sunsite.mff.cuni.cz> <1151605626.20187.72.camel@t60p.site> <20060629193926.GZ3823@sunsite.mff.cuni.cz>
- Reply-to: michael dot meeks at novell dot com
Hi there,
On Thu, 2006-06-29 at 21:39 +0200, Jakub Jelinek wrote:
> More comments later, but as you can see in the patch, undefined
> symbols are never added to .gnu.hash chains, only defined ones
Ah - cool :-) thanks for that.
> Jakub wrote in reply to:
> On Thu, Jun 29, 2006 at 07:27:06PM +0100, Michael Meeks wrote:
> > 1. Extensibility
>
> On the other side, it is IMHO much better to keep using one
> section for one purpose, as ELF has always been doing.
Yes - you're quite right - it's crack smoking to want to put -Bdirect
stuff in that section anyway, as you say: a few minutes thought in a
less excited state convinced me of that [ sadly while off line ].
OTOH - I do think that allowing the stealing of some bits from the hash
code [ for other purposes ] in future is prudent, and need not perform
badly:
if (!(new_hash ^ l->chain[foo]) & l->new_hash_mask))
... hit.
etc. shouldn't be overly slower than a straight comparison surely ?
> Applications keep growing every year and while I currently counted
> ~ 500000 different symbol names, it can soon double, triple and keep
> increasing.
From what I can see - your method was to grab ~all the symbols on your
system, and compare hashes for all of them ? and we only got <100
collisions ? I take Ulrich's point that perhaps many of these are in the
same module [be interesting to see] due to highly similar names etc. but
even so this is a tiny collision rate.
> And, all linker can do for this is try to keep all the chains
> short enough to fit into cachelines.
So - perhaps you can help me here; I've been working with a mental
model of several levels of cache - whereby the more you can compress the
working set, the faster access will be. Now I appreciate the cache line
argument - that after the 1st access, reads inside the same cache line
are really fast [ and (I infer from Ulrich's post) adjacent cache line
reads are free ].
On Fri, 2006-06-30 at 12:36 -0700, Ulrich Drepper wrote:
> On the other hand, what is gained? You already said we have rather
> short hash chains. Most likely not longer than 5 or 6. I can count the
> number of DSOs with longer chains on my hands and there is not more than
> one longer chain in each DSO.
Jakub recently posted this sort of thing:
[snip]
libstdc++.so.6.0.8
Histogram for `.gnu.hash' bucket list length (total of 1022 buckets):
Length Number % of total Coverage
0 52 ( 5.1%)
1 131 ( 12.8%) 4.1%
2 225 ( 22.0%) 18.4%
3 233 ( 22.8%) 40.5%
4 171 ( 16.7%) 62.2%
5 115 ( 11.3%) 80.4%
6 64 ( 6.3%) 92.5%
7 18 ( 1.8%) 96.5%
8 8 ( 0.8%) 98.5%
9 4 ( 0.4%) 99.7%
10 1 ( 0.1%) 100.0%
...
For libc-2.4.90.so:
...
Histogram for `.gnu.hash' bucket list length (total of 1011 buckets):
Length Number % of total Coverage
0 124 ( 12.3%)
1 254 ( 25.1%) 12.4%
2 296 ( 29.3%) 41.2%
3 198 ( 19.6%) 70.2%
4 98 ( 9.7%) 89.3%
5 29 ( 2.9%) 96.4%
6 10 ( 1.0%) 99.3%
7 2 ( 0.2%) 100.0%
[snip]
But sure - long chains should be unusual. It's perhaps interesting to
consider where the performance balance between expanding the base .hash
to get more 0xffffffff hits, vs. shrinking it to fit more of it into L1
cache is.
> Now do the math: even assuming even distribution of the chain we have
> on average 6 chain elements per cache line. If the linker sorts them
> cache lines we can fit in 14. That's enough even for the worst chain
> length.
By "sorts them cache lines" I assume you mean adding padding [perhaps undef]
records / alignment to .dynsym to maximise the length of chain inside
cache lines ? that could certainly be done later; interesting.
> So, on the one hand you save one 4-byte word which in almost no case
> will cause an additional cache line to be loaded. And even if it would,
> the cache lines are consecutive the the processors prefetching takes
> case of it. On the other hand you more than double the likelihood to
> have to compare strings. This is IMO reason enough to not change the
> format.
Nah - let me do the maths another way and perhaps I'll get the result
I'm looking for ;-) by crunching the above numbers that Jakub provided -
I see an avg. no. of comparisons per chain of:
libstdc++ 3.09 # sum({count*length})/sum(count)
glibc 2.03
So a finger-in-the-air avg. chain length is perhaps ~2.5. [1]
My proposed alternate design sacrifices 1 bit of the hash, lets (for
the sake of argument) assume this has the most awful performance
degredation of a factor of 10 (instead of perhaps ~2-4).
The proposed alternate design ('stolen bit') is engineered thus:
* By sorting ~all not-hashed .dynsym entried to the 'end' we end
up with contiguous runs of .dynsym entries [ as now ] in
chains, adjacent to each other. Thus - by stealing a
terminator bit we avoid needing a chain length. [ the 'offset'
in the .hash can then just be a .dynsym index ]
So - lets also assume we have 10^6 unique symbols across a few
libraries - just for argument.
It seems then that the likelihood (with a purely random access pattern)
of a cache hit [ ie. not taking a L1/L2 cache miss ;-] is ~=
sizeof(cache)/sizeof(data_set).
With a 64k L1 cache: and a 4Mb working set (stolen-bit) this gives a
1.6% hit rate. [ for a large L2 of course it's higher, but lets look a
the worst case, smallest cache, where this effect is lowest ]
If we increase the size of each field by 4 bytes per 2.5 records [ as
per the existing 'chain-len' design ] from our avg 400k chains we get: a
1.6Mb increase (factor of 1.4) to a 5.6Mb working set. This gives us a
lower cache hit rate of 1.1%.
But - of course; as you point out there is a penalty for hash
collisions - by reducing precision we loose on comparison:
32bits - 30/530000 = 0.005% false positives
31bits - 10x worse = 0.05%
So - taking our 1 million record hypothetical set:
L1 hits false positives
chain-len. 11k 56
stolen-bit 16k 560
So - if we assume a (conservative) cost of a false positive of ~10 L2
cache misses [ it's more like 3 ] then this seems reasonable.
So - of course each false positive is prolly less than the 10 L2 cache
misses necessary to make these compare (right?). As the cache grows, of
course the assumptions start to fail, but with a 2Mb L2 cache
L1 hits: 64k L2: 2Mb false positives
chain-len. 11k 500k 56
stolen-bit 16k 360k 560
So now we need a false positive to generate 280 L2 misses to get to a
performance parity for the (existing) chain-len solution.
Of course, it's entirely probable that my numbers, assumptions,
calculations and raison-d'etre are entirely incorrect :-) Of course,
wrt. data set size much more than just searching in the .dynsym section
is going on at link time diluting the argument [etc.] Is any of it even
slightly convincing ?
Either way - since I now have time; if you're interested I'm happy to
try to hack this up so it can be profiled both ways; presumably
cachegrind can give a nice repeatable answer. If cachegrind shows it
being more cache efficient - would you consider a re-work / simplify ?
Thanks,
Michael.
[1] - of course assuming that we almost never hit - which I believe is
the common case.
--
michael.meeks@novell.com <><, Pseudo Engineer, itinerant idiot