This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: DT_GNU_HASH: ~ 50% dynamic linking improvement


You still don't get the cost model at all.  So this is the last mail
since there is nothing whatsoever new in any of your replies.

The additional hash table requires an additional cache line to be
loaded.  So it better be worth it.  I.e., it must prevent a sufficient
number of second level hash table to justify the second cache line load
in case there is a hash conflict.  So, when is there an advantage: only
if the bit in the first table isn't set.

If you size the first level hash table as 8*nsymbols you have at most a
12.5% rate of hitting a set bit.  But in these 12.5% you have at least
two cache lines to read.  I.e., the original cost was (with on average 4
comparisons for an unsuccessful lookup)

   cacheline read + 4 * 10 cycles

while with your scheme you have

  87.5% * (cacheline read + 20 cycles)
  + 12.5% (2 * cacheline read + 20 cycles + 4 * 10 cycles)

(here I assume each comparison + loop costs about 10 cycles, the most
expensive bit operations for the bitmap operation 20 cycles).

Now fill in values for the cacheline read costs.  A L2 cache read costs
about 15 cycles on a modern machine.  A main memory read costs > 240
cycles (the 240 cycles are only achieved when reading successive memory,
random access is more expensive).

                     current               yours

 L1 hit            55 cycles            41.875 cycles

 L1 miss           280 cycles           295 cycles


For cache hits the formulas are (average 3 lookups):

  cacheline read + 3 * 10 cycles + strcmp

vs

  2 * cacheline read + 20 cycles + 3 * 10 cycles + strcmp

I.e.,

                     current                yours

  L1 hit           45 cycles + strcmp      60 cycles + strcmp

  L1 miss          270 cycles + strcmp     510 cycles + strcmp


These numbers suggest that the additional table is at best a wash, even
if you assume 95% unsuccessful lookups).  Considering that the memory
accesses are highly random the 240 cycle estimate is far too low.  I've
some numbers for the paper I'll have for my RH summit talk.  For now
look at the slides:

  http://people.redhat.com/drepper/cpucache-slides.pdf

Page 10 shows the random access times.  Only the asymptotic costs are of
interest here because the nature of lookups we optimize for means that
none of the data is in a cache.  The cacheline read time is the
dominating factor and the two reads are independent since the lines are
not adjacent.

A second point to keep in mind is that the gap between memory and CPU
speed isn't closing.  And third, the additional memory load increases
the cache pressure, causing negative effects elsewhere when they could
be avoided by reading only one cache line.

And forth, the additional tables costs disk and virtual memory space.
This is especially true if you try to shift the percentages by
allocating an even larger table.  We easily have DSOs with 5,000 symbols
today.  Multiply by 100 loaded DSOs and perhaps 2 bytes by symbol.  And
those are small numbers.  Some of RH's customers have DSOs which are
each hundreds of MBs in size with 10,000s of symbols.  It is not
acceptable to waste that much space, neither on disk nor in memory (we
have people who use every byte of the available address space).


I consider the time for arguing over.  If you want to show your
mechanisms is better, implement it and time it.  I'm positively sure the
added complexity is adding almost no improvements, maybe even hurting
performance.

-- 
â Ulrich Drepper â Red Hat, Inc. â 444 Castro St â Mountain View, CA â

Attachment: signature.asc
Description: OpenPGP digital signature


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]