This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |
Other format: | [Raw text] |
You still don't get the cost model at all. So this is the last mail since there is nothing whatsoever new in any of your replies. The additional hash table requires an additional cache line to be loaded. So it better be worth it. I.e., it must prevent a sufficient number of second level hash table to justify the second cache line load in case there is a hash conflict. So, when is there an advantage: only if the bit in the first table isn't set. If you size the first level hash table as 8*nsymbols you have at most a 12.5% rate of hitting a set bit. But in these 12.5% you have at least two cache lines to read. I.e., the original cost was (with on average 4 comparisons for an unsuccessful lookup) cacheline read + 4 * 10 cycles while with your scheme you have 87.5% * (cacheline read + 20 cycles) + 12.5% (2 * cacheline read + 20 cycles + 4 * 10 cycles) (here I assume each comparison + loop costs about 10 cycles, the most expensive bit operations for the bitmap operation 20 cycles). Now fill in values for the cacheline read costs. A L2 cache read costs about 15 cycles on a modern machine. A main memory read costs > 240 cycles (the 240 cycles are only achieved when reading successive memory, random access is more expensive). current yours L1 hit 55 cycles 41.875 cycles L1 miss 280 cycles 295 cycles For cache hits the formulas are (average 3 lookups): cacheline read + 3 * 10 cycles + strcmp vs 2 * cacheline read + 20 cycles + 3 * 10 cycles + strcmp I.e., current yours L1 hit 45 cycles + strcmp 60 cycles + strcmp L1 miss 270 cycles + strcmp 510 cycles + strcmp These numbers suggest that the additional table is at best a wash, even if you assume 95% unsuccessful lookups). Considering that the memory accesses are highly random the 240 cycle estimate is far too low. I've some numbers for the paper I'll have for my RH summit talk. For now look at the slides: http://people.redhat.com/drepper/cpucache-slides.pdf Page 10 shows the random access times. Only the asymptotic costs are of interest here because the nature of lookups we optimize for means that none of the data is in a cache. The cacheline read time is the dominating factor and the two reads are independent since the lines are not adjacent. A second point to keep in mind is that the gap between memory and CPU speed isn't closing. And third, the additional memory load increases the cache pressure, causing negative effects elsewhere when they could be avoided by reading only one cache line. And forth, the additional tables costs disk and virtual memory space. This is especially true if you try to shift the percentages by allocating an even larger table. We easily have DSOs with 5,000 symbols today. Multiply by 100 loaded DSOs and perhaps 2 bytes by symbol. And those are small numbers. Some of RH's customers have DSOs which are each hundreds of MBs in size with 10,000s of symbols. It is not acceptable to waste that much space, neither on disk nor in memory (we have people who use every byte of the available address space). I consider the time for arguing over. If you want to show your mechanisms is better, implement it and time it. I'm positively sure the added complexity is adding almost no improvements, maybe even hurting performance. -- â Ulrich Drepper â Red Hat, Inc. â 444 Castro St â Mountain View, CA â
Attachment:
signature.asc
Description: OpenPGP digital signature
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |