This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Optimizing hash table lookup in symbol binding


On 11/18/19 8:58 AM, Florian Weimer wrote:
> On a second-generation AMD EPYC, I didn't see a difference at all for
> some reason.  On Cascade Lake, I see a moderate improvement for the
> dlsym test, but I don't know how realistic this microbenchmark is.  Both
> patches had performance that was on par.
> 
> I also tried to remove the bitmask check altogether, but it was not an
> improvement.  I suspect the bitmasks are much smaller, so consulting
> them avoids the cache misses in the full table lookup.
> 
> If any of the architecture maintainers think this is worth doing, we can
> still incorporate it.

At this point you are probably straying to the realm of special purpose
proprietary analysis programs provided by hardware vendors to get the
most performance out of a given program (stalls, cache misses, interlocks
etc.).

Have you used perf at all to look into aspects beyond just performance?
That is say confirming the cache misses saved vs. the full table lookup?

What about PGO for this case? I often wonder in cases like this if we
fed "normative" workloads into a PGO build if we could get better
results from generated code like this, but it's an entire project
just to do this.

To accept this code I'd want a microbenchmark added, and then we'd
commit the code, and ask the machine maintainers to review that nothing
got terribly worse or was out of kilter with the microbenchmark numbers.

Thoughts?

-- 
Cheers,
Carlos.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]