This is the mail archive of the
libc-alpha@sources.redhat.com
mailing list for the glibc project.
Re: ppc64 vDSO in mainline
- From: Steve Munroe <sjmunroe at us dot ibm dot com>
- To: Ulrich Drepper <drepper at redhat dot com>
- Cc: Alan Modra <amodra at bigpond dot net dot au>, Benjamin Herrenschmidt <benh at kernel dot crashing dot org>, libc-alpha at sources dot redhat dot com, Roland McGrath <roland at redhat dot com>
- Date: Tue, 29 Mar 2005 08:50:14 -0600
- Subject: Re: ppc64 vDSO in mainline
Ulrich Drepper <drepper@redhat.com> wrote on 03/28/2005 05:59:36 PM:
> Steve Munroe wrote:
> > 3) a function which is currently
> > exported by libc , but a better optimized version (with a different
> > symbol) is also exported by the VDSO.
>
> There is no reason to add any complications or dependency problems for
> this. Just using a pointer in libc itself, a test for NULL and if not,
> jump to the function is enough. The penalty for this extra indication
> if minimal compared to all the other work involved.
>
But this level of indirection is unacceptable overhead for some functions.
I have rewritten memcpy twice and will rewrite it again. I am in the
process of rewriting memcmp and then strncmp. Why? because they show up as
hotspots in the importance benchmarks like SPEC and TPC-C.
I can do this aggressive optimization for powerpc64 because I have access
to all currently released 64-bit implementations. I can't do the same for
powerpc32 because there are so many different varieties for 32-bit
implementation. My best efforts for a powerpc32 memcpy/memcmp on
POWER4/POWER5 might make users of older pMAC and 4xx embedded hardware
very unhappy. But if I know I am running on a PPC64 kernel I will know
exactly which processor I am running on and can provide appropriately
optimized string function for both powerpc32/powerpc64.
There is no mechanism is glibc to deal with this problem (processor
specific optimization).
So the additional overhead (at least 9 cycles with the NULL pointer check)
does matter. In the new memcmp I can compare 8 bytes per cycle (8 x 9 ==
72 bytes) so this overhead is significant.
If you don't believe me, put that G5 to good use, and find out for your
self: http://www.alphaworks.ibm.com/tech/simppc,
http://sourceforge.net/projects/perfinsp
Steven J. Munroe
Linux on Power Toolchain Architect
IBM Corporation, Linux Technology Center