[PATCH] Performance improvement for log()
Steven Munroe
munroesj@linux.vnet.ibm.com
Fri Apr 11 12:43:00 GMT 2008
On Thu, 2008-04-10 at 10:27 -0700, Ulrich Drepper wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Steven Munroe wrote:
> > This patch redefines ABS(x) as __builtin_fabs(x).
>
> And to do this you copy the entire file? This is insane. Do what is
> done elsewhere. Write a wrapper which defines macros and then uses the
> orginal code. I know this means changes to some headers. But code
> duplication isn't acceptable.
>
Would it be better to fix the ABS(x) macro in ./ieee754/dbl-64/mpa.h?
This a more general solution but not sure if all platforms handle
__builtin_fabs(x) these days.
The other problem I am trying to address is the that powerpc64 is
actually faster using 64-bit long vs 32-bit int. So also changed the
local int vars to long and added a longnum union.
Also for bit mashing like the inline nan/inf checks using a union
powerpc32 does better with a single long long then with two ints. I
don't think this is common to other platforms and there is no 32/64
split in the generic ./ieee754 source structure (unless you are
comfortable changing all the local ints to longs in ./ieee754/dbl-64?)
>
> > Also the call out to __isnan() in the posix wrapper code (w_log()) was
> > showing up hot in the profiles. So I added --with-cpu=[power4|power6]
> > implementations that effectively in-line the isnan() test within
> > __log().
>
> That seems to be a bad idea as well. It's no ppc-specific problem.
> Again, use macros to abstract out the actual code use so that other
> archs can parametrize this code as well.
Yes and no. The EXTRACT_WORDS, GET_HIGH_WORD, GET_LOW_WORD, macros are
just terrible because they don't allow scheduling of the following load
away from the store into the union. The result is the load-hit-store
hazard which is primary thing I am trying to avoid.
NET transfer FPRs to/from GPRs is powerpc specific issue.
So while I could macroize the isnan() as a wrapper around EXTRACT_WORDS,
does not solve the scheduling problem. I needed to mark the union
volatile:
double z;
volatile union double_long tmp;
long long x_ll;
/* Copy the double value x to a long to test for NaN without the
overhead of calling isnan or triggering VXSNAN.
Mark the union volatile and store x before calling __ieee754_log()
so we can access the value as a 64-bit long long after
__ieee754_log(). This avoids the load-hit-store hazard when the
store queue is deep. */
tmp.d = x;
z = __ieee754_log(x);
x_ll = tmp.ll;
This is what it takes to keep GCC from moving the load up next to the
store (which creates the load-hit-store hazard I was trying to avoid.)
Is this a problem for other platforms?
Once the the data is transferred then a isnan_bits() macro that uses the
resulting long long value would clean up the source things up.
And w_log.c is the simple case, w_pow.c is the nasty one (with multiple
isnan(), finite(), and signbit() calls). Here transferring the double to
to int/long exactly once pays big.
So does this justify powerpc specific code?
I could merge the power6 and power4 into to a single version if that
would help. Otherwise this would require major restructuring of the
math_private.h macros and the common ./ieee754/dbl-64/ implementations.
Perhaps that should be done, but not just before a release due to the
risk.
More information about the Libc-alpha
mailing list