This is the mail archive of the
mailing list for the glibc project.
Re: new alpha division routines
- From: Richard Henderson <rth at redhat dot com>
- To: Falk Hueffner <hueffner at informatik dot uni-tuebingen dot de>
- Cc: libc-alpha at sources dot redhat dot com, Richard Henderson <rth at redhat dot com>
- Date: Sun, 28 Mar 2004 13:28:59 -0800
- Subject: Re: new alpha division routines
- References: <firstname.lastname@example.org>
On Sun, Mar 28, 2004 at 03:24:31PM +0200, Falk Hueffner wrote:
> incidentally, I was also playing with the division routines lately, or
> more precisely __remqu, since it seems to take most time (a whopping
> 30% of all libc cycles when compiling stuff with gcc).
Indeed. I have a patch to the hashing routines to pre-compute
a multiplictive inverse of the primes that we use. I've been
meaning to do some timing tests to see how much it helps.
> * Special case power-of-two divisor. People do that quite often.
I ran my system for a while with an LD_PRELOAD library to capture
divisions. I found
(This is from before my strtol patch, so virtually all of the
x==-1 and y==10 instances came from that.) So yes, Y a power of
two does occur moderately frequently.
Without cttz, I'm not sure it's worth special-casing to avoid
the fp division. Except of course I was thinking about this
more in terms of division than remainder; clearly it's trivial
> * Special case small dividends (compared to divisor).
In what way?
> * Never loop, use second divt/c if lacking precision. This avoids data
> dependent branches and gives good results for the important case of
> huge dividends (nearly all 64 bit set) and medium dividends which
> occurs in hash tables.
I'm surprised you measured a win here; divt is a minimum of 22 cycles;
using the first divt result as an estimate, we go around the fixup loop
usually no more than 3 times. Oh, wait, 22 was ev5; ev6 does divt in
15 cycles. Hmm.
I wonder if we can use divs for the second division? It would save
about three cycles on ev6, but between 7-29 on ev5 and 29 cycles on ev4.
We'd have to be able to show that the remainder has no more than 24
bits set at this point... We'd have to cvtts the divisor in order to
avoid UNPREDICTABLE results, but there should be pleanty of spare
cycles around to get that issued.
Another option might be to unroll the fixup loop N times such that N
covers the majority of operands that we actually see. That would
cure the bulk of the branch predicion problem and might even come
in at less than ev6 12 cycle divs.
> By the way, what do you think about inlining the 32-bit ops from gcc?
It's a good idea. Need to think about how to do it though, since
now we've got to inline the trap-if-zero code too.
I'll have a look at your algorithm on pre-ev6 as well.