new alpha division routines
Richard Henderson
rth@redhat.com
Sun Mar 28 22:06:00 GMT 2004
On Sun, Mar 28, 2004 at 03:24:31PM +0200, Falk Hueffner wrote:
> incidentally, I was also playing with the division routines lately, or
> more precisely __remqu, since it seems to take most time (a whopping
> 30% of all libc cycles when compiling stuff with gcc).
Indeed. I have a patch to the hashing routines to pre-compute
a multiplictive inverse of the primes that we use. I've been
meaning to do some timing tests to see how much it helps.
> * Special case power-of-two divisor. People do that quite often.
I ran my system for a while with an LD_PRELOAD library to capture
divisions. I found
count 77460708
x_lt_2to48 70298527
x_lt_y 4942243
y_pow_2 5109493
y_eq_10 6313795
y_eq_100 882
y_eq_1000 515145
y_eq_10000 307
y_eq_100000 0
y_eq_1000000 15335
x_eq_m1 6443176
(This is from before my strtol patch, so virtually all of the
x==-1 and y==10 instances came from that.) So yes, Y a power of
two does occur moderately frequently.
Without cttz, I'm not sure it's worth special-casing to avoid
the fp division. Except of course I was thinking about this
more in terms of division than remainder; clearly it's trivial
for remainder.
> * Special case small dividends (compared to divisor).
In what way?
> * Never loop, use second divt/c if lacking precision. This avoids data
> dependent branches and gives good results for the important case of
> huge dividends (nearly all 64 bit set) and medium dividends which
> occurs in hash tables.
I'm surprised you measured a win here; divt is a minimum of 22 cycles;
using the first divt result as an estimate, we go around the fixup loop
usually no more than 3 times. Oh, wait, 22 was ev5; ev6 does divt in
15 cycles. Hmm.
I wonder if we can use divs for the second division? It would save
about three cycles on ev6, but between 7-29 on ev5 and 29 cycles on ev4.
We'd have to be able to show that the remainder has no more than 24
bits set at this point... We'd have to cvtts the divisor in order to
avoid UNPREDICTABLE results, but there should be pleanty of spare
cycles around to get that issued.
Another option might be to unroll the fixup loop N times such that N
covers the majority of operands that we actually see. That would
cure the bulk of the branch predicion problem and might even come
in at less than ev6 12 cycle divs.
> By the way, what do you think about inlining the 32-bit ops from gcc?
It's a good idea. Need to think about how to do it though, since
now we've got to inline the trap-if-zero code too.
I'll have a look at your algorithm on pre-ev6 as well.
r~
More information about the Libc-alpha
mailing list