This is the mail archive of the libc-alpha@sources.redhat.com mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: new alpha division routines


On Sun, Mar 28, 2004 at 03:24:31PM +0200, Falk Hueffner wrote:
> incidentally, I was also playing with the division routines lately, or
> more precisely __remqu, since it seems to take most time (a whopping
> 30% of all libc cycles when compiling stuff with gcc).

Indeed.  I have a patch to the hashing routines to pre-compute
a multiplictive inverse of the primes that we use.  I've been
meaning to do some timing tests to see how much it helps.

> * Special case power-of-two divisor. People do that quite often.

I ran my system for a while with an LD_PRELOAD library to capture
divisions.  I found

count             77460708
x_lt_2to48        70298527
x_lt_y             4942243
y_pow_2            5109493
y_eq_10            6313795
y_eq_100               882
y_eq_1000           515145
y_eq_10000             307
y_eq_100000              0
y_eq_1000000         15335
x_eq_m1            6443176

(This is from before my strtol patch, so virtually all of the
x==-1 and y==10 instances came from that.) So yes, Y a power of
two does occur moderately frequently.

Without cttz, I'm not sure it's worth special-casing to avoid
the fp division.  Except of course I was thinking about this
more in terms of division than remainder; clearly it's trivial
for remainder.

> * Special case small dividends (compared to divisor).

In what way?

> * Never loop, use second divt/c if lacking precision. This avoids data
>   dependent branches and gives good results for the important case of
>   huge dividends (nearly all 64 bit set) and medium dividends which
>   occurs in hash tables.

I'm surprised you measured a win here; divt is a minimum of 22 cycles;
using the first divt result as an estimate, we go around the fixup loop
usually no more than 3 times.  Oh, wait, 22 was ev5; ev6 does divt in
15 cycles.  Hmm.

I wonder if we can use divs for the second division?  It would save
about three cycles on ev6, but between 7-29 on ev5 and 29 cycles on ev4.
We'd have to be able to show that the remainder has no more than 24
bits set at this point...  We'd have to cvtts the divisor in order to
avoid UNPREDICTABLE results, but there should be pleanty of spare
cycles around to get that issued.

Another option might be to unroll the fixup loop N times such that N
covers the majority of operands that we actually see.  That would 
cure the bulk of the branch predicion problem and might even come
in at less than ev6 12 cycle divs.

> By the way, what do you think about inlining the 32-bit ops from gcc?

It's a good idea.  Need to think about how to do it though, since
now we've got to inline the trap-if-zero code too.

I'll have a look at your algorithm on pre-ev6 as well.


r~


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]