new alpha division routines
Falk Hueffner
hueffner@informatik.uni-tuebingen.de
Mon Mar 29 02:03:00 GMT 2004
Richard Henderson <rth@redhat.com> writes:
> On Sun, Mar 28, 2004 at 03:24:31PM +0200, Falk Hueffner wrote:
>> incidentally, I was also playing with the division routines lately, or
>> more precisely __remqu, since it seems to take most time (a whopping
>> 30% of all libc cycles when compiling stuff with gcc).
>
> Indeed. I have a patch to the hashing routines to pre-compute
> a multiplictive inverse of the primes that we use. I've been
> meaning to do some timing tests to see how much it helps.
Another option would be to just use powers of two as table size and
use a non-sucking hash function. I think that is already used in some
of the special-cases hash tables in gcc.
>> * Special case power-of-two divisor. People do that quite often.
>
> I ran my system for a while with an LD_PRELOAD library to capture
> divisions.
Could you perhaps send me the source for that? I'd be interested in
some __remqu stats.
> So yes, Y a power of two does occur moderately frequently.
I'd expect it to be more frequent for the remainder functions. It's
probably not worth doing for divisions.
>> * Special case small dividends (compared to divisor).
>
> In what way?
x < y and x < 2y is detected early. This seems to only apply to
modulo.
> I wonder if we can use divs for the second division? It would save
> about three cycles on ev6, but between 7-29 on ev5 and 29 cycles on ev4.
> We'd have to be able to show that the remainder has no more than 24
> bits set at this point...
At first glance, that should actually work. There ought to be only
about 12 bits left.
> Another option might be to unroll the fixup loop N times such that N
> covers the majority of operands that we actually see. That would
> cure the bulk of the branch predicion problem and might even come
> in at less than ev6 12 cycle divs.
I tried that briefly, but it didn't work so well considering each
unrolled loop takes 3 cycles. It should be better on ev5 though where
cmov is cheaper and division more expensive.
>> By the way, what do you think about inlining the 32-bit ops from gcc?
>
> It's a good idea. Need to think about how to do it though, since
> now we've got to inline the trap-if-zero code too.
Well, division by zero is undefined behaviour... do you think there's
code which relies on it being trappable and resumable?
Anyway, here's my not very well tested patch for gcc... if you think
I'm on the right track I could try to clean it up and submit it to
gcc-patches.
--
Falk
-------------- next part --------------
A non-text attachment was scrubbed...
Name: gcc-alpha-inline-divmod2.patch
Type: text/x-patch
Size: 9517 bytes
Desc: not available
URL: <http://sourceware.org/pipermail/libc-alpha/attachments/20040329/a76bd2d1/attachment.bin>
More information about the Libc-alpha
mailing list