This is the mail archive of the
`libc-alpha@sources.redhat.com`
mailing list for the glibc project.

Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|

Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |

Other format: | [Raw text] |

*From*: Richard Henderson <rth at redhat dot com>*To*: Falk Hueffner <hueffner at informatik dot uni-tuebingen dot de>*Cc*: libc-alpha at sources dot redhat dot com, Richard Henderson <rth at redhat dot com>*Date*: Sun, 28 Mar 2004 13:28:59 -0800*Subject*: Re: new alpha division routines*References*: <87y8pl2jg0.fsf@informatik.uni-tuebingen.de>

On Sun, Mar 28, 2004 at 03:24:31PM +0200, Falk Hueffner wrote: > incidentally, I was also playing with the division routines lately, or > more precisely __remqu, since it seems to take most time (a whopping > 30% of all libc cycles when compiling stuff with gcc). Indeed. I have a patch to the hashing routines to pre-compute a multiplictive inverse of the primes that we use. I've been meaning to do some timing tests to see how much it helps. > * Special case power-of-two divisor. People do that quite often. I ran my system for a while with an LD_PRELOAD library to capture divisions. I found count 77460708 x_lt_2to48 70298527 x_lt_y 4942243 y_pow_2 5109493 y_eq_10 6313795 y_eq_100 882 y_eq_1000 515145 y_eq_10000 307 y_eq_100000 0 y_eq_1000000 15335 x_eq_m1 6443176 (This is from before my strtol patch, so virtually all of the x==-1 and y==10 instances came from that.) So yes, Y a power of two does occur moderately frequently. Without cttz, I'm not sure it's worth special-casing to avoid the fp division. Except of course I was thinking about this more in terms of division than remainder; clearly it's trivial for remainder. > * Special case small dividends (compared to divisor). In what way? > * Never loop, use second divt/c if lacking precision. This avoids data > dependent branches and gives good results for the important case of > huge dividends (nearly all 64 bit set) and medium dividends which > occurs in hash tables. I'm surprised you measured a win here; divt is a minimum of 22 cycles; using the first divt result as an estimate, we go around the fixup loop usually no more than 3 times. Oh, wait, 22 was ev5; ev6 does divt in 15 cycles. Hmm. I wonder if we can use divs for the second division? It would save about three cycles on ev6, but between 7-29 on ev5 and 29 cycles on ev4. We'd have to be able to show that the remainder has no more than 24 bits set at this point... We'd have to cvtts the divisor in order to avoid UNPREDICTABLE results, but there should be pleanty of spare cycles around to get that issued. Another option might be to unroll the fixup loop N times such that N covers the majority of operands that we actually see. That would cure the bulk of the branch predicion problem and might even come in at less than ev6 12 cycle divs. > By the way, what do you think about inlining the 32-bit ops from gcc? It's a good idea. Need to think about how to do it though, since now we've got to inline the trap-if-zero code too. I'll have a look at your algorithm on pre-ev6 as well. r~

**Follow-Ups**:**Re: new alpha division routines***From:*Falk Hueffner

**References**:**Re: new alpha division routines***From:*Falk Hueffner

Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|

Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |