[PATCH v1] libm,ieee754:New algorithm of fmod function.

Mon Nov 30 13:33:48 GMT 2020

Hi Kirill,

> I like an adventures games, where hint-by-hint you can reach a goal, but 
> in your cases I give up. Can you write the algorithm explicitly, with 
> only integer arithmetic and without division, with two multiplications, 
> and nothing more heavy like comparison, loops, many shifts etc which can 
> be better for both 64 and 32 bits targets? I really don't know how to do 
> this.

The basic algorithm is simple, it's just long division using a reciprocal
estimate:

num = (num << N) - recip_div (num, recip) * den

where recip_div takes the top bits of the numerator and reciprocal of den
(both must be > N bits) to estimate the next N bits of the long division:

recip_div (num, recip) = ((num >> shift) * recip) >> shift2

The tricky part is ensuring the estimate is accurate enough that it is never
off by more than 1 (this means num can up to 2 * den). This gets harder as
you choose N closer to the number of bits in the reciprocal of den.

Also rather than correcting the error, the idea is to let the next iteration sort
it out and only deal with it at the end of the loop. This uses an extra bit from
recip_div but avoiding the correction is much faster.

On a 64-bit target we can use 64 bits of reciprocal, and I guesN = 62. This needs one
64-bit MUL and a 64-bit mul-high. On a 32-bit target you can get a quick estimate
of the mul-high using 3 32x32->64 bit multiplies and 2 adds, but the increased
error means we likely need to limit N to 60.

The alternative option is to use 32 bits of reciprocal (N=30), so that recip_div uses
a 32x32->64 bit multiply, and recip_div (num, recip) * den is a 32x64->64 bit
multiply.

It's not obvious which is best on 32-bit targets, but it's possible to compare relative
performance of the basic loops with random inputs.

> I can't agree with you, that the new version is generally 20-30% slower. 

It is for any case that uses a division. If it uses a fast path then there is either
no difference or the new version is a bit faster. I'm not too concerned about
the worst case that uses like 186 divisions (since we can remove those divisons),
but the more common cases that use 1-2 divisions, eg. fmod (5.0, 3.0).

> I bought raspberry pi zero, and run my tests there. The raw results are 
> attached. From my side, only "main test" which is test, but not 
> benchmark is 26% slower. In benchmarks the new function have similar or 
> even better performance, compare to libm.

I think only your "main test" tests the division case - the other tests hit the fast
paths or test special cases like denormals with very few bits set. Try fmod (5.0, 3.0)
or for the worst case fmod (DBL_MAX, DBL_MIN * 1.3).

Cheers,
Wilco