[PATCH] libm,ieee754:New algorithm of fmod function for, dbl-64/wordsize-64

Thu Nov 19 13:28:02 GMT 2020

I proposed a new algorithm for fmod calculations for double type and
wordsize-64 architectures. Algorithm description is in the file.
I successfully ran internal tests on x86_64 processors.
Also I did some extensive tests and benchmark with my own test-suite on 
x86_64 and ARM64. See
https://github.com/orex/test_fmod
My tests on x86_64 (Intel, AMD) and ARM64 shows that the new algorithm 
up to 20 times faster for
"extreme" cases. And up to two times faster for regular cases of using 
the function.
Also, I did some unit testing which shows that old and a new algorithms 
gives binary
equivalent result for each of billions different pairs (x, y) with wide 
range of numbers
including normal, subnormal, and special one (NaN INF, 0).

Reply to adhemerval.zanella@linaro.org from libc-help thread.

> I won't comment to your implementation until it is submitted to
> libc-alpha with proper Copyright assignment, but some remarks:
FSF status: In progress.
>
>   1. If you are benchmarking on i686, it won't use the generic
>      sysdeps/ieee754/dbl-64/e_fmod.c but rather a asm specific
>      implementation (sysdeps/i386/fpu/e_fmod.S).  It would be good
>      if you check if we can remove the i386 assembly implementation
>      in favor of the generic one.
You can remove it of course, but in favour to generic implementation, 
but I assume, you will get a strong performance degradation. It is worth 
to check it. I don't have a i686 machine. I don't know how relevant will 
be tests on my CPU.
>
>   2. __builtin_clz might be a libcall on some targets, which might
>      be worse than using the loop.  It is usually not an issue on
>      some implementation (it is used on some float128 which is
>      usually soft implementation anyway), but it is something to
>      keep in mind.
I check the code of libm. The function is already used (except ldbl-128) in

sysdeps/ieee754/flt-32/s_logbf.c:      rix -= __builtin_clz (ix) - 9;
sysdeps/ieee754/dbl-64/s_logb.c:      int m = __builtin_clzll (ix);

The second thing is that loop is not efficient anyhow. There is more 
efficient approach with table (for CLZ and CTZ).

https://www.geeksforgeeks.org/count-trailing-zero-bits-using-lookup-table/

Probably it is already implemented in GCC?

>
>   3. It would be good if could get rid of the wordsize-64
>      implementation and just have a generic one good enough regardless
>      of word size.
I can't imagine such algorithm. The size of double is 64 bit. Splitting 
it over two 32 bits variable, of course, will have a performance impact. 
Algorithm, proposed by me (see description), can work strictly with 64 
bits variables.

Best,
Kirill.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-libm-ieee754-New-algorithm-of-fmod-function-for-dbl-.patch
Type: text/x-patch
Size: 10749 bytes
Desc: not available
URL: <https://sourceware.org/pipermail/libc-alpha/attachments/20201119/ea06d697/attachment-0001.bin>