Created attachment 15523 [details] Taylor series - data I tested it on mips64 with no optimal implementation yet. I can get almost 40% performance boost.
Created attachment 15524 [details] Taylor series - source
Created attachment 15525 [details] Taylor series - source for data
Please post these to libc-alpha@sourceware.org for review by the development community. Please review: https://sourceware.org/glibc/wiki/Contribution%20checklist Please note that there are log2 microbenchmarks that should be run via `make bench` to validate the before and after performance.
You also need to check a complete implementation with all error tests against the glibc math checks, which not only checks for spurious/invalid exceptions and return codes but also the precision for a set of inputs. Keep in mind that optimizing error paths does influence performance (that's one of the reasons we removed the old SVID wrappers). I tried to check your implementation and as-is it shows a *lot* of regressions even with the extra Taylor expansions (d001_15_5 and d001_15_6): Test suite completed: 264 test cases plus 260 tests for exception flags and 260 tests for errno executed. 266 errors occurred. While some are small ulp increase for some inputs, some tests show that implementation is wrong in most cases cases: Failure: Test: log2 (0) Result: is: 0.0000000000000000e+00 0x0.0000000000000p+0 should be: -inf -inf Failure: log2 (-0): Exception "Divide by zero" not set Failure: log2 (-0): errno set to 0, expected 34 (ERANGE) [...] Failure: Test: log2 (0x1.07465bdc7e41cp+0) Result: is: 4.0425841401434279e-02 0x1.4b2b2257702c6p-5 should be: 4.0425841401429338e-02 0x1.4b2b22576fffep-5 difference: 4.9404924595819466e-15 0x1.6400000000000p-48 ulp : 712.0000 max.ulp : 1.0000 [...] Failure: Test: log2 (0xb.54170d5cfa9p-4) Result: is: -4.9811801879064710e-01 -0x1.fe12a661043e0p-2 should be: -4.9811801879074219e-01 -0x1.fe12a66104a91p-2 difference: 9.5090602059144658e-14 0x1.ac40000000000p-44 ulp : 1713.0000 max.ulp : 1.0000 [...] Failure: Test: log2_downward (0x8p-972) Result: is: 0.0000000000000000e+00 0x0.0000000000000p+0 should be: -9.6900000000000000e+02 -0x1.e480000000000p+9 difference: 9.6900000000000000e+02 0x1.e480000000000p+9 ulp : 8523414138519552.0000 max.ulp : 3.0000 Sich high ULPs usually means that the numerical method is not a good fit. You can see that the current log2, which originally came from ARM Optimized Routines [1], was originally crafted not only to have the best performance but also to show correctness and good precision. Also, you need a more robust performance evaluation than a simple loop (glibc benchtests at least try with some random values). And even with glibc bench tests this implementation is way slower (running a aarch64 N1): aarch64-linux-gnu$ ./benchtests/bench-log2 "log2": { "": { "duration": 1.00061e+09, "iterations": 6.24e+07, "max": 281.22, "min": 14.64, "mean": 16.0354 } } Compared to the current implementation: aarch64-linux-gnu$ ./benchtests/bench-log2 "log2": { "": { "duration": 9.97923e+08, "iterations": 1.516e+08, "max": 292.54, "min": 6.44, "mean": 6.5826 } } [1] https://github.com/ARM-software/optimized-routines
I would also advise to not add an arch-specific implementation of complex functions like exp/log, this is usually a lot maintainability burden. Recently we removed a lot of Intel implementation after the ARM optimized routines one was added because the generic implementation used better numerical methods.
Yes. You are right. I guess it may be a problem in MIPS compiler, which cannot produce good enough binary. I will try to find the real problem.
Created attachment 15537 [details] Taylor series implementation Just for anybody has interests. This implementation is some faster than the current one on Loongson 3A4000 (>50%) But it is some slower on ARM64: 30%. Just for reference.