Bug 31744 - Math: maybe better generic log2 implementation
Summary: Math: maybe better generic log2 implementation
Status: RESOLVED NOTABUG
Alias: None
Product: glibc
Classification: Unclassified
Component: math (show other bugs)
Version: unspecified
: P2 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-05-16 10:04 UTC by YunQiang Su
Modified: 2024-05-27 02:57 UTC (History)
2 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:


Attachments
Taylor series - data (881 bytes, text/plain)
2024-05-16 10:04 UTC, YunQiang Su
Details
Taylor series - source (700 bytes, text/x-csrc)
2024-05-16 10:05 UTC, YunQiang Su
Details
Taylor series - source for data (165 bytes, text/x-csrc)
2024-05-16 10:05 UTC, YunQiang Su
Details
Taylor series implementation (29.99 KB, application/x-gzip)
2024-05-27 02:57 UTC, YunQiang Su
Details

Note You need to log in before you can comment on or make changes to this bug.
Description YunQiang Su 2024-05-16 10:04:50 UTC
Created attachment 15523 [details]
Taylor series - data

I tested it on mips64 with no optimal implementation yet.
I can get almost 40% performance boost.
Comment 1 YunQiang Su 2024-05-16 10:05:17 UTC
Created attachment 15524 [details]
Taylor series - source
Comment 2 YunQiang Su 2024-05-16 10:05:43 UTC
Created attachment 15525 [details]
Taylor series - source for data
Comment 3 Carlos O'Donell 2024-05-16 11:28:18 UTC
Please post these to libc-alpha@sourceware.org for review by the development community.

Please review:
https://sourceware.org/glibc/wiki/Contribution%20checklist

Please note that there are log2 microbenchmarks that should be run via `make bench` to validate the before and after performance.
Comment 4 Adhemerval Zanella 2024-05-16 12:34:42 UTC
You also need to check a complete implementation with all error tests against the glibc math checks, which not only checks for spurious/invalid exceptions and return codes but also the precision for a set of inputs. Keep in mind that optimizing error paths does influence performance (that's one of the reasons we removed the old SVID wrappers).

I tried to check your implementation and as-is it shows a *lot* of regressions even with the extra Taylor expansions (d001_15_5 and d001_15_6):

Test suite completed:
  264 test cases plus 260 tests for exception flags and
    260 tests for errno executed.
  266 errors occurred.

While some are small ulp increase for some inputs, some tests show that implementation is wrong in most cases cases:

Failure: Test: log2 (0)
Result:
 is:          0.0000000000000000e+00   0x0.0000000000000p+0
 should be:  -inf  -inf
Failure: log2 (-0): Exception "Divide by zero" not set
Failure: log2 (-0): errno set to 0, expected 34 (ERANGE)
[...]
Failure: Test: log2 (0x1.07465bdc7e41cp+0)
Result:
 is:          4.0425841401434279e-02   0x1.4b2b2257702c6p-5
 should be:   4.0425841401429338e-02   0x1.4b2b22576fffep-5
 difference:  4.9404924595819466e-15   0x1.6400000000000p-48
 ulp       :  712.0000
 max.ulp   :  1.0000
[...]
Failure: Test: log2 (0xb.54170d5cfa9p-4)
Result:
 is:         -4.9811801879064710e-01  -0x1.fe12a661043e0p-2
 should be:  -4.9811801879074219e-01  -0x1.fe12a66104a91p-2
 difference:  9.5090602059144658e-14   0x1.ac40000000000p-44
 ulp       :  1713.0000
 max.ulp   :  1.0000
[...]
Failure: Test: log2_downward (0x8p-972)
Result:
 is:          0.0000000000000000e+00   0x0.0000000000000p+0
 should be:  -9.6900000000000000e+02  -0x1.e480000000000p+9
 difference:  9.6900000000000000e+02   0x1.e480000000000p+9
 ulp       :  8523414138519552.0000
 max.ulp   :  3.0000

Sich high ULPs usually means that the numerical method is not a good fit. You can see that the current log2, which originally came from ARM Optimized Routines [1], was originally crafted not only to have the best performance but also to show correctness and good precision.

Also, you need a more robust performance evaluation than a simple loop (glibc benchtests at least try with some random values). And even with glibc bench tests this implementation is way slower (running a aarch64 N1):

aarch64-linux-gnu$ ./benchtests/bench-log2
  "log2": {
   "": {
    "duration": 1.00061e+09,
    "iterations": 6.24e+07,
    "max": 281.22,
    "min": 14.64,
    "mean": 16.0354
   }
  }

Compared to the current implementation:

aarch64-linux-gnu$ ./benchtests/bench-log2
  "log2": {
   "": {
    "duration": 9.97923e+08,
    "iterations": 1.516e+08,
    "max": 292.54,
    "min": 6.44,
    "mean": 6.5826
   }
  }

[1] https://github.com/ARM-software/optimized-routines
Comment 5 Adhemerval Zanella 2024-05-16 12:39:20 UTC
I would also advise to not add an arch-specific implementation of complex functions like exp/log, this is usually a lot maintainability burden. Recently we removed a lot of Intel implementation after the ARM optimized routines one was added because the generic implementation used better numerical methods.
Comment 6 YunQiang Su 2024-05-17 00:12:50 UTC
Yes. You are right. I guess it may be a problem in MIPS compiler, which cannot produce good enough binary.
I will try to find the real problem.
Comment 7 YunQiang Su 2024-05-27 02:57:14 UTC
Created attachment 15537 [details]
Taylor series implementation

Just for anybody has interests.

This implementation is some faster than the current one on Loongson 3A4000 (>50%)

But it is some slower on ARM64: 30%.

Just for reference.