Created attachment 6310 [details] test case Attached test case shows that pow is over 10,000 slower for some inputs than for other (nearby) inputs. This was first noticed with eglibc-2.11, and reproduced with current git trunk. Results: 32-bit build: ./elf/ld.so --library-path .:math:dlfcn ./a.out pow(0.562500, 1.5): 13.644814 calls/usec pow(0.563250, 1.5): 13.647048 calls/usec ratio: 1 64-bit build: ./elf/ld.so --library-path .:math:dlfcn ./a.out pow(0.562500, 1.5): 0.000982 calls/usec pow(0.563250, 1.5): 13.574397 calls/usec ratio: 13828 perf shows: # Events: 15K cycles # # Overhead Command Shared Object Symbol # ........ ....... ................. ................................... # 63.31% ld.so libm.so [.] __mul 18.44% ld.so libm.so [.] __ieee754_pow_sse2 13.81% ld.so libm.so [.] __exp1 1.20% ld.so libm.so [.] __pow 1.09% ld.so libm.so [.] sub_magnitudes 0.43% ld.so a.out [.] main 0.38% ld.so libm.so [.] __cpy 0.21% ld.so libm.so [.] add_magnitudes 0.19% ld.so libm.so [.] @plt

I can still reproduce it with 2.12 GNU C Library stable release version 2.12, by Roland McGrath et al. Copyright (C) 2010 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. Compiled by GNU CC version 4.4.6 20120305 (Red Hat 4.4.6-4). Compiled on a Linux 2.6.32 system on 2012-07-18. Available extensions: The C stubs add-on version 2.1.2. crypt add-on version 2.1 by Michael Glad and others GNU Libidn by Simon Josefsson Native POSIX Threads Library by Ulrich Drepper et al BIND-8.2.3-T5B RT using linux kernel aio libc ABIs: UNIQUE IFUNC For bug reporting instructions, please see: <http://www.gnu.org/software/libc/bugs.html>. A testcase: #include <math.h> #ifndef POW #define POW pow #endif #define N 1500 int total[N][N]; double alpha = 2.81; int n = N; int main(int argc, char *argv[]) { int x = 30976397; x += 0; double trail = 1. / ( (double) N * (double) x) ; long i,j; for ( i = 0 ; i < N ; i++ ) { for ( j = 0 ; j < i ; j++ ) { total[i][j] = POW(trail, alpha); } } return 0; } $ gcc -o slowpow slowpow.c -lm $ /usr/bin/time ./slowpow 213.75user 0.04system 3:34.27elapsed 99%CPU (0avgtext+0avgdata 12784maxresident)k 0inputs+0outputs (0major+838minor)pagefaults 0swaps (Interrupted, it was running forever) $ gcc -DPOW=powl -o slowpow slowpow.c -lm $ /usr/bin/time ./slowpow 0.20user 0.01system 0:00.22elapsed 98%CPU (0avgtext+0avgdata 35040maxresident)k 0inputs+0outputs (0major+2229minor)pagefaults 0swaps $ gcc -DPOW=powf -o slowpow slowpow.c -lm $ /usr/bin/time ./slowpow 0.23user 0.01system 0:00.25elapsed 95%CPU (0avgtext+0avgdata 35088maxresident)k 160inputs+0outputs (1major+2231minor)pagefaults 0swaps At least, it would be nice if the man page mentioned this issue and what kind of inputs may lead to it. It causes a lot of frustration: http://entropymine.com/imageworsener/slowpow/

(In reply to comment #1) > I can still reproduce it with 2.12 2.12 is ancient history. Current git trunk: # i686: Could not get the test to work. Something appears to be broken with my toolchain, or with trunk in i686 mode. # x86_64: ./elf/ld.so --library-path .:math:dlfcn /tmp/test_pow64 pow(0.562500, 1.5): 0.000990 calls/usec pow(0.563250, 1.5): 12.753271 calls/usec ratio: 12887

Assigning this to me since I'm working on cleaning up the mp code, which should make this better.

(In reply to comment #3) > Assigning this to me since I'm working on cleaning up the mp code, which should > make this better. Thanks for working on this. As I said above, it would be nice if for the time being, a note was added to the man page of pow warning about this issue. If you happen to know what kind of inputs trigger it, that would be interesting information.

If we did, we wouldn't have had to do all those iterations at different precision levels and the result would have been much faster :) Nevertheless, I've added patches to master that ought to improve performance by at least 4x in the worst case, bringing it close to mpfr performance (still about 2x slower though). There's still scope to make it faster, which is why I haven't closed this bug yet. The man pages are not part of glibc, they're a separate project by themselves. There could be a case for adding a note explaining the multiprecision fallback (why it's there, overview, impact, etc.) in the glibc manual though.

Created bug 15267 to track the documentation fix.

(In reply to comment #5) > Nevertheless, I've added patches to master that ought to improve performance by > at least 4x in the worst case, bringing it close to mpfr performance (still > about 2x slower though). There's still scope to make it faster, which is why I > haven't closed this bug yet. Hi, thanks for your work. Personally, I would be interested to have a pow version that is perhaps less precise but that does not have such nasty corner-cases. A 4x improvement is nice, but given that the worst-case may be 12,000x slower, it is not a definitive solution. There are some implementations on the net that perhaps could be used when -ffast-math is given? http://jrfonseca.blogspot.be/2008/09/fast-sse2-pow-tables-or-polynomials.html http://www.dctsystems.co.uk/Software/power.html http://martin.ankerl.com/2012/01/25/optimized-approximative-pow-in-c-and-cpp/ I am no expert in IEEE floating-point arithmetic, so it may well be that those implementations produce extremely bad approximations for some inputs. Is that the case?

(In reply to comment #7) > Hi, thanks for your work. Personally, I would be interested to have a pow > version that is perhaps less precise but that does not have such nasty > corner-cases. A 4x improvement is nice, but given that the worst-case may be > 12,000x slower, it is not a definitive solution. > > There are some implementations on the net that perhaps could be used when > -ffast-math is given? > > http://jrfonseca.blogspot.be/2008/09/fast-sse2-pow-tables-or-polynomials.html > > http://www.dctsystems.co.uk/Software/power.html > > http://martin.ankerl.com/2012/01/25/optimized-approximative-pow-in-c-and-cpp/ > > I am no expert in IEEE floating-point arithmetic, so it may well be that those > implementations produce extremely bad approximations for some inputs. Is that > the case? They're all in a way subsets of what the fast phase of libm does, so we don't really need to reimplement libm to take advantage of those ideas. That said, there is indeed a strong case for fast and (for some inputs) imprecise computations of transcendentals. That probably won't be the default in glibc, but I won't rule out the possibility of having an alternate library or configuration, provided there is consensus in the glibc community for it.

*** Bug 16898 has been marked as a duplicate of this bug. ***

(In reply to Paul Pluzhnikov from comment #2) > # x86_64: > ./elf/ld.so --library-path .:math:dlfcn /tmp/test_pow64 > pow(0.562500, 1.5): 0.000990 calls/usec > pow(0.563250, 1.5): 12.753271 calls/usec > ratio: 12887 Note that mathematically x^1.5 = sqrt(x)^3. Now, 0.562500 is a perfect square: ? sqrt(0.562500) %1 = 0.75000000000000000000000000000000000000 This is probably related to the slowness. Unless the implementation tries to provide the correct inexact flag (which I doubt), I don't see a theoretical reason for which it would be slow in rounding to nearest. It would be very different in a directed rounding mode since this is an exact case. Still, this could be a corner case for an algorithm that doesn't expect exact results or something like that.

This is an automated email from the git hooks/post-receive script. It was generated because a ref change was pushed to the repository containing the project "GNU C Library master sources". The branch, master has been updated via c3d466cba1692708a19c6ff829d0386c83a0c6e5 (commit) from 7bb087bd7bfe3616c4c0974a3f7352b593353ea5 (commit) Those revisions listed above that are new to this repository have not appeared on any other notification email; so we list those revisions in full, below. - Log ----------------------------------------------------------------- https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c3d466cba1692708a19c6ff829d0386c83a0c6e5 commit c3d466cba1692708a19c6ff829d0386c83a0c6e5 Author: Wilco Dijkstra <wdijkstr@arm.com> Date: Mon Feb 12 10:42:42 2018 +0000 Remove slow paths from pow Remove the slow paths from pow. Like several other double precision math functions, pow is exactly rounded. This is not required from math functions and causes major overheads as it requires multiple fallbacks using higher precision arithmetic if a result is close to 0.5ULP. Ridiculous slowdowns of up to 100000x have been reported when the highest precision path triggers. All GLIBC math tests pass on AArch64 and x64 (with ULP of pow set to 1). The worst case error is ~0.506ULP. A simple test over a few hundred million values shows pow is 10% faster on average. This fixes BZ #13932. [BZ #13932] * sysdeps/ieee754/dbl-64/uexp.h (err_1): Remove. * benchtests/pow-inputs: Update comment for slow path cases. * manual/probes.texi (slowpow_p10): Delete removed probe. (slowpow_p10): Likewise. * math/Makefile: Remove halfulp.c and slowpow.c. * sysdeps/aarch64/libm-test-ulps: Set ULP of pow to 1. * sysdeps/generic/math_private.h (__exp1): Remove error argument. (__halfulp): Remove. (__slowpow): Remove. * sysdeps/i386/fpu/halfulp.c: Delete file. * sysdeps/i386/fpu/slowpow.c: Likewise. * sysdeps/ia64/fpu/halfulp.c: Likewise. * sysdeps/ia64/fpu/slowpow.c: Likewise. * sysdeps/ieee754/dbl-64/e_exp.c (__exp1): Remove error argument, improve comments and add error analysis. * sysdeps/ieee754/dbl-64/e_pow.c (__ieee754_pow): Add error analysis. (power1): Remove function: (log1): Remove error argument, add error analysis. (my_log2): Remove function. * sysdeps/ieee754/dbl-64/halfulp.c: Delete file. * sysdeps/ieee754/dbl-64/slowpow.c: Likewise. * sysdeps/m68k/m680x0/fpu/halfulp.c: Likewise. * sysdeps/m68k/m680x0/fpu/slowpow.c: Likewise. * sysdeps/powerpc/power4/fpu/Makefile: Remove CPPFLAGS-slowpow.c. * sysdeps/x86_64/fpu/libm-test-ulps: Set ULP of pow to 1. * sysdeps/x86_64/fpu/multiarch/Makefile: Remove slowpow-fma.c, slowpow-fma4.c, halfulp-fma.c, halfulp-fma4.c. * sysdeps/x86_64/fpu/multiarch/e_pow-fma.c (__slowpow): Remove define. * sysdeps/x86_64/fpu/multiarch/e_pow-fma4.c (__slowpow): Likewise. * sysdeps/x86_64/fpu/multiarch/halfulp-fma.c: Delete file. * sysdeps/x86_64/fpu/multiarch/halfulp-fma4.c: Likewise. * sysdeps/x86_64/fpu/multiarch/slowpow-fma.c: Likewise. * sysdeps/x86_64/fpu/multiarch/slowpow-fma4.c: Likewise. ----------------------------------------------------------------------- Summary of changes: ChangeLog | 37 ++++++ benchtests/pow-inputs | 6 +- manual/probes.texi | 16 --- math/Makefile | 4 +- sysdeps/aarch64/libm-test-ulps | 2 + sysdeps/generic/math_private.h | 4 +- sysdeps/i386/fpu/halfulp.c | 1 - sysdeps/i386/fpu/slowpow.c | 1 - sysdeps/ia64/fpu/halfulp.c | 1 - sysdeps/ia64/fpu/slowpow.c | 1 - sysdeps/ieee754/dbl-64/e_exp.c | 42 +++---- sysdeps/ieee754/dbl-64/e_pow.c | 172 ++++----------------------- sysdeps/ieee754/dbl-64/halfulp.c | 152 ----------------------- sysdeps/ieee754/dbl-64/slowpow.c | 125 ------------------- sysdeps/ieee754/dbl-64/uexp.h | 2 +- sysdeps/m68k/m680x0/fpu/halfulp.c | 1 - sysdeps/m68k/m680x0/fpu/slowpow.c | 1 - sysdeps/powerpc/power4/fpu/Makefile | 1 - sysdeps/x86_64/fpu/libm-test-ulps | 2 + sysdeps/x86_64/fpu/multiarch/Makefile | 12 +-- sysdeps/x86_64/fpu/multiarch/e_pow-fma.c | 1 - sysdeps/x86_64/fpu/multiarch/e_pow-fma4.c | 1 - sysdeps/x86_64/fpu/multiarch/halfulp-fma.c | 4 - sysdeps/x86_64/fpu/multiarch/halfulp-fma4.c | 4 - sysdeps/x86_64/fpu/multiarch/slowpow-fma.c | 11 -- sysdeps/x86_64/fpu/multiarch/slowpow-fma4.c | 11 -- 26 files changed, 90 insertions(+), 525 deletions(-) delete mode 100644 sysdeps/i386/fpu/halfulp.c delete mode 100644 sysdeps/i386/fpu/slowpow.c delete mode 100644 sysdeps/ia64/fpu/halfulp.c delete mode 100644 sysdeps/ia64/fpu/slowpow.c delete mode 100644 sysdeps/ieee754/dbl-64/halfulp.c delete mode 100644 sysdeps/ieee754/dbl-64/slowpow.c delete mode 100644 sysdeps/m68k/m680x0/fpu/halfulp.c delete mode 100644 sysdeps/m68k/m680x0/fpu/slowpow.c delete mode 100644 sysdeps/x86_64/fpu/multiarch/halfulp-fma.c delete mode 100644 sysdeps/x86_64/fpu/multiarch/halfulp-fma4.c delete mode 100644 sysdeps/x86_64/fpu/multiarch/slowpow-fma.c delete mode 100644 sysdeps/x86_64/fpu/multiarch/slowpow-fma4.c

(In reply to cvs-commit@gcc.gnu.org from comment #11) > Remove slow paths from pow Instead of removing the slow paths, wouldn't have been better to fix the bug (see comment #10, where the slowness cannot be explained by the correct rounding, since this is an easy case), then decide?

From the bug report, it is __ieee754_pow_sse2 that was used, and if I understand correctly, the code is in "sysdeps/x86_64/fpu/multiarch/e_pow.c", which uses "sysdeps/ieee754/dbl-64/e_pow.c". The slow path would imply that __exp1 returns a non-positive value. __exp1 is defined in "sysdeps/ieee754/dbl-64/e_exp.c". There, the rounding test is: if (res == (res + cor * (1.0 + error + err_1))) For the example of slow case pow(0.562500, 1.5), the exact result is 0.421875 = 27/64, so that cor should be much smaller than ulp(res) and a positive value should be returned, which doesn't explain the behavior.

Moreover, the code is quite horrible. mydefs.h defines: typedef int int4; typedef union { int4 i[2]; double x; } mynumber; and in e_exp.c, it does bitwise operations on these signed int: binexp.i[HIGH_HALF] = (junk1.i[LOW_HALF] + 1023) << 20; i = ((junk2.i[LOW_HALF] >> 8) & 0xfffffffe) + 356; j = (junk2.i[LOW_HALF] & 511) << 1; which yield undefined or implementation-defined behavior. So I wonder whether the generated code does what is expected.

(In reply to Vincent Lefèvre from comment #12) > (In reply to cvs-commit@gcc.gnu.org from comment #11) > > Remove slow paths from pow > > Instead of removing the slow paths, wouldn't have been better to fix the bug > (see comment #10, where the slowness cannot be explained by the correct > rounding, since this is an easy case), then decide? Removing buggy code is far better than fixing it... (In reply to Vincent Lefèvre from comment #13) > From the bug report, it is __ieee754_pow_sse2 that was used, and if I > understand correctly, the code is in "sysdeps/x86_64/fpu/multiarch/e_pow.c", > which uses "sysdeps/ieee754/dbl-64/e_pow.c". The slow path would imply that > __exp1 returns a non-positive value. __exp1 is defined in > "sysdeps/ieee754/dbl-64/e_exp.c". There, the rounding test is: > > if (res == (res + cor * (1.0 + error + err_1))) > > For the example of slow case pow(0.562500, 1.5), the exact result is > 0.421875 = 27/64, so that cor should be much smaller than ulp(res) and a > positive value should be returned, which doesn't explain the behavior. Whenever you're close to 0.5 ULP, cor will be just below 0.5ULP, eg. 0.4999999999, so any value of error + err_1 that is larger than 2^-53ULP will cause it to round up, thus falling into the slow path.

(In reply to Wilco from comment #15) > Removing buggy code is far better than fixing it... The buggy code is in the fast path. You didn't remove it. > (In reply to Vincent Lefèvre from comment #13) > > For the example of slow case pow(0.562500, 1.5), the exact result is > > 0.421875 = 27/64, so that cor should be much smaller than ulp(res) and a > > positive value should be returned, which doesn't explain the behavior. > > Whenever you're close to 0.5 ULP, It is not close to 0.5 ulp. This is an *exact* case (27/64), so that the correction term should be very small, much smaller than 0.5 ulp. The correction term is close to 0.5 ulp (possibly above due to approximation errors) when the exact result is a midpoint or close to a midpoint (cases where the Table Maker's Dilemma occurs).

(In reply to Vincent Lefèvre from comment #16) > > (In reply to Vincent Lefèvre from comment #13) > > > For the example of slow case pow(0.562500, 1.5), the exact result is > > > 0.421875 = 27/64, so that cor should be much smaller than ulp(res) and a > > > positive value should be returned, which doesn't explain the behavior. > > > > Whenever you're close to 0.5 ULP, > > It is not close to 0.5 ulp. This is an *exact* case (27/64), so that the > correction term should be very small, much smaller than 0.5 ulp. If it were close to 0.5 ulp, it would mean that there would be around 53 bits of accuracy, which does not match your error analysis: "Maximum relative error before rounding is 8.8e-22 (69.9 bits)." So how can you explain the switch to the slow path with the previous code on this exact case with the 69.9 bits of accuracy for (al,rem)?

(In reply to Vincent Lefèvre from comment #17) > (In reply to Vincent Lefèvre from comment #16) > > > (In reply to Vincent Lefèvre from comment #13) > > > > For the example of slow case pow(0.562500, 1.5), the exact result is > > > > 0.421875 = 27/64, so that cor should be much smaller than ulp(res) and a > > > > positive value should be returned, which doesn't explain the behavior. > > > > > > Whenever you're close to 0.5 ULP, > > > > It is not close to 0.5 ulp. This is an *exact* case (27/64), so that the > > correction term should be very small, much smaller than 0.5 ulp. > > If it were close to 0.5 ulp, it would mean that there would be around 53 > bits of accuracy, which does not match your error analysis: "Maximum > relative error before rounding is 8.8e-22 (69.9 bits)." > > So how can you explain the switch to the slow path with the previous code on > this exact case with the 69.9 bits of accuracy for (al,rem)? I don't think pow(0.562500, 1.5) was ever slow. The case I tested as slow was pow(0.999999999999999889, 1.5). Either way the calculation of the error term was incorrect, for example in pow.c there was: error = error * fabs (y); t = __exp1 (a1, a2, 1.9e16 * error); /* return -10 or 0 if wasn't computed exactly */ The error for pow scales with log(x) * y which is bounded by 710.

(In reply to Wilco from comment #18) > I don't think pow(0.562500, 1.5) was ever slow. That's from the bug report: ------------------------------------------------------------ ./elf/ld.so --library-path .:math:dlfcn ./a.out pow(0.562500, 1.5): 0.000982 calls/usec pow(0.563250, 1.5): 13.574397 calls/usec ratio: 13828 ------------------------------------------------------------ But indeed I cannot reproduce this slowness with 0.562500 on glibc 2.26. > The case I tested as slow was pow(0.999999999999999889, 1.5). Indeed this one is very slow (mainly due to the use of 768 bits in the slow path, which is much more than actually necessary). > Either way the calculation of the error term was incorrect, I hadn't checked that. I suppose that a full rewrite would be needed for a crpow() function. This could be useful: https://hal.inria.fr/inria-00583988

(In reply to Vincent Lefèvre from comment #19) > (In reply to Wilco from comment #18) > > I don't think pow(0.562500, 1.5) was ever slow. > > That's from the bug report: > > ------------------------------------------------------------ > ./elf/ld.so --library-path .:math:dlfcn ./a.out > pow(0.562500, 1.5): 0.000982 calls/usec > pow(0.563250, 1.5): 13.574397 calls/usec > ratio: 13828 > ------------------------------------------------------------ The bug report was inaccurate. By outputting values with more precision: pow(0.56249999999999978, 1.5): 0.004758 calls/usec pow(0.5632499999999997, 1.5): 10.506687 calls/usec The exact value for 0.56249999999999978 is 0.5624999999999997779553950749686919152736663818359375, and the significand of its power to 1.5 is: 7599824371187707.50000000000000044... This is a hard-to-round case, which eventually explains the slow path.

If the commit fixes the bug, it should be resolved as FIXED with target milestone set.

(In reply to joseph@codesourcery.com from comment #21) > If the commit fixes the bug, it should be resolved as FIXED with target > milestone set. Hmm, I don't seem to be able to make any changes from this account...

If you use an @gcc.gnu.org / @sourceware.org account you should automatically have editbugs permission.

Fixed by c3d466cba1692708a19c6ff829d0386c83a0c6e5