Bug 14412 - Removal of sysdeps/x86_64/fpu/s_sincos.S causes regressions
Summary: Removal of sysdeps/x86_64/fpu/s_sincos.S causes regressions
Status: RESOLVED DUPLICATE of bug 5781
Alias: None
Product: glibc
Classification: Unclassified
Component: math (show other bugs)
Version: 2.16
: P2 enhancement
Target Milestone: 2.18
Assignee: Not yet assigned to anyone
URL: https://bugs.freedesktop.org/show_bug...
Keywords:
: 15936 (view as bug list)
Depends on:
Blocks:
 
Reported: 2012-07-26 21:07 UTC by Markus Trippelsdorf
Modified: 2014-06-17 18:55 UTC (History)
10 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:
fweimer: security-


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Markus Trippelsdorf 2012-07-26 21:07:44 UTC
The removal of sysdeps/x86_64/fpu/s_sincos.S more than doubles
the time it takes to render a particular pdf testpage with Okular.
 
commit b35fe25ed9b3b406754fe681b1785f330d9faf62
Author: Andreas Jaeger <aj@suse.de>
Date:   Wed Mar 7 14:51:39 2012 +0100

        [BZ #13658]
        * sysdeps/x86_64/fpu/s_sincos.S: Delete.
        * math/libm-test.inc (sincos_test): Add test for large input.

See the following poppler bug:
https://bugs.freedesktop.org/show_bug.cgi?id=52551

With glibc-2.16 it takes ~1 minute to open that page:

# Overhead  Command  Shared Object            Symbol
    35.90%   okular  libm-2.16.so             [.] feraiseexcept
    12.30%   okular  libm-2.16.so             [.] __sin
     9.03%   okular  libm-2.16.so             [.] __cos
     5.01%   okular  libpoppler.so.26.0.0     [.] void
std::__introsort_loop<SplashIntersect*, long,
cmpIntersectFunctor>(SplashIntersect*, SplashInt
ersect*, long, cmpIntersectFunctor)
     2.80%   okular  libpoppler.so.26.0.0     [.] SplashXPathScanner::computeIntersections()

Before commit b35fe25ed9b3b40 it only takes 24 seconds:

# Overhead  Command  Shared Object            Symbol
    24.09%       okular  libm.so                  [.] __sincos
    10.42%       okular  libpoppler.so.26.0.0     [.] SplashIntersect* std::__unguarded_partition<SplashIntersect*, SplashIntersect, cmpIntersectFunc
tor>(SplashIntersect*, SplashIntersect*, SplashIntersect const&, cmpIntersectFunctor)
     5.73%       okular  libpoppler.so.26.0.0     [.] SplashXPathScanner::computeIntersections()
Comment 1 Andreas Jaeger 2012-07-27 06:44:27 UTC
Please have a look at the gcc -ffast-math option if you want fast sincos and can live with the limitations.
Comment 2 Markus Trippelsdorf 2012-07-27 07:18:37 UTC
(In reply to comment #1)
> Please have a look at the gcc -ffast-math option if you want fast sincos and
> can live with the limitations.

Yes. I've already tried this, but unfortunately it doesn't seem to help
in this case. Libpoppler behaves exactly the same when compiled with or without -ffast-math (or -Ofast).
Comment 3 Markus Trippelsdorf 2012-07-27 12:33:24 UTC
Actually, the current glibc behavior is equivalent to using 
"-fno-builtin-sin -fno-builtin-cos" when building poppler.
Comment 4 Rich Felker 2012-07-28 22:15:13 UTC
Judging from the profiling figures in the bug report, it looks like the problem is the cost of feraiseexcept, not the cost of correct sin/cos computation. Would dropping use of feraiseexcept and replacing it with simple arithmetic that generates the desired exception flags be a viable solution?
Comment 5 Carlos O'Donell 2012-08-03 03:51:32 UTC
Marking this as an enhancement. We should try to fix this for 2.17 since it's a large performance regression for this function.
Comment 6 Markus Trippelsdorf 2012-08-09 10:14:25 UTC
Using Intel's SSE2 sinf and cosf http://sourceware.org/ml/libc-alpha/2012-06/msg00692.html solves the issue.

It only takes ~26 seconds to open the testcase:

# Overhead      Command  Shared Object                  Symbol

    11.16%       okular  libpoppler.so.26.0.0        [.] SplashXPathScanner::computeIntersections()
    11.04%       okular  libpoppler.so.26.0.0        [.] void std::__introsort_loop<SplashIntersect*, long, cmpIntersectFunctor>(SplashIntersect*, Sp
lashIntersect*, long, cmpIntersectFunctor) [clone .isra.15]
     8.27%       okular  libpoppler.so.26.0.0        [.] Gfx::doRadialShFill(GfxRadialShading*)
     5.43%       okular  libc-2.16.90.so             [.] _int_malloc
     5.31%       okular  libm-2.16.90.so             [.] __cosf
     5.16%       okular  libpoppler.so.26.0.0        [.] SplashXPathScanner::addIntersection(double, double, unsigned int, int, int, int)
     5.01%       okular  libm-2.16.90.so             [.] __sinf
     2.98%       okular  libc-2.16.90.so             [.] _int_free
     2.83%       okular  libpoppler.so.26.0.0        [.] Lexer::getObj(Object*, int)

I had to manually hook up the new functions with:
#undef sin
#define sin sinf
#undef cos
#define cos cosf
Comment 7 Markus Trippelsdorf 2012-08-19 09:31:02 UTC
*** Bug 14496 has been marked as a duplicate of this bug. ***
Comment 8 David S. Miller 2012-12-03 22:13:26 UTC
This needs more work and discussion so retargetting this bug to 2.18
Comment 9 wbrana 2013-04-25 21:48:11 UTC
this is critical bug
it shouldn't be marked as enhancement
Comment 10 Carlos O'Donell 2013-04-25 23:35:51 UTC
(In reply to comment #9)
> this is critical bug
> it shouldn't be marked as enhancement

If we fix sinf to be fast again it will be imprecise which is a critical bug for another class of users. There is no implementation that can satisfy all requirements from all users. At the very least the precise implementation is at least correct if slow. Therefore this is an enhancement request to make the implementation fast.

We are working on the possibility of having 3 libm variants, fast, default, and precise. At that point -ffast-math will select the fast variant and provide a fast but imprecise implementation.

Until then we need help looking into these performance issues to see if we can relax the feraiseexcept in some cases or optimize the code to call it less times.
Comment 11 Rich Felker 2013-04-25 23:58:19 UTC
Carlos, as I noted above, the horrible slowness seems to be coming not from the correct algorithm but from the silly use of feraiseexcept to raise exceptions. Fixing this should get the performance back at least close to what it was before when the function was broken.
Comment 12 Rich Felker 2013-04-26 00:14:13 UTC
Sorry, I see that you did acknowledge that already. I just skimmed through the code but I'm not familiar enough with out glibc's math library is laid out to figure out where feraiseexcept is getting called from... Once it's found, fixing the problem should just be a matter of replacing the call with code that would generate the exception naturally. However since the only exception that should be needed is the inexact exception, it should already be getting raised by the code to compute sin/cos.

A quick test to see if this is the source of the problem would be to make a no-op implementation of feraiseexcept and link it in (might need to use static linking to get libm to use it) then measure the performance.
Comment 13 Siddhesh Poyarekar 2013-04-26 02:54:59 UTC
This should have been fixed with the patch for bug 14496.  Markus, can you confirm that?
Comment 14 Rich Felker 2013-04-26 03:08:51 UTC
I can see how the fix to #14496 would help, but it's still a bug if feraiseexcept is being called at all. Certainly the code that was fixed in that patch should not even be taken as part of a call to sin or cos. There's simply no reason to be calling feraiseexcept.
Comment 15 Siddhesh Poyarekar 2013-04-26 03:50:20 UTC
That's the point of the fix to bug 14496.  The root of the problem is that we save and restore rounding modes in functions that we don't have non-RN modes implemented.  This calls feupdateenv (IIRC), which in turn calls feraiseexcept if there was an exception to be raised.  The exception mask check in x86_64 was incorrect, causing feraiseexcept to be called all the time.  with the fix, it is only called when there actually is an exception.
Comment 16 Markus Trippelsdorf 2013-04-26 06:29:48 UTC
(In reply to comment #13)
> This should have been fixed with the patch for bug 14496.  Markus, can you
> confirm that?

It now takes ~40 seconds to render the test-page on trunk.
Better than 60 seconds but still 40% slower than s_sincos.S.

 25.82%  libm-2.17.90.so                           [.] __cos                   
 22.06%  libm-2.17.90.so                           [.] __sin        
  6.33%  libm-2.17.90.so                           [.] __dubsin              
  4.59%  libpoppler.so.36.0.0                      [.] GfxPath::lineTo(double, double)  
  4.50%  libpoppler.so.36.0.0                      [.] Gfx::doRadialShFill(GfxRadialShading*)
Comment 17 Jakub Jelinek 2013-04-26 06:40:03 UTC
With the split into 3 libm variants, fast alternative definitely doesn't have to support rounding other than round to even, but I'd say that even the default version shouldn't support it, only the precise one.  After all, until very recently, those rounding modes were not supported in transcendentals in glibc, and -frounding-math isn't the default in gcc either.
Perhaps we should add a predefined macro for -frounding-math in gcc, and the precise variant of libm could be selected using that macro, if the correct rounding precise variants would be suffixed differently (like the *_finite entrypoints are for -ffast-math), then all the 3 libm variants could live in the same libm.so.6 and people would just choose what they want using -ffast-math vs. default vs. -frounding-math.
Comment 18 Siddhesh Poyarekar 2013-04-26 10:00:09 UTC
(In reply to comment #16)
> (In reply to comment #13)
> > This should have been fixed with the patch for bug 14496.  Markus, can you
> > confirm that?
> 
> It now takes ~40 seconds to render the test-page on trunk.
> Better than 60 seconds but still 40% slower than s_sincos.S.

That's expected.  The assembly implementation used the fsincos instruction and the current implementation calls the sin() and cos() functions.  The rounding mode changes have nothing to do with this part of the performance regression.  Maybe we could implement a sincos_finite that calls the fsincos instruction and actually gets used with -ffinite-math.

(In reply to comment #17)
> Perhaps we should add a predefined macro for -frounding-math in gcc, and the
> precise variant of libm could be selected using that macro, if the correct
> rounding precise variants would be suffixed differently (like the *_finite
> entrypoints are for -ffast-math), then all the 3 libm variants could live in
> the same libm.so.6 and people would just choose what they want using
> -ffast-math vs. default vs. -frounding-math.

I like this idea.
Comment 19 jsm-csl@polyomino.org.uk 2013-04-26 11:31:29 UTC
On Fri, 26 Apr 2013, jakub at redhat dot com wrote:

> Perhaps we should add a predefined macro for -frounding-math in gcc, and the
> precise variant of libm could be selected using that macro, if the correct
> rounding precise variants would be suffixed differently (like the *_finite

Basing things on a predefined macro like that won't work with any future 
GCC support for #pragma STDC FENV_ACCESS on (within the scope of that 
pragma, function versions supporting rounding modes should be called).  
So since you'd need a new compiler feature anyway, maybe something like 
__builtin_rounding_math () would be better as it would allow FENV_ACCESS 
support without further library changes (the safe default for older 
compilers without -ffast-math being to assume -frounding-math may be in 
effect).  So you could have e.g.

#define sin(x) (__builtin_rounding_math () ? sin (x) : __sin_noround (x))

(I'm using plain sin as the version supporting rounding modes for safety 
for programs declaring functions themselves rather than including the 
system header.)

DTS 18661-1 constant rounding modes (see 
<http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1676.pdf>) would require 
a further compiler feature to say "call this function with the dynamic 
rounding mode in effect being the constant mode for the current scope".
Comment 20 jsm-csl@polyomino.org.uk 2013-04-26 11:32:58 UTC
Note that having extra variants like this would definitely need 
libm-test.inc tests to be run for the extra variants (there's an existing 
issue on the wiki todo list to run the tests not involving infinities and 
NaNs with -ffinite-math-only to provide coverage of the *_finite 
versions).
Comment 21 Rich Felker 2013-04-26 13:52:36 UTC
Siddhesh, still none of the fenv stuff should be called at all if the initial rounding mode was nearest (except of course checking the initial mode). This should be special-cased since it's true basically all the time in real-world applications.
Comment 22 Rich Felker 2013-04-26 13:55:49 UTC
Jakub, the default libm should conform to IEEE requirements and should not cause correct programs to crash or go into infinite loops (both of which happen if you just remove the check for rounding mode due to poor design in the IBM math library). This is A LOT weaker than the current aim of glibc, which is to provide correct rounding for all transcendental functions despite the fact that IEEE explicitly chose not to require it (just <1ulp error). In my opinion, if there are 3 versions, they should be:

1. Fast: Whatever the gamer/graphics folks want.
2. Default: IEEE-conforming
3. Best: Correctly-rounded results for all functions
Comment 23 jsm-csl@polyomino.org.uk 2013-04-26 15:07:02 UTC
It's only a limited subset of transcendental functions for which glibc is 
trying to be correctly rounding, and only for double for those functions; 
for most functions the aim is generally errors within "a few" ulps (and 
ISO C doesn't specify particular accuracy requirements), though correctly 
rounding would be some sort of ideal.

Implementing the proposed TS 18661 bindings to IEEE 754-2008 would be a 
huge amount of work (and require compiler work as well), but if anyone 
gets the time for that work then I believe part 4 of the draft TS will 
reserve names such as crsin for correctly rounding functions (and 
crsinf128 for correctly rounding sine for _Float128, etc.), without 
requiring those functions, so it might make sense eventually to provide 
those names for correctly-rounding functions where available without 
making them the default versions.  So far I'm still working on going 
through the accumulated libm bugs that don't require such huge amounts of 
work as DTS 18661 or ISO 24747 implementation - where one reported bug 
often shows up lots of related bugs not previously in Bugzilla, that also 
need to be fixed - rather than trying to find time for larger libm 
projects (and I'm also mostly leaving aside libm bugs in areas where 
others have shown interest in fixing them, such as performance or errno 
handling).

(The present functions, producing correctly-rounded results only for 
round-to-nearest, would not of course meet the requirements for functions 
such as crsin, although it's probably possible to adapt them by just doing 
a little bit of final arithmetic in the original rounding mode, having 
done the intermediate computations to produce a higher-precision result in 
round-to-nearest.)
Comment 24 wbrana 2013-04-26 15:20:54 UTC
(In reply to comment #10)
> If we fix sinf to be fast again it will be imprecise which is a critical bug
> for another class of users. There is no implementation that can satisfy all
> requirements from all users. At the very least the precise implementation is at
> least correct if slow. Therefore this is an enhancement request to make the
> implementation fast.
> 
> We are working on the possibility of having 3 libm variants, fast, default, and
> precise. At that point -ffast-math will select the fast variant and provide a
> fast but imprecise implementation.
> 

http://sourceware.org/bugzilla/show_bug.cgi?id=13658

impreciseness of fast sincos can't be critical bug because it took 15 years (1997-2012) 
to report this bug which means s_sincos.S is precise enough for 99% of apps.
s_sincos.S should be restored because slow sincos can break many apps 
because most open source apps aren't properly tested and app devs aren't aware 
that sincos is 50% slower since 2.16.
Only users will notice than e.g. new Ubuntu is slow but users usually can't
submit proper bug report like this one to app devs.
App devs won't know that they should add -ffast-math to restore performance.
Very few apps are using -ffast-math in present.
-ffast-math can add new bugs to apps which aren't broken by fast sincos.

New function named e.g. sincos_precise should be added to glibc for 1% apps
which are affected by impreciseness of fast sincos.
Comment 25 Siddhesh Poyarekar 2013-04-29 10:29:46 UTC
I've posted a patch to implement __sincos_finite:

http://sourceware.org/ml/libc-alpha/2013-04/msg00720.html

This should allow you to compile with -ffast-math or -ffinite-math-only to use this fast implementation of sincos.
Comment 26 wbrana 2013-04-29 10:50:24 UTC
(In reply to comment #25)
> I've posted a patch to implement __sincos_finite:
> 
> http://sourceware.org/ml/libc-alpha/2013-04/msg00720.html
> 
> This should allow you to compile with -ffast-math or -ffinite-math-only to use
> this fast implementation of sincos.

Will you also test about 35 000 Linux apps and libs if they are affected by slow sincos and not affected by -ffinite-math-only and submit bug reports to app/lib devs with request for addition of -ffinite-math-only to environment defined CFLAGS/CXXFLAGS?
Comment 27 Rich Felker 2013-04-29 11:55:09 UTC
wbrana, "getting the right answer is slow" is not an excuse for giving the wrong answer. The "optimized" version of sincos has ZERO bits of precision for certain inputs.

With that said, there are not "35000" apps and libraries to check and send bug reports to. A quick check with nm(1) will show you which ones even call trig functions, and it's a tiny fraction of that. Moreover, of those that do, a good number (anything except gaming and graphics) actually want the correct answer, not the fast-but-wrong answer, and had bugs that had gone unnoticed with the old, wrong sincos.

Getting fast-but-wrong math results is easy; you don't need the standard library for this. The purpose of the C language providing math functions (and recommending that they conform to IEEE requirements) in the standard library is to give applications something that would be very difficult to achieve themselves. Being wrong-by-default but providing a special option to get correct answers is not a solution because portable programs that were not written specifically to glibc will not be able to get the right answers. They would have to encode glibc-specific knowledge to override the incorrect default. Requiring platform-specific knowledge to get the optimal performance on a given platform (glibc) is unfortunate but at least justifiable. Requiring platform-specific knowledge to get the correct results is not justifiable.

With all that said, I still question the cause of the huge performance drop. The only cost of the correct sin/cos versus the incorrect one should be argument reduction, which should be a no-op if the argument is already in the range [-pi,pi] or so and cheap even a good ways outside of that range...
Comment 28 wbrana 2013-04-29 12:32:34 UTC
according to http://sourceware.org/bugzilla/show_bug.cgi?id=13658#c2
fast functions can be used when source operand is in certain range.
Slow functions should be used only outside interval.

"The FPTAN, FSIN, FCOS, and FSINCOS instructions set the C2 flag to 1 to
indicate that the source operand is beyond the allowable range of ±2^63 and
clear the C2 flag if the source operand is within the allowable range."

So, outside the interval [-2^63,+2^63] ("allowable range"), these instructions
must not be used (or they can be used, but with a fallback if the C2 flag is
set to 1). But note that the glibc implementation is more accurate, even with
(very probably) correct rounding, so that it is better to use it anyway.
Comment 29 Siddhesh Poyarekar 2013-04-29 13:06:40 UTC
(In reply to comment #28)
> according to http://sourceware.org/bugzilla/show_bug.cgi?id=13658#c2
> fast functions can be used when source operand is in certain range.
> Slow functions should be used only outside interval.
> 
> "The FPTAN, FSIN, FCOS, and FSINCOS instructions set the C2 flag to 1 to
> indicate that the source operand is beyond the allowable range of ±2^63 and
> clear the C2 flag if the source operand is within the allowable range."
> 
> So, outside the interval [-2^63,+2^63] ("allowable range"), these instructions
> must not be used (or they can be used, but with a fallback if the C2 flag is
> set to 1). But note that the glibc implementation is more accurate, even with
> (very probably) correct rounding, so that it is better to use it anyway.

The problem is not just about range reduction.  fsincos is not always accurate within the allowable range either.

(In reply to comment #27)
> With all that said, I still question the cause of the huge performance drop.
> The only cost of the correct sin/cos versus the incorrect one should be
> argument reduction, which should be a no-op if the argument is already in the
> range [-pi,pi] or so and cheap even a good ways outside of that range...

I considered the possibility of keeping the assembly default and then falling into the default implementation if the exception flags indicate that the result was not accurate enough, but that doesn't catch all cases.  In fact, the limited internal precision of PI (66-bit mantissa) guarantees that there will be cases where fsincos thinks it has hit the target, but it hasn't.

So until there's a way to flag inaccurate results reliably, I think the __sincos_finite way is the best option.  There is definitely some scope to look at the implementation of __sin and __cos functions, or even figure out some clever way to get results faster in sincos than just calling __sin and __cos, but that will need more work.
Comment 30 Markus Trippelsdorf 2013-04-29 13:15:02 UTC
(In reply to comment #25)
> I've posted a patch to implement __sincos_finite:
> 
> http://sourceware.org/ml/libc-alpha/2013-04/msg00720.html
> 
> This should allow you to compile with -ffast-math or -ffinite-math-only to use
> this fast implementation of sincos.

Your patch doesn't work for me. I've re-build glibc and then
re-compiled poppler with "-ffinite-math-only". But I still
only see calls to __sin and __cos with "perf top".

The poppler code in question can be found here:
http://cgit.freedesktop.org/poppler/poppler/tree/poppler/Gfx.cc#n3112

When I just add the following file to glibc:

markus@x4 glibc % cat sysdeps/x86_64/fpu/s_sincos.S
#include <machine/asm.h>

        .text
ENTRY (__sincos)

        movsd   %xmm0, -8(%rsp)
        fldl    -8(%rsp)
        fsincos
        fnstsw  %ax
        testl   $0x400,%eax
        jnz     1f
        fstpl   (%rsi)
        fstpl   (%rdi)

        retq

1:      fldpi
        fadd    %st(0)
        fxch    %st(1)
2:      fprem1
        fnstsw  %ax
        testl   $0x400,%eax
        jnz     2b
        fstp    %st(1)
        fsincos
        fstpl   (%rsi)
        fstpl   (%rdi)

        retq
END (__sincos)
weak_alias (__sincos,sincos)

poppler gets fast again and uses __sincos.
Comment 31 Ondrej Bilka 2013-04-29 13:21:18 UTC
On Mon, Apr 29, 2013 at 12:32:34PM +0000, wbrana at gmail dot com wrote:
> http://sourceware.org/bugzilla/show_bug.cgi?id=14412
> 
> --- Comment #28 from wbrana at gmail dot com 2013-04-29 12:32:34 UTC ---
> according to http://sourceware.org/bugzilla/show_bug.cgi?id=13658#c2
> fast functions can be used when source operand is in certain range.
> Slow functions should be used only outside interval.
> 
> "The FPTAN, FSIN, FCOS, and FSINCOS instructions set the C2 flag to 1 to
> indicate that the source operand is beyond the allowable range of ±2^63 and
> clear the C2 flag if the source operand is within the allowable range."
> 
> So, outside the interval [-2^63,+2^63] ("allowable range"), these instructions
> must not be used (or they can be used, but with a fallback if the C2 flag is
> set to 1). But note that the glibc implementation is more accurate, even with
> (very probably) correct rounding, so that it is better to use it anyway.
>
Well results of sin,cos.. with large numbers are pure garbage no matter
what you try. As cos(1.0e18*PI) = cos(1.0e18*PI + PI) you can basicaly print random
number between -1.0 and 1.0 and be correct.
Comment 32 Rich Felker 2013-04-29 13:25:12 UTC
wbrana, unfortunately that information is mistaken; fsin DOES NOT work on the interval [-2^63,+2^63]. A quick test reveals:

#include <math.h>
#include <stdio.h>

double x87_sin(double x)
{
        __asm__("fsin" : "+t"(x));
        return x;
}

int main()
{
        double x = 0x1p63;
        printf("%.15g %.15g\n", sin(x), x87_sin(x));
}

$ ./a.out
0.999930376673442 9.22337203685478e+18

I suspect this is just a boundary issue (the [] should be () in the range) since slightly smaller values give more reasonable, but still badly wrong, results. For example if x is 0x1p62,

-0.702922443619209 -0.707132927452779

which is wrong in the third decimal place.

Siddhesh, could you explain the motivation for "overloading" the meaning of "_finite" with "fast but wrong"? I'm guessing the idea is that you're thinking finite math implies not just lack of infinities but also "unreasonably large" inputs? This is definitely the glibc team's call, since there's no reasonable basis for assuming you'll get correct results from non-default performance-oriented math settings, but I think some justification would be nice.
Comment 33 Rich Felker 2013-04-29 13:28:10 UTC
Ondrej, the results are not pure garbage. There are only two correct results sin(0x1p1000) can give, either of the nearest representable neighbors (less than 1ulp error) of the number 2**1000. Your fallacy is writing PI in your example. There is no such floating point number as PI, and this is part of why implementing correct trig functions is nontrivial; implementing ones that work in degree units, or a base-2 division of the unit circle, would be much easier.
Comment 34 Rich Felker 2013-04-29 13:32:45 UTC
Markus, your implementation has incorrect argument reduction. For inputs outside of a small range (somewhere around [-2pi,2pi] but maybe slightly larger), and near multiples of pi/2 (where either sin or cos has a near-zero result and thus needs extra precision) you really need to fall back to the C range-reduction code (or good luck reimplementing THAT in asm...). You could perhaps use the x87 instruction in limited cases, though.
Comment 35 wbrana 2013-04-29 13:36:24 UTC
(In reply to comment #32)
> wbrana, unfortunately that information is mistaken; fsin DOES NOT work on the
> interval [-2^63,+2^63]. A quick test reveals:

what about smaller interval [-2^31,+2^31]
Comment 36 Siddhesh Poyarekar 2013-04-29 13:44:08 UTC
(In reply to comment #30)
> Your patch doesn't work for me. I've re-build glibc and then
> re-compiled poppler with "-ffinite-math-only". But I still
> only see calls to __sin and __cos with "perf top".

Looks like your code does not call sincos directly.  sin and cos calls on the
same input seem to get optimized into sincos, so I'm not sure if
ffinite-math-only can detect that.  Also, the patch also changes an installed header, so you need to somehow include that during your compilation.
Comment 37 Rich Felker 2013-04-29 13:51:50 UTC
At the outer edges of that range, the error is "only" around 1000 ulps. This isn't correct, but at least it's better than just having 4 or 5 bits of precision (or none at all). Since the x87 internally uses a 66-bit representation of pi, and doubles are 52-bit, I suspect precision starts to trail off around 16384*pi, but it will be bad even for much smaller values when the result is close to zero.

Experimentally, 0x1p18 seems to be the first power of two for which the result is wrong within 15 decimal places, but 0x1p14*M_PI is already badly wrong (only the first 4 decimal places are correct).
Comment 38 Ondrej Bilka 2013-04-29 14:04:37 UTC
On Mon, Apr 29, 2013 at 01:28:10PM +0000, bugdal at aerifal dot cx wrote:
> http://sourceware.org/bugzilla/show_bug.cgi?id=14412
> 
> --- Comment #33 from Rich Felker <bugdal at aerifal dot cx> 2013-04-29 13:28:10 UTC ---
> Ondrej, the results are not pure garbage. There are only two correct results
> sin(0x1p1000) can give, either of the nearest representable neighbors (less
> than 1ulp error) of the number 2**1000. Your fallacy is writing PI in your
> example. There is no such floating point number as PI, and this is part of why
> implementing correct trig functions is nontrivial; implementing ones that work
> in degree units, or a base-2 division of the unit circle, would be much easier.
>
Ok, assume you wrote function cos_deg that takes input degree in
degrees. You still have
cosdeg(360*(2**60)) == cosdeg(360*(2**60)+180)
In short when you have inputs with zero significant digits you can
expect only garbage as output.
Comment 39 Siddhesh Poyarekar 2013-04-29 14:19:35 UTC
(In reply to comment #36)
> (In reply to comment #30)
> > Your patch doesn't work for me. I've re-build glibc and then
> > re-compiled poppler with "-ffinite-math-only". But I still
> > only see calls to __sin and __cos with "perf top".
> 
> Looks like your code does not call sincos directly.  sin and cos calls on the
> same input seem to get optimized into sincos, so I'm not sure if
> ffinite-math-only can detect that.  Also, the patch also changes an installed
> header, so you need to somehow include that during your compilation.

I've just verified locally that if gcc's transformation of sin + cos into sincos works fine with the further transformation into __sincos_finite with -ffinite-math-only, so it's most likely that your code is not using the header file changes.  Easiest way to test this is to add the patch on top of your distribution glibc, build and install it.

BTW, the question of whether -ffinite-math-only is the right option to do this kind of thing is still open, so it's likely that the resulting fix will be different.  The crux would be similar though, i.e. a fast function assembly that is not always accurate.
Comment 40 Markus Trippelsdorf 2013-04-29 14:29:29 UTC
(In reply to comment #39)
> (In reply to comment #36)
> > (In reply to comment #30)
> > > Your patch doesn't work for me. I've re-build glibc and then
> > > re-compiled poppler with "-ffinite-math-only". But I still
> > > only see calls to __sin and __cos with "perf top".
> > 
> > Looks like your code does not call sincos directly.  sin and cos calls on the
> > same input seem to get optimized into sincos, so I'm not sure if
> > ffinite-math-only can detect that.  Also, the patch also changes an installed
> > header, so you need to somehow include that during your compilation.
> 
> I've just verified locally that if gcc's transformation of sin + cos into
> sincos works fine with the further transformation into __sincos_finite with
> -ffinite-math-only, so it's most likely that your code is not using the header
> file changes.  Easiest way to test this is to add the patch on top of your
> distribution glibc, build and install it.

Hmm, that was exactly how I've tested your patch. And <math.h> is included
in Gfx.cc, so bits/math-finite.h should be included when 
-ffinite-math-only is used.
Will double-check later.
Comment 41 Rich Felker 2013-04-29 14:48:43 UTC
Ondrej, the input always has 52 significant bits unless it's denormal. If, due to the way you computed the input value, none of them are significant to you, then that's your problem. sin() and cos() are still required to return the correct result assuming all bits of input are significant.
Comment 42 Ondrej Bilka 2013-04-29 16:29:40 UTC
On Mon, Apr 29, 2013 at 02:48:43PM +0000, bugdal at aerifal dot cx wrote:
> http://sourceware.org/bugzilla/show_bug.cgi?id=14412
> 
> --- Comment #41 from Rich Felker <bugdal at aerifal dot cx> 2013-04-29 14:48:43 UTC ---
> Ondrej, the input always has 52 significant bits unless it's denormal. If, due
> to the way you computed the input value, none of them are significant to you,
> then that's your problem.

> sin() and cos() are still required to return the
> correct result assuming all bits of input are significant.
> 
Citation needed.
Comment 43 Rich Felker 2013-04-29 17:30:14 UTC
I don't have a copy of the latest IEEE 754 on hand (there's no free online version), but the consensus among implementors seems to be that a few special functions (such as sqrt) require correct rounding, and that the remaining functions (mostly transcendental) are required to give results with less than 1ulp of error. If sin(0x1p1000) is computed with incorrect argument reduction, it may have up to approximately 1e16 ulp error.

Moreover, the C standard specifies:

"2 The sin functions compute the sine of x (measured in radians).

Returns

3 The sin functions return sin x."

Without annex F, the C standard places no requirements on the precision or accuracy of floating point results (e.g. 2.0+2.0==5.0 is legal), but annex F recommends IEEE conforming behavior.

If anyone else can provide some better citations, they'd be welcome.
Comment 44 jsm-csl@polyomino.org.uk 2013-04-29 20:58:15 UTC
On Mon, 29 Apr 2013, bugdal at aerifal dot cx wrote:

> I don't have a copy of the latest IEEE 754 on hand (there's no free online
> version), but the consensus among implementors seems to be that a few special
> functions (such as sqrt) require correct rounding, and that the remaining
> functions (mostly transcendental) are required to give results with less than
> 1ulp of error. If sin(0x1p1000) is computed with incorrect argument reduction,
> it may have up to approximately 1e16 ulp error.

IEEE 754-2008, subclause 5.4 "formatOf general-computational operations", 
specifies squareRoot and fusedMultiplyAdd on the same basis as addition, 
subtraction, multiplication, division, convertFromInt, various operations 
converting to integer formats and a range of other operations.  As the 
2008 revision, these are "formatOf" operations (so including e.g. fused 
multiply-add of long double values, rounded once to float as the 
destination format), which do not necessarily have C bindings as of C11 - 
for the formatOf bindings you need DTS 18661-1.

Those operations are fully defined.  C11 Annex F for sqrt says "sqrt is 
fully specified as a basic arithmetic operation in IEC 60559. The returned 
value is dependent on the current rounding direction mode." (F.10.4.5) and 
similarly for fma (F.10.10.1) says "fma(x, y, z) computes xy + z, 
correctly rounded once.".

For most functions, not fully defined in IEEE 754 or C11, no specific 
accuracy requirements are imposed, although there are requirements on 
particular special cases in Annex F.  That is, Annex F leaves the 
implementation-defined accuracy from 5.2.4.2.2 paragraph 6 for most of the 
functions.

IEEE 754-2008 clause 9 "Recommended operations" lists a range of other 
functions with the requirement "A conforming function shall return results 
correctly rounded for the applicable rounding direction for all operands 
in its domain.".  These include, for example, both sin and sinPi (sine of 
pi times the argument).  There are no ISO C bindings for these operations 
at present.  I believe the intent is that TS 18661-4 will specify bindings 
to them, using names such as crsin for the correctly rounding functions 
(while probably adding names such as sinpi for non-correctly-rounding 
versions).  I believe the intent is that the correctly-rounding versions 
will be optional for implementations of TS 18661-4.  (The exhaustive 
searches required to prove correctly-rounding, bounded-resource-use 
implementations of various functions are as far as I know still not 
feasible for binary128, for example, and correct rounding without 
arbitrary-precision computation is particularly hard for functions of 
multiple arguments.)

I don't know where the idea of a 1ulp limit on errors being some sort of 
IEEE requirement came from.  As far as I'm concerned, sin(0x1p1000) giving 
a result that is "reasonably close" (say a few ulp) to the actual sine of 
0x1p1000 is a matter of quality of implementation, and giving a result 
that isn't wildly different is a matter of meeting the requirement that 
sin implements the sine function rather than some other function that 
returns random numbers for large inputs.

(For some of the additional functions in ISO 24747, the results for large 
values of certain inputs are specified as implementation-defined because 
of perceived difficulty of implementing the functions accurately for large 
inputs.  But none of the functions in C11 itself have such latitude to do 
something arbitrary for large inputs.)
Comment 45 Siddhesh Poyarekar 2013-04-30 05:41:27 UTC
(In reply to comment #32)
> I suspect this is just a boundary issue (the [] should be () in the range)
> since slightly smaller values give more reasonable, but still badly wrong,
> results. For example if x is 0x1p62,

Yes, it is () and not [], according to the manual.

> 
> -0.702922443619209 -0.707132927452779
> 
> which is wrong in the third decimal place.
> 

This is not restricted to numbers on the boundary.  A quick google search on accuracy of fsincos will show that there are numbers even close to pi/2 that are way off in terms of accuracy.

> Siddhesh, could you explain the motivation for "overloading" the meaning of
> "_finite" with "fast but wrong"? I'm guessing the idea is that you're thinking
> finite math implies not just lack of infinities but also "unreasonably large"
> inputs? This is definitely the glibc team's call, since there's no reasonable
> basis for assuming you'll get correct results from non-default
> performance-oriented math settings, but I think some justification would be
> nice.

That was a mistake - I somehow muddled it up with -ffast-math in my head.

I'm going to first try to see if the default sincos can be made faster.  Jakub pointed out offline that the argument reduction code is duplicated in the calls to __sin and __cos and consolidating that in sincos should definitely be a good first step.
Comment 46 Rich Felker 2013-04-30 21:37:51 UTC
Ondrej, I think the citation we were looking for is C99 7.12.1 Treatment of error conditions:

"1 The behavior of each of the functions in <math.h> is specified for all representable values of its input arguments, except where stated otherwise."
Comment 47 Rich Felker 2013-05-01 22:25:14 UTC
Another citation, from the C99 Rationale (http://www.open-std.org/jtc1/sc22/wg14/www/C99RationaleV5.10.pdf), supporting the option to give wrong results for trig functions:

Page 27, lines 9-13:

"Some math functions such as those that do argument reduction modulo an approximation of π have good accuracy for small arguments, but poor accuracy for large arguments. It is not unusual for an implementation of the trigonometric functions to have zero bits correct in the computed result for large arguments. For cases like this, an implementation might break the domain of the function into disjoint regions and specify the accuracy in each region."

Note that an implementation should still document the accuracy, and in the case of wrong argument reduction, the error could theoretically be as large as 2**1074 ULPs for some inputs of sin/cos (if the correct result were denormal but the function returned 1.0 or -1.0) and much worse (even possibly infinite) for tan.
Comment 48 Carlos O'Donell 2013-09-07 06:55:07 UTC
*** Bug 15936 has been marked as a duplicate of this bug. ***
Comment 49 Joseph Myers 2014-02-06 18:31:12 UTC
Although, formally, the regression part came from removing that file, the substance (that the IBM libm functions are slow in special cases and correctly-rounded functions should be under different names, not the default for sin, cos, sincos) is the same as bug 5781.

*** This bug has been marked as a duplicate of bug 5781 ***