This is the mail archive of the mailing list for the glibc project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] x86-64: Add sinf with FMA

On 04/12/2017 20:42, H.J. Lu wrote:
> On Mon, Dec 4, 2017 at 12:59 PM, Joseph Myers <> wrote:
>> On Mon, 4 Dec 2017, H.J. Lu wrote:
>>> Is
>>> (*__errno_location ()) = xxx
>> If anything I'd expect a direct TLS initial-exec access to errno to be
>> faster.
> I will update x86-64 s_sinf.S to use errno.
>>> faster?  On x86-64, there should be no call to __floor.
>> The x86_64 __floor inline in math_private.h is only when compiling glibc
>> for SSE4.1 or later.
>> The case of inlining floor / __floor and related functions for x86_64
>> without SSE4.1 is tricky.  Supposing we had appropriate asm redirects to
>> allow libm to call floor / ceil / trunc etc. directly so the compiler
>> could inline them but __* are still called if not inlined, the default
>> SSE2 inlines would come into play.  But those inlines are slower on SSE4.1
>> hardware than an out-of-line call to the floor / ceil / trunc IFUNC, so if
>> you're building a generic SSE2 glibc that may well be used on SSE4.1
>> hardware, you may wish either to avoid those inlines or, if there is a
>> significant performance difference in benchmarks, have an SSE4.1 IFUNC of
>> the calling function using floor (or __floor, with the present inline).
>> The expf etc. set of optimized float functions have several different
>> choices of how conversions to integer are handled, which may be configured
>> by an architecture.  That may make sense in other cases as well.
> x86-64 s_sinf.S avoids floor () with cvttsd2si.  I don't think it can be used
> with generic sinf.

I think we can the compiler builtins for math_private.h as a fallback for non
SSE 4.1 as:

@@ -109,11 +116,15 @@ extern __always_inline double
 __floor (double d)
   double res;
+#ifdef __SSE4_1__
 # if defined __AVX__ || defined SSE2AVX
   asm ("vroundsd $1, %1, %0, %0" : "=x" (res) : "xm" (d));
 # else
   asm ("roundsd $1, %1, %0" : "=x" (res) : "xm" (d));
 # endif
+  res = __builtin_floor (d);
   return res;

As least for GCC7 compiler will expand to an inline version even for generic
tune option.  However the code that actually calls __floor is not showing on
profiling, but rather int to fp conversion.

Generic implementation shows on my system (gcc version 7.1.1, i7-4790K):

  "sinf": {
   "": {
    "duration": 4.00221e+10,
    "iterations": 1.4128e+09,
    "max": 587.555,
    "min": 12.747,
    "mean": 28.3281

And with a simple modification to avoid int to fp conversion:

diff --git a/sysdeps/ieee754/flt-32/s_sinf.c b/sysdeps/ieee754/flt-32/s_sinf.c
index 40d3d19..a2fd3cf 100644
--- a/sysdeps/ieee754/flt-32/s_sinf.c
+++ b/sysdeps/ieee754/flt-32/s_sinf.c
@@ -75,7 +75,7 @@ static const double invpio4_table[] = {

-static const int ones[] = { +1, -1 };
+static const double ones[] = { 1.0, -1.0 };

 /* Compute the sine value using Chebyshev polynomials where
    THETA is the range reduced absolute value of the input
@@ -92,7 +92,7 @@ reduced (const double theta, const unsigned long int n,
   const double theta2 = theta * theta;
   /* We are operating on |x|, so we need to add back the original
      signbit for sinf.  */
-  int sign;
+  double sign;
   /* Determine positive or negative primary interval.  */
   sign = ones[((n >> 2) & 1) ^ signbit];
   /* Are we in the primary interval of sin or cos?  */

I get:

  "sinf": {
   "": {
    "duration": 4.0015e+10,
    "iterations": 1.4535e+09,
    "max": 640.456,
    "min": 11.437,
    "mean": 27.5301

Which is roughly 3% on mean and 11.5% on min. I think we can improve it
even more by avoiding the int to fp conversion to get the sign right
and try operate with sign as double argument.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]