This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] Sparc exp(), expf() performance improvement


My report on -mcpu=niagara4 for benchtests/math (and make check)
See below for results and recommendations.

On 8/1/2017 11:06 AM, Patrick McGehearty wrote:
On 7/31/2017 4:21 PM, David Miller wrote:
You miss my point.

You are doing two _completely_ different things here.

First, you could simply build the existing exp() and expf() C code in
glibc with niagara4.  In fact, if this float<-->int move instruction
helps so much, you probably want to build the entire math library
this way with appropriate ifunc hooks.  Not just exp/expf.

Second, you could then introduce the new C code implementation of exp
and expf functions and:

1) See if it is faster on other sparc cpus.

2) Ask other glibc developers to test whether it is faster on
    non-sparc cpus as well.

Making both changes and only targetting post-niagara4 cpus is
completely the wrong way to go about this.

I'm preparing to do a trial run on -mcpu=niagara4 for glibc.
I'll report back on any interesting differences for make bench
with/without -mcpu=niagara4 for the current sourceware tree.

I will note from my point of view, this project is focused only
on exp() and expf() as Sparc/Solaris/Studio showed dramatically
better performance on those specific functions. There are a few
other functions which run faster on Sparc/Solaris/Studio, but
nothing like the performance difference for exp() and expf().

- patrick

Benchtests comparison  based on Aug 1, 2017 glibc
using -mcpu=niagara4 on an S7

Niagara4 provides new features, including direct fma support, direct
transfers between fp and int registers, and new types of branch
instructions.

Roughly 40% of tests showed more than 10% change.
I've grouped them by type of test.

Mean execution time (in nsec/call)
Fcn       Base     -mcpu  % improvement
exp       5318      5109       4%
log        534       343      56%
log2       182       157      16%
powf       511       396      29%

acos      1148       730      57%
asin      1120       691      62%
asinh      313       249      26%
atan      1199       915      31%
atanh      242       211      15%
cos       1527      1199      27%
cosh       304       265      15%
sin       1562      1194      31%
sincos    1671      1295      29%
sinh       573       523      10%
tan       1365      1098      24%
tanh       311       223      39%

fmin        14.6      17.6   -17% <<
fmax        14.6      17.6   -19% <<
fminf       14.6      18.1   -19% <<
fmaxf       14.6      18.2   -20% <<
ffsll       32        22      45%
(ffsll: find first bit set in a word)

I suspect fmin/fmax/fminf/fmaxf regression may have placed a branch
instruction in a branch delay slot which causes a pipeline hiccup, or
something similar but I have not investigated specifics for any of
these tests.

I also found that using -mcpu=niagara4 caused warnings to be
generated for math/e_sqrt.c. That's because -mcpu=niagara4 defines
__FP_FAST_FMA which affects the definition of EMULV and DLA_FMS.
Either e_sqrt.c needs to be revised to avoid the warnings or
-mcpu=niagara4 needs to not be used on that routine.

Also, make check got the following new failures when using -mcpu=niagara4.
FAIL: math/test-float-cpow
FAIL: math/test-float-finite-cpow
FAIL: math/test-ifloat-cpow
(above 3 identical single value, 5ulp error instead of 4 ulp)
FAIL: math/test-float-ctanh
FAIL: math/test-float-finite-ctanh
FAIL: math/test-ifloat-ctanh
(above 3 identical single value, 2ulp error instead of 1 ulp)
These may be due to the use of FMA or other optimizations due
to -mcpu=niagara4. They have not been investigated in depth.

I'll note that the tests are not exhaustive. Specifically, only powf,
fminf, and fmaxf are included as 32-bit precision test. Complex
functions are not tested. Some other common functions such as expm1
and log10 are not present. That's not a criticism, just an
observation that more tests are needed before a blanket change of
default behavior should be recommended.

I concur that it would be beneficial to use -mcpu=niagara4 for
selected libm functions, but do not believe it should be default on
without more extensive testing (and likely more compiler tuning). I'd
like to move forward on that issue, but those tasks are beyond the
scope of my current assignment. I will put them on our internal list
of desirable enhancements but can't make any commitments as to whether
resources will be assigned to making them happen.

My goal/assignment is for exp() and expf() on Linux to match Solaris
libm performance. exp() is included in libmicro() and in one or more
of the new SPECcpu2017 applications, making it highly visible when
doing benchmark comparisons between systems.
Even without -mcpu=niagara4, the new code is much faster than the
ieee754 code, but to achieve full parity, the -mcpu=niagara4 switch is
needed. I believe that switch should be only used selectively on
functions where a clear benefit is shown. Also, for those functions, a
fall-back version that does not use -mcpu=niagara4 should be included
and controlled by ifunc to allow a Sparc-based glibc to run on any
Sparc platform even if it is built on a newer Sparc platform.
The approach I've implemented supports both of those requirements.

I can remove the Makefile code to use -mcpu=niagara4 iff the compiler
supports it. That will mean any attempt to build glibc on Sparc with
an older version of gcc will fail.  Internal to Oracle, some Oracle SW
still depends on gcc 4.4.7, meaning I'll maintain two different
versions, one for Oracle Linux/Sparc and one for external use.

respectfully submitted by Patrick McGehearty


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]