This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH] Sparc exp(), expf() performance improvement
- From: Patrick McGehearty <patrick dot mcgehearty at oracle dot com>
- To: libc-alpha at sourceware dot org
- Date: Wed, 2 Aug 2017 14:53:19 -0500
- Subject: Re: [PATCH] Sparc exp(), expf() performance improvement
- Authentication-results: sourceware.org; auth=none
- References: <1501529969-96949-1-git-send-email-patrick.mcgehearty@oracle.com> <20170731.124719.1163288220939988504.davem@davemloft.net> <18ef0698-02a5-eb2d-fc87-ce234ab70ac6@oracle.com> <20170731.142137.2230165415917491707.davem@davemloft.net> <11ace70c-ca7b-b213-bc63-ac7f252ef8ba@oracle.com>
My report on -mcpu=niagara4 for benchtests/math (and make check)
See below for results and recommendations.
On 8/1/2017 11:06 AM, Patrick McGehearty wrote:
On 7/31/2017 4:21 PM, David Miller wrote:
You miss my point.
You are doing two _completely_ different things here.
First, you could simply build the existing exp() and expf() C code in
glibc with niagara4. In fact, if this float<-->int move instruction
helps so much, you probably want to build the entire math library
this way with appropriate ifunc hooks. Not just exp/expf.
Second, you could then introduce the new C code implementation of exp
and expf functions and:
1) See if it is faster on other sparc cpus.
2) Ask other glibc developers to test whether it is faster on
non-sparc cpus as well.
Making both changes and only targetting post-niagara4 cpus is
completely the wrong way to go about this.
I'm preparing to do a trial run on -mcpu=niagara4 for glibc.
I'll report back on any interesting differences for make bench
with/without -mcpu=niagara4 for the current sourceware tree.
I will note from my point of view, this project is focused only
on exp() and expf() as Sparc/Solaris/Studio showed dramatically
better performance on those specific functions. There are a few
other functions which run faster on Sparc/Solaris/Studio, but
nothing like the performance difference for exp() and expf().
- patrick
Benchtests comparison based on Aug 1, 2017 glibc
using -mcpu=niagara4 on an S7
Niagara4 provides new features, including direct fma support, direct
transfers between fp and int registers, and new types of branch
instructions.
Roughly 40% of tests showed more than 10% change.
I've grouped them by type of test.
Mean execution time (in nsec/call)
Fcn Base -mcpu % improvement
exp 5318 5109 4%
log 534 343 56%
log2 182 157 16%
powf 511 396 29%
acos 1148 730 57%
asin 1120 691 62%
asinh 313 249 26%
atan 1199 915 31%
atanh 242 211 15%
cos 1527 1199 27%
cosh 304 265 15%
sin 1562 1194 31%
sincos 1671 1295 29%
sinh 573 523 10%
tan 1365 1098 24%
tanh 311 223 39%
fmin 14.6 17.6 -17% <<
fmax 14.6 17.6 -19% <<
fminf 14.6 18.1 -19% <<
fmaxf 14.6 18.2 -20% <<
ffsll 32 22 45%
(ffsll: find first bit set in a word)
I suspect fmin/fmax/fminf/fmaxf regression may have placed a branch
instruction in a branch delay slot which causes a pipeline hiccup, or
something similar but I have not investigated specifics for any of
these tests.
I also found that using -mcpu=niagara4 caused warnings to be
generated for math/e_sqrt.c. That's because -mcpu=niagara4 defines
__FP_FAST_FMA which affects the definition of EMULV and DLA_FMS.
Either e_sqrt.c needs to be revised to avoid the warnings or
-mcpu=niagara4 needs to not be used on that routine.
Also, make check got the following new failures when using -mcpu=niagara4.
FAIL: math/test-float-cpow
FAIL: math/test-float-finite-cpow
FAIL: math/test-ifloat-cpow
(above 3 identical single value, 5ulp error instead of 4 ulp)
FAIL: math/test-float-ctanh
FAIL: math/test-float-finite-ctanh
FAIL: math/test-ifloat-ctanh
(above 3 identical single value, 2ulp error instead of 1 ulp)
These may be due to the use of FMA or other optimizations due
to -mcpu=niagara4. They have not been investigated in depth.
I'll note that the tests are not exhaustive. Specifically, only powf,
fminf, and fmaxf are included as 32-bit precision test. Complex
functions are not tested. Some other common functions such as expm1
and log10 are not present. That's not a criticism, just an
observation that more tests are needed before a blanket change of
default behavior should be recommended.
I concur that it would be beneficial to use -mcpu=niagara4 for
selected libm functions, but do not believe it should be default on
without more extensive testing (and likely more compiler tuning). I'd
like to move forward on that issue, but those tasks are beyond the
scope of my current assignment. I will put them on our internal list
of desirable enhancements but can't make any commitments as to whether
resources will be assigned to making them happen.
My goal/assignment is for exp() and expf() on Linux to match Solaris
libm performance. exp() is included in libmicro() and in one or more
of the new SPECcpu2017 applications, making it highly visible when
doing benchmark comparisons between systems.
Even without -mcpu=niagara4, the new code is much faster than the
ieee754 code, but to achieve full parity, the -mcpu=niagara4 switch is
needed. I believe that switch should be only used selectively on
functions where a clear benefit is shown. Also, for those functions, a
fall-back version that does not use -mcpu=niagara4 should be included
and controlled by ifunc to allow a Sparc-based glibc to run on any
Sparc platform even if it is built on a newer Sparc platform.
The approach I've implemented supports both of those requirements.
I can remove the Makefile code to use -mcpu=niagara4 iff the compiler
supports it. That will mean any attempt to build glibc on Sparc with
an older version of gcc will fail. Internal to Oracle, some Oracle SW
still depends on gcc 4.4.7, meaning I'll maintain two different
versions, one for Oracle Linux/Sparc and one for external use.
respectfully submitted by Patrick McGehearty