This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] Sparc exp(), expf() performance improvement


On 01/08/2017 13:06, Patrick McGehearty wrote:
> On 7/31/2017 4:21 PM, David Miller wrote:
>> From: Patrick McGehearty <patrick.mcgehearty@oracle.com>
>> Date: Mon, 31 Jul 2017 16:06:44 -0500
>>
>>> Sparc has a significant performance issue with RAW (read after write).
>>> That is, if a value is stored to a particular address and then read
>>> from that
>>> address before the store has reached L2 cache, a pipeline hiccup
>>> occurs
>>> and a 30+ cycle delay is seen. Most commonly this issue is seen in the
>>> case
>>> of register spill/fills, but it also occurs when a value in an integer
>>> register
>>> to stored to a temporary in memory and then loaded to a floating point
>>> register.
>>> The int to fp and fp to int operations are common in exp() algorithms
>>> due
>>> to cracking the exponent from the mantissa to determine which special
>>> case to use in handling particular input data ranges.
>>>
>>> Starting with Niagara4 (T4), direct int to fp and fp to int transfer
>>> instructions
>>> were added, avoiding this performance issue. If we compile for any
>>> Sparc
>>> platform instead of T4 and later, we can't use the direct transfers.
>>> Note that T4 was first introduced in 2011, meaning most current
>>> Sparc/Linux platforms will have this support.
>>>
>>> For comparison, recent x86 chips from Intel have thrown enough HW at
>>> the RAW issue to not have any delays when a read-after-write occurs.
>>>
>>> The new algorithm is significantly different from the existing
>>> sysdeps/ieee754 algorithm.
>>> The new algorithm matches the one used by the Solaris/Studio libm
>>> exp(), expf() code.
>>> My effort was involved in porting (with Oracle corporate permission),
>>> not
>>> algorithm construction.
>>>
>>> It seems likely that this code could be faster on other CPUs, but I've
>>> only tested it on Sparc
>>> as that's the machines I have ready access to. The advantage may be
>>> much less on other platforms.
>> You miss my point.
>>
>> You are doing two _completely_ different things here.
>>
>> First, you could simply build the existing exp() and expf() C code in
>> glibc with niagara4.  In fact, if this float<-->int move instruction
>> helps so much, you probably want to build the entire math library
>> this way with appropriate ifunc hooks.  Not just exp/expf.
>>
>> Second, you could then introduce the new C code implementation of exp
>> and expf functions and:
>>
>> 1) See if it is faster on other sparc cpus.
>>
>> 2) Ask other glibc developers to test whether it is faster on
>>     non-sparc cpus as well.
>>
>> Making both changes and only targetting post-niagara4 cpus is
>> completely the wrong way to go about this.
> 
> I'm preparing to do a trial run on -mcpu=niagara4 for glibc.
> I'll report back on any interesting differences for make bench
> with/without -mcpu=niagara4 for the current sourceware tree.
> 
> I will note from my point of view, this project is focused only
> on exp() and expf() as Sparc/Solaris/Studio showed dramatically
> better performance on those specific functions. There are a few
> other functions which run faster on Sparc/Solaris/Studio, but
> nothing like the performance difference for exp() and expf().
> 
> - patrick
> 

I agree with David, we should refrain of adding even more platform
specific assembly optimization where a default C code could be as
good as and also improve generic performance on other platforms as
well.

The problem you specific is very similar to the one on POWER before POWER8,
where floating pointer to integer transfer issues a load-hit-store that
increases latency.  I tried to mitigate this on sin/cos by tweaking the 
internal code using a hackish hooks (commit 77a2a8b4a19f0), but currently
I am convinced that a new algorithm for single float exp, sin, cos (and
probably others) is in fact a better solution.

In fact, some architecture already went on this path, although by 
providing specific assembly implementation of algorithms that could be
provided in a more C generic way for a possible default one.  For instance,
power8 [1], x86_64 [2], and i686 sse2 [3] seems to use same algorithms 
(based on initial code comments).  You could use this as base as well to 
evaluate which one shows better performance and precision.

We also has a similar discussion for AArch64 proposal of sinf/cosf 
optimization [4].

[1] sysdeps/powerpc/powerpc64/power8/fpu/e_expf.S
[2] sysdeps/x86_64/fpu/e_expf.S
[3] sysdeps/i386/i686/fpu/multiarch/e_expf-sse2.S
[4] https://sourceware.org/ml/libc-alpha/2017-06/msg00503.html


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]