This is the mail archive of the mailing list for the glibc project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] x86-64: Optimize e_expf with FMA [BZ #21912]

On 08/16/2017 10:31 AM, Arjan van de Ven wrote:
> On 8/16/2017 7:04 AM, Carlos O'Donell wrote:
>> On 08/16/2017 09:34 AM, H.J. Lu wrote:
>>> FMA optimized e_expf improves performance by more than 50% on
>>> Skylake.
>>> Any comments?
>> Exactly how much of e_expf-fma.S do you need to achieve that 50%
>> speedup?
> the core "fast path" (the bit after    /* Main path: here if
> 2^(-28)<=|x|<125*log(2) */ )

>> How does this algorithm compare to what is already implemented for
>> e_expf?
> I started with the SSE version of that e_expf, turned it into AVX,
> used FMA where possible and fixed a few glass jaws in the fast path
> that you hit on skylake.
> the slow path is more a direct 1:1 translation from SSE to AVX
> (because mixing SSE and AVX is generally a bad idea)


>> My questions are basically leading to this:
>> (a) Can we write a generic e_expf in C with the special cases
>> written in C, and...
> the special cases are tiny for expf() (compared to say, powf() )
>> (b) Is there a core kernel computation that we can then implement
>> in assembly?
> some of the hot path instructions are pretty long latency in the cpu,
> and to get the performance, those are done before the special case
> tests (so that their execution overlaps, effectively making all the
> special case tests free)
> if you do the first part in C you lose all of that. The whole expf()
> (when hitting the fast path) has a throughput cost of somewhere
> between 10 and 12 cycles (depending on the CPU generation). A big
> chunk of achieving that requires overlapping some of the expensive
> things with more mundane instructions (like the special case tests).
> Giving that up for doing these checks in C first likely doubles the
> cost of expf() or something very close to that.
> (note this new code does not change the basics, the existing SSE code
> already has all of this in assembly) 

OK, so that is what I thought would happen. We saw similar problems
when instrumenting the malloc hot paths, you needed access to the
entire function to hide as much of the cost of the profiling as

I don't object to HJ's patch, it looks good to me, and is just another
assembly implementation optimization.

My goal with this review is just to double-check that we can't do this
in any other way that makes it easier to maintain.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]