This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: PPC64 libmvec sincos/sincosf ABI


On 8/6/19 12:42 PM, Wilco Dijkstra wrote:
> Hi,
>
>> 1. What is the best vector ABI (best performance) for sincos on PPC64?  
>> That may be a function of the particular vector instructions available on 
>> PPC64; the best choice of ABI on PPC64 need not correspond to the best 
>> choice on x86_64.
> I don't think it is related to the target - the fastest ABI is one that avoids
> unnecessary work. For example scalar sincos is slow due to the inefficient
> ABI which forces the results through memory (fixing that gives a 50% speedup). 
>
> Similarly for the vector ABI I think returning 2 vectors in registers will be the
> fastest option in all cases. The actual vector instructions shouldn't affect the
> ABI beyond the vector widths that can be supported.
>
> Wilco
>
Let me jump in here to answer a general question that I think Bert has
had for a while.

For the PPC64LE ABI, we should be returning everything through registers
wherever possible.  The ABI supports multiple return values of the same
type (up to 8 vector return values, for example), using the same
registers used for passing parameters.  For simplicity in this example,
I'll use the AltiVec-style types (vector double), but this works
identically if you use more generically defined vector types.

#include <altivec.h>

struct sincosret
{
    vector double sinvals;
    vector double cosvals;
};

struct sincosret
mysincos (vector double a)
{
    struct sincosret scr;
    scr.sinvals = a+a;  // May be slightly incorrect
    scr.cosvals = a*a;  // Ditto
    return scr;
}

This will result in the values being returned in VR2 and VR3:

    xvmuldp 35,34,34
    xvadddp 34,34,34
    blr

This is preferable to returning values indirectly through memory, which
on older POWER processors can result in stalls from the store and load
being too close together and possibly executed out of order.  The cost
is pretty much negligible compared to the cost of computing sin/cos, but
we might as well do it the best way that the ABI provides.

Now, as I've said elsewhere, dealing with sincos in the -mveclibabi
framework in GCC may be less than straightforward, due to the different
description of the output types, but perhaps AArch64 has already laid
some groundwork here.  I'm not up to date on the pending patches.

Hope this helps,
Bill


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]