[PATCH v2][AArch64] Improve integer memcpy

Fri Mar 13 16:43:21 GMT 2020

On 11/03/2020 13:32, Wilco Dijkstra wrote:
> Hi Adhemerval,
> 
>> I wonder if the optimization for sizes up to 128 yields same gain
>> for the other chip memcpy implementation (thunderx, thunderx2, and
>> falkor).  
> 
> Most definitely - the new memcpy is 15-20% faster than __memcpy_thunderx2
> on TX2.

OK, what I would like to avoid is keep maintaining subpar architecture
implementations once generic implementation improves.  

So, for ThunderX the only optimization its implementation uses iis
prefetch for sizes larger then 32KB. Is it really paying off?
Could it switch to generic implementation as well?

For ThundeX2, it uses Q registers and 128 bytes loops for aligned
loops and the jump table for unaligned.  Is the jump table still
a gain for ThunderX2?  Also, it might an option to have a generic
memcpy that uses Q register with a larger window (so ThunderX and
newer core might prefer it instead of generic one).

> 
>> The main differences seems to be how each chip handles
>> large copies, with thundex and falkor doing 64 bytes per loop,
>> while thunderx2 does either 128 bytes (when source and dest are
>> aligned) or 64 for unaligned inputs (it also does not issue
>> unaligned access, doing aligned load plus merge using a jump table).
> 
> Yes that jump table is insane at 1KB of code... It may seem great in
> microbenchmarks but it falls apart in the real world.
> 
>> So it seems that I don't see a straightforward way to unify the
>> implementations, maybe adding a common shared code for sizes
>> less than 128 bytes.
> 
> Yes we could share the code for small cases across implementations.
> I was thinking about having an ifunc for large copies so we could
> statically link a common routine to handle small copies and avoid
> PLT overheads in 99% of cases.
> 
>> One question is if doing operation for large sizes using
>> ldp/stp might yield some gains (as thunderx2 does, at least
>> for aligned case), or if the cost of checking and using some
>> specific cases does not pay of.
> 
> You mean LDP/STP of SIMD registers? There is some gain for those on
> modern cores.
> 
>> +   Large copies use a software pipelined loop processing 64 bytes per iteration.
>> +   The destination pointer is 16-byte aligned to minimize unaligned accesses.
>> +   The loop tail is handled by always copying 64 bytes from the end.
>> +*/
> 
>> Ok, so it now uses a similar strategy ThunderX/Falkor memcpy (Falkor
>> limits the copy to one register due a hardware prefetcher limitation).
> 
> Well this is what it always did. It's faster on in-order cores and supports
> overlapping copies (unlike the Falkor memcpy).
> 
> I'll fix up the long lines before commit.
> 
> Cheers,
> Wilco
>