[PATCH v2] aarch64: thunderx2 memmove performance improvements

Wed May 1 11:34:00 GMT 2019

On 30/04/2019 13:40, Anton Youdkevitch wrote:
> Now with the patch
> 
> On Tue, Apr 30, 2019 at 03:37:32PM +0300, Anton Youdkevitch wrote:
>> Here is the patch to make memove use thunderx2
>> capabilities more efficient.
>>
>> The performance improvement is about 20%-30% for
>> larger cases and about 1%-5% for smaller cases.

this or similar statement about the performance
improvement on thunderx2 should be added to the
commit message.

>>
>> Used SIMD load/store instead of GPR for overlapping
>> forward move.
>>
>> Reused existing memcpy implementation for small or
>> overlapping backward move.
>>
>> Fixed the existing memcpy implementation to allow it
>> to deal with the overlapping case.
>>
>> Simplified loop tails in the memcpy implementation -
>> use branchless overlapping sequence of fixed length
>> load/stores instead of branching depending on the
>> size.
>>
>> Fixed some missing optimization mainly wrt ldr/str
>> to ldp/stp conversion.
>>
>> Added __memmove_thunderx2 to the list of the
>> available implementations.
>>
>>
>> make check on linux/aarch64 - no regressions
>> make bench on thunderx2     - improvements
>>
>> Looks OK?
>>
>> * sysdeps/aarch64/multiarch/ifunc-impl-list.c: Added
>>   __memmove_thunderx2 to the list of implementations
>> * sysdeps/aarch64/multiarch/memmove.c: Likewise
>> * sysdeps/aarch64/multiarch/memcpy_thunderx2.S:
>>   (__memmove_thunderx2): rewritten using SIMD ld/st
>>   (__memcpy_thunderx2): fixed to handle overlapping cases

This is ok to commit with the commit message fixed.