This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
[PATCH v2] aarch64: thunderx2 memmove performance improvements
- From: Anton Youdkevitch <anton dot youdkevitch at bell-sw dot com>
- To: libc-alpha at sourceware dot org
- Date: Tue, 30 Apr 2019 15:37:32 +0300
- Subject: [PATCH v2] aarch64: thunderx2 memmove performance improvements
Here is the patch to make memove use thunderx2
capabilities more efficient.
The performance improvement is about 20%-30% for
larger cases and about 1%-5% for smaller cases.
Used SIMD load/store instead of GPR for overlapping
forward move.
Reused existing memcpy implementation for small or
overlapping backward move.
Fixed the existing memcpy implementation to allow it
to deal with the overlapping case.
Simplified loop tails in the memcpy implementation -
use branchless overlapping sequence of fixed length
load/stores instead of branching depending on the
size.
Fixed some missing optimization mainly wrt ldr/str
to ldp/stp conversion.
Added __memmove_thunderx2 to the list of the
available implementations.
make check on linux/aarch64 - no regressions
make bench on thunderx2 - improvements
Looks OK?
* sysdeps/aarch64/multiarch/ifunc-impl-list.c: Added
__memmove_thunderx2 to the list of implementations
* sysdeps/aarch64/multiarch/memmove.c: Likewise
* sysdeps/aarch64/multiarch/memcpy_thunderx2.S:
(__memmove_thunderx2): rewritten using SIMD ld/st
(__memcpy_thunderx2): fixed to handle overlapping cases