[PATCH v2] aarch64: thunderx2 memmove performance improvements

Here is the patch to make memove use thunderx2
capabilities more efficient.

The performance improvement is about 20%-30% for
larger cases and about 1%-5% for smaller cases.

Used SIMD load/store instead of GPR for overlapping
forward move.

Reused existing memcpy implementation for small or
overlapping backward move.

Fixed the existing memcpy implementation to allow it
to deal with the overlapping case.

Simplified loop tails in the memcpy implementation -
use branchless overlapping sequence of fixed length
load/stores instead of branching depending on the

Fixed some missing optimization mainly wrt ldr/str
to ldp/stp conversion.

Added __memmove_thunderx2 to the list of the
available implementations.

make check on linux/aarch64 - no regressions
make bench on thunderx2     - improvements

Looks OK?

* sysdeps/aarch64/multiarch/ifunc-impl-list.c: Added
  __memmove_thunderx2 to the list of implementations
* sysdeps/aarch64/multiarch/memmove.c: Likewise
* sysdeps/aarch64/multiarch/memcpy_thunderx2.S:
  (__memmove_thunderx2): rewritten using SIMD ld/st
  (__memcpy_thunderx2): fixed to handle overlapping cases

