This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

[PATCH v2] aarch64: thunderx2 memmove performance improvements

From: Anton Youdkevitch <anton dot youdkevitch at bell-sw dot com>
To: libc-alpha at sourceware dot org
Date: Tue, 30 Apr 2019 15:37:32 +0300
Subject: [PATCH v2] aarch64: thunderx2 memmove performance improvements

Here is the patch to make memove use thunderx2
capabilities more efficient.

The performance improvement is about 20%-30% for
larger cases and about 1%-5% for smaller cases.

Used SIMD load/store instead of GPR for overlapping
forward move.

Reused existing memcpy implementation for small or
overlapping backward move.

Fixed the existing memcpy implementation to allow it
to deal with the overlapping case.

Simplified loop tails in the memcpy implementation -
use branchless overlapping sequence of fixed length
load/stores instead of branching depending on the
size.

Fixed some missing optimization mainly wrt ldr/str
to ldp/stp conversion.

Added __memmove_thunderx2 to the list of the
available implementations.


make check on linux/aarch64 - no regressions
make bench on thunderx2     - improvements

Looks OK?

* sysdeps/aarch64/multiarch/ifunc-impl-list.c: Added
  __memmove_thunderx2 to the list of implementations
* sysdeps/aarch64/multiarch/memmove.c: Likewise
* sysdeps/aarch64/multiarch/memcpy_thunderx2.S:
  (__memmove_thunderx2): rewritten using SIMD ld/st
  (__memcpy_thunderx2): fixed to handle overlapping cases

Follow-Ups:
- Re: [PATCH v2] aarch64: thunderx2 memmove performance improvements
  - From: Anton Youdkevitch

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]