This is the mail archive of the
mailing list for the glibc project.
Re: [Patch][Aarch64] memcpy IFUNC for Cavium ThunderX2
- From: Szabolcs Nagy <szabolcs dot nagy at arm dot com>
- To: sellcey at cavium dot com, libc-alpha <libc-alpha at sourceware dot org>
- Cc: nd at arm dot com
- Date: Fri, 16 Feb 2018 18:39:24 +0000
- Subject: Re: [Patch][Aarch64] memcpy IFUNC for Cavium ThunderX2
- Authentication-results: sourceware.org; auth=none
- Authentication-results: spf=none (sender IP is ) smtp.mailfrom=Szabolcs dot Nagy at arm dot com;
- Nodisclaimer: True
- References: <firstname.lastname@example.org>
- Spamdiagnosticmetadata: NSPM
- Spamdiagnosticoutput: 1:99
On 15/02/18 00:04, Steve Ellcey wrote:
This patch adds a new memcpy ifunc for Cavium ThunderX2. The difference
between this and the Thunderx version is in the prefetching. ThunderX2
has different cache characteristics and so uses a different prefetching
strategy. Note that I prefetch past the end of the buffer being copied
but my understanding is that that is legal and should never generate any
errors. I tried adding code to not prefetch past the end of the source
but those changes slowed down memcpy so I did not include them.
I did not copy memcpy_thunderx.S to memcpy_thunderx2.S but just use
memcpy_thunderx2.S to set some macros and then include memcpy_thunderx.S.
This is to reduce duplicate code.
I have attached the memcpy benchmark output files from a ThunderX2 run,
the main differences are in bench-memcpy-large.out.
Tested with no regressions, OK to checkin?
the code looks ok, and it is ok to commit if you think this gives
benefit on thunderx2 (it should not affect other targets other
than code bloat).
i prefer not to add a new memcpy every time there is a new uarch,
so i think in the long term old ones should be removed or merged
(i'm not yet sure what's the right policy here, e.g. if a target
is not available to anyone in the community for benchmarking it
will be removed or if there is not enough performance benefit).
i don't see a huge performance difference in the benchmark logs
and there are a few weird cases e.g. in bench-memcpy.out
"timings": [151.016, 1905.47, 150.547, 257.656, 147.969, 151.172]
the memcpy_thunderx2 is very slow (and memcpy_falkor is the fastest).
2018-02-14 Steve Ellcey <email@example.com>
* sysdeps/aarch64/multiarch/Makefile (sysdep_routines):
* sysdeps/aarch64/multiarch/ifunc-impl-list.c (MAX_IFUNC):
Increment to 4.
(__libc_ifunc_impl_list): Add __memcpy_thunderx2.
* sysdeps/aarch64/multiarch/memcpy.c (libc_ifunc): Add IS_THUNDERX2
and IS_THUNDERX2PA checks.
* sysdeps/aarch64/multiarch/memcpy_thunderx.S (USE_THUNDERX2):
Use macro to set name appropriately.
(memcpy): Use USE_THUNDERX2 macro to modify prefetches.
* sysdeps/aarch64/multiarch/memcpy_thunderx2.S: New file.
* sysdeps/unix/sysv/linux/aarch64/cpu-features.h (IS_THUNDERX2PA):
(IS_THUNDERX2): New macro.