This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [Patch][Aarch64] memcpy IFUNC for Cavium ThunderX2
- From: Steve Ellcey <sellcey at cavium dot com>
- To: Szabolcs Nagy <szabolcs dot nagy at arm dot com>, libc-alpha <libc-alpha at sourceware dot org>
- Cc: nd at arm dot com
- Date: Tue, 20 Feb 2018 10:52:22 -0800
- Subject: Re: [Patch][Aarch64] memcpy IFUNC for Cavium ThunderX2
- Authentication-results: sourceware.org; auth=none
- Authentication-results: spf=none (sender IP is ) smtp.mailfrom=Steve dot Ellcey at cavium dot com;
- References: <1518653077.14236.76.camel@cavium.com> <64c7a6e8-92c3-da12-c340-89151bbf4041@arm.com>
- Reply-to: sellcey at cavium dot com
- Spamdiagnosticmetadata: NSPM
- Spamdiagnosticoutput: 1:99
On Fri, 2018-02-16 at 18:39 +0000, Szabolcs Nagy wrote:
>
> the code looks ok, and it is ok to commit if you think this gives
> benefit on thunderx2 (it should not affect other targets other
> than code bloat).
>
> i prefer not to add a new memcpy every time there is a new uarch,
> so i think in the long term old ones should be removed or merged
> (i'm not yet sure what's the right policy here, e.g. if a target
> is not available to anyone in the community for benchmarking it
> will be removed or if there is not enough performance benefit).
> i don't see a huge performance difference in the benchmark logs
> and there are a few weird cases e.g. in bench-memcpy.out
>
> {
> "length": 1888,
> "align1": 0,
> "align2": 59,
> "timings": [151.016, 1905.47, 150.547, 257.656, 147.969, 151.172]
> },
>
> the memcpy_thunderx2 is very slow (and memcpy_falkor is the fastest).
The benefit isn't huge but I think it is large enough to be helpful on
some programs. I reran the benchmarks again and looked at the results.
This particular anomoly seems to have gone away (though falkor is still
faster for this specific instance). I am not sure why it showed up
before, I was on a machine with no other users but there still may have
been something running that caused this hiccup.
{
"length": 1888,
"align1": 0,
"align2": 59,
"timings": [154.062, 1905.78, 151.641, 150.625, 146.094, 150.312]
},
The code run for ThunderX and ThunderX2 in this instance would be the
same because the length isn't large enough to trigger the prefetch
loop anyway.
I will check this in later unless someone else raises some objections.
Steve Ellcey
sellcey@cavium.com