This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH] aarch64: Optimized memcpy for Qualcomm Falkor processor
- From: Siddhesh Poyarekar <siddhesh at gotplt dot org>
- To: Wilco Dijkstra <Wilco dot Dijkstra at arm dot com>, "libc-alpha at sourceware dot org" <libc-alpha at sourceware dot org>
- Cc: nd <nd at arm dot com>
- Date: Fri, 23 Jun 2017 19:48:49 +0530
- Subject: Re: [PATCH] aarch64: Optimized memcpy for Qualcomm Falkor processor
- Authentication-results: sourceware.org; auth=none
- References: <AM5PR0802MB2610E536F00CB8A45FF3A78C83D80@AM5PR0802MB2610.eurprd08.prod.outlook.com>
On Friday 23 June 2017 06:19 PM, Wilco Dijkstra wrote:
> Those are odd results. Omnetpp doesn't use memcpy and xalancbmk profile has
> memcpy at ~2%, so getting a 6% improvement couldn't be related to memcpy!
Ah, you're right, I did not notice that. I ran the tests using the
glibc.tune.cpu tunable to ensure that there are no build-related changes
because it appears that there also are variations due to different
versions of gcc generating different sets of memcpy calls due to the
tree-loop-distribution pass. I attributed the performance difference to
the possibility of additional generated memcpy's - I should have checked.
> Similarly the random memcpy benchmark only does a small number of copies
> larger than 96 bytes (where your new code is used), so I find it hard to believe it
> could make a difference. On Cortex-A57 I get identical performance for this patch vs
> the generic version (btw __memcpy_thunderx is very close, __memcpy_thunderx2
> is 18% slower).
Here's what I get on falkor:
Function: memcpy
__memcpy_falkor __memcpy_generic
Variant: random
================================================================================
max-size=4096: 34820.00 36905.10 (-5.99%)
max-size=8192: 33637.30 35860.00 (-6.61%)
max-size=16384: 35167.80 37527.10 (-6.71%)
max-size=32768: 34477.50 36517.70 (-5.92%)
max-size=65536: 35028.00 37272.50 (-6.41%)
So it is a definite overall improvement. However it still does not
justify the double-prefetch since it seems more likely that it had a
negative effect on the numbers.
> If prefetching in larger copies doesn't help Falkor in the large copy benchmark, then
> what's the reasoning behind this patch?
It appears to help internal workloads that Qualcomm are evaluating that
benefit from large copies being interleave-prefetched. Now given this
realization of other factors aiding performance in spec2006, I'll have
to revisit their results and see if there's anything wrong there. I
need to do more work on this.
Thanks,
Siddhesh