This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH] aarch64: Optimized memcpy for Qualcomm Falkor processor

From: Siddhesh Poyarekar <siddhesh at gotplt dot org>
To: Wilco Dijkstra <Wilco dot Dijkstra at arm dot com>, "libc-alpha at sourceware dot org" <libc-alpha at sourceware dot org>
Cc: nd <nd at arm dot com>
Date: Fri, 23 Jun 2017 19:48:49 +0530
Subject: Re: [PATCH] aarch64: Optimized memcpy for Qualcomm Falkor processor
Authentication-results: sourceware.org; auth=none
References: <AM5PR0802MB2610E536F00CB8A45FF3A78C83D80@AM5PR0802MB2610.eurprd08.prod.outlook.com>

On Friday 23 June 2017 06:19 PM, Wilco Dijkstra wrote:
> Those are odd results. Omnetpp doesn't use memcpy and xalancbmk profile has
> memcpy at ~2%, so getting a 6% improvement couldn't be related to memcpy!

Ah, you're right, I did not notice that.  I ran the tests using the
glibc.tune.cpu tunable to ensure that there are no build-related changes
because it appears that there also are variations due to different
versions of gcc generating different sets of memcpy calls due to the
tree-loop-distribution pass.  I attributed the performance difference to
the possibility of additional generated memcpy's - I should have checked.

> Similarly the random memcpy benchmark only does a small number of copies
> larger than 96 bytes (where your new code is used), so I find it hard to believe it
> could make a difference. On Cortex-A57 I get identical performance for this patch vs
> the generic version (btw __memcpy_thunderx is very close, __memcpy_thunderx2
> is 18% slower).

Here's what I get on falkor:

Function: memcpy
__memcpy_falkor	__memcpy_generic
Variant: random
================================================================================
max-size=4096: 	34820.00	36905.10 (-5.99%)	
max-size=8192: 	33637.30	35860.00 (-6.61%)	
max-size=16384: 	35167.80	37527.10 (-6.71%)	
max-size=32768: 	34477.50	36517.70 (-5.92%)	
max-size=65536: 	35028.00	37272.50 (-6.41%)	

So it is a definite overall improvement.  However it still does not
justify the double-prefetch since it seems more likely that it had a
negative effect on the numbers.

> If prefetching in larger copies doesn't help Falkor in the large copy benchmark, then
> what's the reasoning behind this patch?

It appears to help internal workloads that Qualcomm are evaluating that
benefit from large copies being interleave-prefetched.  Now given this
realization of other factors aiding performance in spec2006, I'll have
to revisit their results and see if there's anything wrong there.  I
need to do more work on this.

Thanks,
Siddhesh

References:
- Re: [PATCH] aarch64: Optimized memcpy for Qualcomm Falkor processor
  - From: Wilco Dijkstra

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]