[PATCH][AArch64] Tune memcpy
Fri Nov 6 15:42:00 GMT 2015
Andrew Pinski wrote:
> On Fri, Nov 6, 2015 at 10:34 PM, Wilco Dijkstra <Wilco.Dijkstra@arm.com>
> > firstname.lastname@example.org wrote:
> >> > def_fn memcpy p2align=6
> >> > + prfm PLDL1KEEP, [src]
> >> Why keep rather than strm for the prefetches?
> > It improves small copies by prefetching the input immediately, so
> > setting it to streaming would have an adverse effect as it claims the
> > line will not be used again. For huge copies the initial prefetch has no
> This might be true on ARM's cores but not on all AARCH64 cores.
> For ThunderX, we don't have a hardware prefetcher and STRM does
> immediately starts the fetching of that cache line and it marks the cache line
> in L2 as not going to be used any time afterwards.
> So it could improve the performance there over the keep one.
I don't see how it could improve performance of small copies where it is highly likely that the data is reused. The streaming variants are only meant for really large data streams, but then using LDNP is likely better than software prefetch on CPUs with a hardware prefetcher.
More information about the Newlib