This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH 2/2] aarch64: Optimized memcpy and memmove for Kunpeng processor


Hi Derek,

>> Well these results show a very significant 4% win for Falkor memcpy! It seems strange to only optimize
>> for large sizes when the vast majority of copies in real code are very small (note the distribution of the
>> sizes and alignment for the random benchmark come from SPEC).
>
> Sure, we agree the falkor memcpy has 4% win on small size. However, at the beginning we start to Kunpeng
> optimized the memcpy, one of the most important case is database case, which really need more improvement
> on large size.

Being fast on small copies is not mutually exclusive with being fast on large copies - there are different
code paths for these cases. I'd find it hard to believe databases do huge memcopies, that would be stupid!

> And what confusing us now is that, we removed dst_unaligned code in memcpy according to the previous comments,
> which did not affect performance after testing in memcpy cases. But in the case when uses memmove function and
> enters the memcpy part, unaligned cases is significantly slower than aligned case according to the results of the first
> half part of memmove-walk as shown in the bottom. So do you think we should still remove dst_unaligned code?

Well it seems to me the issue is related to prefetching/caching. memcpy-walk walks both src and dst backwards,
memmove-walk is pretty much identical for the non-overlap case but it does a forward walk on dst and a backward
one on src. There shouldn't be any performance difference between the two cases. On most microarchitectures I see
no difference between these walks, memmove-walk and memcpy-walk basically have identical performance both for
aligned and unaligned cases (alignment doesn't matter for large copies at all).

> We analyse the reason is of more judgement in the begin of memmove and may weak processor ability to handle
> this case, and so dst_unaligned make difference.

The extra instructions at the start couldn't possibly make a difference apart from slowing down the small memmoves.

Wilco

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]