[PATCH 2/2] Improve 64bit memcpy/memmove for Corei7 with avx2 instruction

Ondřej Bílka neleai@seznam.cz
Fri Jun 7 08:48:00 GMT 2013


On Thu, Jun 06, 2013 at 08:11:15PM +0800, Ling Ma wrote:
> (To keep mail thread  consistent, send again with this email address )
> Hi Ondra,
> 
> Thanks for your correction!
> I'm always using test-memcpy.c from glibc to check and compare
> performance before today, based on it we find the best result and send
> out our patch,  currently we should discard it?
> Soon I will test those functions with your profile and other release versions.
> If I was wrong, please correct me.
> 
> Thanks
> Ling
>
Yes it is as you wrote.

I got some afterthoughts how improve memcpy/memset.

First is to copy in backward direction. It may be more friendly to cache
as recently constructed data has end in L1 cache and we will end with
starts in L1 cache which are more likely to be accessed.

Second is look how effective are prefetches on haswell. 
I did not add prefetching because I cannot do that generically. For one
architecture I could determine if it help and size from which it help.
This was too chaotic for generalization.

You migth also try to improve strlen with avx2.
I tried only simple variant that did all with avx2 and it turned out
It was asymptoticaly better but had worse overhead due of higher avx2
latency.

I guess sse header with avx2 loop should be better. I use similar benchmark at

kam.mff.cuni.cz/~ondra/strlen_profile.tar.bz2

Ondra



More information about the Libc-alpha mailing list