This is the mail archive of the mailing list for the glibc project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

RE: New x86-64 memcpy

Hi, Jakub. 

> > They had 8 bytes each in order to allow direct comparisons 
> with the count
> > in a register without having to load the value.  Even if in 
> memcpy they
> > can be used as 4-byte variables, I have other routines that 
> would benefit
> > from them being 8 bytes long.
> In the last round of routines you sent I haven't seen that, 
> but sure, if
> some var has justification for being 64-bit, so be it.  The important
> is just (%rip) addressing.

Got it.  It actually made the patch much leaner, as it doesn't touch on
RTLD stuff anymore.
> > I guess that using the red zone is better.  As the routine 
> has several
> > exit points to improve performance, after each one new CFI 
> directives
> > would have to be added, which complicates maintaining the code.
> Even with red zone you need some CFI directives (which say 
> where %r12/$r13/%r14
> have been saved or cfi_restore for them), but don't need any CFA
> adjustments.

I chose for using the red zone with the CFI directives.

> > I'll double-check that RDI has the expected value always.  
> Otherwise, I'll
> > just use an entry in the red zone.
> I believe so.  L(1{,a,b,c,d,loop}) always increment %rdi by 
> the size they
> stored into (%rdi).  All other ret's are preceeded by jnz 
> L(1), which relies
> on %rdi pointing after the last byte stored.

Indeed.  The tail code is tad harder to read though.

Again, in addition to the source-code patches, I also attached the
resulting data obtained on a 2.4GHz Athlon64 with DDR2-800 RAM and on a
3GHz Core2 with DDR2-533.  The file memcpy-opteron-old.txt has the
original output of string/test-memcpy on the Athlon64 system and the
file memcpy-opteron-new.txt the output using the new routine.  The files
memcpy-core2-old.txt and memcpy-core2-new.txt contain the same results
but on the Core2 system.  

I also plotted the performance of the new routine relative to the old
one (where a ratio of 1 stands for performance parity and >1 for
performance improvement) in movs-opteron-new-movs-opteron-old.png for
the Athlon64 system and in movs-core2-new-movs-core2-old.png for the
Core2 system.  

2007-05-04 Evandro Menezes <>

	* sysdeps/x86_64/memcpy.S: new code to handle more block size
	* sysdeps/x86_64/mempcpy.S: modified macro definition.
	* sysdeps/unix/sysv/linux/x86_64/sysconf.c: moved code to detect
caches sizes...
	* sysdeps/x86_64/cacheinfo.c: ... here.	
	* sysdeps/x86_64/Makefile: added cacheinfo.c.

Could you please review it?


Evandro Menezes               AMD            Austin, TX

Attachment: movs-core2-new-movs-core2-old-ratio.png
Description: movs-core2-new-movs-core2-old-ratio.png

Attachment: movs-opteronf-new-movs-opteronf-old-ratio.png
Description: movs-opteronf-new-movs-opteronf-old-ratio.png

Attachment: memcpy.diff.bz2
Description: memcpy.diff.bz2

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]