This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: GLIBC - I would like to contribute some platform optimizations
Hello List,
Hello Mike,
Please find attached the source "libfastmemcpy.c".
Its a simple memcpy which achieves 60-80% more throughput than
glibc-memcpy on many CPUs.
The memcpy uses MMX and I've tested it on the following CPUs
AMD Duron, AMD Athlon 64, Intel XEON, AMD Opteron.
To archive optimal memcpy performamce a few things are done:
a) Source data should be prefetched to avoid memory latency bubbles.
The below routine will use "prefetchnta" instruction for this.
b) To improve write speed its adviseable to align the
destination properly. The destination will be aligned
to 32 bit boundary first and for bigger copies, the destrination
will be aligned to 64 byte (AMD cache line) boundary.
c) Using a non cache poluting copy will save the 2nd level
cache for other usage. While in rare cases this might be a small
disadvantage, in general this will be a big overall speed
improvement. The below routine uses "movntq" to avoid cache polution
In addition to the source I have attached the source of a small memcpy
benchmark routine, called "memcpy_bench.c"
The memcpy_bench will compare different memcpy routines and measure
their performance. The shown memory throughput will be printed as
MEMORY-BUS throughput in MB/sec. As a memcpy of 50 MB is reading of 50
MB + writing of 50 MB it will be shown as Bus speed of 100 MB.
These different memcpy routines that will be compared:
a) glibc-memcpy
b) bmove512 memcpy routine used by the MySQL-server to copy
bigger blocks as data
c) simple loop copying the data 8bit wise
d) simple loop copying the data 32bit wise
e) simple loop copying the data 64bit wise (using float commands)
This one is also known as STREAM-copy
f) The fast memcpy
The test will copy blocks of different sizes from 16 MB to 16 Byte.
Each test will always copy 16 MB.
E.G. the 1 MB copy will run 16 times to copy 16 MB.
With each run the copy will be moved to another block of 1 MB.
To the 16 iterations of 1 MB will in fact copy 16 MB of different data.
The purpose of this is to really measure the memory-bus performance as
we shift the copy window during the test and the data cache will have
the same (little) influence on each test independent of the size of the
copied block.
The test will be repeated on different aligned data to show the effects
of the alignment on the copy speed. On some CPUs some routines e.g the
64 bit wise float will run slow on misalignement data.
For full results please compile and run the benchmark yourself.
- The test shows that glibc-memcpy achieves
for medium sized and big blocks : 1405 MB/sec
and for small blocks : 650 MB/sec
- The fast memcpy does achieve
for medium sized and big blocks : 2500 MB/sec + 77 %
and for small blocks : 1308 MB/sec + 100 %
I claim no copyright for any of the sources.
Please fell free to use it for whatever you want.
Would be more than silly to claim copyright for such simple code anyway)
I hope that this source is a help for you.
Please tell me if you need anything else.
Please reply to my email address as well as I'm not in this mailing list.
Cheers
Gunnar
Dump of parts of membench output (Please excuse bad email formatting)
For a more complete test with more tests on various aligments please run
the memcpy_bench
Good aligned:
----------------------------------------------------------------------------------------------------------------
Alignment 0
---------------------------------------------------------------------------------------------------------------
16MB 4MB 1MB 256KB 64KB 16KB 4KB 1KB
512B 256B 128B 64B 32B 16B
----------------------------------------------------------------------------------------------------------------
glibc memcpy 1405 1404 1406 1393 1403 1408 1409 1384
1377 1380 1375 1314 903 649
bmove512 1415 1411 1410 1410 1410 1410 1397 1408 1409
copy 8 1133 1130 1142 1142 1138 1137 1132 1130
1126 1118 1113 1109 1014 907
copy 32 1416 1430 1411 1417 1434 1432 1427 1428
1426 1426 1424 1380 1348 1415
copy 64f 1464 1457 1465 1453 1440 1448 1463 1458
1455 1454 1445 1442 1393 1420
memcpy_mmx 2500 2497 2488 2500 2499 2495 2483 2491
2461 2453 1406 1334 1336 1308
Aligned on odd adress
----------------------------------------------------------------------------------------------------------------
Alignment 1
----------------------------------------------------------------------------------------------------------------
16MB 4MB 1MB 256KB 64KB 16KB 4KB 1KB
512B 256B 128B 64B 32B 16B
----------------------------------------------------------------------------------------------------------------
glibc memcpy 1464 1461 1462 1458 1449 1457 1455 1444
1437 1385 1386 1299 1119 682
bmove512 1411 1417 1411 1414 1414 1394 1409 1409 1406
copy 8 1173 1172 1174 1175 1170 1173 1157 1150
1143 1143 1122 1066 971 863
copy 32 1414 1428 1429 1424 1433 1406 1421 1422
1417 1409 1388 1382 1399 1378
copy 64f 1462 1443 1458 1458 1453 1454 1443 1434
1431 1418 1427 1425 1421 1403
memcpy_mmx 2506 2518 2506 2518 2510 2459 2338 1998
1646 1459 1447 1420 1381 1264
Mike Frysinger wrote:
On Sunday 02 September 2007, Gunnar von Boehn wrote:
I would like to contribute some performance optimizations to the GLIBC.
I did some work on memory functions as e.g memcpy.
My results are very promising and with small changes e.g. memcpy could
get up to 50% more throughput on many platforms for transfer > 1KB.
(Tested on AMD/K7/K8 and PowerPC 603/750/7447/970)
Can you please tell me the procedures to be able to contribute the
optimizations to you?
(1) post the actual changes to the mailing list
(2) post data supporting your improvement claims and the methodology for
collecting said data
(3) significant changes require FSF copyright assignment
-mike
//
// This code was written by Gunnar von Boehn <gunnar@greyhound-data.com>
// No copyright is claimed for the source.
// Uncontitional free usage of all included code is granted
// Please feel free to use this source in your projects (commercial/GPL/whatever).
//
// The below code can be used to compile an alternative memcpy routine
// This alternative routine is up to 60% faster than the normal glibc routine
// You can use it with LD_PRELOAD to speed up existing applications
//
// Compile with:
// gcc -O2 -fPIC -c libfastmemcpy.c -o libfastmemcpy.o; gcc -shared -o libfastmemcpy.so libfastmemcpy.o -ldl
//
// Compare of standard glibc memcpy vs the below fast memcpy
// Compare was done on Athlon64 (Tests on Duron / Penti4 / Xeon / Opteron did show similar results)
//
// The Test shows speed of various copies ranging from 16 MegaByte to 16 Byte - The result are in MB/sec
// ----------------------------------------------------------------------------------------------------------------
// copied block 16MB 4MB 1MB 256KB 64KB 16KB 4KB 1KB 512B 256B 128B 64B 32B 16B
// ----------------------------------------------------------------------------------------------------------------
// glibc memcpy 1452 1447 1429 1444 1445 1444 1443 1436 1431 1427 1416 1360 911 657
// fast memcpy 2525 2531 2527 2535 2534 2529 2526 2523 2491 2492 1427 1362 1380 1317
//
// For more test results please see the "memcpy_bench" application by Gunnar von Boehn
//
#include <stdio.h>
#define uint8 unsigned char
#define uint16 unsigned short
#define uint32 unsigned int
// Fast memcpy for x86 CPUs with MMX
//
// To archive optimal memcpy performamce several things are to consider
//
// a) Source data should be prefetched to avoid memory latency bubbles.
// The below routine will use "prefetchnta" instruction for this.
// b) To improve write speed its adviseable to align the destination properly.
// The destination will be aligned to 32 bit boundary first and for bigger
// copies the destrination will be aligned to 64 byte (cache line) boundary
// c) Using a non cache poluting copy will save the 2nd level cache for other usage
// While in rare cases this might be a small disadvantage,
// in general this will be a big overall speed improvement.
// The below routine uses "movntq" to avoid cache polution
//
// The copy routine will first align the destiantion to 16 bit
// Then the copy routine will align the destination to 32 bit
// For copy over a certain size (>=256 byte) we will align the
// destination to a whola cache line of 64 byte and then use
// mmx copy commands and prefetching to optimally copy the memory.
// Of some CPUs is of advantage to use to use the MMX copy block
// even on smaller block and start with sizes of >= 128 byte.
// After the fast mmx copy block we will copy the remaining part
// using 32bit copy and 8bit copies
//
void *memcpy(void *dst, const void *src, size_t size){
uint32 i;
if(size<4) goto memcpy_less4; // Tiny copy? No need to align
if( (uint32)dst & 1) { // align destination to 16 bit
*((uint8*)dst++) = *((uint8*)src++); //
size--; //
} //
if ((uint32)dst & 2) { // align destitnation to 32 bit
*((uint16*)dst) = *((uint16*)src); //
src+=2; //
dst+=2; //
size -= 2; //
} //
if(size>=256){ // use cache line, prefetching routine for sizes >=256
__asm__ __volatile__ ( // prefetch src some cache lines ahead
"prefetchnta 64(%0) \n" //
"prefetchnta 96(%0) \n" //
"prefetchnta 128(%0) \n" //
"prefetchnta 160(%0) \n" //
"prefetchnta 196(%0) \n" //
"prefetchnta 228(%0) \n" //
: : "r" (src));
while( (uint32)dst & 63) { // align dest to 512 bit (For 64 byte cache line)
*((uint32*)dst) = *((uint32*)src); //
src+=4; //
dst+=4; //
size -= 4; //
} //
for (i=size/(16*sizeof(uint32));i;i--) { // now we are well aligned and can copy data 64 byte (cache line) wise
__asm__ __volatile__ ( //
"prefetchnta 256(%0)\n" // prefetch cache line 256 bytes ahead
"prefetchnta 288(%0)\n" //
"\tmovq (%0), %%mm0\n" // loading 64 bytes into MMX
"\tmovq 8(%0), %%mm1\n" //
"\tmovq 16(%0), %%mm2\n" //
"\tmovq 24(%0), %%mm3\n" //
"\tmovq 32(%0), %%mm4\n" //
"\tmovq 40(%0), %%mm5\n" //
"\tmovq 48(%0), %%mm6\n" //
"\tmovq 56(%0), %%mm7\n" //
//
"\tmovntq %%mm0, (%1)\n" // storing 64 bytes
"\tmovntq %%mm1, 8(%1)\n" // we use non cache trashing stores
"\tmovntq %%mm2, 16(%1)\n" // this will maintain our data cache content
"\tmovntq %%mm3, 24(%1)\n" //
"\tmovntq %%mm4, 32(%1)\n" //
"\tmovntq %%mm5, 40(%1)\n" //
"\tmovntq %%mm6, 48(%1)\n" //
"\tmovntq %%mm7, 56(%1)\n" //
: : "r" (src), "r" (dst) : "%mm0","%mm1","%mm2","%mm3","%mm4","%mm5","%mm6","%mm7");
src+=64; // 8 x 64 bit words
dst+=64; //
} //
__asm__ __volatile__ ( "emms"); // switch back from MMX
size &= 16*sizeof(uint32)-1; //
}
for (i=size/sizeof(uint32);i;i--) { // copy all remaining 32 bit words
*((uint32*)dst) = *((uint32*)src); //
src+=4; //
dst+=4; //
} //
size &= sizeof(uint32)-1; //
memcpy_less4:
while (size--) { // copy all remaining 8 bit words (max 3)
*((uint8*)dst++) = *((uint8*)src++); //
} //
return dst;
}
//-----------------------------------------------------------------------
//
// Program: memcpy Benchmark
//
// Version 0.3
// Last Change 2007-09-01
// Writen by Gunnar von Boehn <gunnar@greyhound-data.com>
// NO copyright is claimed for this - please use the source for what you want
//
// compile with : gcc -O3 memcpy_bench.c -o memcpy_bench
//
// The purpose of this tool is to measure the memcpy performance of different algorythm
// The program is a subset of the membench program which measures read/write and copy performance
//
// Multiple routines will be used for each of these tests to
// outline the performance differences when accessing 8 BIT, 32 BIT, 64 BIT wise.
// Additional routines that work on cahce line sizes and use cache prevetching will be tested.
//
// The archived troughput will be compared to the GLIBC functions
// The memory throughput will be show as BUS throughput in MB/S.
// E.G reading of 100 MB in one second will be shown as 100 MB/S
// E.G writing of 100 MB in one second will be shown as 100 MB/S
// E.G copying of 50 MB in one second will be shown as 100 MB/S,
// As a memcpy of 50 MB is in fact reading of 50 MB + writing of 50 MB.
// the test will print this out as BUS speed of 100 MB.
//
// The BUS is the physical limit of a CPU. The goal of an optimal
// software routine is to come as close as possible to this limit.
//
// The test results will typically show that the standard C function for copying memory
// does NOT perform optimally on most systems, when copying larger block of memory.
// On many systems the MEMORY througput could be improved by 50 or 100%
// by using copy routines optimized for CPU cache prefetching.
//
// This tool was inspired by the original "stream" by John D. McCalpin
// The original stream code is often used as reference for memory throughput.
//
//-----------------------------------------------------------------------
# include <stdio.h>
# include <math.h>
# include <float.h>
# include <limits.h>
# include <sys/time.h>
# include <sys/types.h>
#include <signal.h>
#include <setjmp.h>
static sigjmp_buf jmpbuf;
static volatile sig_atomic_t canjump = 0;
static void sigill_handler (int sig) {
if (!canjump) {
signal (sig, SIG_DFL);
raise (sig);
}
canjump = 0;
siglongjmp (jmpbuf, 1);
}
int hasaltivec=0;
# define N 1024*1024*2
# define LOOPS 1
# define NTIMES 3
#define memsteps 21
#define KB (1024)
#define MB (KB*KB)
#define PAGE_SIZE (16*KB)
#define TEST_SIZE (16*MB + 2*PAGE_SIZE)
void * memcpy_asmFC64(void *dst, const void *src, size_t len);
void * moto_memcpy(void *dst, const void *src, size_t len);
void * moto_memcmp(const void *dst, const void *src, size_t len);
void * moto_memset(void *dst, const int c, size_t len);
volatile u_long *ptra;
volatile u_long *ptrb;
double times[30][30][NTIMES];
# define HLINE "----------------------------------------------------------------------------------------------------------------\n"
# ifndef MIN
# define MIN(x,y) ((x)<(y)?(x):(y))
# endif
# ifndef MAX
# define MAX(x,y) ((x)>(y)?(x):(y))
# endif
static double avgtime[40] = {0},
maxtime[40] = {0},
mintime[40] = {0};
static char *label_copy[25] = {
"glibc memcpy ",
"bmove512 ",
"copy 8 ",
"copy 32 ",
"copy 64f ",
"copy 64fx2 ",
"copy 64fx4 ",
"copy 32x2 ",
"copy 32x4 ",
"copy 32x8 ",
#if defined(__i386__) || defined(__x86_64__)
"memcpy_mmx ",
"memcpy_mmx2 ",
#else
"copy 32 P2 ",
#endif
"copy 32 P3 ",
};
extern double mysecond();
/* bmove 521
* Simple copy routine used by MYSQL to copy block of N x 512 bytes
*
*/
void bmove512(unsigned long *to,unsigned long *from, unsigned int length)
{
register unsigned long *f,*t,*end;
end = (long*) ((char*) from+length);
f= (unsigned long*) from;
t= (unsigned long*) to;
#if defined(m88k) || defined(sparc) || defined(HAVE_LONG_LONG)
do {
t[0]=f[0]; t[1]=f[1]; t[2]=f[2]; t[3]=f[3];
t[4]=f[4]; t[5]=f[5]; t[6]=f[6]; t[7]=f[7];
t[8]=f[8]; t[9]=f[9]; t[10]=f[10]; t[11]=f[11];
t[12]=f[12]; t[13]=f[13]; t[14]=f[14]; t[15]=f[15];
t[16]=f[16]; t[17]=f[17]; t[18]=f[18]; t[19]=f[19];
t[20]=f[20]; t[21]=f[21]; t[22]=f[22]; t[23]=f[23];
t[24]=f[24]; t[25]=f[25]; t[26]=f[26]; t[27]=f[27];
t[28]=f[28]; t[29]=f[29]; t[30]=f[30]; t[31]=f[31];
t[32]=f[32]; t[33]=f[33]; t[34]=f[34]; t[35]=f[35];
t[36]=f[36]; t[37]=f[37]; t[38]=f[38]; t[39]=f[39];
t[40]=f[40]; t[41]=f[41]; t[42]=f[42]; t[43]=f[43];
t[44]=f[44]; t[45]=f[45]; t[46]=f[46]; t[47]=f[47];
t[48]=f[48]; t[49]=f[49]; t[50]=f[50]; t[51]=f[51];
t[52]=f[52]; t[53]=f[53]; t[54]=f[54]; t[55]=f[55];
t[56]=f[56]; t[57]=f[57]; t[58]=f[58]; t[59]=f[59];
t[60]=f[60]; t[61]=f[61]; t[62]=f[62]; t[63]=f[63];
#ifdef HAVE_LONG_LONG
t+=64; f+=64;
#else
t[64]=f[64]; t[65]=f[65]; t[66]=f[66]; t[67]=f[67];
t[68]=f[68]; t[69]=f[69]; t[70]=f[70]; t[71]=f[71];
t[72]=f[72]; t[73]=f[73]; t[74]=f[74]; t[75]=f[75];
t[76]=f[76]; t[77]=f[77]; t[78]=f[78]; t[79]=f[79];
t[80]=f[80]; t[81]=f[81]; t[82]=f[82]; t[83]=f[83];
t[84]=f[84]; t[85]=f[85]; t[86]=f[86]; t[87]=f[87];
t[88]=f[88]; t[89]=f[89]; t[90]=f[90]; t[91]=f[91];
t[92]=f[92]; t[93]=f[93]; t[94]=f[94]; t[95]=f[95];
t[96]=f[96]; t[97]=f[97]; t[98]=f[98]; t[99]=f[99];
t[100]=f[100]; t[101]=f[101]; t[102]=f[102]; t[103]=f[103];
t[104]=f[104]; t[105]=f[105]; t[106]=f[106]; t[107]=f[107];
t[108]=f[108]; t[109]=f[109]; t[110]=f[110]; t[111]=f[111];
t[112]=f[112]; t[113]=f[113]; t[114]=f[114]; t[115]=f[115];
t[116]=f[116]; t[117]=f[117]; t[118]=f[118]; t[119]=f[119];
t[120]=f[120]; t[121]=f[121]; t[122]=f[122]; t[123]=f[123];
t[124]=f[124]; t[125]=f[125]; t[126]=f[126]; t[127]=f[127];
t+=128; f+=128;
#endif
} while (f < end);
#else
do {
*t++ = *f++; *t++ = *f++; *t++ = *f++; *t++ = *f++;
*t++ = *f++; *t++ = *f++; *t++ = *f++; *t++ = *f++;
*t++ = *f++; *t++ = *f++; *t++ = *f++; *t++ = *f++;
*t++ = *f++; *t++ = *f++; *t++ = *f++; *t++ = *f++;
*t++ = *f++; *t++ = *f++; *t++ = *f++; *t++ = *f++;
*t++ = *f++; *t++ = *f++; *t++ = *f++; *t++ = *f++;
*t++ = *f++; *t++ = *f++; *t++ = *f++; *t++ = *f++;
*t++ = *f++; *t++ = *f++; *t++ = *f++; *t++ = *f++;
*t++ = *f++; *t++ = *f++; *t++ = *f++; *t++ = *f++;
*t++ = *f++; *t++ = *f++; *t++ = *f++; *t++ = *f++;
*t++ = *f++; *t++ = *f++; *t++ = *f++; *t++ = *f++;
*t++ = *f++; *t++ = *f++; *t++ = *f++; *t++ = *f++;
*t++ = *f++; *t++ = *f++; *t++ = *f++; *t++ = *f++;
*t++ = *f++; *t++ = *f++; *t++ = *f++; *t++ = *f++;
*t++ = *f++; *t++ = *f++; *t++ = *f++; *t++ = *f++;
*t++ = *f++; *t++ = *f++; *t++ = *f++; *t++ = *f++;
*t++ = *f++; *t++ = *f++; *t++ = *f++; *t++ = *f++;
*t++ = *f++; *t++ = *f++; *t++ = *f++; *t++ = *f++;
*t++ = *f++; *t++ = *f++; *t++ = *f++; *t++ = *f++;
*t++ = *f++; *t++ = *f++; *t++ = *f++; *t++ = *f++;
*t++ = *f++; *t++ = *f++; *t++ = *f++; *t++ = *f++;
*t++ = *f++; *t++ = *f++; *t++ = *f++; *t++ = *f++;
*t++ = *f++; *t++ = *f++; *t++ = *f++; *t++ = *f++;
*t++ = *f++; *t++ = *f++; *t++ = *f++; *t++ = *f++;
*t++ = *f++; *t++ = *f++; *t++ = *f++; *t++ = *f++;
*t++ = *f++; *t++ = *f++; *t++ = *f++; *t++ = *f++;
*t++ = *f++; *t++ = *f++; *t++ = *f++; *t++ = *f++;
*t++ = *f++; *t++ = *f++; *t++ = *f++; *t++ = *f++;
*t++ = *f++; *t++ = *f++; *t++ = *f++; *t++ = *f++;
*t++ = *f++; *t++ = *f++; *t++ = *f++; *t++ = *f++;
*t++ = *f++; *t++ = *f++; *t++ = *f++; *t++ = *f++;
*t++ = *f++; *t++ = *f++; *t++ = *f++; *t++ = *f++;
} while (f < end);
#endif
return;
} /* bmove512 */
// Very simply copy using a 8 bit loop to copy the data
//
void *copy_8(char *dest, char *src, int size) {
for(;size; size--){
*dest++ = *src++;
}
return dest;
}
// simply copy using a 32 bit loop to copy the data
// any remaining bytes are copies with 8 bit loop
//
void *copy_32(int *dest, int *src, int size) {
int size32;
size32=size/4;
for(;size32; size32--){
*dest++ = *src++;
}
size=size&3;
unsigned char *dest8, *src8;
dest8 = (unsigned char *)dest;
src8 = (unsigned char *)src;
for(;size; size--){
*dest8++ = *src8++;
}
return dest8;
}
// simply copy using a FLOT instructions to copy the data 64 bit wise
// This copy is also known as SPEC STREAM copy
//
void *copy_64(double *dest,double *src, int size) {
int size64;
size64=size/8;
for (;size64; size64--){
*dest++ = *src++;
}
size=size&7;
unsigned char *dest8, *src8;
dest8 = (unsigned char *)dest;
src8 = (unsigned char *)src;
for(;size; size--){
*dest8++ = *src8++;
}
return dest8;
}
//
//
void cleartimecounter(){
register int j, k,l,m;
for(m=0; m<30; m++){
for (j=0; j<30; j++){
for (k=0 ; k < NTIMES ; k++){
times[m][j][k]=0;
}
avgtime[j] = 0,
maxtime[j] = 0;
mintime[j] = FLT_MAX;
}
}
}
//
//
int printout_results(int blocksize, int bus_usage){
register int j, k,l,m;
printf(HLINE);
char** label=label_copy;
for(m=0; m<memsteps ; m+=1){
for (k=0; k<NTIMES; k++){ /* note -- skip first iteration */
for (j=0; j<25; j++){
avgtime[j] = avgtime[j] + times[m][j][k];
}
}
}
printf(" ");
for(m=0; m<memsteps ; ){
if((blocksize>>m) >= MB){
printf("%4dMB ", (blocksize>>m) /MB);
}else if((blocksize>>m) >= KB){
printf("%4dKB ", (blocksize>>m)/KB );
}else{
printf(" %4dB ", (blocksize>>m) );
}
if((blocksize>>m) > KB){
m+=2;
}else{
m+=1;
}
}
printf("\n");
printf(HLINE);
for (j=0; j<25; j++) {
if(times[0][j][0]!=0){
printf("%s ", label[j]);
for(m=0; m<memsteps ; m+=1){
avgtime[j] = 0;
mintime[j] = FLT_MAX;
for (k=0; k<NTIMES; k++){ /* note -- skip first iteration */
avgtime[j] = avgtime[j] + times[m][j][k];
mintime[j] = MIN(mintime[j], times[m][j][k]);
maxtime[j] = MAX(maxtime[j], times[m][j][k]);
}
if(avgtime[j]!=0){
printf("%6.0f ", 1.048576E-06 * bus_usage * blocksize /mintime[j] );
// Result in MB/Sec MB =10^6 NOT 2^20
}
}
printf("\n");
}
}
printf("\n");
printf(HLINE);
}
#define uint8 unsigned char
#define uint16 unsigned short
#define uint32 unsigned int
#if defined(__i386__) || defined(__x86_64__)
// Fast memcpy for x86 CPUs with MMX
//
// To archive optimal memcpy performamce several things are to consider
//
// a) Source data should be prefetched to avoid memory latency bubbles.
// The below routine will use "prefetchnta" instruction for this.
// b) To improve write speed its adviseable to align the destination properly.
// The destination will be aligned to 32 bit boundary first and for bigger
// copies the destrination will be aligned to 64 byte (cache line) boundary
// c) Using a non cache poluting copy will save the 2nd level cache for other usage
// While in rare cases this might be a small disadvantage,
// in general this will be a big overall speed improvement.
// The below routine uses "movntq" to avoid cache polution
//
// The copy routine will first align the destiantion to 16 bit
// Then the copy routine will align the destination to 32 bit
// For copy over a certain size (>=256 byte) we will align the
// destination to a whola cache line of 64 byte and then use
// mmx copy commands and prefetching to optimally copy the memory.
// Of some CPUs is of advantage to use to use the MMX copy block
// even on smaller block and start with sizes of >= 128 byte.
// After the fast mmx copy block we will copy the remaining part
// using 32bit copy and 8bit copies
//
void *memcpy_mmx(void *dst, const void *src, size_t size){
uint32 i;
if(size<4) goto memcpy_less4; // Tiny copy? No need to align
if( (uint32)dst & 1) { // align destination to 16 bit
*((uint8*)dst++) = *((uint8*)src++); //
size--; //
} //
if ((uint32)dst & 2) { // align destitnation to 32 bit
*((uint16*)dst) = *((uint16*)src); //
src+=2; //
dst+=2; //
size -= 2; //
} //
if(size>=256){ // use cache line, prefetching routine for sizes >=256
__asm__ __volatile__ ( //
"prefetchnta 64(%0) \n" // prefetch 32 bytes of source 64 bytes ahead
"prefetchnta 96(%0) \n" // prefetch 32 bytes of source 96 bytes ahead
"prefetchnta 128(%0) \n" // prefetch 32 bytes of source 96 bytes ahead
"prefetchnta 160(%0) \n" // prefetch 32 bytes of source 96 bytes ahead
: : "r" (src));
while( (uint32)dst & 63) { // align dest to 512 bit (For 64 byte cache line)
*((uint32*)dst) = *((uint32*)src); //
src+=4; //
dst+=4; //
size -= 4; //
} //
for (i=size/(16*sizeof(uint32));i;i--) { // now we are well aligned and can copy data 64 byte (cache line) wise
__asm__ __volatile__ ( //
// "prefetchnta 64(%0) \n" // prefetch 32 bytes of source 1 cache line ahead
// "prefetchnta 96(%0) \n" // prefetch 32 bytes
// "prefetchnta 196(%0) \n" // prefetch 32 bytes of source 3 chache lines ahead
// "prefetchnta 228(%0) \n" // prefetch 32 bytes
"prefetchnta 256(%0) \n" // prefetch 32 bytes of source 64 bytes ahead
"prefetchnta 288(%0) \n" // prefetch 32 bytes of source 96 bytes ahead
"\tmovq (%0), %%mm0\n" // loading 64 bytes into MMX
"\tmovq 8(%0), %%mm1\n" //
"\tmovq 16(%0), %%mm2\n" //
"\tmovq 24(%0), %%mm3\n" //
"\tmovq 32(%0), %%mm4\n" //
"\tmovq 40(%0), %%mm5\n" //
"\tmovq 48(%0), %%mm6\n" //
"\tmovq 56(%0), %%mm7\n" //
//
"\tmovntq %%mm0, (%1)\n" // storing 64 bytes
"\tmovntq %%mm1, 8(%1)\n" // we use non cache trashing stores
"\tmovntq %%mm2, 16(%1)\n" // this will maintain our data cache content
"\tmovntq %%mm3, 24(%1)\n" //
"\tmovntq %%mm4, 32(%1)\n" //
"\tmovntq %%mm5, 40(%1)\n" //
"\tmovntq %%mm6, 48(%1)\n" //
"\tmovntq %%mm7, 56(%1)\n" //
: : "r" (src), "r" (dst) : "%mm0","%mm1","%mm2","%mm3","%mm4","%mm5","%mm6","%mm7");
src+=64; // 8 x 64 bit words
dst+=64; //
} //
__asm__ __volatile__ ( "emms"); // switch back from MMX
size &= 16*sizeof(uint32)-1; //
}
for (i=size/sizeof(uint32);i;i--) { // copy all remaining 32 bit words
*((uint32*)dst) = *((uint32*)src); //
src+=4; //
dst+=4; //
} //
size &= sizeof(uint32)-1; //
memcpy_less4:
while (size--) { // copy all remaining 8 bit words (max 3)
*((uint8*)dst++) = *((uint8*)src++); //
} //
return dst;
}
void *memcpy_mmx2(void *dst, const void *src, size_t size){
uint32 i;
if(size<4) goto memcpy_less4; // Tiny copy? No need to align
if( (uint32)dst & 1) { // align destination to 16 bit
*((uint8*)dst++) = *((uint8*)src++); //
size--; //
} //
if ((uint32)dst & 2) { // align destitnation to 32 bit
*((uint16*)dst) = *((uint16*)src); //
src+=2; //
dst+=2; //
size -= 2; //
} //
if(size>=256){ // use cache line, prefetching routine for sizes >=256
__asm__ __volatile__ ( //
"prefetchnta 64(%0) \n" // prefetch 32 bytes of source 64 bytes ahead
"prefetchnta 96(%0) \n" // prefetch 32 bytes of source 96 bytes ahead
"prefetchnta 128(%0) \n" // prefetch 32 bytes of source 96 bytes ahead
"prefetchnta 160(%0) \n" // prefetch 32 bytes of source 96 bytes ahead
"prefetchnta 196(%0) \n" // prefetch 32 bytes of source 64 bytes ahead
"prefetchnta 228(%0) \n" // prefetch 32 bytes of source 96 bytes ahead
: : "r" (src));
while( (uint32)dst & 63) { // align dest to 512 bit (For 64 byte cache line)
*((uint32*)dst) = *((uint32*)src); //
src+=4; //
dst+=4; //
size -= 4; //
} //
for (i=size/(16*sizeof(uint32));i;i--) { // now we are well aligned and can copy data 64 byte (cache line) wise
__asm__ __volatile__ ( //
// "prefetchnta 64(%0) \n" // prefetch 32 bytes of source 64 bytes ahead
// "prefetchnta 96(%0) \n" // prefetch 32 bytes of source 96 bytes ahead
// "prefetchnta 128(%0) \n" // prefetch 32 bytes of source 64 bytes ahead
// "prefetchnta 160(%0) \n" // prefetch 32 bytes of source 96 bytes ahead
// "prefetchnta 196(%0) \n" // prefetch 32 bytes of source 64 bytes ahead
// "prefetchnta 228(%0) \n" // prefetch 32 bytes of source 96 bytes ahead
"prefetchnta 256(%0) \n" // prefetch 32 bytes of source 64 bytes ahead
"prefetchnta 288(%0) \n" // prefetch 32 bytes of source 96 bytes ahead
"\tmovq (%0), %%mm0\n" // loading 64 bytes into MMX
"\tmovq 8(%0), %%mm1\n" //
"\tmovq 16(%0), %%mm2\n" //
"\tmovq 24(%0), %%mm3\n" //
"\tmovq 32(%0), %%mm4\n" //
"\tmovq 40(%0), %%mm5\n" //
"\tmovq 48(%0), %%mm6\n" //
"\tmovq 56(%0), %%mm7\n" //
//
"\tmovntq %%mm0, (%1)\n" // storing 64 bytes
"\tmovntq %%mm1, 8(%1)\n" // we use non cache trashing stores
"\tmovntq %%mm2, 16(%1)\n" // this will maintain our data cache content
"\tmovntq %%mm3, 24(%1)\n" //
"\tmovntq %%mm4, 32(%1)\n" //
"\tmovntq %%mm5, 40(%1)\n" //
"\tmovntq %%mm6, 48(%1)\n" //
"\tmovntq %%mm7, 56(%1)\n" //
: : "r" (src), "r" (dst) : "%mm0","%mm1","%mm2","%mm3","%mm4","%mm5","%mm6","%mm7");
src+=64; // 8 x 64 bit words
dst+=64; //
} //
__asm__ __volatile__ ( "emms"); // switch back from MMX
size &= 16*sizeof(uint32)-1; //
}
for (i=size/sizeof(uint32);i;i--) { // copy all remaining 32 bit words
*((uint32*)dst) = *((uint32*)src); //
src+=4; //
dst+=4; //
} //
size &= sizeof(uint32)-1; //
memcpy_less4:
while (size--) { // copy all remaining 8 bit words (max 3)
*((uint8*)dst++) = *((uint8*)src++); //
} //
return dst;
}
void *memcpy_mmx_old(void *dst, const void *src, int size){
uint8 *dst8, *src8;
uint32 *src32, *dst32;
uint32 i;
dst8 = (uint8 *)dst;
src8 = (uint8 *)src;
if(size<4) goto memcpy_less4;
/* align dst8 to 16 bit */
if ((uint32)dst8 & 1) {
*dst8++ = *src8++;
size--;
}
/* align dst8 to 32 bit */
if ((uint32)dst8 & 2) {
uint16 *src16 = (uint16 *)src8;
uint16 *dst16 = (uint16 *)dst8;
*dst16++ = *src16++;
size -= 2;
src8 = (uint8 *)src16;
dst8 = (uint8 *)dst16;
}
if(size>=256){
/* align dst8 to 512 bit (For 64 byte cache line) */
while ((uint32)dst8 & 63) {
src32 = (uint32 *)src8;
dst32 = (uint32 *)dst8;
*dst32++ = *src32++;
size -= 4;
src8 = (uint8 *)src32;
dst8 = (uint8 *)dst32;
}
src32 = (uint32 *)src8;
dst32 = (uint32 *)dst8;
for (i=size/(16*sizeof(uint32));i;i--) {
__asm__ __volatile__ (
"prefetchnta 64(%0) \n" // prefetch 32 bytes of source 64 bytes ahead
"prefetchnta 96(%0) \n" // prefetch 32 bytes of source 96 bytes ahead
"\tmovq (%0), %%mm0\n" // loading 64 bytes into MMX
"\tmovq 8(%0), %%mm1\n"
"\tmovq 16(%0), %%mm2\n"
"\tmovq 24(%0), %%mm3\n"
"\tmovq 32(%0), %%mm4\n"
"\tmovq 40(%0), %%mm5\n"
"\tmovq 48(%0), %%mm6\n"
"\tmovq 56(%0), %%mm7\n"
"\tmovntq %%mm0, (%1)\n" // storing 64 bytes
"\tmovntq %%mm1, 8(%1)\n" // we use non cache trashing stores
"\tmovntq %%mm2, 16(%1)\n" // this will maintain our data cache content
"\tmovntq %%mm3, 24(%1)\n"
"\tmovntq %%mm4, 32(%1)\n"
"\tmovntq %%mm5, 40(%1)\n"
"\tmovntq %%mm6, 48(%1)\n"
"\tmovntq %%mm7, 56(%1)\n"
: : "r" (src32), "r" (dst32) : "%mm0","%mm1","%mm2","%mm3","%mm4","%mm5","%mm6","%mm7");
src32+=16; // 16 x 32 words
dst32+=16; //
}
__asm__ __volatile__ ( "emms"); // MMX switch back
size &= 16*sizeof(uint32)-1;
src8 = (uint8 *)src32;
dst8 = (uint8 *)dst32;
}
if (size >= sizeof(uint32)) {
src32 = (uint32 *)src8;
dst32 = (uint32 *)dst8;
for (i=size/sizeof(uint32);i;i--) {
*dst32++ = *src32++;
}
size &= sizeof(uint32)-1;
src8 = (uint8 *)src32;
dst8 = (uint8 *)dst32;
}
memcpy_less4:
while (size--) {
*dst8++ = *src8++;
}
return dst8;
}
#endif
void messen_copy(int blocksize, int loops, int startoffset){
register int j, k,l,m;
double scalar, t;
int z;
int runblock, runloop;
int offset=0;
char *workptra;
char *workptrb;
cleartimecounter();
for(m=0; m<memsteps ; ){
runblock= blocksize*8>>m;
runloop= loops<<m;
offset=startoffset;
for (k=0; k<NTIMES; k++){
offset+=runblock;
if(offset>=blocksize*8) offset-=(blocksize*8);
workptra=(char*)ptra;
workptrb=(char*)ptrb;
workptra+=offset;
workptrb+=offset;
t = mysecond();
for (l=0; l<runloop; l++){
offset+=runblock;
if(offset>=blocksize*8) offset-=(blocksize*8);
workptra=(char*)ptra;
workptrb=(char*)ptrb;
workptra+=offset;
workptrb+=offset;
memcpy( (void*)workptrb, (void*)workptra, runblock);
}
times[m][0][k] = mysecond() - t;
if(runblock>=512){
t = mysecond();
for (l=0; l<runloop; l++){
offset+=runblock;
if(offset>=blocksize*8) offset-=(blocksize*8);
workptra=(char*)ptra;
workptrb=(char*)ptrb;
workptra+=offset;
workptrb+=offset;
bmove512( (void*)workptrb, (void*)workptra,runblock);
}
times[m][1][k] = mysecond() - t;
}
t = mysecond();
for (l=0; l<runloop; l++){
offset+=runblock;
if(offset>=blocksize*8) offset-=(blocksize*8);
workptra=(char*)ptra;
workptrb=(char*)ptrb;
workptra+=offset;
workptrb+=offset;
copy_8( (char*)workptra, (char*)workptrb,runblock);
}
times[m][2][k] = mysecond() - t;
t = mysecond();
for (l=0; l<runloop; l++){
offset+=runblock;
if(offset>=blocksize*8) offset-=(blocksize*8);
workptra=(char*)ptra;
workptrb=(char*)ptrb;
workptra+=offset;
workptrb+=offset;
copy_32( (int*)workptra, (int*)workptrb,runblock);
}
times[m][3][k] = mysecond() - t;
t = mysecond();
for (l=0; l<runloop; l++){
offset+=runblock;
if(offset>=blocksize*8) offset-=(blocksize*8);
workptra=(char*)ptra;
workptrb=(char*)ptrb;
workptra+=offset;
workptrb+=offset;
copy_64( (double*)workptra, (double*)workptrb, runblock); // Stream
}
times[m][4][k] = mysecond() - t;
#if defined(__i386__) || defined(__x86_64__)
t = mysecond();
for (l=0; l<runloop; l++){
offset+=runblock;
if(offset>=blocksize*8) offset-=(blocksize*8);
workptra=(char*)ptra;
workptrb=(char*)ptrb;
workptra+=offset;
workptrb+=offset;
memcpy_mmx( (int*)workptrb, (int*)workptra,runblock);
}
times[m][10][k] = mysecond() - t;
/*
t = mysecond();
for (l=0; l<runloop; l++){
offset+=runblock;
if(offset>=blocksize*8) offset-=(blocksize*8);
workptra=(char*)ptra;
workptrb=(char*)ptrb;
workptra+=offset;
workptrb+=offset;
memcpy_mmx2( (int*)workptrb, (int*)workptra,runblock);
}
times[m][11][k] = mysecond() - t;
*/
#endif
}
if((blocksize*8 >>m) > KB){
m+=2;
}else{
m+=1;
}
}
// printf("Copy test (copying array A -> B).\n");
printout_results(blocksize*8,loops*2);
// if(z==0) printf(HLINE); // use z to prevent agressive optimizing the tests away
}
int main()
{
register int j, k,l;
int quantum, checktick();
int BytesPerWord;
double t;
volatile u_long *memA;
volatile u_long *memB;
memA = (u_long*) malloc(TEST_SIZE);
memB = (u_long*) malloc(TEST_SIZE);
ptra = (void*) ( ( ((u_long)memA) + PAGE_SIZE - 1 ) / PAGE_SIZE * PAGE_SIZE );
ptrb = (void*) ( ( ((u_long)memB) + PAGE_SIZE - 1 ) / PAGE_SIZE * PAGE_SIZE );
/* --- SETUP --- determine precision and check timing --- */
printf(HLINE);
printf("Benchmark memcpy performance v0.3\n");
printf(HLINE);
printf("The Test will run some time please be patient.\n");
BytesPerWord = sizeof(double);
printf("Total memory required = %.1f MB.\n", (2.0 * BytesPerWord) * ( (double) N / 1000000.0 ));
/* Get initial value for system clock. */
#pragma omp parallel for
for (j=0; j<N; j++) {
ptra[j] = j;
ptrb[j] = j;
}
printf(HLINE);
printf("\n\nMemory throughput Working on Arrays of %.1f MB.\n",(BytesPerWord) * ( (double) N/ 1000000.0 ) );
printf("We are now comparing different memcpy routines\n");
printf("Results are in MB/sec. Higher value means faster.\n");
printf("The test will copy block of differnt sizes from 16 MB to 16 Byte.\n");
printf("The test will be repeated on different aligned data.\n");
printf(HLINE);
memset( (int*)ptra, 0, N*8);
memset( (int*)ptrb, 0, N*8);
printf("Alignment 0\n");
messen_copy (N,LOOPS,0);
printf("Alignment 1\n");
messen_copy (N,LOOPS,1);
printf("Alignment 2\n");
messen_copy (N,LOOPS,2);
printf("Alignment 3\n");
messen_copy (N,LOOPS,3);
printf("Alignment 4\n");
messen_copy (N,LOOPS,4);
printf("Alignment 7\n");
messen_copy (N,LOOPS,7);
printf("Alignment 8\n");
messen_copy (N,LOOPS,8);
printf("Alignment 15\n");
messen_copy (N,LOOPS,15);
printf("Alignment 31\n");
messen_copy (N,LOOPS,31);
printf("Alignment 48\n");
messen_copy (N,LOOPS,31);
printf("Alignment 63\n");
messen_copy (N,LOOPS,31);
char *workptra;
char *workptrb;
int offset=0;
for (l=0; l<32; l++){
workptra=(char*)ptra;
workptrb=(char*)ptrb;
workptra+=offset;
workptrb+=offset;
offset++;
memcpy_mmx( (int*)workptrb, (int*)workptra,1024);
}
return 0;
}
# define M 20
int
checktick()
{
int i, minDelta, Delta;
double t1, t2, timesfound[M];
/* Collect a sequence of M unique time values from the system. */
for (i = 0; i < M; i++) {
t1 = mysecond();
while( ((t2=mysecond()) - t1) < 1.0E-6 )
;
timesfound[i] = t1 = t2;
}
/*
* Determine the minimum difference between these M values.
* This result will be our estimate (in microseconds) for the
* clock granularity.
*/
minDelta = 1000000;
for (i = 1; i < M; i++) {
// Delta = (int) nearbyint( 1.0E6 * (timesfound[i]-timesfound[i-1]));
minDelta = MIN(minDelta, MAX(Delta,0));
}
return(minDelta);
}
/* A gettimeofday routine to give access to the wall
clock timer on most UNIX-like systems. */
#include <sys/time.h>
double mysecond()
{
struct timeval tp;
struct timezone tzp;
int i;
i = gettimeofday(&tp,&tzp);
return ( (double) tp.tv_sec + (double) tp.tv_usec * 1.e-6 );
}