This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH RFC] Improve 64bit memset for Corei7 with avx2 instruction

From: Ling Ma <ling dot ma dot program at gmail dot com>
To: Ondřej Bílka <neleai at seznam dot cz>
Cc: libc-alpha at sourceware dot org, ling dot ml at alibaba-inc dot com
Date: Tue, 30 Jul 2013 17:26:09 +0800
Subject: Re: [PATCH RFC] Improve 64bit memset for Corei7 with avx2 instruction
References: <CAOGi=dMfjBWkFOhUh7QjBM=XiJqkP+6sEsVSHgz+=wC9z1+O=w at mail dot gmail dot com> <20130730071521 dot GA8596 at domone dot kolej dot mff dot cuni dot cz> <20130730071717 dot GA8741 at domone dot kolej dot mff dot cuni dot cz>

We never find prefetcht1 is good instruction to pre-fetch data on
core2, nehalem, sandybridge, and haswell. Our experiments  show
prefetchw is best in your cases.
In your code, memset only handle 256 bytes, in this case we don't need
to use prefetch because hardware prefetch is enough for us in small
size, but it can tell us whether prefetch will hurt performance so we
run it, result is below, it indicates prefetchw on haswell is
harmless, even it is redundant code in memset on haswell.

[root@localhost memset_cache]# ./test
size: 32000
0.10    0.10
0.10    0.10
0.10    0.10
0.10    0.10
0.10    0.10
0.10    0.11
0.11    0.11
0.10    0.10
0.10    0.11
0.10    0.10
size: 256000
0.21    0.22
0.22    0.21
0.21    0.21
0.21    0.21
0.22    0.21
0.21    0.21
0.21    0.21
0.22    0.22
0.21    0.21
0.22    0.20
size: 1024000
0.38    0.38
0.38    0.38
0.38    0.38
0.38    0.38
0.38    0.38
0.38    0.38
0.38    0.38
0.38    0.38
0.38    0.38
0.38    0.38
size: 204800
0.20    0.21
0.20    0.20
0.19    0.19
0.20    0.20
0.20    0.19
0.19    0.19
0.19    0.19
0.19    0.20
0.20    0.20
0.20    0.21
size: 4048000
0.44    0.44
0.44    0.44
0.44    0.44
0.44    0.44
0.44    0.44
0.44    0.44
0.44    0.44
0.44    0.44
0.44    0.44
0.44    0.44
size: 8096000
0.44    0.44
0.44    0.44
0.44    0.44
0.44    0.44
0.44    0.44
0.45    0.44
0.44    0.44
0.44    0.44
0.44    0.44
0.44    0.44


Then we modified memset2 to handle 4096 bytes
in test.c as bellow
...
char ary[SIZE+4096];
...
memset2(ary+(512*((unsigned)rand_r(&seed)))%SIZE,0,4096);
and run your code on haswell as below, result shows prefetchw get
better  performance
and harmless.

[root@localhost memset_cache]# ./test
size: 32000
1.01    0.91
0.98    0.90
0.98    0.91
0.98    0.91
0.98    0.91
0.97    0.91
0.98    0.91
0.97    0.91
1.00    0.91
0.97    0.91
size: 256000
1.34    1.36
1.34    1.33
1.35    1.35
1.37    1.35
1.35    1.34
1.36    1.34
1.36    1.34
1.37    1.36
1.38    1.35
1.36    1.35
size: 1024000
1.81    1.81
1.81    1.81
1.82    1.81
1.81    1.81
1.81    1.81
1.82    1.81
1.81    1.81
1.81    1.81
1.81    1.81
1.82    1.81
size: 204800
1.29    1.27
1.30    1.30
1.32    1.33
1.34    1.31
1.31    1.27
1.30    1.31
1.35    1.32
1.32    1.33
1.36    1.33
1.34    1.31
size: 4048000
1.95    1.94
1.95    1.95
1.95    1.95
1.95    1.94
1.95    1.95
1.94    1.95
1.95    1.95
1.95    1.95
1.95    1.94
1.95    1.95
size: 8096000
2.14    2.14
2.15    2.16
2.15    2.15
2.15    2.16
2.16    2.17
2.17    2.17
2.17    2.18
2.16    2.19
2.16    2.17
2.18    2.17

We will  test prefetchw in our code with gcc.403,  according to data
we will do corresponding behavior in next version.

Thanks
Ling

Follow-Ups:
- Re: [PATCH RFC] Improve 64bit memset for Corei7 with avx2 instruction
  - From: OndÅej BÃlka

References:
- Re: [PATCH RFC] Improve 64bit memset for Corei7 with avx2 instruction
  - From: Ling Ma
- Re: [PATCH RFC] Improve 64bit memset for Corei7 with avx2 instruction
  - From: OndÅej BÃlka
- Re: [PATCH RFC] Improve 64bit memset for Corei7 with avx2 instruction
  - From: OndÅej BÃlka

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]