This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH RFC] Improve 64bit memset for Corei7 with avx2 instruction
- From: Ling Ma <ling dot ma dot program at gmail dot com>
- To: Ondřej Bílka <neleai at seznam dot cz>
- Cc: libc-alpha at sourceware dot org, ling dot ml at alibaba-inc dot com
- Date: Tue, 30 Jul 2013 17:26:09 +0800
- Subject: Re: [PATCH RFC] Improve 64bit memset for Corei7 with avx2 instruction
- References: <CAOGi=dMfjBWkFOhUh7QjBM=XiJqkP+6sEsVSHgz+=wC9z1+O=w at mail dot gmail dot com> <20130730071521 dot GA8596 at domone dot kolej dot mff dot cuni dot cz> <20130730071717 dot GA8741 at domone dot kolej dot mff dot cuni dot cz>
We never find prefetcht1 is good instruction to pre-fetch data on
core2, nehalem, sandybridge, and haswell. Our experiments show
prefetchw is best in your cases.
In your code, memset only handle 256 bytes, in this case we don't need
to use prefetch because hardware prefetch is enough for us in small
size, but it can tell us whether prefetch will hurt performance so we
run it, result is below, it indicates prefetchw on haswell is
harmless, even it is redundant code in memset on haswell.
[root@localhost memset_cache]# ./test
size: 32000
0.10 0.10
0.10 0.10
0.10 0.10
0.10 0.10
0.10 0.10
0.10 0.11
0.11 0.11
0.10 0.10
0.10 0.11
0.10 0.10
size: 256000
0.21 0.22
0.22 0.21
0.21 0.21
0.21 0.21
0.22 0.21
0.21 0.21
0.21 0.21
0.22 0.22
0.21 0.21
0.22 0.20
size: 1024000
0.38 0.38
0.38 0.38
0.38 0.38
0.38 0.38
0.38 0.38
0.38 0.38
0.38 0.38
0.38 0.38
0.38 0.38
0.38 0.38
size: 204800
0.20 0.21
0.20 0.20
0.19 0.19
0.20 0.20
0.20 0.19
0.19 0.19
0.19 0.19
0.19 0.20
0.20 0.20
0.20 0.21
size: 4048000
0.44 0.44
0.44 0.44
0.44 0.44
0.44 0.44
0.44 0.44
0.44 0.44
0.44 0.44
0.44 0.44
0.44 0.44
0.44 0.44
size: 8096000
0.44 0.44
0.44 0.44
0.44 0.44
0.44 0.44
0.44 0.44
0.45 0.44
0.44 0.44
0.44 0.44
0.44 0.44
0.44 0.44
Then we modified memset2 to handle 4096 bytes
in test.c as bellow
...
char ary[SIZE+4096];
...
memset2(ary+(512*((unsigned)rand_r(&seed)))%SIZE,0,4096);
and run your code on haswell as below, result shows prefetchw get
better performance
and harmless.
[root@localhost memset_cache]# ./test
size: 32000
1.01 0.91
0.98 0.90
0.98 0.91
0.98 0.91
0.98 0.91
0.97 0.91
0.98 0.91
0.97 0.91
1.00 0.91
0.97 0.91
size: 256000
1.34 1.36
1.34 1.33
1.35 1.35
1.37 1.35
1.35 1.34
1.36 1.34
1.36 1.34
1.37 1.36
1.38 1.35
1.36 1.35
size: 1024000
1.81 1.81
1.81 1.81
1.82 1.81
1.81 1.81
1.81 1.81
1.82 1.81
1.81 1.81
1.81 1.81
1.81 1.81
1.82 1.81
size: 204800
1.29 1.27
1.30 1.30
1.32 1.33
1.34 1.31
1.31 1.27
1.30 1.31
1.35 1.32
1.32 1.33
1.36 1.33
1.34 1.31
size: 4048000
1.95 1.94
1.95 1.95
1.95 1.95
1.95 1.94
1.95 1.95
1.94 1.95
1.95 1.95
1.95 1.95
1.95 1.94
1.95 1.95
size: 8096000
2.14 2.14
2.15 2.16
2.15 2.15
2.15 2.16
2.16 2.17
2.17 2.17
2.17 2.18
2.16 2.19
2.16 2.17
2.18 2.17
We will test prefetchw in our code with gcc.403, according to data
we will do corresponding behavior in next version.
Thanks
Ling