This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH RFC] Improve 64bit memset for Corei7 with avx2 instruction
- From: Ling Ma <ling dot ma dot program at gmail dot com>
- To: Ondřej Bílka <neleai at seznam dot cz>
- Cc: libc-alpha at sourceware dot org, liubov dot dmitrieva at gmail dot com
- Date: Tue, 30 Jul 2013 20:38:32 +0800
- Subject: Re: [PATCH RFC] Improve 64bit memset for Corei7 with avx2 instruction
- References: <CAOGi=dMfjBWkFOhUh7QjBM=XiJqkP+6sEsVSHgz+=wC9z1+O=w at mail dot gmail dot com> <20130730071521 dot GA8596 at domone dot kolej dot mff dot cuni dot cz> <20130730071717 dot GA8741 at domone dot kolej dot mff dot cuni dot cz> <CAOGi=dOCH41BCXY+yN7_w4Ed4DCAHQKJMvJhKUs-pi3EkxHp=g at mail dot gmail dot com> <20130730113445 dot GA4577 at domone dot kolej dot mff dot cuni dot cz>
2013/7/30, OndÅej BÃlka <neleai@seznam.cz>:
> On Tue, Jul 30, 2013 at 05:26:09PM +0800, Ling Ma wrote:
>> We never find prefetcht1 is good instruction to pre-fetch data on
>> core2, nehalem, sandybridge, and haswell. Our experiments show
>> prefetchw is best in your cases.
>
> But your code was following:
Ling: yes, i say we find in your case, prefetchw is the best,
and we also say we will do further test to verify whether prefetchw is
better in gcc.403 cases too, if prefetchw is better in gcc.403, we
will replace prefetcht0 with prefetchw.
>
> +L(gobble_128_loop):
> + prefetcht0 0x1c0(%rdi)
> + vmovaps %ymm0, (%rdi)
> + prefetcht0 0x280(%rdi)
> + vmovaps %ymm0, 0x20(%rdi)
> + vmovaps %ymm0, 0x40(%rdi)
> + vmovaps %ymm0, 0x60(%rdi)
> + lea 0x80(%rdi), %rdi
> + sub $0x80, %rdx
> + jae L(gobble_128_loop)
>
> Which uses prefetcht0 (A prefetcht1 in mine benchmark was typo.)
>
> I updated benchmark (attached) with your code with and without prefetching.
> 1)
>
> Ljuba could you test it on haswell?
Ling: Ljuba, please also append with prefetchw, thanks.
>
>> In your code, memset only handle 256 bytes, in this case we don't need
>> to use prefetch because hardware prefetch is enough for us in small
>> size, but it can tell us whether prefetch will hurt performance so we
>
> Does haswell improved hardware prefetcher to fetch from next page? I
> changed layout of benchmark so that data ends at page boundary.
Ling: we use software prefetch is becuase it have longer stride than
hw prefetch,
so it is good for bigger size.
>
>> run it, result is below, it indicates prefetchw on haswell is
>> harmless, even it is redundant code in memset on haswell.
>>
> Your test was invalid as you did compared apples with oranges
> (prefetcht0 vs prefetchw) To see how your code fares you should replace
> it with your implementation with and without prefetch.
> You need that to be exactly what you submitted and if that means
> prefetchw then post new version.
>
>> Then we modified memset2 to handle 4096 bytes
>> in test.c as bellow
>> ...
>> char ary[SIZE+4096];
>> ...
>> memset2(ary+(512*((unsigned)rand_r(&seed)))%SIZE,0,4096);
>> and run your code on haswell as below, result shows prefetchw get
>> better performance
>> and harmless
>
> With that 'improvement' you defeated purpose of benchmark. It
> demonstrated increase cache usage by touching only first half of data
> and having second half fetched by prefetch.
>
> As you only changed size a writes now overlap and there will be no
> extra memory usage.
>
> Also changing it to 4096 decreases percentage of wasted memory. Before
> it was 50% (256 saved/512 fetched) now its around 11% (4096 saved/ 4608
> fetched)
>