This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH RFC] Improve 64bit memset for Corei7 with avx2 instruction
- From: "H.J. Lu" <hjl dot tools at gmail dot com>
- To: Ling Ma <ling dot ma dot program at gmail dot com>
- Cc: Ondřej Bílka <neleai at seznam dot cz>, GNU C Library <libc-alpha at sourceware dot org>, Liubov Dmitrieva <liubov dot dmitrieva at gmail dot com>
- Date: Thu, 1 Aug 2013 08:32:36 -0700
- Subject: Re: [PATCH RFC] Improve 64bit memset for Corei7 with avx2 instruction
- References: <CAOGi=dMfjBWkFOhUh7QjBM=XiJqkP+6sEsVSHgz+=wC9z1+O=w at mail dot gmail dot com> <20130730071521 dot GA8596 at domone dot kolej dot mff dot cuni dot cz> <20130730071717 dot GA8741 at domone dot kolej dot mff dot cuni dot cz> <CAOGi=dOCH41BCXY+yN7_w4Ed4DCAHQKJMvJhKUs-pi3EkxHp=g at mail dot gmail dot com> <20130730113445 dot GA4577 at domone dot kolej dot mff dot cuni dot cz> <CAOGi=dMPnGq_35r9TmTHkPn6oS-kbjb=eFmFWQL+N9DBMreu-A at mail dot gmail dot com>
On Tue, Jul 30, 2013 at 5:38 AM, Ling Ma <ling.ma.program@gmail.com> wrote:
> 2013/7/30, OndÅej BÃlka <neleai@seznam.cz>:
>> On Tue, Jul 30, 2013 at 05:26:09PM +0800, Ling Ma wrote:
>>> We never find prefetcht1 is good instruction to pre-fetch data on
>>> core2, nehalem, sandybridge, and haswell. Our experiments show
>>> prefetchw is best in your cases.
>>
>> But your code was following:
> Ling: yes, i say we find in your case, prefetchw is the best,
> and we also say we will do further test to verify whether prefetchw is
> better in gcc.403 cases too, if prefetchw is better in gcc.403, we
> will replace prefetcht0 with prefetchw.
>
>>
>> +L(gobble_128_loop):
>> + prefetcht0 0x1c0(%rdi)
>> + vmovaps %ymm0, (%rdi)
>> + prefetcht0 0x280(%rdi)
>> + vmovaps %ymm0, 0x20(%rdi)
>> + vmovaps %ymm0, 0x40(%rdi)
>> + vmovaps %ymm0, 0x60(%rdi)
>> + lea 0x80(%rdi), %rdi
>> + sub $0x80, %rdx
>> + jae L(gobble_128_loop)
>>
>> Which uses prefetcht0 (A prefetcht1 in mine benchmark was typo.)
>>
>> I updated benchmark (attached) with your code with and without prefetching.
>> 1)
>>
>> Ljuba could you test it on haswell?
> Ling: Ljuba, please also append with prefetchw, thanks.
>>
>>> In your code, memset only handle 256 bytes, in this case we don't need
>>> to use prefetch because hardware prefetch is enough for us in small
>>> size, but it can tell us whether prefetch will hurt performance so we
>>
>> Does haswell improved hardware prefetcher to fetch from next page? I
>> changed layout of benchmark so that data ends at page boundary.
> Ling: we use software prefetch is becuase it have longer stride than
> hw prefetch,
> so it is good for bigger size.
>>
>>> run it, result is below, it indicates prefetchw on haswell is
>>> harmless, even it is redundant code in memset on haswell.
>>>
>> Your test was invalid as you did compared apples with oranges
>> (prefetcht0 vs prefetchw) To see how your code fares you should replace
>> it with your implementation with and without prefetch.
>> You need that to be exactly what you submitted and if that means
>> prefetchw then post new version.
There is no need to test prefetchw on Haswll since it isn't
supported. I think this is a rare case where prefetcht0 helps.
--
H.J.