This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH RFC] Improve 64bit memset for Corei7 with avx2 instruction


2013/7/30, OndÅej BÃlka <neleai@seznam.cz>:
> On Tue, Jul 30, 2013 at 05:26:09PM +0800, Ling Ma wrote:
>> We never find prefetcht1 is good instruction to pre-fetch data on
>> core2, nehalem, sandybridge, and haswell. Our experiments  show
>> prefetchw is best in your cases.
>
> But your code was following:
Ling: yes, i say we find in your case, prefetchw is the best,
and we also say we will do further test to verify whether prefetchw is
better in gcc.403 cases too, if prefetchw is better in gcc.403, we
will replace prefetcht0 with prefetchw.

>
> +L(gobble_128_loop):
> +       prefetcht0      0x1c0(%rdi)
> +       vmovaps %ymm0, (%rdi)
> +       prefetcht0      0x280(%rdi)
> +       vmovaps %ymm0, 0x20(%rdi)
> +       vmovaps %ymm0, 0x40(%rdi)
> +       vmovaps %ymm0, 0x60(%rdi)
> +       lea     0x80(%rdi), %rdi
> +       sub     $0x80, %rdx
> +       jae     L(gobble_128_loop)
>
> Which uses prefetcht0 (A prefetcht1 in mine benchmark was typo.)
>
> I updated benchmark (attached) with your code with and without prefetching.
> 1)
>
> Ljuba could you test it on haswell?
Ling: Ljuba, please also append with prefetchw, thanks.
>
>> In your code, memset only handle 256 bytes, in this case we don't need
>> to use prefetch because hardware prefetch is enough for us in small
>> size, but it can tell us whether prefetch will hurt performance so we
>
> Does haswell improved hardware prefetcher to fetch from next page? I
> changed layout of benchmark so that data ends at page boundary.
Ling: we use software prefetch is becuase it have longer stride than
hw prefetch,
so it is good for bigger size.
>
>> run it, result is below, it indicates prefetchw on haswell is
>> harmless, even it is redundant code in memset on haswell.
>>
> Your test was invalid as you did compared apples with oranges
> (prefetcht0 vs prefetchw) To see how your code fares you should replace
> it with your implementation with and without prefetch.
> You need that to be exactly what you submitted and if that means
> prefetchw then post new version.
>
>> Then we modified memset2 to handle 4096 bytes
>> in test.c as bellow
>> ...
>> char ary[SIZE+4096];
>> ...
>> memset2(ary+(512*((unsigned)rand_r(&seed)))%SIZE,0,4096);
>> and run your code on haswell as below, result shows prefetchw get
>> better  performance
>> and harmless
>
> With that 'improvement' you defeated purpose of benchmark. It
> demonstrated increase cache usage by touching only first half of data
> and having second half fetched by prefetch.
>
> As you only changed size a writes now overlap and there will be no
> extra memory usage.
>
> Also changing it to 4096 decreases percentage of wasted memory. Before
> it was 50% (256 saved/512 fetched) now its around 11% (4096 saved/ 4608
> fetched)
>


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]