This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH RFC] Imporve 64bit memset performance for Haswell CPU with AVX2 instruction
- From: Ling Ma <ling dot ma dot program at gmail dot com>
- To: OndÅej BÃlka <neleai at seznam dot cz>
- Cc: libc-alpha at sourceware dot org, rth at twiddle dot net, aj at suse dot com, liubov dot dmitrieva at gmail dot com, hjl dot tools at gmail dot com, Ling Ma <ling dot ml at alibaba-inc dot com>
- Date: Fri, 30 May 2014 17:02:29 +0800
- Subject: Re: [PATCH RFC] Imporve 64bit memset performance for Haswell CPU with AVX2 instruction
- Authentication-results: sourceware.org; auth=none
- References: <1396850238-29041-1-git-send-email-ling dot ma at alipay dot com> <20140513173616 dot GC5047 at domone dot podge> <20140515201458 dot GA24885 at domone dot podge>
Hi all,
Here is latest memset pach: http://www.yunos.org/tmp/memset-avx2.patch
When I send patch by git-send-email, libc-alpha@sourceware.org refuse
to show it,
Sorry for Inconvenience to you
Thanks
Ling
2014-05-16 4:14 GMT+08:00, OndÅej BÃlka <neleai@seznam.cz>:
> Correction, in for following
>
> On Tue, May 13, 2014 at 07:36:16PM +0200, OndÅej BÃlka wrote:
>> > + ALIGN(4)
>> > +L(gobble_data):
>> > +#ifdef SHARED_CACHE_SIZE_HALF
>> > + mov $SHARED_CACHE_SIZE_HALF, %r9
>> > +#else
>> > + mov __x86_shared_cache_size_half(%rip), %r9
>> > +#endif
>> > + shl $4, %r9
>> > + cmp %r9, %rdx
>> > + ja L(gobble_big_data)
>> > + mov %rax, %r9
>> > + mov %esi, %eax
>> > + mov %rdx, %rcx
>> > + rep stosb
>> > + mov %r9, %rax
>> > + vzeroupper
>> > + ret
>> > +
>> > + ALIGN(4)
>> > +L(gobble_big_data):
>> > + sub $0x80, %rdx
>> > +L(gobble_big_data_loop):
>> > + vmovntdq %ymm0, (%rdi)
>> > + vmovntdq %ymm0, 0x20(%rdi)
>> > + vmovntdq %ymm0, 0x40(%rdi)
>> > + vmovntdq %ymm0, 0x60(%rdi)
>> > + lea 0x80(%rdi), %rdi
>> > + sub $0x80, %rdx
>> > + jae L(gobble_big_data_loop)
>> > + vmovups %ymm0, -0x80(%r8)
>> > + vmovups %ymm0, -0x60(%r8)
>> > + vmovups %ymm0, -0x40(%r8)
>> > + vmovups %ymm0, -0x20(%r8)
>> > + vzeroupper
>> > + sfence
>> > + ret
>>
>> That loop does seem to help on haswell at all, It is indistingushible
>> from
>> rep stosb loop above. I used following benchmark to check that with
>> different sizes but performance stayed same.
>>
>> #include <stdlib.h>
>> #include <string.h>
>> int main(){
>> int i;
>> char *x=malloc(100000000);
>> for (i=0;i<100;i++)
>> MEMSET(x,0,100000000);
>>
>> }
>>
>>
>> for I in `seq 1 10`; do
>> echo avx
>> gcc -L. -DMEMSET=__memset_avx2 -lc_profile big.c
>> time LD_LIBRARY_PATH=. ./a.out
>> echo rep
>> gcc -L. -DMEMSET=__memset_rep -lc_profile big.c
>> time LD_LIBRARY_PATH=. ./a.out
>> done
>
> Sorry I forgotten that __memset_rep also has branch for large inputs so
> what I wrote was wrong.
>
> I retested it with fixed rep stosq and your loop is around 20% slower on
> similar test so its better to remove that loop.
>
> $ gcc big.c -o big
> $ time LD_PRELOAD=./memset-avx2.so ./big
>
> real 0m0.076s
> user 0m0.066s
> sys 0m0.010s
>
> $ time LD_PRELOAD=./memset_rep.so ./big
>
> real 0m0.063s
> user 0m0.042s
> sys 0m0.021s
>
> I use a different benchmark to be sure, it could be download here and
> run it commands above in that directory.
>
> http://kam.mff.cuni.cz/~ondra/memset_consistency_benchmark.tar.bz2
>
> For different implementation you need to create .so with function
> memset, there is script compile that compiles all .s files provided that
> first line is of shape
>
> # arch_requirement function_name color
>
>