This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH RFC] Imporve 64bit memset performance for Haswell CPU with AVX2 instruction
- From: "H.J. Lu" <hjl dot tools at gmail dot com>
- To: OndÅej BÃlka <neleai at seznam dot cz>
- Cc: Ling Ma <ling dot ma dot program at gmail dot com>, GNU C Library <libc-alpha at sourceware dot org>, Richard Henderson <rth at twiddle dot net>, Andreas Jaeger <aj at suse dot com>, Liubov Dmitrieva <liubov dot dmitrieva at gmail dot com>, Ling Ma <ling dot ml at alibaba-inc dot com>
- Date: Thu, 5 Jun 2014 10:29:36 -0700
- Subject: Re: [PATCH RFC] Imporve 64bit memset performance for Haswell CPU with AVX2 instruction
- Authentication-results: sourceware.org; auth=none
- References: <1396850238-29041-1-git-send-email-ling dot ma at alipay dot com> <20140513173616 dot GC5047 at domone dot podge> <20140515201458 dot GA24885 at domone dot podge> <CAOGi=dNmn2bPfB65VoXUGjQ7t6RLVJ2hj2QDarrUjZV75kTbDA at mail dot gmail dot com> <20140530113041 dot GB26528 at domone dot podge> <CAOGi=dPdWegEo1s8=wG4WzOANaQ3x=boLFitQ_wBp+Xf+hxexQ at mail dot gmail dot com> <CAMe9rOqv5RYK1MO2M098n3o50-KmmZJuvsvMmXqkBBt0g3OY_g at mail dot gmail dot com> <CAOGi=dMNyzckY8s3uF0qRpKuqUwHHhzQeyy-j29ydLNn_s9Bog at mail dot gmail dot com> <20140605163224 dot GA8041 at domone dot podge>
On Thu, Jun 5, 2014 at 9:32 AM, OndÅej BÃlka <neleai@seznam.cz> wrote:
> On Wed, Jun 04, 2014 at 03:00:05PM +0800, Ling Ma wrote:
>> H.J
>>
>> The website changed IP, now the code is available again:
>> http://www.yunos.org/tmp/memset-avx2.patch ,
>> and also gziped as attachment in this mail.
>>
>> Thanks
>> Ling
>>
>
> Now performance looks ok for me, but few formating problems.
> With these fixed I would be satisfied H.J do you have comments?
I don't have any additional comments. Thanks.
> There is possible followup to also optimize __bzero like we do in
> general case.
>
> Then second followup would be decrease function size by reshuffling
> blocks, on several places there are 15/16 free bytes due alignment.
>
> Formatting problems are here:
>
> + vpxor %xmm0, %xmm0, %xmm0
> + vmovd %esi, %xmm1
> + mov %rdi, %rsi
> + mov %rdi, %rax
>
> here
>
> +L(less_16bytes):
> + vmovd %xmm0, %rcx
> + cmp $8, %dl
> + jb L(less_8bytes)
> + mov %rcx, (%rdi)
> + mov %rcx, -0x08(%rsi)
> + ret
> +
> + .p2align 4
> +L(less_8bytes):
> + cmp $4, %dl
> + jb L(less_4bytes)
> + mov %ecx, (%rdi)
> + mov %ecx, -0x04(%rsi)
> + ret
>
> and here
>
> + mov %rax, %rsi
> + vmovd %xmm0, %eax
> + mov %rdx, %rcx
>
> As I mentioned code size one trick is that instructions
> with -128 argument are shorter than with 128. You could save 16
> bytes with following modification, however it must be tested if
> it improves performance.
>
>
> --- x 2014-06-05 18:20:35.313645591 +0200
> +++ sysdeps/x86_64/multiarch/memset-avx2.S 2014-06-05
> 18:22:25.068642767 +0200
> @@ -95,7 +95,6 @@
> .p2align 4
> L(256bytesormore):
> vinserti128 $1, %xmm0, %ymm0, %ymm0
> - mov $0x80, %rcx
> add %rdx, %rsi
> mov %rdi, %r9
> vmovdqu %ymm0, (%rdi)
> @@ -105,15 +104,15 @@
> add %r9, %rdx
> cmp $4096, %rdx
> ja L(gobble_data)
> - sub %ecx, %edx
> + add $-128, %edx
> L(gobble_128_loop):
> vmovdqa %ymm0, (%rdi)
> vmovdqa %ymm0, 0x20(%rdi)
> vmovdqa %ymm0, 0x40(%rdi)
> vmovdqa %ymm0, 0x60(%rdi)
> - add %rcx, %rdi
> - sub %ecx, %edx
> - jae L(gobble_128_loop)
> + sub $-128, %rdi
> + add $-128, %edx
> + jb L(gobble_128_loop)
> vmovdqu %ymm0, -0x80(%rsi)
> vmovdqu %ymm0, -0x60(%rsi)
> vmovdqu %ymm0, -0x40(%rsi)
>
--
H.J.