This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH] x86_32: memcpy, mempcpy, memmove, memcpy_chk, mempcpy_chk, memmove_chk optimized with SSE2 unaligned loads/stores

From: Andrew Senkevich <andrew dot n dot senkevich at gmail dot com>
To: OndÅej BÃlka <neleai at seznam dot cz>, libc-alpha <libc-alpha at sourceware dot org>
Date: Wed, 9 Jul 2014 19:36:05 +0400
Subject: Re: [PATCH] x86_32: memcpy, mempcpy, memmove, memcpy_chk, mempcpy_chk, memmove_chk optimized with SSE2 unaligned loads/stores
Authentication-results: sourceware.org; auth=none
References: <CAMXFM3t+TwhkeJbDXz0TSt-MZH3KOXTDWoT1nTiWjEyw1VSgcg at mail dot gmail dot com> <20140706134246 dot GA5694 at domone dot podge>

Hi, Ondrej,

2014-07-08 22:54 GMT+04:00 Ondrej Bilka <neleai@seznam.cz>:
> On Tue, Jul 08, 2014 at 03:25:26PM +0400, Andrew Senkevich wrote:
>> >
>> > Does that prefetch improve performance? On x64 it harmed performance and 128 bytes looks too small to matter.
>> > +
>> > +       prefetcht0 -128(%edi, %esi)
>> > +
>> > +       movdqu  -64(%edi, %esi), %xmm0
>> > +       movdqu  -48(%edi, %esi), %xmm1
>> > +       movdqu  -32(%edi, %esi), %xmm2
>> > +       movdqu  -16(%edi, %esi), %xmm3
>> > +       movdqa  %xmm0, -64(%edi)
>> > +       movdqa  %xmm1, -48(%edi)
>> > +       movdqa  %xmm2, -32(%edi)
>> > +       movdqa  %xmm3, -16(%edi)
>> > +       leal    -64(%edi), %edi
>> > +       cmp     %edi, %ebx
>> > +       jb      L(mm_main_loop_backward)
>> > +L(mm_main_loop_backward_end):
>> > +       POP (%edi)
>> > +       POP (%esi)
>> > +       jmp     L(mm_recalc_len)
>>
>> Disabling prefetch here and in below case leads to 10% degradation on
>> Silvermont on 3 tests. On Haswell performance is almost the same.
>>
> I had silvermont optimization in my todo list, it needs a separate
> implementation as it behaves differently than most other architectures.

But this implementation is already Silvermont optimized.
Memcpy Silvermont performance improvement is
on long lengths:
rand: +6.2%
rand_L2: +6.1%
rand_L3: +3%
rand_nocache: +2.5%
rand_noicache: +20.8%

on short lengths:
rand: +14.1%
rand_L2: +21.5%
rand_L3: +8%
rand_nocache: +14.5%
rand_noicache: +33.9%

This is data from attached at first time tarballs.
Memmove Silvermont performance improvement is even better.
For other architectures it is also very good performance benefit.

> From testing that I done it looks that simply using a  rep movsq is
> faster for strings upto around 1024 bytes.

How it will be comparing with current performance data results?

>> > +L(mm_recalc_len):
>> > +/* Compute in %ecx how many bytes are left to copy after
>> > +       the main loop stops.  */
>> > +       movl    %ebx, %ecx
>> > +       subl    %edx, %ecx
>> > +       jmp     L(mm_len_0_or_more_backward)
>> > +
>> > That also looks slow as it adds unpredictable branch. On x64 we read start and end into registers before loop starts, and write these registers when it ends.
>> > If you align to 16 bytes instead 64 you need only 4 registers to save end, 4 working and save 16 bytes at start into stack.
>>
>> Not very clear what do you mean. On x64 where are no memmove and no
>> backward case...
>>
> I was referring to memcpy as same trick could be applied to memmove
> backward case. A forward loop looks like this but there could be problem
> that you run out of registers.
>
>  movdqu -16(%rsi,%rdx), %xmm4
>  movdqu -32(%rsi,%rdx), %xmm5
>  movdqu -48(%rsi,%rdx), %xmm6
>  movdqu -64(%rsi,%rdx), %xmm7
>  lea    (%rdi, %rdx), %r10
>  movdqu (%rsi), %xmm8
>
>  movq   %rdi, %rcx
>  subq   %rsi, %rcx
>  cmpq   %rdx, %rcx
>  jb     .Lbwd
>
>  leaq   16(%rdi), %rdx
>  andq   $-16, %rdx
>  movq   %rdx, %rcx
>  subq   %rdi, %rcx
>  addq   %rcx, %rsi
>  movq   %r10, %rcx
>  subq   %rdx, %rcx
>  shrq   $6, %rcx
>
>  .p2align 4
> .Lloop:
>  movdqu (%rsi), %xmm0
>  movdqu 16(%rsi), %xmm1
>  movdqu 32(%rsi), %xmm2
>  movdqu 48(%rsi), %xmm3
>  movdqa %xmm0, (%rdx)
>  addq   $64, %rsi
>  movdqa %xmm1, 16(%rdx)
>  movdqa %xmm2, 32(%rdx)
>  movdqa %xmm3, 48(%rdx)
>  addq   $64, %rdx
>  sub    $1, %rcx
>  jnz    .Lloop
>  movdqu %xmm8, (%rdi)
>  movdqu %xmm4, -16(%r10)
>  movdqu %xmm5, -32(%r10)
>  movdqu %xmm6, -48(%r10)
>  movdqu %xmm7, -64(%r10)
>  ret

Clear. But how much performance benefit could be obtained, what do you think?

>> >
>> > +       movdqu  %xmm0, (%edx)
>> > +       movdqu  %xmm1, 16(%edx)
>> > +       movdqu  %xmm2, 32(%edx)
>> > +       movdqu  %xmm3, 48(%edx)
>> > +       movdqa  %xmm4, (%edi)
>> > +       movaps  %xmm5, 16(%edi)
>> > +       movaps  %xmm6, 32(%edi)
>> > +       movaps  %xmm7, 48(%edi)
>> > Why did you add floating point moves here?
>>
>> Because movaps with offset has 4 bytes length and it leads to improve
>> in instructions alignment and code size.
>> I also inserted it in more places.
>>
>> > +
>> > +/* We should stop two iterations before the termination
>> > +       (in order not to misprefetch).  */
>> > +       subl    $64, %ecx
>> > +       cmpl    %ebx, %ecx
>> > +       je      L(main_loop_just_one_iteration)
>> > +
>> > +       subl    $64, %ecx
>> > +       cmpl    %ebx, %ecx
>> > +       je      L(main_loop_last_two_iterations)
>> > +
>> > Same comment that prefetching will unlikely help, so you need show that it helps versus variant where you omit it.
>>
>> Disabling prefetching here gives degradation up to -11% on Silvermont,
>> on Haswell no significant changes.
>>
>> > +
>> > +       .p2align 4
>> > +L(main_loop_large_page):
>> >
>> > However here prefetching should help as its sufficiently large, also loads could be nontemporal.
>>
>> Prefetching here gives no significant performance change (all results attached).
>>
>> Could you clarify what nontemporal loads do you mean? Here needed
>> unaligned but I know only aligned nontemporal loads.
>> Also not clear why prefetch should help with nontemporal access...
>>
> Try prefetchnta.

prefetchnta gives also no improvements on Silvermont (and -4%
degradation on one test) for memcpy.
On Haswell on small lengths benefit is less than 0.9% but 1.4%
degradation on one test on long lengths.

>
>> Attaching edited patch.
>> 32bit build was tested with no new regressions.

Attaching last measurements.

Attachment: results_memcpy_slm.tar.bz2
Description: BZip2 compressed data

Attachment: results_memcpy_hsw.tar.bz2
Description: BZip2 compressed data

References:
- [PATCH] x86_32: memcpy, mempcpy, memmove, memcpy_chk, mempcpy_chk, memmove_chk optimized with SSE2 unaligned loads/stores
  - From: Andrew Senkevich
- Re: [PATCH] x86_32: memcpy, mempcpy, memmove, memcpy_chk, mempcpy_chk, memmove_chk optimized with SSE2 unaligned loads/stores
  - From: OndÅej BÃlka

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]