This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH] x86_32: memcpy, mempcpy, memmove, memcpy_chk, mempcpy_chk, memmove_chk optimized with SSE2 unaligned loads/stores

From: Andrew Senkevich <andrew dot n dot senkevich at gmail dot com>
To: OndÅej BÃlka <neleai at seznam dot cz>
Cc: libc-alpha <libc-alpha at sourceware dot org>
Date: Tue, 5 Aug 2014 15:42:34 +0400
Subject: Re: [PATCH] x86_32: memcpy, mempcpy, memmove, memcpy_chk, mempcpy_chk, memmove_chk optimized with SSE2 unaligned loads/stores
Authentication-results: sourceware.org; auth=none
References: <CAMXFM3t+TwhkeJbDXz0TSt-MZH3KOXTDWoT1nTiWjEyw1VSgcg at mail dot gmail dot com> <20140706134246 dot GA5694 at domone dot podge> <CAMXFM3uT9arjY4YsVmSuiXvV_AGWONzCoa4LcQ0fFy0b98KrKw at mail dot gmail dot com> <20140708185433 dot GA31167 at domone dot podge>

Hello, Ondrej,

> > > +L(mm_recalc_len):
> > > +/* Compute in %ecx how many bytes are left to copy after
> > > +       the main loop stops.  */
> > > +       movl    %ebx, %ecx
> > > +       subl    %edx, %ecx
> > > +       jmp     L(mm_len_0_or_more_backward)
> > > +
> > > That also looks slow as it adds unpredictable branch. On x64 we read start and end into registers before loop starts, and write these registers when it ends.
> > > If you align to 16 bytes instead 64 you need only 4 registers to save end, 4 working and save 16 bytes at start into stack.
> >
> > Not very clear what do you mean. On x64 where are no memmove and no
> > backward case...
> >
> I was referring to memcpy as same trick could be applied to memmove
> backward case. A forward loop looks like this but there could be problem
> that you run out of registers.

> movdqu -16(%rsi,%rdx), %xmm4
> movdqu -32(%rsi,%rdx), %xmm5
> movdqu -48(%rsi,%rdx), %xmm6
> movdqu -64(%rsi,%rdx), %xmm7
> lea    (%rdi, %rdx), %r10
> movdqu (%rsi), %xmm8

> movq   %rdi, %rcx
> subq   %rsi, %rcx
> cmpq   %rdx, %rcx
> jb     .Lbwd

> leaq   16(%rdi), %rdx
> andq   $-16, %rdx
> movq   %rdx, %rcx
> subq   %rdi, %rcx
> addq   %rcx, %rsi
> movq   %r10, %rcx
> subq   %rdx, %rcx
> shrq   $6, %rcx

> .p2align 4
> .Lloop:
> movdqu (%rsi), %xmm0
> movdqu 16(%rsi), %xmm1
> movdqu 32(%rsi), %xmm2
> movdqu 48(%rsi), %xmm3
> movdqa %xmm0, (%rdx)
> addq   $64, %rsi
> movdqa %xmm1, 16(%rdx)
> movdqa %xmm2, 32(%rdx)
> movdqa %xmm3, 48(%rdx)
> addq   $64, %rdx
> sub    $1, %rcx
> jnz    .Lloop
> movdqu %xmm8, (%rdi)
> movdqu %xmm4, -16(%r10)
> movdqu %xmm5, -32(%r10)
> movdqu %xmm6, -48(%r10)
> movdqu %xmm7, -64(%r10)
> ret

I've tried this way for memmove backward loop also. This method gives
about 2.5% benefit for x86_64 implementation which I am also working
on (tested on Silvermont and Haswell), but no benefit and no
regression in this i686 case (tested on Haswell, Silvermont, Ivy
Bridge, Sandy Bridge and Westmere). But I decided to use this way
because it is more easy.
New patch and performance test results attached.

Attachment: perf_memmove_sse2_unaligned_i386_v3.tar.gz
Description: GNU Zip compressed data

Attachment: memcpy_mempcpy_memmove_with_chk_sse2_unaligned_i386_v3.patch
Description: Binary data

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]