[PATCH] x86_32: memcpy, mempcpy, memmove, memcpy_chk, mempcpy_chk, memmove_chk optimized with SSE2 unaligned loads/stores

Andrew Senkevich andrew.n.senkevich@gmail.com
Tue Aug 5 11:43:00 GMT 2014


Hello, Ondrej,

> > > +L(mm_recalc_len):
> > > +/* Compute in %ecx how many bytes are left to copy after
> > > +       the main loop stops.  */
> > > +       movl    %ebx, %ecx
> > > +       subl    %edx, %ecx
> > > +       jmp     L(mm_len_0_or_more_backward)
> > > +
> > > That also looks slow as it adds unpredictable branch. On x64 we read start and end into registers before loop starts, and write these registers when it ends.
> > > If you align to 16 bytes instead 64 you need only 4 registers to save end, 4 working and save 16 bytes at start into stack.
> >
> > Not very clear what do you mean. On x64 where are no memmove and no
> > backward case...
> >
> I was referring to memcpy as same trick could be applied to memmove
> backward case. A forward loop looks like this but there could be problem
> that you run out of registers.

> movdqu -16(%rsi,%rdx), %xmm4
> movdqu -32(%rsi,%rdx), %xmm5
> movdqu -48(%rsi,%rdx), %xmm6
> movdqu -64(%rsi,%rdx), %xmm7
> lea    (%rdi, %rdx), %r10
> movdqu (%rsi), %xmm8

> movq   %rdi, %rcx
> subq   %rsi, %rcx
> cmpq   %rdx, %rcx
> jb     .Lbwd

> leaq   16(%rdi), %rdx
> andq   $-16, %rdx
> movq   %rdx, %rcx
> subq   %rdi, %rcx
> addq   %rcx, %rsi
> movq   %r10, %rcx
> subq   %rdx, %rcx
> shrq   $6, %rcx

> .p2align 4
> .Lloop:
> movdqu (%rsi), %xmm0
> movdqu 16(%rsi), %xmm1
> movdqu 32(%rsi), %xmm2
> movdqu 48(%rsi), %xmm3
> movdqa %xmm0, (%rdx)
> addq   $64, %rsi
> movdqa %xmm1, 16(%rdx)
> movdqa %xmm2, 32(%rdx)
> movdqa %xmm3, 48(%rdx)
> addq   $64, %rdx
> sub    $1, %rcx
> jnz    .Lloop
> movdqu %xmm8, (%rdi)
> movdqu %xmm4, -16(%r10)
> movdqu %xmm5, -32(%r10)
> movdqu %xmm6, -48(%r10)
> movdqu %xmm7, -64(%r10)
> ret

I've tried this way for memmove backward loop also. This method gives
about 2.5% benefit for x86_64 implementation which I am also working
on (tested on Silvermont and Haswell), but no benefit and no
regression in this i686 case (tested on Haswell, Silvermont, Ivy
Bridge, Sandy Bridge and Westmere). But I decided to use this way
because it is more easy.
New patch and performance test results attached.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: perf_memmove_sse2_unaligned_i386_v3.tar.gz
Type: application/x-gzip
Size: 4316634 bytes
Desc: not available
URL: <http://sourceware.org/pipermail/libc-alpha/attachments/20140805/bc76583a/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: memcpy_mempcpy_memmove_with_chk_sse2_unaligned_i386_v3.patch
Type: application/octet-stream
Size: 26294 bytes
Desc: not available
URL: <http://sourceware.org/pipermail/libc-alpha/attachments/20140805/bc76583a/attachment.obj>


More information about the Libc-alpha mailing list