This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |
Other format: | [Raw text] |
Hello, Ondrej, > > > +L(mm_recalc_len): > > > +/* Compute in %ecx how many bytes are left to copy after > > > + the main loop stops. */ > > > + movl %ebx, %ecx > > > + subl %edx, %ecx > > > + jmp L(mm_len_0_or_more_backward) > > > + > > > That also looks slow as it adds unpredictable branch. On x64 we read start and end into registers before loop starts, and write these registers when it ends. > > > If you align to 16 bytes instead 64 you need only 4 registers to save end, 4 working and save 16 bytes at start into stack. > > > > Not very clear what do you mean. On x64 where are no memmove and no > > backward case... > > > I was referring to memcpy as same trick could be applied to memmove > backward case. A forward loop looks like this but there could be problem > that you run out of registers. > movdqu -16(%rsi,%rdx), %xmm4 > movdqu -32(%rsi,%rdx), %xmm5 > movdqu -48(%rsi,%rdx), %xmm6 > movdqu -64(%rsi,%rdx), %xmm7 > lea (%rdi, %rdx), %r10 > movdqu (%rsi), %xmm8 > movq %rdi, %rcx > subq %rsi, %rcx > cmpq %rdx, %rcx > jb .Lbwd > leaq 16(%rdi), %rdx > andq $-16, %rdx > movq %rdx, %rcx > subq %rdi, %rcx > addq %rcx, %rsi > movq %r10, %rcx > subq %rdx, %rcx > shrq $6, %rcx > .p2align 4 > .Lloop: > movdqu (%rsi), %xmm0 > movdqu 16(%rsi), %xmm1 > movdqu 32(%rsi), %xmm2 > movdqu 48(%rsi), %xmm3 > movdqa %xmm0, (%rdx) > addq $64, %rsi > movdqa %xmm1, 16(%rdx) > movdqa %xmm2, 32(%rdx) > movdqa %xmm3, 48(%rdx) > addq $64, %rdx > sub $1, %rcx > jnz .Lloop > movdqu %xmm8, (%rdi) > movdqu %xmm4, -16(%r10) > movdqu %xmm5, -32(%r10) > movdqu %xmm6, -48(%r10) > movdqu %xmm7, -64(%r10) > ret I've tried this way for memmove backward loop also. This method gives about 2.5% benefit for x86_64 implementation which I am also working on (tested on Silvermont and Haswell), but no benefit and no regression in this i686 case (tested on Haswell, Silvermont, Ivy Bridge, Sandy Bridge and Westmere). But I decided to use this way because it is more easy. New patch and performance test results attached.
Attachment:
perf_memmove_sse2_unaligned_i386_v3.tar.gz
Description: GNU Zip compressed data
Attachment:
memcpy_mempcpy_memmove_with_chk_sse2_unaligned_i386_v3.patch
Description: Binary data
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |