This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |
Other format: | [Raw text] |
Hi, Ondrej, 2014-07-08 22:54 GMT+04:00 Ondrej Bilka <neleai@seznam.cz>: > On Tue, Jul 08, 2014 at 03:25:26PM +0400, Andrew Senkevich wrote: >> > >> > Does that prefetch improve performance? On x64 it harmed performance and 128 bytes looks too small to matter. >> > + >> > + prefetcht0 -128(%edi, %esi) >> > + >> > + movdqu -64(%edi, %esi), %xmm0 >> > + movdqu -48(%edi, %esi), %xmm1 >> > + movdqu -32(%edi, %esi), %xmm2 >> > + movdqu -16(%edi, %esi), %xmm3 >> > + movdqa %xmm0, -64(%edi) >> > + movdqa %xmm1, -48(%edi) >> > + movdqa %xmm2, -32(%edi) >> > + movdqa %xmm3, -16(%edi) >> > + leal -64(%edi), %edi >> > + cmp %edi, %ebx >> > + jb L(mm_main_loop_backward) >> > +L(mm_main_loop_backward_end): >> > + POP (%edi) >> > + POP (%esi) >> > + jmp L(mm_recalc_len) >> >> Disabling prefetch here and in below case leads to 10% degradation on >> Silvermont on 3 tests. On Haswell performance is almost the same. >> > I had silvermont optimization in my todo list, it needs a separate > implementation as it behaves differently than most other architectures. But this implementation is already Silvermont optimized. Memcpy Silvermont performance improvement is on long lengths: rand: +6.2% rand_L2: +6.1% rand_L3: +3% rand_nocache: +2.5% rand_noicache: +20.8% on short lengths: rand: +14.1% rand_L2: +21.5% rand_L3: +8% rand_nocache: +14.5% rand_noicache: +33.9% This is data from attached at first time tarballs. Memmove Silvermont performance improvement is even better. For other architectures it is also very good performance benefit. > From testing that I done it looks that simply using a rep movsq is > faster for strings upto around 1024 bytes. How it will be comparing with current performance data results? >> > +L(mm_recalc_len): >> > +/* Compute in %ecx how many bytes are left to copy after >> > + the main loop stops. */ >> > + movl %ebx, %ecx >> > + subl %edx, %ecx >> > + jmp L(mm_len_0_or_more_backward) >> > + >> > That also looks slow as it adds unpredictable branch. On x64 we read start and end into registers before loop starts, and write these registers when it ends. >> > If you align to 16 bytes instead 64 you need only 4 registers to save end, 4 working and save 16 bytes at start into stack. >> >> Not very clear what do you mean. On x64 where are no memmove and no >> backward case... >> > I was referring to memcpy as same trick could be applied to memmove > backward case. A forward loop looks like this but there could be problem > that you run out of registers. > > movdqu -16(%rsi,%rdx), %xmm4 > movdqu -32(%rsi,%rdx), %xmm5 > movdqu -48(%rsi,%rdx), %xmm6 > movdqu -64(%rsi,%rdx), %xmm7 > lea (%rdi, %rdx), %r10 > movdqu (%rsi), %xmm8 > > movq %rdi, %rcx > subq %rsi, %rcx > cmpq %rdx, %rcx > jb .Lbwd > > leaq 16(%rdi), %rdx > andq $-16, %rdx > movq %rdx, %rcx > subq %rdi, %rcx > addq %rcx, %rsi > movq %r10, %rcx > subq %rdx, %rcx > shrq $6, %rcx > > .p2align 4 > .Lloop: > movdqu (%rsi), %xmm0 > movdqu 16(%rsi), %xmm1 > movdqu 32(%rsi), %xmm2 > movdqu 48(%rsi), %xmm3 > movdqa %xmm0, (%rdx) > addq $64, %rsi > movdqa %xmm1, 16(%rdx) > movdqa %xmm2, 32(%rdx) > movdqa %xmm3, 48(%rdx) > addq $64, %rdx > sub $1, %rcx > jnz .Lloop > movdqu %xmm8, (%rdi) > movdqu %xmm4, -16(%r10) > movdqu %xmm5, -32(%r10) > movdqu %xmm6, -48(%r10) > movdqu %xmm7, -64(%r10) > ret Clear. But how much performance benefit could be obtained, what do you think? >> > >> > + movdqu %xmm0, (%edx) >> > + movdqu %xmm1, 16(%edx) >> > + movdqu %xmm2, 32(%edx) >> > + movdqu %xmm3, 48(%edx) >> > + movdqa %xmm4, (%edi) >> > + movaps %xmm5, 16(%edi) >> > + movaps %xmm6, 32(%edi) >> > + movaps %xmm7, 48(%edi) >> > Why did you add floating point moves here? >> >> Because movaps with offset has 4 bytes length and it leads to improve >> in instructions alignment and code size. >> I also inserted it in more places. >> >> > + >> > +/* We should stop two iterations before the termination >> > + (in order not to misprefetch). */ >> > + subl $64, %ecx >> > + cmpl %ebx, %ecx >> > + je L(main_loop_just_one_iteration) >> > + >> > + subl $64, %ecx >> > + cmpl %ebx, %ecx >> > + je L(main_loop_last_two_iterations) >> > + >> > Same comment that prefetching will unlikely help, so you need show that it helps versus variant where you omit it. >> >> Disabling prefetching here gives degradation up to -11% on Silvermont, >> on Haswell no significant changes. >> >> > + >> > + .p2align 4 >> > +L(main_loop_large_page): >> > >> > However here prefetching should help as its sufficiently large, also loads could be nontemporal. >> >> Prefetching here gives no significant performance change (all results attached). >> >> Could you clarify what nontemporal loads do you mean? Here needed >> unaligned but I know only aligned nontemporal loads. >> Also not clear why prefetch should help with nontemporal access... >> > Try prefetchnta. prefetchnta gives also no improvements on Silvermont (and -4% degradation on one test) for memcpy. On Haswell on small lengths benefit is less than 0.9% but 1.4% degradation on one test on long lengths. > >> Attaching edited patch. >> 32bit build was tested with no new regressions. Attaching last measurements.
Attachment:
results_memcpy_slm.tar.bz2
Description: BZip2 compressed data
Attachment:
results_memcpy_hsw.tar.bz2
Description: BZip2 compressed data
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |