[PATCH 5/5] X86-64: Optimize memmove-vec-unaligned-erms.S
Noah Goldstein
goldstein.w.n@gmail.com
Tue Aug 24 09:12:15 GMT 2021
On Tue, Aug 24, 2021 at 4:29 AM Noah Goldstein <goldstein.w.n@gmail.com>
wrote:
> No bug. This commit optimizes memmove-vec-unaligned.S.
>
> The optimizations are in descending order of importance to the
> L(less_vec), L(movsb), the 8x forward/backward loops and various
> target alignments that have minimal code size impact.
>
> The L(less_vec) optimizations are to:
>
> 1. Readjust the branch order to either given hotter paths a fall
> through case or have less branches in there way.
> 2. Moderately change the size classes to make hot branches hotter
> and thus increase predictability.
> 3. Try and minimize branch aliasing to avoid BPU thrashing based
> misses.
> 4. 64 byte the prior function entry. This is to avoid cases where
> seemingly unrelated changes end up have severe negative
> performance impacts.
>
> The L(movsb) optimizations are to:
>
> 1. Reduce the number of taken branches needed to determine if
> movsb should be used.
> 2. 64 byte align either dst if the CPU has fsrm or if dst and src
> do not 4k alias.
> 3. 64 byte align src if the CPU does not have fsrm and dst and src
> do 4k alias.
>
> The 8x forward/backward loop optimizations are to:
>
> 1. Reduce instructions needed for aligning to VEC_SIZE.
> 2. Reduce uops and code size of the loops.
>
> All tests in string/ passing.
> ---
> See performance data attached.
> Included benchmarks: memcpy-random, memcpy, memmove, memcpy-walk,
> memmove-walk, memcpy-large
>
> The first page is a summary with the ifunc selection version for
> erms/non-erms for each computers. Then in the following 4 sheets are
> all the numbers for sse2, avx for Skylake and sse2, avx2, evex, and
> avx512 for Tigerlake.
>
> Benchmark CPUS: Skylake:
>
> https://ark.intel.com/content/www/us/en/ark/products/149091/intel-core-i7-8565u-processor-8m-cache-up-to-4-60-ghz.html
>
> Tigerlake:
>
> https://ark.intel.com/content/www/us/en/ark/products/208921/intel-core-i7-1165g7-processor-12m-cache-up-to-4-70-ghz-with-ipu.html
>
> All times are geometric mean of N=30.
>
> "Cur" refers to the current implementation "New" refers to this
> patches implementation
>
> Score refers to new/cur (low means improvement, high means
> degragation). Scores are color coded. The more green the better, the
> more red the worse.
>
>
> Some notes on the numbers:
>
> In my opinion most of the benchmarks where src/dst align are in [0,
> 64] have some unpredictable and unfortunate noise from non-obvious
> false dependencies between stores to dst and next iterations loads
> from src. For example in the 8x forward case, the store of VEC(4) will
> end up stalling next iterations load queue, so if size was large
> enough that the begining of dst was flushed from L1 this can have a
> seemingly random but significant impact on the benchmark result.
>
> There are significant performance improvements/degregations in the [0,
> VEC_SIZE]. I didn't treat these as imporant as I think in this size
> range the branch pattern indicated by the random tests is more
> important. On the random tests the new implementation performance
> significantly better.
>
> I also added logic to align before L(movsb). If you see the new random
> benchmarks with fixed size this leads to roughly a 10-20% performance
> improvement for some hot sizes. I am not 100% convinced this is needed
> as generally for larger copies that would go to movsb they are already
> aligned but even in the fixed loop cases, especially on Skylake w.o
> FSRM it seems aligning before movsb pays off. Let me know if you think
> this is unnecessary.
>
> There are occasional performance degregations at odd splots throughout
> the medium range sizes in the fixed memcpy benchmarks. I think
> generally there is more good than harm here and at the moment I don't
> have an explination for why these certain configurations seem to
> perform worse. On the plus side, however, it also seems that there are
> unexplained improvements of the same magnitude patterened with the
> degregations (and both are sparse) so I ultimately believe it should
> be acceptable. if this is not the case let me know.
>
> The memmove benchmarks look a bit worse, especially for the erms
> case. Part of this is from the nop cases which I didn't treat as
> important. But part of it is also because to optimize for what I
> expect to be the common case of no overlap the overlap case has extra
> branches and overhead. I think this is inevitable when implementing
> memmove and memcpy in the same file, but if this is unacceptable let
> me know.
>
>
> Note: I benchmarks before two changes that made it into the final version:
>
> -#if !defined USE_MULTIARCH || !IS_IN (libc)
> -L(nop):
> - ret
> -#else
> + VMOVU %VEC(1), -VEC_SIZE(%rdi, %rdx)
> VZEROUPPER_RETURN
> -#endif
>
>
>
> And
>
> + testl $X86_STRING_CONTROL_AVOID_SHORT_DISTANCE_REP_MOVSB,
> __x86_string_control(%rip)
> - andl $X86_STRING_CONTROL_AVOID_SHORT_DISTANCE_REP_MOVSB,
> __x86_string_control(%rip)
>
>
> I don't think either of these should have any impact.
>
> I made the former change because I think it was a bug that could cause
> use of avx2 w.o vzeroupper and the latter because I think it could
> cause issues on multicore platforms.
>
>
> sysdeps/x86/sysdep.h | 13 +-
> .../multiarch/memmove-vec-unaligned-erms.S | 484 +++++++++++-------
> 2 files changed, 317 insertions(+), 180 deletions(-)
>
> diff --git a/sysdeps/x86/sysdep.h b/sysdeps/x86/sysdep.h
> index cac1d762fb..9226d2c6c9 100644
> --- a/sysdeps/x86/sysdep.h
> +++ b/sysdeps/x86/sysdep.h
> @@ -78,15 +78,18 @@ enum cf_protection_level
> #define ASM_SIZE_DIRECTIVE(name) .size name,.-name;
>
> /* Define an entry point visible from C. */
> -#define ENTRY(name)
> \
> - .globl C_SYMBOL_NAME(name);
> \
> - .type C_SYMBOL_NAME(name),@function;
> \
> - .align ALIGNARG(4);
> \
> +#define P2ALIGN_ENTRY(name, alignment)
> \
> + .globl C_SYMBOL_NAME(name);
> \
> + .type C_SYMBOL_NAME(name),@function;
> \
> + .align ALIGNARG(alignment);
> \
> C_LABEL(name)
> \
> cfi_startproc;
> \
> - _CET_ENDBR;
> \
> + _CET_ENDBR; \
> CALL_MCOUNT
>
> +#define ENTRY(name) P2ALIGN_ENTRY(name, 4)
> +
> +
> #undef END
> #define END(name)
> \
> cfi_endproc;
> \
> diff --git a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
> b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
> index 9f02624375..75b6efe969 100644
> --- a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
> +++ b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
> @@ -165,6 +165,32 @@
> # error Invalid LARGE_LOAD_SIZE
> #endif
>
> +/* Whether to align before movsb. Ultimately we want 64 byte align
> + and not worth it to load 4x VEC for VEC_SIZE == 16. */
> +#define ALIGN_MOVSB (VEC_SIZE > 16)
> +
> +/* Number of VECs to align movsb to. */
> +#if VEC_SIZE == 64
> +# define MOVSB_ALIGN_TO (VEC_SIZE)
> +#else
> +# define MOVSB_ALIGN_TO (VEC_SIZE * 2)
> +#endif
> +
> +/* Macro for copying inclusive power of 2 range with two register
> + loads. */
> +#define COPY_BLOCK(mov_inst, src_reg, dst_reg, size_reg, len, tmp_reg0,
> tmp_reg1) \
> + mov_inst (%src_reg), %tmp_reg0; \
> + mov_inst -(len)(%src_reg, %size_reg), %tmp_reg1; \
> + mov_inst %tmp_reg0, (%dst_reg); \
> + mov_inst %tmp_reg1, -(len)(%dst_reg, %size_reg);
> +
> +/* Define all copies used by L(less_vec) for VEC_SIZE of 16, 32, or
> + 64. */
> +#define COPY_4_8 COPY_BLOCK(movl, rsi, rdi, rdx, 4, ecx, esi)
> +#define COPY_8_16 COPY_BLOCK(movq, rsi, rdi, rdx, 8, rcx, rsi)
> +#define COPY_16_32 COPY_BLOCK(vmovdqu, rsi, rdi, rdx, 16, xmm0, xmm1)
> +#define COPY_32_64 COPY_BLOCK(vmovdqu64, rsi, rdi, rdx, 32, ymm16,
> ymm17)
> +
> #ifndef SECTION
> # error SECTION is not defined!
> #endif
> @@ -198,7 +224,13 @@ L(start):
> movl %edx, %edx
> # endif
> cmp $VEC_SIZE, %RDX_LP
> + /* Based on SPEC2017 distribution both 16 and 32 memcpy calls are
> + really hot so we want them to take the same branch path. */
> +#if VEC_SIZE > 16
> + jbe L(less_vec)
> +#else
> jb L(less_vec)
> +#endif
> cmp $(VEC_SIZE * 2), %RDX_LP
> ja L(more_2x_vec)
> #if !defined USE_MULTIARCH || !IS_IN (libc)
> @@ -206,15 +238,10 @@ L(last_2x_vec):
> #endif
> /* From VEC and to 2 * VEC. No branch when size == VEC_SIZE. */
> VMOVU (%rsi), %VEC(0)
> - VMOVU -VEC_SIZE(%rsi,%rdx), %VEC(1)
> + VMOVU -VEC_SIZE(%rsi, %rdx), %VEC(1)
> VMOVU %VEC(0), (%rdi)
> - VMOVU %VEC(1), -VEC_SIZE(%rdi,%rdx)
> -#if !defined USE_MULTIARCH || !IS_IN (libc)
> -L(nop):
> - ret
> -#else
> + VMOVU %VEC(1), -VEC_SIZE(%rdi, %rdx)
> VZEROUPPER_RETURN
> -#endif
> #if defined USE_MULTIARCH && IS_IN (libc)
> END (MEMMOVE_SYMBOL (__memmove, unaligned))
>
> @@ -289,7 +316,9 @@ ENTRY (MEMMOVE_CHK_SYMBOL (__memmove_chk,
> unaligned_erms))
> END (MEMMOVE_CHK_SYMBOL (__memmove_chk, unaligned_erms))
> # endif
>
> -ENTRY (MEMMOVE_SYMBOL (__memmove, unaligned_erms))
> +/* Cache align entry so that branch heavy L(less_vec) maintains good
> + alignment. */
> +P2ALIGN_ENTRY (MEMMOVE_SYMBOL (__memmove, unaligned_erms), 6)
> movq %rdi, %rax
> L(start_erms):
> # ifdef __ILP32__
> @@ -297,123 +326,217 @@ L(start_erms):
> movl %edx, %edx
> # endif
> cmp $VEC_SIZE, %RDX_LP
> + /* Based on SPEC2017 distribution both 16 and 32 memcpy calls are
> + really hot so we want them to take the same branch path. */
> +# if VEC_SIZE > 16
> + jbe L(less_vec)
> +# else
> jb L(less_vec)
> +# endif
> cmp $(VEC_SIZE * 2), %RDX_LP
> ja L(movsb_more_2x_vec)
> L(last_2x_vec):
> - /* From VEC and to 2 * VEC. No branch when size == VEC_SIZE. */
> + /* From VEC and to 2 * VEC. No branch when size == VEC_SIZE. */
> VMOVU (%rsi), %VEC(0)
> - VMOVU -VEC_SIZE(%rsi,%rdx), %VEC(1)
> + VMOVU -VEC_SIZE(%rsi, %rdx), %VEC(1)
> VMOVU %VEC(0), (%rdi)
> - VMOVU %VEC(1), -VEC_SIZE(%rdi,%rdx)
> + VMOVU %VEC(1), -VEC_SIZE(%rdi, %rdx)
> L(return):
> -#if VEC_SIZE > 16
> +# if VEC_SIZE > 16
> ZERO_UPPER_VEC_REGISTERS_RETURN
> -#else
> +# else
> ret
> +# endif
> #endif
> +#if VEC_SIZE == 64
> +L(copy_8_15):
> + COPY_8_16
> + ret
>
> -L(movsb):
> - cmp __x86_rep_movsb_stop_threshold(%rip), %RDX_LP
> - jae L(more_8x_vec)
> - cmpq %rsi, %rdi
> - jb 1f
> - /* Source == destination is less common. */
> - je L(nop)
> - leaq (%rsi,%rdx), %r9
> - cmpq %r9, %rdi
> - /* Avoid slow backward REP MOVSB. */
> - jb L(more_8x_vec_backward)
> -# if AVOID_SHORT_DISTANCE_REP_MOVSB
> - andl $X86_STRING_CONTROL_AVOID_SHORT_DISTANCE_REP_MOVSB,
> __x86_string_control(%rip)
> - jz 3f
> - movq %rdi, %rcx
> - subq %rsi, %rcx
> - jmp 2f
> -# endif
> -1:
> -# if AVOID_SHORT_DISTANCE_REP_MOVSB
> - andl $X86_STRING_CONTROL_AVOID_SHORT_DISTANCE_REP_MOVSB,
> __x86_string_control(%rip)
> - jz 3f
> - movq %rsi, %rcx
> - subq %rdi, %rcx
> -2:
> -/* Avoid "rep movsb" if RCX, the distance between source and destination,
> - is N*4GB + [1..63] with N >= 0. */
> - cmpl $63, %ecx
> - jbe L(more_2x_vec) /* Avoid "rep movsb" if ECX <= 63. */
> -3:
> -# endif
> - mov %RDX_LP, %RCX_LP
> - rep movsb
> -L(nop):
> +L(copy_33_63):
> + COPY_32_64
> ret
> #endif
> -
> + /* Only worth aligning if near end of 16 byte block and won't get
> + first branch in first decode after jump. */
> + .p2align 4,, 6
> L(less_vec):
> - /* Less than 1 VEC. */
> #if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64
> # error Unsupported VEC_SIZE!
> #endif
> -#if VEC_SIZE > 32
> - cmpb $32, %dl
> - jae L(between_32_63)
> + /* Second set of branches for smallest copies. */
> + cmpl $(VEC_SIZE / 4), %edx
> + jb L(less_quarter_vec)
> +
> + cmpl $(VEC_SIZE / 2), %edx
> +#if VEC_SIZE == 64
> + /* We branch to [33, 63] instead of [16, 32] to give [16, 32] fall
> + through path as [16, 32] is hotter. */
> + ja L(copy_33_63)
> + COPY_16_32
> +#elif VEC_SIZE == 32
> + /* Branch to [8, 15]. Fall through to [16, 32]. */
> + jb L(copy_8_15)
> + COPY_16_32
> +#else
> + /* Branch to [4, 7]. Fall through to [8, 15]. */
> + jb L(copy_4_7)
> + COPY_8_16
> #endif
> -#if VEC_SIZE > 16
> - cmpb $16, %dl
> - jae L(between_16_31)
> -#endif
> - cmpb $8, %dl
> - jae L(between_8_15)
> - cmpb $4, %dl
> - jae L(between_4_7)
> - cmpb $1, %dl
> - ja L(between_2_3)
> - jb 1f
> + ret
> + /* Align if won't cost too many bytes. */
> + .p2align 4,, 6
> +L(copy_4_7):
> + COPY_4_8
> + ret
> +
> + /* Cold target. No need to align. */
> +L(copy_1):
> movzbl (%rsi), %ecx
> movb %cl, (%rdi)
> -1:
> ret
> +
> + /* Colder copy case for [0, VEC_SIZE / 4 - 1]. */
> +L(less_quarter_vec):
> #if VEC_SIZE > 32
> -L(between_32_63):
> - /* From 32 to 63. No branch when size == 32. */
> - VMOVU (%rsi), %YMM0
> - VMOVU -32(%rsi,%rdx), %YMM1
> - VMOVU %YMM0, (%rdi)
> - VMOVU %YMM1, -32(%rdi,%rdx)
> - VZEROUPPER_RETURN
> + cmpl $8, %edx
> + jae L(copy_8_15)
> #endif
> #if VEC_SIZE > 16
> - /* From 16 to 31. No branch when size == 16. */
> -L(between_16_31):
> - VMOVU (%rsi), %XMM0
> - VMOVU -16(%rsi,%rdx), %XMM1
> - VMOVU %XMM0, (%rdi)
> - VMOVU %XMM1, -16(%rdi,%rdx)
> - VZEROUPPER_RETURN
> -#endif
> -L(between_8_15):
> - /* From 8 to 15. No branch when size == 8. */
> - movq -8(%rsi,%rdx), %rcx
> - movq (%rsi), %rsi
> - movq %rcx, -8(%rdi,%rdx)
> - movq %rsi, (%rdi)
> - ret
> -L(between_4_7):
> - /* From 4 to 7. No branch when size == 4. */
> - movl -4(%rsi,%rdx), %ecx
> - movl (%rsi), %esi
> - movl %ecx, -4(%rdi,%rdx)
> - movl %esi, (%rdi)
> + cmpl $4, %edx
> + jae L(copy_4_7)
> +#endif
> + cmpl $1, %edx
> + je L(copy_1)
> + jb L(copy_0)
> + /* Fall through into copy [2, 3] as it is more common than [0, 1].
> + */
> + movzwl (%rsi), %ecx
> + movzbl -1(%rsi, %rdx), %esi
> + movw %cx, (%rdi)
> + movb %sil, -1(%rdi, %rdx)
> +L(copy_0):
> ret
> -L(between_2_3):
> - /* From 2 to 3. No branch when size == 2. */
> - movzwl -2(%rsi,%rdx), %ecx
> - movzwl (%rsi), %esi
> - movw %cx, -2(%rdi,%rdx)
> - movw %si, (%rdi)
> +
> + .p2align 4
> +#if VEC_SIZE == 32
> +L(copy_8_15):
> + COPY_8_16
> ret
> + /* COPY_8_16 is exactly 17 bytes so don't want to p2align after as
> + it wastes 15 bytes of code and 1 byte off is fine. */
> +#endif
> +
> +#if defined USE_MULTIARCH && IS_IN (libc)
> +L(movsb):
> + movq %rdi, %rcx
> + subq %rsi, %rcx
> + /* Go to backwards temporal copy if overlap no matter what as
> + backward movsb is slow. */
> + cmpq %rdx, %rcx
> + /* L(more_8x_vec_backward_check_nop) checks for src == dst. */
> + jb L(more_8x_vec_backward_check_nop)
> + /* If above __x86_rep_movsb_stop_threshold most likely is candidate
> + for NT moves aswell. */
> + cmp __x86_rep_movsb_stop_threshold(%rip), %RDX_LP
> + jae L(large_memcpy_2x_check)
> +# if ALIGN_MOVSB
> + VMOVU (%rsi), %VEC(0)
> +# if MOVSB_ALIGN_TO > VEC_SIZE
> + VMOVU VEC_SIZE(%rsi), %VEC(1)
> +# endif
> +# if MOVSB_ALIGN_TO > (VEC_SIZE * 2)
> +# error Unsupported MOVSB_ALIGN_TO
> +# endif
> + /* Store dst for use after rep movsb. */
> + movq %rdi, %r8
> +# endif
> +# if AVOID_SHORT_DISTANCE_REP_MOVSB
> + /* Only avoid short movsb if CPU has FSRM. */
> + testl $X86_STRING_CONTROL_AVOID_SHORT_DISTANCE_REP_MOVSB,
> __x86_string_control(%rip)
> + jz L(skip_short_movsb_check)
> + /* Avoid "rep movsb" if RCX, the distance between source and
> + destination, is N*4GB + [1..63] with N >= 0. */
> +
> + /* ecx contains dst - src. Early check for backward copy conditions
> + means only case of slow movsb with src = dst + [0, 63] is ecx in
> + [-63, 0]. Use unsigned comparison with -64 check for that
> case. */
> + cmpl $-64, %ecx
> + ja L(more_8x_vec_forward)
> +# endif
> +# if ALIGN_MOVSB
> + /* Fall through means cpu has FSRM. In that case exclusively align
> + destination. */
> +
> + /* Subtract dst from src. Add back after dst aligned. */
> + subq %rdi, %rsi
> + /* Add dst to len. Subtract back after dst aligned. */
> + leaq (%rdi, %rdx), %rcx
> + /* Exclusively align dst to MOVSB_ALIGN_TO (64). */
> + addq $(MOVSB_ALIGN_TO - 1), %rdi
> + andq $-(MOVSB_ALIGN_TO), %rdi
> + /* Restore src and len adjusted with new values for aligned dst.
> */
> + addq %rdi, %rsi
> + subq %rdi, %rcx
>
> + rep movsb
> + VMOVU %VEC(0), (%r8)
> +# if MOVSB_ALIGN_TO > VEC_SIZE
> + VMOVU %VEC(1), VEC_SIZE(%r8)
> +# endif
> + VZEROUPPER_RETURN
> +L(movsb_align_dst):
> + /* Subtract dst from src. Add back after dst aligned. */
> + subq %rdi, %rsi
> + /* Add dst to len. Subtract back after dst aligned. -1 because dst
> + is initially aligned to MOVSB_ALIGN_TO - 1. */
> + leaq -(1)(%rdi, %rdx), %rcx
> + /* Inclusively align dst to MOVSB_ALIGN_TO - 1. */
> + orq $(MOVSB_ALIGN_TO - 1), %rdi
> + leaq 1(%rdi, %rsi), %rsi
> + /* Restore src and len adjusted with new values for aligned dst.
> */
> + subq %rdi, %rcx
> + /* Finish aligning dst. */
> + incq %rdi
> + rep movsb
> + VMOVU %VEC(0), (%r8)
> +# if MOVSB_ALIGN_TO > VEC_SIZE
> + VMOVU %VEC(1), VEC_SIZE(%r8)
> +# endif
> + VZEROUPPER_RETURN
> +
> +L(skip_short_movsb_check):
> + /* If CPU does not have FSRM two options for aligning. Align src if
> + dst and src 4k alias. Otherwise align dst. */
> + testl $(PAGE_SIZE - 512), %ecx
> + jnz L(movsb_align_dst)
> + /* rcx already has dst - src. */
> + movq %rcx, %r9
> + /* Add src to len. Subtract back after src aligned. -1 because src
> + is initially aligned to MOVSB_ALIGN_TO - 1. */
> + leaq -(1)(%rsi, %rdx), %rcx
> + /* Inclusively align src to MOVSB_ALIGN_TO - 1. */
> + orq $(MOVSB_ALIGN_TO - 1), %rsi
> + /* Restore dst and len adjusted with new values for aligned dst.
> */
> + leaq 1(%rsi, %r9), %rdi
> + subq %rsi, %rcx
> + /* Finish aligning src. */
> + incq %rsi
> + rep movsb
> + VMOVU %VEC(0), (%r8)
> +# if MOVSB_ALIGN_TO > VEC_SIZE
> + VMOVU %VEC(1), VEC_SIZE(%r8)
> +# endif
> + VZEROUPPER_RETURN
> +# else
> + /* Not alignined rep movsb so just copy. */
> + mov %RDX_LP, %RCX_LP
> + rep movsb
> + ret
> +# endif
> +#endif
> + /* Align if doesn't cost too many bytes. */
> + .p2align 4,, 6
> #if defined USE_MULTIARCH && IS_IN (libc)
> L(movsb_more_2x_vec):
> cmp __x86_rep_movsb_threshold(%rip), %RDX_LP
> @@ -426,50 +549,60 @@ L(more_2x_vec):
> ja L(more_8x_vec)
> cmpq $(VEC_SIZE * 4), %rdx
> jbe L(last_4x_vec)
> - /* Copy from 4 * VEC + 1 to 8 * VEC, inclusively. */
> + /* Copy from 4 * VEC + 1 to 8 * VEC, inclusively. */
> VMOVU (%rsi), %VEC(0)
> VMOVU VEC_SIZE(%rsi), %VEC(1)
> VMOVU (VEC_SIZE * 2)(%rsi), %VEC(2)
> VMOVU (VEC_SIZE * 3)(%rsi), %VEC(3)
> - VMOVU -VEC_SIZE(%rsi,%rdx), %VEC(4)
> - VMOVU -(VEC_SIZE * 2)(%rsi,%rdx), %VEC(5)
> - VMOVU -(VEC_SIZE * 3)(%rsi,%rdx), %VEC(6)
> - VMOVU -(VEC_SIZE * 4)(%rsi,%rdx), %VEC(7)
> + VMOVU -VEC_SIZE(%rsi, %rdx), %VEC(4)
> + VMOVU -(VEC_SIZE * 2)(%rsi, %rdx), %VEC(5)
> + VMOVU -(VEC_SIZE * 3)(%rsi, %rdx), %VEC(6)
> + VMOVU -(VEC_SIZE * 4)(%rsi, %rdx), %VEC(7)
> VMOVU %VEC(0), (%rdi)
> VMOVU %VEC(1), VEC_SIZE(%rdi)
> VMOVU %VEC(2), (VEC_SIZE * 2)(%rdi)
> VMOVU %VEC(3), (VEC_SIZE * 3)(%rdi)
> - VMOVU %VEC(4), -VEC_SIZE(%rdi,%rdx)
> - VMOVU %VEC(5), -(VEC_SIZE * 2)(%rdi,%rdx)
> - VMOVU %VEC(6), -(VEC_SIZE * 3)(%rdi,%rdx)
> - VMOVU %VEC(7), -(VEC_SIZE * 4)(%rdi,%rdx)
> + VMOVU %VEC(4), -VEC_SIZE(%rdi, %rdx)
> + VMOVU %VEC(5), -(VEC_SIZE * 2)(%rdi, %rdx)
> + VMOVU %VEC(6), -(VEC_SIZE * 3)(%rdi, %rdx)
> + VMOVU %VEC(7), -(VEC_SIZE * 4)(%rdi, %rdx)
> VZEROUPPER_RETURN
> + /* Align if doesn't cost too much code size. 6 bytes so that after
> + jump to target a full mov instruction will always be able to be
> + fetched. */
> + .p2align 4,, 6
> L(last_4x_vec):
> - /* Copy from 2 * VEC + 1 to 4 * VEC, inclusively. */
> + /* Copy from 2 * VEC + 1 to 4 * VEC, inclusively. */
> VMOVU (%rsi), %VEC(0)
> VMOVU VEC_SIZE(%rsi), %VEC(1)
> - VMOVU -VEC_SIZE(%rsi,%rdx), %VEC(2)
> - VMOVU -(VEC_SIZE * 2)(%rsi,%rdx), %VEC(3)
> + VMOVU -VEC_SIZE(%rsi, %rdx), %VEC(2)
> + VMOVU -(VEC_SIZE * 2)(%rsi, %rdx), %VEC(3)
> VMOVU %VEC(0), (%rdi)
> VMOVU %VEC(1), VEC_SIZE(%rdi)
> - VMOVU %VEC(2), -VEC_SIZE(%rdi,%rdx)
> - VMOVU %VEC(3), -(VEC_SIZE * 2)(%rdi,%rdx)
> + VMOVU %VEC(2), -VEC_SIZE(%rdi, %rdx)
> + VMOVU %VEC(3), -(VEC_SIZE * 2)(%rdi, %rdx)
> + /* Keep nop target close to jmp for 2-byte encoding. */
> +L(nop):
> VZEROUPPER_RETURN
> -
> + /* Align if doesn't cost too much code size. */
> + .p2align 4,, 10
> L(more_8x_vec):
> /* Check if non-temporal move candidate. */
> #if (defined USE_MULTIARCH || VEC_SIZE == 16) && IS_IN (libc)
> /* Check non-temporal store threshold. */
> - cmp __x86_shared_non_temporal_threshold(%rip), %RDX_LP
> + cmp __x86_shared_non_temporal_threshold(%rip), %RDX_LP
> ja L(large_memcpy_2x)
> #endif
> - /* Entry if rdx is greater than non-temporal threshold but there
> - is overlap. */
> + /* Entry if rdx is greater than non-temporal threshold but there is
> + overlap. */
> L(more_8x_vec_check):
> cmpq %rsi, %rdi
> ja L(more_8x_vec_backward)
> /* Source == destination is less common. */
> je L(nop)
> + /* Entry if rdx is greater than movsb or stop movsb threshold but
> + there is overlap with dst > src. */
> +L(more_8x_vec_forward):
> /* Load the first VEC and last 4 * VEC to support overlapping
> addresses. */
> VMOVU (%rsi), %VEC(4)
> @@ -477,22 +610,18 @@ L(more_8x_vec_check):
> VMOVU -(VEC_SIZE * 2)(%rsi, %rdx), %VEC(6)
> VMOVU -(VEC_SIZE * 3)(%rsi, %rdx), %VEC(7)
> VMOVU -(VEC_SIZE * 4)(%rsi, %rdx), %VEC(8)
> - /* Save start and stop of the destination buffer. */
> - movq %rdi, %r11
> - leaq -VEC_SIZE(%rdi, %rdx), %rcx
> - /* Align destination for aligned stores in the loop. Compute
> - how much destination is misaligned. */
> - movq %rdi, %r8
> - andq $(VEC_SIZE - 1), %r8
> - /* Get the negative of offset for alignment. */
> - subq $VEC_SIZE, %r8
> - /* Adjust source. */
> - subq %r8, %rsi
> - /* Adjust destination which should be aligned now. */
> - subq %r8, %rdi
> - /* Adjust length. */
> - addq %r8, %rdx
> -
> + /* Subtract dst from src. Add back after dst aligned. */
> + subq %rdi, %rsi
> + /* Store end of buffer minus tail in rdx. */
> + leaq (VEC_SIZE * -4)(%rdi, %rdx), %rdx
> + /* Save begining of dst. */
> + movq %rdi, %rcx
> + /* Align dst to VEC_SIZE - 1. */
> + orq $(VEC_SIZE - 1), %rdi
> + /* Restore src adjusted with new value for aligned dst. */
> + leaq 1(%rdi, %rsi), %rsi
> + /* Finish aligning dst. */
> + incq %rdi
> .p2align 4
> L(loop_4x_vec_forward):
> /* Copy 4 * VEC a time forward. */
> @@ -501,23 +630,27 @@ L(loop_4x_vec_forward):
> VMOVU (VEC_SIZE * 2)(%rsi), %VEC(2)
> VMOVU (VEC_SIZE * 3)(%rsi), %VEC(3)
> subq $-(VEC_SIZE * 4), %rsi
> - addq $-(VEC_SIZE * 4), %rdx
> VMOVA %VEC(0), (%rdi)
> VMOVA %VEC(1), VEC_SIZE(%rdi)
> VMOVA %VEC(2), (VEC_SIZE * 2)(%rdi)
> VMOVA %VEC(3), (VEC_SIZE * 3)(%rdi)
> subq $-(VEC_SIZE * 4), %rdi
> - cmpq $(VEC_SIZE * 4), %rdx
> + cmpq %rdi, %rdx
> ja L(loop_4x_vec_forward)
> /* Store the last 4 * VEC. */
> - VMOVU %VEC(5), (%rcx)
> - VMOVU %VEC(6), -VEC_SIZE(%rcx)
> - VMOVU %VEC(7), -(VEC_SIZE * 2)(%rcx)
> - VMOVU %VEC(8), -(VEC_SIZE * 3)(%rcx)
> + VMOVU %VEC(5), (VEC_SIZE * 3)(%rdx)
> + VMOVU %VEC(6), (VEC_SIZE * 2)(%rdx)
> + VMOVU %VEC(7), VEC_SIZE(%rdx)
> + VMOVU %VEC(8), (%rdx)
> /* Store the first VEC. */
> - VMOVU %VEC(4), (%r11)
> + VMOVU %VEC(4), (%rcx)
> + /* Keep nop target close to jmp for 2-byte encoding. */
> +L(nop2):
> VZEROUPPER_RETURN
> -
> + /* Entry from fail movsb. Need to test if dst - src == 0 still. */
> +L(more_8x_vec_backward_check_nop):
> + testq %rcx, %rcx
> + jz L(nop2)
> L(more_8x_vec_backward):
> /* Load the first 4 * VEC and last VEC to support overlapping
> addresses. */
> @@ -525,49 +658,50 @@ L(more_8x_vec_backward):
> VMOVU VEC_SIZE(%rsi), %VEC(5)
> VMOVU (VEC_SIZE * 2)(%rsi), %VEC(6)
> VMOVU (VEC_SIZE * 3)(%rsi), %VEC(7)
> - VMOVU -VEC_SIZE(%rsi,%rdx), %VEC(8)
> - /* Save stop of the destination buffer. */
> - leaq -VEC_SIZE(%rdi, %rdx), %r11
> - /* Align destination end for aligned stores in the loop. Compute
> - how much destination end is misaligned. */
> - leaq -VEC_SIZE(%rsi, %rdx), %rcx
> - movq %r11, %r9
> - movq %r11, %r8
> - andq $(VEC_SIZE - 1), %r8
> - /* Adjust source. */
> - subq %r8, %rcx
> - /* Adjust the end of destination which should be aligned now. */
> - subq %r8, %r9
> - /* Adjust length. */
> - subq %r8, %rdx
> -
> - .p2align 4
> + VMOVU -VEC_SIZE(%rsi, %rdx), %VEC(8)
> + /* Subtract dst from src. Add back after dst aligned. */
> + subq %rdi, %rsi
> + /* Save begining of buffer. */
> + movq %rdi, %rcx
> + /* Set dst to begining of region to copy. -1 for inclusive
> + alignment. */
> + leaq (VEC_SIZE * -4 + -1)(%rdi, %rdx), %rdi
> + /* Align dst. */
> + andq $-(VEC_SIZE), %rdi
> + /* Restore src. */
> + addq %rdi, %rsi
> + /* Don't use multi-byte nop to align. */
> + .p2align 4,, 11
> L(loop_4x_vec_backward):
> /* Copy 4 * VEC a time backward. */
> - VMOVU (%rcx), %VEC(0)
> - VMOVU -VEC_SIZE(%rcx), %VEC(1)
> - VMOVU -(VEC_SIZE * 2)(%rcx), %VEC(2)
> - VMOVU -(VEC_SIZE * 3)(%rcx), %VEC(3)
> - addq $-(VEC_SIZE * 4), %rcx
> - addq $-(VEC_SIZE * 4), %rdx
> - VMOVA %VEC(0), (%r9)
> - VMOVA %VEC(1), -VEC_SIZE(%r9)
> - VMOVA %VEC(2), -(VEC_SIZE * 2)(%r9)
> - VMOVA %VEC(3), -(VEC_SIZE * 3)(%r9)
> - addq $-(VEC_SIZE * 4), %r9
> - cmpq $(VEC_SIZE * 4), %rdx
> - ja L(loop_4x_vec_backward)
> + VMOVU (VEC_SIZE * 3)(%rsi), %VEC(0)
> + VMOVU (VEC_SIZE * 2)(%rsi), %VEC(1)
> + VMOVU (VEC_SIZE * 1)(%rsi), %VEC(2)
> + VMOVU (VEC_SIZE * 0)(%rsi), %VEC(3)
> + addq $(VEC_SIZE * -4), %rsi
> + VMOVA %VEC(0), (VEC_SIZE * 3)(%rdi)
> + VMOVA %VEC(1), (VEC_SIZE * 2)(%rdi)
> + VMOVA %VEC(2), (VEC_SIZE * 1)(%rdi)
> + VMOVA %VEC(3), (VEC_SIZE * 0)(%rdi)
> + addq $(VEC_SIZE * -4), %rdi
> + cmpq %rdi, %rcx
> + jb L(loop_4x_vec_backward)
> /* Store the first 4 * VEC. */
> - VMOVU %VEC(4), (%rdi)
> - VMOVU %VEC(5), VEC_SIZE(%rdi)
> - VMOVU %VEC(6), (VEC_SIZE * 2)(%rdi)
> - VMOVU %VEC(7), (VEC_SIZE * 3)(%rdi)
> + VMOVU %VEC(4), (%rcx)
> + VMOVU %VEC(5), VEC_SIZE(%rcx)
> + VMOVU %VEC(6), (VEC_SIZE * 2)(%rcx)
> + VMOVU %VEC(7), (VEC_SIZE * 3)(%rcx)
> /* Store the last VEC. */
> - VMOVU %VEC(8), (%r11)
> + VMOVU %VEC(8), -VEC_SIZE(%rdx, %rcx)
> VZEROUPPER_RETURN
>
> #if (defined USE_MULTIARCH || VEC_SIZE == 16) && IS_IN (libc)
> .p2align 4
> + /* Entry if dst > stop movsb threshold (usually set to non-temporal
> + threshold). */
> +L(large_memcpy_2x_check):
> + cmp __x86_shared_non_temporal_threshold(%rip), %RDX_LP
> + jb L(more_8x_vec_forward)
> L(large_memcpy_2x):
> /* Compute absolute value of difference between source and
> destination. */
> --
> 2.25.1
>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: skl-tgl-summary.tar.gz
Type: application/gzip
Size: 932674 bytes
Desc: not available
URL: <https://sourceware.org/pipermail/libc-alpha/attachments/20210824/0a6ffff4/attachment-0001.gz>
More information about the Libc-alpha
mailing list