The use of vzeroupper in for example strcmp on a AVX2 capable machine like Skylake-X causes HTM aborts when used inside transactions. This causes severe performance degradation for some workloads compared to glibc without those multiarch implementations. For one specific benchmark the following hack restores performance (as does removing the VZEROUPPER or replacing it with the way more costly VZEROALL): diff --git a/sysdeps/x86_64/multiarch/strcmp-avx2.S b/sysdeps/x86_64/multiarch/strcmp-avx2.S index ee82fa3e19..208b396557 100644 --- a/sysdeps/x86_64/multiarch/strcmp-avx2.S +++ b/sysdeps/x86_64/multiarch/strcmp-avx2.S @@ -127,7 +127,8 @@ L(return): movzbl (%rsi, %rdx), %edx subl %edx, %eax # endif - VZEROUPPER + vpxor %ymm0, %ymm0, %ymm0 + vpxor %ymm1, %ymm1, %ymm1 ret .p2align 4
FWIW, the (proprietary) benchmark regresses by 40% (!) when using the avx2 strcmp routines, even though the overall runtime of strcmp is only about 1%. So the transaction aborts caused by vzeroupper here are quite tremendous.
More correct, the wcscmp path ends here with higher %ymm regs used diff --git a/sysdeps/x86_64/multiarch/strcmp-avx2.S b/sysdeps/x86_64/multiarch/strcmp-avx2.S index ee82fa3e19..bd3b6243e2 100644 --- a/sysdeps/x86_64/multiarch/strcmp-avx2.S +++ b/sysdeps/x86_64/multiarch/strcmp-avx2.S @@ -122,13 +122,16 @@ L(wcscmp_return): negl %eax orl $1, %eax L(return): + VZEROUPPER + ret # else movzbl (%rdi, %rdx), %eax movzbl (%rsi, %rdx), %edx subl %edx, %eax -# endif - VZEROUPPER + vpxor %ymm0, %ymm0, %ymm0 + vpxor %ymm1, %ymm1, %ymm1 ret +# endif .p2align 4 L(return_vec_size):
(In reply to Richard Biener from comment #2) > More correct, the wcscmp path ends here with higher %ymm regs used > > diff --git a/sysdeps/x86_64/multiarch/strcmp-avx2.S > b/sysdeps/x86_64/multiarch/strcmp-avx2.S > index ee82fa3e19..bd3b6243e2 100644 > --- a/sysdeps/x86_64/multiarch/strcmp-avx2.S > +++ b/sysdeps/x86_64/multiarch/strcmp-avx2.S > @@ -122,13 +122,16 @@ L(wcscmp_return): > negl %eax > orl $1, %eax > L(return): > + VZEROUPPER > + ret > # else > movzbl (%rdi, %rdx), %eax > movzbl (%rsi, %rdx), %edx > subl %edx, %eax > -# endif > - VZEROUPPER > + vpxor %ymm0, %ymm0, %ymm0 > + vpxor %ymm1, %ymm1, %ymm1 > ret > +# endif > > .p2align 4 > L(return_vec_size): These won't remove AVX-SSE transition penalty. I am re-implementing all AVX string/memory functions with YMM16-YMM31, which don't need VZEROUPPER. My current work is on users/hjl/pr27457/evex branch at https://gitlab.com/x86-glibc/glibc/-/commits/users/hjl/pr27457/evex
On February 27, 2021 3:39:50 AM GMT+01:00, "hjl.tools at gmail dot com" <sourceware-bugzilla@sourceware.org> wrote: >https://sourceware.org/bugzilla/show_bug.cgi?id=27457 > >H.J. Lu <hjl.tools at gmail dot com> changed: > > What |Removed |Added >---------------------------------------------------------------------------- > Target Milestone|--- |2.34 > >--- Comment #3 from H.J. Lu <hjl.tools at gmail dot com> --- >(In reply to Richard Biener from comment #2) >> More correct, the wcscmp path ends here with higher %ymm regs used >> >> diff --git a/sysdeps/x86_64/multiarch/strcmp-avx2.S >> b/sysdeps/x86_64/multiarch/strcmp-avx2.S >> index ee82fa3e19..bd3b6243e2 100644 >> --- a/sysdeps/x86_64/multiarch/strcmp-avx2.S >> +++ b/sysdeps/x86_64/multiarch/strcmp-avx2.S >> @@ -122,13 +122,16 @@ L(wcscmp_return): >> negl %eax >> orl $1, %eax >> L(return): >> + VZEROUPPER >> + ret >> # else >> movzbl (%rdi, %rdx), %eax >> movzbl (%rsi, %rdx), %edx >> subl %edx, %eax >> -# endif >> - VZEROUPPER >> + vpxor %ymm0, %ymm0, %ymm0 >> + vpxor %ymm1, %ymm1, %ymm1 >> ret >> +# endif >> >> .p2align 4 >> L(return_vec_size): > >These won't remove AVX-SSE transition penalty. I am re-implementing >all AVX string/memory functions with YMM16-YMM31, which don't need >VZEROUPPER. It should still avoid any false dependences. Ymm16 to ymm31 are only available with AVX512, that will make the AVX2 strong functions unusable on non-avx512 hardware. Are you introducing another set of functions then? My current work is on users/hjl/pr27457/evex branch at > >https://gitlab.com/x86-glibc/glibc/-/commits/users/hjl/pr27457/evex
(In reply to rguenther from comment #4) > > It should still avoid any false dependences. Ymm16 to ymm31 are only > available with AVX512, that will make the AVX2 strong functions unusable on > non-avx512 hardware. Are you introducing another set of functions then? > > My current work is on users/hjl/pr27457/evex branch at > > > >https://gitlab.com/x86-glibc/glibc/-/commits/users/hjl/pr27457/evex I added another set of AVX/AVX2 functions to support RTM. You can use libcpu-rt-c.so: https://gitlab.com/cpu-rt/glibc/-/wikis/libcpu-rt-c.so-to-avoid-vzeroupper-in-RTM-region to try it out.
(In reply to H.J. Lu from comment #5) > I added another set of AVX/AVX2 functions to support RTM. You can use > libcpu-rt-c.so: > > https://gitlab.com/cpu-rt/glibc/-/wikis/libcpu-rt-c.so-to-avoid-vzeroupper- > in-RTM-region > > to try it out. The sources do not seem to show the selection logic. __memset_avx2_unaligned_rtm seems to simply have dropped the VZEROUPPER instruction, which can't be good for general application performance. Anyway, is there anything we can realistically do here on CPUs with AVX2, but without AVX-512? VPXOR doesn't avoid the transition penalty according to the Intel documentation. Switching to 128-bit registers once the CPU supports RTM is also likely to introduce performance regressions in non-transactional code.
On Mon, 1 Mar 2021, fweimer at redhat dot com wrote: > https://sourceware.org/bugzilla/show_bug.cgi?id=27457 > > --- Comment #6 from Florian Weimer <fweimer at redhat dot com> --- > (In reply to H.J. Lu from comment #5) > > I added another set of AVX/AVX2 functions to support RTM. You can use > > libcpu-rt-c.so: > > > > https://gitlab.com/cpu-rt/glibc/-/wikis/libcpu-rt-c.so-to-avoid-vzeroupper- > > in-RTM-region > > > > to try it out. > > The sources do not seem to show the selection logic. > __memset_avx2_unaligned_rtm seems to simply have dropped the VZEROUPPER > instruction, which can't be good for general application performance. The intent was to use %ymm16+ only which does not cause any transition penalty even w/o vzeroupper. > Anyway, is there anything we can realistically do here on CPUs with AVX2, but > without AVX-512? VPXOR doesn't avoid the transition penalty according to the > Intel documentation. Switching to 128-bit registers once the CPU supports RTM > is also likely to introduce performance regressions in non-transactional code. There's the option to use another path when HTM is available, doing xtest and branch to the non-AVX variants when a transaction is active. I'm not sure about the overhead of xtest here. Nor am I sure whether Intel has (or plans any) SKUs with HTM but not AVX512. Since HTM on broadwell + haswell is crippled due to bugs (and thus usually disabled in firmware) this leaves Skylake and later where I don't know of any HTM but no-AVX512 SKUs. But this is Intel, so ...
xtest has a few cycles latency. RTM is not disabled in firmware by default on HSX-EX, BDW-EP, BDW-EX, SKX, CLX, CPX (server SKUs). This is different on client SKUs. Examples of RTM but no-AVX512 SKUs: HSX-EX, BDW-EP, BDW-EX. -- Roman
(In reply to rguenther from comment #7) > On Mon, 1 Mar 2021, fweimer at redhat dot com wrote: > > The sources do not seem to show the selection logic. > > __memset_avx2_unaligned_rtm seems to simply have dropped the VZEROUPPER > > instruction, which can't be good for general application performance. > > The intent was to use %ymm16+ only which does not cause any transition > penalty even w/o vzeroupper. I still saw %ymm0 usage in the disassembly, if I recall correctly. And for AVX2, there isn't much choice. I didn't try to reverse-engineer the corresponding IFUNC selector. > There's the option to use another path when HTM is available, doing > xtest and branch to the non-AVX variants when a transaction is active. > I'm not sure about the overhead of xtest here. Nor am I sure whether > Intel has (or plans any) SKUs with HTM but not AVX512. Since HTM > on broadwell + haswell is crippled due to bugs (and thus usually > disabled in firmware) this leaves Skylake and later where I don't > know of any HTM but no-AVX512 SKUs. But this is Intel, so ... Skylake is an overloaded term. There is e.g. Xeon E3-1240 v5 which is advertised as having the Skylake microarchitecture, but it doesn't have AVX-512. (I haven't checked RTM status, but I don't see why it wouldn't be supported.) Xeon E3-1240 v6 is Kaby Lake, but it doesn't have AVX-512 either, and according to what I see, RTM is not disabled by at least one current distribution kernel/microcode combination. But I think the gist is that RTM without AVX-512 exists out there even in server parts.
(In reply to Florian Weimer from comment #6) > (In reply to H.J. Lu from comment #5) > > I added another set of AVX/AVX2 functions to support RTM. You can use > > libcpu-rt-c.so: > > > > https://gitlab.com/cpu-rt/glibc/-/wikis/libcpu-rt-c.so-to-avoid-vzeroupper- > > in-RTM-region > > > > to try it out. > > The sources do not seem to show the selection logic. > __memset_avx2_unaligned_rtm seems to simply have dropped the VZEROUPPER > instruction, which can't be good for general application performance. memset-avx2-unaligned-erms-rtm.S has #define ZERO_UPPER_VEC_REGISTERS_RETURN \ ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST 0000000000000040 <__memset_avx2_unaligned_rtm>: 40: f3 0f 1e fa endbr64 44: c5 f9 6e c6 vmovd %esi,%xmm0 48: 48 89 f8 mov %rdi,%rax 4b: c4 e2 7d 78 c0 vpbroadcastb %xmm0,%ymm0 50: 48 83 fa 20 cmp $0x20,%rdx 54: 0f 82 0b 01 00 00 jb 165 <__memset_avx2_unaligned_erms_rtm+0xc5> 5a: 48 83 fa 40 cmp $0x40,%rdx 5e: 77 75 ja d5 <__memset_avx2_unaligned_erms_rtm+0x35> 60: c5 fe 7f 44 17 e0 vmovdqu %ymm0,-0x20(%rdi,%rdx,1) 66: c5 fe 7f 07 vmovdqu %ymm0,(%rdi) 6a: e9 84 00 00 00 jmp f3 <__memset_avx2_unaligned_erms_rtm+0x53> 6f: 90 nop ,,, 00000000000000a0 <__memset_avx2_unaligned_erms_rtm>: a0: f3 0f 1e fa endbr64 a4: c5 f9 6e c6 vmovd %esi,%xmm0 a8: 48 89 f8 mov %rdi,%rax ab: c4 e2 7d 78 c0 vpbroadcastb %xmm0,%ymm0 b0: 48 83 fa 20 cmp $0x20,%rdx b4: 0f 82 ab 00 00 00 jb 165 <__memset_avx2_unaligned_erms_rtm+0xc5> ba: 48 83 fa 40 cmp $0x40,%rdx be: 77 0c ja cc <__memset_avx2_unaligned_erms_rtm+0x2c> c0: c5 fe 7f 44 17 e0 vmovdqu %ymm0,-0x20(%rdi,%rdx,1) c6: c5 fe 7f 07 vmovdqu %ymm0,(%rdi) ca: eb 27 jmp f3 <__memset_avx2_unaligned_erms_rtm+0x53> cc: 48 3b 15 00 00 00 00 cmp 0x0(%rip),%rdx # d3 <__memset_avx2_unaligned_erms_rtm+0x33> d3: 77 9f ja 74 <__memset_avx2_erms_rtm+0x4> d5: 48 81 fa 80 00 00 00 cmp $0x80,%rdx dc: 77 22 ja 100 <__memset_avx2_unaligned_erms_rtm+0x60> de: c5 fe 7f 07 vmovdqu %ymm0,(%rdi) e2: c5 fe 7f 47 20 vmovdqu %ymm0,0x20(%rdi) e7: c5 fe 7f 44 17 e0 vmovdqu %ymm0,-0x20(%rdi,%rdx,1) ed: c5 fe 7f 44 17 c0 vmovdqu %ymm0,-0x40(%rdi,%rdx,1) f3: 0f 01 d6 xtest f6: 74 04 je fc <__memset_avx2_unaligned_erms_rtm+0x5c> f8: c5 fc 77 vzeroall fb: c3 ret fc: c5 f8 77 vzeroupper ff: c3 ret
(In reply to Florian Weimer from comment #9) > (In reply to rguenther from comment #7) > > On Mon, 1 Mar 2021, fweimer at redhat dot com wrote: > > > The sources do not seem to show the selection logic. > > > __memset_avx2_unaligned_rtm seems to simply have dropped the VZEROUPPER > > > instruction, which can't be good for general application performance. > > > > The intent was to use %ymm16+ only which does not cause any transition > > penalty even w/o vzeroupper. > > I still saw %ymm0 usage in the disassembly, if I recall correctly. And for > AVX2, there isn't much choice. I didn't try to reverse-engineer the > corresponding IFUNC selector. At function exit, there is f3: 0f 01 d6 xtest f6: 74 04 je fc <__memset_avx2_unaligned_erms_rtm+0x5c> f8: c5 fc 77 vzeroall fb: c3 ret fc: c5 f8 77 vzeroupper ff: c3 ret > > There's the option to use another path when HTM is available, doing > > xtest and branch to the non-AVX variants when a transaction is active. > > I'm not sure about the overhead of xtest here. Nor am I sure whether > > Intel has (or plans any) SKUs with HTM but not AVX512. Since HTM > > on broadwell + haswell is crippled due to bugs (and thus usually > > disabled in firmware) this leaves Skylake and later where I don't > > know of any HTM but no-AVX512 SKUs. But this is Intel, so ... > > Skylake is an overloaded term. There is e.g. Xeon E3-1240 v5 which is > advertised as having the Skylake microarchitecture, but it doesn't have > AVX-512. (I haven't checked RTM status, but I don't see why it wouldn't be > supported.) Xeon E3-1240 v6 is Kaby Lake, but it doesn't have AVX-512 > either, and according to what I see, RTM is not disabled by at least one > current distribution kernel/microcode combination. > > But I think the gist is that RTM without AVX-512 exists out there even in > server parts. It is handed by #define ZERO_UPPER_VEC_REGISTERS_RETURN \ ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST
(In reply to Florian Weimer from comment #9) > (In reply to rguenther from comment #7) > > On Mon, 1 Mar 2021, fweimer at redhat dot com wrote: > > > The sources do not seem to show the selection logic. > > > __memset_avx2_unaligned_rtm seems to simply have dropped the VZEROUPPER > > > instruction, which can't be good for general application performance. > > > > The intent was to use %ymm16+ only which does not cause any transition > > penalty even w/o vzeroupper. > > I still saw %ymm0 usage in the disassembly, if I recall correctly. And for > AVX2, there isn't much choice. I didn't try to reverse-engineer the > corresponding IFUNC selector. > memset-evex-unaligned-erms.S has # define VEC_SIZE 32 # define XMM0 xmm16 # define YMM0 ymm16 # define VEC0 ymm16 # define VEC(i) VEC##i # define VMOVU vmovdqu64 # define VMOVA vmovdqa64 # define VZEROUPPER There are no %ymm0 nor vzeroupper.
(In reply to rguenther from comment #7) > On Mon, 1 Mar 2021, fweimer at redhat dot com wrote: > > There's the option to use another path when HTM is available, doing > xtest and branch to the non-AVX variants when a transaction is active. > I'm not sure about the overhead of xtest here. Nor am I sure whether The overhead is low. Can you verify it?
(In reply to H.J. Lu from comment #11) > (In reply to Florian Weimer from comment #9) > > (In reply to rguenther from comment #7) > > > On Mon, 1 Mar 2021, fweimer at redhat dot com wrote: > > > > The sources do not seem to show the selection logic. > > > > __memset_avx2_unaligned_rtm seems to simply have dropped the VZEROUPPER > > > > instruction, which can't be good for general application performance. > > > > > > The intent was to use %ymm16+ only which does not cause any transition > > > penalty even w/o vzeroupper. > > > > I still saw %ymm0 usage in the disassembly, if I recall correctly. And for > > AVX2, there isn't much choice. I didn't try to reverse-engineer the > > corresponding IFUNC selector. > > At function exit, there is > > f3: 0f 01 d6 xtest > f6: 74 04 je fc <__memset_avx2_unaligned_erms_rtm+0x5c> > f8: c5 fc 77 vzeroall > fb: c3 ret > fc: c5 f8 77 vzeroupper > ff: c3 ret Note according to Agner vzeroall, for example on Haswell, decodes to 20 uops while vzeroupper only requires 4. On Skylake it's even worse (34 uops). For short sizes (as in our benchmark which had 16-31 byte strcmp) this might be a bigger difference than using the SSE2 variant off an early xtest result. That said, why not, for HTM + AVX2 CPUs, have an intermediate dispatcher between the AVX2 and the SSE variant using xtest? That leaves the actual implementations unchanged and thus with known performance characteristic?
(In reply to Richard Biener from comment #14) > > Note according to Agner vzeroall, for example on Haswell, decodes to > 20 uops while vzeroupper only requires 4. On Skylake it's even worse > (34 uops). For short sizes (as in our benchmark which had 16-31 byte > strcmp) this might be a bigger difference than using the SSE2 variant > off an early xtest result. That said, why not, for HTM + AVX2 CPUs, > have an intermediate dispatcher between the AVX2 and the SSE variant > using xtest? That leaves the actual implementations unchanged and thus > with known performance characteristic? It is implemented on users/hjl/pr27457/wrapper branch: https://gitlab.com/x86-glibc/glibc/-/tree/users/hjl/pr27457/wrapper There are 2 problems: 1. Many RTM tests failed for other reasons. 2. Even with vzeroall overhead, AVX version may still be faster than SSE version.
On Mon, 1 Mar 2021, hjl.tools at gmail dot com wrote: > https://sourceware.org/bugzilla/show_bug.cgi?id=27457 > > --- Comment #15 from H.J. Lu <hjl.tools at gmail dot com> --- > (In reply to Richard Biener from comment #14) > > > > Note according to Agner vzeroall, for example on Haswell, decodes to > > 20 uops while vzeroupper only requires 4. On Skylake it's even worse > > (34 uops). For short sizes (as in our benchmark which had 16-31 byte > > strcmp) this might be a bigger difference than using the SSE2 variant > > off an early xtest result. That said, why not, for HTM + AVX2 CPUs, > > have an intermediate dispatcher between the AVX2 and the SSE variant > > using xtest? That leaves the actual implementations unchanged and thus > > with known performance characteristic? > > It is implemented on users/hjl/pr27457/wrapper branch: > > https://gitlab.com/x86-glibc/glibc/-/tree/users/hjl/pr27457/wrapper > > There are 2 problems: > > 1. Many RTM tests failed for other reasons. > 2. Even with vzeroall overhead, AVX version may still be faster than > SSE version. And the SSE version may still be faster than the AVX version with vzeroall. I guess we should mostly care about optimizing for "modern" CPUs which likely means HTM + AVX512 which should be already optimal on your branches by using %ymm16+. So we're talking about the "legacy" AVX2 + HTM path. And there I think we should optimize the path that is _not_ in a transaction since that will be 99% of the cases. Which to me means using the proven tuned (on their respective ISA subsets) SSE2 and AVX2 variants and simply switch between them based on xtest. Yeah, so strcmp of a large string inside an transaction might not run at optimal AVX2 speed. But it will be faster than before the xtest dispatch since before that it would have aborted the transaction.
(In reply to H.J. Lu from comment #11) > (In reply to Florian Weimer from comment #9) > > (In reply to rguenther from comment #7) > > > On Mon, 1 Mar 2021, fweimer at redhat dot com wrote: > > > > The sources do not seem to show the selection logic. > > > > __memset_avx2_unaligned_rtm seems to simply have dropped the VZEROUPPER > > > > instruction, which can't be good for general application performance. > > > > > > The intent was to use %ymm16+ only which does not cause any transition > > > penalty even w/o vzeroupper. > > > > I still saw %ymm0 usage in the disassembly, if I recall correctly. And for > > AVX2, there isn't much choice. I didn't try to reverse-engineer the > > corresponding IFUNC selector. > > At function exit, there is > > f3: 0f 01 d6 xtest > f6: 74 04 je fc <__memset_avx2_unaligned_erms_rtm+0x5c> > f8: c5 fc 77 vzeroall > fb: c3 ret > fc: c5 f8 77 vzeroupper > ff: c3 ret Btw, if the 'je' mispredicts to the vzeroupper case inside an transaction will the speculative execution of vzeroupper abort the transaction or does it only abort the transaction when retired?
(In reply to rguenther from comment #16) > On Mon, 1 Mar 2021, hjl.tools at gmail dot com wrote: > > > https://sourceware.org/bugzilla/show_bug.cgi?id=27457 > > > > --- Comment #15 from H.J. Lu <hjl.tools at gmail dot com> --- > > (In reply to Richard Biener from comment #14) > > > > > > Note according to Agner vzeroall, for example on Haswell, decodes to > > > 20 uops while vzeroupper only requires 4. On Skylake it's even worse > > > (34 uops). For short sizes (as in our benchmark which had 16-31 byte > > > strcmp) this might be a bigger difference than using the SSE2 variant > > > off an early xtest result. That said, why not, for HTM + AVX2 CPUs, > > > have an intermediate dispatcher between the AVX2 and the SSE variant > > > using xtest? That leaves the actual implementations unchanged and thus > > > with known performance characteristic? > > > > It is implemented on users/hjl/pr27457/wrapper branch: > > > > https://gitlab.com/x86-glibc/glibc/-/tree/users/hjl/pr27457/wrapper > > > > There are 2 problems: > > > > 1. Many RTM tests failed for other reasons. > > 2. Even with vzeroall overhead, AVX version may still be faster than > > SSE version. > > And the SSE version may still be faster than the AVX version with > vzeroall. Here is some data: Function: strcmp Variant: default __strcmp_avx2 __strcmp_sse2_unaligned length=14, align1=14, align2=14: 11.36 17.50 length=14, align1=14, align2=14: 11.36 15.59 length=14, align1=14, align2=14: 11.43 15.55 length=15, align1=15, align2=15: 11.36 17.42 length=15, align1=15, align2=15: 11.96 17.41 length=15, align1=15, align2=15: 11.36 16.97 length=16, align1=16, align2=16: 11.36 18.58 length=16, align1=16, align2=16: 11.36 17.41 length=16, align1=16, align2=16: 11.43 17.34 length=17, align1=17, align2=17: 11.36 21.37 length=17, align1=17, align2=17: 11.36 18.52 length=17, align1=17, align2=17: 11.36 17.94 length=18, align1=18, align2=18: 11.36 19.73 length=18, align1=18, align2=18: 11.36 19.20 length=18, align1=18, align2=18: 11.36 19.13 length=19, align1=19, align2=19: 11.36 20.38 length=19, align1=19, align2=19: 11.36 19.39 length=19, align1=19, align2=19: 11.36 20.39 length=20, align1=20, align2=20: 11.36 21.53 length=20, align1=20, align2=20: 11.36 20.98 length=20, align1=20, align2=20: 11.36 20.93 length=21, align1=21, align2=21: 11.36 22.83 length=21, align1=21, align2=21: 11.36 22.26 length=21, align1=21, align2=21: 11.36 22.25 length=22, align1=22, align2=22: 11.43 23.37 length=22, align1=22, align2=22: 11.36 22.78 length=22, align1=22, align2=22: 12.29 22.12 length=23, align1=23, align2=23: 11.36 24.63 length=23, align1=23, align2=23: 12.53 23.97 length=23, align1=23, align2=23: 11.36 23.97 length=24, align1=24, align2=24: 11.36 24.52 length=24, align1=24, align2=24: 11.36 43.47 length=24, align1=24, align2=24: 11.36 44.47 length=25, align1=25, align2=25: 11.36 39.50 length=25, align1=25, align2=25: 11.36 48.97 length=25, align1=25, align2=25: 11.36 48.53 length=26, align1=26, align2=26: 11.36 47.87 length=26, align1=26, align2=26: 11.36 47.20 length=26, align1=26, align2=26: 11.36 47.15 length=27, align1=27, align2=27: 11.36 50.90 length=27, align1=27, align2=27: 11.44 49.98 length=27, align1=27, align2=27: 11.36 49.77 length=28, align1=28, align2=28: 11.36 49.74 length=28, align1=28, align2=28: 11.36 48.86 length=28, align1=28, align2=28: 11.36 49.08 length=29, align1=29, align2=29: 11.36 52.74 length=29, align1=29, align2=29: 11.36 54.04 length=29, align1=29, align2=29: 11.36 29.49 length=30, align1=30, align2=30: 11.36 50.91 length=30, align1=30, align2=30: 11.36 51.09 length=30, align1=30, align2=30: 11.36 51.13 length=31, align1=31, align2=31: 12.36 54.33 length=31, align1=31, align2=31: 11.36 53.49 length=31, align1=31, align2=31: 11.36 53.29 length=16, align1=0, align2=0: 11.36 18.02 length=16, align1=0, align2=0: 11.36 18.58 length=16, align1=0, align2=0: 11.36 17.34 length=16, align1=0, align2=0: 11.44 19.88 length=16, align1=0, align2=0: 11.36 16.74 length=16, align1=0, align2=0: 11.36 17.42 length=16, align1=0, align2=3: 11.36 17.34 length=16, align1=3, align2=4: 11.36 17.34 length=32, align1=0, align2=0: 12.29 61.07 length=32, align1=0, align2=0: 12.63 61.08 length=32, align1=0, align2=0: 11.36 60.48 length=32, align1=0, align2=0: 11.36 60.48 length=32, align1=0, align2=0: 11.36 60.40 length=32, align1=0, align2=0: 11.36 60.40 length=32, align1=0, align2=4: 11.36 60.40 length=32, align1=4, align2=5: 12.10 59.72 > I guess we should mostly care about optimizing for "modern" CPUs > which likely means HTM + AVX512 which should be already optimal > on your branches by using %ymm16+. So we're talking about > the "legacy" AVX2 + HTM path. > > And there I think we should optimize the path that is _not_ in > a transaction since that will be 99% of the cases. Which to > me means using the proven tuned (on their respective ISA subsets) > SSE2 and AVX2 variants and simply switch between them based on > xtest. Yeah, so strcmp of a large string inside an transaction I tried it and I got RTM abort for other reasons. > might not run at optimal AVX2 speed. But it will be faster > than before the xtest dispatch since before that it would have > aborted the transaction. Please give my current approach is a try.
(In reply to Richard Biener from comment #17) > (In reply to H.J. Lu from comment #11) > > (In reply to Florian Weimer from comment #9) > > > (In reply to rguenther from comment #7) > > > > On Mon, 1 Mar 2021, fweimer at redhat dot com wrote: > > > > > The sources do not seem to show the selection logic. > > > > > __memset_avx2_unaligned_rtm seems to simply have dropped the VZEROUPPER > > > > > instruction, which can't be good for general application performance. > > > > > > > > The intent was to use %ymm16+ only which does not cause any transition > > > > penalty even w/o vzeroupper. > > > > > > I still saw %ymm0 usage in the disassembly, if I recall correctly. And for > > > AVX2, there isn't much choice. I didn't try to reverse-engineer the > > > corresponding IFUNC selector. > > > > At function exit, there is > > > > f3: 0f 01 d6 xtest > > f6: 74 04 je fc <__memset_avx2_unaligned_erms_rtm+0x5c> > > f8: c5 fc 77 vzeroall > > fb: c3 ret > > fc: c5 f8 77 vzeroupper > > ff: c3 ret > > Btw, if the 'je' mispredicts to the vzeroupper case inside an transaction > will the speculative execution of vzeroupper abort the transaction or > does it only abort the transaction when retired? My branch includes RTM test: for (i = 0; i < loop; i++) { if (_xbegin() == _XBEGIN_STARTED) { failed |= function (); _xend(); } else { failed |= function (); ++naborts; } } It passes with xtest + je.
(In reply to H.J. Lu from comment #18) > (In reply to rguenther from comment #16) > > On Mon, 1 Mar 2021, hjl.tools at gmail dot com wrote: > > > > > https://sourceware.org/bugzilla/show_bug.cgi?id=27457 > > > > > > --- Comment #15 from H.J. Lu <hjl.tools at gmail dot com> --- > > > (In reply to Richard Biener from comment #14) > > > > > > > > Note according to Agner vzeroall, for example on Haswell, decodes to > > > > 20 uops while vzeroupper only requires 4. On Skylake it's even worse > > > > (34 uops). For short sizes (as in our benchmark which had 16-31 byte > > > > strcmp) this might be a bigger difference than using the SSE2 variant > > > > off an early xtest result. That said, why not, for HTM + AVX2 CPUs, > > > > have an intermediate dispatcher between the AVX2 and the SSE variant > > > > using xtest? That leaves the actual implementations unchanged and thus > > > > with known performance characteristic? > > > > > > It is implemented on users/hjl/pr27457/wrapper branch: > > > > > > https://gitlab.com/x86-glibc/glibc/-/tree/users/hjl/pr27457/wrapper > > > > > > There are 2 problems: > > > > > > 1. Many RTM tests failed for other reasons. > > > 2. Even with vzeroall overhead, AVX version may still be faster than > > > SSE version. > > > > And the SSE version may still be faster than the AVX version with > > vzeroall. > > Here is some data: > > Function: strcmp > Variant: default > __strcmp_avx2 __strcmp_sse2_unaligned > length=14, align1=14, align2=14: 11.36 17.50 > length=14, align1=14, align2=14: 11.36 15.59 > length=14, align1=14, align2=14: 11.43 15.55 > length=15, align1=15, align2=15: 11.36 17.42 > length=15, align1=15, align2=15: 11.96 17.41 > length=15, align1=15, align2=15: 11.36 16.97 > length=16, align1=16, align2=16: 11.36 18.58 > length=16, align1=16, align2=16: 11.36 17.41 > length=16, align1=16, align2=16: 11.43 17.34 > length=17, align1=17, align2=17: 11.36 21.37 > length=17, align1=17, align2=17: 11.36 18.52 > length=17, align1=17, align2=17: 11.36 17.94 > length=18, align1=18, align2=18: 11.36 19.73 > length=18, align1=18, align2=18: 11.36 19.20 > length=18, align1=18, align2=18: 11.36 19.13 > length=19, align1=19, align2=19: 11.36 20.38 > length=19, align1=19, align2=19: 11.36 19.39 > length=19, align1=19, align2=19: 11.36 20.39 > length=20, align1=20, align2=20: 11.36 21.53 > length=20, align1=20, align2=20: 11.36 20.98 > length=20, align1=20, align2=20: 11.36 20.93 > length=21, align1=21, align2=21: 11.36 22.83 > length=21, align1=21, align2=21: 11.36 22.26 > length=21, align1=21, align2=21: 11.36 22.25 > length=22, align1=22, align2=22: 11.43 23.37 > length=22, align1=22, align2=22: 11.36 22.78 > length=22, align1=22, align2=22: 12.29 22.12 > length=23, align1=23, align2=23: 11.36 24.63 > length=23, align1=23, align2=23: 12.53 23.97 > length=23, align1=23, align2=23: 11.36 23.97 > length=24, align1=24, align2=24: 11.36 24.52 > length=24, align1=24, align2=24: 11.36 43.47 > length=24, align1=24, align2=24: 11.36 44.47 > length=25, align1=25, align2=25: 11.36 39.50 > length=25, align1=25, align2=25: 11.36 48.97 > length=25, align1=25, align2=25: 11.36 48.53 > length=26, align1=26, align2=26: 11.36 47.87 > length=26, align1=26, align2=26: 11.36 47.20 > length=26, align1=26, align2=26: 11.36 47.15 > length=27, align1=27, align2=27: 11.36 50.90 > length=27, align1=27, align2=27: 11.44 49.98 > length=27, align1=27, align2=27: 11.36 49.77 > length=28, align1=28, align2=28: 11.36 49.74 > length=28, align1=28, align2=28: 11.36 48.86 > length=28, align1=28, align2=28: 11.36 49.08 > length=29, align1=29, align2=29: 11.36 52.74 > length=29, align1=29, align2=29: 11.36 54.04 > length=29, align1=29, align2=29: 11.36 29.49 > length=30, align1=30, align2=30: 11.36 50.91 > length=30, align1=30, align2=30: 11.36 51.09 > length=30, align1=30, align2=30: 11.36 51.13 > length=31, align1=31, align2=31: 12.36 54.33 > length=31, align1=31, align2=31: 11.36 53.49 > length=31, align1=31, align2=31: 11.36 53.29 > length=16, align1=0, align2=0: 11.36 18.02 > length=16, align1=0, align2=0: 11.36 18.58 > length=16, align1=0, align2=0: 11.36 17.34 > length=16, align1=0, align2=0: 11.44 19.88 > length=16, align1=0, align2=0: 11.36 16.74 > length=16, align1=0, align2=0: 11.36 17.42 > length=16, align1=0, align2=3: 11.36 17.34 > length=16, align1=3, align2=4: 11.36 17.34 > length=32, align1=0, align2=0: 12.29 61.07 > length=32, align1=0, align2=0: 12.63 61.08 > length=32, align1=0, align2=0: 11.36 60.48 > length=32, align1=0, align2=0: 11.36 60.48 > length=32, align1=0, align2=0: 11.36 60.40 > length=32, align1=0, align2=0: 11.36 60.40 > length=32, align1=0, align2=4: 11.36 60.40 > length=32, align1=4, align2=5: 12.10 59.72 That's with or without the vzeroall actually executing? > > I guess we should mostly care about optimizing for "modern" CPUs > > which likely means HTM + AVX512 which should be already optimal > > on your branches by using %ymm16+. So we're talking about > > the "legacy" AVX2 + HTM path. > > > > And there I think we should optimize the path that is _not_ in > > a transaction since that will be 99% of the cases. Which to > > me means using the proven tuned (on their respective ISA subsets) > > SSE2 and AVX2 variants and simply switch between them based on > > xtest. Yeah, so strcmp of a large string inside an transaction > > I tried it and I got RTM abort for other reasons. > > > might not run at optimal AVX2 speed. But it will be faster > > than before the xtest dispatch since before that it would have > > aborted the transaction. > > Please give my current approach is a try. Well, I know that even unconditionally doing vzeroall will fix our observed regression since the time is dominated by all the other code inside the transaction that is then retried a few times (and always re-fails with vzeroupper), the strcmp part is just ~1%. I also only have AVX512 HW with HTM so can't easily test the AVX2 + HTM path. That said, I'm fine with the xtest/vzero{all,upper} epilogue. But I have also been reported libmicro regressions for strcpy with length 10 (32byte aligned), hot cache, when using AVX2 vs. SSE2 (SSE2 being faster by 20%). [note strcpy, not strcmp here] Yeah, stupid benchmark ... but it likely shows that for small lenghts every detail matters in case you want to shave off the last ns.
(In reply to H.J. Lu from comment #19) > (In reply to Richard Biener from comment #17) > > (In reply to H.J. Lu from comment #11) > > > (In reply to Florian Weimer from comment #9) > > > > (In reply to rguenther from comment #7) > > > > > On Mon, 1 Mar 2021, fweimer at redhat dot com wrote: > > > > > > The sources do not seem to show the selection logic. > > > > > > __memset_avx2_unaligned_rtm seems to simply have dropped the VZEROUPPER > > > > > > instruction, which can't be good for general application performance. > > > > > > > > > > The intent was to use %ymm16+ only which does not cause any transition > > > > > penalty even w/o vzeroupper. > > > > > > > > I still saw %ymm0 usage in the disassembly, if I recall correctly. And for > > > > AVX2, there isn't much choice. I didn't try to reverse-engineer the > > > > corresponding IFUNC selector. > > > > > > At function exit, there is > > > > > > f3: 0f 01 d6 xtest > > > f6: 74 04 je fc <__memset_avx2_unaligned_erms_rtm+0x5c> > > > f8: c5 fc 77 vzeroall > > > fb: c3 ret > > > fc: c5 f8 77 vzeroupper > > > ff: c3 ret > > > > Btw, if the 'je' mispredicts to the vzeroupper case inside an transaction > > will the speculative execution of vzeroupper abort the transaction or > > does it only abort the transaction when retired? > > My branch includes RTM test: > > for (i = 0; i < loop; i++) > { > if (_xbegin() == _XBEGIN_STARTED) > { > failed |= function (); > _xend(); > } > else > { > failed |= function (); > ++naborts; > } > } > > It passes with xtest + je. that's good to hear. Does it still work when you add and unconditional non-transaction failed |= function (); before the loop? Just trying to make sure we do actually mispredict the xtest + je.
(In reply to Richard Biener from comment #21) > > that's good to hear. Does it still work when you add and unconditional > non-transaction > > failed |= function (); > > before the loop? Just trying to make sure we do actually mispredict the > xtest + je. I will give it a try.
(In reply to H.J. Lu from comment #22) > (In reply to Richard Biener from comment #21) > > > > > that's good to hear. Does it still work when you add and unconditional > > non-transaction > > > > failed |= function (); > > > > before the loop? Just trying to make sure we do actually mispredict the > > xtest + je. > > I will give it a try. I did: for (i = 0; i < loop; i++) { failed |= function (); if (_xbegin() == _XBEGIN_STARTED) { failed |= function (); _xend(); } else { failed |= function (); ++naborts; } } There is no issue.
A patch set is posted at https://sourceware.org/pipermail/libc-alpha/2021-March/123302.html
We have successfully tested the backport to the 2.31 branch where it mitigates the regression seen.
Any progress? I see the patchset did not get any feedback in the last 10 days.
(In reply to Richard Biener from comment #26) > Any progress? I see the patchset did not get any feedback in the last 10 > days. We discussed it at glibc patch meeting on Monday. I posted the v2 patch: https://sourceware.org/pipermail/libc-alpha/2021-March/123867.html
Fixed for 2.34 on master so far by 10 commits. The last commit is commit e4fda4631017e49d4ee5a2755db34289b6860fa4 Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Mar 7 09:45:23 2021 -0800 x86-64: Use ZMM16-ZMM31 in AVX512 memmove family functions Update ifunc-memmove.h to select the function optimized with AVX512 instructions using ZMM16-ZMM31 registers to avoid RTM abort with usable AVX512VL since VZEROUPPER isn't needed at function exit.
The release/2.33/master branch has been updated by H.J. Lu <hjl@sourceware.org>: https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=902af2f5eee71c3e48fe30d43fd7c61d563e975b commit 902af2f5eee71c3e48fe30d43fd7c61d563e975b Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Jan 27 12:20:21 2022 -0800 NEWS: Add a bug fix entry for BZ #27457
The release/2.32/master branch has been updated by H.J. Lu <hjl@sourceware.org>: https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=05751d1c5c85b7f55086e81567cd55b025b25625 commit 05751d1c5c85b7f55086e81567cd55b025b25625 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Jan 27 12:22:42 2022 -0800 NEWS: Add a bug fix entry for BZ #27457
The release/2.31/master branch has been updated by H.J. Lu <hjl@sourceware.org>: https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c0cbb9345ea2d81d017e7725e3ac5250ef870513 commit c0cbb9345ea2d81d017e7725e3ac5250ef870513 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Jan 27 12:23:42 2022 -0800 NEWS: Add a bug fix entry for BZ #27457
The release/2.30/master branch has been updated by H.J. Lu <hjl@sourceware.org>: https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=de28bb3c612daedb08fa975325c1a293fbca07a9 commit de28bb3c612daedb08fa975325c1a293fbca07a9 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Jan 27 12:24:44 2022 -0800 NEWS: Add a bug fix entry for BZ #27457
The release/2.29/master branch has been updated by H.J. Lu <hjl@sourceware.org>: https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c3535cb6cdd4cbbce22018df09cc69633781d808 commit c3535cb6cdd4cbbce22018df09cc69633781d808 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Jan 27 12:25:41 2022 -0800 NEWS: Add a bug fix entry for BZ #27457
The release/2.28/master branch has been updated by H.J. Lu <hjl@sourceware.org>: https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=2baf5616d5ec5e592d64746253713969eb473f5b commit 2baf5616d5ec5e592d64746253713969eb473f5b Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Jan 27 12:49:55 2022 -0800 NEWS: Add a bug fix entry for BZ #27457
Fixed for 2.34 and all release branches.