27457 – vzeroupper use in AVX2 multiarch string functions cause HTM aborts

Bug 27457 - vzeroupper use in AVX2 multiarch string functions cause HTM aborts

Summary: vzeroupper use in AVX2 multiarch string functions cause HTM aborts

Status:	RESOLVED FIXED

Alias:	None

Product:	glibc
Classification:	Unclassified
Component:	string (show other bugs)
Version:	2.31

Importance:	P2 normal
Target Milestone:	2.34
Assignee:	H.J. Lu

URL:	https://sourceware.org/pipermail/libc...
Keywords:

Depends on:
Blocks:

Reported:	2021-02-22 12:40 UTC by Richard Biener
Modified:	2022-01-28 02:24 UTC (History)
CC List:	5 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:

Flags:	fweimer: security-

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Richard Biener 2021-02-22 12:40:32 UTC

The use of vzeroupper in for example strcmp on a AVX2 capable machine like Skylake-X causes HTM aborts when used inside transactions.  This causes severe performance degradation for some workloads compared to glibc without those multiarch implementations.

For one specific benchmark the following hack restores performance (as does
removing the VZEROUPPER or replacing it with the way more costly VZEROALL):

diff --git a/sysdeps/x86_64/multiarch/strcmp-avx2.S b/sysdeps/x86_64/multiarch/strcmp-avx2.S
index ee82fa3e19..208b396557 100644
--- a/sysdeps/x86_64/multiarch/strcmp-avx2.S
+++ b/sysdeps/x86_64/multiarch/strcmp-avx2.S
@@ -127,7 +127,8 @@ L(return):
        movzbl  (%rsi, %rdx), %edx
        subl    %edx, %eax
 # endif
-       VZEROUPPER
+       vpxor   %ymm0, %ymm0, %ymm0
+       vpxor   %ymm1, %ymm1, %ymm1
        ret
 
        .p2align 4

Comment 1 Michael Matz 2021-02-22 14:50:56 UTC

FWIW, the (proprietary) benchmark regresses by 40% (!) when using the avx2 strcmp
routines, even though the overall runtime of strcmp is only about 1%.  So the transaction aborts caused by vzeroupper here are quite tremendous.

Comment 2 Richard Biener 2021-02-22 15:00:04 UTC

More correct, the wcscmp path ends here with higher %ymm regs used

diff --git a/sysdeps/x86_64/multiarch/strcmp-avx2.S b/sysdeps/x86_64/multiarch/strcmp-avx2.S
index ee82fa3e19..bd3b6243e2 100644
--- a/sysdeps/x86_64/multiarch/strcmp-avx2.S
+++ b/sysdeps/x86_64/multiarch/strcmp-avx2.S
@@ -122,13 +122,16 @@ L(wcscmp_return):
        negl    %eax
        orl     $1, %eax
 L(return):
+       VZEROUPPER
+       ret
 # else
        movzbl  (%rdi, %rdx), %eax
        movzbl  (%rsi, %rdx), %edx
        subl    %edx, %eax
-# endif
-       VZEROUPPER
+       vpxor   %ymm0, %ymm0, %ymm0
+       vpxor   %ymm1, %ymm1, %ymm1
        ret
+# endif
 
        .p2align 4
 L(return_vec_size):

Comment 3 H.J. Lu 2021-02-27 02:39:50 UTC

(In reply to Richard Biener from comment #2)
> More correct, the wcscmp path ends here with higher %ymm regs used
> 
> diff --git a/sysdeps/x86_64/multiarch/strcmp-avx2.S
> b/sysdeps/x86_64/multiarch/strcmp-avx2.S
> index ee82fa3e19..bd3b6243e2 100644
> --- a/sysdeps/x86_64/multiarch/strcmp-avx2.S
> +++ b/sysdeps/x86_64/multiarch/strcmp-avx2.S
> @@ -122,13 +122,16 @@ L(wcscmp_return):
>         negl    %eax
>         orl     $1, %eax
>  L(return):
> +       VZEROUPPER
> +       ret
>  # else
>         movzbl  (%rdi, %rdx), %eax
>         movzbl  (%rsi, %rdx), %edx
>         subl    %edx, %eax
> -# endif
> -       VZEROUPPER
> +       vpxor   %ymm0, %ymm0, %ymm0
> +       vpxor   %ymm1, %ymm1, %ymm1
>         ret
> +# endif
>  
>         .p2align 4
>  L(return_vec_size):

These won't remove AVX-SSE transition penalty.  I am re-implementing
all AVX string/memory functions with YMM16-YMM31, which don't need
VZEROUPPER.  My current work is on users/hjl/pr27457/evex branch at

https://gitlab.com/x86-glibc/glibc/-/commits/users/hjl/pr27457/evex

Comment 4 rguenther 2021-02-27 07:34:27 UTC

On February 27, 2021 3:39:50 AM GMT+01:00, "hjl.tools at gmail dot com" <sourceware-bugzilla@sourceware.org> wrote:
>https://sourceware.org/bugzilla/show_bug.cgi?id=27457
>
>H.J. Lu <hjl.tools at gmail dot com> changed:
>
>           What    |Removed                     |Added
>----------------------------------------------------------------------------
>   Target Milestone|---                         |2.34
>
>--- Comment #3 from H.J. Lu <hjl.tools at gmail dot com> ---
>(In reply to Richard Biener from comment #2)
>> More correct, the wcscmp path ends here with higher %ymm regs used
>> 
>> diff --git a/sysdeps/x86_64/multiarch/strcmp-avx2.S
>> b/sysdeps/x86_64/multiarch/strcmp-avx2.S
>> index ee82fa3e19..bd3b6243e2 100644
>> --- a/sysdeps/x86_64/multiarch/strcmp-avx2.S
>> +++ b/sysdeps/x86_64/multiarch/strcmp-avx2.S
>> @@ -122,13 +122,16 @@ L(wcscmp_return):
>>         negl    %eax
>>         orl     $1, %eax
>>  L(return):
>> +       VZEROUPPER
>> +       ret
>>  # else
>>         movzbl  (%rdi, %rdx), %eax
>>         movzbl  (%rsi, %rdx), %edx
>>         subl    %edx, %eax
>> -# endif
>> -       VZEROUPPER
>> +       vpxor   %ymm0, %ymm0, %ymm0
>> +       vpxor   %ymm1, %ymm1, %ymm1
>>         ret
>> +# endif
>>  
>>         .p2align 4
>>  L(return_vec_size):
>
>These won't remove AVX-SSE transition penalty.  I am re-implementing
>all AVX string/memory functions with YMM16-YMM31, which don't need
>VZEROUPPER. 

It should still avoid any false dependences. Ymm16 to ymm31 are only available with AVX512, that will make the AVX2 strong functions unusable on non-avx512 hardware. Are you introducing another set of functions then? 

 My current work is on users/hjl/pr27457/evex branch at
>
>https://gitlab.com/x86-glibc/glibc/-/commits/users/hjl/pr27457/evex

Comment 5 H.J. Lu 2021-02-28 14:53:22 UTC

(In reply to rguenther from comment #4)
> 
> It should still avoid any false dependences. Ymm16 to ymm31 are only
> available with AVX512, that will make the AVX2 strong functions unusable on
> non-avx512 hardware. Are you introducing another set of functions then? 
> 
>  My current work is on users/hjl/pr27457/evex branch at
> >
> >https://gitlab.com/x86-glibc/glibc/-/commits/users/hjl/pr27457/evex

I added another set of AVX/AVX2 functions to support RTM.  You can use
libcpu-rt-c.so:

https://gitlab.com/cpu-rt/glibc/-/wikis/libcpu-rt-c.so-to-avoid-vzeroupper-in-RTM-region

to try it out.

Comment 6 Florian Weimer 2021-03-01 11:32:28 UTC

(In reply to H.J. Lu from comment #5)
> I added another set of AVX/AVX2 functions to support RTM.  You can use
> libcpu-rt-c.so:
> 
> https://gitlab.com/cpu-rt/glibc/-/wikis/libcpu-rt-c.so-to-avoid-vzeroupper-
> in-RTM-region
> 
> to try it out.

The sources do not seem to show the selection logic. __memset_avx2_unaligned_rtm seems to simply have dropped the VZEROUPPER instruction, which can't be good for general application performance.

Anyway, is there anything we can realistically do here on CPUs with AVX2, but without AVX-512? VPXOR doesn't avoid the transition penalty according to the Intel documentation. Switching to 128-bit registers once the CPU supports RTM is also likely to introduce performance regressions in non-transactional code.

Comment 7 rguenther 2021-03-01 12:47:36 UTC

On Mon, 1 Mar 2021, fweimer at redhat dot com wrote:

> https://sourceware.org/bugzilla/show_bug.cgi?id=27457
> 
> --- Comment #6 from Florian Weimer <fweimer at redhat dot com> ---
> (In reply to H.J. Lu from comment #5)
> > I added another set of AVX/AVX2 functions to support RTM.  You can use
> > libcpu-rt-c.so:
> > 
> > https://gitlab.com/cpu-rt/glibc/-/wikis/libcpu-rt-c.so-to-avoid-vzeroupper-
> > in-RTM-region
> > 
> > to try it out.
> 
> The sources do not seem to show the selection logic.
> __memset_avx2_unaligned_rtm seems to simply have dropped the VZEROUPPER
> instruction, which can't be good for general application performance.

The intent was to use %ymm16+ only which does not cause any transition
penalty even w/o vzeroupper.

> Anyway, is there anything we can realistically do here on CPUs with AVX2, but
> without AVX-512? VPXOR doesn't avoid the transition penalty according to the
> Intel documentation. Switching to 128-bit registers once the CPU supports RTM
> is also likely to introduce performance regressions in non-transactional code.

There's the option to use another path when HTM is available, doing
xtest and branch to the non-AVX variants when a transaction is active.
I'm not sure about the overhead of xtest here.  Nor am I sure whether
Intel has (or plans any) SKUs with HTM but not AVX512.  Since HTM
on broadwell + haswell is crippled due to bugs (and thus usually
disabled in firmware) this leaves Skylake and later where I don't
know of any HTM but no-AVX512 SKUs.  But this is Intel, so ...

Comment 8 Roman Dementiev 2021-03-01 13:13:18 UTC

xtest has a few cycles latency.

RTM is not disabled in firmware by default on HSX-EX, BDW-EP, BDW-EX, SKX, CLX, CPX (server SKUs). This is different on client SKUs.

Examples of RTM but no-AVX512 SKUs: HSX-EX, BDW-EP, BDW-EX.

--
Roman

Comment 9 Florian Weimer 2021-03-01 13:19:40 UTC

(In reply to rguenther from comment #7)
> On Mon, 1 Mar 2021, fweimer at redhat dot com wrote:
> > The sources do not seem to show the selection logic.
> > __memset_avx2_unaligned_rtm seems to simply have dropped the VZEROUPPER
> > instruction, which can't be good for general application performance.
> 
> The intent was to use %ymm16+ only which does not cause any transition
> penalty even w/o vzeroupper.

I still saw %ymm0 usage in the disassembly, if I recall correctly. And for AVX2, there isn't much choice. I didn't try to reverse-engineer the corresponding IFUNC selector.

> There's the option to use another path when HTM is available, doing
> xtest and branch to the non-AVX variants when a transaction is active.
> I'm not sure about the overhead of xtest here.  Nor am I sure whether
> Intel has (or plans any) SKUs with HTM but not AVX512.  Since HTM
> on broadwell + haswell is crippled due to bugs (and thus usually
> disabled in firmware) this leaves Skylake and later where I don't
> know of any HTM but no-AVX512 SKUs.  But this is Intel, so ...

Skylake is an overloaded term. There is e.g. Xeon E3-1240 v5 which is advertised as having the Skylake microarchitecture, but it doesn't have AVX-512. (I haven't checked RTM status, but I don't see why it wouldn't be supported.) Xeon E3-1240 v6 is Kaby Lake, but it doesn't have AVX-512 either, and according to what I see, RTM is not disabled by at least one current distribution kernel/microcode combination.

But I think the gist is that RTM without AVX-512 exists out there even in server parts.

Comment 10 H.J. Lu 2021-03-01 13:21:29 UTC

(In reply to Florian Weimer from comment #6)
> (In reply to H.J. Lu from comment #5)
> > I added another set of AVX/AVX2 functions to support RTM.  You can use
> > libcpu-rt-c.so:
> > 
> > https://gitlab.com/cpu-rt/glibc/-/wikis/libcpu-rt-c.so-to-avoid-vzeroupper-
> > in-RTM-region
> > 
> > to try it out.
> 
> The sources do not seem to show the selection logic.
> __memset_avx2_unaligned_rtm seems to simply have dropped the VZEROUPPER
> instruction, which can't be good for general application performance.

memset-avx2-unaligned-erms-rtm.S has

#define ZERO_UPPER_VEC_REGISTERS_RETURN \
  ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST

0000000000000040 <__memset_avx2_unaligned_rtm>:
  40:	f3 0f 1e fa          	endbr64 
  44:	c5 f9 6e c6          	vmovd  %esi,%xmm0
  48:	48 89 f8             	mov    %rdi,%rax
  4b:	c4 e2 7d 78 c0       	vpbroadcastb %xmm0,%ymm0
  50:	48 83 fa 20          	cmp    $0x20,%rdx
  54:	0f 82 0b 01 00 00    	jb     165 <__memset_avx2_unaligned_erms_rtm+0xc5>
  5a:	48 83 fa 40          	cmp    $0x40,%rdx
  5e:	77 75                	ja     d5 <__memset_avx2_unaligned_erms_rtm+0x35>
  60:	c5 fe 7f 44 17 e0    	vmovdqu %ymm0,-0x20(%rdi,%rdx,1)
  66:	c5 fe 7f 07          	vmovdqu %ymm0,(%rdi)
  6a:	e9 84 00 00 00       	jmp    f3 <__memset_avx2_unaligned_erms_rtm+0x53>
  6f:	90                   	nop
,,,
00000000000000a0 <__memset_avx2_unaligned_erms_rtm>:
  a0:	f3 0f 1e fa          	endbr64 
  a4:	c5 f9 6e c6          	vmovd  %esi,%xmm0
  a8:	48 89 f8             	mov    %rdi,%rax
  ab:	c4 e2 7d 78 c0       	vpbroadcastb %xmm0,%ymm0
  b0:	48 83 fa 20          	cmp    $0x20,%rdx
  b4:	0f 82 ab 00 00 00    	jb     165 <__memset_avx2_unaligned_erms_rtm+0xc5>
  ba:	48 83 fa 40          	cmp    $0x40,%rdx
  be:	77 0c                	ja     cc <__memset_avx2_unaligned_erms_rtm+0x2c>
  c0:	c5 fe 7f 44 17 e0    	vmovdqu %ymm0,-0x20(%rdi,%rdx,1)
  c6:	c5 fe 7f 07          	vmovdqu %ymm0,(%rdi)
  ca:	eb 27                	jmp    f3 <__memset_avx2_unaligned_erms_rtm+0x53>
  cc:	48 3b 15 00 00 00 00 	cmp    0x0(%rip),%rdx        # d3 <__memset_avx2_unaligned_erms_rtm+0x33>
  d3:	77 9f                	ja     74 <__memset_avx2_erms_rtm+0x4>
  d5:	48 81 fa 80 00 00 00 	cmp    $0x80,%rdx
  dc:	77 22                	ja     100 <__memset_avx2_unaligned_erms_rtm+0x60>
  de:	c5 fe 7f 07          	vmovdqu %ymm0,(%rdi)
  e2:	c5 fe 7f 47 20       	vmovdqu %ymm0,0x20(%rdi)
  e7:	c5 fe 7f 44 17 e0    	vmovdqu %ymm0,-0x20(%rdi,%rdx,1)
  ed:	c5 fe 7f 44 17 c0    	vmovdqu %ymm0,-0x40(%rdi,%rdx,1)
  f3:	0f 01 d6             	xtest  
  f6:	74 04                	je     fc <__memset_avx2_unaligned_erms_rtm+0x5c>
  f8:	c5 fc 77             	vzeroall 
  fb:	c3                   	ret    
  fc:	c5 f8 77             	vzeroupper 
  ff:	c3                   	ret

Comment 11 H.J. Lu 2021-03-01 13:24:23 UTC

(In reply to Florian Weimer from comment #9)
> (In reply to rguenther from comment #7)
> > On Mon, 1 Mar 2021, fweimer at redhat dot com wrote:
> > > The sources do not seem to show the selection logic.
> > > __memset_avx2_unaligned_rtm seems to simply have dropped the VZEROUPPER
> > > instruction, which can't be good for general application performance.
> > 
> > The intent was to use %ymm16+ only which does not cause any transition
> > penalty even w/o vzeroupper.
> 
> I still saw %ymm0 usage in the disassembly, if I recall correctly. And for
> AVX2, there isn't much choice. I didn't try to reverse-engineer the
> corresponding IFUNC selector.

At function exit, there is

  f3:	0f 01 d6             	xtest  
  f6:	74 04                	je     fc <__memset_avx2_unaligned_erms_rtm+0x5c>
  f8:	c5 fc 77             	vzeroall 
  fb:	c3                   	ret    
  fc:	c5 f8 77             	vzeroupper 
  ff:	c3                   	ret    

> > There's the option to use another path when HTM is available, doing
> > xtest and branch to the non-AVX variants when a transaction is active.
> > I'm not sure about the overhead of xtest here.  Nor am I sure whether
> > Intel has (or plans any) SKUs with HTM but not AVX512.  Since HTM
> > on broadwell + haswell is crippled due to bugs (and thus usually
> > disabled in firmware) this leaves Skylake and later where I don't
> > know of any HTM but no-AVX512 SKUs.  But this is Intel, so ...
> 
> Skylake is an overloaded term. There is e.g. Xeon E3-1240 v5 which is
> advertised as having the Skylake microarchitecture, but it doesn't have
> AVX-512. (I haven't checked RTM status, but I don't see why it wouldn't be
> supported.) Xeon E3-1240 v6 is Kaby Lake, but it doesn't have AVX-512
> either, and according to what I see, RTM is not disabled by at least one
> current distribution kernel/microcode combination.
> 
> But I think the gist is that RTM without AVX-512 exists out there even in
> server parts.

It is handed by

#define ZERO_UPPER_VEC_REGISTERS_RETURN \
  ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST

Comment 12 H.J. Lu 2021-03-01 13:27:08 UTC

(In reply to Florian Weimer from comment #9)
> (In reply to rguenther from comment #7)
> > On Mon, 1 Mar 2021, fweimer at redhat dot com wrote:
> > > The sources do not seem to show the selection logic.
> > > __memset_avx2_unaligned_rtm seems to simply have dropped the VZEROUPPER
> > > instruction, which can't be good for general application performance.
> > 
> > The intent was to use %ymm16+ only which does not cause any transition
> > penalty even w/o vzeroupper.
> 
> I still saw %ymm0 usage in the disassembly, if I recall correctly. And for
> AVX2, there isn't much choice. I didn't try to reverse-engineer the
> corresponding IFUNC selector.
> 

memset-evex-unaligned-erms.S has

# define VEC_SIZE	32
# define XMM0		xmm16
# define YMM0		ymm16
# define VEC0		ymm16
# define VEC(i)		VEC##i
# define VMOVU		vmovdqu64
# define VMOVA		vmovdqa64
# define VZEROUPPER

There are no %ymm0 nor vzeroupper.

Comment 13 H.J. Lu 2021-03-01 13:29:58 UTC

(In reply to rguenther from comment #7)
> On Mon, 1 Mar 2021, fweimer at redhat dot com wrote:
>
> There's the option to use another path when HTM is available, doing
> xtest and branch to the non-AVX variants when a transaction is active.
> I'm not sure about the overhead of xtest here.  Nor am I sure whether

The overhead is low.  Can you verify it?

Comment 14 Richard Biener 2021-03-01 13:44:57 UTC

(In reply to H.J. Lu from comment #11)
> (In reply to Florian Weimer from comment #9)
> > (In reply to rguenther from comment #7)
> > > On Mon, 1 Mar 2021, fweimer at redhat dot com wrote:
> > > > The sources do not seem to show the selection logic.
> > > > __memset_avx2_unaligned_rtm seems to simply have dropped the VZEROUPPER
> > > > instruction, which can't be good for general application performance.
> > > 
> > > The intent was to use %ymm16+ only which does not cause any transition
> > > penalty even w/o vzeroupper.
> > 
> > I still saw %ymm0 usage in the disassembly, if I recall correctly. And for
> > AVX2, there isn't much choice. I didn't try to reverse-engineer the
> > corresponding IFUNC selector.
> 
> At function exit, there is
> 
>   f3:	0f 01 d6             	xtest  
>   f6:	74 04                	je     fc <__memset_avx2_unaligned_erms_rtm+0x5c>
>   f8:	c5 fc 77             	vzeroall 
>   fb:	c3                   	ret    
>   fc:	c5 f8 77             	vzeroupper 
>   ff:	c3                   	ret    

Note according to Agner vzeroall, for example on Haswell, decodes to
20 uops while vzeroupper only requires 4.  On Skylake it's even worse
(34 uops).  For short sizes (as in our benchmark which had 16-31 byte
strcmp) this might be a bigger difference than using the SSE2 variant
off an early xtest result.  That said, why not, for HTM + AVX2 CPUs,
have an intermediate dispatcher between the AVX2 and the SSE variant
using xtest?  That leaves the actual implementations unchanged and thus
with known performance characteristic?

Comment 15 H.J. Lu 2021-03-01 14:05:44 UTC

(In reply to Richard Biener from comment #14)
> 
> Note according to Agner vzeroall, for example on Haswell, decodes to
> 20 uops while vzeroupper only requires 4.  On Skylake it's even worse
> (34 uops).  For short sizes (as in our benchmark which had 16-31 byte
> strcmp) this might be a bigger difference than using the SSE2 variant
> off an early xtest result.  That said, why not, for HTM + AVX2 CPUs,
> have an intermediate dispatcher between the AVX2 and the SSE variant
> using xtest?  That leaves the actual implementations unchanged and thus
> with known performance characteristic?

It is implemented on users/hjl/pr27457/wrapper branch:

https://gitlab.com/x86-glibc/glibc/-/tree/users/hjl/pr27457/wrapper

There are 2 problems:

1. Many RTM tests failed for other reasons.
2. Even with vzeroall overhead, AVX version may still be faster than
SSE version.

Comment 16 rguenther 2021-03-01 14:14:38 UTC

On Mon, 1 Mar 2021, hjl.tools at gmail dot com wrote:

> https://sourceware.org/bugzilla/show_bug.cgi?id=27457
> 
> --- Comment #15 from H.J. Lu <hjl.tools at gmail dot com> ---
> (In reply to Richard Biener from comment #14)
> > 
> > Note according to Agner vzeroall, for example on Haswell, decodes to
> > 20 uops while vzeroupper only requires 4.  On Skylake it's even worse
> > (34 uops).  For short sizes (as in our benchmark which had 16-31 byte
> > strcmp) this might be a bigger difference than using the SSE2 variant
> > off an early xtest result.  That said, why not, for HTM + AVX2 CPUs,
> > have an intermediate dispatcher between the AVX2 and the SSE variant
> > using xtest?  That leaves the actual implementations unchanged and thus
> > with known performance characteristic?
> 
> It is implemented on users/hjl/pr27457/wrapper branch:
> 
> https://gitlab.com/x86-glibc/glibc/-/tree/users/hjl/pr27457/wrapper
> 
> There are 2 problems:
> 
> 1. Many RTM tests failed for other reasons.
> 2. Even with vzeroall overhead, AVX version may still be faster than
> SSE version.

And the SSE version may still be faster than the AVX version with
vzeroall.

I guess we should mostly care about optimizing for "modern" CPUs
which likely means HTM + AVX512 which should be already optimal
on your branches by using %ymm16+.  So we're talking about
the "legacy" AVX2 + HTM path.

And there I think we should optimize the path that is _not_ in
a transaction since that will be 99% of the cases.  Which to
me means using the proven tuned (on their respective ISA subsets)
SSE2 and AVX2 variants and simply switch between them based on
xtest.  Yeah, so strcmp of a large string inside an transaction
might not run at optimal AVX2 speed.  But it will be faster
than before the xtest dispatch since before that it would have
aborted the transaction.

Comment 17 Richard Biener 2021-03-01 14:25:02 UTC

(In reply to H.J. Lu from comment #11)
> (In reply to Florian Weimer from comment #9)
> > (In reply to rguenther from comment #7)
> > > On Mon, 1 Mar 2021, fweimer at redhat dot com wrote:
> > > > The sources do not seem to show the selection logic.
> > > > __memset_avx2_unaligned_rtm seems to simply have dropped the VZEROUPPER
> > > > instruction, which can't be good for general application performance.
> > > 
> > > The intent was to use %ymm16+ only which does not cause any transition
> > > penalty even w/o vzeroupper.
> > 
> > I still saw %ymm0 usage in the disassembly, if I recall correctly. And for
> > AVX2, there isn't much choice. I didn't try to reverse-engineer the
> > corresponding IFUNC selector.
> 
> At function exit, there is
> 
>   f3:	0f 01 d6             	xtest  
>   f6:	74 04                	je     fc <__memset_avx2_unaligned_erms_rtm+0x5c>
>   f8:	c5 fc 77             	vzeroall 
>   fb:	c3                   	ret    
>   fc:	c5 f8 77             	vzeroupper 
>   ff:	c3                   	ret    

Btw, if the 'je' mispredicts to the vzeroupper case inside an transaction
will the speculative execution of vzeroupper abort the transaction or
does it only abort the transaction when retired?

Comment 18 H.J. Lu 2021-03-01 14:37:33 UTC

(In reply to rguenther from comment #16)
> On Mon, 1 Mar 2021, hjl.tools at gmail dot com wrote:
> 
> > https://sourceware.org/bugzilla/show_bug.cgi?id=27457
> > 
> > --- Comment #15 from H.J. Lu <hjl.tools at gmail dot com> ---
> > (In reply to Richard Biener from comment #14)
> > > 
> > > Note according to Agner vzeroall, for example on Haswell, decodes to
> > > 20 uops while vzeroupper only requires 4.  On Skylake it's even worse
> > > (34 uops).  For short sizes (as in our benchmark which had 16-31 byte
> > > strcmp) this might be a bigger difference than using the SSE2 variant
> > > off an early xtest result.  That said, why not, for HTM + AVX2 CPUs,
> > > have an intermediate dispatcher between the AVX2 and the SSE variant
> > > using xtest?  That leaves the actual implementations unchanged and thus
> > > with known performance characteristic?
> > 
> > It is implemented on users/hjl/pr27457/wrapper branch:
> > 
> > https://gitlab.com/x86-glibc/glibc/-/tree/users/hjl/pr27457/wrapper
> > 
> > There are 2 problems:
> > 
> > 1. Many RTM tests failed for other reasons.
> > 2. Even with vzeroall overhead, AVX version may still be faster than
> > SSE version.
> 
> And the SSE version may still be faster than the AVX version with
> vzeroall.

Here is some data:

Function: strcmp
Variant: default
                                       __strcmp_avx2	__strcmp_sse2_unaligned
     length=14, align1=14, align2=14:        11.36	       17.50	
     length=14, align1=14, align2=14:        11.36	       15.59	
     length=14, align1=14, align2=14:        11.43	       15.55	
     length=15, align1=15, align2=15:        11.36	       17.42	
     length=15, align1=15, align2=15:        11.96	       17.41	
     length=15, align1=15, align2=15:        11.36	       16.97	
     length=16, align1=16, align2=16:        11.36	       18.58	
     length=16, align1=16, align2=16:        11.36	       17.41	
     length=16, align1=16, align2=16:        11.43	       17.34	
     length=17, align1=17, align2=17:        11.36	       21.37	
     length=17, align1=17, align2=17:        11.36	       18.52	
     length=17, align1=17, align2=17:        11.36	       17.94	
     length=18, align1=18, align2=18:        11.36	       19.73	
     length=18, align1=18, align2=18:        11.36	       19.20	
     length=18, align1=18, align2=18:        11.36	       19.13	
     length=19, align1=19, align2=19:        11.36	       20.38	
     length=19, align1=19, align2=19:        11.36	       19.39	
     length=19, align1=19, align2=19:        11.36	       20.39	
     length=20, align1=20, align2=20:        11.36	       21.53	
     length=20, align1=20, align2=20:        11.36	       20.98	
     length=20, align1=20, align2=20:        11.36	       20.93	
     length=21, align1=21, align2=21:        11.36	       22.83	
     length=21, align1=21, align2=21:        11.36	       22.26	
     length=21, align1=21, align2=21:        11.36	       22.25	
     length=22, align1=22, align2=22:        11.43	       23.37	
     length=22, align1=22, align2=22:        11.36	       22.78	
     length=22, align1=22, align2=22:        12.29	       22.12	
     length=23, align1=23, align2=23:        11.36	       24.63	
     length=23, align1=23, align2=23:        12.53	       23.97	
     length=23, align1=23, align2=23:        11.36	       23.97	
     length=24, align1=24, align2=24:        11.36	       24.52	
     length=24, align1=24, align2=24:        11.36	       43.47	
     length=24, align1=24, align2=24:        11.36	       44.47	
     length=25, align1=25, align2=25:        11.36	       39.50	
     length=25, align1=25, align2=25:        11.36	       48.97	
     length=25, align1=25, align2=25:        11.36	       48.53	
     length=26, align1=26, align2=26:        11.36	       47.87	
     length=26, align1=26, align2=26:        11.36	       47.20	
     length=26, align1=26, align2=26:        11.36	       47.15	
     length=27, align1=27, align2=27:        11.36	       50.90	
     length=27, align1=27, align2=27:        11.44	       49.98	
     length=27, align1=27, align2=27:        11.36	       49.77	
     length=28, align1=28, align2=28:        11.36	       49.74	
     length=28, align1=28, align2=28:        11.36	       48.86	
     length=28, align1=28, align2=28:        11.36	       49.08	
     length=29, align1=29, align2=29:        11.36	       52.74	
     length=29, align1=29, align2=29:        11.36	       54.04	
     length=29, align1=29, align2=29:        11.36	       29.49	
     length=30, align1=30, align2=30:        11.36	       50.91	
     length=30, align1=30, align2=30:        11.36	       51.09	
     length=30, align1=30, align2=30:        11.36	       51.13	
     length=31, align1=31, align2=31:        12.36	       54.33	
     length=31, align1=31, align2=31:        11.36	       53.49	
     length=31, align1=31, align2=31:        11.36	       53.29	
       length=16, align1=0, align2=0:        11.36	       18.02	
       length=16, align1=0, align2=0:        11.36	       18.58	
       length=16, align1=0, align2=0:        11.36	       17.34	
       length=16, align1=0, align2=0:        11.44	       19.88	
       length=16, align1=0, align2=0:        11.36	       16.74	
       length=16, align1=0, align2=0:        11.36	       17.42	
       length=16, align1=0, align2=3:        11.36	       17.34	
       length=16, align1=3, align2=4:        11.36	       17.34	
       length=32, align1=0, align2=0:        12.29	       61.07	
       length=32, align1=0, align2=0:        12.63	       61.08	
       length=32, align1=0, align2=0:        11.36	       60.48	
       length=32, align1=0, align2=0:        11.36	       60.48	
       length=32, align1=0, align2=0:        11.36	       60.40	
       length=32, align1=0, align2=0:        11.36	       60.40	
       length=32, align1=0, align2=4:        11.36	       60.40	
       length=32, align1=4, align2=5:        12.10	       59.72	

> I guess we should mostly care about optimizing for "modern" CPUs
> which likely means HTM + AVX512 which should be already optimal
> on your branches by using %ymm16+.  So we're talking about
> the "legacy" AVX2 + HTM path.
> 
> And there I think we should optimize the path that is _not_ in
> a transaction since that will be 99% of the cases.  Which to
> me means using the proven tuned (on their respective ISA subsets)
> SSE2 and AVX2 variants and simply switch between them based on
> xtest.  Yeah, so strcmp of a large string inside an transaction

I tried it and I got RTM abort for other reasons.

> might not run at optimal AVX2 speed.  But it will be faster
> than before the xtest dispatch since before that it would have
> aborted the transaction.

Please give my current approach is a try.

Comment 19 H.J. Lu 2021-03-01 14:47:49 UTC

(In reply to Richard Biener from comment #17)
> (In reply to H.J. Lu from comment #11)
> > (In reply to Florian Weimer from comment #9)
> > > (In reply to rguenther from comment #7)
> > > > On Mon, 1 Mar 2021, fweimer at redhat dot com wrote:
> > > > > The sources do not seem to show the selection logic.
> > > > > __memset_avx2_unaligned_rtm seems to simply have dropped the VZEROUPPER
> > > > > instruction, which can't be good for general application performance.
> > > > 
> > > > The intent was to use %ymm16+ only which does not cause any transition
> > > > penalty even w/o vzeroupper.
> > > 
> > > I still saw %ymm0 usage in the disassembly, if I recall correctly. And for
> > > AVX2, there isn't much choice. I didn't try to reverse-engineer the
> > > corresponding IFUNC selector.
> > 
> > At function exit, there is
> > 
> >   f3:	0f 01 d6             	xtest  
> >   f6:	74 04                	je     fc <__memset_avx2_unaligned_erms_rtm+0x5c>
> >   f8:	c5 fc 77             	vzeroall 
> >   fb:	c3                   	ret    
> >   fc:	c5 f8 77             	vzeroupper 
> >   ff:	c3                   	ret    
> 
> Btw, if the 'je' mispredicts to the vzeroupper case inside an transaction
> will the speculative execution of vzeroupper abort the transaction or
> does it only abort the transaction when retired?

My branch includes RTM test:

  for (i = 0; i < loop; i++)
    {
      if (_xbegin() == _XBEGIN_STARTED)
	{
	  failed |= function ();
	  _xend();
	}
      else
	{
	  failed |= function ();
	  ++naborts;
	}
    }

It passes with xtest + je.

Comment 20 Richard Biener 2021-03-01 14:49:48 UTC

(In reply to H.J. Lu from comment #18)
> (In reply to rguenther from comment #16)
> > On Mon, 1 Mar 2021, hjl.tools at gmail dot com wrote:
> > 
> > > https://sourceware.org/bugzilla/show_bug.cgi?id=27457
> > > 
> > > --- Comment #15 from H.J. Lu <hjl.tools at gmail dot com> ---
> > > (In reply to Richard Biener from comment #14)
> > > > 
> > > > Note according to Agner vzeroall, for example on Haswell, decodes to
> > > > 20 uops while vzeroupper only requires 4.  On Skylake it's even worse
> > > > (34 uops).  For short sizes (as in our benchmark which had 16-31 byte
> > > > strcmp) this might be a bigger difference than using the SSE2 variant
> > > > off an early xtest result.  That said, why not, for HTM + AVX2 CPUs,
> > > > have an intermediate dispatcher between the AVX2 and the SSE variant
> > > > using xtest?  That leaves the actual implementations unchanged and thus
> > > > with known performance characteristic?
> > > 
> > > It is implemented on users/hjl/pr27457/wrapper branch:
> > > 
> > > https://gitlab.com/x86-glibc/glibc/-/tree/users/hjl/pr27457/wrapper
> > > 
> > > There are 2 problems:
> > > 
> > > 1. Many RTM tests failed for other reasons.
> > > 2. Even with vzeroall overhead, AVX version may still be faster than
> > > SSE version.
> > 
> > And the SSE version may still be faster than the AVX version with
> > vzeroall.
> 
> Here is some data:
> 
> Function: strcmp
> Variant: default
>                                        __strcmp_avx2	__strcmp_sse2_unaligned
>      length=14, align1=14, align2=14:        11.36	       17.50	
>      length=14, align1=14, align2=14:        11.36	       15.59	
>      length=14, align1=14, align2=14:        11.43	       15.55	
>      length=15, align1=15, align2=15:        11.36	       17.42	
>      length=15, align1=15, align2=15:        11.96	       17.41	
>      length=15, align1=15, align2=15:        11.36	       16.97	
>      length=16, align1=16, align2=16:        11.36	       18.58	
>      length=16, align1=16, align2=16:        11.36	       17.41	
>      length=16, align1=16, align2=16:        11.43	       17.34	
>      length=17, align1=17, align2=17:        11.36	       21.37	
>      length=17, align1=17, align2=17:        11.36	       18.52	
>      length=17, align1=17, align2=17:        11.36	       17.94	
>      length=18, align1=18, align2=18:        11.36	       19.73	
>      length=18, align1=18, align2=18:        11.36	       19.20	
>      length=18, align1=18, align2=18:        11.36	       19.13	
>      length=19, align1=19, align2=19:        11.36	       20.38	
>      length=19, align1=19, align2=19:        11.36	       19.39	
>      length=19, align1=19, align2=19:        11.36	       20.39	
>      length=20, align1=20, align2=20:        11.36	       21.53	
>      length=20, align1=20, align2=20:        11.36	       20.98	
>      length=20, align1=20, align2=20:        11.36	       20.93	
>      length=21, align1=21, align2=21:        11.36	       22.83	
>      length=21, align1=21, align2=21:        11.36	       22.26	
>      length=21, align1=21, align2=21:        11.36	       22.25	
>      length=22, align1=22, align2=22:        11.43	       23.37	
>      length=22, align1=22, align2=22:        11.36	       22.78	
>      length=22, align1=22, align2=22:        12.29	       22.12	
>      length=23, align1=23, align2=23:        11.36	       24.63	
>      length=23, align1=23, align2=23:        12.53	       23.97	
>      length=23, align1=23, align2=23:        11.36	       23.97	
>      length=24, align1=24, align2=24:        11.36	       24.52	
>      length=24, align1=24, align2=24:        11.36	       43.47	
>      length=24, align1=24, align2=24:        11.36	       44.47	
>      length=25, align1=25, align2=25:        11.36	       39.50	
>      length=25, align1=25, align2=25:        11.36	       48.97	
>      length=25, align1=25, align2=25:        11.36	       48.53	
>      length=26, align1=26, align2=26:        11.36	       47.87	
>      length=26, align1=26, align2=26:        11.36	       47.20	
>      length=26, align1=26, align2=26:        11.36	       47.15	
>      length=27, align1=27, align2=27:        11.36	       50.90	
>      length=27, align1=27, align2=27:        11.44	       49.98	
>      length=27, align1=27, align2=27:        11.36	       49.77	
>      length=28, align1=28, align2=28:        11.36	       49.74	
>      length=28, align1=28, align2=28:        11.36	       48.86	
>      length=28, align1=28, align2=28:        11.36	       49.08	
>      length=29, align1=29, align2=29:        11.36	       52.74	
>      length=29, align1=29, align2=29:        11.36	       54.04	
>      length=29, align1=29, align2=29:        11.36	       29.49	
>      length=30, align1=30, align2=30:        11.36	       50.91	
>      length=30, align1=30, align2=30:        11.36	       51.09	
>      length=30, align1=30, align2=30:        11.36	       51.13	
>      length=31, align1=31, align2=31:        12.36	       54.33	
>      length=31, align1=31, align2=31:        11.36	       53.49	
>      length=31, align1=31, align2=31:        11.36	       53.29	
>        length=16, align1=0, align2=0:        11.36	       18.02	
>        length=16, align1=0, align2=0:        11.36	       18.58	
>        length=16, align1=0, align2=0:        11.36	       17.34	
>        length=16, align1=0, align2=0:        11.44	       19.88	
>        length=16, align1=0, align2=0:        11.36	       16.74	
>        length=16, align1=0, align2=0:        11.36	       17.42	
>        length=16, align1=0, align2=3:        11.36	       17.34	
>        length=16, align1=3, align2=4:        11.36	       17.34	
>        length=32, align1=0, align2=0:        12.29	       61.07	
>        length=32, align1=0, align2=0:        12.63	       61.08	
>        length=32, align1=0, align2=0:        11.36	       60.48	
>        length=32, align1=0, align2=0:        11.36	       60.48	
>        length=32, align1=0, align2=0:        11.36	       60.40	
>        length=32, align1=0, align2=0:        11.36	       60.40	
>        length=32, align1=0, align2=4:        11.36	       60.40	
>        length=32, align1=4, align2=5:        12.10	       59.72	

That's with or without the vzeroall actually executing?

> > I guess we should mostly care about optimizing for "modern" CPUs
> > which likely means HTM + AVX512 which should be already optimal
> > on your branches by using %ymm16+.  So we're talking about
> > the "legacy" AVX2 + HTM path.
> > 
> > And there I think we should optimize the path that is _not_ in
> > a transaction since that will be 99% of the cases.  Which to
> > me means using the proven tuned (on their respective ISA subsets)
> > SSE2 and AVX2 variants and simply switch between them based on
> > xtest.  Yeah, so strcmp of a large string inside an transaction
> 
> I tried it and I got RTM abort for other reasons.
> 
> > might not run at optimal AVX2 speed.  But it will be faster
> > than before the xtest dispatch since before that it would have
> > aborted the transaction.
> 
> Please give my current approach is a try.

Well, I know that even unconditionally doing vzeroall will fix our observed
regression since the time is dominated by all the other code inside the
transaction that is then retried a few times (and always re-fails with
vzeroupper), the strcmp part is just ~1%.

I also only have AVX512 HW with HTM so can't easily test the AVX2 + HTM
path.  That said, I'm fine with the xtest/vzero{all,upper} epilogue.

But I have also been reported libmicro regressions for strcpy with length
10 (32byte aligned), hot cache, when using AVX2 vs. SSE2 (SSE2 being faster
by 20%).  [note strcpy, not strcmp here]  Yeah, stupid benchmark ... but
it likely shows that for small lenghts every detail matters in case you
want to shave off the last ns.

Comment 21 Richard Biener 2021-03-01 14:53:23 UTC

(In reply to H.J. Lu from comment #19)
> (In reply to Richard Biener from comment #17)
> > (In reply to H.J. Lu from comment #11)
> > > (In reply to Florian Weimer from comment #9)
> > > > (In reply to rguenther from comment #7)
> > > > > On Mon, 1 Mar 2021, fweimer at redhat dot com wrote:
> > > > > > The sources do not seem to show the selection logic.
> > > > > > __memset_avx2_unaligned_rtm seems to simply have dropped the VZEROUPPER
> > > > > > instruction, which can't be good for general application performance.
> > > > > 
> > > > > The intent was to use %ymm16+ only which does not cause any transition
> > > > > penalty even w/o vzeroupper.
> > > > 
> > > > I still saw %ymm0 usage in the disassembly, if I recall correctly. And for
> > > > AVX2, there isn't much choice. I didn't try to reverse-engineer the
> > > > corresponding IFUNC selector.
> > > 
> > > At function exit, there is
> > > 
> > >   f3:	0f 01 d6             	xtest  
> > >   f6:	74 04                	je     fc <__memset_avx2_unaligned_erms_rtm+0x5c>
> > >   f8:	c5 fc 77             	vzeroall 
> > >   fb:	c3                   	ret    
> > >   fc:	c5 f8 77             	vzeroupper 
> > >   ff:	c3                   	ret    
> > 
> > Btw, if the 'je' mispredicts to the vzeroupper case inside an transaction
> > will the speculative execution of vzeroupper abort the transaction or
> > does it only abort the transaction when retired?
> 
> My branch includes RTM test:
> 
>   for (i = 0; i < loop; i++)
>     {
>       if (_xbegin() == _XBEGIN_STARTED)
> 	{
> 	  failed |= function ();
> 	  _xend();
> 	}
>       else
> 	{
> 	  failed |= function ();
> 	  ++naborts;
> 	}
>     }
> 
> It passes with xtest + je.

that's good to hear.  Does it still work when you add and unconditional
non-transaction

      failed |= function ();

before the loop?  Just trying to make sure we do actually mispredict the
xtest + je.

Comment 22 H.J. Lu 2021-03-01 15:19:26 UTC

(In reply to Richard Biener from comment #21)

> 
> that's good to hear.  Does it still work when you add and unconditional
> non-transaction
> 
>       failed |= function ();
> 
> before the loop?  Just trying to make sure we do actually mispredict the
> xtest + je.

I will give it a try.

Comment 23 H.J. Lu 2021-03-01 23:39:03 UTC

(In reply to H.J. Lu from comment #22)
> (In reply to Richard Biener from comment #21)
> 
> > 
> > that's good to hear.  Does it still work when you add and unconditional
> > non-transaction
> > 
> >       failed |= function ();
> > 
> > before the loop?  Just trying to make sure we do actually mispredict the
> > xtest + je.
> 
> I will give it a try.

I did:

 for (i = 0; i < loop; i++)
    {
      failed |= function ();
      if (_xbegin() == _XBEGIN_STARTED)
	{
	  failed |= function ();
	  _xend();
	}
      else
	{
	  failed |= function ();
	  ++naborts;
	}
    }

There is no issue.

Comment 24 H.J. Lu 2021-03-05 16:54:28 UTC

A patch set is posted at

https://sourceware.org/pipermail/libc-alpha/2021-March/123302.html

Comment 25 Richard Biener 2021-03-11 10:42:22 UTC

We have successfully tested the backport to the 2.31 branch where it mitigates the regression seen.

Comment 26 Richard Biener 2021-03-16 13:53:19 UTC

Any progress?  I see the patchset did not get any feedback in the last 10 days.

Comment 27 H.J. Lu 2021-03-16 14:12:42 UTC

(In reply to Richard Biener from comment #26)
> Any progress?  I see the patchset did not get any feedback in the last 10
> days.

We discussed it at glibc patch meeting on Monday.  I posted the v2 patch:

https://sourceware.org/pipermail/libc-alpha/2021-March/123867.html

Comment 28 H.J. Lu 2021-03-29 23:00:51 UTC

Fixed for 2.34 on master so far by 10 commits.  The last commit is

commit e4fda4631017e49d4ee5a2755db34289b6860fa4
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Mar 7 09:45:23 2021 -0800

    x86-64: Use ZMM16-ZMM31 in AVX512 memmove family functions
    
    Update ifunc-memmove.h to select the function optimized with AVX512
    instructions using ZMM16-ZMM31 registers to avoid RTM abort with usable
    AVX512VL since VZEROUPPER isn't needed at function exit.

Comment 29 Sourceware Commits 2022-01-27 20:21:12 UTC

The release/2.33/master branch has been updated by H.J. Lu <hjl@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=902af2f5eee71c3e48fe30d43fd7c61d563e975b

commit 902af2f5eee71c3e48fe30d43fd7c61d563e975b
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Jan 27 12:20:21 2022 -0800

    NEWS: Add a bug fix entry for BZ #27457

Comment 30 Sourceware Commits 2022-01-27 20:23:12 UTC

The release/2.32/master branch has been updated by H.J. Lu <hjl@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=05751d1c5c85b7f55086e81567cd55b025b25625

commit 05751d1c5c85b7f55086e81567cd55b025b25625
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Jan 27 12:22:42 2022 -0800

    NEWS: Add a bug fix entry for BZ #27457

Comment 31 Sourceware Commits 2022-01-27 20:47:22 UTC

The release/2.31/master branch has been updated by H.J. Lu <hjl@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c0cbb9345ea2d81d017e7725e3ac5250ef870513

commit c0cbb9345ea2d81d017e7725e3ac5250ef870513
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Jan 27 12:23:42 2022 -0800

    NEWS: Add a bug fix entry for BZ #27457

Comment 32 Sourceware Commits 2022-01-27 20:47:40 UTC

The release/2.30/master branch has been updated by H.J. Lu <hjl@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=de28bb3c612daedb08fa975325c1a293fbca07a9

commit de28bb3c612daedb08fa975325c1a293fbca07a9
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Jan 27 12:24:44 2022 -0800

    NEWS: Add a bug fix entry for BZ #27457

Comment 33 Sourceware Commits 2022-01-27 20:48:01 UTC

The release/2.29/master branch has been updated by H.J. Lu <hjl@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c3535cb6cdd4cbbce22018df09cc69633781d808

commit c3535cb6cdd4cbbce22018df09cc69633781d808
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Jan 27 12:25:41 2022 -0800

    NEWS: Add a bug fix entry for BZ #27457

Comment 34 Sourceware Commits 2022-01-27 22:41:34 UTC

The release/2.28/master branch has been updated by H.J. Lu <hjl@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=2baf5616d5ec5e592d64746253713969eb473f5b

commit 2baf5616d5ec5e592d64746253713969eb473f5b
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Jan 27 12:49:55 2022 -0800

    NEWS: Add a bug fix entry for BZ #27457

Comment 35 H.J. Lu 2022-01-28 02:24:21 UTC

Fixed for 2.34 and all release branches.