Bug 20495 - x86_64 performance degradation due to AVX/SSE transition penalty
Summary: x86_64 performance degradation due to AVX/SSE transition penalty
Status: RESOLVED FIXED
Alias: None
Product: glibc
Classification: Unclassified
Component: math (show other bugs)
Version: 2.23
: P2 normal
Target Milestone: 2.25
Assignee: Andrew Senkevich
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-08-19 17:55 UTC by Adam Stylinski
Modified: 2017-04-20 14:58 UTC (History)
3 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed: 2016-08-25 00:00:00
fweimer: security-


Attachments
Slower AVX example (482 bytes, text/x-csrc)
2016-08-23 01:22 UTC, Adam Stylinski
Details
attachment-83016-0.html (615 bytes, text/html)
2016-08-25 22:10 UTC, Adam Stylinski
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Adam Stylinski 2016-08-19 17:55:29 UTC
I noticed fairly recently that after a lot of AVX vectorized code I was utilizing the scalar sincosf in libm for the remainder.  I found it particularly odd that this code was being outperformed by SSE4 variations of the code, and later traced the issue to be the fact that sincosf is utilizing legacy SSE instructions in what was otherwise AVX accelerated code.  This was causing a sizable transition penalty between AVX and SSE, as the code calling the sincosf function was utilizing ymm* (AVX) registers.  I added a hardcoded vzeroupper() to fix this transition penalty, but having a VEX coded version to jump to might be of benefit for some workloads.

Arguably I shouldn't be utilizing the scalar instructions at all for the remainder, and that's probably one other possible solution, but I thought I'd file this all the same.
Comment 1 Andrew Senkevich 2016-08-22 12:32:35 UTC
Can you add the testcase and compilation commands?
Comment 2 Adam Stylinski 2016-08-23 01:22:13 UTC
Created attachment 9469 [details]
Slower AVX example

Compile this with -mavx & then compile it with -msse4.  The -msse4 version will be faster as it's not hitting the SSE to AVX register transition penalty.
Comment 3 Adam Stylinski 2016-08-23 01:22:47 UTC
Just attached a pretty basic scalar version, it's about twice as slow with -mavx as it is with -msse4.
Comment 4 Adam Stylinski 2016-08-23 01:31:51 UTC
It may help to increase the iterations a bit more by modifying numvals to be 200*9000 or so.

After doing that, even with -O3, it's clear we're hitting that penalty:

adam@eggsbenedict ~ $ gcc -mavx test_sincos.c -o test.out -lm -O3
adam@eggsbenedict ~ $ time ./test.out 

real	0m0.174s
user	0m0.171s
sys	0m0.002s
adam@eggsbenedict ~ $ gcc -msse4 test_sincos.c -o test.out -lm -O3
adam@eggsbenedict ~ $ time ./test.out 

real	0m0.069s
user	0m0.064s
sys	0m0.004s
Comment 5 Andrew Senkevich 2016-08-25 19:13:30 UTC
Thanks.

Investigation shows that it is degradation from 2.22 to 2.23 releases.

Simplified testcase is

#include <math.h>

#define NUM 9000*1000

float valsA[NUM];
float a[NUM];

int main(void)
{
    int i;

    for (i = 0; i < NUM; ++i)
    {
        a[i] += sinf(valsA[i]);
    }

    return (int)a[NUM-1];
}
Comment 6 Andrew Senkevich 2016-08-25 19:20:30 UTC
Looks like performance drop caused by 

commit f3dcae82d54e5097e18e1d6ef4ff55c2ea4e621e

Save and restore vector registers in x86-64 ld.so

because it adds save/restore of YMM registers (with AVX instructions) before call to sinf() implemented in SSE.
Comment 7 H.J. Lu 2016-08-25 19:31:57 UTC
See:

https://sourceware.org/ml/libc-alpha/2016-08/msg00741.html

But it only works for machines supporting XGETBV with ECX == 1.
Comment 8 Andrew Senkevich 2016-08-25 20:03:33 UTC
(In reply to Adam Stylinski from comment #2)
> Created attachment 9469 [details]
> Slower AVX example
> 
> Compile this with -mavx & then compile it with -msse4.  The -msse4 version
> will be faster as it's not hitting the SSE to AVX register transition
> penalty.

You can avoid slowdown also using "-Wl,-z,now".
Comment 9 H.J. Lu 2016-08-25 21:27:50 UTC
Dup

*** This bug has been marked as a duplicate of bug 20508 ***
Comment 10 Adam Stylinski 2016-08-25 22:09:58 UTC
Created attachment 9476 [details]
attachment-83016-0.html

Interesting, I will have to try that link flag. What is the reason for that
working? Everything I can find about that flag is for hardening binaries to
prevent others from writing into the global offset table.

On Aug 25, 2016 4:03 PM, "andrew.n.senkevich at gmail dot com" <
sourceware-bugzilla@sourceware.org> wrote:
>
> https://sourceware.org/bugzilla/show_bug.cgi?id=20495
>
> --- Comment #8 from Andrew Senkevich <andrew.n.senkevich at gmail dot
com> ---
> (In reply to Adam Stylinski from comment #2)
> > Created attachment 9469 [details]
> > Slower AVX example
> >
> > Compile this with -mavx & then compile it with -msse4.  The -msse4
version
> > will be faster as it's not hitting the SSE to AVX register transition
> > penalty.
>
> You can avoid slowdown also using "-Wl,-z,now".
>
> --
> You are receiving this mail because:
> You reported the bug.
Comment 11 Andrew Senkevich 2016-08-26 13:32:29 UTC
(In reply to Adam Stylinski from comment #10)
> Created attachment 9476 [details]
> attachment-83016-0.html
> 
> Interesting, I will have to try that link flag. What is the reason for that
> working? Everything I can find about that flag is for hardening binaries to
> prevent others from writing into the global offset table.
> 
> On Aug 25, 2016 4:03 PM, "andrew.n.senkevich at gmail dot com" <
> sourceware-bugzilla@sourceware.org> wrote:
> >
> > https://sourceware.org/bugzilla/show_bug.cgi?id=20495
> >
> > --- Comment #8 from Andrew Senkevich <andrew.n.senkevich at gmail dot
> com> ---
> > (In reply to Adam Stylinski from comment #2)
> > > Created attachment 9469 [details]
> > > Slower AVX example
> > >
> > > Compile this with -mavx & then compile it with -msse4.  The -msse4
> version
> > > will be faster as it's not hitting the SSE to AVX register transition
> > > penalty.
> >
> > You can avoid slowdown also using "-Wl,-z,now".
> >
> > --
> > You are receiving this mail because:
> > You reported the bug.

Current issue because of AVX in _dl_runtime_resolve() during the first call of SSE sinf().

From man ld:
-z now 
When generating an executable or shared library, mark it to tell the dynamic linker to resolve all symbols when the program is started, or when the shared library is linked to using dlopen, instead of deferring function call resolution to the point when the function is first called.

So it may help or may not.
Helps for reduced but not for original test.
Comment 12 Sourceware Commits 2016-08-26 16:00:58 UTC
This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, hjl/x86/xgetbv has been created
        at  99143e37c1186c765da5b6e892ddaff0b3719f9f (commit)

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=99143e37c1186c765da5b6e892ddaff0b3719f9f

commit 99143e37c1186c765da5b6e892ddaff0b3719f9f
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Aug 23 09:09:32 2016 -0700

    X86-64: Add _dl_runtime_resolve_avx[512]_{opt|slow} [BZ #20508]
    
    There is transition penalty when SSE instructions are mixed with 256-bit
    AVX or 512-bit AVX512 load instructions.  Since _dl_runtime_resolve_avx
    and _dl_runtime_profile_avx512 save/restore 256-bit YMM/512-bit ZMM
    registers, there is transition penalty when SSE instructions are used
    with lazy binding on AVX and AVX512 processors.
    
    To avoid SSE transition penalty, if only the lower 128 bits of the first
    8 vector registers are non-zero, we can preserve %xmm0 - %xmm7 registers
    with the zero upper bits.
    
    For AVX and AVX512 processors which support XGETBV with ECX == 1, we can
    use XGETBV with ECX == 1 to check if the upper 128 bits of YMM registers
    or the upper 256 bits of ZMM registers are zero.  We can restore only the
    non-zero portion of vector registers with AVX/AVX512 load instructions
    which will zero-extend upper bits of vector registers.
    
    This patch adds _dl_runtime_resolve_sse_vex which saves and restores
    XMM registers with 128-bit AVX store/load instructions.  It is used to
    preserve YMM/ZMM registers when only the lower 128 bits are non-zero.
    _dl_runtime_resolve_avx_opt and _dl_runtime_resolve_avx512_opt are added
    and used on AVX/AVX512 processors supporting XGETBV with ECX == 1 so
    that we store and load only the non-zero portion of vector registers.
    This avoids SSE transition penalty caused by _dl_runtime_resolve_avx and
    _dl_runtime_profile_avx512 when only the lower 128 bits of vector
    registers are used.
    
    _dl_runtime_resolve_avx_slow is added and used for AVX processors which
    don't upport XGETBV with ECX == 1.  Since there is no SSE transition
    penalty on AVX512 processors which don't support XGETBV with ECX == 1,
    _dl_runtime_resolve_avx512_slow isn't provided.
    
    	[BZ #20495]
    	[BZ #20508]
    	* sysdeps/x86/cpu-features.c (init_cpu_features): For Intel
    	processors, set Use_dl_runtime_resolve_slow and set
    	Use_dl_runtime_resolve_opt if XGETBV suports ECX == 1.
    	* sysdeps/x86/cpu-features.h (bit_arch_Use_dl_runtime_resolve_opt):
    	New.
    	(bit_arch_Use_dl_runtime_resolve_slow): Likewise.
    	(index_arch_Use_dl_runtime_resolve_opt): Likewise.
    	(index_arch_Use_dl_runtime_resolve_slow): Likewise.
    	* sysdeps/x86_64/dl-machine.h (elf_machine_runtime_setup): Use
    	_dl_runtime_resolve_avx512_opt and _dl_runtime_resolve_avx_opt
    	if Use_dl_runtime_resolve_opt is set.  Use
    	_dl_runtime_resolve_slow if Use_dl_runtime_resolve_slow is set.
    	* sysdeps/x86_64/dl-trampoline.S: Include <cpu-features.h>.
    	(_dl_runtime_resolve_opt): New.  Defined for AVX and AVX512.
    	(_dl_runtime_resolve): Add one for _dl_runtime_resolve_sse_vex.
    	* sysdeps/x86_64/dl-trampoline.h (_dl_runtime_resolve_avx_slow):
    	New.
    	(_dl_runtime_resolve_opt): Likewise.
    	(_dl_runtime_profile): Define only if _dl_runtime_profile is
    	defined.

-----------------------------------------------------------------------
Comment 13 Florian Weimer 2016-08-26 18:14:52 UTC
This is not an exact duplicate, it needs a separate fix.
Comment 14 H.J. Lu 2016-08-26 19:14:54 UTC
(In reply to Florian Weimer from comment #13)
> This is not an exact duplicate, it needs a separate fix.

The fix for PR 20508 is extended to cover this:

https://sourceware.org/ml/libc-alpha/2016-08/msg00803.html
Comment 15 Sourceware Commits 2016-08-30 22:04:43 UTC
This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, hjl/x86/xgetbv has been created
        at  fdb9777e1d770446972f46a80ebfa59d522a93f1 (commit)

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=fdb9777e1d770446972f46a80ebfa59d522a93f1

commit fdb9777e1d770446972f46a80ebfa59d522a93f1
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Aug 23 09:09:32 2016 -0700

    X86-64: Add _dl_runtime_resolve_avx[512]_{opt|slow} [BZ #20508]
    
    There is transition penalty when SSE instructions are mixed with 256-bit
    AVX or 512-bit AVX512 load instructions.  Since _dl_runtime_resolve_avx
    and _dl_runtime_profile_avx512 save/restore 256-bit YMM/512-bit ZMM
    registers, there is transition penalty when SSE instructions are used
    with lazy binding on AVX and AVX512 processors.
    
    To avoid SSE transition penalty, if only the lower 128 bits of the first
    8 vector registers are non-zero, we can preserve %xmm0 - %xmm7 registers
    with the zero upper bits.
    
    For AVX and AVX512 processors which support XGETBV with ECX == 1, we can
    use XGETBV with ECX == 1 to check if the upper 128 bits of YMM registers
    or the upper 256 bits of ZMM registers are zero.  We can restore only the
    non-zero portion of vector registers with AVX/AVX512 load instructions
    which will zero-extend upper bits of vector registers.
    
    This patch adds _dl_runtime_resolve_sse_vex which saves and restores
    XMM registers with 128-bit AVX store/load instructions.  It is used to
    preserve YMM/ZMM registers when only the lower 128 bits are non-zero.
    _dl_runtime_resolve_avx_opt and _dl_runtime_resolve_avx512_opt are added
    and used on AVX/AVX512 processors supporting XGETBV with ECX == 1 so
    that we store and load only the non-zero portion of vector registers.
    This avoids SSE transition penalty caused by _dl_runtime_resolve_avx and
    _dl_runtime_profile_avx512 when only the lower 128 bits of vector
    registers are used.
    
    _dl_runtime_resolve_avx_slow is added and used for AVX processors which
    don't support XGETBV with ECX == 1.  Since there is no SSE transition
    penalty on AVX512 processors which don't support XGETBV with ECX == 1,
    _dl_runtime_resolve_avx512_slow isn't provided.
    
    	[BZ #20495]
    	[BZ #20508]
    	* sysdeps/x86/cpu-features.c (init_cpu_features): For Intel
    	processors, set Use_dl_runtime_resolve_slow and set
    	Use_dl_runtime_resolve_opt if XGETBV suports ECX == 1.
    	* sysdeps/x86/cpu-features.h (bit_arch_Use_dl_runtime_resolve_opt):
    	New.
    	(bit_arch_Use_dl_runtime_resolve_slow): Likewise.
    	(index_arch_Use_dl_runtime_resolve_opt): Likewise.
    	(index_arch_Use_dl_runtime_resolve_slow): Likewise.
    	* sysdeps/x86_64/dl-machine.h (elf_machine_runtime_setup): Use
    	_dl_runtime_resolve_avx512_opt and _dl_runtime_resolve_avx_opt
    	if Use_dl_runtime_resolve_opt is set.  Use
    	_dl_runtime_resolve_slow if Use_dl_runtime_resolve_slow is set.
    	* sysdeps/x86_64/dl-trampoline.S: Include <cpu-features.h>.
    	(_dl_runtime_resolve_opt): New.  Defined for AVX and AVX512.
    	(_dl_runtime_resolve): Add one for _dl_runtime_resolve_sse_vex.
    	* sysdeps/x86_64/dl-trampoline.h (_dl_runtime_resolve_avx_slow):
    	New.
    	(_dl_runtime_resolve_opt): Likewise.
    	(_dl_runtime_profile): Define only if _dl_runtime_profile is
    	defined.

-----------------------------------------------------------------------
Comment 16 Sourceware Commits 2016-09-06 16:10:47 UTC
This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, master has been updated
       via  fb0f7a6755c1bfaec38f490fbfcaa39a66ee3604 (commit)
      from  a0d47f487fe250c63cc21e9608b85bc02dc2a006 (commit)

Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=fb0f7a6755c1bfaec38f490fbfcaa39a66ee3604

commit fb0f7a6755c1bfaec38f490fbfcaa39a66ee3604
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Sep 6 08:50:55 2016 -0700

    X86-64: Add _dl_runtime_resolve_avx[512]_{opt|slow} [BZ #20508]
    
    There is transition penalty when SSE instructions are mixed with 256-bit
    AVX or 512-bit AVX512 load instructions.  Since _dl_runtime_resolve_avx
    and _dl_runtime_profile_avx512 save/restore 256-bit YMM/512-bit ZMM
    registers, there is transition penalty when SSE instructions are used
    with lazy binding on AVX and AVX512 processors.
    
    To avoid SSE transition penalty, if only the lower 128 bits of the first
    8 vector registers are non-zero, we can preserve %xmm0 - %xmm7 registers
    with the zero upper bits.
    
    For AVX and AVX512 processors which support XGETBV with ECX == 1, we can
    use XGETBV with ECX == 1 to check if the upper 128 bits of YMM registers
    or the upper 256 bits of ZMM registers are zero.  We can restore only the
    non-zero portion of vector registers with AVX/AVX512 load instructions
    which will zero-extend upper bits of vector registers.
    
    This patch adds _dl_runtime_resolve_sse_vex which saves and restores
    XMM registers with 128-bit AVX store/load instructions.  It is used to
    preserve YMM/ZMM registers when only the lower 128 bits are non-zero.
    _dl_runtime_resolve_avx_opt and _dl_runtime_resolve_avx512_opt are added
    and used on AVX/AVX512 processors supporting XGETBV with ECX == 1 so
    that we store and load only the non-zero portion of vector registers.
    This avoids SSE transition penalty caused by _dl_runtime_resolve_avx and
    _dl_runtime_profile_avx512 when only the lower 128 bits of vector
    registers are used.
    
    _dl_runtime_resolve_avx_slow is added and used for AVX processors which
    don't support XGETBV with ECX == 1.  Since there is no SSE transition
    penalty on AVX512 processors which don't support XGETBV with ECX == 1,
    _dl_runtime_resolve_avx512_slow isn't provided.
    
    	[BZ #20495]
    	[BZ #20508]
    	* sysdeps/x86/cpu-features.c (init_cpu_features): For Intel
    	processors, set Use_dl_runtime_resolve_slow and set
    	Use_dl_runtime_resolve_opt if XGETBV suports ECX == 1.
    	* sysdeps/x86/cpu-features.h (bit_arch_Use_dl_runtime_resolve_opt):
    	New.
    	(bit_arch_Use_dl_runtime_resolve_slow): Likewise.
    	(index_arch_Use_dl_runtime_resolve_opt): Likewise.
    	(index_arch_Use_dl_runtime_resolve_slow): Likewise.
    	* sysdeps/x86_64/dl-machine.h (elf_machine_runtime_setup): Use
    	_dl_runtime_resolve_avx512_opt and _dl_runtime_resolve_avx_opt
    	if Use_dl_runtime_resolve_opt is set.  Use
    	_dl_runtime_resolve_slow if Use_dl_runtime_resolve_slow is set.
    	* sysdeps/x86_64/dl-trampoline.S: Include <cpu-features.h>.
    	(_dl_runtime_resolve_opt): New.  Defined for AVX and AVX512.
    	(_dl_runtime_resolve): Add one for _dl_runtime_resolve_sse_vex.
    	* sysdeps/x86_64/dl-trampoline.h (_dl_runtime_resolve_avx_slow):
    	New.
    	(_dl_runtime_resolve_opt): Likewise.
    	(_dl_runtime_profile): Define only if _dl_runtime_profile is
    	defined.

-----------------------------------------------------------------------

Summary of changes:
 ChangeLog                      |   25 ++++++++++
 sysdeps/x86/cpu-features.c     |   14 +++++
 sysdeps/x86/cpu-features.h     |    6 ++
 sysdeps/x86_64/dl-machine.h    |   24 ++++++++-
 sysdeps/x86_64/dl-trampoline.S |   20 ++++++++
 sysdeps/x86_64/dl-trampoline.h |  104 +++++++++++++++++++++++++++++++++++++++-
 6 files changed, 190 insertions(+), 3 deletions(-)
Comment 17 Joseph Myers 2016-11-23 17:03:11 UTC
Is this fixed by the September commit to master, or not?
Comment 18 Adam Stylinski 2016-11-23 17:37:14 UTC
Not sure, how do I point my GCC to target the new master of glibc without actually replacing my distro's glibc?  

If anyone else could test this, the example I posted is pretty basic:

adam@Crushinator:~$ gcc -mavx test_sincos.c -o test.out -lm -O3
adam@Crushinator:~$ time ./test.out

real	0m0.083s
user	0m0.080s
sys	0m0.000s

adam@Crushinator:~$ gcc -msse4 testsincos.c -o test.out -lm -O3
adam@Crushinator:~$ time ./test.out

real	0m0.029s
user	0m0.027s
sys	0m0.000s

Just bump NUMVALS up to something arbitrarily high.
Comment 19 Florian Weimer 2016-11-23 17:46:57 UTC
(In reply to Adam Stylinski from comment #18)
> Not sure, how do I point my GCC to target the new master of glibc without
> actually replacing my distro's glibc?  

If you build glibc from source, you'll get a script, testrun.sh, which runs its argument against the newly built glibc:

https://sourceware.org/glibc/wiki/Testing/Builds
Comment 20 Adam Stylinski 2016-11-23 18:00:01 UTC
(In reply to Florian Weimer from comment #19)
> (In reply to Adam Stylinski from comment #18)
> > Not sure, how do I point my GCC to target the new master of glibc without
> > actually replacing my distro's glibc?  
> 
> If you build glibc from source, you'll get a script, testrun.sh, which runs
> its argument against the newly built glibc:
> 
> https://sourceware.org/glibc/wiki/Testing/Builds

Yes, this definitely fixes it.
Comment 21 Adam Stylinski 2016-11-23 18:03:26 UTC
Marking as fixed.  Hoping 2.25 gets into my distro, soon.
Comment 22 Sourceware Commits 2016-11-30 22:01:19 UTC
This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, release/2.24/master has been updated
       via  4b8790c81c1a7b870a43810ec95e08a2e501123d (commit)
      from  2d16e81babd1d7b66d10cec0bc6d6d86a7e0c95e (commit)

Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4b8790c81c1a7b870a43810ec95e08a2e501123d

commit 4b8790c81c1a7b870a43810ec95e08a2e501123d
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Sep 6 08:50:55 2016 -0700

    X86-64: Add _dl_runtime_resolve_avx[512]_{opt|slow} [BZ #20508]
    
    There is transition penalty when SSE instructions are mixed with 256-bit
    AVX or 512-bit AVX512 load instructions.  Since _dl_runtime_resolve_avx
    and _dl_runtime_profile_avx512 save/restore 256-bit YMM/512-bit ZMM
    registers, there is transition penalty when SSE instructions are used
    with lazy binding on AVX and AVX512 processors.
    
    To avoid SSE transition penalty, if only the lower 128 bits of the first
    8 vector registers are non-zero, we can preserve %xmm0 - %xmm7 registers
    with the zero upper bits.
    
    For AVX and AVX512 processors which support XGETBV with ECX == 1, we can
    use XGETBV with ECX == 1 to check if the upper 128 bits of YMM registers
    or the upper 256 bits of ZMM registers are zero.  We can restore only the
    non-zero portion of vector registers with AVX/AVX512 load instructions
    which will zero-extend upper bits of vector registers.
    
    This patch adds _dl_runtime_resolve_sse_vex which saves and restores
    XMM registers with 128-bit AVX store/load instructions.  It is used to
    preserve YMM/ZMM registers when only the lower 128 bits are non-zero.
    _dl_runtime_resolve_avx_opt and _dl_runtime_resolve_avx512_opt are added
    and used on AVX/AVX512 processors supporting XGETBV with ECX == 1 so
    that we store and load only the non-zero portion of vector registers.
    This avoids SSE transition penalty caused by _dl_runtime_resolve_avx and
    _dl_runtime_profile_avx512 when only the lower 128 bits of vector
    registers are used.
    
    _dl_runtime_resolve_avx_slow is added and used for AVX processors which
    don't support XGETBV with ECX == 1.  Since there is no SSE transition
    penalty on AVX512 processors which don't support XGETBV with ECX == 1,
    _dl_runtime_resolve_avx512_slow isn't provided.
    
    	[BZ #20495]
    	[BZ #20508]
    	* sysdeps/x86/cpu-features.c (init_cpu_features): For Intel
    	processors, set Use_dl_runtime_resolve_slow and set
    	Use_dl_runtime_resolve_opt if XGETBV suports ECX == 1.
    	* sysdeps/x86/cpu-features.h (bit_arch_Use_dl_runtime_resolve_opt):
    	New.
    	(bit_arch_Use_dl_runtime_resolve_slow): Likewise.
    	(index_arch_Use_dl_runtime_resolve_opt): Likewise.
    	(index_arch_Use_dl_runtime_resolve_slow): Likewise.
    	* sysdeps/x86_64/dl-machine.h (elf_machine_runtime_setup): Use
    	_dl_runtime_resolve_avx512_opt and _dl_runtime_resolve_avx_opt
    	if Use_dl_runtime_resolve_opt is set.  Use
    	_dl_runtime_resolve_slow if Use_dl_runtime_resolve_slow is set.
    	* sysdeps/x86_64/dl-trampoline.S: Include <cpu-features.h>.
    	(_dl_runtime_resolve_opt): New.  Defined for AVX and AVX512.
    	(_dl_runtime_resolve): Add one for _dl_runtime_resolve_sse_vex.
    	* sysdeps/x86_64/dl-trampoline.h (_dl_runtime_resolve_avx_slow):
    	New.
    	(_dl_runtime_resolve_opt): Likewise.
    	(_dl_runtime_profile): Define only if _dl_runtime_profile is
    	defined.
    
    (cherry picked from commit fb0f7a6755c1bfaec38f490fbfcaa39a66ee3604)

-----------------------------------------------------------------------

Summary of changes:
 ChangeLog                      |   25 ++++++++++
 sysdeps/x86/cpu-features.c     |   14 +++++
 sysdeps/x86/cpu-features.h     |    6 ++
 sysdeps/x86_64/dl-machine.h    |   24 ++++++++-
 sysdeps/x86_64/dl-trampoline.S |   20 ++++++++
 sysdeps/x86_64/dl-trampoline.h |  104 +++++++++++++++++++++++++++++++++++++++-
 6 files changed, 190 insertions(+), 3 deletions(-)
Comment 23 Sourceware Commits 2016-12-08 19:27:44 UTC
This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, gentoo/2.24 has been updated
       via  b73ec923c79ab493a9265930a45800391329571a (commit)
       via  04c5f782796052de9d06975061eb3376ccbcbdb1 (commit)
       via  9b34c1494d8e61bb3d718e2ea83b856030476737 (commit)
       via  2afb8a945ddc104c5ef9aa61f32427c19b681232 (commit)
       via  df13b9c22a0fb690a0ab9dd4af163ae3c459d975 (commit)
       via  b4391b0c7def246a4503db1af683122681c12a56 (commit)
       via  0d5f4a32a34f048b35360a110a0e6d1c87e3eced (commit)
       via  0ab02a62e42e63b058e7a4e160dbe51762ef2c46 (commit)
       via  901db98f36690e4743feefd985c6ba2d7fd19813 (commit)
      from  caafe2b2612be88046d7bad4da42dbc2b07fbcd7 (commit)

Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=b73ec923c79ab493a9265930a45800391329571a

commit b73ec923c79ab493a9265930a45800391329571a
Author: Aurelien Jarno <aurelien@aurel32.net>
Date:   Tue Aug 2 09:18:59 2016 +0200

    alpha: fix trunc for big input values
    
    The alpha specific version of trunc and truncf always add and subtract
    0x1.0p23 or 0x1.0p52 even for big values. This causes this kind of
    errors in the testsuite:
    
      Failure: Test: trunc_towardzero (0x1p107)
      Result:
       is:          1.6225927682921334e+32   0x1.fffffffffffffp+106
       should be:   1.6225927682921336e+32   0x1.0000000000000p+107
       difference:  1.8014398509481984e+16   0x1.0000000000000p+54
       ulp       :  0.5000
       max.ulp   :  0.0000
    
    Change this by returning the input value when its absolute value is
    greater than 0x1.0p23 or 0x1.0p52. NaN have to go through the add and
    subtract operations to get possibly silenced.
    
    Finally remove the code to handle inexact exception, trunc should never
    generate such an exception.
    
    Changelog:
    	* sysdeps/alpha/fpu/s_trunc.c (__trunc): Return the input value
    	when its absolute value is greater than 0x1.0p52.
    	[_IEEE_FP_INEXACT] Remove.
    	* sysdeps/alpha/fpu/s_truncf.c (__truncf): Return the input value
    	when its absolute value is greater than 0x1.0p23.
    	[_IEEE_FP_INEXACT] Remove.
    
    (cherry picked from commit b74d259fe793499134eb743222cd8dd7c74a31ce)
    (cherry picked from commit e6eab16cc302e6c42f79e1af02ce98ebb9a783bc)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=04c5f782796052de9d06975061eb3376ccbcbdb1

commit 04c5f782796052de9d06975061eb3376ccbcbdb1
Author: Aurelien Jarno <aurelien@aurel32.net>
Date:   Tue Aug 2 09:18:59 2016 +0200

    alpha: fix rint on sNaN input
    
    The alpha version of rint wrongly return sNaN for sNaN input. Fix that
    by checking for NaN and by returning the input value added with itself
    in that case.
    
    Changelog:
    	* sysdeps/alpha/fpu/s_rint.c (__rint): Add argument with itself
    	when it is a NaN.
    	* sysdeps/alpha/fpu/s_rintf.c (__rintf): Likewise.
    
    (cherry picked from commit cb7f9d63b921ea1a1cbb4ab377a8484fd5da9a2b)
    (cherry picked from commit 8eb9a92e0522f2d4f2d4167df919d066c85d3408)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=9b34c1494d8e61bb3d718e2ea83b856030476737

commit 9b34c1494d8e61bb3d718e2ea83b856030476737
Author: Aurelien Jarno <aurelien@aurel32.net>
Date:   Tue Aug 2 09:18:59 2016 +0200

    alpha: fix floor on sNaN input
    
    The alpha version of floor wrongly return sNaN for sNaN input. Fix that
    by checking for NaN and by returning the input value added with itself
    in that case.
    
    Finally remove the code to handle inexact exception, floor should never
    generate such an exception.
    
    Changelog:
    	* sysdeps/alpha/fpu/s_floor.c (__floor): Add argument with itself
    	when it is a NaN.
    	[_IEEE_FP_INEXACT] Remove.
    	* sysdeps/alpha/fpu/s_floorf.c (__floorf): Likewise.
    
    (cherry picked from commit 65cc568cf57156e5230db9a061645e54ff028a41)
    (cherry picked from commit 1912cc082df4739c2388c375f8d486afdaa7d49b)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=2afb8a945ddc104c5ef9aa61f32427c19b681232

commit 2afb8a945ddc104c5ef9aa61f32427c19b681232
Author: Aurelien Jarno <aurelien@aurel32.net>
Date:   Tue Aug 2 09:18:59 2016 +0200

    alpha: fix ceil on sNaN input
    
    The alpha version of ceil wrongly return sNaN for sNaN input. Fix that
    by checking for NaN and by returning the input value added with itself
    in that case.
    
    Finally remove the code to handle inexact exception, ceil should never
    generate such an exception.
    
    Changelog:
    	* sysdeps/alpha/fpu/s_ceil.c (__ceil): Add argument with itself
    	when it is a NaN.
    	[_IEEE_FP_INEXACT] Remove.
    	* sysdeps/alpha/fpu/s_ceilf.c (__ceilf): Likewise.
    
    (cherry picked from commit 062e53c195b4a87754632c7d51254867247698b4)
    (cherry picked from commit 3eff6f84311d2679a58a637e3be78b4ced275762)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=df13b9c22a0fb690a0ab9dd4af163ae3c459d975

commit df13b9c22a0fb690a0ab9dd4af163ae3c459d975
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Sep 6 08:50:55 2016 -0700

    X86-64: Add _dl_runtime_resolve_avx[512]_{opt|slow} [BZ #20508]
    
    There is transition penalty when SSE instructions are mixed with 256-bit
    AVX or 512-bit AVX512 load instructions.  Since _dl_runtime_resolve_avx
    and _dl_runtime_profile_avx512 save/restore 256-bit YMM/512-bit ZMM
    registers, there is transition penalty when SSE instructions are used
    with lazy binding on AVX and AVX512 processors.
    
    To avoid SSE transition penalty, if only the lower 128 bits of the first
    8 vector registers are non-zero, we can preserve %xmm0 - %xmm7 registers
    with the zero upper bits.
    
    For AVX and AVX512 processors which support XGETBV with ECX == 1, we can
    use XGETBV with ECX == 1 to check if the upper 128 bits of YMM registers
    or the upper 256 bits of ZMM registers are zero.  We can restore only the
    non-zero portion of vector registers with AVX/AVX512 load instructions
    which will zero-extend upper bits of vector registers.
    
    This patch adds _dl_runtime_resolve_sse_vex which saves and restores
    XMM registers with 128-bit AVX store/load instructions.  It is used to
    preserve YMM/ZMM registers when only the lower 128 bits are non-zero.
    _dl_runtime_resolve_avx_opt and _dl_runtime_resolve_avx512_opt are added
    and used on AVX/AVX512 processors supporting XGETBV with ECX == 1 so
    that we store and load only the non-zero portion of vector registers.
    This avoids SSE transition penalty caused by _dl_runtime_resolve_avx and
    _dl_runtime_profile_avx512 when only the lower 128 bits of vector
    registers are used.
    
    _dl_runtime_resolve_avx_slow is added and used for AVX processors which
    don't support XGETBV with ECX == 1.  Since there is no SSE transition
    penalty on AVX512 processors which don't support XGETBV with ECX == 1,
    _dl_runtime_resolve_avx512_slow isn't provided.
    
    	[BZ #20495]
    	[BZ #20508]
    	* sysdeps/x86/cpu-features.c (init_cpu_features): For Intel
    	processors, set Use_dl_runtime_resolve_slow and set
    	Use_dl_runtime_resolve_opt if XGETBV suports ECX == 1.
    	* sysdeps/x86/cpu-features.h (bit_arch_Use_dl_runtime_resolve_opt):
    	New.
    	(bit_arch_Use_dl_runtime_resolve_slow): Likewise.
    	(index_arch_Use_dl_runtime_resolve_opt): Likewise.
    	(index_arch_Use_dl_runtime_resolve_slow): Likewise.
    	* sysdeps/x86_64/dl-machine.h (elf_machine_runtime_setup): Use
    	_dl_runtime_resolve_avx512_opt and _dl_runtime_resolve_avx_opt
    	if Use_dl_runtime_resolve_opt is set.  Use
    	_dl_runtime_resolve_slow if Use_dl_runtime_resolve_slow is set.
    	* sysdeps/x86_64/dl-trampoline.S: Include <cpu-features.h>.
    	(_dl_runtime_resolve_opt): New.  Defined for AVX and AVX512.
    	(_dl_runtime_resolve): Add one for _dl_runtime_resolve_sse_vex.
    	* sysdeps/x86_64/dl-trampoline.h (_dl_runtime_resolve_avx_slow):
    	New.
    	(_dl_runtime_resolve_opt): Likewise.
    	(_dl_runtime_profile): Define only if _dl_runtime_profile is
    	defined.
    
    (cherry picked from commit fb0f7a6755c1bfaec38f490fbfcaa39a66ee3604)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=b4391b0c7def246a4503db1af683122681c12a56

commit b4391b0c7def246a4503db1af683122681c12a56
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Sep 6 08:50:55 2016 -0700

    X86-64: Add _dl_runtime_resolve_avx[512]_{opt|slow} [BZ #20508]
    
    There is transition penalty when SSE instructions are mixed with 256-bit
    AVX or 512-bit AVX512 load instructions.  Since _dl_runtime_resolve_avx
    and _dl_runtime_profile_avx512 save/restore 256-bit YMM/512-bit ZMM
    registers, there is transition penalty when SSE instructions are used
    with lazy binding on AVX and AVX512 processors.
    
    To avoid SSE transition penalty, if only the lower 128 bits of the first
    8 vector registers are non-zero, we can preserve %xmm0 - %xmm7 registers
    with the zero upper bits.
    
    For AVX and AVX512 processors which support XGETBV with ECX == 1, we can
    use XGETBV with ECX == 1 to check if the upper 128 bits of YMM registers
    or the upper 256 bits of ZMM registers are zero.  We can restore only the
    non-zero portion of vector registers with AVX/AVX512 load instructions
    which will zero-extend upper bits of vector registers.
    
    This patch adds _dl_runtime_resolve_sse_vex which saves and restores
    XMM registers with 128-bit AVX store/load instructions.  It is used to
    preserve YMM/ZMM registers when only the lower 128 bits are non-zero.
    _dl_runtime_resolve_avx_opt and _dl_runtime_resolve_avx512_opt are added
    and used on AVX/AVX512 processors supporting XGETBV with ECX == 1 so
    that we store and load only the non-zero portion of vector registers.
    This avoids SSE transition penalty caused by _dl_runtime_resolve_avx and
    _dl_runtime_profile_avx512 when only the lower 128 bits of vector
    registers are used.
    
    _dl_runtime_resolve_avx_slow is added and used for AVX processors which
    don't support XGETBV with ECX == 1.  Since there is no SSE transition
    penalty on AVX512 processors which don't support XGETBV with ECX == 1,
    _dl_runtime_resolve_avx512_slow isn't provided.
    
    	[BZ #20495]
    	[BZ #20508]
    	* sysdeps/x86/cpu-features.c (init_cpu_features): For Intel
    	processors, set Use_dl_runtime_resolve_slow and set
    	Use_dl_runtime_resolve_opt if XGETBV suports ECX == 1.
    	* sysdeps/x86/cpu-features.h (bit_arch_Use_dl_runtime_resolve_opt):
    	New.
    	(bit_arch_Use_dl_runtime_resolve_slow): Likewise.
    	(index_arch_Use_dl_runtime_resolve_opt): Likewise.
    	(index_arch_Use_dl_runtime_resolve_slow): Likewise.
    	* sysdeps/x86_64/dl-machine.h (elf_machine_runtime_setup): Use
    	_dl_runtime_resolve_avx512_opt and _dl_runtime_resolve_avx_opt
    	if Use_dl_runtime_resolve_opt is set.  Use
    	_dl_runtime_resolve_slow if Use_dl_runtime_resolve_slow is set.
    	* sysdeps/x86_64/dl-trampoline.S: Include <cpu-features.h>.
    	(_dl_runtime_resolve_opt): New.  Defined for AVX and AVX512.
    	(_dl_runtime_resolve): Add one for _dl_runtime_resolve_sse_vex.
    	* sysdeps/x86_64/dl-trampoline.h (_dl_runtime_resolve_avx_slow):
    	New.
    	(_dl_runtime_resolve_opt): Likewise.
    	(_dl_runtime_profile): Define only if _dl_runtime_profile is
    	defined.
    
    (cherry picked from commit fb0f7a6755c1bfaec38f490fbfcaa39a66ee3604)
    (cherry picked from commit 4b8790c81c1a7b870a43810ec95e08a2e501123d)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=0d5f4a32a34f048b35360a110a0e6d1c87e3eced

commit 0d5f4a32a34f048b35360a110a0e6d1c87e3eced
Author: Aurelien Jarno <aurelien@aurel32.net>
Date:   Thu Nov 24 12:10:13 2016 +0100

    x86_64: fix static build of __memcpy_chk for compilers defaulting to PIC/PIE
    
    When glibc is compiled with gcc 6.2 that has been configured with
    to default to PIC/PIE, the static version of __memcpy_chk is not built,
    as the test is done on PIC instead of SHARED. Fix the test to check for
    SHARED, like it is done for similar functions like memmove_chk.
    
    Changelog:
    	* sysdeps/x86_64/memcpy_chk.S (__memcpy_chk): Check for SHARED
    	instead of PIC.
    
    (cherry picked from commit 380ec16d62f459d5a28cfc25b7b20990c45e1cc9)
    (cherry picked from commit 2d16e81babd1d7b66d10cec0bc6d6d86a7e0c95e)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=0ab02a62e42e63b058e7a4e160dbe51762ef2c46

commit 0ab02a62e42e63b058e7a4e160dbe51762ef2c46
Author: Maciej W. Rozycki <macro@imgtec.com>
Date:   Thu Nov 17 19:15:51 2016 +0000

    MIPS: Add `.insn' to ensure a text label is defined as code not data
    
    Avoid a build error with microMIPS compilation and recent versions of
    GAS which complain if a branch targets a label which is marked as data
    rather than microMIPS code:
    
    ../sysdeps/mips/mips32/crti.S: Assembler messages:
    ../sysdeps/mips/mips32/crti.S:72: Error: branch to a symbol in another ISA mode
    make[2]: *** [.../csu/crti.o] Error 1
    
    as commit 9d862524f6ae ("MIPS: Verify the ISA mode and alignment of
    branch and jump targets") closed a hole in branch processing, making
    relocation calculation respect the ISA mode of the symbol referred.
    This allowed diagnosing the situation where an attempt is made to pass
    control from code assembled for one ISA mode to code assembled for a
    different ISA mode and either relaxing the branch to a cross-mode jump
    or if that is not possible, then reporting this as an error rather than
    letting such code build and then fail unpredictably at the run time.
    
    This however requires the correct annotation of branch targets as code,
    because the ISA mode is not relevant for data symbols and is therefore
    not recorded for them.  The `.insn' pseudo-op is used for this purpose
    and has been supported by GAS since:
    
    Wed Feb 12 14:36:29 1997  Ian Lance Taylor  <ian@cygnus.com>
    
    	* config/tc-mips.c (mips_pseudo_table): Add "insn".
    	(s_insn): New static function.
    	* doc/c-mips.texi: Document .insn.
    
    so there has been no reason to avoid it where required.  More recently
    this pseudo-op has been documented, by the microMIPS architecture
    specification[1][2], as required for the correct interpretation of any
    code label which is not followed by an actual instruction in an assembly
    source.
    
    Use it in our crti.S files then, to mark that the trailing label there
    with no instructions following is indeed not a code bug and the branch
    is legitimate.
    
    References:
    
    [1] "MIPS Architecture for Programmers, Volume II-B: The microMIPS32
        Instruction Set", MIPS Technologies, Inc., Document Number: MD00582,
        Revision 5.04, January 15, 2014, Section 7.1 "Assembly-Level
        Compatibility", p. 533
    
    [2] "MIPS Architecture for Programmers, Volume II-B: The microMIPS64
        Instruction Set", MIPS Technologies, Inc., Document Number: MD00594,
        Revision 5.04, January 15, 2014, Section 8.1 "Assembly-Level
        Compatibility", p. 623
    
    2016-11-23  Matthew Fortune  <Matthew.Fortune@imgtec.com>
                Maciej W. Rozycki  <macro@imgtec.com>
    
    	* sysdeps/mips/mips32/crti.S (_init): Add `.insn' pseudo-op at
    	`.Lno_weak_fn' label.
    	* sysdeps/mips/mips64/n32/crti.S (_init): Likewise.
    	* sysdeps/mips/mips64/n64/crti.S (_init): Likewise.
    
    (cherry picked from commit cfaf1949ff1f8336b54c43796d0e2531bc8a40a2)
    (cherry picked from commit 65a2b63756a4d622b938910d582d8b807c471c9a)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=901db98f36690e4743feefd985c6ba2d7fd19813

commit 901db98f36690e4743feefd985c6ba2d7fd19813
Author: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Date:   Mon Nov 21 11:06:15 2016 -0200

    Fix writes past the allocated array bounds in execvpe (BZ#20847)
    
    This patch fixes an invalid write out or stack allocated buffer in
    2 places at execvpe implementation:
    
      1. On 'maybe_script_execute' function where it allocates the new
         argument list and it does not account that a minimum of argc
         plus 3 elements (default shell path, script name, arguments,
         and ending null pointer) should be considered.  The straightforward
         fix is just to take account of the correct list size on argument
         copy.
    
      2. On '__execvpe' where the executable file name lenght may not
         account for ending '\0' and thus subsequent path creation may
         write past array bounds because it requires to add the terminating
         null.  The fix is to change how to calculate the executable name
         size to add the final '\0' and adjust the rest of the code
         accordingly.
    
    As described in GCC bug report 78433 [1], these issues were masked off by
    GCC because it allocated several bytes more than necessary so that many
    off-by-one bugs went unnoticed.
    
    Checked on x86_64 with a latest GCC (7.0.0 20161121) with -O3 on CFLAGS.
    
    	[BZ #20847]
    	* posix/execvpe.c (maybe_script_execute): Remove write past allocated
    	array bounds.
    	(__execvpe): Likewise.
    
    [1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78433
    
    (cherry picked from commit d174436712e3cabce70d6cd771f177b6fe0e097b)

-----------------------------------------------------------------------

Summary of changes:
 ChangeLog                      |   25 ++++++++++
 posix/execvpe.c                |   15 ++++--
 sysdeps/alpha/fpu/s_ceil.c     |    7 +--
 sysdeps/alpha/fpu/s_ceilf.c    |    7 +--
 sysdeps/alpha/fpu/s_floor.c    |    7 +--
 sysdeps/alpha/fpu/s_floorf.c   |    7 +--
 sysdeps/alpha/fpu/s_rint.c     |    3 +
 sysdeps/alpha/fpu/s_rintf.c    |    3 +
 sysdeps/alpha/fpu/s_trunc.c    |    7 +--
 sysdeps/alpha/fpu/s_truncf.c   |    7 +--
 sysdeps/mips/mips32/crti.S     |    1 +
 sysdeps/mips/mips64/n32/crti.S |    1 +
 sysdeps/mips/mips64/n64/crti.S |    1 +
 sysdeps/x86/cpu-features.c     |   14 +++++
 sysdeps/x86/cpu-features.h     |    6 ++
 sysdeps/x86_64/dl-machine.h    |   24 ++++++++-
 sysdeps/x86_64/dl-trampoline.S |   20 ++++++++
 sysdeps/x86_64/dl-trampoline.h |  104 +++++++++++++++++++++++++++++++++++++++-
 sysdeps/x86_64/memcpy_chk.S    |    2 +-
 19 files changed, 228 insertions(+), 33 deletions(-)
Comment 24 Sourceware Commits 2017-04-20 14:58:40 UTC
This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, hjl/pr21258/2.23 has been created
        at  883cadc5543ffd3a4537498b44c782ded8a4a4e8 (commit)

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=883cadc5543ffd3a4537498b44c782ded8a4a4e8

commit 883cadc5543ffd3a4537498b44c782ded8a4a4e8
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Mar 21 10:59:31 2017 -0700

    x86-64: Improve branch predication in _dl_runtime_resolve_avx512_opt [BZ #21258]
    
    On Skylake server, _dl_runtime_resolve_avx512_opt is used to preserve
    the first 8 vector registers.  The code layout is
    
      if only %xmm0 - %xmm7 registers are used
         preserve %xmm0 - %xmm7 registers
      if only %ymm0 - %ymm7 registers are used
         preserve %ymm0 - %ymm7 registers
      preserve %zmm0 - %zmm7 registers
    
    Branch predication always executes the fallthrough code path to preserve
    %zmm0 - %zmm7 registers speculatively, even though only %xmm0 - %xmm7
    registers are used.  This leads to lower CPU frequency on Skylake
    server.  This patch changes the fallthrough code path to preserve
    %xmm0 - %xmm7 registers instead:
    
      if whole %zmm0 - %zmm7 registers are used
        preserve %zmm0 - %zmm7 registers
      if only %ymm0 - %ymm7 registers are used
         preserve %ymm0 - %ymm7 registers
      preserve %xmm0 - %xmm7 registers
    
    Tested on Skylake server.
    
    	[BZ #21258]
    	* sysdeps/x86_64/dl-trampoline.S (_dl_runtime_resolve_opt):
    	Define only if _dl_runtime_resolve is defined to
    	_dl_runtime_resolve_sse_vex.
    	* sysdeps/x86_64/dl-trampoline.h (_dl_runtime_resolve_opt):
    	Fallthrough to _dl_runtime_resolve_sse_vex.
    
    (cherry picked from commit c15f8eb50cea7ad1a4ccece6e0982bf426d52c00)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=83037ea1d9e84b1b44ed307f01cbb5eeac24e22d

commit 83037ea1d9e84b1b44ed307f01cbb5eeac24e22d
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Aug 23 09:09:32 2016 -0700

    X86-64: Add _dl_runtime_resolve_avx[512]_{opt|slow} [BZ #20508]
    
    There is transition penalty when SSE instructions are mixed with 256-bit
    AVX or 512-bit AVX512 load instructions.  Since _dl_runtime_resolve_avx
    and _dl_runtime_profile_avx512 save/restore 256-bit YMM/512-bit ZMM
    registers, there is transition penalty when SSE instructions are used
    with lazy binding on AVX and AVX512 processors.
    
    To avoid SSE transition penalty, if only the lower 128 bits of the first
    8 vector registers are non-zero, we can preserve %xmm0 - %xmm7 registers
    with the zero upper bits.
    
    For AVX and AVX512 processors which support XGETBV with ECX == 1, we can
    use XGETBV with ECX == 1 to check if the upper 128 bits of YMM registers
    or the upper 256 bits of ZMM registers are zero.  We can restore only the
    non-zero portion of vector registers with AVX/AVX512 load instructions
    which will zero-extend upper bits of vector registers.
    
    This patch adds _dl_runtime_resolve_sse_vex which saves and restores
    XMM registers with 128-bit AVX store/load instructions.  It is used to
    preserve YMM/ZMM registers when only the lower 128 bits are non-zero.
    _dl_runtime_resolve_avx_opt and _dl_runtime_resolve_avx512_opt are added
    and used on AVX/AVX512 processors supporting XGETBV with ECX == 1 so
    that we store and load only the non-zero portion of vector registers.
    This avoids SSE transition penalty caused by _dl_runtime_resolve_avx and
    _dl_runtime_profile_avx512 when only the lower 128 bits of vector
    registers are used.
    
    _dl_runtime_resolve_avx_slow is added and used for AVX processors which
    don't support XGETBV with ECX == 1.  Since there is no SSE transition
    penalty on AVX512 processors which don't support XGETBV with ECX == 1,
    _dl_runtime_resolve_avx512_slow isn't provided.
    
    	[BZ #20495]
    	[BZ #20508]
    	* sysdeps/x86/cpu-features.c (init_cpu_features): For Intel
    	processors, set Use_dl_runtime_resolve_slow and set
    	Use_dl_runtime_resolve_opt if XGETBV suports ECX == 1.
    	* sysdeps/x86/cpu-features.h (bit_Use_dl_runtime_resolve_opt):
    	New.
    	(bit_Use_dl_runtime_resolve_slow): Likewise.
    	(index_Use_dl_runtime_resolve_opt): Likewise.
    	(index_Use_dl_runtime_resolve_slow): Likewise.
    	* sysdeps/x86_64/dl-machine.h (elf_machine_runtime_setup): Use
    	_dl_runtime_resolve_avx512_opt and _dl_runtime_resolve_avx_opt
    	if Use_dl_runtime_resolve_opt is set.  Use
    	_dl_runtime_resolve_slow if Use_dl_runtime_resolve_slow is set.
    	* sysdeps/x86_64/dl-trampoline.S: Include <cpu-features.h>.
    	(_dl_runtime_resolve_opt): New.  Defined for AVX and AVX512.
    	(_dl_runtime_resolve): Add one for _dl_runtime_resolve_sse_vex.
    	* sysdeps/x86_64/dl-trampoline.h (_dl_runtime_resolve_avx_slow):
    	New.
    	(_dl_runtime_resolve_opt): Likewise.
    	(_dl_runtime_profile): Define only if _dl_runtime_profile is
    	defined.
    
    (cherry picked from commit fb0f7a6755c1bfaec38f490fbfcaa39a66ee3604)

-----------------------------------------------------------------------