Bug 20508

Summary: _dl_runtime_resolve_avx/_dl_runtime_profile_avx512 cause transition penalty
Product: glibc Reporter: H.J. Lu <hjl.tools>
Component: dynamic-linkAssignee: Not yet assigned to anyone <unassigned>
Status: RESOLVED FIXED    
Severity: normal CC: fweimer, kungfujesus06, markus
Priority: P2 Flags: fweimer: security-
Version: 2.25   
Target Milestone: 2.25   
Host: Target: x86-64
Build: Last reconfirmed:

Description H.J. Lu 2016-08-23 15:40:38 UTC
There is transition penalty when SSE instructions are mixed with 256-bit
AVX or 512-bit AVX512 instructions.   Since _dl_runtime_resolve_avx and
_dl_runtime_profile_avx512 save and restore 256-bit YMM/512-bit ZMM 
registers, there is transition penalty for SSE instructions with lazy
binding.
Comment 1 Florian Weimer 2016-08-23 17:29:10 UTC
I also expect that the present state of affairs makes all context switches slower because the kernel has to save and restore the AVX-512F state.
Comment 2 H.J. Lu 2016-08-23 17:57:10 UTC
(In reply to Florian Weimer from comment #1)
> I also expect that the present state of affairs makes all context switches
> slower because the kernel has to save and restore the AVX-512F state.

Context switches may not be impacted since XSAVEC and XSAVEOPT track
upper bits of vector registers.  But SSE transition tracks only YMM/ZMM
load instructions, not the bits in vector registers.
Comment 3 Sourceware Commits 2016-08-23 18:02:34 UTC
This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, hjl/x86/xgetbv has been created
        at  d963b835c1e0fe430a88168a81c4c69dcd9ad00c (commit)

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=d963b835c1e0fe430a88168a81c4c69dcd9ad00c

commit d963b835c1e0fe430a88168a81c4c69dcd9ad00c
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Aug 23 09:09:32 2016 -0700

    X86-64: Add _dl_runtime_resolve_avx[512]_opt [BZ #20508]
    
    There is transition penalty when SSE instructions are mixed with 256-bit
    AVX or 512-bit AVX512 load instructions.  Since _dl_runtime_resolve_avx
    and _dl_runtime_profile_avx512 save/restore 256-bit YMM/512-bit ZMM
    registers, there is transition penalty when SSE instructions are used
    with lazy binding on AVX and AVX512 processors.
    
    For AVX and AVX512 processors which support XGETBV with ECX == 1, we can
    use XGETBV with ECX == 1 to check if the upper 128 bits of YMM registers
    or the upper 256 bits of ZMM registers are zero.  We can restore only the
    non-zero portion of vector registers with AVX/AVX512 load instructions
    which will zero-extend upper bits of vector registers.
    
    This patch adds _dl_runtime_resolve_sse_vex which saves and restores
    XMM registers with 128-bit AVX store/load instructions.  It is used to
    preserve YMM/ZMM registers when only the lower 128 bits are non-zero.
    _dl_runtime_resolve_avx_opt and _dl_runtime_resolve_avx512_opt are added
    and used on AVX/AVX512 processors supporting XGETBV with ECX == 1 so
    that we store and load only the non-zero portion of vector registers.
    This avoids SSE transition penalty caused by _dl_runtime_resolve_avx and
    _dl_runtime_profile_avx512 when only the lower 128 bits of vector
    registers are used.
    
    	[BZ #20508]
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Set
    	Use_dl_runtime_resolve_opt if XGETBV suports ECX == 1.
    	* sysdeps/x86/cpu-features.h (bit_arch_Use_dl_runtime_resolve_opt):
    	New.
    	(index_arch_Use_dl_runtime_resolve_opt): Likewise.
    	* sysdeps/x86_64/dl-machine.h (elf_machine_runtime_setup): Use
    	_dl_runtime_resolve_avx512_opt and _dl_runtime_resolve_avx_opt
    	if Use_dl_runtime_resolve_opt is set.
    	* sysdeps/x86_64/dl-trampoline.S: Include <cpu-features.h>.
    	(_dl_runtime_resolve_opt): New.  Defined for AVX and AVX512.
    	(_dl_runtime_resolve): Add one for _dl_runtime_resolve_sse_vex.
    	* sysdeps/x86_64/dl-trampoline.h (_dl_runtime_resolve_opt): New.
    	(_dl_runtime_profile): Define only if _dl_runtime_profile is
    	defined.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=d4e9985c033d90c310d7798f2d1f0634a64cedff

commit d4e9985c033d90c310d7798f2d1f0634a64cedff
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Aug 18 14:52:42 2016 -0700

    X86-64: Correct CFA in _dl_runtime_resolve
    
    When stack is re-aligned in _dl_runtime_resolve, there is no need to
    adjust CFA when allocating register save area on stack.
    
    	* sysdeps/x86_64/dl-trampoline.h (_dl_runtime_resolve): Don't
    	adjust CFA when allocating register save area on re-aligned
    	stack.

-----------------------------------------------------------------------
Comment 4 H.J. Lu 2016-08-25 21:27:50 UTC
*** Bug 20495 has been marked as a duplicate of this bug. ***
Comment 5 H.J. Lu 2016-08-25 21:28:55 UTC
There is also transition penalty on AVX machines
Comment 6 Sourceware Commits 2016-08-26 16:00:58 UTC
This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, hjl/x86/xgetbv has been deleted
       was  d963b835c1e0fe430a88168a81c4c69dcd9ad00c

- Log -----------------------------------------------------------------
d963b835c1e0fe430a88168a81c4c69dcd9ad00c X86-64: Add _dl_runtime_resolve_avx[512]_opt [BZ #20508]
-----------------------------------------------------------------------
Comment 7 Sourceware Commits 2016-08-26 16:00:58 UTC
This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, hjl/x86/xgetbv has been created
        at  99143e37c1186c765da5b6e892ddaff0b3719f9f (commit)

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=99143e37c1186c765da5b6e892ddaff0b3719f9f

commit 99143e37c1186c765da5b6e892ddaff0b3719f9f
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Aug 23 09:09:32 2016 -0700

    X86-64: Add _dl_runtime_resolve_avx[512]_{opt|slow} [BZ #20508]
    
    There is transition penalty when SSE instructions are mixed with 256-bit
    AVX or 512-bit AVX512 load instructions.  Since _dl_runtime_resolve_avx
    and _dl_runtime_profile_avx512 save/restore 256-bit YMM/512-bit ZMM
    registers, there is transition penalty when SSE instructions are used
    with lazy binding on AVX and AVX512 processors.
    
    To avoid SSE transition penalty, if only the lower 128 bits of the first
    8 vector registers are non-zero, we can preserve %xmm0 - %xmm7 registers
    with the zero upper bits.
    
    For AVX and AVX512 processors which support XGETBV with ECX == 1, we can
    use XGETBV with ECX == 1 to check if the upper 128 bits of YMM registers
    or the upper 256 bits of ZMM registers are zero.  We can restore only the
    non-zero portion of vector registers with AVX/AVX512 load instructions
    which will zero-extend upper bits of vector registers.
    
    This patch adds _dl_runtime_resolve_sse_vex which saves and restores
    XMM registers with 128-bit AVX store/load instructions.  It is used to
    preserve YMM/ZMM registers when only the lower 128 bits are non-zero.
    _dl_runtime_resolve_avx_opt and _dl_runtime_resolve_avx512_opt are added
    and used on AVX/AVX512 processors supporting XGETBV with ECX == 1 so
    that we store and load only the non-zero portion of vector registers.
    This avoids SSE transition penalty caused by _dl_runtime_resolve_avx and
    _dl_runtime_profile_avx512 when only the lower 128 bits of vector
    registers are used.
    
    _dl_runtime_resolve_avx_slow is added and used for AVX processors which
    don't upport XGETBV with ECX == 1.  Since there is no SSE transition
    penalty on AVX512 processors which don't support XGETBV with ECX == 1,
    _dl_runtime_resolve_avx512_slow isn't provided.
    
    	[BZ #20495]
    	[BZ #20508]
    	* sysdeps/x86/cpu-features.c (init_cpu_features): For Intel
    	processors, set Use_dl_runtime_resolve_slow and set
    	Use_dl_runtime_resolve_opt if XGETBV suports ECX == 1.
    	* sysdeps/x86/cpu-features.h (bit_arch_Use_dl_runtime_resolve_opt):
    	New.
    	(bit_arch_Use_dl_runtime_resolve_slow): Likewise.
    	(index_arch_Use_dl_runtime_resolve_opt): Likewise.
    	(index_arch_Use_dl_runtime_resolve_slow): Likewise.
    	* sysdeps/x86_64/dl-machine.h (elf_machine_runtime_setup): Use
    	_dl_runtime_resolve_avx512_opt and _dl_runtime_resolve_avx_opt
    	if Use_dl_runtime_resolve_opt is set.  Use
    	_dl_runtime_resolve_slow if Use_dl_runtime_resolve_slow is set.
    	* sysdeps/x86_64/dl-trampoline.S: Include <cpu-features.h>.
    	(_dl_runtime_resolve_opt): New.  Defined for AVX and AVX512.
    	(_dl_runtime_resolve): Add one for _dl_runtime_resolve_sse_vex.
    	* sysdeps/x86_64/dl-trampoline.h (_dl_runtime_resolve_avx_slow):
    	New.
    	(_dl_runtime_resolve_opt): Likewise.
    	(_dl_runtime_profile): Define only if _dl_runtime_profile is
    	defined.

-----------------------------------------------------------------------
Comment 8 Sourceware Commits 2016-08-30 22:04:35 UTC
This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, hjl/x86/xgetbv has been deleted
       was  99143e37c1186c765da5b6e892ddaff0b3719f9f

- Log -----------------------------------------------------------------
99143e37c1186c765da5b6e892ddaff0b3719f9f X86-64: Add _dl_runtime_resolve_avx[512]_{opt|slow} [BZ #20508]
-----------------------------------------------------------------------
Comment 9 Sourceware Commits 2016-08-30 22:04:43 UTC
This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, hjl/x86/xgetbv has been created
        at  fdb9777e1d770446972f46a80ebfa59d522a93f1 (commit)

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=fdb9777e1d770446972f46a80ebfa59d522a93f1

commit fdb9777e1d770446972f46a80ebfa59d522a93f1
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Aug 23 09:09:32 2016 -0700

    X86-64: Add _dl_runtime_resolve_avx[512]_{opt|slow} [BZ #20508]
    
    There is transition penalty when SSE instructions are mixed with 256-bit
    AVX or 512-bit AVX512 load instructions.  Since _dl_runtime_resolve_avx
    and _dl_runtime_profile_avx512 save/restore 256-bit YMM/512-bit ZMM
    registers, there is transition penalty when SSE instructions are used
    with lazy binding on AVX and AVX512 processors.
    
    To avoid SSE transition penalty, if only the lower 128 bits of the first
    8 vector registers are non-zero, we can preserve %xmm0 - %xmm7 registers
    with the zero upper bits.
    
    For AVX and AVX512 processors which support XGETBV with ECX == 1, we can
    use XGETBV with ECX == 1 to check if the upper 128 bits of YMM registers
    or the upper 256 bits of ZMM registers are zero.  We can restore only the
    non-zero portion of vector registers with AVX/AVX512 load instructions
    which will zero-extend upper bits of vector registers.
    
    This patch adds _dl_runtime_resolve_sse_vex which saves and restores
    XMM registers with 128-bit AVX store/load instructions.  It is used to
    preserve YMM/ZMM registers when only the lower 128 bits are non-zero.
    _dl_runtime_resolve_avx_opt and _dl_runtime_resolve_avx512_opt are added
    and used on AVX/AVX512 processors supporting XGETBV with ECX == 1 so
    that we store and load only the non-zero portion of vector registers.
    This avoids SSE transition penalty caused by _dl_runtime_resolve_avx and
    _dl_runtime_profile_avx512 when only the lower 128 bits of vector
    registers are used.
    
    _dl_runtime_resolve_avx_slow is added and used for AVX processors which
    don't support XGETBV with ECX == 1.  Since there is no SSE transition
    penalty on AVX512 processors which don't support XGETBV with ECX == 1,
    _dl_runtime_resolve_avx512_slow isn't provided.
    
    	[BZ #20495]
    	[BZ #20508]
    	* sysdeps/x86/cpu-features.c (init_cpu_features): For Intel
    	processors, set Use_dl_runtime_resolve_slow and set
    	Use_dl_runtime_resolve_opt if XGETBV suports ECX == 1.
    	* sysdeps/x86/cpu-features.h (bit_arch_Use_dl_runtime_resolve_opt):
    	New.
    	(bit_arch_Use_dl_runtime_resolve_slow): Likewise.
    	(index_arch_Use_dl_runtime_resolve_opt): Likewise.
    	(index_arch_Use_dl_runtime_resolve_slow): Likewise.
    	* sysdeps/x86_64/dl-machine.h (elf_machine_runtime_setup): Use
    	_dl_runtime_resolve_avx512_opt and _dl_runtime_resolve_avx_opt
    	if Use_dl_runtime_resolve_opt is set.  Use
    	_dl_runtime_resolve_slow if Use_dl_runtime_resolve_slow is set.
    	* sysdeps/x86_64/dl-trampoline.S: Include <cpu-features.h>.
    	(_dl_runtime_resolve_opt): New.  Defined for AVX and AVX512.
    	(_dl_runtime_resolve): Add one for _dl_runtime_resolve_sse_vex.
    	* sysdeps/x86_64/dl-trampoline.h (_dl_runtime_resolve_avx_slow):
    	New.
    	(_dl_runtime_resolve_opt): Likewise.
    	(_dl_runtime_profile): Define only if _dl_runtime_profile is
    	defined.

-----------------------------------------------------------------------
Comment 10 Sourceware Commits 2016-09-06 16:10:47 UTC
This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, master has been updated
       via  fb0f7a6755c1bfaec38f490fbfcaa39a66ee3604 (commit)
      from  a0d47f487fe250c63cc21e9608b85bc02dc2a006 (commit)

Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=fb0f7a6755c1bfaec38f490fbfcaa39a66ee3604

commit fb0f7a6755c1bfaec38f490fbfcaa39a66ee3604
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Sep 6 08:50:55 2016 -0700

    X86-64: Add _dl_runtime_resolve_avx[512]_{opt|slow} [BZ #20508]
    
    There is transition penalty when SSE instructions are mixed with 256-bit
    AVX or 512-bit AVX512 load instructions.  Since _dl_runtime_resolve_avx
    and _dl_runtime_profile_avx512 save/restore 256-bit YMM/512-bit ZMM
    registers, there is transition penalty when SSE instructions are used
    with lazy binding on AVX and AVX512 processors.
    
    To avoid SSE transition penalty, if only the lower 128 bits of the first
    8 vector registers are non-zero, we can preserve %xmm0 - %xmm7 registers
    with the zero upper bits.
    
    For AVX and AVX512 processors which support XGETBV with ECX == 1, we can
    use XGETBV with ECX == 1 to check if the upper 128 bits of YMM registers
    or the upper 256 bits of ZMM registers are zero.  We can restore only the
    non-zero portion of vector registers with AVX/AVX512 load instructions
    which will zero-extend upper bits of vector registers.
    
    This patch adds _dl_runtime_resolve_sse_vex which saves and restores
    XMM registers with 128-bit AVX store/load instructions.  It is used to
    preserve YMM/ZMM registers when only the lower 128 bits are non-zero.
    _dl_runtime_resolve_avx_opt and _dl_runtime_resolve_avx512_opt are added
    and used on AVX/AVX512 processors supporting XGETBV with ECX == 1 so
    that we store and load only the non-zero portion of vector registers.
    This avoids SSE transition penalty caused by _dl_runtime_resolve_avx and
    _dl_runtime_profile_avx512 when only the lower 128 bits of vector
    registers are used.
    
    _dl_runtime_resolve_avx_slow is added and used for AVX processors which
    don't support XGETBV with ECX == 1.  Since there is no SSE transition
    penalty on AVX512 processors which don't support XGETBV with ECX == 1,
    _dl_runtime_resolve_avx512_slow isn't provided.
    
    	[BZ #20495]
    	[BZ #20508]
    	* sysdeps/x86/cpu-features.c (init_cpu_features): For Intel
    	processors, set Use_dl_runtime_resolve_slow and set
    	Use_dl_runtime_resolve_opt if XGETBV suports ECX == 1.
    	* sysdeps/x86/cpu-features.h (bit_arch_Use_dl_runtime_resolve_opt):
    	New.
    	(bit_arch_Use_dl_runtime_resolve_slow): Likewise.
    	(index_arch_Use_dl_runtime_resolve_opt): Likewise.
    	(index_arch_Use_dl_runtime_resolve_slow): Likewise.
    	* sysdeps/x86_64/dl-machine.h (elf_machine_runtime_setup): Use
    	_dl_runtime_resolve_avx512_opt and _dl_runtime_resolve_avx_opt
    	if Use_dl_runtime_resolve_opt is set.  Use
    	_dl_runtime_resolve_slow if Use_dl_runtime_resolve_slow is set.
    	* sysdeps/x86_64/dl-trampoline.S: Include <cpu-features.h>.
    	(_dl_runtime_resolve_opt): New.  Defined for AVX and AVX512.
    	(_dl_runtime_resolve): Add one for _dl_runtime_resolve_sse_vex.
    	* sysdeps/x86_64/dl-trampoline.h (_dl_runtime_resolve_avx_slow):
    	New.
    	(_dl_runtime_resolve_opt): Likewise.
    	(_dl_runtime_profile): Define only if _dl_runtime_profile is
    	defined.

-----------------------------------------------------------------------

Summary of changes:
 ChangeLog                      |   25 ++++++++++
 sysdeps/x86/cpu-features.c     |   14 +++++
 sysdeps/x86/cpu-features.h     |    6 ++
 sysdeps/x86_64/dl-machine.h    |   24 ++++++++-
 sysdeps/x86_64/dl-trampoline.S |   20 ++++++++
 sysdeps/x86_64/dl-trampoline.h |  104 +++++++++++++++++++++++++++++++++++++++-
 6 files changed, 190 insertions(+), 3 deletions(-)
Comment 11 H.J. Lu 2016-09-06 16:22:16 UTC
Fixed for 2.25.
Comment 12 Sourceware Commits 2016-11-30 22:01:19 UTC
This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, release/2.24/master has been updated
       via  4b8790c81c1a7b870a43810ec95e08a2e501123d (commit)
      from  2d16e81babd1d7b66d10cec0bc6d6d86a7e0c95e (commit)

Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4b8790c81c1a7b870a43810ec95e08a2e501123d

commit 4b8790c81c1a7b870a43810ec95e08a2e501123d
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Sep 6 08:50:55 2016 -0700

    X86-64: Add _dl_runtime_resolve_avx[512]_{opt|slow} [BZ #20508]
    
    There is transition penalty when SSE instructions are mixed with 256-bit
    AVX or 512-bit AVX512 load instructions.  Since _dl_runtime_resolve_avx
    and _dl_runtime_profile_avx512 save/restore 256-bit YMM/512-bit ZMM
    registers, there is transition penalty when SSE instructions are used
    with lazy binding on AVX and AVX512 processors.
    
    To avoid SSE transition penalty, if only the lower 128 bits of the first
    8 vector registers are non-zero, we can preserve %xmm0 - %xmm7 registers
    with the zero upper bits.
    
    For AVX and AVX512 processors which support XGETBV with ECX == 1, we can
    use XGETBV with ECX == 1 to check if the upper 128 bits of YMM registers
    or the upper 256 bits of ZMM registers are zero.  We can restore only the
    non-zero portion of vector registers with AVX/AVX512 load instructions
    which will zero-extend upper bits of vector registers.
    
    This patch adds _dl_runtime_resolve_sse_vex which saves and restores
    XMM registers with 128-bit AVX store/load instructions.  It is used to
    preserve YMM/ZMM registers when only the lower 128 bits are non-zero.
    _dl_runtime_resolve_avx_opt and _dl_runtime_resolve_avx512_opt are added
    and used on AVX/AVX512 processors supporting XGETBV with ECX == 1 so
    that we store and load only the non-zero portion of vector registers.
    This avoids SSE transition penalty caused by _dl_runtime_resolve_avx and
    _dl_runtime_profile_avx512 when only the lower 128 bits of vector
    registers are used.
    
    _dl_runtime_resolve_avx_slow is added and used for AVX processors which
    don't support XGETBV with ECX == 1.  Since there is no SSE transition
    penalty on AVX512 processors which don't support XGETBV with ECX == 1,
    _dl_runtime_resolve_avx512_slow isn't provided.
    
    	[BZ #20495]
    	[BZ #20508]
    	* sysdeps/x86/cpu-features.c (init_cpu_features): For Intel
    	processors, set Use_dl_runtime_resolve_slow and set
    	Use_dl_runtime_resolve_opt if XGETBV suports ECX == 1.
    	* sysdeps/x86/cpu-features.h (bit_arch_Use_dl_runtime_resolve_opt):
    	New.
    	(bit_arch_Use_dl_runtime_resolve_slow): Likewise.
    	(index_arch_Use_dl_runtime_resolve_opt): Likewise.
    	(index_arch_Use_dl_runtime_resolve_slow): Likewise.
    	* sysdeps/x86_64/dl-machine.h (elf_machine_runtime_setup): Use
    	_dl_runtime_resolve_avx512_opt and _dl_runtime_resolve_avx_opt
    	if Use_dl_runtime_resolve_opt is set.  Use
    	_dl_runtime_resolve_slow if Use_dl_runtime_resolve_slow is set.
    	* sysdeps/x86_64/dl-trampoline.S: Include <cpu-features.h>.
    	(_dl_runtime_resolve_opt): New.  Defined for AVX and AVX512.
    	(_dl_runtime_resolve): Add one for _dl_runtime_resolve_sse_vex.
    	* sysdeps/x86_64/dl-trampoline.h (_dl_runtime_resolve_avx_slow):
    	New.
    	(_dl_runtime_resolve_opt): Likewise.
    	(_dl_runtime_profile): Define only if _dl_runtime_profile is
    	defined.
    
    (cherry picked from commit fb0f7a6755c1bfaec38f490fbfcaa39a66ee3604)

-----------------------------------------------------------------------

Summary of changes:
 ChangeLog                      |   25 ++++++++++
 sysdeps/x86/cpu-features.c     |   14 +++++
 sysdeps/x86/cpu-features.h     |    6 ++
 sysdeps/x86_64/dl-machine.h    |   24 ++++++++-
 sysdeps/x86_64/dl-trampoline.S |   20 ++++++++
 sysdeps/x86_64/dl-trampoline.h |  104 +++++++++++++++++++++++++++++++++++++++-
 6 files changed, 190 insertions(+), 3 deletions(-)
Comment 13 markus 2016-12-06 09:12:40 UTC
May I ask for a high priority to get this fixed in the stable versions?

With the dolphin emulator, we've seen an up to 80% slowdown on random compilation settings and usages because of this issue: https://forums.dolphin-emu.org/Thread-dolphin-uses-ffmpeg-to-play-show-the-videos-of-the-games

We use AVX128 and SSE in a mixed way in our just in time compiler, so we are hit hard by this penalty. Especially as this thread run almost no C++ code. So there was a good chance to never call VZEROUPPER at all. Debugging this performance issue was a big pain.
Comment 14 H.J. Lu 2016-12-06 15:41:36 UTC
(In reply to markus from comment #13)
> May I ask for a high priority to get this fixed in the stable versions?
> 

It has been backported to 2.24 branch.
Comment 15 Sourceware Commits 2016-12-08 19:27:43 UTC
This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, gentoo/2.24 has been updated
       via  b73ec923c79ab493a9265930a45800391329571a (commit)
       via  04c5f782796052de9d06975061eb3376ccbcbdb1 (commit)
       via  9b34c1494d8e61bb3d718e2ea83b856030476737 (commit)
       via  2afb8a945ddc104c5ef9aa61f32427c19b681232 (commit)
       via  df13b9c22a0fb690a0ab9dd4af163ae3c459d975 (commit)
       via  b4391b0c7def246a4503db1af683122681c12a56 (commit)
       via  0d5f4a32a34f048b35360a110a0e6d1c87e3eced (commit)
       via  0ab02a62e42e63b058e7a4e160dbe51762ef2c46 (commit)
       via  901db98f36690e4743feefd985c6ba2d7fd19813 (commit)
      from  caafe2b2612be88046d7bad4da42dbc2b07fbcd7 (commit)

Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=b73ec923c79ab493a9265930a45800391329571a

commit b73ec923c79ab493a9265930a45800391329571a
Author: Aurelien Jarno <aurelien@aurel32.net>
Date:   Tue Aug 2 09:18:59 2016 +0200

    alpha: fix trunc for big input values
    
    The alpha specific version of trunc and truncf always add and subtract
    0x1.0p23 or 0x1.0p52 even for big values. This causes this kind of
    errors in the testsuite:
    
      Failure: Test: trunc_towardzero (0x1p107)
      Result:
       is:          1.6225927682921334e+32   0x1.fffffffffffffp+106
       should be:   1.6225927682921336e+32   0x1.0000000000000p+107
       difference:  1.8014398509481984e+16   0x1.0000000000000p+54
       ulp       :  0.5000
       max.ulp   :  0.0000
    
    Change this by returning the input value when its absolute value is
    greater than 0x1.0p23 or 0x1.0p52. NaN have to go through the add and
    subtract operations to get possibly silenced.
    
    Finally remove the code to handle inexact exception, trunc should never
    generate such an exception.
    
    Changelog:
    	* sysdeps/alpha/fpu/s_trunc.c (__trunc): Return the input value
    	when its absolute value is greater than 0x1.0p52.
    	[_IEEE_FP_INEXACT] Remove.
    	* sysdeps/alpha/fpu/s_truncf.c (__truncf): Return the input value
    	when its absolute value is greater than 0x1.0p23.
    	[_IEEE_FP_INEXACT] Remove.
    
    (cherry picked from commit b74d259fe793499134eb743222cd8dd7c74a31ce)
    (cherry picked from commit e6eab16cc302e6c42f79e1af02ce98ebb9a783bc)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=04c5f782796052de9d06975061eb3376ccbcbdb1

commit 04c5f782796052de9d06975061eb3376ccbcbdb1
Author: Aurelien Jarno <aurelien@aurel32.net>
Date:   Tue Aug 2 09:18:59 2016 +0200

    alpha: fix rint on sNaN input
    
    The alpha version of rint wrongly return sNaN for sNaN input. Fix that
    by checking for NaN and by returning the input value added with itself
    in that case.
    
    Changelog:
    	* sysdeps/alpha/fpu/s_rint.c (__rint): Add argument with itself
    	when it is a NaN.
    	* sysdeps/alpha/fpu/s_rintf.c (__rintf): Likewise.
    
    (cherry picked from commit cb7f9d63b921ea1a1cbb4ab377a8484fd5da9a2b)
    (cherry picked from commit 8eb9a92e0522f2d4f2d4167df919d066c85d3408)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=9b34c1494d8e61bb3d718e2ea83b856030476737

commit 9b34c1494d8e61bb3d718e2ea83b856030476737
Author: Aurelien Jarno <aurelien@aurel32.net>
Date:   Tue Aug 2 09:18:59 2016 +0200

    alpha: fix floor on sNaN input
    
    The alpha version of floor wrongly return sNaN for sNaN input. Fix that
    by checking for NaN and by returning the input value added with itself
    in that case.
    
    Finally remove the code to handle inexact exception, floor should never
    generate such an exception.
    
    Changelog:
    	* sysdeps/alpha/fpu/s_floor.c (__floor): Add argument with itself
    	when it is a NaN.
    	[_IEEE_FP_INEXACT] Remove.
    	* sysdeps/alpha/fpu/s_floorf.c (__floorf): Likewise.
    
    (cherry picked from commit 65cc568cf57156e5230db9a061645e54ff028a41)
    (cherry picked from commit 1912cc082df4739c2388c375f8d486afdaa7d49b)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=2afb8a945ddc104c5ef9aa61f32427c19b681232

commit 2afb8a945ddc104c5ef9aa61f32427c19b681232
Author: Aurelien Jarno <aurelien@aurel32.net>
Date:   Tue Aug 2 09:18:59 2016 +0200

    alpha: fix ceil on sNaN input
    
    The alpha version of ceil wrongly return sNaN for sNaN input. Fix that
    by checking for NaN and by returning the input value added with itself
    in that case.
    
    Finally remove the code to handle inexact exception, ceil should never
    generate such an exception.
    
    Changelog:
    	* sysdeps/alpha/fpu/s_ceil.c (__ceil): Add argument with itself
    	when it is a NaN.
    	[_IEEE_FP_INEXACT] Remove.
    	* sysdeps/alpha/fpu/s_ceilf.c (__ceilf): Likewise.
    
    (cherry picked from commit 062e53c195b4a87754632c7d51254867247698b4)
    (cherry picked from commit 3eff6f84311d2679a58a637e3be78b4ced275762)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=df13b9c22a0fb690a0ab9dd4af163ae3c459d975

commit df13b9c22a0fb690a0ab9dd4af163ae3c459d975
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Sep 6 08:50:55 2016 -0700

    X86-64: Add _dl_runtime_resolve_avx[512]_{opt|slow} [BZ #20508]
    
    There is transition penalty when SSE instructions are mixed with 256-bit
    AVX or 512-bit AVX512 load instructions.  Since _dl_runtime_resolve_avx
    and _dl_runtime_profile_avx512 save/restore 256-bit YMM/512-bit ZMM
    registers, there is transition penalty when SSE instructions are used
    with lazy binding on AVX and AVX512 processors.
    
    To avoid SSE transition penalty, if only the lower 128 bits of the first
    8 vector registers are non-zero, we can preserve %xmm0 - %xmm7 registers
    with the zero upper bits.
    
    For AVX and AVX512 processors which support XGETBV with ECX == 1, we can
    use XGETBV with ECX == 1 to check if the upper 128 bits of YMM registers
    or the upper 256 bits of ZMM registers are zero.  We can restore only the
    non-zero portion of vector registers with AVX/AVX512 load instructions
    which will zero-extend upper bits of vector registers.
    
    This patch adds _dl_runtime_resolve_sse_vex which saves and restores
    XMM registers with 128-bit AVX store/load instructions.  It is used to
    preserve YMM/ZMM registers when only the lower 128 bits are non-zero.
    _dl_runtime_resolve_avx_opt and _dl_runtime_resolve_avx512_opt are added
    and used on AVX/AVX512 processors supporting XGETBV with ECX == 1 so
    that we store and load only the non-zero portion of vector registers.
    This avoids SSE transition penalty caused by _dl_runtime_resolve_avx and
    _dl_runtime_profile_avx512 when only the lower 128 bits of vector
    registers are used.
    
    _dl_runtime_resolve_avx_slow is added and used for AVX processors which
    don't support XGETBV with ECX == 1.  Since there is no SSE transition
    penalty on AVX512 processors which don't support XGETBV with ECX == 1,
    _dl_runtime_resolve_avx512_slow isn't provided.
    
    	[BZ #20495]
    	[BZ #20508]
    	* sysdeps/x86/cpu-features.c (init_cpu_features): For Intel
    	processors, set Use_dl_runtime_resolve_slow and set
    	Use_dl_runtime_resolve_opt if XGETBV suports ECX == 1.
    	* sysdeps/x86/cpu-features.h (bit_arch_Use_dl_runtime_resolve_opt):
    	New.
    	(bit_arch_Use_dl_runtime_resolve_slow): Likewise.
    	(index_arch_Use_dl_runtime_resolve_opt): Likewise.
    	(index_arch_Use_dl_runtime_resolve_slow): Likewise.
    	* sysdeps/x86_64/dl-machine.h (elf_machine_runtime_setup): Use
    	_dl_runtime_resolve_avx512_opt and _dl_runtime_resolve_avx_opt
    	if Use_dl_runtime_resolve_opt is set.  Use
    	_dl_runtime_resolve_slow if Use_dl_runtime_resolve_slow is set.
    	* sysdeps/x86_64/dl-trampoline.S: Include <cpu-features.h>.
    	(_dl_runtime_resolve_opt): New.  Defined for AVX and AVX512.
    	(_dl_runtime_resolve): Add one for _dl_runtime_resolve_sse_vex.
    	* sysdeps/x86_64/dl-trampoline.h (_dl_runtime_resolve_avx_slow):
    	New.
    	(_dl_runtime_resolve_opt): Likewise.
    	(_dl_runtime_profile): Define only if _dl_runtime_profile is
    	defined.
    
    (cherry picked from commit fb0f7a6755c1bfaec38f490fbfcaa39a66ee3604)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=b4391b0c7def246a4503db1af683122681c12a56

commit b4391b0c7def246a4503db1af683122681c12a56
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Sep 6 08:50:55 2016 -0700

    X86-64: Add _dl_runtime_resolve_avx[512]_{opt|slow} [BZ #20508]
    
    There is transition penalty when SSE instructions are mixed with 256-bit
    AVX or 512-bit AVX512 load instructions.  Since _dl_runtime_resolve_avx
    and _dl_runtime_profile_avx512 save/restore 256-bit YMM/512-bit ZMM
    registers, there is transition penalty when SSE instructions are used
    with lazy binding on AVX and AVX512 processors.
    
    To avoid SSE transition penalty, if only the lower 128 bits of the first
    8 vector registers are non-zero, we can preserve %xmm0 - %xmm7 registers
    with the zero upper bits.
    
    For AVX and AVX512 processors which support XGETBV with ECX == 1, we can
    use XGETBV with ECX == 1 to check if the upper 128 bits of YMM registers
    or the upper 256 bits of ZMM registers are zero.  We can restore only the
    non-zero portion of vector registers with AVX/AVX512 load instructions
    which will zero-extend upper bits of vector registers.
    
    This patch adds _dl_runtime_resolve_sse_vex which saves and restores
    XMM registers with 128-bit AVX store/load instructions.  It is used to
    preserve YMM/ZMM registers when only the lower 128 bits are non-zero.
    _dl_runtime_resolve_avx_opt and _dl_runtime_resolve_avx512_opt are added
    and used on AVX/AVX512 processors supporting XGETBV with ECX == 1 so
    that we store and load only the non-zero portion of vector registers.
    This avoids SSE transition penalty caused by _dl_runtime_resolve_avx and
    _dl_runtime_profile_avx512 when only the lower 128 bits of vector
    registers are used.
    
    _dl_runtime_resolve_avx_slow is added and used for AVX processors which
    don't support XGETBV with ECX == 1.  Since there is no SSE transition
    penalty on AVX512 processors which don't support XGETBV with ECX == 1,
    _dl_runtime_resolve_avx512_slow isn't provided.
    
    	[BZ #20495]
    	[BZ #20508]
    	* sysdeps/x86/cpu-features.c (init_cpu_features): For Intel
    	processors, set Use_dl_runtime_resolve_slow and set
    	Use_dl_runtime_resolve_opt if XGETBV suports ECX == 1.
    	* sysdeps/x86/cpu-features.h (bit_arch_Use_dl_runtime_resolve_opt):
    	New.
    	(bit_arch_Use_dl_runtime_resolve_slow): Likewise.
    	(index_arch_Use_dl_runtime_resolve_opt): Likewise.
    	(index_arch_Use_dl_runtime_resolve_slow): Likewise.
    	* sysdeps/x86_64/dl-machine.h (elf_machine_runtime_setup): Use
    	_dl_runtime_resolve_avx512_opt and _dl_runtime_resolve_avx_opt
    	if Use_dl_runtime_resolve_opt is set.  Use
    	_dl_runtime_resolve_slow if Use_dl_runtime_resolve_slow is set.
    	* sysdeps/x86_64/dl-trampoline.S: Include <cpu-features.h>.
    	(_dl_runtime_resolve_opt): New.  Defined for AVX and AVX512.
    	(_dl_runtime_resolve): Add one for _dl_runtime_resolve_sse_vex.
    	* sysdeps/x86_64/dl-trampoline.h (_dl_runtime_resolve_avx_slow):
    	New.
    	(_dl_runtime_resolve_opt): Likewise.
    	(_dl_runtime_profile): Define only if _dl_runtime_profile is
    	defined.
    
    (cherry picked from commit fb0f7a6755c1bfaec38f490fbfcaa39a66ee3604)
    (cherry picked from commit 4b8790c81c1a7b870a43810ec95e08a2e501123d)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=0d5f4a32a34f048b35360a110a0e6d1c87e3eced

commit 0d5f4a32a34f048b35360a110a0e6d1c87e3eced
Author: Aurelien Jarno <aurelien@aurel32.net>
Date:   Thu Nov 24 12:10:13 2016 +0100

    x86_64: fix static build of __memcpy_chk for compilers defaulting to PIC/PIE
    
    When glibc is compiled with gcc 6.2 that has been configured with
    to default to PIC/PIE, the static version of __memcpy_chk is not built,
    as the test is done on PIC instead of SHARED. Fix the test to check for
    SHARED, like it is done for similar functions like memmove_chk.
    
    Changelog:
    	* sysdeps/x86_64/memcpy_chk.S (__memcpy_chk): Check for SHARED
    	instead of PIC.
    
    (cherry picked from commit 380ec16d62f459d5a28cfc25b7b20990c45e1cc9)
    (cherry picked from commit 2d16e81babd1d7b66d10cec0bc6d6d86a7e0c95e)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=0ab02a62e42e63b058e7a4e160dbe51762ef2c46

commit 0ab02a62e42e63b058e7a4e160dbe51762ef2c46
Author: Maciej W. Rozycki <macro@imgtec.com>
Date:   Thu Nov 17 19:15:51 2016 +0000

    MIPS: Add `.insn' to ensure a text label is defined as code not data
    
    Avoid a build error with microMIPS compilation and recent versions of
    GAS which complain if a branch targets a label which is marked as data
    rather than microMIPS code:
    
    ../sysdeps/mips/mips32/crti.S: Assembler messages:
    ../sysdeps/mips/mips32/crti.S:72: Error: branch to a symbol in another ISA mode
    make[2]: *** [.../csu/crti.o] Error 1
    
    as commit 9d862524f6ae ("MIPS: Verify the ISA mode and alignment of
    branch and jump targets") closed a hole in branch processing, making
    relocation calculation respect the ISA mode of the symbol referred.
    This allowed diagnosing the situation where an attempt is made to pass
    control from code assembled for one ISA mode to code assembled for a
    different ISA mode and either relaxing the branch to a cross-mode jump
    or if that is not possible, then reporting this as an error rather than
    letting such code build and then fail unpredictably at the run time.
    
    This however requires the correct annotation of branch targets as code,
    because the ISA mode is not relevant for data symbols and is therefore
    not recorded for them.  The `.insn' pseudo-op is used for this purpose
    and has been supported by GAS since:
    
    Wed Feb 12 14:36:29 1997  Ian Lance Taylor  <ian@cygnus.com>
    
    	* config/tc-mips.c (mips_pseudo_table): Add "insn".
    	(s_insn): New static function.
    	* doc/c-mips.texi: Document .insn.
    
    so there has been no reason to avoid it where required.  More recently
    this pseudo-op has been documented, by the microMIPS architecture
    specification[1][2], as required for the correct interpretation of any
    code label which is not followed by an actual instruction in an assembly
    source.
    
    Use it in our crti.S files then, to mark that the trailing label there
    with no instructions following is indeed not a code bug and the branch
    is legitimate.
    
    References:
    
    [1] "MIPS Architecture for Programmers, Volume II-B: The microMIPS32
        Instruction Set", MIPS Technologies, Inc., Document Number: MD00582,
        Revision 5.04, January 15, 2014, Section 7.1 "Assembly-Level
        Compatibility", p. 533
    
    [2] "MIPS Architecture for Programmers, Volume II-B: The microMIPS64
        Instruction Set", MIPS Technologies, Inc., Document Number: MD00594,
        Revision 5.04, January 15, 2014, Section 8.1 "Assembly-Level
        Compatibility", p. 623
    
    2016-11-23  Matthew Fortune  <Matthew.Fortune@imgtec.com>
                Maciej W. Rozycki  <macro@imgtec.com>
    
    	* sysdeps/mips/mips32/crti.S (_init): Add `.insn' pseudo-op at
    	`.Lno_weak_fn' label.
    	* sysdeps/mips/mips64/n32/crti.S (_init): Likewise.
    	* sysdeps/mips/mips64/n64/crti.S (_init): Likewise.
    
    (cherry picked from commit cfaf1949ff1f8336b54c43796d0e2531bc8a40a2)
    (cherry picked from commit 65a2b63756a4d622b938910d582d8b807c471c9a)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=901db98f36690e4743feefd985c6ba2d7fd19813

commit 901db98f36690e4743feefd985c6ba2d7fd19813
Author: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Date:   Mon Nov 21 11:06:15 2016 -0200

    Fix writes past the allocated array bounds in execvpe (BZ#20847)
    
    This patch fixes an invalid write out or stack allocated buffer in
    2 places at execvpe implementation:
    
      1. On 'maybe_script_execute' function where it allocates the new
         argument list and it does not account that a minimum of argc
         plus 3 elements (default shell path, script name, arguments,
         and ending null pointer) should be considered.  The straightforward
         fix is just to take account of the correct list size on argument
         copy.
    
      2. On '__execvpe' where the executable file name lenght may not
         account for ending '\0' and thus subsequent path creation may
         write past array bounds because it requires to add the terminating
         null.  The fix is to change how to calculate the executable name
         size to add the final '\0' and adjust the rest of the code
         accordingly.
    
    As described in GCC bug report 78433 [1], these issues were masked off by
    GCC because it allocated several bytes more than necessary so that many
    off-by-one bugs went unnoticed.
    
    Checked on x86_64 with a latest GCC (7.0.0 20161121) with -O3 on CFLAGS.
    
    	[BZ #20847]
    	* posix/execvpe.c (maybe_script_execute): Remove write past allocated
    	array bounds.
    	(__execvpe): Likewise.
    
    [1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78433
    
    (cherry picked from commit d174436712e3cabce70d6cd771f177b6fe0e097b)

-----------------------------------------------------------------------

Summary of changes:
 ChangeLog                      |   25 ++++++++++
 posix/execvpe.c                |   15 ++++--
 sysdeps/alpha/fpu/s_ceil.c     |    7 +--
 sysdeps/alpha/fpu/s_ceilf.c    |    7 +--
 sysdeps/alpha/fpu/s_floor.c    |    7 +--
 sysdeps/alpha/fpu/s_floorf.c   |    7 +--
 sysdeps/alpha/fpu/s_rint.c     |    3 +
 sysdeps/alpha/fpu/s_rintf.c    |    3 +
 sysdeps/alpha/fpu/s_trunc.c    |    7 +--
 sysdeps/alpha/fpu/s_truncf.c   |    7 +--
 sysdeps/mips/mips32/crti.S     |    1 +
 sysdeps/mips/mips64/n32/crti.S |    1 +
 sysdeps/mips/mips64/n64/crti.S |    1 +
 sysdeps/x86/cpu-features.c     |   14 +++++
 sysdeps/x86/cpu-features.h     |    6 ++
 sysdeps/x86_64/dl-machine.h    |   24 ++++++++-
 sysdeps/x86_64/dl-trampoline.S |   20 ++++++++
 sysdeps/x86_64/dl-trampoline.h |  104 +++++++++++++++++++++++++++++++++++++++-
 sysdeps/x86_64/memcpy_chk.S    |    2 +-
 19 files changed, 228 insertions(+), 33 deletions(-)
Comment 16 Sourceware Commits 2017-02-05 15:57:31 UTC
This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The annotated tag, glibc-2.25 has been created
        at  be176490b818b65b5162c332eb6b581690b16e5c (tag)
   tagging  db0242e3023436757bbc7c488a779e6e3343db04 (commit)
  replaces  glibc-2.24
 tagged by  Siddhesh Poyarekar
        on  Sun Feb 5 21:19:00 2017 +0530

- Log -----------------------------------------------------------------
The GNU C Library
=================

The GNU C Library version 2.25 is now available.

The GNU C Library is used as *the* C library in the GNU system and
in GNU/Linux systems, as well as many other systems that use Linux
as the kernel.

The GNU C Library is primarily designed to be a portable
and high performance C library.  It follows all relevant
standards including ISO C11 and POSIX.1-2008.  It is also
internationalized and has one of the most complete
internationalization interfaces known.

The GNU C Library webpage is at http://www.gnu.org/software/libc/

Packages for the 2.25 release may be downloaded from:
        http://ftpmirror.gnu.org/libc/
        http://ftp.gnu.org/gnu/libc/

The mirror list is at http://www.gnu.org/order/ftp.html

NEWS for version 2.25
=====================

* The feature test macro __STDC_WANT_LIB_EXT2__, from ISO/IEC TR
  24731-2:2010, is supported to enable declarations of functions from that
  TR.  Note that not all functions from that TR are supported by the GNU C
  Library.

* The feature test macro __STDC_WANT_IEC_60559_BFP_EXT__, from ISO/IEC TS
  18661-1:2014, is supported to enable declarations of functions and macros
  from that TS.  Note that not all features from that TS are supported by
  the GNU C Library.

* The feature test macro __STDC_WANT_IEC_60559_FUNCS_EXT__, from ISO/IEC TS
  18661-4:2015, is supported to enable declarations of functions and macros
  from that TS.  Note that most features from that TS are not supported by
  the GNU C Library.

* The nonstandard feature selection macros _REENTRANT and _THREAD_SAFE are
  now treated as compatibility synonyms for _POSIX_C_SOURCE=199506L.
  Since the GNU C Library defaults to a much newer revision of POSIX, this
  will only affect programs that specifically request an old conformance
  mode.  For instance, a program compiled with -std=c89 -D_REENTRANT will
  see a change in the visible declarations, but a program compiled with
  just -D_REENTRANT, or -std=c99 -D_POSIX_C_SOURCE=200809L -D_REENTRANT,
  will not.

  Some C libraries once required _REENTRANT and/or _THREAD_SAFE to be
  defined by all multithreaded code, but glibc has not required this for
  many years.

* The inclusion of <sys/sysmacros.h> by <sys/types.h> is deprecated.  This
  means that in a future release, the macros “major”, “minor”, and “makedev”
  will only be available from <sys/sysmacros.h>.

  These macros are not part of POSIX nor XSI, and their names frequently
  collide with user code; see for instance glibc bug 19239 and Red Hat bug
  130601.  <stdlib.h> includes <sys/types.h> under _GNU_SOURCE, and C++ code
  presently cannot avoid being compiled under _GNU_SOURCE, exacerbating the
  problem.

* New <fenv.h> features from TS 18661-1:2014 are added to libm: the
  fesetexcept, fetestexceptflag, fegetmode and fesetmode functions, the
  femode_t type and the FE_DFL_MODE and FE_SNANS_ALWAYS_SIGNAL macros.

* Integer width macros from TS 18661-1:2014 are added to <limits.h>:
  CHAR_WIDTH, SCHAR_WIDTH, UCHAR_WIDTH, SHRT_WIDTH, USHRT_WIDTH, INT_WIDTH,
  UINT_WIDTH, LONG_WIDTH, ULONG_WIDTH, LLONG_WIDTH, ULLONG_WIDTH; and to
  <stdint.h>: INT8_WIDTH, UINT8_WIDTH, INT16_WIDTH, UINT16_WIDTH,
  INT32_WIDTH, UINT32_WIDTH, INT64_WIDTH, UINT64_WIDTH, INT_LEAST8_WIDTH,
  UINT_LEAST8_WIDTH, INT_LEAST16_WIDTH, UINT_LEAST16_WIDTH,
  INT_LEAST32_WIDTH, UINT_LEAST32_WIDTH, INT_LEAST64_WIDTH,
  UINT_LEAST64_WIDTH, INT_FAST8_WIDTH, UINT_FAST8_WIDTH, INT_FAST16_WIDTH,
  UINT_FAST16_WIDTH, INT_FAST32_WIDTH, UINT_FAST32_WIDTH, INT_FAST64_WIDTH,
  UINT_FAST64_WIDTH, INTPTR_WIDTH, UINTPTR_WIDTH, INTMAX_WIDTH,
  UINTMAX_WIDTH, PTRDIFF_WIDTH, SIG_ATOMIC_WIDTH, SIZE_WIDTH, WCHAR_WIDTH,
  WINT_WIDTH.

* New <math.h> features are added from TS 18661-1:2014:

  - Signaling NaN macros: SNANF, SNAN, SNANL.

  - Nearest integer functions: roundeven, roundevenf, roundevenl, fromfp,
    fromfpf, fromfpl, ufromfp, ufromfpf, ufromfpl, fromfpx, fromfpxf,
    fromfpxl, ufromfpx, ufromfpxf, ufromfpxl.

  - llogb functions: the llogb, llogbf and llogbl functions, and the
    FP_LLOGB0 and FP_LLOGBNAN macros.

  - Max-min magnitude functions: fmaxmag, fmaxmagf, fmaxmagl, fminmag,
    fminmagf, fminmagl.

  - Comparison macros: iseqsig.

  - Classification macros: iscanonical, issubnormal, iszero.

  - Total order functions: totalorder, totalorderf, totalorderl,
    totalordermag, totalordermagf, totalordermagl.

  - Canonicalize functions: canonicalize, canonicalizef, canonicalizel.

  - NaN functions: getpayload, getpayloadf, getpayloadl, setpayload,
    setpayloadf, setpayloadl, setpayloadsig, setpayloadsigf, setpayloadsigl.

* The functions strfromd, strfromf, and strfroml, from ISO/IEC TS 18661-1:2014,
  are added to libc.  They convert a floating-point number into string.

* Most of glibc can now be built with the stack smashing protector enabled.
  It is recommended to build glibc with --enable-stack-protector=strong.
  Implemented by Nick Alcock (Oracle).

* The function explicit_bzero, from OpenBSD, has been added to libc.  It is
  intended to be used instead of memset() to erase sensitive data after use;
  the compiler will not optimize out calls to explicit_bzero even if they
  are "unnecessary" (in the sense that no _correct_ program can observe the
  effects of the memory clear).

* On ColdFire, MicroBlaze, Nios II and SH3, the float_t type is now defined
  to float instead of double.  This does not affect the ABI of any libraries
  that are part of the GNU C Library, but may affect the ABI of other
  libraries that use this type in their interfaces.

* On x86_64, when compiling with -mfpmath=387 or -mfpmath=sse+387, the
  float_t and double_t types are now defined to long double instead of float
  and double.  These options are not the default, and this does not affect
  the ABI of any libraries that are part of the GNU C Library, but it may
  affect the ABI of other libraries that use this type in their interfaces,
  if they are compiled or used with those options.

* The getentropy and getrandom functions, and the <sys/random.h> header file
  have been added.

* The buffer size for byte-oriented stdio streams is now limited to 8192
  bytes by default.  Previously, on Linux, the default buffer size on most
  file systems was 4096 bytes (and thus remains unchanged), except on
  network file systems, where the buffer size was unpredictable and could be
  as large as several megabytes.

* The <sys/quota.h> header now includes the <linux/quota.h> header.  Support
  for the Linux quota interface which predates kernel version 2.4.22 has
  been removed.

* The malloc_get_state and malloc_set_state functions have been removed.
  Already-existing binaries that dynamically link to these functions will
  get a hidden implementation in which malloc_get_state is a stub.  As far
  as we know, these functions are used only by GNU Emacs and this change
  will not adversely affect already-built Emacs executables.  Any undumped
  Emacs executables, which normally exist only during an Emacs build, should
  be rebuilt by re-running “./configure; make” in the Emacs build tree.

* The “ip6-dotint” and “no-ip6-dotint” resolver options, and the
  corresponding RES_NOIP6DOTINT flag from <resolv.h> have been removed.
  “no-ip6-dotint” had already been the default, and support for the
  “ip6-dotint” option was removed from the Internet in 2006.

* The "ip6-bytestring" resolver option and the corresponding RES_USEBSTRING
  flag from <resolv.h> have been removed.  The option relied on a
  backwards-incompatible DNS extension which was never deployed on the
  Internet.

* The flags RES_AAONLY, RES_PRIMARY, RES_NOCHECKNAME, RES_KEEPTSIG,
  RES_BLAST defined in the <resolv.h> header file have been deprecated.
  They were already unimplemented.

* The "inet6" option in /etc/resolv.conf and the RES_USE_INET6 flag for
  _res.flags are deprecated.  The flag was standardized in RFC 2133, but
  removed again from the IETF name lookup interface specification in RFC
  2553.  Applications should use getaddrinfo instead.

* DNSSEC-related declarations and definitions have been removed from the
  <arpa/nameser.h> header file, and libresolv will no longer attempt to
  decode the data part of DNSSEC record types.  Previous versions of glibc
  only implemented minimal support for the previous version of DNSSEC, which
  is incompatible with the currently deployed version.

* The resource record type classification macros ns_t_qt_p, ns_t_mrr_p,
  ns_t_rr_p, ns_t_udp_p, ns_t_xfr_p have been removed from the
  <arpa/nameser.h> header file because the distinction between RR types and
  meta-RR types is not officially standardized, subject to revision, and
  thus not suitable for encoding in a macro.

* The types res_sendhookact, res_send_qhook, re_send_rhook, and the qhook
  and rhook members of the res_state type in <resolv.h> have been removed.
  The glibc stub resolver did not support these hooks, but the header file
  did not reflect that.

* For multi-arch support it is recommended to use a GCC which has
  been built with support for GNU indirect functions.  This ensures
  that correct debugging information is generated for functions
  selected by IFUNC resolvers.  This support can either be enabled by
  configuring GCC with '--enable-gnu-indirect-function', or by
  enabling it by default by setting 'default_gnu_indirect_function'
  variable for a particular architecture in the GCC source file
  'gcc/config.gcc'.

* GDB pretty printers have been added for mutex and condition variable
  structures in POSIX Threads. When installed and loaded in gdb these pretty
  printers show various pthread variables in human-readable form when read
  using the 'print' or 'display' commands in gdb.

* Tunables feature added to allow tweaking of the runtime for an application
  program.  This feature can be enabled with the '--enable-tunables' configure
  flag.  The GNU C Library manual has details on usage and README.tunables has
  instructions on adding new tunables to the library.

* A new version of condition variables functions have been implemented in
  the NPTL implementation of POSIX Threads to provide stronger ordering
  guarantees.

* A new version of pthread_rwlock functions have been implemented to use a more
  scalable algorithm primarily through not using a critical section anymore to
  make state changes.

Security related changes:

* On ARM EABI (32-bit), generating a backtrace for execution contexts which
  have been created with makecontext could fail to terminate due to a
  missing .cantunwind annotation.  This has been observed to lead to a hang
  (denial of service) in some Go applications compiled with gccgo.  Reported
  by Andreas Schwab.  (CVE-2016-6323)

* The DNS stub resolver functions would crash due to a NULL pointer
  dereference when processing a query with a valid DNS question type which
  was used internally in the implementation.  The stub resolver now uses a
  question type which is outside the range of valid question type values.
  (CVE-2015-5180)

Contributors
============

This release was made possible by the contributions of many people.
The maintainers are grateful to everyone who has contributed
changes or bug reports.  These include:

Adhemerval Zanella
Alan Modra
Alexandre Oliva
Andreas Schwab
Andrew Senkevich
Aurelien Jarno
Brent W. Baccala
Carlos O'Donell
Chris Metcalf
Chung-Lin Tang
DJ Delorie
David S. Miller
Denis Kaganovich
Dmitry V. Levin
Ernestas Kulik
Florian Weimer
Gabriel F T Gomes
Gabriel F. T. Gomes
H.J. Lu
Jakub Jelinek
James Clarke
James Greenhalgh
Jim Meyering
John David Anglin
Joseph Myers
Maciej W. Rozycki
Mark Wielaard
Martin Galvan
Martin Pitt
Mike Frysinger
Märt Põder
Nick Alcock
Paul E. Murphy
Paul Murphy
Rajalakshmi Srinivasaraghavan
Rasmus Villemoes
Rical Jasan
Richard Henderson
Roland McGrath
Samuel Thibault
Siddhesh Poyarekar
Stefan Liebler
Steve Ellcey
Svante Signell
Szabolcs Nagy
Tom Tromey
Torvald Riegel
Tulio Magno Quites Machado Filho
Wilco Dijkstra
Yury Norov
Zack Weinberg
-----BEGIN PGP SIGNATURE-----

iQEcBAABAgAGBQJYl0mTAAoJEHnEPfvxzyGHXTgH/jsS205Wdz9EniZrJ6+NXCm1
F/eeOMotGNv82BYaLRnw9XrF7p6+ND8E+7rSvFZT5O309OrdLjg4QG6M63COMRCh
6KKtQUM/00I1u4AYkOOgrUkor3m58GgeQUziOxXNvQNoU8zLguPk4kzVsvxq6lJR
/IROH2Mfl1AggOGq9Y1R/0uQCpj4jJSLETxJupg4calGPZQW3isogucSmogdccAB
Bqso7L40Xo4LJnEoD7JurlMrP5x043TttmTyvnFTtxRZTAHVjyQpFMKHaSkMgtIG
+fe26Ua3oMqbE9A9G3qiMIrPEqu+0tWKbvci0FeaE30vfI6YtVcd8I0RlBW9gok=
=3NM3
-----END PGP SIGNATURE-----

Adhemerval Zanella (69):
      Fix test-skeleton C99 designed initialization
      nptl: Consolidate sem_open implementations
      nptl: Set sem_open as a non cancellation point (BZ #15765)
      nptl: Remove sparc sem_wait
      nptl: Fix sem_wait and sem_timedwait cancellation (BZ#18243)
      rt: Set shm_open as a non cancellation point (BZ #18243)
      nptl: Consolidate sem_init implementations
      posix: Correctly enable/disable cancellation on Linux posix_spawn
      posix: Correctly block/unblock all signals on Linux posix_spawn
      Add INTERNAL_SYSCALL_CALL
      posix: Fix open file action for posix_spawn on Linux
      Remove C++ style comments from string3.h
      libio: Multiple fixes for open_{w}memstram (BZ#18241 and BZ#20181)
      Fix tst-memstream3 build failure
      Consolidate fallocate{64} implementations
      Consolidate posix_fallocate{64} implementations
      Consolidate posix_fadvise implementations
      Fix iseqsig for ports that do not support FE_INVALID
      Consolidate Linux sync_file_range implementations
      Fix posix_fadvise64 build on mips64n64
      Fix Linux fallocate tests for EOPNOTSUPP
      Fix Linux sh4 pread/pwrite argument passing
      Fix sparc build due missing __WORDSIZE_TIME64_COMPAT32 definition
      Consolidate lseek/lseek64/llseek implementations
      Consolidate Linux ftruncate implementations
      Consolidate Linux truncate implementations
      Consolidate Linux access implementation
      Fix sh4 build with __ASSUME_ST_INO_64_BIT redefinition
      New internal function __access_noerrno
      Consolidate Linux setrlimit and getrlimit implementation
      Fix hurd __access_noerrno implementation.
      Fix writes past the allocated array bounds in execvpe (BZ#20847)
      Remove cached PID/TID in clone
      powerpc: Remove stpcpy internal clash with IFUNC
      powerpc: Remove stpcpy internal clash with IFUNC
      Fix writes past the allocated array bounds in execvpe (BZ#20847)
      Consolidate rename Linux implementation
      Consolidate renameat Linux implementation
      Fix powerpc64/power7 memchr for large input sizes
      Fix typos and missing closing bracket in test-memchr.c
      Adjust benchtests to new support library.
      benchtests: Add fmax/fmin benchmarks
      benchtests: Add fmaxf/fminf benchmarks
      Fix x86_64 memchr for large input sizes
      powerpc: Remove f{max,min}{f} assembly implementations
      Add __ASSUME_DIRECT_SYSVIPC_SYSCALL for Linux
      Refactor Linux ipc_priv header
      Consolidate Linux msgctl implementation
      Consolidate Linux msgrcv implementation
      Use msgsnd syscall for Linux implementation
      Use msgget syscall for Linux implementation
      Add SYSV message queue test
      Consolidate Linux semctl implementation
      Use semget syscall for Linux implementation
      Use semop syscall for Linux implementation
      Consolidate Linux semtimedop implementation
      Add SYSV semaphore test
      Use shmat syscall for Linux implementation
      Consolidate Linux shmctl implementation
      Use shmdt syscall for linux implementation
      Use shmget syscall for linux implementation
      Add SYSV shared memory test
      Fix i686 memchr for large input sizes
      Fix test-sysvsem on some platforms
      Fix x86 strncat optimized implementation for large sizes
      Remove duplicate strcat implementations
      Use fortify macros for b{zero,copy} along decl from strings.h
      Move fortified explicit_bzero back to string3
      Add missing bugzilla reference in previous ChangeLog entry

Alan Modra (1):
      powerpc32: make PLT call in _mcount compatible with -msecure-plt (bug 20554)

Alexandre Oliva (2):
      [PR19826] fix non-LE TLS in static programs
      Bug 20915: Do not initialize DTV of other threads.

Andreas Schwab (11):
      arm: mark __startcontext as .cantunwind (bug 20435)
      Properly initialize glob structure with GLOB_BRACE|GLOB_DOOFFS (bug 20707)
      Fix multiple definitions of mk[o]stemp[s]64
      Get rid of __elision_available
      Fix testsuite timeout handling
      powerpc: remove _dl_platform_string and _dl_powerpc_platforms
      Fix assertion failure on test timeout
      Fix ChangeLog typo
      Revert "Fix ChangeLog typo"
      m68k: fix 64bit atomic ops
      Fix missing test dependency

Andrew Senkevich (4):
      x86_64: Call finite scalar versions in vectorized log, pow, exp (bz #20033).
      Install libm.a as linker script (bug 20539).
      Better design of libm.a installation rule.
      Disable TSX on some Haswell processors.

Aurelien Jarno (14):
      alpha: fix ceil on sNaN input
      alpha: fix floor on sNaN input
      alpha: fix rint on sNaN input
      alpha: fix trunc for big input values
      powerpc: fix ifunc-sel.h with GCC 6
      powerpc: fix ifunc-sel.h fix asm constraints and clobber list
      sparc64: add a VIS3 version of ceil, floor and trunc
      sparc: build with -mvis on sparc32/sparcv9 and sparc64
      sparc: remove fdim sparc specific implementations
      sparc32/sparcv9: add a VIS3 version of fdim
      Set NODELETE flag after checking for NULL pointer
      conform tests: call perl with '-I.'
      gconv.h: fix build with GCC 7
      x86_64: fix static build of __memcpy_chk for compilers defaulting to PIC/PIE

Brent W. Baccala (1):
      hurd: Fix spurious port deallocation

Carlos O'Donell (17):
      Open development for 2.25.
      Update PO files.
      Bug 20292 - Simplify and test _dl_addr_inside_object
      Bug 20689: Fix FMA and AVX2 detection on Intel
      Fix atomic_fetch_xor_release.
      Add missing include for stdlib.h.
      Fix building tst-linkall-static.
      Add include/crypt.h.
      Bug 20729: Fix building with -Os.
      Bug 20729: Include libc-internal.h where required.
      Bug 20729: Fix build failures on ppc64 and other arches.
      Remove out of date PROJECTS file.
      Bug 20918 - Building with --enable-nss-crypt fails tst-linkall-static
      Bug 11941: ld.so: Improper assert map->l_init_called in dlclose
      Add deferred cancellation regression test for getpwuid_r.
      Fix failing pretty printer tests when CPPFLAGS has optimizations.
      Bug 20116: Fix use after free in pthread_create()

Chris Metcalf (6):
      Make sure tilepro uses kernel atomics fo atomic_store
      Make tile's set_dataplane API compatibility-only
      tile: create new math-tests.h header
      build-many-glibcs: Revert -fno-isolate-erroneous-paths options for tilepro
      tile: pass __IPC_64 as zero for SysV IPC calls
      tile: Check for pointer add overflow in memchr

Chung-Lin Tang (1):
      Add ipc_priv.h header for Nios II to set __IPC_64 to zero.

DJ Delorie (1):
      * elf/dl-tunables.c (tunable_set_val_if_valid_range): Split into ...

David S. Miller (4):
      Fix wide-char testsuite SIGBUS on platforms such as Sparc.
      Fix sNaN handling in nearbyint on 32-bit sparc.
      Fix a sparc header conformtest failure.
      sparc: Remove optimized math routines which cause testsuite failures.

Denis Kaganovich (1):
      configure: accept __stack_chk_fail_local for ssp support too [BZ #20662]

Dmitry V. Levin (1):
      Fix typos in the spelling of "implementation"

Ernestas Kulik (1):
      localedata: lt_LT: use hyphens in d_fmt [BZ #20497]

Florian Weimer (100):
      malloc: Preserve arena free list/thread count invariant [BZ #20370]
      malloc: Run tests without calling mallopt [BZ #19469]
      Add support for referencing specific symbol versions
      elf: dl-minimal malloc needs to respect fundamental alignment
      elf: Avoid using memalign for TLS allocations [BZ #17730]
      elf: Do not use memalign for TCB/TLS blocks allocation [BZ #17730]
      x86: Use sysdep.o from libc.a in static libraries
      Add missing reference to bug 20452
      nptl/tst-tls3-malloc: Force freeing of thread stacks
      Add NEWS entry for CVE-2016-6323
      Add CVE-2016-6323 missing from NEWS entry
      Do not override objects in libc.a in other static libraries [BZ #20452]
      nptl/tst-once5: Reduce time to expected failure
      argp: Do not override GCC keywords with macros [BZ #16907]
      string: More tests for strcmp, strcasecmp, strncmp, strncasecmp
      nptl: Avoid expected SIGALRM in most tests [BZ #20432]
      Correct incorrect bug number in changelog
      malloc: Simplify static malloc interposition [BZ #20432]
      Base <sys/quota.h> on Linux kernel headers [BZ #20525]
      vfprintf: Avoid creating a VLA which complicates stack management
      vfscanf: Avoid multiple reads of multi-byte character width
      malloc: Automated part of conversion to __libc_lock
      resolv: Remove _LIBC_REENTRANT
      Remove the ptw-% patterns
      inet: Add __inet6_scopeid_pton function [BZ #20611]
      sysd-rules: Cut down the number of rtld-% pattern rules
      Remove remnants of .og patterns
      sln: Preprocessor cleanups
      Generate .op pattern rules for profiling builds only
      Avoid running $(CXX) during build to obtain header file paths
      Add test case for O_TMPFILE handling in open, openat
      manual: Clarify the documentation of strverscmp [BZ #20524]
      Remove obsolete DNSSEC support [BZ #20591]
      resolv: Remove the BIND_4_COMPAT macro
      <arpa/nameser.h>, <arpa/nameser_compat.h>: Remove versions
      <arpa/nameser.h>: Remove RR type classification macros [BZ #20592]
      malloc: Manual part of conversion to __libc_lock
      resolv: Remove unsupported hook functions from the API [BZ #20016]
      test-skeleton.c: Remove unintended #include <stdarg.h>.
      tst-open-tmpfile: Add checks for open64, openat64, linkat
      manual: Clarify NSS error reporting
      resolv: Deprecate unimplemented flags
      resolv: Remove RES_NOIP6DOTINT and its implementation
      resolv: Remove RES_USEBSTRING and its implementation [BZ #20629]
      resolv: Compile without -Wno-write-strings
      math: Define iszero as a function template for C++ [BZ #20715]
      math.h: Wrap C++ bits in extern "C++"
      iconv: Avoid writable data and relocations in IBM charsets
      iconv: Avoid writable data and relocations in ISO646
      malloc: Remove malloc_get_state, malloc_set_state [BZ #19473]
      malloc: Use accessors for chunk metadata access
      sysmalloc: Initialize previous size field of mmaped chunks
      Add test for linking against most static libraries
      i386: Support CFLAGS which imply -fno-omit-frame-pointer [BZ #20729]
      crypt: Use internal names for the SHA-2 block functions
      malloc: Update comments about chunk layout
      nptl: Document the reason why __kind in pthread_mutex_t is part of the ABI
      s390x: Add hidden definition for __sigsetjmp
      elf: Assume TLS is initialized in _dl_map_object_from_fd
      powerpc: Remove unintended __longjmp symbol from ABI
      powerpc: Add hidden definition for __sigsetjmp
      gconv: Adjust GBK to support the Euro sign
      libio: Limit buffer size to 8192 bytes [BZ #4099]
      Implement _dl_catch_error, _dl_signal_error in libc.so [BZ #16628]
      ld.so: Remove __libc_memalign
      aarch64: Use explicit offsets in _dl_tlsdesc_dynamic
      elf/tst-tls-manydynamic: New test
      support: Introduce new subdirectory for test infrastructure
      inet: Make IN6_IS_ADDR_UNSPECIFIED etc. usable with POSIX [BZ #16421]
      debug: Additional compiler barriers for backtrace tests [BZ #20956]
      Add getentropy, getrandom, <sys/random.h> [BZ #17252]
      Expose linking against libsupport as make dependency
      nptl/tst-cancel7: Add missing case label
      Add missing bug number to ChangeLog
      Do not require memset elimination in explicit_bzero test
      Remove unused function _dl_tls_setup
      scripts/test_printers_common.py: Log GDB error message
      rpcinfo: Remove traces of unbuilt helper program
      sunrpc: Always obtain AF_INET addresses from NSS [BZ #20964]
      resolv: Remove processing of unimplemented "spoof" host.conf options
      Declare getentropy in <unistd.h> [BZ #17252]
      support: Add support for delayed test failure reporting
      Add file missing from ChangeLog in previous commit
      Fix various typos in the ChangeLog
      resolv: Turn historic name lookup functions into compat symbols
      getentropy: Declare it in <unistd.h> for __USE_MISC [BZ #17252]
      support: Helper functions for entering namespaces
      support: Use support_record_failure consistently
      support: Implement --verbose option for test programs
      resolv: Add beginnings of a libresolv test suite
      resolv: Deprecate the "inet6" option and RES_USE_INET6 [BZ #19582]
      resolv: Deprecate RES_BLAST
      tunables: Use correct unused attribute
      CVE-2015-5180: resolv: Fix crash with internal QTYPE [BZ #18784]
      Update DNS RR type definitions [BZ #20593]
      malloc: Run tunables tests only if tunables are enabled
      support: Use %td for pointer difference in xwrite
      support: struct netent portability fix for support_format_netent
      string/tst-strcoll-overflow: Do not accept timeout as test result
      nptl: Add tst-robust-fork

Gabriel F T Gomes (1):
      Fix warning caused by unused-result in bug-atexit3-lib.cc

Gabriel F. T. Gomes (10):
      Add strfromd, strfromf, and strfroml functions
      Use read_int in vfscanf
      Use write_message instead of write
      Write messages to stdout and use write_message instead of write
      Make w_log1p type-generic
      Fix arg used as litteral suffix in tst-strfrom.h
      Make w_scalbln type-generic
      Replace use of snprintf with strfrom in libm tests
      Fix typo in manual for iseqsig
      Move wrappers to libm-compat-calls-auto

H.J. Lu (8):
      X86: Change bit_YMM_state to (1 << 2)
      X86-64: Correct CFA in _dl_runtime_resolve
      X86-64: Add _dl_runtime_resolve_avx[512]_{opt|slow} [BZ #20508]
      X86: Don't assert on older Intel CPUs [BZ #20647]
      Check IFUNC definition in unrelocated shared library [BZ #20019]
      X86_64: Don't use PLT nor GOT in static archives [BZ #20750]
      Add VZEROUPPER to memset-vec-unaligned-erms.S [BZ #21081]
      Allow IFUNC relocation against unrelocated shared library

Jakub Jelinek (1):
      * soft-fp/op-common.h (_FP_MUL, _FP_FMA, _FP_DIV): Add

James Clarke (1):
      Bug 21053: sh: Reduce namespace pollution from sys/ucontext.h

James Greenhalgh (1):
      [soft-fp] Add support for various half-precision conversion routines.

Jim Meyering (1):
      assert.h: allow gcc to detect assert(a = 1) errors

John David Anglin (1):
      hppa: Optimize atomic_compare_and_exchange_val_acq

Joseph Myers (181):
      Support __STDC_WANT_LIB_EXT2__ feature test macro.
      Define PF_QIPCRTR, AF_QIPCRTR from Linux 4.7 in bits/socket.h.
      Define UDP_ENCAP_* from Linux 4.7 in netinet/udp.h.
      Support __STDC_WANT_IEC_60559_BFP_EXT__ feature test macro.
      Fix typo in last arith.texi change.
      Support __STDC_WANT_IEC_60559_FUNCS_EXT__ feature test macro.
      Also handle __STDC_WANT_IEC_60559_BFP_EXT__ in <tgmath.h>.
      Do not call __nan in scalb functions.
      Fix math.h comment about bits/mathdef.h.
      Add tests for fegetexceptflag, fesetexceptflag.
      Fix powerpc fesetexceptflag clearing FE_INVALID (bug 20455).
      Fix test-fexcept when "inexact" implicitly raised.
      Add comment from sysdeps/powerpc/fpu/fraiseexcpt.c to fsetexcptflg.c.
      Add fesetexcept.
      Add fesetexcept: aarch64.
      Add fesetexcept: alpha.
      Add fesetexcept: arm.
      Add fesetexcept: hppa.
      Add fesetexcept: ia64.
      Add fesetexcept: m68k.
      Add fesetexcept: mips.
      Add fesetexcept: powerpc.
      Add fesetexcept: s390.
      Add fesetexcept: sh.
      Add fesetexcept: sparc.
      Fix soft-fp extended.h unpacking (GCC bug 77265).
      Add fetestexceptflag.
      Add femode_t functions.
      Add femode_t functions: aarch64.
      Add femode_t functions: alpha.
      Add femode_t functions: arm.
      Add femode_t functions: hppa.
      Add femode_t functions: ia64.
      Add femode_t functions: m68k.
      Add femode_t functions: mips.
      Add femode_t functions: powerpc.
      Add femode_t functions: s390.
      Add femode_t functions: sh.
      Add femode_t functions: sparc.
      Add e500 version of fetestexceptflag.
      Add <limits.h> integer width macros.
      Add <stdint.h> integer width macros.
      Add issubnormal.
      Add iszero.
      Fix iszero for excess precision.
      Add iscanonical.
      Fix ldbl-128ibm iscanonical for -mlong-double-64.
      Use __builtin_fma more in dbl-64 code.
      Add TCP_REPAIR_WINDOW from Linux 4.8.
      Fix LONG_WIDTH, ULONG_WIDTH include ordering issue.
      Add iseqsig.
      Make iseqsig handle excess precision.
      Avoid M_NAN + M_NAN in complex functions.
      Add totalorder, totalorderf, totalorderl.
      Add more totalorder tests.
      Clean up some complex functions raising FE_INVALID.
      Add totalordermag, totalordermagf, totalordermagl.
      Define HIGH_ORDER_BIT_IS_SET_FOR_SNAN to 0 or 1.
      Add getpayload, getpayloadf, getpayloadl.
      Stop powerpc copysignl raising "invalid" for sNaN argument (bug 20718).
      Use VSQRT instruction for ARM sqrt (bug 20660).
      Use -fno-builtin for sqrt benchmark.
      Fix cmpli usage in power6 memset.
      Add getpayloadl to libnldbl.
      Add canonicalize, canonicalizef, canonicalizel.
      Make strtod raise "inexact" exceptions (bug 19380).
      Add SNAN, SNANF, SNANL macros.
      Correct clog10 documentation (bug 19673).
      Fix linknamespace parallel test failures.
      Handle tilegx* machine names.
      Add localplt.data for MIPS.
      XFAIL check-execstack for MIPS.
      Make MIPS <sys/user.h> self-contained.
      Do not hardcode platform names in manual/libm-err-tab.pl (bug 14139).
      Fix alpha sqrt fegetenv namespace (bug 20768).
      Handle tests-unsupported if run-built-tests = no.
      Do not generate UNRESOLVED results for run-built-tests = no.
      Make check-installed-headers.sh ignore sys/sysctl.h for x32.
      Update nios2 localplt.data.
      Update alpha localplt.data.
      Add localplt.data for hppa.
      Add localplt.data for sh.
      Fix rpcgen buffer overrun (bug 20790).
      Refactor some libm type-generic macros.
      Make SH <sys/user.h> self-contained.
      Ignore -Wmaybe-uninitialized in stdlib/bug-getcontext.c.
      Add script to build many glibc configurations.
      Make tilegx32 install libraries in lib32 directories.
      Fix build-many-glibcs.py style issues.
      Make SH ucontext always match current kernels.
      Fix SH4 register-dump.h for soft-float.
      Fix crypt snprintf namespace (bug 20829).
      Enable linknamespace testing for libdl and libcrypt.
      Make Alpha <sys/user.h> self-contained.
      Actually use newly built host libraries in build-many-glibcs.py.
      Quote shell commands in logs from build-many-glibcs.py.
      Add setpayload, setpayloadf, setpayloadl.
      Make build-many-glibcs.py use -fno-isolate-erroneous-paths options for tilepro.
      Fix default float_t definition (bug 20855).
      Fix x86_64 -mfpmath=387 float_t, double_t (bug 20787).
      Fix SH4 FP_ILOGB0 (bug 20859).
      More NEWS entries / fixes for float_t / double_t changes.
      Refactor float_t, double_t information into bits/flt-eval-method.h.
      Make build-many-glibcs.py track component versions requested and used.
      Add setpayloadsig, setpayloadsigf, setpayloadsigl.
      Make build-many-glibcs.py re-exec itself if changed by checkout.
      Make build-many-glibcs.py store more information about builds.
      Do not include asm/cachectl.h in nios2 sys/cachectl.h.
      Fix sysdeps/ia64/fpu/libm-symbols.h for inclusion in testcases.
      Work around IA64 tst-setcontext2.c compile failure.
      Make ilogb wrappers type-generic.
      Refactor FP_FAST_* into bits/fp-fast.h.
      Add build-many-glibcs.py bot-cycle action.
      Make build-many-glibcs.py support running as a bot.
      Refactor FP_ILOGB* out of bits/mathdef.h.
      Add missing hidden_def (__sigsetjmp).
      Make ldbl-128 getpayload, setpayload functions use _Float128.
      Add llogb, llogbf, llogbl.
      Fix pow (qNaN, 0) result with -lieee (bug 20919), remove dead parts of wrappers.
      Fix sysdeps/ieee754 pow handling of sNaN arguments (bug 20916).
      Fix x86_64/x86 powl handling of sNaN arguments (bug 20916).
      Fix hypot sNaN handling (bug 20940).
      Fix typo in last ChangeLog message.
      Add build-many-glibcs.py option to strip installed shared libraries.
      Fix tests-printers handling for cross compiling.
      Use Linux 4.9 (headers) in build-many-glibcs.py.
      Add [BZ #19398] marker to ChangeLog entry.
      Include <linux/falloc.h> in bits/fcntl-linux.h.
      Refactor long double information into bits/long-double.h.
      Fix generic fmax, fmin sNaN handling (bug 20947).
      Fix powerpc fmax, fmin sNaN handling (bug 20947).
      Fix x86, x86_64 fmax, fmin sNaN handling, add tests (bug 20947).
      Make build-many-glibcs.py flush stdout before execv.
      Define FE_SNANS_ALWAYS_SIGNAL.
      Document sNaN argument error handling.
      Add fmaxmag, fminmag functions.
      Add preprocessor indentation for llogb macro in tgmath.h.
      Add roundeven, roundevenf, roundevenl.
      Update miscellaneous files from upstream sources.
      Fix nss_nisplus build with mainline GCC (bug 20978).
      Update NEWS feature test macro description of TS 18661-1 support.
      Fix tst-support_record_failure-2 for run-built-tests = no.
      Define __intmax_t, __uintmax_t in bits/types.h.
      Add fromfp functions.
      Update copyright dates with scripts/update-copyrights.
      Update copyright dates not handled by scripts/update-copyrights.
      Update config.guess and config.sub to current versions.
      Make build-many-glibcs.py use binutils 2.28 branch by default.
      Correct MIPS math-tests.h condition for sNaN payload preservation.
      Fix math/test-nearbyint-except for no-exceptions configurations.
      Add build-many-glibcs.py powerpc-linux-gnu-power4 build.
      Fix MIPS n32 lseek, lseek64 (bug 21019).
      Fix elf/tst-ldconfig-X for cross testing.
      Fix math/test-fenvinline for no-exceptions configurations.
      Update i386 libm-test-ulps.
      Fix MicroBlaze __backtrace get_frame_size namespace (bug 21022).
      Make MIPS soft-fp preserve NaN payloads for NAN2008.
      Fix MicroBlaze bits/setjmp.h for C++.
      Update libm-test XFAILs for ibm128 format.
      Fix malloc/ tests for GCC 7 -Walloc-size-larger-than=.
      Fix string/tester.c for GCC 7 -Wstringop-overflow=.
      Fix MIPS n64 readahead (bug 21026).
      Increase some test timeouts.
      Make fallback fesetexceptflag always succeed (bug 21028).
      Update MicroBlaze localplt.data.
      Fix math/test-fenv for no-exceptions / no-rounding-modes configurations.
      Improve libm-test XFAILing for ibm128-libgcc.
      XFAIL libm-test.inc tests as needed for ibm128.
      Fix elf/sotruss-lib format-truncation error.
      Fix ld-address format-truncation error.
      Fix testsuite build for GCC 7 -Wformat-truncation.
      Make endian-conversion macros always return correct types (bug 16458).
      Make fallback fegetexceptflag work with generic fetestexceptflag.
      Fix MIPS o32 posix_fadvise.
      Make soft-float powerpc swapcontext restore the signal mask (bug 21045).
      Update install.texi latest GCC version known to work.
      Avoid parallel GCC install in build-many-glibcs.py.
      Fix ARM fpu_control.h for assemblers requiring VFP insn names (bug 21047).
      Restore clock_* librt exports for MicroBlaze (bug 21061).
      Update README.libm-test.
      Remove very old libm-test-ulps entries.

Maciej W. Rozycki (2):
      MIPS: Add `.insn' to ensure a text label is defined as code not data
      MIPS: Use R_MICROMIPS_JALR rather than R_MIPS_JALR in microMIPS code

Mark Wielaard (1):
      Reduce memory size of tsearch red-black tree.

Martin Galvan (3):
      Add pretty printers for the NPTL lock types
      Add -B to python invocation to avoid generating pyc files
      Fix up tabs/spaces mismatches

Martin Pitt (1):
      locales: en_CA: update d_fmt [BZ #9842]

Mike Frysinger (5):
      localedata: change M$ to Microsoft
      ChangeLog: change Winblowz to Windows
      ChangeLog: fix date
      localedata: GBK: add mapping for 0x80->Euro sign [BZ #20864]
      localedata: bs_BA: fix yesexpr/noexpr [BZ #20974]

Märt Põder (1):
      locales: et_EE: locale has wrong {p,n}_cs_precedes value [BZ #20459]

Nick Alcock (14):
      Move all tests out of the csu subdirectory
      x86_64: tst-quad1pie, tst-quad2pie: compile with -fPIE [BZ #7065]
      Configure support for --enable-stack-protector [BZ #7065]
      Initialize the stack guard earlier when linking statically [BZ #7065]
      Do not stack-protect ifunc resolvers [BZ #7065]
      Disable stack protector in early static initialization [BZ #7065]
      Compile the dynamic linker without stack protection [BZ #7065]
      Ignore __stack_chk_fail* in the rtld mapfile computation [BZ #7065]
      Work even with compilers which enable -fstack-protector by default [BZ #7065]
      PLT avoidance for __stack_chk_fail [BZ #7065]
      Link a non-libc-using test with -fno-stack-protector [BZ #7065]
      Drop explicit stack-protection of pieces of the system [BZ #7065]
      Do not stack-protect sigreturn stubs [BZ #7065]
      Enable -fstack-protector=* when requested by configure [BZ #7065]

Paul E. Murphy (28):
      Remove tacit double usage in ldbl-128
      Refactor part of math Makefile
      Unify drift between _Complex function type variants
      Improve gen-libm-test.pl LIT() application
      Support for type-generic libm function implementations libm
      ldbl-128: Remove unused sqrtl declaration in e_asinl.c
      Add tst-wcstod-round
      Prepare to convert _Complex cosine functions
      Convert _Complex cosine functions to generated code
      Merge common usage of mul_split function
      Prepare to convert _Complex sine functions
      Convert _Complex sine functions to generated code
      Prepare to convert _Complex tangent functions
      Convert _Complex tangent functions to generated code
      sparcv9: Restore fdiml@GLIBC_2.1
      Prepare to convert remaining _Complex functions
      Convert remaining complex function to generated files
      ldbl-128: Rename 'long double' to '_Float128'
      ldbl-128: Cleanup e_gammal_r.c after _Float128 rename
      Make common fdim implementation generic.
      Make common nextdown implementation generic.
      Make common fmax implementation generic.
      Make common fmin implementation generic.
      Remove unneeded stubs for k_rem_pio2l.
      ldbl-128: Use L(x) macro for long double constants
      Make ldexpF generic.
      Remove __nan{f,,l} macros
      Build s_nan* objects from a generic template

Paul Murphy (1):
      powerpc: Cleanup fenv_private.h

Rajalakshmi Srinivasaraghavan (5):
      Refactor strtod tests
      Add tests for strfrom functions
      powerpc: strcmp optimization for power9
      powerpc: strncmp optimization for power9
      powerpc64: strchr/strchrnul optimization for power8

Rasmus Villemoes (1):
      linux: spawni.c: simplify error reporting to parent

Rical Jasan (28):
      Manual typos: Input/Output on Streams
      Manual typos: Low-Level Input/Output
      Manual typos: File System Interface
      Manual typos: Sockets
      Manual typos: Low-Level Terminal Interface
      Manual typos: Syslog
      Manual typos: Mathematics
      Manual typos: Arithmetic Functions
      Manual typos: Date and Time
      Manual typos: Resource Usage and Limitation
      Manual typos: Non-Local Exits
      Manual typos: Signal Handling
      Manual typos: The Basic Program/System Interface
      Manual typos: Processes
      Manual typos: Job Control
      Manual typos: Users and Groups
      Manual typos: System Management
      Manual typos: System Configuration Parameters
      Manual typos: DES Encryption and Password Handling
      Manual typos: Debugging support
      Manual typos: POSIX Threads
      Manual typos: Internal probes
      Manual typos: C Language Facilities in the Library
      Manual typos: Installing
      Manual typos: Library Maintenance
      Manual typos: Contributors to
      manual: Remove non-existent mount options S_IMMUTABLE and S_APPEND [BZ #11235]
      manual: Convert @tables of variables to @vtables.

Richard Henderson (1):
      alpha: Use saturating arithmetic in memchr

Roland McGrath (3):
      NaCl: Fix compile error in clock function.
      Fix generic wait3 after union wait_status removal.
      NaCl: Fix compile error for __dup after libc_hidden_proto addition.

Samuel Thibault (12):
      Fix recvmsg returning SIGLOST on PF_LOCAL sockets
      mach: Add more allowed external headers
      hurd: fix pathconf visibility
      hurd: fix fcntl visibility
      Fix exc2signal.c template
      mach: Fix old-style function definition.
      Fix old-style function definition
      hurdmalloc: Run fork handler as late as possible [BZ #19431]
      hurd: Fix stack pointer corruption in syscall
      hurd: Fix unused variable warning
      hurd: fix using hurd/signal.h in C++ programs
      hurd: fix using hurd.h in C++ programs

Siddhesh Poyarekar (47):
      Consolidate reduce_and_compute code
      Add fall through comments
      Use fabs(x) instead of branching on signedness of input to sin and cos
      Consolidate input partitioning into do_cos and do_sin
      Use do_sin for sin(x) where 0.25 < |x| < 0.855469
      Inline all support functions for sin and cos
      Remove __libc_csu_irel declaration
      Add tests-static to tests in malloc/Makefile
      consolidate sign checks for slow2
      Use copysign instead of ternary conditions for positive constants
      Use copysign instead of ternary for some sin/cos input ranges
      Make the quadrant shift K a bool in do_sincos_* functions
      Check n instead of k1 to decide on sign of sin/cos result
      Manual typos: System Databases and Name Service Switch
      Make quadrant shift a boolean in reduce_and_compute in s_sin.c
      Adjust calls to do_sincos_1 and do_sincos_2 in s_sincos.c
      Update comments for some functions in s_sin.c
      Add note on MALLOC_MMAP_* environment variables
      Document the M_ARENA_* mallopt parameters
      Remove references to sbrk to grow/shrink arenas
      Remove redundant definitions of M_ARENA_* macros
      Static inline functions for mallopt helpers
      Regenerate ULPs for aarch64
      Add ChangeLog for previous commit
      Link benchset tests against libsupport
      Add configure check for python program
      Fix pretty printer tests for run-built-tests == no
      Add framework for tunables
      Initialize tunable list with the GLIBC_TUNABLES environment variable
      Enhance --enable-tunables to select tunables frontend at build time
      User manual documentation for tunables
      Add NEWS item for tunables
      tunables: Avoid getenv calls and disable glibc.malloc.check by default
      Regenerate libc.pot
      Update translations from the Translation Project
      Merge translations from the Translation Project
      Fix typo in NEWS
      Merge translations from the Translation Project
      Fix environment traversal when an envvar value is empty
      Add target to incorporate translations from translations.org
      tunables: Fix environment variable processing for setuid binaries (bz #21073)
      Drop GLIBC_TUNABLES for setxid programs when tunables is disabled (bz #21073)
      tunables: Fail tests correctly when setgid does not work
      Add missing NEWS items
      Add list of bugs fixed in 2.25
      Add more contributors to contrib.texi
      Update for 2.25 release

Stefan Liebler (22):
      Get rid of array-bounds warning in __kernel_rem_pio2[f] with gcc 6.1 -O3.
      S390: Do not set FE_INEXACT with feraiseexcept (FE_OWERFLOW|FE_UNDERFLOW).
      S390: Support PLT and GOT references in check-localplt.
      S390: Regenerate ULPs
      Add configure check to test if gcc supports attribute ifunc.
      Use gcc attribute ifunc in libc_ifunc macro instead of inline assembly due to false debuginfo.
      s390: Refactor ifunc resolvers due to false debuginfo.
      i386, x86: Use libc_ifunc macro for time, gettimeofday.
      ppc: Use libc_ifunc macro for time, gettimeofday.
      Use libc_ifunc macro for clock_* symbols in librt.
      Use libc_ifunc macro for system in libpthread.
      Use libc_ifunc macro for vfork in libpthread.
      Use libc_ifunc macro for siglongjmp, longjmp in libpthread.
      S390: Fix fp comparison not raising FE_INVALID.
      Fix new testcase elf/tst-latepthread on s390x.
      S390: Regenerate ULPs.
      S390: Use C11-like atomics instead of plain memory accesses in lock elision code.
      S390: Use own tbegin macro instead of __builtin_tbegin.
      S390: Use new __libc_tbegin_retry macro in elision-lock.c.
      S390: Optimize lock-elision by decrementing adapt_count at unlock.
      S390: Fix FAIL in test string/tst-xbzero-opt [BZ #21006]
      S390: Adjust lock elision code after review.

Steve Ellcey (14):
      Fix -Wformat-length warning in tst-setgetname.c
      Fix warning from latest GCC in tst-printf.c
      Fix -Wformat-length warning in time/tst-strptime2.c
      Define wordsize.h macros everywhere
      Speed up math/test-tgmath2.c
      Document do_test in test-skeleton.c
      Define __ASSUME_ST_INO_64_BIT on all platforms.
      Add definitions to sysdeps/tile/tilepro/bits/wordsize.h.
      Always define XSTAT_IS_XSTAT64
      Allow [f]statfs64 to alias [f]statfs
      Fix for [f]statfs64/[f]statfs aliasing patch
      Partial ILP32 support for aarch64.
      Use XSTAT_IS_XSTAT64 in generic xstat functions
      Add comments to check-c++-types.sh.

Svante Signell (1):
      hurd: Fix adjtime call with OLDDELTA == NULL

Szabolcs Nagy (1):
      Make build-many-glibcs.py work on python3.2

Tom Tromey (1):
      Update and install proc_service.h [BZ #20311]

Torvald Riegel (12):
      Add atomic_exchange_relaxed.
      Add atomic operations required by the new condition variable.
      Fix incorrect double-checked locking related to _res_hconf.initialized.
      Use C11-like atomics instead of plain memory accesses in x86 lock elision.
      Robust mutexes: Fix lost wake-up.
      New condvar implementation that provides stronger ordering guarantees.
      Fix pthread_cond_t on sparc for new condvar.
      New pthread rwlock that is more scalable.
      robust mutexes: Fix broken x86 assembly by removing it
      Clear list of acquired robust mutexes in the child process after forking.
      Add compiler barriers around modifications of the robust mutex list.
      Fix mutex pretty printer test and pretty printer output.

Tulio Magno Quites Machado Filho (9):
      powerpc: Fix POWER9 implies
      powerpc: Installed-header hygiene
      powerpc: Regenerate ULPs
      powerpc: Fix TOC stub on powerpc64 clone()
      Document a behavior of an elided pthread_rwlock_unlock
      powerpc: Fix powerpc32/power7 memchr for large input sizes
      powerpc: Fix write-after-destroy in lock elision [BZ #20822]
      powerpc: Regenerate ULPs
      powerpc: Fix adapt_count update in __lll_unlock_elision

Wilco Dijkstra (4):
      An optimized memchr was missing for AArch64.  This version is similar to
      Improve generic rawmemchr for targets that don't have an
      Improve strtok and strtok_r performance.  Instead of calling strpbrk which
      This patch cleans up the strsep implementation and improves performance.

Yury Norov (1):
      * sysdeps/unix/sysv/linux/fxstat.c: Remove useless cast.

Zack Weinberg (20):
      Add utility macros for clang detection, and deprecation with messages.
      Minimize sysdeps code involved in defining major/minor/makedev.
      Deprecate inclusion of <sys/sysmacros.h> by <sys/types.h>
      Add tests for fortification of bcopy and bzero.
      Installed-header hygiene (BZ#20366): Simple self-contained fixes.
      Installed-header hygiene (BZ#20366): obsolete BSD u_* types.
      Installed-header hygiene (BZ#20366): conditionally defined structures.
      Installed-header hygiene (BZ#20366): time.h types.
      Installed-header hygiene (BZ#20366): stack_t.
      Installed header hygiene (BZ#20366): Test of installed headers.
      Minor correction to the "installed header hygiene" patches.
      Minor corrections to scripts/check-installed-headers.sh.
      [BZ #19239] Issue deprecation warnings on macro expansion.
      Fix typo in string/bits/string2.h.
      Fix build-and-build-again bug in sunrpc tests.
      Forgot to add the ChangeLog to the previous commit, doh.
      Correct comments in string.h re strcoll_l, strxfrm_l.
      Minor problems exposed by compiling C++ tests under _ISOMAC.
      Make _REENTRANT and _THREAD_SAFE aliases for _POSIX_C_SOURCE=199506L.
      New string function explicit_bzero (from OpenBSD).

steve ellcey-CA Eng-Software (1):
      Fix warnings from latest GCC.

-----------------------------------------------------------------------
Comment 17 Sourceware Commits 2017-04-06 15:03:14 UTC
This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, hjl/x86/xgetbv has been deleted
       was  fdb9777e1d770446972f46a80ebfa59d522a93f1

- Log -----------------------------------------------------------------
fdb9777e1d770446972f46a80ebfa59d522a93f1 X86-64: Add _dl_runtime_resolve_avx[512]_{opt|slow} [BZ #20508]
-----------------------------------------------------------------------
Comment 18 Sourceware Commits 2017-04-20 14:58:40 UTC
This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, hjl/pr21258/2.23 has been created
        at  883cadc5543ffd3a4537498b44c782ded8a4a4e8 (commit)

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=883cadc5543ffd3a4537498b44c782ded8a4a4e8

commit 883cadc5543ffd3a4537498b44c782ded8a4a4e8
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Mar 21 10:59:31 2017 -0700

    x86-64: Improve branch predication in _dl_runtime_resolve_avx512_opt [BZ #21258]
    
    On Skylake server, _dl_runtime_resolve_avx512_opt is used to preserve
    the first 8 vector registers.  The code layout is
    
      if only %xmm0 - %xmm7 registers are used
         preserve %xmm0 - %xmm7 registers
      if only %ymm0 - %ymm7 registers are used
         preserve %ymm0 - %ymm7 registers
      preserve %zmm0 - %zmm7 registers
    
    Branch predication always executes the fallthrough code path to preserve
    %zmm0 - %zmm7 registers speculatively, even though only %xmm0 - %xmm7
    registers are used.  This leads to lower CPU frequency on Skylake
    server.  This patch changes the fallthrough code path to preserve
    %xmm0 - %xmm7 registers instead:
    
      if whole %zmm0 - %zmm7 registers are used
        preserve %zmm0 - %zmm7 registers
      if only %ymm0 - %ymm7 registers are used
         preserve %ymm0 - %ymm7 registers
      preserve %xmm0 - %xmm7 registers
    
    Tested on Skylake server.
    
    	[BZ #21258]
    	* sysdeps/x86_64/dl-trampoline.S (_dl_runtime_resolve_opt):
    	Define only if _dl_runtime_resolve is defined to
    	_dl_runtime_resolve_sse_vex.
    	* sysdeps/x86_64/dl-trampoline.h (_dl_runtime_resolve_opt):
    	Fallthrough to _dl_runtime_resolve_sse_vex.
    
    (cherry picked from commit c15f8eb50cea7ad1a4ccece6e0982bf426d52c00)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=83037ea1d9e84b1b44ed307f01cbb5eeac24e22d

commit 83037ea1d9e84b1b44ed307f01cbb5eeac24e22d
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Aug 23 09:09:32 2016 -0700

    X86-64: Add _dl_runtime_resolve_avx[512]_{opt|slow} [BZ #20508]
    
    There is transition penalty when SSE instructions are mixed with 256-bit
    AVX or 512-bit AVX512 load instructions.  Since _dl_runtime_resolve_avx
    and _dl_runtime_profile_avx512 save/restore 256-bit YMM/512-bit ZMM
    registers, there is transition penalty when SSE instructions are used
    with lazy binding on AVX and AVX512 processors.
    
    To avoid SSE transition penalty, if only the lower 128 bits of the first
    8 vector registers are non-zero, we can preserve %xmm0 - %xmm7 registers
    with the zero upper bits.
    
    For AVX and AVX512 processors which support XGETBV with ECX == 1, we can
    use XGETBV with ECX == 1 to check if the upper 128 bits of YMM registers
    or the upper 256 bits of ZMM registers are zero.  We can restore only the
    non-zero portion of vector registers with AVX/AVX512 load instructions
    which will zero-extend upper bits of vector registers.
    
    This patch adds _dl_runtime_resolve_sse_vex which saves and restores
    XMM registers with 128-bit AVX store/load instructions.  It is used to
    preserve YMM/ZMM registers when only the lower 128 bits are non-zero.
    _dl_runtime_resolve_avx_opt and _dl_runtime_resolve_avx512_opt are added
    and used on AVX/AVX512 processors supporting XGETBV with ECX == 1 so
    that we store and load only the non-zero portion of vector registers.
    This avoids SSE transition penalty caused by _dl_runtime_resolve_avx and
    _dl_runtime_profile_avx512 when only the lower 128 bits of vector
    registers are used.
    
    _dl_runtime_resolve_avx_slow is added and used for AVX processors which
    don't support XGETBV with ECX == 1.  Since there is no SSE transition
    penalty on AVX512 processors which don't support XGETBV with ECX == 1,
    _dl_runtime_resolve_avx512_slow isn't provided.
    
    	[BZ #20495]
    	[BZ #20508]
    	* sysdeps/x86/cpu-features.c (init_cpu_features): For Intel
    	processors, set Use_dl_runtime_resolve_slow and set
    	Use_dl_runtime_resolve_opt if XGETBV suports ECX == 1.
    	* sysdeps/x86/cpu-features.h (bit_Use_dl_runtime_resolve_opt):
    	New.
    	(bit_Use_dl_runtime_resolve_slow): Likewise.
    	(index_Use_dl_runtime_resolve_opt): Likewise.
    	(index_Use_dl_runtime_resolve_slow): Likewise.
    	* sysdeps/x86_64/dl-machine.h (elf_machine_runtime_setup): Use
    	_dl_runtime_resolve_avx512_opt and _dl_runtime_resolve_avx_opt
    	if Use_dl_runtime_resolve_opt is set.  Use
    	_dl_runtime_resolve_slow if Use_dl_runtime_resolve_slow is set.
    	* sysdeps/x86_64/dl-trampoline.S: Include <cpu-features.h>.
    	(_dl_runtime_resolve_opt): New.  Defined for AVX and AVX512.
    	(_dl_runtime_resolve): Add one for _dl_runtime_resolve_sse_vex.
    	* sysdeps/x86_64/dl-trampoline.h (_dl_runtime_resolve_avx_slow):
    	New.
    	(_dl_runtime_resolve_opt): Likewise.
    	(_dl_runtime_profile): Define only if _dl_runtime_profile is
    	defined.
    
    (cherry picked from commit fb0f7a6755c1bfaec38f490fbfcaa39a66ee3604)

-----------------------------------------------------------------------