Bug 18880 - Wrong selector in x86_64/multiarch/memcpy.S
Summary: Wrong selector in x86_64/multiarch/memcpy.S
Status: RESOLVED FIXED
Alias: None
Product: glibc
Classification: Unclassified
Component: string (show other bugs)
Version: 2.23
: P2 normal
Target Milestone: 2.24
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks: 19586
  Show dependency treegraph
 
Reported: 2015-08-28 11:46 UTC by H.J. Lu
Modified: 2017-01-04 12:38 UTC (History)
2 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:
fweimer: security-


Attachments
0001-x86_64-fixing-and-updating-memcpy-IFUNC-selection-or.patch (811 bytes, application/octet-stream)
2016-03-02 11:08 UTC, Amit Pawar
Details

Note You need to log in before you can comment on or make changes to this bug.
Description H.J. Lu 2015-08-28 11:46:37 UTC
x86_64/multiarch/memcpy.S has

ENTRY(__new_memcpy)
	.type	__new_memcpy, @gnu_indirect_function
	LOAD_RTLD_GLOBAL_RO_RDX
	leaq	__memcpy_avx_unaligned(%rip), %rax
	HAS_ARCH_FEATURE (AVX_Fast_Unaligned_Load)
	jz 1f
	ret
1:	leaq	__memcpy_sse2(%rip), %rax
	HAS_ARCH_FEATURE (Slow_BSF)
	jnz	2f
	leaq	__memcpy_sse2_unaligned(%rip), %rax
	ret
2:	HAS_CPU_FEATURE (SSSE3)
	jz 3f
	leaq    __memcpy_ssse3(%rip), %rax
3:	ret
END(__new_memcpy)

But Slow_BSF feature has nothing to do with any memcpy implementations.
It should be

ENTRY(__new_memcpy)
	.type	__new_memcpy, @gnu_indirect_function
	LOAD_RTLD_GLOBAL_RO_RDX
	leaq	__memcpy_avx_unaligned(%rip), %rax
	HAS_ARCH_FEATURE (AVX_Fast_Unaligned_Load)
	jz 1f
	ret
1:	leaq	__memcpy_sse2_unaligned(%rip), %rax
	HAS_ARCH_FEATURE (Fast_Unaligned_Load)
	jz	2f
	ret
2:	leaq	__memcpy_sse2(%rip), %rax
	HAS_CPU_FEATURE (SSSE3)
	jz 3f
	leaq    __memcpy_ssse3(%rip), %rax
3:	ret
END(__new_memcpy)
Comment 1 H.J. Lu 2015-08-28 12:01:36 UTC
Also __memcpy_ssse3_back isn't used.
Comment 2 cvs-commit@gcc.gnu.org 2015-08-28 13:06:51 UTC
This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, hjl/unaligned has been created
        at  9db00f75ae25af0c043de52786739dcdf52e53f5 (commit)

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=9db00f75ae25af0c043de52786739dcdf52e53f5

commit 9db00f75ae25af0c043de52786739dcdf52e53f5
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Aug 25 11:01:20 2015 -0700

    Make strcmp with unaligned load/store the default
    
    Since strcmp_sse2_unaligned performs better on current Intel and AMD
    processors, this patch makes it the default.
    
    	* sysdeps/x86_64/strcmp.S: Moved to ...
    	* sysdeps/x86_64/multiarch/strcmp-sse2.S:  Here.  Remove
    	"#if !IS_IN (libc)".  Remove libc_hidden_builtin_def (STRCMP).
    	(STRCMP): Defined to __strcmp_sse2 if not defined.
    	* sysdeps/x86_64/multiarch/strcmp-sse2-unaligned.S: Moved to ...
    	* sysdeps/x86_64/strcmp.S: Here.  Remove "#if IS_IN (libc)".
    	Add .text.  Add libc_hidden_builtin_def (strcmp).
    	(__strcmp_sse2_unaligned): Renamed to ...
    	(strcmp): This.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
    	strcmp-sse2.
    	* sysdeps/x86_64/multiarch/strcasecmp_l-ssse3.S: Include
    	strcmp-sse2.S instead of ../strcmp.S.
    	* sysdeps/x86_64/multiarch/strcmp-ssse3.S: Likewise.
    	* sysdeps/x86_64/multiarch/strncase_l-ssse3.S: Likewise.
    	* sysdeps/x86_64/multiarch/strncmp-ssse3.S: Likewise.
    	* sysdeps/x86_64/multiarch/strcmp.S
    	[USE_AS_STRCMP] (STRCMP_SSE2): Set to __strcmp_sse2_unaligned.
    	[USE_AS_STRCMP] (STRCMP): Load __strcmp_sse2 instead of
    	STRCMP_SSE2.
    	[USE_AS_STRCMP] (strcmp): Defined __strcmp_sse2_unaligned if
    	in libc.
    	[!USE_AS_STRCMP]: Include strcmp-sse2S instead of ../strcmp.S.
    	* sysdeps/x86_64/strcasecmp_l.S: Include multiarch/strcmp-sse2.S
    	instead of strcmp.S.  Add libc_hidden_builtin_def (STRCMP).
    	* sysdeps/x86_64/strncase_l.S: Likewise.
    	* sysdeps/x86_64/strncmp.S: Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e05a252da92a4dd15d4be40a855d31bd864804e9

commit e05a252da92a4dd15d4be40a855d31bd864804e9
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Aug 28 05:40:35 2015 -0700

    Correct x86-64 memcpy/mempcpy multiarch selector
    
    For x86-64 memcpy/mempcpy, we choose the best implementation by the
    order:
    
    1. __memcpy_avx_unaligned if AVX_Fast_Unaligned_Load bit is set.
    2. __memcpy_sse2_unaligned if Fast_Unaligned_Load bit is set.
    3. __memcpy_sse2 if SSSE3 isn't available.
    4. __memcpy_ssse3_back if Fast_Copy_Backward bit it set.
    5. __memcpy_ssse3
    
    In libc.a and ld.so, we choose __memcpy_sse2_unaligned which is optimized
    for current Intel and AMD x86-64 processors.
    
    	[BZ #18880]
    	* sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Moved to ...
    	* sysdeps/x86_64/memcpy.S: Here.  Remove "#if !IS_IN (libc)".
    	Add libc_hidden_builtin_def and versioned_symbol.
    	(__memcpy_chk): New.
    	(__memcpy_sse2_unaligned): Renamed to ...
    	(memcpy): This.  Support USE_AS_MEMPCPY.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
    	mempcpy-sse2.
    	* sysdeps/x86_64/memcpy.S: Moved to ...
    	sysdeps/x86_64/multiarch/memcpy-sse2.S: Here.
    	(__memcpy_chk): Renamed to ...
    	(__memcpy_chk_sse2): This.
    	(memcpy): Renamed to ...
    	(__memcpy_sse2): This.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Properly
    	select the best implementation.
    	(ENTRY): Replace __memcpy_sse2 with __memcpy_sse2_unaligned.
    	(END): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	(ENTRY_CHK): Replace __memcpy_chk_sse2 with
    	__memcpy_chk_sse2_unaligned.
    	(END_CHK): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Properly
    	select the best implementation.
    	* sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Properly
    	select the best implementation.
    	(ENTRY): Replace __mempcpy_sse2 with __mempcpy_sse2_unaligned.
    	(END): Likewise.
    	(libc_hidden_def): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	(ENTRY_CHK): Replace __mempcpy_chk_sse2 with
    	__mempcpy_chk_sse2_unaligned.
    	(END_CHK): Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Properly
    	select the best implementation.

-----------------------------------------------------------------------
Comment 3 Amit Pawar 2016-02-26 16:16:45 UTC
Memcpy preferred order on AMD processors
1. SSSE3-Fast_Copy_Backward
2. SSSE
3. AVX
4. SSE2-Fast_Unaligned_Load
5. SSE2.

Mempcpy preferred order on AMD processors
1. SSSE3-Fast_Copy_Backward
2. SSSE3
3. AVX_Unaligned
4. SSE2

Please suggest how to include this.
Comment 4 H.J. Lu 2016-02-26 16:20:47 UTC
(In reply to Amit Pawar from comment #3)
> Memcpy preferred order on AMD processors
> 1. SSSE3-Fast_Copy_Backward
> 2. SSSE
> 3. AVX
> 4. SSE2-Fast_Unaligned_Load
> 5. SSE2.
> 
> Mempcpy preferred order on AMD processors
> 1. SSSE3-Fast_Copy_Backward
> 2. SSSE3
> 3. AVX_Unaligned
> 4. SSE2
> 
> Please suggest how to include this.

Please fix the WRONG selector first and leave out AMD processors for now.
Comment 5 Amit Pawar 2016-03-02 11:08:00 UTC
Created attachment 9057 [details]
0001-x86_64-fixing-and-updating-memcpy-IFUNC-selection-or.patch

> Please fix the WRONG selector first and leave out AMD processors for now.
As per your suggestion this attached patch contains the fix for BZ #18880. Please commit it if it is OK.

Thanks,
Amit Pawar
Comment 6 H.J. Lu 2016-03-03 02:59:50 UTC
(In reply to Amit Pawar from comment #5)
> Created attachment 9057 [details]
> 0001-x86_64-fixing-and-updating-memcpy-IFUNC-selection-or.patch
> 
> > Please fix the WRONG selector first and leave out AMD processors for now.
> As per your suggestion this attached patch contains the fix for BZ #18880.
> Please commit it if it is OK.

Please change label 3 to 2 and submit to glibc mailing list.
Comment 7 cvs-commit@gcc.gnu.org 2016-03-04 16:41:29 UTC
This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, master has been updated
       via  14a1d7cc4c4fd5ee8e4e66b777221dd32a84efe8 (commit)
      from  4b230f6a60f3bb9cae92306d016535f40578ff2e (commit)

Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=14a1d7cc4c4fd5ee8e4e66b777221dd32a84efe8

commit 14a1d7cc4c4fd5ee8e4e66b777221dd32a84efe8
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Mar 4 08:37:40 2016 -0800

    x86-64: Fix memcpy IFUNC selection
    
    Chek Fast_Unaligned_Load, instead of Slow_BSF, and also check for
    Fast_Copy_Backward to enable __memcpy_ssse3_back.  Existing selection
    order is updated with following selection order:
    
    1. __memcpy_avx_unaligned if AVX_Fast_Unaligned_Load bit is set.
    2. __memcpy_sse2_unaligned if Fast_Unaligned_Load bit is set.
    3. __memcpy_sse2 if SSSE3 isn't available.
    4. __memcpy_ssse3_back if Fast_Copy_Backward bit it set.
    5. __memcpy_ssse3
    
    	[BZ #18880]
    	* sysdeps/x86_64/multiarch/memcpy.S: Check Fast_Unaligned_Load,
    	instead of Slow_BSF, and also check for Fast_Copy_Backward to
    	enable __memcpy_ssse3_back.

-----------------------------------------------------------------------

Summary of changes:
 ChangeLog                         |    8 ++++++++
 sysdeps/x86_64/multiarch/memcpy.S |   27 ++++++++++++++-------------
 2 files changed, 22 insertions(+), 13 deletions(-)
Comment 8 H.J. Lu 2016-03-04 16:41:35 UTC
Fix on master so far.
Comment 9 cvs-commit@gcc.gnu.org 2016-04-02 17:13:49 UTC
This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, hjl/erms/2.23 has been created
        at  4e339b9dc65217fb9b9be6cdc0e991f4ae64ccfe (commit)

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4e339b9dc65217fb9b9be6cdc0e991f4ae64ccfe

commit 4e339b9dc65217fb9b9be6cdc0e991f4ae64ccfe
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 1 14:01:24 2016 -0700

    X86-64: Add dummy memcopy.h and wordcopy.c
    
    Since x86-64 doesn't use memory copy functions, add dummy memcopy.h and
    wordcopy.c to reduce code size.  It reduces the size of libc.so by about
    1 KB.
    
    	* sysdeps/x86_64/memcopy.h: New file.
    	* sysdeps/x86_64/wordcopy.c: Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=997e6c0db2c351f4a7b688c3134c1f77a0aa49de

commit 997e6c0db2c351f4a7b688c3134c1f77a0aa49de
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 12:46:57 2016 -0700

    X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove
    
    Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones,
    we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with
    the new ones.
    
    No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used
    before.  If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2
    memcpy/memmove optimized with Enhanced REP MOVSB will be used for
    processors with ERMS.  The new AVX512 memcpy/memmove will be used for
    processors with AVX512 which prefer vzeroupper.
    
    Since the new SSE2 memcpy/memmove are faster than the previous default
    memcpy/memmove used in libc.a and ld.so, we also remove the previous
    default memcpy/memmove and make them the default memcpy/memmove.
    
    Together, it reduces the size of libc.so by about 6 KB and the size of
    ld.so by about 2 KB.
    
    It also fixes the placement of __mempcpy_erms and __memmove_erms.
    
    	[BZ #19776]
    	* sysdeps/x86_64/memcpy.S: Make it dummy.
    	* sysdeps/x86_64/mempcpy.S: Likewise.
    	* sysdeps/x86_64/memmove.S: New file.
    	* sysdeps/x86_64/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/memmove.c: Removed.
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memmove.c: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memcpy-sse2-unaligned, memmove-avx-unaligned,
    	memcpy-avx-unaligned and memmove-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Replace
    	__memmove_chk_avx512_unaligned_2 with
    	__memmove_chk_avx512_unaligned.  Remove
    	__memmove_chk_avx_unaligned_2.  Replace
    	__memmove_chk_sse2_unaligned_2 with
    	__memmove_chk_sse2_unaligned.  Remove __memmove_chk_sse2 and
    	__memmove_avx_unaligned_2.  Replace __memmove_avx512_unaligned_2
    	with __memmove_avx512_unaligned.  Replace
    	__memmove_sse2_unaligned_2 with __memmove_sse2_unaligned.
    	Remove __memmove_sse2.  Replace __memcpy_chk_avx512_unaligned_2
    	with __memcpy_chk_avx512_unaligned.  Remove
    	__memcpy_chk_avx_unaligned_2.  Replace
    	__memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned.
    	Remove __memcpy_chk_sse2.  Remove __memcpy_avx_unaligned_2.
    	Replace __memcpy_avx512_unaligned_2 with
    	__memcpy_avx512_unaligned.  Remove __memcpy_sse2_unaligned_2
    	and __memcpy_sse2.  Replace __mempcpy_chk_avx512_unaligned_2
    	with __mempcpy_chk_avx512_unaligned.  Remove
    	__mempcpy_chk_avx_unaligned_2.  Replace
    	__mempcpy_chk_sse2_unaligned_2 with
    	__mempcpy_chk_sse2_unaligned.  Remove __mempcpy_chk_sse2.
    	Replace __mempcpy_avx512_unaligned_2 with
    	__mempcpy_avx512_unaligned.  Remove __mempcpy_avx_unaligned_2.
    	Replace __mempcpy_sse2_unaligned_2 with
    	__mempcpy_sse2_unaligned.  Remove __mempcpy_sse2.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support
    	__memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned.
    	Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __memcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../memcpy.S.
    	* sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support
    	__memcpy_chk_avx512_unaligned_erms and
    	__memcpy_chk_avx512_unaligned.  Use
    	__memcpy_chk_avx_unaligned_erms and
    	__memcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __memcpy_chk_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip if
    	not in libc.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
    	(MEMCPY_SYMBOL): New.
    	(MEMPCPY_SYMBOL): Likewise.
    	(MEMMOVE_CHK_SYMBOL): Likewise.
    	(__mempcpy_erms, __memmove_erms): Moved before __mempcpy_chk
    	with unaligned_erms.
    	Replace MEMMOVE_SYMBOL with MEMMOVE_CHK_SYMBOL on __mempcpy_chk
    	symbols.  Replace MEMMOVE_SYMBOL with MEMPCPY_SYMBOL on
    	__mempcpy symbols.  Change function suffix from unaligned_2 to
    	unaligned.  Provide alias for __memcpy_chk in libc.a.  Provide
    	alias for memcpy in libc.a and ld.so.
    	* sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support
    	__mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned.
    	Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __mempcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../mempcpy.S.
    	(mempcpy): New.  Add a weak alias.
    	* sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support
    	__mempcpy_chk_avx512_unaligned_erms and
    	__mempcpy_chk_avx512_unaligned.  Use
    	__mempcpy_chk_avx_unaligned_erms and
    	__mempcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __mempcpy_chk_sse2_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=0ff8c6a7b53c5bb28ac3d3e0ae8da8099491b16c

commit 0ff8c6a7b53c5bb28ac3d3e0ae8da8099491b16c
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:42:30 2016 -0700

    X86-64: Remove the previous SSE2/AVX2 memsets
    
    Since the new SSE2/AVX2 memsets are faster than the previous ones, we
    can remove the previous SSE2/AVX2 memsets and replace them with the
    new ones.  This reduces the size of libc.so by about 900 bytes.
    
    No change in IFUNC selection if SSE2 and AVX2 memsets weren't used
    before.  If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset
    optimized with Enhanced REP STOSB will be used for processors with
    ERMS.  The new AVX512 memset will be used for processors with AVX512
    which prefer vzeroupper.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded
    	into ...
    	* sysdeps/x86_64/memset.S: This.
    	(__bzero): Removed.
    	(__memset_tail): Likewise.
    	(__memset_chk): Likewise.
    	(memset): Likewise.
    	(MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't
    	defined.
    	(MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined.
    	* sysdeps/x86_64/multiarch/memset-avx2.S: Removed.
    	(__memset_zero_constant_len_parameter): Check SHARED instead of
    	PIC.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memset-avx2 and memset-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Remove __memset_chk_sse2,
    	__memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: Skip
    	if not in libc.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
    	(MEMSET_CHK_SYMBOL): New.  Define if not defined.
    	(__bzero): Check VEC_SIZE == 16 instead of USE_MULTIARCH.
    	Replace MEMSET_SYMBOL with MEMSET_CHK_SYMBOL on __memset_chk
    	symbols.
    	Properly check USE_MULTIARCH on __memset symbols.
    	* sysdeps/x86_64/multiarch/memset.S (memset): Replace
    	__memset_sse2 and __memset_avx2 with __memset_sse2_unaligned
    	and __memset_avx2_unaligned.  Use __memset_sse2_unaligned_erms
    	or __memset_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	(memset): Removed.
    	(__memset_chk): Likewise.
    	(MEMSET_SYMBOL): New.
    	(libc_hidden_builtin_def): Replace __memset_sse2 with
    	__memset_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace
    	__memset_chk_sse2 and __memset_chk_avx2 with
    	__memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms.
    	Use __memset_chk_sse2_unaligned_erms or
    	__memset_chk_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_chk_avx512_unaligned_erms and
    	__memset_chk_avx512_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=cfb059c79729b26284863334c9aa04f0a3b967b9

commit cfb059c79729b26284863334c9aa04f0a3b967b9
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 1 15:08:48 2016 -0700

    Remove Fast_Copy_Backward from Intel Core processors
    
    Intel Core i3, i5 and i7 processors have fast unaligned copy and
    copy backward is ignored.  Remove Fast_Copy_Backward from Intel Core
    processors to avoid confusion.
    
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Don't set
    	bit_arch_Fast_Copy_Backward for Intel Core proessors.
    
    (cherry picked from commit 27d3ce1467990f89126e228559dec8f84b96c60e)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=30c389be1af67c4d0716d207b6780c6169d1355f

commit 30c389be1af67c4d0716d207b6780c6169d1355f
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:05:51 2016 -0700

    Add x86-64 memset with unaligned store and rep stosb
    
    Implement x86-64 memset with unaligned store and rep movsb.  Support
    16-byte, 32-byte and 64-byte vector register sizes.  A single file
    provides 2 implementations of memset, one with rep stosb and the other
    without rep stosb.  They share the same codes when size is between 2
    times of vector register size and REP_STOSB_THRESHOLD which defaults
    to 2KB.
    
    Key features:
    
    1. Use overlapping store to avoid branch.
    2. For size <= 4 times of vector register size, fully unroll the loop.
    3. For size > 4 times of vector register size, store 4 times of vector
    register size at a time.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
    	memset-sse2-unaligned-erms, memset-avx2-unaligned-erms and
    	memset-avx512-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Test __memset_chk_sse2_unaligned,
    	__memset_chk_sse2_unaligned_erms, __memset_chk_avx2_unaligned,
    	__memset_chk_avx2_unaligned_erms, __memset_chk_avx512_unaligned,
    	__memset_chk_avx512_unaligned_erms, __memset_sse2_unaligned,
    	__memset_sse2_unaligned_erms, __memset_erms,
    	__memset_avx2_unaligned, __memset_avx2_unaligned_erms,
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: New
    	file.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:
    	Likewise.
    
    (cherry picked from commit 830566307f038387ca0af3fd327706a8d1a2f595)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=980d639b4ae58209843f09a29d86b0a8303b6650

commit 980d639b4ae58209843f09a29d86b0a8303b6650
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:04:26 2016 -0700

    Add x86-64 memmove with unaligned load/store and rep movsb
    
    Implement x86-64 memmove with unaligned load/store and rep movsb.
    Support 16-byte, 32-byte and 64-byte vector register sizes.  When
    size <= 8 times of vector register size, there is no check for
    address overlap bewteen source and destination.  Since overhead for
    overlap check is small when size > 8 times of vector register size,
    memcpy is an alias of memmove.
    
    A single file provides 2 implementations of memmove, one with rep movsb
    and the other without rep movsb.  They share the same codes when size is
    between 2 times of vector register size and REP_MOVSB_THRESHOLD which
    is 2KB for 16-byte vector register size and scaled up by large vector
    register size.
    
    Key features:
    
    1. Use overlapping load and store to avoid branch.
    2. For size <= 8 times of vector register size, load  all sources into
    registers and store them together.
    3. If there is no address overlap bewteen source and destination, copy
    from both ends with 4 times of vector register size at a time.
    4. If address of destination > address of source, backward copy 8 times
    of vector register size at a time.
    5. Otherwise, forward copy 8 times of vector register size at a time.
    6. Use rep movsb only for forward copy.  Avoid slow backward rep movsb
    by fallbacking to backward copy 8 times of vector register size at a
    time.
    7. Skip when address of destination == address of source.
    
    	[BZ #19776]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
    	memmove-sse2-unaligned-erms, memmove-avx-unaligned-erms and
    	memmove-avx512-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Test
    	__memmove_chk_avx512_unaligned_2,
    	__memmove_chk_avx512_unaligned_erms,
    	__memmove_chk_avx_unaligned_2, __memmove_chk_avx_unaligned_erms,
    	__memmove_chk_sse2_unaligned_2,
    	__memmove_chk_sse2_unaligned_erms, __memmove_avx_unaligned_2,
    	__memmove_avx_unaligned_erms, __memmove_avx512_unaligned_2,
    	__memmove_avx512_unaligned_erms, __memmove_erms,
    	__memmove_sse2_unaligned_2, __memmove_sse2_unaligned_erms,
    	__memcpy_chk_avx512_unaligned_2,
    	__memcpy_chk_avx512_unaligned_erms,
    	__memcpy_chk_avx_unaligned_2, __memcpy_chk_avx_unaligned_erms,
    	__memcpy_chk_sse2_unaligned_2, __memcpy_chk_sse2_unaligned_erms,
    	__memcpy_avx_unaligned_2, __memcpy_avx_unaligned_erms,
    	__memcpy_avx512_unaligned_2, __memcpy_avx512_unaligned_erms,
    	__memcpy_sse2_unaligned_2, __memcpy_sse2_unaligned_erms,
    	__memcpy_erms, __mempcpy_chk_avx512_unaligned_2,
    	__mempcpy_chk_avx512_unaligned_erms,
    	__mempcpy_chk_avx_unaligned_2, __mempcpy_chk_avx_unaligned_erms,
    	__mempcpy_chk_sse2_unaligned_2, __mempcpy_chk_sse2_unaligned_erms,
    	__mempcpy_avx512_unaligned_2, __mempcpy_avx512_unaligned_erms,
    	__mempcpy_avx_unaligned_2, __mempcpy_avx_unaligned_erms,
    	__mempcpy_sse2_unaligned_2, __mempcpy_sse2_unaligned_erms and
    	__mempcpy_erms.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: New
    	file.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likwise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
    	Likwise.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:
    	Likwise.
    
    (cherry picked from commit 88b57b8ed41d5ecf2e1bdfc19556f9246a665ebb)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=bf2bc5e5c9d7aa8af28b299ec26b8a37352730cc

commit bf2bc5e5c9d7aa8af28b299ec26b8a37352730cc
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 19:22:59 2016 -0700

    Initial Enhanced REP MOVSB/STOSB (ERMS) support
    
    The newer Intel processors support Enhanced REP MOVSB/STOSB (ERMS) which
    has a feature bit in CPUID.  This patch adds the Enhanced REP MOVSB/STOSB
    (ERMS) bit to x86 cpu-features.
    
    	* sysdeps/x86/cpu-features.h (bit_cpu_ERMS): New.
    	(index_cpu_ERMS): Likewise.
    	(reg_ERMS): Likewise.
    
    (cherry picked from commit 0791f91dff9a77263fa8173b143d854cad902c6d)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7c244283ff12329b3bca9878b8edac3b3fe5c7bc

commit 7c244283ff12329b3bca9878b8edac3b3fe5c7bc
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 13:15:59 2016 -0700

    Make __memcpy_avx512_no_vzeroupper an alias
    
    Since x86-64 memcpy-avx512-no-vzeroupper.S implements memmove, make
    __memcpy_avx512_no_vzeroupper an alias of __memmove_avx512_no_vzeroupper
    to reduce code size of libc.so.
    
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memcpy-avx512-no-vzeroupper.
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: Renamed
    	to ...
    	* sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: This.
    	(MEMCPY): Don't define.
    	(MEMCPY_CHK): Likewise.
    	(MEMPCPY): Likewise.
    	(MEMPCPY_CHK): Likewise.
    	(MEMPCPY_CHK): Renamed to ...
    	(__mempcpy_chk_avx512_no_vzeroupper): This.
    	(MEMPCPY_CHK): Renamed to ...
    	(__mempcpy_chk_avx512_no_vzeroupper): This.
    	(MEMCPY_CHK): Renamed to ...
    	(__memmove_chk_avx512_no_vzeroupper): This.
    	(MEMCPY): Renamed to ...
    	(__memmove_avx512_no_vzeroupper): This.
    	(__memcpy_avx512_no_vzeroupper): New alias.
    	(__memcpy_chk_avx512_no_vzeroupper): Likewise.
    
    (cherry picked from commit 064f01b10b57ff09cda7025f484b848c38ddd57a)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a9a14991fb2d3e69f80d25e9bbf2f6b0bcf11c3d

commit a9a14991fb2d3e69f80d25e9bbf2f6b0bcf11c3d
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 13:13:36 2016 -0700

    Implement x86-64 multiarch mempcpy in memcpy
    
    Implement x86-64 multiarch mempcpy in memcpy to share most of code.  It
    reduces code size of libc.so.
    
    	[BZ #18858]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	mempcpy-ssse3, mempcpy-ssse3-back, mempcpy-avx-unaligned
    	and mempcpy-avx512-no-vzeroupper.
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMPCPY_CHK):
    	New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S
    	(MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-ssse3-back.S (MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-ssse3.S (MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-avx-unaligned.S: Removed.
    	* sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-ssse3-back.S: Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-ssse3.S: Likewise.
    
    (cherry picked from commit c365e615f7429aee302f8af7bf07ae262278febb)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4fc09dabecee1b7cafdbca26ee7c63f68e53c229

commit 4fc09dabecee1b7cafdbca26ee7c63f68e53c229
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 04:39:48 2016 -0700

    [x86] Add a feature bit: Fast_Unaligned_Copy
    
    On AMD processors, memcpy optimized with unaligned SSE load is
    slower than emcpy optimized with aligned SSSE3 while other string
    functions are faster with unaligned SSE load.  A feature bit,
    Fast_Unaligned_Copy, is added to select memcpy optimized with
    unaligned SSE load.
    
    	[BZ #19583]
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Set
    	Fast_Unaligned_Copy with Fast_Unaligned_Load for Intel
    	processors.  Set Fast_Copy_Backward for AMD Excavator
    	processors.
    	* sysdeps/x86/cpu-features.h (bit_arch_Fast_Unaligned_Copy):
    	New.
    	(index_arch_Fast_Unaligned_Copy): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Check
    	Fast_Unaligned_Copy instead of Fast_Unaligned_Load.
    
    (cherry picked from commit e41b395523040fcb58c7d378475720c2836d280c)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=75f2d47e459a6bf5656a938e5c63f8b581eb3ee6

commit 75f2d47e459a6bf5656a938e5c63f8b581eb3ee6
Author: Florian Weimer <fweimer@redhat.com>
Date:   Fri Mar 25 11:11:42 2016 +0100

    tst-audit10: Fix compilation on compilers without bit_AVX512F [BZ #19860]
    
    	[BZ# 19860]
    	* sysdeps/x86_64/tst-audit10.c (avx512_enabled): Always return
    	zero if the compiler does not provide the AVX512F bit.
    
    (cherry picked from commit f327f5b47be57bc05a4077344b381016c1bb2c11)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=96c7375cb8b6f1875d9865f2ae92ecacf5f5e6fa

commit 96c7375cb8b6f1875d9865f2ae92ecacf5f5e6fa
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Mar 22 08:36:16 2016 -0700

    Don't set %rcx twice before "rep movsb"
    
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMCPY):
    	Don't set %rcx twice before "rep movsb".
    
    (cherry picked from commit 3c9a4cd16cbc7b79094fec68add2df66061ab5d7)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c273f613b0cc779ee33cc33d20941d271316e483

commit c273f613b0cc779ee33cc33d20941d271316e483
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Mar 22 07:46:56 2016 -0700

    Set index_arch_AVX_Fast_Unaligned_Load only for Intel processors
    
    Since only Intel processors with AVX2 have fast unaligned load, we
    should set index_arch_AVX_Fast_Unaligned_Load only for Intel processors.
    
    Move AVX, AVX2, AVX512, FMA and FMA4 detection into get_common_indeces
    and call get_common_indeces for other processors.
    
    Add CPU_FEATURES_CPU_P and CPU_FEATURES_ARCH_P to aoid loading
    GLRO(dl_x86_cpu_features) in cpu-features.c.
    
    	[BZ #19583]
    	* sysdeps/x86/cpu-features.c (get_common_indeces): Remove
    	inline.  Check family before setting family, model and
    	extended_model.  Set AVX, AVX2, AVX512, FMA and FMA4 usable
    	bits here.
    	(init_cpu_features): Replace HAS_CPU_FEATURE and
    	HAS_ARCH_FEATURE with CPU_FEATURES_CPU_P and
    	CPU_FEATURES_ARCH_P.  Set index_arch_AVX_Fast_Unaligned_Load
    	for Intel processors with usable AVX2.  Call get_common_indeces
    	for other processors with family == NULL.
    	* sysdeps/x86/cpu-features.h (CPU_FEATURES_CPU_P): New macro.
    	(CPU_FEATURES_ARCH_P): Likewise.
    	(HAS_CPU_FEATURE): Use CPU_FEATURES_CPU_P.
    	(HAS_ARCH_FEATURE): Use CPU_FEATURES_ARCH_P.
    
    (cherry picked from commit f781a9e96138d8839663af5e88649ab1fbed74f8)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c858d10a4e7fd682f2e7083836e4feacc2d580f4

commit c858d10a4e7fd682f2e7083836e4feacc2d580f4
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 10 05:26:46 2016 -0800

    Add _arch_/_cpu_ to index_*/bit_* in x86 cpu-features.h
    
    index_* and bit_* macros are used to access cpuid and feature arrays o
    struct cpu_features.  It is very easy to use bits and indices of cpuid
    array on feature array, especially in assembly codes.  For example,
    sysdeps/i386/i686/multiarch/bcopy.S has
    
    	HAS_CPU_FEATURE (Fast_Rep_String)
    
    which should be
    
    	HAS_ARCH_FEATURE (Fast_Rep_String)
    
    We change index_* and bit_* to index_cpu_*/index_arch_* and
    bit_cpu_*/bit_arch_* so that we can catch such error at build time.
    
    	[BZ #19762]
    	* sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h
    	(EXTRA_LD_ENVVARS): Add _arch_ to index_*/bit_*.
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Likewise.
    	* sysdeps/x86/cpu-features.h (bit_*): Renamed to ...
    	(bit_arch_*): This for feature array.
    	(bit_*): Renamed to ...
    	(bit_cpu_*): This for cpu array.
    	(index_*): Renamed to ...
    	(index_arch_*): This for feature array.
    	(index_*): Renamed to ...
    	(index_cpu_*): This for cpu array.
    	[__ASSEMBLER__] (HAS_FEATURE): Add and use field.
    	[__ASSEMBLER__] (HAS_CPU_FEATURE)): Pass cpu to HAS_FEATURE.
    	[__ASSEMBLER__] (HAS_ARCH_FEATURE)): Pass arch to HAS_FEATURE.
    	[!__ASSEMBLER__] (HAS_CPU_FEATURE): Replace index_##name and
    	bit_##name with index_cpu_##name and bit_cpu_##name.
    	[!__ASSEMBLER__] (HAS_ARCH_FEATURE): Replace index_##name and
    	bit_##name with index_arch_##name and bit_arch_##name.
    
    (cherry picked from commit 6aa3e97e2530f9917f504eb4146af119a3f27229)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7a90b56b0c3f8e55df44957cf6de7d3c9c04cbb9

commit 7a90b56b0c3f8e55df44957cf6de7d3c9c04cbb9
Author: Roland McGrath <roland@hack.frob.com>
Date:   Tue Mar 8 12:31:13 2016 -0800

    Fix tst-audit10 build when -mavx512f is not supported.
    
    (cherry picked from commit 3bd80c0de2f8e7ca8020d37739339636d169957e)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=ba80f6ceea3a6b6f711038646f419125fe3ad39c

commit ba80f6ceea3a6b6f711038646f419125fe3ad39c
Author: Florian Weimer <fweimer@redhat.com>
Date:   Mon Mar 7 16:00:25 2016 +0100

    tst-audit4, tst-audit10: Compile AVX/AVX-512 code separately [BZ #19269]
    
    This ensures that GCC will not use unsupported instructions before
    the run-time check to ensure support.
    
    (cherry picked from commit 3c0f7407eedb524c9114bb675cd55b903c71daaa)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=b8fe596e7f750d4ee2fca14d6a3999364c02662e

commit b8fe596e7f750d4ee2fca14d6a3999364c02662e
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Mar 6 16:48:11 2016 -0800

    Group AVX512 functions in .text.avx512 section
    
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S:
    	Replace .text with .text.avx512.
    	* sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S:
    	Likewise.
    
    (cherry picked from commit fee9eb6200f0e44a4b684903bc47fde36d46f1a5)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e455d17680cfaebb12692547422f95ba1ed30e29

commit e455d17680cfaebb12692547422f95ba1ed30e29
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Mar 4 08:37:40 2016 -0800

    x86-64: Fix memcpy IFUNC selection
    
    Chek Fast_Unaligned_Load, instead of Slow_BSF, and also check for
    Fast_Copy_Backward to enable __memcpy_ssse3_back.  Existing selection
    order is updated with following selection order:
    
    1. __memcpy_avx_unaligned if AVX_Fast_Unaligned_Load bit is set.
    2. __memcpy_sse2_unaligned if Fast_Unaligned_Load bit is set.
    3. __memcpy_sse2 if SSSE3 isn't available.
    4. __memcpy_ssse3_back if Fast_Copy_Backward bit it set.
    5. __memcpy_ssse3
    
    	[BZ #18880]
    	* sysdeps/x86_64/multiarch/memcpy.S: Check Fast_Unaligned_Load,
    	instead of Slow_BSF, and also check for Fast_Copy_Backward to
    	enable __memcpy_ssse3_back.
    
    (cherry picked from commit 14a1d7cc4c4fd5ee8e4e66b777221dd32a84efe8)

-----------------------------------------------------------------------
Comment 10 cvs-commit@gcc.gnu.org 2016-04-02 19:31:09 UTC
This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, hjl/erms/2.22 has been created
        at  7962f7b04a6374b36d1df15c0c7c8f5747e2e85f (commit)

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7962f7b04a6374b36d1df15c0c7c8f5747e2e85f

commit 7962f7b04a6374b36d1df15c0c7c8f5747e2e85f
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 1 14:01:24 2016 -0700

    X86-64: Add dummy memcopy.h and wordcopy.c
    
    Since x86-64 doesn't use memory copy functions, add dummy memcopy.h and
    wordcopy.c to reduce code size.  It reduces the size of libc.so by about
    1 KB.
    
    	* sysdeps/x86_64/memcopy.h: New file.
    	* sysdeps/x86_64/wordcopy.c: Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=40d52d834531b7a4315b68155ee3daec3cdceb46

commit 40d52d834531b7a4315b68155ee3daec3cdceb46
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 12:46:57 2016 -0700

    X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove
    
    Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones,
    we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with
    the new ones.
    
    No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used
    before.  If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2
    memcpy/memmove optimized with Enhanced REP MOVSB will be used for
    processors with ERMS.  The new AVX512 memcpy/memmove will be used for
    processors with AVX512 which prefer vzeroupper.
    
    Since the new SSE2 memcpy/memmove are faster than the previous default
    memcpy/memmove used in libc.a and ld.so, we also remove the previous
    default memcpy/memmove and make them the default memcpy/memmove.
    
    Together, it reduces the size of libc.so by about 6 KB and the size of
    ld.so by about 2 KB.
    
    It also fixes the placement of __mempcpy_erms and __memmove_erms.
    
    	[BZ #19776]
    	* sysdeps/x86_64/memcpy.S: Make it dummy.
    	* sysdeps/x86_64/mempcpy.S: Likewise.
    	* sysdeps/x86_64/memmove.S: New file.
    	* sysdeps/x86_64/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/memmove.c: Removed.
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memmove.c: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memcpy-sse2-unaligned, memmove-avx-unaligned,
    	memcpy-avx-unaligned and memmove-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Replace
    	__memmove_chk_avx512_unaligned_2 with
    	__memmove_chk_avx512_unaligned.  Remove
    	__memmove_chk_avx_unaligned_2.  Replace
    	__memmove_chk_sse2_unaligned_2 with
    	__memmove_chk_sse2_unaligned.  Remove __memmove_chk_sse2 and
    	__memmove_avx_unaligned_2.  Replace __memmove_avx512_unaligned_2
    	with __memmove_avx512_unaligned.  Replace
    	__memmove_sse2_unaligned_2 with __memmove_sse2_unaligned.
    	Remove __memmove_sse2.  Replace __memcpy_chk_avx512_unaligned_2
    	with __memcpy_chk_avx512_unaligned.  Remove
    	__memcpy_chk_avx_unaligned_2.  Replace
    	__memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned.
    	Remove __memcpy_chk_sse2.  Remove __memcpy_avx_unaligned_2.
    	Replace __memcpy_avx512_unaligned_2 with
    	__memcpy_avx512_unaligned.  Remove __memcpy_sse2_unaligned_2
    	and __memcpy_sse2.  Replace __mempcpy_chk_avx512_unaligned_2
    	with __mempcpy_chk_avx512_unaligned.  Remove
    	__mempcpy_chk_avx_unaligned_2.  Replace
    	__mempcpy_chk_sse2_unaligned_2 with
    	__mempcpy_chk_sse2_unaligned.  Remove __mempcpy_chk_sse2.
    	Replace __mempcpy_avx512_unaligned_2 with
    	__mempcpy_avx512_unaligned.  Remove __mempcpy_avx_unaligned_2.
    	Replace __mempcpy_sse2_unaligned_2 with
    	__mempcpy_sse2_unaligned.  Remove __mempcpy_sse2.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support
    	__memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned.
    	Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __memcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../memcpy.S.
    	* sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support
    	__memcpy_chk_avx512_unaligned_erms and
    	__memcpy_chk_avx512_unaligned.  Use
    	__memcpy_chk_avx_unaligned_erms and
    	__memcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __memcpy_chk_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip if
    	not in libc.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
    	(MEMCPY_SYMBOL): New.
    	(MEMPCPY_SYMBOL): Likewise.
    	(MEMMOVE_CHK_SYMBOL): Likewise.
    	(__mempcpy_erms, __memmove_erms): Moved before __mempcpy_chk
    	with unaligned_erms.
    	Replace MEMMOVE_SYMBOL with MEMMOVE_CHK_SYMBOL on __mempcpy_chk
    	symbols.  Replace MEMMOVE_SYMBOL with MEMPCPY_SYMBOL on
    	__mempcpy symbols.  Change function suffix from unaligned_2 to
    	unaligned.  Provide alias for __memcpy_chk in libc.a.  Provide
    	alias for memcpy in libc.a and ld.so.
    	* sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support
    	__mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned.
    	Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __mempcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../mempcpy.S.
    	(mempcpy): New.  Add a weak alias.
    	* sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support
    	__mempcpy_chk_avx512_unaligned_erms and
    	__mempcpy_chk_avx512_unaligned.  Use
    	__mempcpy_chk_avx_unaligned_erms and
    	__mempcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __mempcpy_chk_sse2_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a61bbdcc906231982398239ec38f193a7522af5b

commit a61bbdcc906231982398239ec38f193a7522af5b
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:42:30 2016 -0700

    X86-64: Remove the previous SSE2/AVX2 memsets
    
    Since the new SSE2/AVX2 memsets are faster than the previous ones, we
    can remove the previous SSE2/AVX2 memsets and replace them with the
    new ones.  This reduces the size of libc.so by about 900 bytes.
    
    No change in IFUNC selection if SSE2 and AVX2 memsets weren't used
    before.  If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset
    optimized with Enhanced REP STOSB will be used for processors with
    ERMS.  The new AVX512 memset will be used for processors with AVX512
    which prefer vzeroupper.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded
    	into ...
    	* sysdeps/x86_64/memset.S: This.
    	(__bzero): Removed.
    	(__memset_tail): Likewise.
    	(__memset_chk): Likewise.
    	(memset): Likewise.
    	(MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't
    	defined.
    	(MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined.
    	* sysdeps/x86_64/multiarch/memset-avx2.S: Removed.
    	(__memset_zero_constant_len_parameter): Check SHARED instead of
    	PIC.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memset-avx2 and memset-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Remove __memset_chk_sse2,
    	__memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: Skip
    	if not in libc.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
    	(MEMSET_CHK_SYMBOL): New.  Define if not defined.
    	(__bzero): Check VEC_SIZE == 16 instead of USE_MULTIARCH.
    	Replace MEMSET_SYMBOL with MEMSET_CHK_SYMBOL on __memset_chk
    	symbols.
    	Properly check USE_MULTIARCH on __memset symbols.
    	* sysdeps/x86_64/multiarch/memset.S (memset): Replace
    	__memset_sse2 and __memset_avx2 with __memset_sse2_unaligned
    	and __memset_avx2_unaligned.  Use __memset_sse2_unaligned_erms
    	or __memset_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	(memset): Removed.
    	(__memset_chk): Likewise.
    	(MEMSET_SYMBOL): New.
    	(libc_hidden_builtin_def): Replace __memset_sse2 with
    	__memset_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace
    	__memset_chk_sse2 and __memset_chk_avx2 with
    	__memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms.
    	Use __memset_chk_sse2_unaligned_erms or
    	__memset_chk_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_chk_avx512_unaligned_erms and
    	__memset_chk_avx512_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4ad4d58ed7a444e2d9787113fce132a99b35b417

commit 4ad4d58ed7a444e2d9787113fce132a99b35b417
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 1 15:08:48 2016 -0700

    Remove Fast_Copy_Backward from Intel Core processors
    
    Intel Core i3, i5 and i7 processors have fast unaligned copy and
    copy backward is ignored.  Remove Fast_Copy_Backward from Intel Core
    processors to avoid confusion.
    
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Don't set
    	bit_arch_Fast_Copy_Backward for Intel Core proessors.
    
    (cherry picked from commit 27d3ce1467990f89126e228559dec8f84b96c60e)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a304f3933c7f8347e49057a7a315cbd571662ff7

commit a304f3933c7f8347e49057a7a315cbd571662ff7
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:05:51 2016 -0700

    Add x86-64 memset with unaligned store and rep stosb
    
    Implement x86-64 memset with unaligned store and rep movsb.  Support
    16-byte, 32-byte and 64-byte vector register sizes.  A single file
    provides 2 implementations of memset, one with rep stosb and the other
    without rep stosb.  They share the same codes when size is between 2
    times of vector register size and REP_STOSB_THRESHOLD which defaults
    to 2KB.
    
    Key features:
    
    1. Use overlapping store to avoid branch.
    2. For size <= 4 times of vector register size, fully unroll the loop.
    3. For size > 4 times of vector register size, store 4 times of vector
    register size at a time.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
    	memset-sse2-unaligned-erms, memset-avx2-unaligned-erms and
    	memset-avx512-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Test __memset_chk_sse2_unaligned,
    	__memset_chk_sse2_unaligned_erms, __memset_chk_avx2_unaligned,
    	__memset_chk_avx2_unaligned_erms, __memset_chk_avx512_unaligned,
    	__memset_chk_avx512_unaligned_erms, __memset_sse2_unaligned,
    	__memset_sse2_unaligned_erms, __memset_erms,
    	__memset_avx2_unaligned, __memset_avx2_unaligned_erms,
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: New
    	file.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:
    	Likewise.
    
    (cherry picked from commit 830566307f038387ca0af3fd327706a8d1a2f595)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=1c5d0b0ba41376c2f0792da4f22cc1f5b2b2688e

commit 1c5d0b0ba41376c2f0792da4f22cc1f5b2b2688e
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:04:26 2016 -0700

    Add x86-64 memmove with unaligned load/store and rep movsb
    
    Implement x86-64 memmove with unaligned load/store and rep movsb.
    Support 16-byte, 32-byte and 64-byte vector register sizes.  When
    size <= 8 times of vector register size, there is no check for
    address overlap bewteen source and destination.  Since overhead for
    overlap check is small when size > 8 times of vector register size,
    memcpy is an alias of memmove.
    
    A single file provides 2 implementations of memmove, one with rep movsb
    and the other without rep movsb.  They share the same codes when size is
    between 2 times of vector register size and REP_MOVSB_THRESHOLD which
    is 2KB for 16-byte vector register size and scaled up by large vector
    register size.
    
    Key features:
    
    1. Use overlapping load and store to avoid branch.
    2. For size <= 8 times of vector register size, load  all sources into
    registers and store them together.
    3. If there is no address overlap bewteen source and destination, copy
    from both ends with 4 times of vector register size at a time.
    4. If address of destination > address of source, backward copy 8 times
    of vector register size at a time.
    5. Otherwise, forward copy 8 times of vector register size at a time.
    6. Use rep movsb only for forward copy.  Avoid slow backward rep movsb
    by fallbacking to backward copy 8 times of vector register size at a
    time.
    7. Skip when address of destination == address of source.
    
    	[BZ #19776]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
    	memmove-sse2-unaligned-erms, memmove-avx-unaligned-erms and
    	memmove-avx512-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Test
    	__memmove_chk_avx512_unaligned_2,
    	__memmove_chk_avx512_unaligned_erms,
    	__memmove_chk_avx_unaligned_2, __memmove_chk_avx_unaligned_erms,
    	__memmove_chk_sse2_unaligned_2,
    	__memmove_chk_sse2_unaligned_erms, __memmove_avx_unaligned_2,
    	__memmove_avx_unaligned_erms, __memmove_avx512_unaligned_2,
    	__memmove_avx512_unaligned_erms, __memmove_erms,
    	__memmove_sse2_unaligned_2, __memmove_sse2_unaligned_erms,
    	__memcpy_chk_avx512_unaligned_2,
    	__memcpy_chk_avx512_unaligned_erms,
    	__memcpy_chk_avx_unaligned_2, __memcpy_chk_avx_unaligned_erms,
    	__memcpy_chk_sse2_unaligned_2, __memcpy_chk_sse2_unaligned_erms,
    	__memcpy_avx_unaligned_2, __memcpy_avx_unaligned_erms,
    	__memcpy_avx512_unaligned_2, __memcpy_avx512_unaligned_erms,
    	__memcpy_sse2_unaligned_2, __memcpy_sse2_unaligned_erms,
    	__memcpy_erms, __mempcpy_chk_avx512_unaligned_2,
    	__mempcpy_chk_avx512_unaligned_erms,
    	__mempcpy_chk_avx_unaligned_2, __mempcpy_chk_avx_unaligned_erms,
    	__mempcpy_chk_sse2_unaligned_2, __mempcpy_chk_sse2_unaligned_erms,
    	__mempcpy_avx512_unaligned_2, __mempcpy_avx512_unaligned_erms,
    	__mempcpy_avx_unaligned_2, __mempcpy_avx_unaligned_erms,
    	__mempcpy_sse2_unaligned_2, __mempcpy_sse2_unaligned_erms and
    	__mempcpy_erms.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: New
    	file.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likwise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
    	Likwise.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:
    	Likwise.
    
    (cherry picked from commit 88b57b8ed41d5ecf2e1bdfc19556f9246a665ebb)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e1203f48239fbb9832db6ed3a0d2a008e317aff9

commit e1203f48239fbb9832db6ed3a0d2a008e317aff9
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 19:22:59 2016 -0700

    Initial Enhanced REP MOVSB/STOSB (ERMS) support
    
    The newer Intel processors support Enhanced REP MOVSB/STOSB (ERMS) which
    has a feature bit in CPUID.  This patch adds the Enhanced REP MOVSB/STOSB
    (ERMS) bit to x86 cpu-features.
    
    	* sysdeps/x86/cpu-features.h (bit_cpu_ERMS): New.
    	(index_cpu_ERMS): Likewise.
    	(reg_ERMS): Likewise.
    
    (cherry picked from commit 0791f91dff9a77263fa8173b143d854cad902c6d)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=3597d65be2a44f063ef12bb907fdad8567aa3e6a

commit 3597d65be2a44f063ef12bb907fdad8567aa3e6a
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 13:15:59 2016 -0700

    Make __memcpy_avx512_no_vzeroupper an alias
    
    Since x86-64 memcpy-avx512-no-vzeroupper.S implements memmove, make
    __memcpy_avx512_no_vzeroupper an alias of __memmove_avx512_no_vzeroupper
    to reduce code size of libc.so.
    
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memcpy-avx512-no-vzeroupper.
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: Renamed
    	to ...
    	* sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: This.
    	(MEMCPY): Don't define.
    	(MEMCPY_CHK): Likewise.
    	(MEMPCPY): Likewise.
    	(MEMPCPY_CHK): Likewise.
    	(MEMPCPY_CHK): Renamed to ...
    	(__mempcpy_chk_avx512_no_vzeroupper): This.
    	(MEMPCPY_CHK): Renamed to ...
    	(__mempcpy_chk_avx512_no_vzeroupper): This.
    	(MEMCPY_CHK): Renamed to ...
    	(__memmove_chk_avx512_no_vzeroupper): This.
    	(MEMCPY): Renamed to ...
    	(__memmove_avx512_no_vzeroupper): This.
    	(__memcpy_avx512_no_vzeroupper): New alias.
    	(__memcpy_chk_avx512_no_vzeroupper): Likewise.
    
    (cherry picked from commit 064f01b10b57ff09cda7025f484b848c38ddd57a)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=9fbaf0f27a11deb98df79d04adee97aebee78d40

commit 9fbaf0f27a11deb98df79d04adee97aebee78d40
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 13:13:36 2016 -0700

    Implement x86-64 multiarch mempcpy in memcpy
    
    Implement x86-64 multiarch mempcpy in memcpy to share most of code.  It
    reduces code size of libc.so.
    
    	[BZ #18858]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	mempcpy-ssse3, mempcpy-ssse3-back, mempcpy-avx-unaligned
    	and mempcpy-avx512-no-vzeroupper.
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMPCPY_CHK):
    	New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S
    	(MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-ssse3-back.S (MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-ssse3.S (MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-avx-unaligned.S: Removed.
    	* sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-ssse3-back.S: Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-ssse3.S: Likewise.
    
    (cherry picked from commit c365e615f7429aee302f8af7bf07ae262278febb)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=5239cb481eea27650173b9b9af22439afdcbf358

commit 5239cb481eea27650173b9b9af22439afdcbf358
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 04:39:48 2016 -0700

    [x86] Add a feature bit: Fast_Unaligned_Copy
    
    On AMD processors, memcpy optimized with unaligned SSE load is
    slower than emcpy optimized with aligned SSSE3 while other string
    functions are faster with unaligned SSE load.  A feature bit,
    Fast_Unaligned_Copy, is added to select memcpy optimized with
    unaligned SSE load.
    
    	[BZ #19583]
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Set
    	Fast_Unaligned_Copy with Fast_Unaligned_Load for Intel
    	processors.  Set Fast_Copy_Backward for AMD Excavator
    	processors.
    	* sysdeps/x86/cpu-features.h (bit_arch_Fast_Unaligned_Copy):
    	New.
    	(index_arch_Fast_Unaligned_Copy): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Check
    	Fast_Unaligned_Copy instead of Fast_Unaligned_Load.
    
    (cherry picked from commit e41b395523040fcb58c7d378475720c2836d280c)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a65b3d13e1754d568782e64a762c2c7fab45a55d

commit a65b3d13e1754d568782e64a762c2c7fab45a55d
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Mar 22 08:36:16 2016 -0700

    Don't set %rcx twice before "rep movsb"
    
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMCPY):
    	Don't set %rcx twice before "rep movsb".
    
    (cherry picked from commit 3c9a4cd16cbc7b79094fec68add2df66061ab5d7)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=f4b6d20366aac66070f1cf50552cf2951991a1e5

commit f4b6d20366aac66070f1cf50552cf2951991a1e5
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Mar 22 07:46:56 2016 -0700

    Set index_arch_AVX_Fast_Unaligned_Load only for Intel processors
    
    Since only Intel processors with AVX2 have fast unaligned load, we
    should set index_arch_AVX_Fast_Unaligned_Load only for Intel processors.
    
    Move AVX, AVX2, AVX512, FMA and FMA4 detection into get_common_indeces
    and call get_common_indeces for other processors.
    
    Add CPU_FEATURES_CPU_P and CPU_FEATURES_ARCH_P to aoid loading
    GLRO(dl_x86_cpu_features) in cpu-features.c.
    
    	[BZ #19583]
    	* sysdeps/x86/cpu-features.c (get_common_indeces): Remove
    	inline.  Check family before setting family, model and
    	extended_model.  Set AVX, AVX2, AVX512, FMA and FMA4 usable
    	bits here.
    	(init_cpu_features): Replace HAS_CPU_FEATURE and
    	HAS_ARCH_FEATURE with CPU_FEATURES_CPU_P and
    	CPU_FEATURES_ARCH_P.  Set index_arch_AVX_Fast_Unaligned_Load
    	for Intel processors with usable AVX2.  Call get_common_indeces
    	for other processors with family == NULL.
    	* sysdeps/x86/cpu-features.h (CPU_FEATURES_CPU_P): New macro.
    	(CPU_FEATURES_ARCH_P): Likewise.
    	(HAS_CPU_FEATURE): Use CPU_FEATURES_CPU_P.
    	(HAS_ARCH_FEATURE): Use CPU_FEATURES_ARCH_P.
    
    (cherry picked from commit f781a9e96138d8839663af5e88649ab1fbed74f8)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=ca9c5edeea52dc18f42ebbe29b1af352f5555538

commit ca9c5edeea52dc18f42ebbe29b1af352f5555538
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Nov 30 08:53:37 2015 -0800

    Update family and model detection for AMD CPUs
    
    AMD CPUs uses the similar encoding scheme for extended family and model
    as Intel CPUs as shown in:
    
    http://support.amd.com/TechDocs/25481.pdf
    
    This patch updates get_common_indeces to get family and model for both
    Intel and AMD CPUs when family == 0x0f.
    
    	[BZ #19214]
    	* sysdeps/x86/cpu-features.c (get_common_indeces): Add an
    	argument to return extended model.  Update family and model
    	with extended family and model when family == 0x0f.
    	(init_cpu_features): Updated.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c23cdbac4ea473effbef5c50b1217f95595b3460

commit c23cdbac4ea473effbef5c50b1217f95595b3460
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 10 05:26:46 2016 -0800

    Add _arch_/_cpu_ to index_*/bit_* in x86 cpu-features.h
    
    index_* and bit_* macros are used to access cpuid and feature arrays o
    struct cpu_features.  It is very easy to use bits and indices of cpuid
    array on feature array, especially in assembly codes.  For example,
    sysdeps/i386/i686/multiarch/bcopy.S has
    
    	HAS_CPU_FEATURE (Fast_Rep_String)
    
    which should be
    
    	HAS_ARCH_FEATURE (Fast_Rep_String)
    
    We change index_* and bit_* to index_cpu_*/index_arch_* and
    bit_cpu_*/bit_arch_* so that we can catch such error at build time.
    
    	[BZ #19762]
    	* sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h
    	(EXTRA_LD_ENVVARS): Add _arch_ to index_*/bit_*.
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Likewise.
    	* sysdeps/x86/cpu-features.h (bit_*): Renamed to ...
    	(bit_arch_*): This for feature array.
    	(bit_*): Renamed to ...
    	(bit_cpu_*): This for cpu array.
    	(index_*): Renamed to ...
    	(index_arch_*): This for feature array.
    	(index_*): Renamed to ...
    	(index_cpu_*): This for cpu array.
    	[__ASSEMBLER__] (HAS_FEATURE): Add and use field.
    	[__ASSEMBLER__] (HAS_CPU_FEATURE)): Pass cpu to HAS_FEATURE.
    	[__ASSEMBLER__] (HAS_ARCH_FEATURE)): Pass arch to HAS_FEATURE.
    	[!__ASSEMBLER__] (HAS_CPU_FEATURE): Replace index_##name and
    	bit_##name with index_cpu_##name and bit_cpu_##name.
    	[!__ASSEMBLER__] (HAS_ARCH_FEATURE): Replace index_##name and
    	bit_##name with index_arch_##name and bit_arch_##name.
    
    (cherry picked from commit 6aa3e97e2530f9917f504eb4146af119a3f27229)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4a49c82956f5a42a2cce22c2e97360de1b32301d

commit 4a49c82956f5a42a2cce22c2e97360de1b32301d
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 3 14:51:40 2016 -0800

    Or bit_Prefer_MAP_32BIT_EXEC in EXTRA_LD_ENVVARS
    
    We should turn on bit_Prefer_MAP_32BIT_EXEC in EXTRA_LD_ENVVARS without
    overriding other bits.
    
    	[BZ #19758]
    	* sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h
    	(EXTRA_LD_ENVVARS): Or bit_Prefer_MAP_32BIT_EXEC.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=18f8c0e3cc9ff7b092f02c9b42874a5439347bbc

commit 18f8c0e3cc9ff7b092f02c9b42874a5439347bbc
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Mar 6 16:48:11 2016 -0800

    Group AVX512 functions in .text.avx512 section
    
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S:
    	Replace .text with .text.avx512.
    	* sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S:
    	Likewise.
    
    (cherry picked from commit fee9eb6200f0e44a4b684903bc47fde36d46f1a5)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=0c8e297a186f844ebb7eba7a3bc0343c83615ca9

commit 0c8e297a186f844ebb7eba7a3bc0343c83615ca9
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Mar 4 08:37:40 2016 -0800

    x86-64: Fix memcpy IFUNC selection
    
    Chek Fast_Unaligned_Load, instead of Slow_BSF, and also check for
    Fast_Copy_Backward to enable __memcpy_ssse3_back.  Existing selection
    order is updated with following selection order:
    
    1. __memcpy_avx_unaligned if AVX_Fast_Unaligned_Load bit is set.
    2. __memcpy_sse2_unaligned if Fast_Unaligned_Load bit is set.
    3. __memcpy_sse2 if SSSE3 isn't available.
    4. __memcpy_ssse3_back if Fast_Copy_Backward bit it set.
    5. __memcpy_ssse3
    
    	[BZ #18880]
    	* sysdeps/x86_64/multiarch/memcpy.S: Check Fast_Unaligned_Load,
    	instead of Slow_BSF, and also check for Fast_Copy_Backward to
    	enable __memcpy_ssse3_back.
    
    (cherry picked from commit 14a1d7cc4c4fd5ee8e4e66b777221dd32a84efe8)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=3c772cb4d9cbe19cd97ad991e3dab43014198c44

commit 3c772cb4d9cbe19cd97ad991e3dab43014198c44
Author: Andrew Senkevich <andrew.senkevich@intel.com>
Date:   Sat Jan 16 00:49:45 2016 +0300

    Added memcpy/memmove family optimized with AVX512 for KNL hardware.
    
    Added AVX512 implementations of memcpy, mempcpy, memmove, memcpy_chk,
    mempcpy_chk, memmove_chk.
    It shows average improvement more than 30% over AVX versions on KNL
    hardware (performance results in the thread
    <https://sourceware.org/ml/libc-alpha/2016-01/msg00258.html>).
    
        * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Added new files.
        * sysdeps/x86_64/multiarch/ifunc-impl-list.c: Added new tests.
        * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: New file.
        * sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S: Likewise.
        * sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: Likewise.
        * sysdeps/x86_64/multiarch/memcpy.S: Added new IFUNC branch.
        * sysdeps/x86_64/multiarch/memcpy_chk.S: Likewise.
        * sysdeps/x86_64/multiarch/memmove.c: Likewise.
        * sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
        * sysdeps/x86_64/multiarch/mempcpy.S: Likewise.
        * sysdeps/x86_64/multiarch/mempcpy_chk.S: Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7f20e52f1f8d7709140f4bdf828a6bb0f0f08af2

commit 7f20e52f1f8d7709140f4bdf828a6bb0f0f08af2
Author: Andrew Senkevich <andrew.senkevich@intel.com>
Date:   Sat Dec 19 02:47:28 2015 +0300

    Added memset optimized with AVX512 for KNL hardware.
    
    It shows improvement up to 28% over AVX2 memset (performance results
    attached at <https://sourceware.org/ml/libc-alpha/2015-12/msg00052.html>).
    
        * sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S: New file.
        * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Added new file.
        * sysdeps/x86_64/multiarch/ifunc-impl-list.c: Added new tests.
        * sysdeps/x86_64/multiarch/memset.S: Added new IFUNC branch.
        * sysdeps/x86_64/multiarch/memset_chk.S: Likewise.
        * sysdeps/x86/cpu-features.h (bit_Prefer_No_VZEROUPPER,
        index_Prefer_No_VZEROUPPER): New.
        * sysdeps/x86/cpu-features.c (init_cpu_features): Set the
        Prefer_No_VZEROUPPER for Knights Landing.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=d530cd5463701a59ed923d53a97d3b534fdfea8a

commit d530cd5463701a59ed923d53a97d3b534fdfea8a
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Wed Oct 21 14:44:23 2015 -0700

    Add Prefer_MAP_32BIT_EXEC to map executable pages with MAP_32BIT
    
    According to Silvermont software optimization guide, for 64-bit
    applications, branch prediction performance can be negatively impacted
    when the target of a branch is more than 4GB away from the branch.  Add
    the Prefer_MAP_32BIT_EXEC bit so that mmap will try to map executable
    pages with MAP_32BIT first.  NB: MAP_32BIT will map to lower 2GB, not
    lower 4GB, address.  Prefer_MAP_32BIT_EXEC reduces bits available for
    address space layout randomization (ASLR), which is always disabled for
    SUID programs and can only be enabled by setting environment variable,
    LD_PREFER_MAP_32BIT_EXEC.
    
    On Fedora 23, this patch speeds up GCC 5 testsuite by 3% on Silvermont.
    
    	[BZ #19367]
    	* sysdeps/unix/sysv/linux/wordsize-64/mmap.c: New file.
    	* sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h: Likewise.
    	* sysdeps/unix/sysv/linux/x86_64/64/mmap.c: Likewise.
    	* sysdeps/x86/cpu-features.h (bit_Prefer_MAP_32BIT_EXEC): New.
    	(index_Prefer_MAP_32BIT_EXEC): Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=fe24aedc3530037d7bb614b84d309e6b816686bf

commit fe24aedc3530037d7bb614b84d309e6b816686bf
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Dec 15 11:46:54 2015 -0800

    Enable Silvermont optimizations for Knights Landing
    
    Knights Landing processor is based on Silvermont.  This patch enables
    Silvermont optimizations for Knights Landing.
    
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Enable
    	Silvermont optimizations for Knights Landing.

-----------------------------------------------------------------------
Comment 11 cvs-commit@gcc.gnu.org 2016-04-05 17:06:12 UTC
This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, hjl/erms/2.23 has been created
        at  9910c54c2e97b6c36f8593097e53d5e09f837a69 (commit)

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=9910c54c2e97b6c36f8593097e53d5e09f837a69

commit 9910c54c2e97b6c36f8593097e53d5e09f837a69
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 1 14:01:24 2016 -0700

    X86-64: Add dummy memcopy.h and wordcopy.c
    
    Since x86-64 doesn't use memory copy functions, add dummy memcopy.h and
    wordcopy.c to reduce code size.  It reduces the size of libc.so by about
    1 KB.
    
    	* sysdeps/x86_64/memcopy.h: New file.
    	* sysdeps/x86_64/wordcopy.c: Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=3429d9dd330a5c140cb37e77e7c388a71fdb44f1

commit 3429d9dd330a5c140cb37e77e7c388a71fdb44f1
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 12:46:57 2016 -0700

    X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove
    
    Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones,
    we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with
    the new ones.
    
    No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used
    before.  If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2
    memcpy/memmove optimized with Enhanced REP MOVSB will be used for
    processors with ERMS.  The new AVX512 memcpy/memmove will be used for
    processors with AVX512 which prefer vzeroupper.
    
    Since the new SSE2 memcpy/memmove are faster than the previous default
    memcpy/memmove used in libc.a and ld.so, we also remove the previous
    default memcpy/memmove and make them the default memcpy/memmove.
    
    Together, it reduces the size of libc.so by about 6 KB and the size of
    ld.so by about 2 KB.
    
    	[BZ #19776]
    	* sysdeps/x86_64/memcpy.S: Make it dummy.
    	* sysdeps/x86_64/mempcpy.S: Likewise.
    	* sysdeps/x86_64/memmove.S: New file.
    	* sysdeps/x86_64/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/memmove.c: Removed.
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memmove.c: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memcpy-sse2-unaligned, memmove-avx-unaligned,
    	memcpy-avx-unaligned and memmove-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Replace
    	__memmove_chk_avx512_unaligned_2 with
    	__memmove_chk_avx512_unaligned.  Remove
    	__memmove_chk_avx_unaligned_2.  Replace
    	__memmove_chk_sse2_unaligned_2 with
    	__memmove_chk_sse2_unaligned.  Remove __memmove_chk_sse2 and
    	__memmove_avx_unaligned_2.  Replace __memmove_avx512_unaligned_2
    	with __memmove_avx512_unaligned.  Replace
    	__memmove_sse2_unaligned_2 with __memmove_sse2_unaligned.
    	Remove __memmove_sse2.  Replace __memcpy_chk_avx512_unaligned_2
    	with __memcpy_chk_avx512_unaligned.  Remove
    	__memcpy_chk_avx_unaligned_2.  Replace
    	__memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned.
    	Remove __memcpy_chk_sse2.  Remove __memcpy_avx_unaligned_2.
    	Replace __memcpy_avx512_unaligned_2 with
    	__memcpy_avx512_unaligned.  Remove __memcpy_sse2_unaligned_2
    	and __memcpy_sse2.  Replace __mempcpy_chk_avx512_unaligned_2
    	with __mempcpy_chk_avx512_unaligned.  Remove
    	__mempcpy_chk_avx_unaligned_2.  Replace
    	__mempcpy_chk_sse2_unaligned_2 with
    	__mempcpy_chk_sse2_unaligned.  Remove __mempcpy_chk_sse2.
    	Replace __mempcpy_avx512_unaligned_2 with
    	__mempcpy_avx512_unaligned.  Remove __mempcpy_avx_unaligned_2.
    	Replace __mempcpy_sse2_unaligned_2 with
    	__mempcpy_sse2_unaligned.  Remove __mempcpy_sse2.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support
    	__memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned.
    	Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __memcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../memcpy.S.
    	* sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support
    	__memcpy_chk_avx512_unaligned_erms and
    	__memcpy_chk_avx512_unaligned.  Use
    	__memcpy_chk_avx_unaligned_erms and
    	__memcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __memcpy_chk_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip if
    	not in libc.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
    	(MEMCPY_SYMBOL): New.
    	(MEMPCPY_SYMBOL): Likewise.
    	(MEMMOVE_CHK_SYMBOL): Likewise.
    	Replace MEMMOVE_SYMBOL with MEMMOVE_CHK_SYMBOL on __mempcpy_chk
    	symbols.  Replace MEMMOVE_SYMBOL with MEMPCPY_SYMBOL on
    	__mempcpy symbols.  Change function suffix from unaligned_2 to
    	unaligned.  Provide alias for __memcpy_chk in libc.a.  Provide
    	alias for memcpy in libc.a and ld.so.
    	* sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support
    	__mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned.
    	Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __mempcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../mempcpy.S.
    	(mempcpy): New.  Add a weak alias.
    	* sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support
    	__mempcpy_chk_avx512_unaligned_erms and
    	__mempcpy_chk_avx512_unaligned.  Use
    	__mempcpy_chk_avx_unaligned_erms and
    	__mempcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __mempcpy_chk_sse2_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7c36cac64f6855f1f4ff007beaca3cb766e694ec

commit 7c36cac64f6855f1f4ff007beaca3cb766e694ec
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:42:30 2016 -0700

    X86-64: Remove the previous SSE2/AVX2 memsets
    
    Since the new SSE2/AVX2 memsets are faster than the previous ones, we
    can remove the previous SSE2/AVX2 memsets and replace them with the
    new ones.  This reduces the size of libc.so by about 900 bytes.
    
    No change in IFUNC selection if SSE2 and AVX2 memsets weren't used
    before.  If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset
    optimized with Enhanced REP STOSB will be used for processors with
    ERMS.  The new AVX512 memset will be used for processors with AVX512
    which prefer vzeroupper.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded
    	into ...
    	* sysdeps/x86_64/memset.S: This.
    	(__bzero): Removed.
    	(__memset_tail): Likewise.
    	(__memset_chk): Likewise.
    	(memset): Likewise.
    	(MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't
    	defined.
    	(MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined.
    	* sysdeps/x86_64/multiarch/memset-avx2.S: Removed.
    	(__memset_zero_constant_len_parameter): Check SHARED instead of
    	PIC.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memset-avx2 and memset-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Remove __memset_chk_sse2,
    	__memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: Skip
    	if not in libc.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
    	(MEMSET_CHK_SYMBOL): New.  Define if not defined.
    	(__bzero): Check VEC_SIZE == 16 instead of USE_MULTIARCH.
    	Replace MEMSET_SYMBOL with MEMSET_CHK_SYMBOL on __memset_chk
    	symbols.
    	Properly check USE_MULTIARCH on __memset symbols.
    	* sysdeps/x86_64/multiarch/memset.S (memset): Replace
    	__memset_sse2 and __memset_avx2 with __memset_sse2_unaligned
    	and __memset_avx2_unaligned.  Use __memset_sse2_unaligned_erms
    	or __memset_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	(memset): Removed.
    	(__memset_chk): Likewise.
    	(MEMSET_SYMBOL): New.
    	(libc_hidden_builtin_def): Replace __memset_sse2 with
    	__memset_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace
    	__memset_chk_sse2 and __memset_chk_avx2 with
    	__memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms.
    	Use __memset_chk_sse2_unaligned_erms or
    	__memset_chk_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_chk_avx512_unaligned_erms and
    	__memset_chk_avx512_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=69b122e1149e158c382c2b0bdd4591a4a19cb505

commit 69b122e1149e158c382c2b0bdd4591a4a19cb505
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Apr 3 17:21:45 2016 -0700

    X86-64: Use non-temporal store in memmove on large data
    
    memcpy/memmove benchmarks with large data shows that there is a
    regression with large data on Haswell machine.  non-temporal store
    in memmove on large data can improve performance significantly.  This
    patch adds a threshold to use non temporal store which is 4 times of
    shared cache size.  When size is above the threshold, non temporal
    store will be used.
    
    For size below 8 vector register width, we load all data into registers
    and store them together.  Only forward and backward loops, which move 4
    vector registers at a time, are used to support overlapping addresses.
    For forward loop, we load the last 4 vector register width of data and
    the first vector register width of data into vector registers before the
    loop and store them after the loop.  For backward loop, we load the first
    4 vector register width of data and the last vector register width of
    data into vector registers before the loop and store them after the loop.
    
    	* sysdeps/x86_64/cacheinfo.c (__x86_shared_non_temporal_threshold):
    	New.
    	(init_cacheinfo): Set __x86_shared_non_temporal_threshold to
    	4 times of shared cache size.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S
    	(PREFETCHNT): New.
    	(VMOVNT): Likewise.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S
    	(PREFETCHNT): Likewise.
    	(VMOVNT): Likewise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S
    	(PREFETCHNT): Likewise.
    	(VMOVNT): Likewise.
    	(VMOVU): Changed to movups for smaller code sizes.
    	(VMOVA): Changed to movaps for smaller code sizes.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Update
    	comments.  Rewrite to use forward and backward loops, which move
    	4 vector registers at a time, to support overlapping addresses
    	and use non temporal store if size is above the threshold.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=9a93bdbaff81edf67c5486c84f8098055e355abb

commit 9a93bdbaff81edf67c5486c84f8098055e355abb
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Apr 5 05:21:07 2016 -0700

    Force 32-bit displacement in memset-vec-unaligned-erms.S
    
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: Force
    	32-bit displacement to avoid long nop between instructions.
    
    (cherry picked from commit ec0cac9a1f4094bd0db6f77c1b329e7a40eecc10)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=5118e532600549ad0f56cb9b1a179b8eab70c483

commit 5118e532600549ad0f56cb9b1a179b8eab70c483
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Apr 5 05:19:05 2016 -0700

    Add a comment in memset-sse2-unaligned-erms.S
    
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Add
    	a comment on VMOVU and VMOVA.
    
    (cherry picked from commit 696ac774847b80cf994438739478b0c3003b5958)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=06c6d4ae6ee7e5b83fd5868bef494def01f59292

commit 06c6d4ae6ee7e5b83fd5868bef494def01f59292
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Apr 3 14:32:20 2016 -0700

    Don't put SSE2/AVX/AVX512 memmove/memset in ld.so
    
    Since memmove and memset in ld.so don't use IFUNC, don't put SSE2, AVX
    and AVX512 memmove and memset in ld.so.
    
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip
    	if not in libc.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    
    (cherry picked from commit 5cd7af016d8587ff53b20ba259746f97edbddbf7)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a96379797a7eecc1b709cad7b68981eb698783dc

commit a96379797a7eecc1b709cad7b68981eb698783dc
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Apr 3 12:38:25 2016 -0700

    Fix memmove-vec-unaligned-erms.S
    
    __mempcpy_erms and __memmove_erms can't be placed between __memmove_chk
    and __memmove it breaks __memmove_chk.
    
    Don't check source == destination first since it is less common.
    
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:
    	(__mempcpy_erms, __memmove_erms): Moved before __mempcpy_chk
    	with unaligned_erms.
    	(__memmove_erms): Skip if source == destination.
    	(__memmove_unaligned_erms): Don't check source == destination
    	first.
    
    (cherry picked from commit ea2785e96fa503f3a2b5dd9f3a6ca65622b3c5f2)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=cfb059c79729b26284863334c9aa04f0a3b967b9

commit cfb059c79729b26284863334c9aa04f0a3b967b9
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 1 15:08:48 2016 -0700

    Remove Fast_Copy_Backward from Intel Core processors
    
    Intel Core i3, i5 and i7 processors have fast unaligned copy and
    copy backward is ignored.  Remove Fast_Copy_Backward from Intel Core
    processors to avoid confusion.
    
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Don't set
    	bit_arch_Fast_Copy_Backward for Intel Core proessors.
    
    (cherry picked from commit 27d3ce1467990f89126e228559dec8f84b96c60e)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=30c389be1af67c4d0716d207b6780c6169d1355f

commit 30c389be1af67c4d0716d207b6780c6169d1355f
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:05:51 2016 -0700

    Add x86-64 memset with unaligned store and rep stosb
    
    Implement x86-64 memset with unaligned store and rep movsb.  Support
    16-byte, 32-byte and 64-byte vector register sizes.  A single file
    provides 2 implementations of memset, one with rep stosb and the other
    without rep stosb.  They share the same codes when size is between 2
    times of vector register size and REP_STOSB_THRESHOLD which defaults
    to 2KB.
    
    Key features:
    
    1. Use overlapping store to avoid branch.
    2. For size <= 4 times of vector register size, fully unroll the loop.
    3. For size > 4 times of vector register size, store 4 times of vector
    register size at a time.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
    	memset-sse2-unaligned-erms, memset-avx2-unaligned-erms and
    	memset-avx512-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Test __memset_chk_sse2_unaligned,
    	__memset_chk_sse2_unaligned_erms, __memset_chk_avx2_unaligned,
    	__memset_chk_avx2_unaligned_erms, __memset_chk_avx512_unaligned,
    	__memset_chk_avx512_unaligned_erms, __memset_sse2_unaligned,
    	__memset_sse2_unaligned_erms, __memset_erms,
    	__memset_avx2_unaligned, __memset_avx2_unaligned_erms,
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: New
    	file.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:
    	Likewise.
    
    (cherry picked from commit 830566307f038387ca0af3fd327706a8d1a2f595)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=980d639b4ae58209843f09a29d86b0a8303b6650

commit 980d639b4ae58209843f09a29d86b0a8303b6650
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:04:26 2016 -0700

    Add x86-64 memmove with unaligned load/store and rep movsb
    
    Implement x86-64 memmove with unaligned load/store and rep movsb.
    Support 16-byte, 32-byte and 64-byte vector register sizes.  When
    size <= 8 times of vector register size, there is no check for
    address overlap bewteen source and destination.  Since overhead for
    overlap check is small when size > 8 times of vector register size,
    memcpy is an alias of memmove.
    
    A single file provides 2 implementations of memmove, one with rep movsb
    and the other without rep movsb.  They share the same codes when size is
    between 2 times of vector register size and REP_MOVSB_THRESHOLD which
    is 2KB for 16-byte vector register size and scaled up by large vector
    register size.
    
    Key features:
    
    1. Use overlapping load and store to avoid branch.
    2. For size <= 8 times of vector register size, load  all sources into
    registers and store them together.
    3. If there is no address overlap bewteen source and destination, copy
    from both ends with 4 times of vector register size at a time.
    4. If address of destination > address of source, backward copy 8 times
    of vector register size at a time.
    5. Otherwise, forward copy 8 times of vector register size at a time.
    6. Use rep movsb only for forward copy.  Avoid slow backward rep movsb
    by fallbacking to backward copy 8 times of vector register size at a
    time.
    7. Skip when address of destination == address of source.
    
    	[BZ #19776]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
    	memmove-sse2-unaligned-erms, memmove-avx-unaligned-erms and
    	memmove-avx512-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Test
    	__memmove_chk_avx512_unaligned_2,
    	__memmove_chk_avx512_unaligned_erms,
    	__memmove_chk_avx_unaligned_2, __memmove_chk_avx_unaligned_erms,
    	__memmove_chk_sse2_unaligned_2,
    	__memmove_chk_sse2_unaligned_erms, __memmove_avx_unaligned_2,
    	__memmove_avx_unaligned_erms, __memmove_avx512_unaligned_2,
    	__memmove_avx512_unaligned_erms, __memmove_erms,
    	__memmove_sse2_unaligned_2, __memmove_sse2_unaligned_erms,
    	__memcpy_chk_avx512_unaligned_2,
    	__memcpy_chk_avx512_unaligned_erms,
    	__memcpy_chk_avx_unaligned_2, __memcpy_chk_avx_unaligned_erms,
    	__memcpy_chk_sse2_unaligned_2, __memcpy_chk_sse2_unaligned_erms,
    	__memcpy_avx_unaligned_2, __memcpy_avx_unaligned_erms,
    	__memcpy_avx512_unaligned_2, __memcpy_avx512_unaligned_erms,
    	__memcpy_sse2_unaligned_2, __memcpy_sse2_unaligned_erms,
    	__memcpy_erms, __mempcpy_chk_avx512_unaligned_2,
    	__mempcpy_chk_avx512_unaligned_erms,
    	__mempcpy_chk_avx_unaligned_2, __mempcpy_chk_avx_unaligned_erms,
    	__mempcpy_chk_sse2_unaligned_2, __mempcpy_chk_sse2_unaligned_erms,
    	__mempcpy_avx512_unaligned_2, __mempcpy_avx512_unaligned_erms,
    	__mempcpy_avx_unaligned_2, __mempcpy_avx_unaligned_erms,
    	__mempcpy_sse2_unaligned_2, __mempcpy_sse2_unaligned_erms and
    	__mempcpy_erms.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: New
    	file.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likwise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
    	Likwise.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:
    	Likwise.
    
    (cherry picked from commit 88b57b8ed41d5ecf2e1bdfc19556f9246a665ebb)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=bf2bc5e5c9d7aa8af28b299ec26b8a37352730cc

commit bf2bc5e5c9d7aa8af28b299ec26b8a37352730cc
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 19:22:59 2016 -0700

    Initial Enhanced REP MOVSB/STOSB (ERMS) support
    
    The newer Intel processors support Enhanced REP MOVSB/STOSB (ERMS) which
    has a feature bit in CPUID.  This patch adds the Enhanced REP MOVSB/STOSB
    (ERMS) bit to x86 cpu-features.
    
    	* sysdeps/x86/cpu-features.h (bit_cpu_ERMS): New.
    	(index_cpu_ERMS): Likewise.
    	(reg_ERMS): Likewise.
    
    (cherry picked from commit 0791f91dff9a77263fa8173b143d854cad902c6d)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7c244283ff12329b3bca9878b8edac3b3fe5c7bc

commit 7c244283ff12329b3bca9878b8edac3b3fe5c7bc
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 13:15:59 2016 -0700

    Make __memcpy_avx512_no_vzeroupper an alias
    
    Since x86-64 memcpy-avx512-no-vzeroupper.S implements memmove, make
    __memcpy_avx512_no_vzeroupper an alias of __memmove_avx512_no_vzeroupper
    to reduce code size of libc.so.
    
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memcpy-avx512-no-vzeroupper.
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: Renamed
    	to ...
    	* sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: This.
    	(MEMCPY): Don't define.
    	(MEMCPY_CHK): Likewise.
    	(MEMPCPY): Likewise.
    	(MEMPCPY_CHK): Likewise.
    	(MEMPCPY_CHK): Renamed to ...
    	(__mempcpy_chk_avx512_no_vzeroupper): This.
    	(MEMPCPY_CHK): Renamed to ...
    	(__mempcpy_chk_avx512_no_vzeroupper): This.
    	(MEMCPY_CHK): Renamed to ...
    	(__memmove_chk_avx512_no_vzeroupper): This.
    	(MEMCPY): Renamed to ...
    	(__memmove_avx512_no_vzeroupper): This.
    	(__memcpy_avx512_no_vzeroupper): New alias.
    	(__memcpy_chk_avx512_no_vzeroupper): Likewise.
    
    (cherry picked from commit 064f01b10b57ff09cda7025f484b848c38ddd57a)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a9a14991fb2d3e69f80d25e9bbf2f6b0bcf11c3d

commit a9a14991fb2d3e69f80d25e9bbf2f6b0bcf11c3d
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 13:13:36 2016 -0700

    Implement x86-64 multiarch mempcpy in memcpy
    
    Implement x86-64 multiarch mempcpy in memcpy to share most of code.  It
    reduces code size of libc.so.
    
    	[BZ #18858]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	mempcpy-ssse3, mempcpy-ssse3-back, mempcpy-avx-unaligned
    	and mempcpy-avx512-no-vzeroupper.
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMPCPY_CHK):
    	New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S
    	(MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-ssse3-back.S (MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-ssse3.S (MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-avx-unaligned.S: Removed.
    	* sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-ssse3-back.S: Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-ssse3.S: Likewise.
    
    (cherry picked from commit c365e615f7429aee302f8af7bf07ae262278febb)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4fc09dabecee1b7cafdbca26ee7c63f68e53c229

commit 4fc09dabecee1b7cafdbca26ee7c63f68e53c229
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 04:39:48 2016 -0700

    [x86] Add a feature bit: Fast_Unaligned_Copy
    
    On AMD processors, memcpy optimized with unaligned SSE load is
    slower than emcpy optimized with aligned SSSE3 while other string
    functions are faster with unaligned SSE load.  A feature bit,
    Fast_Unaligned_Copy, is added to select memcpy optimized with
    unaligned SSE load.
    
    	[BZ #19583]
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Set
    	Fast_Unaligned_Copy with Fast_Unaligned_Load for Intel
    	processors.  Set Fast_Copy_Backward for AMD Excavator
    	processors.
    	* sysdeps/x86/cpu-features.h (bit_arch_Fast_Unaligned_Copy):
    	New.
    	(index_arch_Fast_Unaligned_Copy): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Check
    	Fast_Unaligned_Copy instead of Fast_Unaligned_Load.
    
    (cherry picked from commit e41b395523040fcb58c7d378475720c2836d280c)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=75f2d47e459a6bf5656a938e5c63f8b581eb3ee6

commit 75f2d47e459a6bf5656a938e5c63f8b581eb3ee6
Author: Florian Weimer <fweimer@redhat.com>
Date:   Fri Mar 25 11:11:42 2016 +0100

    tst-audit10: Fix compilation on compilers without bit_AVX512F [BZ #19860]
    
    	[BZ# 19860]
    	* sysdeps/x86_64/tst-audit10.c (avx512_enabled): Always return
    	zero if the compiler does not provide the AVX512F bit.
    
    (cherry picked from commit f327f5b47be57bc05a4077344b381016c1bb2c11)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=96c7375cb8b6f1875d9865f2ae92ecacf5f5e6fa

commit 96c7375cb8b6f1875d9865f2ae92ecacf5f5e6fa
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Mar 22 08:36:16 2016 -0700

    Don't set %rcx twice before "rep movsb"
    
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMCPY):
    	Don't set %rcx twice before "rep movsb".
    
    (cherry picked from commit 3c9a4cd16cbc7b79094fec68add2df66061ab5d7)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c273f613b0cc779ee33cc33d20941d271316e483

commit c273f613b0cc779ee33cc33d20941d271316e483
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Mar 22 07:46:56 2016 -0700

    Set index_arch_AVX_Fast_Unaligned_Load only for Intel processors
    
    Since only Intel processors with AVX2 have fast unaligned load, we
    should set index_arch_AVX_Fast_Unaligned_Load only for Intel processors.
    
    Move AVX, AVX2, AVX512, FMA and FMA4 detection into get_common_indeces
    and call get_common_indeces for other processors.
    
    Add CPU_FEATURES_CPU_P and CPU_FEATURES_ARCH_P to aoid loading
    GLRO(dl_x86_cpu_features) in cpu-features.c.
    
    	[BZ #19583]
    	* sysdeps/x86/cpu-features.c (get_common_indeces): Remove
    	inline.  Check family before setting family, model and
    	extended_model.  Set AVX, AVX2, AVX512, FMA and FMA4 usable
    	bits here.
    	(init_cpu_features): Replace HAS_CPU_FEATURE and
    	HAS_ARCH_FEATURE with CPU_FEATURES_CPU_P and
    	CPU_FEATURES_ARCH_P.  Set index_arch_AVX_Fast_Unaligned_Load
    	for Intel processors with usable AVX2.  Call get_common_indeces
    	for other processors with family == NULL.
    	* sysdeps/x86/cpu-features.h (CPU_FEATURES_CPU_P): New macro.
    	(CPU_FEATURES_ARCH_P): Likewise.
    	(HAS_CPU_FEATURE): Use CPU_FEATURES_CPU_P.
    	(HAS_ARCH_FEATURE): Use CPU_FEATURES_ARCH_P.
    
    (cherry picked from commit f781a9e96138d8839663af5e88649ab1fbed74f8)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c858d10a4e7fd682f2e7083836e4feacc2d580f4

commit c858d10a4e7fd682f2e7083836e4feacc2d580f4
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 10 05:26:46 2016 -0800

    Add _arch_/_cpu_ to index_*/bit_* in x86 cpu-features.h
    
    index_* and bit_* macros are used to access cpuid and feature arrays o
    struct cpu_features.  It is very easy to use bits and indices of cpuid
    array on feature array, especially in assembly codes.  For example,
    sysdeps/i386/i686/multiarch/bcopy.S has
    
    	HAS_CPU_FEATURE (Fast_Rep_String)
    
    which should be
    
    	HAS_ARCH_FEATURE (Fast_Rep_String)
    
    We change index_* and bit_* to index_cpu_*/index_arch_* and
    bit_cpu_*/bit_arch_* so that we can catch such error at build time.
    
    	[BZ #19762]
    	* sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h
    	(EXTRA_LD_ENVVARS): Add _arch_ to index_*/bit_*.
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Likewise.
    	* sysdeps/x86/cpu-features.h (bit_*): Renamed to ...
    	(bit_arch_*): This for feature array.
    	(bit_*): Renamed to ...
    	(bit_cpu_*): This for cpu array.
    	(index_*): Renamed to ...
    	(index_arch_*): This for feature array.
    	(index_*): Renamed to ...
    	(index_cpu_*): This for cpu array.
    	[__ASSEMBLER__] (HAS_FEATURE): Add and use field.
    	[__ASSEMBLER__] (HAS_CPU_FEATURE)): Pass cpu to HAS_FEATURE.
    	[__ASSEMBLER__] (HAS_ARCH_FEATURE)): Pass arch to HAS_FEATURE.
    	[!__ASSEMBLER__] (HAS_CPU_FEATURE): Replace index_##name and
    	bit_##name with index_cpu_##name and bit_cpu_##name.
    	[!__ASSEMBLER__] (HAS_ARCH_FEATURE): Replace index_##name and
    	bit_##name with index_arch_##name and bit_arch_##name.
    
    (cherry picked from commit 6aa3e97e2530f9917f504eb4146af119a3f27229)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7a90b56b0c3f8e55df44957cf6de7d3c9c04cbb9

commit 7a90b56b0c3f8e55df44957cf6de7d3c9c04cbb9
Author: Roland McGrath <roland@hack.frob.com>
Date:   Tue Mar 8 12:31:13 2016 -0800

    Fix tst-audit10 build when -mavx512f is not supported.
    
    (cherry picked from commit 3bd80c0de2f8e7ca8020d37739339636d169957e)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=ba80f6ceea3a6b6f711038646f419125fe3ad39c

commit ba80f6ceea3a6b6f711038646f419125fe3ad39c
Author: Florian Weimer <fweimer@redhat.com>
Date:   Mon Mar 7 16:00:25 2016 +0100

    tst-audit4, tst-audit10: Compile AVX/AVX-512 code separately [BZ #19269]
    
    This ensures that GCC will not use unsupported instructions before
    the run-time check to ensure support.
    
    (cherry picked from commit 3c0f7407eedb524c9114bb675cd55b903c71daaa)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=b8fe596e7f750d4ee2fca14d6a3999364c02662e

commit b8fe596e7f750d4ee2fca14d6a3999364c02662e
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Mar 6 16:48:11 2016 -0800

    Group AVX512 functions in .text.avx512 section
    
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S:
    	Replace .text with .text.avx512.
    	* sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S:
    	Likewise.
    
    (cherry picked from commit fee9eb6200f0e44a4b684903bc47fde36d46f1a5)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e455d17680cfaebb12692547422f95ba1ed30e29

commit e455d17680cfaebb12692547422f95ba1ed30e29
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Mar 4 08:37:40 2016 -0800

    x86-64: Fix memcpy IFUNC selection
    
    Chek Fast_Unaligned_Load, instead of Slow_BSF, and also check for
    Fast_Copy_Backward to enable __memcpy_ssse3_back.  Existing selection
    order is updated with following selection order:
    
    1. __memcpy_avx_unaligned if AVX_Fast_Unaligned_Load bit is set.
    2. __memcpy_sse2_unaligned if Fast_Unaligned_Load bit is set.
    3. __memcpy_sse2 if SSSE3 isn't available.
    4. __memcpy_ssse3_back if Fast_Copy_Backward bit it set.
    5. __memcpy_ssse3
    
    	[BZ #18880]
    	* sysdeps/x86_64/multiarch/memcpy.S: Check Fast_Unaligned_Load,
    	instead of Slow_BSF, and also check for Fast_Copy_Backward to
    	enable __memcpy_ssse3_back.
    
    (cherry picked from commit 14a1d7cc4c4fd5ee8e4e66b777221dd32a84efe8)

-----------------------------------------------------------------------
Comment 12 cvs-commit@gcc.gnu.org 2016-04-05 21:17:40 UTC
This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, hjl/erms/2.22 has been created
        at  f0a3ab52c05e0813348e0e5460aaf1dc5d1e7a64 (commit)

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=f0a3ab52c05e0813348e0e5460aaf1dc5d1e7a64

commit f0a3ab52c05e0813348e0e5460aaf1dc5d1e7a64
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 1 14:01:24 2016 -0700

    X86-64: Add dummy memcopy.h and wordcopy.c
    
    Since x86-64 doesn't use memory copy functions, add dummy memcopy.h and
    wordcopy.c to reduce code size.  It reduces the size of libc.so by about
    1 KB.
    
    	* sysdeps/x86_64/memcopy.h: New file.
    	* sysdeps/x86_64/wordcopy.c: Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=d5af940569c5c48835acdf6c8c47451e1e92c817

commit d5af940569c5c48835acdf6c8c47451e1e92c817
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 12:46:57 2016 -0700

    X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove
    
    Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones,
    we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with
    the new ones.
    
    No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used
    before.  If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2
    memcpy/memmove optimized with Enhanced REP MOVSB will be used for
    processors with ERMS.  The new AVX512 memcpy/memmove will be used for
    processors with AVX512 which prefer vzeroupper.
    
    Since the new SSE2 memcpy/memmove are faster than the previous default
    memcpy/memmove used in libc.a and ld.so, we also remove the previous
    default memcpy/memmove and make them the default memcpy/memmove.
    
    Together, it reduces the size of libc.so by about 6 KB and the size of
    ld.so by about 2 KB.
    
    	[BZ #19776]
    	* sysdeps/x86_64/memcpy.S: Make it dummy.
    	* sysdeps/x86_64/mempcpy.S: Likewise.
    	* sysdeps/x86_64/memmove.S: New file.
    	* sysdeps/x86_64/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/memmove.c: Removed.
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memmove.c: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memcpy-sse2-unaligned, memmove-avx-unaligned,
    	memcpy-avx-unaligned and memmove-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Replace
    	__memmove_chk_avx512_unaligned_2 with
    	__memmove_chk_avx512_unaligned.  Remove
    	__memmove_chk_avx_unaligned_2.  Replace
    	__memmove_chk_sse2_unaligned_2 with
    	__memmove_chk_sse2_unaligned.  Remove __memmove_chk_sse2 and
    	__memmove_avx_unaligned_2.  Replace __memmove_avx512_unaligned_2
    	with __memmove_avx512_unaligned.  Replace
    	__memmove_sse2_unaligned_2 with __memmove_sse2_unaligned.
    	Remove __memmove_sse2.  Replace __memcpy_chk_avx512_unaligned_2
    	with __memcpy_chk_avx512_unaligned.  Remove
    	__memcpy_chk_avx_unaligned_2.  Replace
    	__memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned.
    	Remove __memcpy_chk_sse2.  Remove __memcpy_avx_unaligned_2.
    	Replace __memcpy_avx512_unaligned_2 with
    	__memcpy_avx512_unaligned.  Remove __memcpy_sse2_unaligned_2
    	and __memcpy_sse2.  Replace __mempcpy_chk_avx512_unaligned_2
    	with __mempcpy_chk_avx512_unaligned.  Remove
    	__mempcpy_chk_avx_unaligned_2.  Replace
    	__mempcpy_chk_sse2_unaligned_2 with
    	__mempcpy_chk_sse2_unaligned.  Remove __mempcpy_chk_sse2.
    	Replace __mempcpy_avx512_unaligned_2 with
    	__mempcpy_avx512_unaligned.  Remove __mempcpy_avx_unaligned_2.
    	Replace __mempcpy_sse2_unaligned_2 with
    	__mempcpy_sse2_unaligned.  Remove __mempcpy_sse2.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support
    	__memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned.
    	Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __memcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../memcpy.S.
    	* sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support
    	__memcpy_chk_avx512_unaligned_erms and
    	__memcpy_chk_avx512_unaligned.  Use
    	__memcpy_chk_avx_unaligned_erms and
    	__memcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __memcpy_chk_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip if
    	not in libc.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
    	(MEMCPY_SYMBOL): New.
    	(MEMPCPY_SYMBOL): Likewise.
    	(MEMMOVE_CHK_SYMBOL): Likewise.
    	Replace MEMMOVE_SYMBOL with MEMMOVE_CHK_SYMBOL on __mempcpy_chk
    	symbols.  Replace MEMMOVE_SYMBOL with MEMPCPY_SYMBOL on
    	__mempcpy symbols.  Change function suffix from unaligned_2 to
    	unaligned.  Provide alias for __memcpy_chk in libc.a.  Provide
    	alias for memcpy in libc.a and ld.so.
    	* sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support
    	__mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned.
    	Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __mempcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../mempcpy.S.
    	(mempcpy): New.  Add a weak alias.
    	* sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support
    	__mempcpy_chk_avx512_unaligned_erms and
    	__mempcpy_chk_avx512_unaligned.  Use
    	__mempcpy_chk_avx_unaligned_erms and
    	__mempcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __mempcpy_chk_sse2_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=3ad9dce564d95ac817f86cad1bb4f0bc29c58f5f

commit 3ad9dce564d95ac817f86cad1bb4f0bc29c58f5f
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:42:30 2016 -0700

    X86-64: Remove the previous SSE2/AVX2 memsets
    
    Since the new SSE2/AVX2 memsets are faster than the previous ones, we
    can remove the previous SSE2/AVX2 memsets and replace them with the
    new ones.  This reduces the size of libc.so by about 900 bytes.
    
    No change in IFUNC selection if SSE2 and AVX2 memsets weren't used
    before.  If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset
    optimized with Enhanced REP STOSB will be used for processors with
    ERMS.  The new AVX512 memset will be used for processors with AVX512
    which prefer vzeroupper.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded
    	into ...
    	* sysdeps/x86_64/memset.S: This.
    	(__bzero): Removed.
    	(__memset_tail): Likewise.
    	(__memset_chk): Likewise.
    	(memset): Likewise.
    	(MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't
    	defined.
    	(MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined.
    	* sysdeps/x86_64/multiarch/memset-avx2.S: Removed.
    	(__memset_zero_constant_len_parameter): Check SHARED instead of
    	PIC.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memset-avx2 and memset-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Remove __memset_chk_sse2,
    	__memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: Skip
    	if not in libc.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
    	(MEMSET_CHK_SYMBOL): New.  Define if not defined.
    	(__bzero): Check VEC_SIZE == 16 instead of USE_MULTIARCH.
    	Replace MEMSET_SYMBOL with MEMSET_CHK_SYMBOL on __memset_chk
    	symbols.
    	Properly check USE_MULTIARCH on __memset symbols.
    	* sysdeps/x86_64/multiarch/memset.S (memset): Replace
    	__memset_sse2 and __memset_avx2 with __memset_sse2_unaligned
    	and __memset_avx2_unaligned.  Use __memset_sse2_unaligned_erms
    	or __memset_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	(memset): Removed.
    	(__memset_chk): Likewise.
    	(MEMSET_SYMBOL): New.
    	(libc_hidden_builtin_def): Replace __memset_sse2 with
    	__memset_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace
    	__memset_chk_sse2 and __memset_chk_avx2 with
    	__memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms.
    	Use __memset_chk_sse2_unaligned_erms or
    	__memset_chk_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_chk_avx512_unaligned_erms and
    	__memset_chk_avx512_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e1e4908bee33dc0aed48835c1884387b5e942963

commit e1e4908bee33dc0aed48835c1884387b5e942963
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Apr 3 17:21:45 2016 -0700

    X86-64: Use non-temporal store in memmove on large data
    
    memcpy/memmove benchmarks with large data shows that there is a
    regression with large data on Haswell machine.  non-temporal store
    in memmove on large data can improve performance significantly.  This
    patch adds a threshold to use non temporal store which is 4 times of
    shared cache size.  When size is above the threshold, non temporal
    store will be used.
    
    For size below 8 vector register width, we load all data into registers
    and store them together.  Only forward and backward loops, which move 4
    vector registers at a time, are used to support overlapping addresses.
    For forward loop, we load the last 4 vector register width of data and
    the first vector register width of data into vector registers before the
    loop and store them after the loop.  For backward loop, we load the first
    4 vector register width of data and the last vector register width of
    data into vector registers before the loop and store them after the loop.
    
    	* sysdeps/x86_64/cacheinfo.c (__x86_shared_non_temporal_threshold):
    	New.
    	(init_cacheinfo): Set __x86_shared_non_temporal_threshold to
    	4 times of shared cache size.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S
    	(PREFETCHNT): New.
    	(VMOVNT): Likewise.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S
    	(PREFETCHNT): Likewise.
    	(VMOVNT): Likewise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S
    	(PREFETCHNT): Likewise.
    	(VMOVNT): Likewise.
    	(VMOVU): Changed to movups for smaller code sizes.
    	(VMOVA): Changed to movaps for smaller code sizes.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Update
    	comments.  Rewrite to use forward and backward loops, which move
    	4 vector registers at a time, to support overlapping addresses
    	and use non temporal store if size is above the threshold.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c2d3bdd6aec639fd66fceb3e2c145420c25d409b

commit c2d3bdd6aec639fd66fceb3e2c145420c25d409b
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Apr 5 05:21:07 2016 -0700

    Force 32-bit displacement in memset-vec-unaligned-erms.S
    
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: Force
    	32-bit displacement to avoid long nop between instructions.
    
    (cherry picked from commit ec0cac9a1f4094bd0db6f77c1b329e7a40eecc10)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=070a5e77d66f5520c1bbbc24dc1843a0a1c161ee

commit 070a5e77d66f5520c1bbbc24dc1843a0a1c161ee
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Apr 5 05:19:05 2016 -0700

    Add a comment in memset-sse2-unaligned-erms.S
    
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Add
    	a comment on VMOVU and VMOVA.
    
    (cherry picked from commit 696ac774847b80cf994438739478b0c3003b5958)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7e00bb9720268f142668d22e91dff7c3e6e0c08c

commit 7e00bb9720268f142668d22e91dff7c3e6e0c08c
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Apr 3 14:32:20 2016 -0700

    Don't put SSE2/AVX/AVX512 memmove/memset in ld.so
    
    Since memmove and memset in ld.so don't use IFUNC, don't put SSE2, AVX
    and AVX512 memmove and memset in ld.so.
    
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip
    	if not in libc.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    
    (cherry picked from commit 5cd7af016d8587ff53b20ba259746f97edbddbf7)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e1e57539f5dbdefc96a85021b611863eaa28dd13

commit e1e57539f5dbdefc96a85021b611863eaa28dd13
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Apr 3 12:38:25 2016 -0700

    Fix memmove-vec-unaligned-erms.S
    
    __mempcpy_erms and __memmove_erms can't be placed between __memmove_chk
    and __memmove it breaks __memmove_chk.
    
    Don't check source == destination first since it is less common.
    
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:
    	(__mempcpy_erms, __memmove_erms): Moved before __mempcpy_chk
    	with unaligned_erms.
    	(__memmove_erms): Skip if source == destination.
    	(__memmove_unaligned_erms): Don't check source == destination
    	first.
    
    (cherry picked from commit ea2785e96fa503f3a2b5dd9f3a6ca65622b3c5f2)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a13ac6b5ced68aadb7c1546102445f9c57f43231

commit a13ac6b5ced68aadb7c1546102445f9c57f43231
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Mar 6 08:23:24 2016 -0800

    Use HAS_ARCH_FEATURE with Fast_Rep_String
    
    HAS_ARCH_FEATURE, not HAS_CPU_FEATURE, should be used with
    Fast_Rep_String.
    
    	[BZ #19762]
    	* sysdeps/i386/i686/multiarch/bcopy.S (bcopy): Use
    	HAS_ARCH_FEATURE with Fast_Rep_String.
    	* sysdeps/i386/i686/multiarch/bzero.S (__bzero): Likewise.
    	* sysdeps/i386/i686/multiarch/memcpy.S (memcpy): Likewise.
    	* sysdeps/i386/i686/multiarch/memcpy_chk.S (__memcpy_chk):
    	Likewise.
    	* sysdeps/i386/i686/multiarch/memmove_chk.S (__memmove_chk):
    	Likewise.
    	* sysdeps/i386/i686/multiarch/mempcpy.S (__mempcpy): Likewise.
    	* sysdeps/i386/i686/multiarch/mempcpy_chk.S (__mempcpy_chk):
    	Likewise.
    	* sysdeps/i386/i686/multiarch/memset.S (memset): Likewise.
    	* sysdeps/i386/i686/multiarch/memset_chk.S (__memset_chk):
    	Likewise.
    
    (cherry picked from commit 4e940b2f4b577f3a530e0580373f7c2d569f4d63)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4ad4d58ed7a444e2d9787113fce132a99b35b417

commit 4ad4d58ed7a444e2d9787113fce132a99b35b417
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 1 15:08:48 2016 -0700

    Remove Fast_Copy_Backward from Intel Core processors
    
    Intel Core i3, i5 and i7 processors have fast unaligned copy and
    copy backward is ignored.  Remove Fast_Copy_Backward from Intel Core
    processors to avoid confusion.
    
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Don't set
    	bit_arch_Fast_Copy_Backward for Intel Core proessors.
    
    (cherry picked from commit 27d3ce1467990f89126e228559dec8f84b96c60e)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a304f3933c7f8347e49057a7a315cbd571662ff7

commit a304f3933c7f8347e49057a7a315cbd571662ff7
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:05:51 2016 -0700

    Add x86-64 memset with unaligned store and rep stosb
    
    Implement x86-64 memset with unaligned store and rep movsb.  Support
    16-byte, 32-byte and 64-byte vector register sizes.  A single file
    provides 2 implementations of memset, one with rep stosb and the other
    without rep stosb.  They share the same codes when size is between 2
    times of vector register size and REP_STOSB_THRESHOLD which defaults
    to 2KB.
    
    Key features:
    
    1. Use overlapping store to avoid branch.
    2. For size <= 4 times of vector register size, fully unroll the loop.
    3. For size > 4 times of vector register size, store 4 times of vector
    register size at a time.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
    	memset-sse2-unaligned-erms, memset-avx2-unaligned-erms and
    	memset-avx512-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Test __memset_chk_sse2_unaligned,
    	__memset_chk_sse2_unaligned_erms, __memset_chk_avx2_unaligned,
    	__memset_chk_avx2_unaligned_erms, __memset_chk_avx512_unaligned,
    	__memset_chk_avx512_unaligned_erms, __memset_sse2_unaligned,
    	__memset_sse2_unaligned_erms, __memset_erms,
    	__memset_avx2_unaligned, __memset_avx2_unaligned_erms,
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: New
    	file.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:
    	Likewise.
    
    (cherry picked from commit 830566307f038387ca0af3fd327706a8d1a2f595)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=1c5d0b0ba41376c2f0792da4f22cc1f5b2b2688e

commit 1c5d0b0ba41376c2f0792da4f22cc1f5b2b2688e
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:04:26 2016 -0700

    Add x86-64 memmove with unaligned load/store and rep movsb
    
    Implement x86-64 memmove with unaligned load/store and rep movsb.
    Support 16-byte, 32-byte and 64-byte vector register sizes.  When
    size <= 8 times of vector register size, there is no check for
    address overlap bewteen source and destination.  Since overhead for
    overlap check is small when size > 8 times of vector register size,
    memcpy is an alias of memmove.
    
    A single file provides 2 implementations of memmove, one with rep movsb
    and the other without rep movsb.  They share the same codes when size is
    between 2 times of vector register size and REP_MOVSB_THRESHOLD which
    is 2KB for 16-byte vector register size and scaled up by large vector
    register size.
    
    Key features:
    
    1. Use overlapping load and store to avoid branch.
    2. For size <= 8 times of vector register size, load  all sources into
    registers and store them together.
    3. If there is no address overlap bewteen source and destination, copy
    from both ends with 4 times of vector register size at a time.
    4. If address of destination > address of source, backward copy 8 times
    of vector register size at a time.
    5. Otherwise, forward copy 8 times of vector register size at a time.
    6. Use rep movsb only for forward copy.  Avoid slow backward rep movsb
    by fallbacking to backward copy 8 times of vector register size at a
    time.
    7. Skip when address of destination == address of source.
    
    	[BZ #19776]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
    	memmove-sse2-unaligned-erms, memmove-avx-unaligned-erms and
    	memmove-avx512-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Test
    	__memmove_chk_avx512_unaligned_2,
    	__memmove_chk_avx512_unaligned_erms,
    	__memmove_chk_avx_unaligned_2, __memmove_chk_avx_unaligned_erms,
    	__memmove_chk_sse2_unaligned_2,
    	__memmove_chk_sse2_unaligned_erms, __memmove_avx_unaligned_2,
    	__memmove_avx_unaligned_erms, __memmove_avx512_unaligned_2,
    	__memmove_avx512_unaligned_erms, __memmove_erms,
    	__memmove_sse2_unaligned_2, __memmove_sse2_unaligned_erms,
    	__memcpy_chk_avx512_unaligned_2,
    	__memcpy_chk_avx512_unaligned_erms,
    	__memcpy_chk_avx_unaligned_2, __memcpy_chk_avx_unaligned_erms,
    	__memcpy_chk_sse2_unaligned_2, __memcpy_chk_sse2_unaligned_erms,
    	__memcpy_avx_unaligned_2, __memcpy_avx_unaligned_erms,
    	__memcpy_avx512_unaligned_2, __memcpy_avx512_unaligned_erms,
    	__memcpy_sse2_unaligned_2, __memcpy_sse2_unaligned_erms,
    	__memcpy_erms, __mempcpy_chk_avx512_unaligned_2,
    	__mempcpy_chk_avx512_unaligned_erms,
    	__mempcpy_chk_avx_unaligned_2, __mempcpy_chk_avx_unaligned_erms,
    	__mempcpy_chk_sse2_unaligned_2, __mempcpy_chk_sse2_unaligned_erms,
    	__mempcpy_avx512_unaligned_2, __mempcpy_avx512_unaligned_erms,
    	__mempcpy_avx_unaligned_2, __mempcpy_avx_unaligned_erms,
    	__mempcpy_sse2_unaligned_2, __mempcpy_sse2_unaligned_erms and
    	__mempcpy_erms.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: New
    	file.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likwise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
    	Likwise.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:
    	Likwise.
    
    (cherry picked from commit 88b57b8ed41d5ecf2e1bdfc19556f9246a665ebb)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e1203f48239fbb9832db6ed3a0d2a008e317aff9

commit e1203f48239fbb9832db6ed3a0d2a008e317aff9
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 19:22:59 2016 -0700

    Initial Enhanced REP MOVSB/STOSB (ERMS) support
    
    The newer Intel processors support Enhanced REP MOVSB/STOSB (ERMS) which
    has a feature bit in CPUID.  This patch adds the Enhanced REP MOVSB/STOSB
    (ERMS) bit to x86 cpu-features.
    
    	* sysdeps/x86/cpu-features.h (bit_cpu_ERMS): New.
    	(index_cpu_ERMS): Likewise.
    	(reg_ERMS): Likewise.
    
    (cherry picked from commit 0791f91dff9a77263fa8173b143d854cad902c6d)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=3597d65be2a44f063ef12bb907fdad8567aa3e6a

commit 3597d65be2a44f063ef12bb907fdad8567aa3e6a
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 13:15:59 2016 -0700

    Make __memcpy_avx512_no_vzeroupper an alias
    
    Since x86-64 memcpy-avx512-no-vzeroupper.S implements memmove, make
    __memcpy_avx512_no_vzeroupper an alias of __memmove_avx512_no_vzeroupper
    to reduce code size of libc.so.
    
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memcpy-avx512-no-vzeroupper.
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: Renamed
    	to ...
    	* sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: This.
    	(MEMCPY): Don't define.
    	(MEMCPY_CHK): Likewise.
    	(MEMPCPY): Likewise.
    	(MEMPCPY_CHK): Likewise.
    	(MEMPCPY_CHK): Renamed to ...
    	(__mempcpy_chk_avx512_no_vzeroupper): This.
    	(MEMPCPY_CHK): Renamed to ...
    	(__mempcpy_chk_avx512_no_vzeroupper): This.
    	(MEMCPY_CHK): Renamed to ...
    	(__memmove_chk_avx512_no_vzeroupper): This.
    	(MEMCPY): Renamed to ...
    	(__memmove_avx512_no_vzeroupper): This.
    	(__memcpy_avx512_no_vzeroupper): New alias.
    	(__memcpy_chk_avx512_no_vzeroupper): Likewise.
    
    (cherry picked from commit 064f01b10b57ff09cda7025f484b848c38ddd57a)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=9fbaf0f27a11deb98df79d04adee97aebee78d40

commit 9fbaf0f27a11deb98df79d04adee97aebee78d40
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 13:13:36 2016 -0700

    Implement x86-64 multiarch mempcpy in memcpy
    
    Implement x86-64 multiarch mempcpy in memcpy to share most of code.  It
    reduces code size of libc.so.
    
    	[BZ #18858]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	mempcpy-ssse3, mempcpy-ssse3-back, mempcpy-avx-unaligned
    	and mempcpy-avx512-no-vzeroupper.
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMPCPY_CHK):
    	New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S
    	(MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-ssse3-back.S (MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-ssse3.S (MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-avx-unaligned.S: Removed.
    	* sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-ssse3-back.S: Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-ssse3.S: Likewise.
    
    (cherry picked from commit c365e615f7429aee302f8af7bf07ae262278febb)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=5239cb481eea27650173b9b9af22439afdcbf358

commit 5239cb481eea27650173b9b9af22439afdcbf358
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 04:39:48 2016 -0700

    [x86] Add a feature bit: Fast_Unaligned_Copy
    
    On AMD processors, memcpy optimized with unaligned SSE load is
    slower than emcpy optimized with aligned SSSE3 while other string
    functions are faster with unaligned SSE load.  A feature bit,
    Fast_Unaligned_Copy, is added to select memcpy optimized with
    unaligned SSE load.
    
    	[BZ #19583]
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Set
    	Fast_Unaligned_Copy with Fast_Unaligned_Load for Intel
    	processors.  Set Fast_Copy_Backward for AMD Excavator
    	processors.
    	* sysdeps/x86/cpu-features.h (bit_arch_Fast_Unaligned_Copy):
    	New.
    	(index_arch_Fast_Unaligned_Copy): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Check
    	Fast_Unaligned_Copy instead of Fast_Unaligned_Load.
    
    (cherry picked from commit e41b395523040fcb58c7d378475720c2836d280c)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a65b3d13e1754d568782e64a762c2c7fab45a55d

commit a65b3d13e1754d568782e64a762c2c7fab45a55d
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Mar 22 08:36:16 2016 -0700

    Don't set %rcx twice before "rep movsb"
    
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMCPY):
    	Don't set %rcx twice before "rep movsb".
    
    (cherry picked from commit 3c9a4cd16cbc7b79094fec68add2df66061ab5d7)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=f4b6d20366aac66070f1cf50552cf2951991a1e5

commit f4b6d20366aac66070f1cf50552cf2951991a1e5
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Mar 22 07:46:56 2016 -0700

    Set index_arch_AVX_Fast_Unaligned_Load only for Intel processors
    
    Since only Intel processors with AVX2 have fast unaligned load, we
    should set index_arch_AVX_Fast_Unaligned_Load only for Intel processors.
    
    Move AVX, AVX2, AVX512, FMA and FMA4 detection into get_common_indeces
    and call get_common_indeces for other processors.
    
    Add CPU_FEATURES_CPU_P and CPU_FEATURES_ARCH_P to aoid loading
    GLRO(dl_x86_cpu_features) in cpu-features.c.
    
    	[BZ #19583]
    	* sysdeps/x86/cpu-features.c (get_common_indeces): Remove
    	inline.  Check family before setting family, model and
    	extended_model.  Set AVX, AVX2, AVX512, FMA and FMA4 usable
    	bits here.
    	(init_cpu_features): Replace HAS_CPU_FEATURE and
    	HAS_ARCH_FEATURE with CPU_FEATURES_CPU_P and
    	CPU_FEATURES_ARCH_P.  Set index_arch_AVX_Fast_Unaligned_Load
    	for Intel processors with usable AVX2.  Call get_common_indeces
    	for other processors with family == NULL.
    	* sysdeps/x86/cpu-features.h (CPU_FEATURES_CPU_P): New macro.
    	(CPU_FEATURES_ARCH_P): Likewise.
    	(HAS_CPU_FEATURE): Use CPU_FEATURES_CPU_P.
    	(HAS_ARCH_FEATURE): Use CPU_FEATURES_ARCH_P.
    
    (cherry picked from commit f781a9e96138d8839663af5e88649ab1fbed74f8)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=ca9c5edeea52dc18f42ebbe29b1af352f5555538

commit ca9c5edeea52dc18f42ebbe29b1af352f5555538
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Nov 30 08:53:37 2015 -0800

    Update family and model detection for AMD CPUs
    
    AMD CPUs uses the similar encoding scheme for extended family and model
    as Intel CPUs as shown in:
    
    http://support.amd.com/TechDocs/25481.pdf
    
    This patch updates get_common_indeces to get family and model for both
    Intel and AMD CPUs when family == 0x0f.
    
    	[BZ #19214]
    	* sysdeps/x86/cpu-features.c (get_common_indeces): Add an
    	argument to return extended model.  Update family and model
    	with extended family and model when family == 0x0f.
    	(init_cpu_features): Updated.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c23cdbac4ea473effbef5c50b1217f95595b3460

commit c23cdbac4ea473effbef5c50b1217f95595b3460
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 10 05:26:46 2016 -0800

    Add _arch_/_cpu_ to index_*/bit_* in x86 cpu-features.h
    
    index_* and bit_* macros are used to access cpuid and feature arrays o
    struct cpu_features.  It is very easy to use bits and indices of cpuid
    array on feature array, especially in assembly codes.  For example,
    sysdeps/i386/i686/multiarch/bcopy.S has
    
    	HAS_CPU_FEATURE (Fast_Rep_String)
    
    which should be
    
    	HAS_ARCH_FEATURE (Fast_Rep_String)
    
    We change index_* and bit_* to index_cpu_*/index_arch_* and
    bit_cpu_*/bit_arch_* so that we can catch such error at build time.
    
    	[BZ #19762]
    	* sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h
    	(EXTRA_LD_ENVVARS): Add _arch_ to index_*/bit_*.
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Likewise.
    	* sysdeps/x86/cpu-features.h (bit_*): Renamed to ...
    	(bit_arch_*): This for feature array.
    	(bit_*): Renamed to ...
    	(bit_cpu_*): This for cpu array.
    	(index_*): Renamed to ...
    	(index_arch_*): This for feature array.
    	(index_*): Renamed to ...
    	(index_cpu_*): This for cpu array.
    	[__ASSEMBLER__] (HAS_FEATURE): Add and use field.
    	[__ASSEMBLER__] (HAS_CPU_FEATURE)): Pass cpu to HAS_FEATURE.
    	[__ASSEMBLER__] (HAS_ARCH_FEATURE)): Pass arch to HAS_FEATURE.
    	[!__ASSEMBLER__] (HAS_CPU_FEATURE): Replace index_##name and
    	bit_##name with index_cpu_##name and bit_cpu_##name.
    	[!__ASSEMBLER__] (HAS_ARCH_FEATURE): Replace index_##name and
    	bit_##name with index_arch_##name and bit_arch_##name.
    
    (cherry picked from commit 6aa3e97e2530f9917f504eb4146af119a3f27229)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4a49c82956f5a42a2cce22c2e97360de1b32301d

commit 4a49c82956f5a42a2cce22c2e97360de1b32301d
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 3 14:51:40 2016 -0800

    Or bit_Prefer_MAP_32BIT_EXEC in EXTRA_LD_ENVVARS
    
    We should turn on bit_Prefer_MAP_32BIT_EXEC in EXTRA_LD_ENVVARS without
    overriding other bits.
    
    	[BZ #19758]
    	* sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h
    	(EXTRA_LD_ENVVARS): Or bit_Prefer_MAP_32BIT_EXEC.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=18f8c0e3cc9ff7b092f02c9b42874a5439347bbc

commit 18f8c0e3cc9ff7b092f02c9b42874a5439347bbc
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Mar 6 16:48:11 2016 -0800

    Group AVX512 functions in .text.avx512 section
    
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S:
    	Replace .text with .text.avx512.
    	* sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S:
    	Likewise.
    
    (cherry picked from commit fee9eb6200f0e44a4b684903bc47fde36d46f1a5)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=0c8e297a186f844ebb7eba7a3bc0343c83615ca9

commit 0c8e297a186f844ebb7eba7a3bc0343c83615ca9
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Mar 4 08:37:40 2016 -0800

    x86-64: Fix memcpy IFUNC selection
    
    Chek Fast_Unaligned_Load, instead of Slow_BSF, and also check for
    Fast_Copy_Backward to enable __memcpy_ssse3_back.  Existing selection
    order is updated with following selection order:
    
    1. __memcpy_avx_unaligned if AVX_Fast_Unaligned_Load bit is set.
    2. __memcpy_sse2_unaligned if Fast_Unaligned_Load bit is set.
    3. __memcpy_sse2 if SSSE3 isn't available.
    4. __memcpy_ssse3_back if Fast_Copy_Backward bit it set.
    5. __memcpy_ssse3
    
    	[BZ #18880]
    	* sysdeps/x86_64/multiarch/memcpy.S: Check Fast_Unaligned_Load,
    	instead of Slow_BSF, and also check for Fast_Copy_Backward to
    	enable __memcpy_ssse3_back.
    
    (cherry picked from commit 14a1d7cc4c4fd5ee8e4e66b777221dd32a84efe8)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=3c772cb4d9cbe19cd97ad991e3dab43014198c44

commit 3c772cb4d9cbe19cd97ad991e3dab43014198c44
Author: Andrew Senkevich <andrew.senkevich@intel.com>
Date:   Sat Jan 16 00:49:45 2016 +0300

    Added memcpy/memmove family optimized with AVX512 for KNL hardware.
    
    Added AVX512 implementations of memcpy, mempcpy, memmove, memcpy_chk,
    mempcpy_chk, memmove_chk.
    It shows average improvement more than 30% over AVX versions on KNL
    hardware (performance results in the thread
    <https://sourceware.org/ml/libc-alpha/2016-01/msg00258.html>).
    
        * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Added new files.
        * sysdeps/x86_64/multiarch/ifunc-impl-list.c: Added new tests.
        * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: New file.
        * sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S: Likewise.
        * sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: Likewise.
        * sysdeps/x86_64/multiarch/memcpy.S: Added new IFUNC branch.
        * sysdeps/x86_64/multiarch/memcpy_chk.S: Likewise.
        * sysdeps/x86_64/multiarch/memmove.c: Likewise.
        * sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
        * sysdeps/x86_64/multiarch/mempcpy.S: Likewise.
        * sysdeps/x86_64/multiarch/mempcpy_chk.S: Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7f20e52f1f8d7709140f4bdf828a6bb0f0f08af2

commit 7f20e52f1f8d7709140f4bdf828a6bb0f0f08af2
Author: Andrew Senkevich <andrew.senkevich@intel.com>
Date:   Sat Dec 19 02:47:28 2015 +0300

    Added memset optimized with AVX512 for KNL hardware.
    
    It shows improvement up to 28% over AVX2 memset (performance results
    attached at <https://sourceware.org/ml/libc-alpha/2015-12/msg00052.html>).
    
        * sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S: New file.
        * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Added new file.
        * sysdeps/x86_64/multiarch/ifunc-impl-list.c: Added new tests.
        * sysdeps/x86_64/multiarch/memset.S: Added new IFUNC branch.
        * sysdeps/x86_64/multiarch/memset_chk.S: Likewise.
        * sysdeps/x86/cpu-features.h (bit_Prefer_No_VZEROUPPER,
        index_Prefer_No_VZEROUPPER): New.
        * sysdeps/x86/cpu-features.c (init_cpu_features): Set the
        Prefer_No_VZEROUPPER for Knights Landing.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=d530cd5463701a59ed923d53a97d3b534fdfea8a

commit d530cd5463701a59ed923d53a97d3b534fdfea8a
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Wed Oct 21 14:44:23 2015 -0700

    Add Prefer_MAP_32BIT_EXEC to map executable pages with MAP_32BIT
    
    According to Silvermont software optimization guide, for 64-bit
    applications, branch prediction performance can be negatively impacted
    when the target of a branch is more than 4GB away from the branch.  Add
    the Prefer_MAP_32BIT_EXEC bit so that mmap will try to map executable
    pages with MAP_32BIT first.  NB: MAP_32BIT will map to lower 2GB, not
    lower 4GB, address.  Prefer_MAP_32BIT_EXEC reduces bits available for
    address space layout randomization (ASLR), which is always disabled for
    SUID programs and can only be enabled by setting environment variable,
    LD_PREFER_MAP_32BIT_EXEC.
    
    On Fedora 23, this patch speeds up GCC 5 testsuite by 3% on Silvermont.
    
    	[BZ #19367]
    	* sysdeps/unix/sysv/linux/wordsize-64/mmap.c: New file.
    	* sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h: Likewise.
    	* sysdeps/unix/sysv/linux/x86_64/64/mmap.c: Likewise.
    	* sysdeps/x86/cpu-features.h (bit_Prefer_MAP_32BIT_EXEC): New.
    	(index_Prefer_MAP_32BIT_EXEC): Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=fe24aedc3530037d7bb614b84d309e6b816686bf

commit fe24aedc3530037d7bb614b84d309e6b816686bf
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Dec 15 11:46:54 2015 -0800

    Enable Silvermont optimizations for Knights Landing
    
    Knights Landing processor is based on Silvermont.  This patch enables
    Silvermont optimizations for Knights Landing.
    
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Enable
    	Silvermont optimizations for Knights Landing.

-----------------------------------------------------------------------
Comment 13 cvs-commit@gcc.gnu.org 2016-04-06 20:03:07 UTC
This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, hjl/erms/2.23 has been created
        at  c51eab61e17e7575265f1e36bd0293e224500f52 (commit)

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c51eab61e17e7575265f1e36bd0293e224500f52

commit c51eab61e17e7575265f1e36bd0293e224500f52
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 1 14:01:24 2016 -0700

    X86-64: Add dummy memcopy.h and wordcopy.c
    
    Since x86-64 doesn't use memory copy functions, add dummy memcopy.h and
    wordcopy.c to reduce code size.  It reduces the size of libc.so by about
    1 KB.
    
    	* sysdeps/x86_64/memcopy.h: New file.
    	* sysdeps/x86_64/wordcopy.c: Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7f9478e6530ab0ede00f705e456445aeff283560

commit 7f9478e6530ab0ede00f705e456445aeff283560
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 12:46:57 2016 -0700

    X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove
    
    Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones,
    we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with
    the new ones.
    
    No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used
    before.  If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2
    memcpy/memmove optimized with Enhanced REP MOVSB will be used for
    processors with ERMS.  The new AVX512 memcpy/memmove will be used for
    processors with AVX512 which prefer vzeroupper.
    
    Since the new SSE2 memcpy/memmove are faster than the previous default
    memcpy/memmove used in libc.a and ld.so, we also remove the previous
    default memcpy/memmove and make them the default memcpy/memmove.
    
    Together, it reduces the size of libc.so by about 6 KB and the size of
    ld.so by about 2 KB.
    
    	[BZ #19776]
    	* sysdeps/x86_64/memcpy.S: Make it dummy.
    	* sysdeps/x86_64/mempcpy.S: Likewise.
    	* sysdeps/x86_64/memmove.S: New file.
    	* sysdeps/x86_64/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/memmove.c: Removed.
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memmove.c: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memcpy-sse2-unaligned, memmove-avx-unaligned,
    	memcpy-avx-unaligned and memmove-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Replace
    	__memmove_chk_avx512_unaligned_2 with
    	__memmove_chk_avx512_unaligned.  Remove
    	__memmove_chk_avx_unaligned_2.  Replace
    	__memmove_chk_sse2_unaligned_2 with
    	__memmove_chk_sse2_unaligned.  Remove __memmove_chk_sse2 and
    	__memmove_avx_unaligned_2.  Replace __memmove_avx512_unaligned_2
    	with __memmove_avx512_unaligned.  Replace
    	__memmove_sse2_unaligned_2 with __memmove_sse2_unaligned.
    	Remove __memmove_sse2.  Replace __memcpy_chk_avx512_unaligned_2
    	with __memcpy_chk_avx512_unaligned.  Remove
    	__memcpy_chk_avx_unaligned_2.  Replace
    	__memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned.
    	Remove __memcpy_chk_sse2.  Remove __memcpy_avx_unaligned_2.
    	Replace __memcpy_avx512_unaligned_2 with
    	__memcpy_avx512_unaligned.  Remove __memcpy_sse2_unaligned_2
    	and __memcpy_sse2.  Replace __mempcpy_chk_avx512_unaligned_2
    	with __mempcpy_chk_avx512_unaligned.  Remove
    	__mempcpy_chk_avx_unaligned_2.  Replace
    	__mempcpy_chk_sse2_unaligned_2 with
    	__mempcpy_chk_sse2_unaligned.  Remove __mempcpy_chk_sse2.
    	Replace __mempcpy_avx512_unaligned_2 with
    	__mempcpy_avx512_unaligned.  Remove __mempcpy_avx_unaligned_2.
    	Replace __mempcpy_sse2_unaligned_2 with
    	__mempcpy_sse2_unaligned.  Remove __mempcpy_sse2.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support
    	__memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned.
    	Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __memcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../memcpy.S.
    	* sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support
    	__memcpy_chk_avx512_unaligned_erms and
    	__memcpy_chk_avx512_unaligned.  Use
    	__memcpy_chk_avx_unaligned_erms and
    	__memcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __memcpy_chk_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip if
    	not in libc.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
    	Change function suffix from unaligned_2 to unaligned.
    	* sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support
    	__mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned.
    	Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __mempcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../mempcpy.S.
    	(mempcpy): New.  Add a weak alias.
    	* sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support
    	__mempcpy_chk_avx512_unaligned_erms and
    	__mempcpy_chk_avx512_unaligned.  Use
    	__mempcpy_chk_avx_unaligned_erms and
    	__mempcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __mempcpy_chk_sse2_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=68248ecc51b4725e794236c495effde76d4be61c

commit 68248ecc51b4725e794236c495effde76d4be61c
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:42:30 2016 -0700

    X86-64: Remove the previous SSE2/AVX2 memsets
    
    Since the new SSE2/AVX2 memsets are faster than the previous ones, we
    can remove the previous SSE2/AVX2 memsets and replace them with the
    new ones.  This reduces the size of libc.so by about 900 bytes.
    
    No change in IFUNC selection if SSE2 and AVX2 memsets weren't used
    before.  If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset
    optimized with Enhanced REP STOSB will be used for processors with
    ERMS.  The new AVX512 memset will be used for processors with AVX512
    which prefer vzeroupper.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded
    	into ...
    	* sysdeps/x86_64/memset.S: This.
    	(__bzero): Removed.
    	(__memset_tail): Likewise.
    	(__memset_chk): Likewise.
    	(memset): Likewise.
    	(MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't
    	defined.
    	(MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined.
    	* sysdeps/x86_64/multiarch/memset-avx2.S: Removed.
    	(__memset_zero_constant_len_parameter): Check SHARED instead of
    	PIC.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memset-avx2 and memset-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Remove __memset_chk_sse2,
    	__memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: Skip
    	if not in libc.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
    	(__bzero): Enabled.
    	* sysdeps/x86_64/multiarch/memset.S (memset): Replace
    	__memset_sse2 and __memset_avx2 with __memset_sse2_unaligned
    	and __memset_avx2_unaligned.  Use __memset_sse2_unaligned_erms
    	or __memset_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	(memset): Removed.
    	(__memset_chk): Likewise.
    	(MEMSET_SYMBOL): New.
    	(libc_hidden_builtin_def): Replace __memset_sse2 with
    	__memset_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace
    	__memset_chk_sse2 and __memset_chk_avx2 with
    	__memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms.
    	Use __memset_chk_sse2_unaligned_erms or
    	__memset_chk_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_chk_avx512_unaligned_erms and
    	__memset_chk_avx512_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=095d851c67b7ea5edb536ead965c73fce34b2edd

commit 095d851c67b7ea5edb536ead965c73fce34b2edd
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Apr 3 17:21:45 2016 -0700

    X86-64: Use non-temporal store in memmove on large data
    
    memcpy/memmove benchmarks with large data shows that there is a
    regression with large data on Haswell machine.  non-temporal store
    in memmove on large data can improve performance significantly.  This
    patch adds a threshold to use non temporal store which is 4 times of
    shared cache size.  When size is above the threshold, non temporal
    store will be used.
    
    For size below 8 vector register width, we load all data into registers
    and store them together.  Only forward and backward loops, which move 4
    vector registers at a time, are used to support overlapping addresses.
    For forward loop, we load the last 4 vector register width of data and
    the first vector register width of data into vector registers before the
    loop and store them after the loop.  For backward loop, we load the first
    4 vector register width of data and the last vector register width of
    data into vector registers before the loop and store them after the loop.
    
    	* sysdeps/x86_64/cacheinfo.c (__x86_shared_non_temporal_threshold):
    	New.
    	(init_cacheinfo): Set __x86_shared_non_temporal_threshold to
    	4 times of shared cache size.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S
    	(PREFETCHNT): New.
    	(VMOVNT): Likewise.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S
    	(PREFETCHNT): Likewise.
    	(VMOVNT): Likewise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S
    	(PREFETCHNT): Likewise.
    	(VMOVNT): Likewise.
    	(VMOVU): Changed to movups for smaller code sizes.
    	(VMOVA): Changed to movaps for smaller code sizes.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Update
    	comments.
    	(PREFETCH_SIZE): New.
    	(PREFETCHED_LOAD_SIZE): Likewise.
    	(PREFETCH_ONE_SET): Likewise.
    	Rewrite to use forward and backward loops, which move 4 vector
    	registers at a time, to support overlapping addresses and use
    	non temporal store if size is above the threshold.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=0932dd8b56db46dd421a4855fb5dee9de092538d

commit 0932dd8b56db46dd421a4855fb5dee9de092538d
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Wed Apr 6 10:19:16 2016 -0700

    X86-64: Prepare memmove-vec-unaligned-erms.S
    
    Prepare memmove-vec-unaligned-erms.S to make the SSE2 version as the
    default memcpy, mempcpy and memmove.
    
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
    	(MEMCPY_SYMBOL): New.
    	(MEMPCPY_SYMBOL): Likewise.
    	(MEMMOVE_CHK_SYMBOL): Likewise.
    	Replace MEMMOVE_SYMBOL with MEMMOVE_CHK_SYMBOL on __mempcpy_chk
    	symbols.  Replace MEMMOVE_SYMBOL with MEMPCPY_SYMBOL on
    	__mempcpy symbols.  Provide alias for __memcpy_chk in libc.a.
    	Provide alias for memcpy in libc.a and ld.so.
    
    (cherry picked from commit a7d1c51482d15ab6c07e2ee0ae5e007067b18bfb)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=da2da79262814ba4ead3ee487549949096d8ad2d

commit da2da79262814ba4ead3ee487549949096d8ad2d
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Wed Apr 6 09:10:18 2016 -0700

    X86-64: Prepare memset-vec-unaligned-erms.S
    
    Prepare memset-vec-unaligned-erms.S to make the SSE2 version as the
    default memset.
    
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
    	(MEMSET_CHK_SYMBOL): New.  Define if not defined.
    	(__bzero): Check VEC_SIZE == 16 instead of USE_MULTIARCH.
    	Disabled fro now.
    	Replace MEMSET_SYMBOL with MEMSET_CHK_SYMBOL on __memset_chk
    	symbols.  Properly check USE_MULTIARCH on __memset symbols.
    
    (cherry picked from commit 4af1bb06c59d24f35bf8dc55897838d926c05892)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=9a93bdbaff81edf67c5486c84f8098055e355abb

commit 9a93bdbaff81edf67c5486c84f8098055e355abb
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Apr 5 05:21:07 2016 -0700

    Force 32-bit displacement in memset-vec-unaligned-erms.S
    
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: Force
    	32-bit displacement to avoid long nop between instructions.
    
    (cherry picked from commit ec0cac9a1f4094bd0db6f77c1b329e7a40eecc10)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=5118e532600549ad0f56cb9b1a179b8eab70c483

commit 5118e532600549ad0f56cb9b1a179b8eab70c483
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Apr 5 05:19:05 2016 -0700

    Add a comment in memset-sse2-unaligned-erms.S
    
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Add
    	a comment on VMOVU and VMOVA.
    
    (cherry picked from commit 696ac774847b80cf994438739478b0c3003b5958)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=06c6d4ae6ee7e5b83fd5868bef494def01f59292

commit 06c6d4ae6ee7e5b83fd5868bef494def01f59292
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Apr 3 14:32:20 2016 -0700

    Don't put SSE2/AVX/AVX512 memmove/memset in ld.so
    
    Since memmove and memset in ld.so don't use IFUNC, don't put SSE2, AVX
    and AVX512 memmove and memset in ld.so.
    
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip
    	if not in libc.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    
    (cherry picked from commit 5cd7af016d8587ff53b20ba259746f97edbddbf7)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a96379797a7eecc1b709cad7b68981eb698783dc

commit a96379797a7eecc1b709cad7b68981eb698783dc
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Apr 3 12:38:25 2016 -0700

    Fix memmove-vec-unaligned-erms.S
    
    __mempcpy_erms and __memmove_erms can't be placed between __memmove_chk
    and __memmove it breaks __memmove_chk.
    
    Don't check source == destination first since it is less common.
    
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:
    	(__mempcpy_erms, __memmove_erms): Moved before __mempcpy_chk
    	with unaligned_erms.
    	(__memmove_erms): Skip if source == destination.
    	(__memmove_unaligned_erms): Don't check source == destination
    	first.
    
    (cherry picked from commit ea2785e96fa503f3a2b5dd9f3a6ca65622b3c5f2)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=cfb059c79729b26284863334c9aa04f0a3b967b9

commit cfb059c79729b26284863334c9aa04f0a3b967b9
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 1 15:08:48 2016 -0700

    Remove Fast_Copy_Backward from Intel Core processors
    
    Intel Core i3, i5 and i7 processors have fast unaligned copy and
    copy backward is ignored.  Remove Fast_Copy_Backward from Intel Core
    processors to avoid confusion.
    
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Don't set
    	bit_arch_Fast_Copy_Backward for Intel Core proessors.
    
    (cherry picked from commit 27d3ce1467990f89126e228559dec8f84b96c60e)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=30c389be1af67c4d0716d207b6780c6169d1355f

commit 30c389be1af67c4d0716d207b6780c6169d1355f
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:05:51 2016 -0700

    Add x86-64 memset with unaligned store and rep stosb
    
    Implement x86-64 memset with unaligned store and rep movsb.  Support
    16-byte, 32-byte and 64-byte vector register sizes.  A single file
    provides 2 implementations of memset, one with rep stosb and the other
    without rep stosb.  They share the same codes when size is between 2
    times of vector register size and REP_STOSB_THRESHOLD which defaults
    to 2KB.
    
    Key features:
    
    1. Use overlapping store to avoid branch.
    2. For size <= 4 times of vector register size, fully unroll the loop.
    3. For size > 4 times of vector register size, store 4 times of vector
    register size at a time.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
    	memset-sse2-unaligned-erms, memset-avx2-unaligned-erms and
    	memset-avx512-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Test __memset_chk_sse2_unaligned,
    	__memset_chk_sse2_unaligned_erms, __memset_chk_avx2_unaligned,
    	__memset_chk_avx2_unaligned_erms, __memset_chk_avx512_unaligned,
    	__memset_chk_avx512_unaligned_erms, __memset_sse2_unaligned,
    	__memset_sse2_unaligned_erms, __memset_erms,
    	__memset_avx2_unaligned, __memset_avx2_unaligned_erms,
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: New
    	file.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:
    	Likewise.
    
    (cherry picked from commit 830566307f038387ca0af3fd327706a8d1a2f595)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=980d639b4ae58209843f09a29d86b0a8303b6650

commit 980d639b4ae58209843f09a29d86b0a8303b6650
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:04:26 2016 -0700

    Add x86-64 memmove with unaligned load/store and rep movsb
    
    Implement x86-64 memmove with unaligned load/store and rep movsb.
    Support 16-byte, 32-byte and 64-byte vector register sizes.  When
    size <= 8 times of vector register size, there is no check for
    address overlap bewteen source and destination.  Since overhead for
    overlap check is small when size > 8 times of vector register size,
    memcpy is an alias of memmove.
    
    A single file provides 2 implementations of memmove, one with rep movsb
    and the other without rep movsb.  They share the same codes when size is
    between 2 times of vector register size and REP_MOVSB_THRESHOLD which
    is 2KB for 16-byte vector register size and scaled up by large vector
    register size.
    
    Key features:
    
    1. Use overlapping load and store to avoid branch.
    2. For size <= 8 times of vector register size, load  all sources into
    registers and store them together.
    3. If there is no address overlap bewteen source and destination, copy
    from both ends with 4 times of vector register size at a time.
    4. If address of destination > address of source, backward copy 8 times
    of vector register size at a time.
    5. Otherwise, forward copy 8 times of vector register size at a time.
    6. Use rep movsb only for forward copy.  Avoid slow backward rep movsb
    by fallbacking to backward copy 8 times of vector register size at a
    time.
    7. Skip when address of destination == address of source.
    
    	[BZ #19776]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
    	memmove-sse2-unaligned-erms, memmove-avx-unaligned-erms and
    	memmove-avx512-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Test
    	__memmove_chk_avx512_unaligned_2,
    	__memmove_chk_avx512_unaligned_erms,
    	__memmove_chk_avx_unaligned_2, __memmove_chk_avx_unaligned_erms,
    	__memmove_chk_sse2_unaligned_2,
    	__memmove_chk_sse2_unaligned_erms, __memmove_avx_unaligned_2,
    	__memmove_avx_unaligned_erms, __memmove_avx512_unaligned_2,
    	__memmove_avx512_unaligned_erms, __memmove_erms,
    	__memmove_sse2_unaligned_2, __memmove_sse2_unaligned_erms,
    	__memcpy_chk_avx512_unaligned_2,
    	__memcpy_chk_avx512_unaligned_erms,
    	__memcpy_chk_avx_unaligned_2, __memcpy_chk_avx_unaligned_erms,
    	__memcpy_chk_sse2_unaligned_2, __memcpy_chk_sse2_unaligned_erms,
    	__memcpy_avx_unaligned_2, __memcpy_avx_unaligned_erms,
    	__memcpy_avx512_unaligned_2, __memcpy_avx512_unaligned_erms,
    	__memcpy_sse2_unaligned_2, __memcpy_sse2_unaligned_erms,
    	__memcpy_erms, __mempcpy_chk_avx512_unaligned_2,
    	__mempcpy_chk_avx512_unaligned_erms,
    	__mempcpy_chk_avx_unaligned_2, __mempcpy_chk_avx_unaligned_erms,
    	__mempcpy_chk_sse2_unaligned_2, __mempcpy_chk_sse2_unaligned_erms,
    	__mempcpy_avx512_unaligned_2, __mempcpy_avx512_unaligned_erms,
    	__mempcpy_avx_unaligned_2, __mempcpy_avx_unaligned_erms,
    	__mempcpy_sse2_unaligned_2, __mempcpy_sse2_unaligned_erms and
    	__mempcpy_erms.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: New
    	file.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likwise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
    	Likwise.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:
    	Likwise.
    
    (cherry picked from commit 88b57b8ed41d5ecf2e1bdfc19556f9246a665ebb)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=bf2bc5e5c9d7aa8af28b299ec26b8a37352730cc

commit bf2bc5e5c9d7aa8af28b299ec26b8a37352730cc
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 19:22:59 2016 -0700

    Initial Enhanced REP MOVSB/STOSB (ERMS) support
    
    The newer Intel processors support Enhanced REP MOVSB/STOSB (ERMS) which
    has a feature bit in CPUID.  This patch adds the Enhanced REP MOVSB/STOSB
    (ERMS) bit to x86 cpu-features.
    
    	* sysdeps/x86/cpu-features.h (bit_cpu_ERMS): New.
    	(index_cpu_ERMS): Likewise.
    	(reg_ERMS): Likewise.
    
    (cherry picked from commit 0791f91dff9a77263fa8173b143d854cad902c6d)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7c244283ff12329b3bca9878b8edac3b3fe5c7bc

commit 7c244283ff12329b3bca9878b8edac3b3fe5c7bc
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 13:15:59 2016 -0700

    Make __memcpy_avx512_no_vzeroupper an alias
    
    Since x86-64 memcpy-avx512-no-vzeroupper.S implements memmove, make
    __memcpy_avx512_no_vzeroupper an alias of __memmove_avx512_no_vzeroupper
    to reduce code size of libc.so.
    
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memcpy-avx512-no-vzeroupper.
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: Renamed
    	to ...
    	* sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: This.
    	(MEMCPY): Don't define.
    	(MEMCPY_CHK): Likewise.
    	(MEMPCPY): Likewise.
    	(MEMPCPY_CHK): Likewise.
    	(MEMPCPY_CHK): Renamed to ...
    	(__mempcpy_chk_avx512_no_vzeroupper): This.
    	(MEMPCPY_CHK): Renamed to ...
    	(__mempcpy_chk_avx512_no_vzeroupper): This.
    	(MEMCPY_CHK): Renamed to ...
    	(__memmove_chk_avx512_no_vzeroupper): This.
    	(MEMCPY): Renamed to ...
    	(__memmove_avx512_no_vzeroupper): This.
    	(__memcpy_avx512_no_vzeroupper): New alias.
    	(__memcpy_chk_avx512_no_vzeroupper): Likewise.
    
    (cherry picked from commit 064f01b10b57ff09cda7025f484b848c38ddd57a)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a9a14991fb2d3e69f80d25e9bbf2f6b0bcf11c3d

commit a9a14991fb2d3e69f80d25e9bbf2f6b0bcf11c3d
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 13:13:36 2016 -0700

    Implement x86-64 multiarch mempcpy in memcpy
    
    Implement x86-64 multiarch mempcpy in memcpy to share most of code.  It
    reduces code size of libc.so.
    
    	[BZ #18858]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	mempcpy-ssse3, mempcpy-ssse3-back, mempcpy-avx-unaligned
    	and mempcpy-avx512-no-vzeroupper.
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMPCPY_CHK):
    	New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S
    	(MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-ssse3-back.S (MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-ssse3.S (MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-avx-unaligned.S: Removed.
    	* sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-ssse3-back.S: Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-ssse3.S: Likewise.
    
    (cherry picked from commit c365e615f7429aee302f8af7bf07ae262278febb)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4fc09dabecee1b7cafdbca26ee7c63f68e53c229

commit 4fc09dabecee1b7cafdbca26ee7c63f68e53c229
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 04:39:48 2016 -0700

    [x86] Add a feature bit: Fast_Unaligned_Copy
    
    On AMD processors, memcpy optimized with unaligned SSE load is
    slower than emcpy optimized with aligned SSSE3 while other string
    functions are faster with unaligned SSE load.  A feature bit,
    Fast_Unaligned_Copy, is added to select memcpy optimized with
    unaligned SSE load.
    
    	[BZ #19583]
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Set
    	Fast_Unaligned_Copy with Fast_Unaligned_Load for Intel
    	processors.  Set Fast_Copy_Backward for AMD Excavator
    	processors.
    	* sysdeps/x86/cpu-features.h (bit_arch_Fast_Unaligned_Copy):
    	New.
    	(index_arch_Fast_Unaligned_Copy): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Check
    	Fast_Unaligned_Copy instead of Fast_Unaligned_Load.
    
    (cherry picked from commit e41b395523040fcb58c7d378475720c2836d280c)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=75f2d47e459a6bf5656a938e5c63f8b581eb3ee6

commit 75f2d47e459a6bf5656a938e5c63f8b581eb3ee6
Author: Florian Weimer <fweimer@redhat.com>
Date:   Fri Mar 25 11:11:42 2016 +0100

    tst-audit10: Fix compilation on compilers without bit_AVX512F [BZ #19860]
    
    	[BZ# 19860]
    	* sysdeps/x86_64/tst-audit10.c (avx512_enabled): Always return
    	zero if the compiler does not provide the AVX512F bit.
    
    (cherry picked from commit f327f5b47be57bc05a4077344b381016c1bb2c11)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=96c7375cb8b6f1875d9865f2ae92ecacf5f5e6fa

commit 96c7375cb8b6f1875d9865f2ae92ecacf5f5e6fa
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Mar 22 08:36:16 2016 -0700

    Don't set %rcx twice before "rep movsb"
    
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMCPY):
    	Don't set %rcx twice before "rep movsb".
    
    (cherry picked from commit 3c9a4cd16cbc7b79094fec68add2df66061ab5d7)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c273f613b0cc779ee33cc33d20941d271316e483

commit c273f613b0cc779ee33cc33d20941d271316e483
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Mar 22 07:46:56 2016 -0700

    Set index_arch_AVX_Fast_Unaligned_Load only for Intel processors
    
    Since only Intel processors with AVX2 have fast unaligned load, we
    should set index_arch_AVX_Fast_Unaligned_Load only for Intel processors.
    
    Move AVX, AVX2, AVX512, FMA and FMA4 detection into get_common_indeces
    and call get_common_indeces for other processors.
    
    Add CPU_FEATURES_CPU_P and CPU_FEATURES_ARCH_P to aoid loading
    GLRO(dl_x86_cpu_features) in cpu-features.c.
    
    	[BZ #19583]
    	* sysdeps/x86/cpu-features.c (get_common_indeces): Remove
    	inline.  Check family before setting family, model and
    	extended_model.  Set AVX, AVX2, AVX512, FMA and FMA4 usable
    	bits here.
    	(init_cpu_features): Replace HAS_CPU_FEATURE and
    	HAS_ARCH_FEATURE with CPU_FEATURES_CPU_P and
    	CPU_FEATURES_ARCH_P.  Set index_arch_AVX_Fast_Unaligned_Load
    	for Intel processors with usable AVX2.  Call get_common_indeces
    	for other processors with family == NULL.
    	* sysdeps/x86/cpu-features.h (CPU_FEATURES_CPU_P): New macro.
    	(CPU_FEATURES_ARCH_P): Likewise.
    	(HAS_CPU_FEATURE): Use CPU_FEATURES_CPU_P.
    	(HAS_ARCH_FEATURE): Use CPU_FEATURES_ARCH_P.
    
    (cherry picked from commit f781a9e96138d8839663af5e88649ab1fbed74f8)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c858d10a4e7fd682f2e7083836e4feacc2d580f4

commit c858d10a4e7fd682f2e7083836e4feacc2d580f4
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 10 05:26:46 2016 -0800

    Add _arch_/_cpu_ to index_*/bit_* in x86 cpu-features.h
    
    index_* and bit_* macros are used to access cpuid and feature arrays o
    struct cpu_features.  It is very easy to use bits and indices of cpuid
    array on feature array, especially in assembly codes.  For example,
    sysdeps/i386/i686/multiarch/bcopy.S has
    
    	HAS_CPU_FEATURE (Fast_Rep_String)
    
    which should be
    
    	HAS_ARCH_FEATURE (Fast_Rep_String)
    
    We change index_* and bit_* to index_cpu_*/index_arch_* and
    bit_cpu_*/bit_arch_* so that we can catch such error at build time.
    
    	[BZ #19762]
    	* sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h
    	(EXTRA_LD_ENVVARS): Add _arch_ to index_*/bit_*.
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Likewise.
    	* sysdeps/x86/cpu-features.h (bit_*): Renamed to ...
    	(bit_arch_*): This for feature array.
    	(bit_*): Renamed to ...
    	(bit_cpu_*): This for cpu array.
    	(index_*): Renamed to ...
    	(index_arch_*): This for feature array.
    	(index_*): Renamed to ...
    	(index_cpu_*): This for cpu array.
    	[__ASSEMBLER__] (HAS_FEATURE): Add and use field.
    	[__ASSEMBLER__] (HAS_CPU_FEATURE)): Pass cpu to HAS_FEATURE.
    	[__ASSEMBLER__] (HAS_ARCH_FEATURE)): Pass arch to HAS_FEATURE.
    	[!__ASSEMBLER__] (HAS_CPU_FEATURE): Replace index_##name and
    	bit_##name with index_cpu_##name and bit_cpu_##name.
    	[!__ASSEMBLER__] (HAS_ARCH_FEATURE): Replace index_##name and
    	bit_##name with index_arch_##name and bit_arch_##name.
    
    (cherry picked from commit 6aa3e97e2530f9917f504eb4146af119a3f27229)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7a90b56b0c3f8e55df44957cf6de7d3c9c04cbb9

commit 7a90b56b0c3f8e55df44957cf6de7d3c9c04cbb9
Author: Roland McGrath <roland@hack.frob.com>
Date:   Tue Mar 8 12:31:13 2016 -0800

    Fix tst-audit10 build when -mavx512f is not supported.
    
    (cherry picked from commit 3bd80c0de2f8e7ca8020d37739339636d169957e)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=ba80f6ceea3a6b6f711038646f419125fe3ad39c

commit ba80f6ceea3a6b6f711038646f419125fe3ad39c
Author: Florian Weimer <fweimer@redhat.com>
Date:   Mon Mar 7 16:00:25 2016 +0100

    tst-audit4, tst-audit10: Compile AVX/AVX-512 code separately [BZ #19269]
    
    This ensures that GCC will not use unsupported instructions before
    the run-time check to ensure support.
    
    (cherry picked from commit 3c0f7407eedb524c9114bb675cd55b903c71daaa)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=b8fe596e7f750d4ee2fca14d6a3999364c02662e

commit b8fe596e7f750d4ee2fca14d6a3999364c02662e
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Mar 6 16:48:11 2016 -0800

    Group AVX512 functions in .text.avx512 section
    
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S:
    	Replace .text with .text.avx512.
    	* sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S:
    	Likewise.
    
    (cherry picked from commit fee9eb6200f0e44a4b684903bc47fde36d46f1a5)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e455d17680cfaebb12692547422f95ba1ed30e29

commit e455d17680cfaebb12692547422f95ba1ed30e29
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Mar 4 08:37:40 2016 -0800

    x86-64: Fix memcpy IFUNC selection
    
    Chek Fast_Unaligned_Load, instead of Slow_BSF, and also check for
    Fast_Copy_Backward to enable __memcpy_ssse3_back.  Existing selection
    order is updated with following selection order:
    
    1. __memcpy_avx_unaligned if AVX_Fast_Unaligned_Load bit is set.
    2. __memcpy_sse2_unaligned if Fast_Unaligned_Load bit is set.
    3. __memcpy_sse2 if SSSE3 isn't available.
    4. __memcpy_ssse3_back if Fast_Copy_Backward bit it set.
    5. __memcpy_ssse3
    
    	[BZ #18880]
    	* sysdeps/x86_64/multiarch/memcpy.S: Check Fast_Unaligned_Load,
    	instead of Slow_BSF, and also check for Fast_Copy_Backward to
    	enable __memcpy_ssse3_back.
    
    (cherry picked from commit 14a1d7cc4c4fd5ee8e4e66b777221dd32a84efe8)

-----------------------------------------------------------------------
Comment 14 cvs-commit@gcc.gnu.org 2016-04-06 20:13:12 UTC
This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, hjl/erms/2.22 has been created
        at  34f2cbf8ca6ee99f36229315fb03c27e3acd805d (commit)

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=34f2cbf8ca6ee99f36229315fb03c27e3acd805d

commit 34f2cbf8ca6ee99f36229315fb03c27e3acd805d
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 1 14:01:24 2016 -0700

    X86-64: Add dummy memcopy.h and wordcopy.c
    
    Since x86-64 doesn't use memory copy functions, add dummy memcopy.h and
    wordcopy.c to reduce code size.  It reduces the size of libc.so by about
    1 KB.
    
    	* sysdeps/x86_64/memcopy.h: New file.
    	* sysdeps/x86_64/wordcopy.c: Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=912d6a93556739773b511766c2ca95fb293f5566

commit 912d6a93556739773b511766c2ca95fb293f5566
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 12:46:57 2016 -0700

    X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove
    
    Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones,
    we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with
    the new ones.
    
    No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used
    before.  If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2
    memcpy/memmove optimized with Enhanced REP MOVSB will be used for
    processors with ERMS.  The new AVX512 memcpy/memmove will be used for
    processors with AVX512 which prefer vzeroupper.
    
    Since the new SSE2 memcpy/memmove are faster than the previous default
    memcpy/memmove used in libc.a and ld.so, we also remove the previous
    default memcpy/memmove and make them the default memcpy/memmove.
    
    Together, it reduces the size of libc.so by about 6 KB and the size of
    ld.so by about 2 KB.
    
    	[BZ #19776]
    	* sysdeps/x86_64/memcpy.S: Make it dummy.
    	* sysdeps/x86_64/mempcpy.S: Likewise.
    	* sysdeps/x86_64/memmove.S: New file.
    	* sysdeps/x86_64/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/memmove.c: Removed.
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memmove.c: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memcpy-sse2-unaligned, memmove-avx-unaligned,
    	memcpy-avx-unaligned and memmove-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Replace
    	__memmove_chk_avx512_unaligned_2 with
    	__memmove_chk_avx512_unaligned.  Remove
    	__memmove_chk_avx_unaligned_2.  Replace
    	__memmove_chk_sse2_unaligned_2 with
    	__memmove_chk_sse2_unaligned.  Remove __memmove_chk_sse2 and
    	__memmove_avx_unaligned_2.  Replace __memmove_avx512_unaligned_2
    	with __memmove_avx512_unaligned.  Replace
    	__memmove_sse2_unaligned_2 with __memmove_sse2_unaligned.
    	Remove __memmove_sse2.  Replace __memcpy_chk_avx512_unaligned_2
    	with __memcpy_chk_avx512_unaligned.  Remove
    	__memcpy_chk_avx_unaligned_2.  Replace
    	__memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned.
    	Remove __memcpy_chk_sse2.  Remove __memcpy_avx_unaligned_2.
    	Replace __memcpy_avx512_unaligned_2 with
    	__memcpy_avx512_unaligned.  Remove __memcpy_sse2_unaligned_2
    	and __memcpy_sse2.  Replace __mempcpy_chk_avx512_unaligned_2
    	with __mempcpy_chk_avx512_unaligned.  Remove
    	__mempcpy_chk_avx_unaligned_2.  Replace
    	__mempcpy_chk_sse2_unaligned_2 with
    	__mempcpy_chk_sse2_unaligned.  Remove __mempcpy_chk_sse2.
    	Replace __mempcpy_avx512_unaligned_2 with
    	__mempcpy_avx512_unaligned.  Remove __mempcpy_avx_unaligned_2.
    	Replace __mempcpy_sse2_unaligned_2 with
    	__mempcpy_sse2_unaligned.  Remove __mempcpy_sse2.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support
    	__memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned.
    	Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __memcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../memcpy.S.
    	* sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support
    	__memcpy_chk_avx512_unaligned_erms and
    	__memcpy_chk_avx512_unaligned.  Use
    	__memcpy_chk_avx_unaligned_erms and
    	__memcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __memcpy_chk_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip if
    	not in libc.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
    	Change function suffix from unaligned_2 to unaligned.
    	* sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support
    	__mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned.
    	Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __mempcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../mempcpy.S.
    	(mempcpy): New.  Add a weak alias.
    	* sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support
    	__mempcpy_chk_avx512_unaligned_erms and
    	__mempcpy_chk_avx512_unaligned.  Use
    	__mempcpy_chk_avx_unaligned_erms and
    	__mempcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __mempcpy_chk_sse2_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7dfa91a07593740cf3ad71060300b1cc38ac2910

commit 7dfa91a07593740cf3ad71060300b1cc38ac2910
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:42:30 2016 -0700

    X86-64: Remove the previous SSE2/AVX2 memsets
    
    Since the new SSE2/AVX2 memsets are faster than the previous ones, we
    can remove the previous SSE2/AVX2 memsets and replace them with the
    new ones.  This reduces the size of libc.so by about 900 bytes.
    
    No change in IFUNC selection if SSE2 and AVX2 memsets weren't used
    before.  If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset
    optimized with Enhanced REP STOSB will be used for processors with
    ERMS.  The new AVX512 memset will be used for processors with AVX512
    which prefer vzeroupper.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded
    	into ...
    	* sysdeps/x86_64/memset.S: This.
    	(__bzero): Removed.
    	(__memset_tail): Likewise.
    	(__memset_chk): Likewise.
    	(memset): Likewise.
    	(MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't
    	defined.
    	(MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined.
    	* sysdeps/x86_64/multiarch/memset-avx2.S: Removed.
    	(__memset_zero_constant_len_parameter): Check SHARED instead of
    	PIC.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memset-avx2 and memset-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Remove __memset_chk_sse2,
    	__memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: Skip
    	if not in libc.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
    	(MEMSET_CHK_SYMBOL): New.  Define if not defined.
    	(__bzero): Check VEC_SIZE == 16 instead of USE_MULTIARCH.
    	Replace MEMSET_SYMBOL with MEMSET_CHK_SYMBOL on __memset_chk
    	symbols.
    	Properly check USE_MULTIARCH on __memset symbols.
    	* sysdeps/x86_64/multiarch/memset.S (memset): Replace
    	__memset_sse2 and __memset_avx2 with __memset_sse2_unaligned
    	and __memset_avx2_unaligned.  Use __memset_sse2_unaligned_erms
    	or __memset_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	(memset): Removed.
    	(__memset_chk): Likewise.
    	(MEMSET_SYMBOL): New.
    	(libc_hidden_builtin_def): Replace __memset_sse2 with
    	__memset_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace
    	__memset_chk_sse2 and __memset_chk_avx2 with
    	__memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms.
    	Use __memset_chk_sse2_unaligned_erms or
    	__memset_chk_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_chk_avx512_unaligned_erms and
    	__memset_chk_avx512_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=5b3e44eeb5ae74fb4a1c353db7e8a5ee18ccdb10

commit 5b3e44eeb5ae74fb4a1c353db7e8a5ee18ccdb10
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Apr 3 17:21:45 2016 -0700

    X86-64: Use non-temporal store in memmove on large data
    
    memcpy/memmove benchmarks with large data shows that there is a
    regression with large data on Haswell machine.  non-temporal store
    in memmove on large data can improve performance significantly.  This
    patch adds a threshold to use non temporal store which is 4 times of
    shared cache size.  When size is above the threshold, non temporal
    store will be used.
    
    For size below 8 vector register width, we load all data into registers
    and store them together.  Only forward and backward loops, which move 4
    vector registers at a time, are used to support overlapping addresses.
    For forward loop, we load the last 4 vector register width of data and
    the first vector register width of data into vector registers before the
    loop and store them after the loop.  For backward loop, we load the first
    4 vector register width of data and the last vector register width of
    data into vector registers before the loop and store them after the loop.
    
    	* sysdeps/x86_64/cacheinfo.c (__x86_shared_non_temporal_threshold):
    	New.
    	(init_cacheinfo): Set __x86_shared_non_temporal_threshold to
    	4 times of shared cache size.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S
    	(PREFETCHNT): New.
    	(VMOVNT): Likewise.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S
    	(PREFETCHNT): Likewise.
    	(VMOVNT): Likewise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S
    	(PREFETCHNT): Likewise.
    	(VMOVNT): Likewise.
    	(VMOVU): Changed to movups for smaller code sizes.
    	(VMOVA): Changed to movaps for smaller code sizes.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Update
    	comments.
    	(PREFETCH_SIZE): New.
    	(PREFETCHED_LOAD_SIZE): Likewise.
    	(PREFETCH_ONE_SET): Likewise.
    	Rewrite to use forward and backward loops, which move 4 vector
    	registers at a time, to support overlapping addresses and use
    	non temporal store if size is above the threshold.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=54667f64fa4074325ee33e487c033c313ce95067

commit 54667f64fa4074325ee33e487c033c313ce95067
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Wed Apr 6 10:19:16 2016 -0700

    X86-64: Prepare memmove-vec-unaligned-erms.S
    
    Prepare memmove-vec-unaligned-erms.S to make the SSE2 version as the
    default memcpy, mempcpy and memmove.
    
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
    	(MEMCPY_SYMBOL): New.
    	(MEMPCPY_SYMBOL): Likewise.
    	(MEMMOVE_CHK_SYMBOL): Likewise.
    	Replace MEMMOVE_SYMBOL with MEMMOVE_CHK_SYMBOL on __mempcpy_chk
    	symbols.  Replace MEMMOVE_SYMBOL with MEMPCPY_SYMBOL on
    	__mempcpy symbols.  Provide alias for __memcpy_chk in libc.a.
    	Provide alias for memcpy in libc.a and ld.so.
    
    (cherry picked from commit a7d1c51482d15ab6c07e2ee0ae5e007067b18bfb)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=68a0b487e274b3452a1660e4b9fad5df8d8c0284

commit 68a0b487e274b3452a1660e4b9fad5df8d8c0284
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Wed Apr 6 09:10:18 2016 -0700

    X86-64: Prepare memset-vec-unaligned-erms.S
    
    Prepare memset-vec-unaligned-erms.S to make the SSE2 version as the
    default memset.
    
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
    	(MEMSET_CHK_SYMBOL): New.  Define if not defined.
    	(__bzero): Check VEC_SIZE == 16 instead of USE_MULTIARCH.
    	Disabled fro now.
    	Replace MEMSET_SYMBOL with MEMSET_CHK_SYMBOL on __memset_chk
    	symbols.  Properly check USE_MULTIARCH on __memset symbols.
    
    (cherry picked from commit 4af1bb06c59d24f35bf8dc55897838d926c05892)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c2d3bdd6aec639fd66fceb3e2c145420c25d409b

commit c2d3bdd6aec639fd66fceb3e2c145420c25d409b
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Apr 5 05:21:07 2016 -0700

    Force 32-bit displacement in memset-vec-unaligned-erms.S
    
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: Force
    	32-bit displacement to avoid long nop between instructions.
    
    (cherry picked from commit ec0cac9a1f4094bd0db6f77c1b329e7a40eecc10)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=070a5e77d66f5520c1bbbc24dc1843a0a1c161ee

commit 070a5e77d66f5520c1bbbc24dc1843a0a1c161ee
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Apr 5 05:19:05 2016 -0700

    Add a comment in memset-sse2-unaligned-erms.S
    
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Add
    	a comment on VMOVU and VMOVA.
    
    (cherry picked from commit 696ac774847b80cf994438739478b0c3003b5958)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7e00bb9720268f142668d22e91dff7c3e6e0c08c

commit 7e00bb9720268f142668d22e91dff7c3e6e0c08c
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Apr 3 14:32:20 2016 -0700

    Don't put SSE2/AVX/AVX512 memmove/memset in ld.so
    
    Since memmove and memset in ld.so don't use IFUNC, don't put SSE2, AVX
    and AVX512 memmove and memset in ld.so.
    
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip
    	if not in libc.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    
    (cherry picked from commit 5cd7af016d8587ff53b20ba259746f97edbddbf7)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e1e57539f5dbdefc96a85021b611863eaa28dd13

commit e1e57539f5dbdefc96a85021b611863eaa28dd13
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Apr 3 12:38:25 2016 -0700

    Fix memmove-vec-unaligned-erms.S
    
    __mempcpy_erms and __memmove_erms can't be placed between __memmove_chk
    and __memmove it breaks __memmove_chk.
    
    Don't check source == destination first since it is less common.
    
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:
    	(__mempcpy_erms, __memmove_erms): Moved before __mempcpy_chk
    	with unaligned_erms.
    	(__memmove_erms): Skip if source == destination.
    	(__memmove_unaligned_erms): Don't check source == destination
    	first.
    
    (cherry picked from commit ea2785e96fa503f3a2b5dd9f3a6ca65622b3c5f2)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a13ac6b5ced68aadb7c1546102445f9c57f43231

commit a13ac6b5ced68aadb7c1546102445f9c57f43231
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Mar 6 08:23:24 2016 -0800

    Use HAS_ARCH_FEATURE with Fast_Rep_String
    
    HAS_ARCH_FEATURE, not HAS_CPU_FEATURE, should be used with
    Fast_Rep_String.
    
    	[BZ #19762]
    	* sysdeps/i386/i686/multiarch/bcopy.S (bcopy): Use
    	HAS_ARCH_FEATURE with Fast_Rep_String.
    	* sysdeps/i386/i686/multiarch/bzero.S (__bzero): Likewise.
    	* sysdeps/i386/i686/multiarch/memcpy.S (memcpy): Likewise.
    	* sysdeps/i386/i686/multiarch/memcpy_chk.S (__memcpy_chk):
    	Likewise.
    	* sysdeps/i386/i686/multiarch/memmove_chk.S (__memmove_chk):
    	Likewise.
    	* sysdeps/i386/i686/multiarch/mempcpy.S (__mempcpy): Likewise.
    	* sysdeps/i386/i686/multiarch/mempcpy_chk.S (__mempcpy_chk):
    	Likewise.
    	* sysdeps/i386/i686/multiarch/memset.S (memset): Likewise.
    	* sysdeps/i386/i686/multiarch/memset_chk.S (__memset_chk):
    	Likewise.
    
    (cherry picked from commit 4e940b2f4b577f3a530e0580373f7c2d569f4d63)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4ad4d58ed7a444e2d9787113fce132a99b35b417

commit 4ad4d58ed7a444e2d9787113fce132a99b35b417
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 1 15:08:48 2016 -0700

    Remove Fast_Copy_Backward from Intel Core processors
    
    Intel Core i3, i5 and i7 processors have fast unaligned copy and
    copy backward is ignored.  Remove Fast_Copy_Backward from Intel Core
    processors to avoid confusion.
    
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Don't set
    	bit_arch_Fast_Copy_Backward for Intel Core proessors.
    
    (cherry picked from commit 27d3ce1467990f89126e228559dec8f84b96c60e)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a304f3933c7f8347e49057a7a315cbd571662ff7

commit a304f3933c7f8347e49057a7a315cbd571662ff7
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:05:51 2016 -0700

    Add x86-64 memset with unaligned store and rep stosb
    
    Implement x86-64 memset with unaligned store and rep movsb.  Support
    16-byte, 32-byte and 64-byte vector register sizes.  A single file
    provides 2 implementations of memset, one with rep stosb and the other
    without rep stosb.  They share the same codes when size is between 2
    times of vector register size and REP_STOSB_THRESHOLD which defaults
    to 2KB.
    
    Key features:
    
    1. Use overlapping store to avoid branch.
    2. For size <= 4 times of vector register size, fully unroll the loop.
    3. For size > 4 times of vector register size, store 4 times of vector
    register size at a time.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
    	memset-sse2-unaligned-erms, memset-avx2-unaligned-erms and
    	memset-avx512-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Test __memset_chk_sse2_unaligned,
    	__memset_chk_sse2_unaligned_erms, __memset_chk_avx2_unaligned,
    	__memset_chk_avx2_unaligned_erms, __memset_chk_avx512_unaligned,
    	__memset_chk_avx512_unaligned_erms, __memset_sse2_unaligned,
    	__memset_sse2_unaligned_erms, __memset_erms,
    	__memset_avx2_unaligned, __memset_avx2_unaligned_erms,
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: New
    	file.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:
    	Likewise.
    
    (cherry picked from commit 830566307f038387ca0af3fd327706a8d1a2f595)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=1c5d0b0ba41376c2f0792da4f22cc1f5b2b2688e

commit 1c5d0b0ba41376c2f0792da4f22cc1f5b2b2688e
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:04:26 2016 -0700

    Add x86-64 memmove with unaligned load/store and rep movsb
    
    Implement x86-64 memmove with unaligned load/store and rep movsb.
    Support 16-byte, 32-byte and 64-byte vector register sizes.  When
    size <= 8 times of vector register size, there is no check for
    address overlap bewteen source and destination.  Since overhead for
    overlap check is small when size > 8 times of vector register size,
    memcpy is an alias of memmove.
    
    A single file provides 2 implementations of memmove, one with rep movsb
    and the other without rep movsb.  They share the same codes when size is
    between 2 times of vector register size and REP_MOVSB_THRESHOLD which
    is 2KB for 16-byte vector register size and scaled up by large vector
    register size.
    
    Key features:
    
    1. Use overlapping load and store to avoid branch.
    2. For size <= 8 times of vector register size, load  all sources into
    registers and store them together.
    3. If there is no address overlap bewteen source and destination, copy
    from both ends with 4 times of vector register size at a time.
    4. If address of destination > address of source, backward copy 8 times
    of vector register size at a time.
    5. Otherwise, forward copy 8 times of vector register size at a time.
    6. Use rep movsb only for forward copy.  Avoid slow backward rep movsb
    by fallbacking to backward copy 8 times of vector register size at a
    time.
    7. Skip when address of destination == address of source.
    
    	[BZ #19776]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
    	memmove-sse2-unaligned-erms, memmove-avx-unaligned-erms and
    	memmove-avx512-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Test
    	__memmove_chk_avx512_unaligned_2,
    	__memmove_chk_avx512_unaligned_erms,
    	__memmove_chk_avx_unaligned_2, __memmove_chk_avx_unaligned_erms,
    	__memmove_chk_sse2_unaligned_2,
    	__memmove_chk_sse2_unaligned_erms, __memmove_avx_unaligned_2,
    	__memmove_avx_unaligned_erms, __memmove_avx512_unaligned_2,
    	__memmove_avx512_unaligned_erms, __memmove_erms,
    	__memmove_sse2_unaligned_2, __memmove_sse2_unaligned_erms,
    	__memcpy_chk_avx512_unaligned_2,
    	__memcpy_chk_avx512_unaligned_erms,
    	__memcpy_chk_avx_unaligned_2, __memcpy_chk_avx_unaligned_erms,
    	__memcpy_chk_sse2_unaligned_2, __memcpy_chk_sse2_unaligned_erms,
    	__memcpy_avx_unaligned_2, __memcpy_avx_unaligned_erms,
    	__memcpy_avx512_unaligned_2, __memcpy_avx512_unaligned_erms,
    	__memcpy_sse2_unaligned_2, __memcpy_sse2_unaligned_erms,
    	__memcpy_erms, __mempcpy_chk_avx512_unaligned_2,
    	__mempcpy_chk_avx512_unaligned_erms,
    	__mempcpy_chk_avx_unaligned_2, __mempcpy_chk_avx_unaligned_erms,
    	__mempcpy_chk_sse2_unaligned_2, __mempcpy_chk_sse2_unaligned_erms,
    	__mempcpy_avx512_unaligned_2, __mempcpy_avx512_unaligned_erms,
    	__mempcpy_avx_unaligned_2, __mempcpy_avx_unaligned_erms,
    	__mempcpy_sse2_unaligned_2, __mempcpy_sse2_unaligned_erms and
    	__mempcpy_erms.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: New
    	file.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likwise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
    	Likwise.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:
    	Likwise.
    
    (cherry picked from commit 88b57b8ed41d5ecf2e1bdfc19556f9246a665ebb)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e1203f48239fbb9832db6ed3a0d2a008e317aff9

commit e1203f48239fbb9832db6ed3a0d2a008e317aff9
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 19:22:59 2016 -0700

    Initial Enhanced REP MOVSB/STOSB (ERMS) support
    
    The newer Intel processors support Enhanced REP MOVSB/STOSB (ERMS) which
    has a feature bit in CPUID.  This patch adds the Enhanced REP MOVSB/STOSB
    (ERMS) bit to x86 cpu-features.
    
    	* sysdeps/x86/cpu-features.h (bit_cpu_ERMS): New.
    	(index_cpu_ERMS): Likewise.
    	(reg_ERMS): Likewise.
    
    (cherry picked from commit 0791f91dff9a77263fa8173b143d854cad902c6d)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=3597d65be2a44f063ef12bb907fdad8567aa3e6a

commit 3597d65be2a44f063ef12bb907fdad8567aa3e6a
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 13:15:59 2016 -0700

    Make __memcpy_avx512_no_vzeroupper an alias
    
    Since x86-64 memcpy-avx512-no-vzeroupper.S implements memmove, make
    __memcpy_avx512_no_vzeroupper an alias of __memmove_avx512_no_vzeroupper
    to reduce code size of libc.so.
    
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memcpy-avx512-no-vzeroupper.
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: Renamed
    	to ...
    	* sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: This.
    	(MEMCPY): Don't define.
    	(MEMCPY_CHK): Likewise.
    	(MEMPCPY): Likewise.
    	(MEMPCPY_CHK): Likewise.
    	(MEMPCPY_CHK): Renamed to ...
    	(__mempcpy_chk_avx512_no_vzeroupper): This.
    	(MEMPCPY_CHK): Renamed to ...
    	(__mempcpy_chk_avx512_no_vzeroupper): This.
    	(MEMCPY_CHK): Renamed to ...
    	(__memmove_chk_avx512_no_vzeroupper): This.
    	(MEMCPY): Renamed to ...
    	(__memmove_avx512_no_vzeroupper): This.
    	(__memcpy_avx512_no_vzeroupper): New alias.
    	(__memcpy_chk_avx512_no_vzeroupper): Likewise.
    
    (cherry picked from commit 064f01b10b57ff09cda7025f484b848c38ddd57a)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=9fbaf0f27a11deb98df79d04adee97aebee78d40

commit 9fbaf0f27a11deb98df79d04adee97aebee78d40
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 13:13:36 2016 -0700

    Implement x86-64 multiarch mempcpy in memcpy
    
    Implement x86-64 multiarch mempcpy in memcpy to share most of code.  It
    reduces code size of libc.so.
    
    	[BZ #18858]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	mempcpy-ssse3, mempcpy-ssse3-back, mempcpy-avx-unaligned
    	and mempcpy-avx512-no-vzeroupper.
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMPCPY_CHK):
    	New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S
    	(MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-ssse3-back.S (MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-ssse3.S (MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-avx-unaligned.S: Removed.
    	* sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-ssse3-back.S: Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-ssse3.S: Likewise.
    
    (cherry picked from commit c365e615f7429aee302f8af7bf07ae262278febb)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=5239cb481eea27650173b9b9af22439afdcbf358

commit 5239cb481eea27650173b9b9af22439afdcbf358
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 04:39:48 2016 -0700

    [x86] Add a feature bit: Fast_Unaligned_Copy
    
    On AMD processors, memcpy optimized with unaligned SSE load is
    slower than emcpy optimized with aligned SSSE3 while other string
    functions are faster with unaligned SSE load.  A feature bit,
    Fast_Unaligned_Copy, is added to select memcpy optimized with
    unaligned SSE load.
    
    	[BZ #19583]
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Set
    	Fast_Unaligned_Copy with Fast_Unaligned_Load for Intel
    	processors.  Set Fast_Copy_Backward for AMD Excavator
    	processors.
    	* sysdeps/x86/cpu-features.h (bit_arch_Fast_Unaligned_Copy):
    	New.
    	(index_arch_Fast_Unaligned_Copy): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Check
    	Fast_Unaligned_Copy instead of Fast_Unaligned_Load.
    
    (cherry picked from commit e41b395523040fcb58c7d378475720c2836d280c)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a65b3d13e1754d568782e64a762c2c7fab45a55d

commit a65b3d13e1754d568782e64a762c2c7fab45a55d
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Mar 22 08:36:16 2016 -0700

    Don't set %rcx twice before "rep movsb"
    
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMCPY):
    	Don't set %rcx twice before "rep movsb".
    
    (cherry picked from commit 3c9a4cd16cbc7b79094fec68add2df66061ab5d7)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=f4b6d20366aac66070f1cf50552cf2951991a1e5

commit f4b6d20366aac66070f1cf50552cf2951991a1e5
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Mar 22 07:46:56 2016 -0700

    Set index_arch_AVX_Fast_Unaligned_Load only for Intel processors
    
    Since only Intel processors with AVX2 have fast unaligned load, we
    should set index_arch_AVX_Fast_Unaligned_Load only for Intel processors.
    
    Move AVX, AVX2, AVX512, FMA and FMA4 detection into get_common_indeces
    and call get_common_indeces for other processors.
    
    Add CPU_FEATURES_CPU_P and CPU_FEATURES_ARCH_P to aoid loading
    GLRO(dl_x86_cpu_features) in cpu-features.c.
    
    	[BZ #19583]
    	* sysdeps/x86/cpu-features.c (get_common_indeces): Remove
    	inline.  Check family before setting family, model and
    	extended_model.  Set AVX, AVX2, AVX512, FMA and FMA4 usable
    	bits here.
    	(init_cpu_features): Replace HAS_CPU_FEATURE and
    	HAS_ARCH_FEATURE with CPU_FEATURES_CPU_P and
    	CPU_FEATURES_ARCH_P.  Set index_arch_AVX_Fast_Unaligned_Load
    	for Intel processors with usable AVX2.  Call get_common_indeces
    	for other processors with family == NULL.
    	* sysdeps/x86/cpu-features.h (CPU_FEATURES_CPU_P): New macro.
    	(CPU_FEATURES_ARCH_P): Likewise.
    	(HAS_CPU_FEATURE): Use CPU_FEATURES_CPU_P.
    	(HAS_ARCH_FEATURE): Use CPU_FEATURES_ARCH_P.
    
    (cherry picked from commit f781a9e96138d8839663af5e88649ab1fbed74f8)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=ca9c5edeea52dc18f42ebbe29b1af352f5555538

commit ca9c5edeea52dc18f42ebbe29b1af352f5555538
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Nov 30 08:53:37 2015 -0800

    Update family and model detection for AMD CPUs
    
    AMD CPUs uses the similar encoding scheme for extended family and model
    as Intel CPUs as shown in:
    
    http://support.amd.com/TechDocs/25481.pdf
    
    This patch updates get_common_indeces to get family and model for both
    Intel and AMD CPUs when family == 0x0f.
    
    	[BZ #19214]
    	* sysdeps/x86/cpu-features.c (get_common_indeces): Add an
    	argument to return extended model.  Update family and model
    	with extended family and model when family == 0x0f.
    	(init_cpu_features): Updated.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c23cdbac4ea473effbef5c50b1217f95595b3460

commit c23cdbac4ea473effbef5c50b1217f95595b3460
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 10 05:26:46 2016 -0800

    Add _arch_/_cpu_ to index_*/bit_* in x86 cpu-features.h
    
    index_* and bit_* macros are used to access cpuid and feature arrays o
    struct cpu_features.  It is very easy to use bits and indices of cpuid
    array on feature array, especially in assembly codes.  For example,
    sysdeps/i386/i686/multiarch/bcopy.S has
    
    	HAS_CPU_FEATURE (Fast_Rep_String)
    
    which should be
    
    	HAS_ARCH_FEATURE (Fast_Rep_String)
    
    We change index_* and bit_* to index_cpu_*/index_arch_* and
    bit_cpu_*/bit_arch_* so that we can catch such error at build time.
    
    	[BZ #19762]
    	* sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h
    	(EXTRA_LD_ENVVARS): Add _arch_ to index_*/bit_*.
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Likewise.
    	* sysdeps/x86/cpu-features.h (bit_*): Renamed to ...
    	(bit_arch_*): This for feature array.
    	(bit_*): Renamed to ...
    	(bit_cpu_*): This for cpu array.
    	(index_*): Renamed to ...
    	(index_arch_*): This for feature array.
    	(index_*): Renamed to ...
    	(index_cpu_*): This for cpu array.
    	[__ASSEMBLER__] (HAS_FEATURE): Add and use field.
    	[__ASSEMBLER__] (HAS_CPU_FEATURE)): Pass cpu to HAS_FEATURE.
    	[__ASSEMBLER__] (HAS_ARCH_FEATURE)): Pass arch to HAS_FEATURE.
    	[!__ASSEMBLER__] (HAS_CPU_FEATURE): Replace index_##name and
    	bit_##name with index_cpu_##name and bit_cpu_##name.
    	[!__ASSEMBLER__] (HAS_ARCH_FEATURE): Replace index_##name and
    	bit_##name with index_arch_##name and bit_arch_##name.
    
    (cherry picked from commit 6aa3e97e2530f9917f504eb4146af119a3f27229)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4a49c82956f5a42a2cce22c2e97360de1b32301d

commit 4a49c82956f5a42a2cce22c2e97360de1b32301d
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 3 14:51:40 2016 -0800

    Or bit_Prefer_MAP_32BIT_EXEC in EXTRA_LD_ENVVARS
    
    We should turn on bit_Prefer_MAP_32BIT_EXEC in EXTRA_LD_ENVVARS without
    overriding other bits.
    
    	[BZ #19758]
    	* sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h
    	(EXTRA_LD_ENVVARS): Or bit_Prefer_MAP_32BIT_EXEC.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=18f8c0e3cc9ff7b092f02c9b42874a5439347bbc

commit 18f8c0e3cc9ff7b092f02c9b42874a5439347bbc
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Mar 6 16:48:11 2016 -0800

    Group AVX512 functions in .text.avx512 section
    
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S:
    	Replace .text with .text.avx512.
    	* sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S:
    	Likewise.
    
    (cherry picked from commit fee9eb6200f0e44a4b684903bc47fde36d46f1a5)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=0c8e297a186f844ebb7eba7a3bc0343c83615ca9

commit 0c8e297a186f844ebb7eba7a3bc0343c83615ca9
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Mar 4 08:37:40 2016 -0800

    x86-64: Fix memcpy IFUNC selection
    
    Chek Fast_Unaligned_Load, instead of Slow_BSF, and also check for
    Fast_Copy_Backward to enable __memcpy_ssse3_back.  Existing selection
    order is updated with following selection order:
    
    1. __memcpy_avx_unaligned if AVX_Fast_Unaligned_Load bit is set.
    2. __memcpy_sse2_unaligned if Fast_Unaligned_Load bit is set.
    3. __memcpy_sse2 if SSSE3 isn't available.
    4. __memcpy_ssse3_back if Fast_Copy_Backward bit it set.
    5. __memcpy_ssse3
    
    	[BZ #18880]
    	* sysdeps/x86_64/multiarch/memcpy.S: Check Fast_Unaligned_Load,
    	instead of Slow_BSF, and also check for Fast_Copy_Backward to
    	enable __memcpy_ssse3_back.
    
    (cherry picked from commit 14a1d7cc4c4fd5ee8e4e66b777221dd32a84efe8)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=3c772cb4d9cbe19cd97ad991e3dab43014198c44

commit 3c772cb4d9cbe19cd97ad991e3dab43014198c44
Author: Andrew Senkevich <andrew.senkevich@intel.com>
Date:   Sat Jan 16 00:49:45 2016 +0300

    Added memcpy/memmove family optimized with AVX512 for KNL hardware.
    
    Added AVX512 implementations of memcpy, mempcpy, memmove, memcpy_chk,
    mempcpy_chk, memmove_chk.
    It shows average improvement more than 30% over AVX versions on KNL
    hardware (performance results in the thread
    <https://sourceware.org/ml/libc-alpha/2016-01/msg00258.html>).
    
        * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Added new files.
        * sysdeps/x86_64/multiarch/ifunc-impl-list.c: Added new tests.
        * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: New file.
        * sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S: Likewise.
        * sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: Likewise.
        * sysdeps/x86_64/multiarch/memcpy.S: Added new IFUNC branch.
        * sysdeps/x86_64/multiarch/memcpy_chk.S: Likewise.
        * sysdeps/x86_64/multiarch/memmove.c: Likewise.
        * sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
        * sysdeps/x86_64/multiarch/mempcpy.S: Likewise.
        * sysdeps/x86_64/multiarch/mempcpy_chk.S: Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7f20e52f1f8d7709140f4bdf828a6bb0f0f08af2

commit 7f20e52f1f8d7709140f4bdf828a6bb0f0f08af2
Author: Andrew Senkevich <andrew.senkevich@intel.com>
Date:   Sat Dec 19 02:47:28 2015 +0300

    Added memset optimized with AVX512 for KNL hardware.
    
    It shows improvement up to 28% over AVX2 memset (performance results
    attached at <https://sourceware.org/ml/libc-alpha/2015-12/msg00052.html>).
    
        * sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S: New file.
        * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Added new file.
        * sysdeps/x86_64/multiarch/ifunc-impl-list.c: Added new tests.
        * sysdeps/x86_64/multiarch/memset.S: Added new IFUNC branch.
        * sysdeps/x86_64/multiarch/memset_chk.S: Likewise.
        * sysdeps/x86/cpu-features.h (bit_Prefer_No_VZEROUPPER,
        index_Prefer_No_VZEROUPPER): New.
        * sysdeps/x86/cpu-features.c (init_cpu_features): Set the
        Prefer_No_VZEROUPPER for Knights Landing.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=d530cd5463701a59ed923d53a97d3b534fdfea8a

commit d530cd5463701a59ed923d53a97d3b534fdfea8a
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Wed Oct 21 14:44:23 2015 -0700

    Add Prefer_MAP_32BIT_EXEC to map executable pages with MAP_32BIT
    
    According to Silvermont software optimization guide, for 64-bit
    applications, branch prediction performance can be negatively impacted
    when the target of a branch is more than 4GB away from the branch.  Add
    the Prefer_MAP_32BIT_EXEC bit so that mmap will try to map executable
    pages with MAP_32BIT first.  NB: MAP_32BIT will map to lower 2GB, not
    lower 4GB, address.  Prefer_MAP_32BIT_EXEC reduces bits available for
    address space layout randomization (ASLR), which is always disabled for
    SUID programs and can only be enabled by setting environment variable,
    LD_PREFER_MAP_32BIT_EXEC.
    
    On Fedora 23, this patch speeds up GCC 5 testsuite by 3% on Silvermont.
    
    	[BZ #19367]
    	* sysdeps/unix/sysv/linux/wordsize-64/mmap.c: New file.
    	* sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h: Likewise.
    	* sysdeps/unix/sysv/linux/x86_64/64/mmap.c: Likewise.
    	* sysdeps/x86/cpu-features.h (bit_Prefer_MAP_32BIT_EXEC): New.
    	(index_Prefer_MAP_32BIT_EXEC): Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=fe24aedc3530037d7bb614b84d309e6b816686bf

commit fe24aedc3530037d7bb614b84d309e6b816686bf
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Dec 15 11:46:54 2015 -0800

    Enable Silvermont optimizations for Knights Landing
    
    Knights Landing processor is based on Silvermont.  This patch enables
    Silvermont optimizations for Knights Landing.
    
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Enable
    	Silvermont optimizations for Knights Landing.

-----------------------------------------------------------------------
Comment 15 cvs-commit@gcc.gnu.org 2016-04-07 19:44:45 UTC
This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, hjl/erms/2.23 has been created
        at  2a1cca399be415d6c5a556af2018e5fb726d9a37 (commit)

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=2a1cca399be415d6c5a556af2018e5fb726d9a37

commit 2a1cca399be415d6c5a556af2018e5fb726d9a37
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 1 14:01:24 2016 -0700

    X86-64: Add dummy memcopy.h and wordcopy.c
    
    Since x86-64 doesn't use memory copy functions, add dummy memcopy.h and
    wordcopy.c to reduce code size.  It reduces the size of libc.so by about
    1 KB.
    
    	* sysdeps/x86_64/memcopy.h: New file.
    	* sysdeps/x86_64/wordcopy.c: Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=b361e72f264a06e856d97cbbf1cedbf2f7dd73bf

commit b361e72f264a06e856d97cbbf1cedbf2f7dd73bf
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 12:46:57 2016 -0700

    X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove
    
    Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones,
    we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with
    the new ones.
    
    No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used
    before.  If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2
    memcpy/memmove optimized with Enhanced REP MOVSB will be used for
    processors with ERMS.  The new AVX512 memcpy/memmove will be used for
    processors with AVX512 which prefer vzeroupper.
    
    Since the new SSE2 memcpy/memmove are faster than the previous default
    memcpy/memmove used in libc.a and ld.so, we also remove the previous
    default memcpy/memmove and make them the default memcpy/memmove.
    
    Together, it reduces the size of libc.so by about 6 KB and the size of
    ld.so by about 2 KB.
    
    	[BZ #19776]
    	* sysdeps/x86_64/memcpy.S: Make it dummy.
    	* sysdeps/x86_64/mempcpy.S: Likewise.
    	* sysdeps/x86_64/memmove.S: New file.
    	* sysdeps/x86_64/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/memmove.c: Removed.
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memmove.c: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memcpy-sse2-unaligned, memmove-avx-unaligned,
    	memcpy-avx-unaligned and memmove-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Replace
    	__memmove_chk_avx512_unaligned_2 with
    	__memmove_chk_avx512_unaligned.  Remove
    	__memmove_chk_avx_unaligned_2.  Replace
    	__memmove_chk_sse2_unaligned_2 with
    	__memmove_chk_sse2_unaligned.  Remove __memmove_chk_sse2 and
    	__memmove_avx_unaligned_2.  Replace __memmove_avx512_unaligned_2
    	with __memmove_avx512_unaligned.  Replace
    	__memmove_sse2_unaligned_2 with __memmove_sse2_unaligned.
    	Remove __memmove_sse2.  Replace __memcpy_chk_avx512_unaligned_2
    	with __memcpy_chk_avx512_unaligned.  Remove
    	__memcpy_chk_avx_unaligned_2.  Replace
    	__memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned.
    	Remove __memcpy_chk_sse2.  Remove __memcpy_avx_unaligned_2.
    	Replace __memcpy_avx512_unaligned_2 with
    	__memcpy_avx512_unaligned.  Remove __memcpy_sse2_unaligned_2
    	and __memcpy_sse2.  Replace __mempcpy_chk_avx512_unaligned_2
    	with __mempcpy_chk_avx512_unaligned.  Remove
    	__mempcpy_chk_avx_unaligned_2.  Replace
    	__mempcpy_chk_sse2_unaligned_2 with
    	__mempcpy_chk_sse2_unaligned.  Remove __mempcpy_chk_sse2.
    	Replace __mempcpy_avx512_unaligned_2 with
    	__mempcpy_avx512_unaligned.  Remove __mempcpy_avx_unaligned_2.
    	Replace __mempcpy_sse2_unaligned_2 with
    	__mempcpy_sse2_unaligned.  Remove __mempcpy_sse2.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support
    	__memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned.
    	Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __memcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../memcpy.S.
    	* sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support
    	__memcpy_chk_avx512_unaligned_erms and
    	__memcpy_chk_avx512_unaligned.  Use
    	__memcpy_chk_avx_unaligned_erms and
    	__memcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __memcpy_chk_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip if
    	not in libc.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
    	Change function suffix from unaligned_2 to unaligned.
    	* sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support
    	__mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned.
    	Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __mempcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../mempcpy.S.
    	(mempcpy): New.  Add a weak alias.
    	* sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support
    	__mempcpy_chk_avx512_unaligned_erms and
    	__mempcpy_chk_avx512_unaligned.  Use
    	__mempcpy_chk_avx_unaligned_erms and
    	__mempcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __mempcpy_chk_sse2_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c97c370612496379176be8e33c19dc4f80b7f01c

commit c97c370612496379176be8e33c19dc4f80b7f01c
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:42:30 2016 -0700

    X86-64: Remove the previous SSE2/AVX2 memsets
    
    Since the new SSE2/AVX2 memsets are faster than the previous ones, we
    can remove the previous SSE2/AVX2 memsets and replace them with the
    new ones.  This reduces the size of libc.so by about 900 bytes.
    
    No change in IFUNC selection if SSE2 and AVX2 memsets weren't used
    before.  If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset
    optimized with Enhanced REP STOSB will be used for processors with
    ERMS.  The new AVX512 memset will be used for processors with AVX512
    which prefer vzeroupper.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded
    	into ...
    	* sysdeps/x86_64/memset.S: This.
    	(__bzero): Removed.
    	(__memset_tail): Likewise.
    	(__memset_chk): Likewise.
    	(memset): Likewise.
    	(MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't
    	defined.
    	(MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined.
    	* sysdeps/x86_64/multiarch/memset-avx2.S: Removed.
    	(__memset_zero_constant_len_parameter): Check SHARED instead of
    	PIC.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memset-avx2 and memset-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Remove __memset_chk_sse2,
    	__memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: Skip
    	if not in libc.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
    	(__bzero): Enabled.
    	* sysdeps/x86_64/multiarch/memset.S (memset): Replace
    	__memset_sse2 and __memset_avx2 with __memset_sse2_unaligned
    	and __memset_avx2_unaligned.  Use __memset_sse2_unaligned_erms
    	or __memset_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	(memset): Removed.
    	(__memset_chk): Likewise.
    	(MEMSET_SYMBOL): New.
    	(libc_hidden_builtin_def): Replace __memset_sse2 with
    	__memset_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace
    	__memset_chk_sse2 and __memset_chk_avx2 with
    	__memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms.
    	Use __memset_chk_sse2_unaligned_erms or
    	__memset_chk_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_chk_avx512_unaligned_erms and
    	__memset_chk_avx512_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=121270b79236d7c5802e8d9af2d27952cb9efae9

commit 121270b79236d7c5802e8d9af2d27952cb9efae9
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Apr 3 17:21:45 2016 -0700

    X86-64: Use non-temporal store in memcpy on large data
    
    The large memcpy micro benchmark in glibc shows that there is a
    regression with large data on Haswell machine.  non-temporal store in
    memcpy on large data can improve performance significantly.  This
    patch adds a threshold to use non temporal store which is 6 times of
    shared cache size.  When size is above the threshold, non temporal
    store will be used.
    
    For size below 8 vector register width, we load all data into registers
    and store them together.  Only forward and backward loops, which move 4
    vector registers at a time, are used to support overlapping addresses.
    For forward loop, we load the last 4 vector register width of data and
    the first vector register width of data into vector registers before the
    loop and store them after the loop.  For backward loop, we load the first
    4 vector register width of data and the last vector register width of
    data into vector registers before the loop and store them after the loop.
    
    	* sysdeps/x86_64/cacheinfo.c (__x86_shared_non_temporal_threshold):
    	New.
    	(init_cacheinfo): Set __x86_shared_non_temporal_threshold to
    	6 times of shared cache size.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S
    	(VMOVNT): New.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S
    	(VMOVNT): Likewise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S
    	(VMOVNT): Likewise.
    	(VMOVU): Changed to movups for smaller code sizes.
    	(VMOVA): Changed to movaps for smaller code sizes.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Update
    	comments.
    	(PREFETCH): New.
    	(PREFETCH_SIZE): Likewise.
    	(PREFETCHED_LOAD_SIZE): Likewise.
    	(PREFETCH_ONE_SET): Likewise.
    	Rewrite to use forward and backward loops, which move 4 vector
    	registers at a time, to support overlapping addresses and use
    	non temporal store if size is above the threshold.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=0932dd8b56db46dd421a4855fb5dee9de092538d

commit 0932dd8b56db46dd421a4855fb5dee9de092538d
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Wed Apr 6 10:19:16 2016 -0700

    X86-64: Prepare memmove-vec-unaligned-erms.S
    
    Prepare memmove-vec-unaligned-erms.S to make the SSE2 version as the
    default memcpy, mempcpy and memmove.
    
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
    	(MEMCPY_SYMBOL): New.
    	(MEMPCPY_SYMBOL): Likewise.
    	(MEMMOVE_CHK_SYMBOL): Likewise.
    	Replace MEMMOVE_SYMBOL with MEMMOVE_CHK_SYMBOL on __mempcpy_chk
    	symbols.  Replace MEMMOVE_SYMBOL with MEMPCPY_SYMBOL on
    	__mempcpy symbols.  Provide alias for __memcpy_chk in libc.a.
    	Provide alias for memcpy in libc.a and ld.so.
    
    (cherry picked from commit a7d1c51482d15ab6c07e2ee0ae5e007067b18bfb)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=da2da79262814ba4ead3ee487549949096d8ad2d

commit da2da79262814ba4ead3ee487549949096d8ad2d
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Wed Apr 6 09:10:18 2016 -0700

    X86-64: Prepare memset-vec-unaligned-erms.S
    
    Prepare memset-vec-unaligned-erms.S to make the SSE2 version as the
    default memset.
    
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
    	(MEMSET_CHK_SYMBOL): New.  Define if not defined.
    	(__bzero): Check VEC_SIZE == 16 instead of USE_MULTIARCH.
    	Disabled fro now.
    	Replace MEMSET_SYMBOL with MEMSET_CHK_SYMBOL on __memset_chk
    	symbols.  Properly check USE_MULTIARCH on __memset symbols.
    
    (cherry picked from commit 4af1bb06c59d24f35bf8dc55897838d926c05892)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=9a93bdbaff81edf67c5486c84f8098055e355abb

commit 9a93bdbaff81edf67c5486c84f8098055e355abb
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Apr 5 05:21:07 2016 -0700

    Force 32-bit displacement in memset-vec-unaligned-erms.S
    
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: Force
    	32-bit displacement to avoid long nop between instructions.
    
    (cherry picked from commit ec0cac9a1f4094bd0db6f77c1b329e7a40eecc10)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=5118e532600549ad0f56cb9b1a179b8eab70c483

commit 5118e532600549ad0f56cb9b1a179b8eab70c483
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Apr 5 05:19:05 2016 -0700

    Add a comment in memset-sse2-unaligned-erms.S
    
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Add
    	a comment on VMOVU and VMOVA.
    
    (cherry picked from commit 696ac774847b80cf994438739478b0c3003b5958)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=06c6d4ae6ee7e5b83fd5868bef494def01f59292

commit 06c6d4ae6ee7e5b83fd5868bef494def01f59292
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Apr 3 14:32:20 2016 -0700

    Don't put SSE2/AVX/AVX512 memmove/memset in ld.so
    
    Since memmove and memset in ld.so don't use IFUNC, don't put SSE2, AVX
    and AVX512 memmove and memset in ld.so.
    
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip
    	if not in libc.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    
    (cherry picked from commit 5cd7af016d8587ff53b20ba259746f97edbddbf7)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a96379797a7eecc1b709cad7b68981eb698783dc

commit a96379797a7eecc1b709cad7b68981eb698783dc
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Apr 3 12:38:25 2016 -0700

    Fix memmove-vec-unaligned-erms.S
    
    __mempcpy_erms and __memmove_erms can't be placed between __memmove_chk
    and __memmove it breaks __memmove_chk.
    
    Don't check source == destination first since it is less common.
    
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:
    	(__mempcpy_erms, __memmove_erms): Moved before __mempcpy_chk
    	with unaligned_erms.
    	(__memmove_erms): Skip if source == destination.
    	(__memmove_unaligned_erms): Don't check source == destination
    	first.
    
    (cherry picked from commit ea2785e96fa503f3a2b5dd9f3a6ca65622b3c5f2)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=cfb059c79729b26284863334c9aa04f0a3b967b9

commit cfb059c79729b26284863334c9aa04f0a3b967b9
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 1 15:08:48 2016 -0700

    Remove Fast_Copy_Backward from Intel Core processors
    
    Intel Core i3, i5 and i7 processors have fast unaligned copy and
    copy backward is ignored.  Remove Fast_Copy_Backward from Intel Core
    processors to avoid confusion.
    
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Don't set
    	bit_arch_Fast_Copy_Backward for Intel Core proessors.
    
    (cherry picked from commit 27d3ce1467990f89126e228559dec8f84b96c60e)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=30c389be1af67c4d0716d207b6780c6169d1355f

commit 30c389be1af67c4d0716d207b6780c6169d1355f
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:05:51 2016 -0700

    Add x86-64 memset with unaligned store and rep stosb
    
    Implement x86-64 memset with unaligned store and rep movsb.  Support
    16-byte, 32-byte and 64-byte vector register sizes.  A single file
    provides 2 implementations of memset, one with rep stosb and the other
    without rep stosb.  They share the same codes when size is between 2
    times of vector register size and REP_STOSB_THRESHOLD which defaults
    to 2KB.
    
    Key features:
    
    1. Use overlapping store to avoid branch.
    2. For size <= 4 times of vector register size, fully unroll the loop.
    3. For size > 4 times of vector register size, store 4 times of vector
    register size at a time.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
    	memset-sse2-unaligned-erms, memset-avx2-unaligned-erms and
    	memset-avx512-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Test __memset_chk_sse2_unaligned,
    	__memset_chk_sse2_unaligned_erms, __memset_chk_avx2_unaligned,
    	__memset_chk_avx2_unaligned_erms, __memset_chk_avx512_unaligned,
    	__memset_chk_avx512_unaligned_erms, __memset_sse2_unaligned,
    	__memset_sse2_unaligned_erms, __memset_erms,
    	__memset_avx2_unaligned, __memset_avx2_unaligned_erms,
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: New
    	file.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:
    	Likewise.
    
    (cherry picked from commit 830566307f038387ca0af3fd327706a8d1a2f595)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=980d639b4ae58209843f09a29d86b0a8303b6650

commit 980d639b4ae58209843f09a29d86b0a8303b6650
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:04:26 2016 -0700

    Add x86-64 memmove with unaligned load/store and rep movsb
    
    Implement x86-64 memmove with unaligned load/store and rep movsb.
    Support 16-byte, 32-byte and 64-byte vector register sizes.  When
    size <= 8 times of vector register size, there is no check for
    address overlap bewteen source and destination.  Since overhead for
    overlap check is small when size > 8 times of vector register size,
    memcpy is an alias of memmove.
    
    A single file provides 2 implementations of memmove, one with rep movsb
    and the other without rep movsb.  They share the same codes when size is
    between 2 times of vector register size and REP_MOVSB_THRESHOLD which
    is 2KB for 16-byte vector register size and scaled up by large vector
    register size.
    
    Key features:
    
    1. Use overlapping load and store to avoid branch.
    2. For size <= 8 times of vector register size, load  all sources into
    registers and store them together.
    3. If there is no address overlap bewteen source and destination, copy
    from both ends with 4 times of vector register size at a time.
    4. If address of destination > address of source, backward copy 8 times
    of vector register size at a time.
    5. Otherwise, forward copy 8 times of vector register size at a time.
    6. Use rep movsb only for forward copy.  Avoid slow backward rep movsb
    by fallbacking to backward copy 8 times of vector register size at a
    time.
    7. Skip when address of destination == address of source.
    
    	[BZ #19776]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
    	memmove-sse2-unaligned-erms, memmove-avx-unaligned-erms and
    	memmove-avx512-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Test
    	__memmove_chk_avx512_unaligned_2,
    	__memmove_chk_avx512_unaligned_erms,
    	__memmove_chk_avx_unaligned_2, __memmove_chk_avx_unaligned_erms,
    	__memmove_chk_sse2_unaligned_2,
    	__memmove_chk_sse2_unaligned_erms, __memmove_avx_unaligned_2,
    	__memmove_avx_unaligned_erms, __memmove_avx512_unaligned_2,
    	__memmove_avx512_unaligned_erms, __memmove_erms,
    	__memmove_sse2_unaligned_2, __memmove_sse2_unaligned_erms,
    	__memcpy_chk_avx512_unaligned_2,
    	__memcpy_chk_avx512_unaligned_erms,
    	__memcpy_chk_avx_unaligned_2, __memcpy_chk_avx_unaligned_erms,
    	__memcpy_chk_sse2_unaligned_2, __memcpy_chk_sse2_unaligned_erms,
    	__memcpy_avx_unaligned_2, __memcpy_avx_unaligned_erms,
    	__memcpy_avx512_unaligned_2, __memcpy_avx512_unaligned_erms,
    	__memcpy_sse2_unaligned_2, __memcpy_sse2_unaligned_erms,
    	__memcpy_erms, __mempcpy_chk_avx512_unaligned_2,
    	__mempcpy_chk_avx512_unaligned_erms,
    	__mempcpy_chk_avx_unaligned_2, __mempcpy_chk_avx_unaligned_erms,
    	__mempcpy_chk_sse2_unaligned_2, __mempcpy_chk_sse2_unaligned_erms,
    	__mempcpy_avx512_unaligned_2, __mempcpy_avx512_unaligned_erms,
    	__mempcpy_avx_unaligned_2, __mempcpy_avx_unaligned_erms,
    	__mempcpy_sse2_unaligned_2, __mempcpy_sse2_unaligned_erms and
    	__mempcpy_erms.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: New
    	file.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likwise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
    	Likwise.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:
    	Likwise.
    
    (cherry picked from commit 88b57b8ed41d5ecf2e1bdfc19556f9246a665ebb)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=bf2bc5e5c9d7aa8af28b299ec26b8a37352730cc

commit bf2bc5e5c9d7aa8af28b299ec26b8a37352730cc
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 19:22:59 2016 -0700

    Initial Enhanced REP MOVSB/STOSB (ERMS) support
    
    The newer Intel processors support Enhanced REP MOVSB/STOSB (ERMS) which
    has a feature bit in CPUID.  This patch adds the Enhanced REP MOVSB/STOSB
    (ERMS) bit to x86 cpu-features.
    
    	* sysdeps/x86/cpu-features.h (bit_cpu_ERMS): New.
    	(index_cpu_ERMS): Likewise.
    	(reg_ERMS): Likewise.
    
    (cherry picked from commit 0791f91dff9a77263fa8173b143d854cad902c6d)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7c244283ff12329b3bca9878b8edac3b3fe5c7bc

commit 7c244283ff12329b3bca9878b8edac3b3fe5c7bc
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 13:15:59 2016 -0700

    Make __memcpy_avx512_no_vzeroupper an alias
    
    Since x86-64 memcpy-avx512-no-vzeroupper.S implements memmove, make
    __memcpy_avx512_no_vzeroupper an alias of __memmove_avx512_no_vzeroupper
    to reduce code size of libc.so.
    
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memcpy-avx512-no-vzeroupper.
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: Renamed
    	to ...
    	* sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: This.
    	(MEMCPY): Don't define.
    	(MEMCPY_CHK): Likewise.
    	(MEMPCPY): Likewise.
    	(MEMPCPY_CHK): Likewise.
    	(MEMPCPY_CHK): Renamed to ...
    	(__mempcpy_chk_avx512_no_vzeroupper): This.
    	(MEMPCPY_CHK): Renamed to ...
    	(__mempcpy_chk_avx512_no_vzeroupper): This.
    	(MEMCPY_CHK): Renamed to ...
    	(__memmove_chk_avx512_no_vzeroupper): This.
    	(MEMCPY): Renamed to ...
    	(__memmove_avx512_no_vzeroupper): This.
    	(__memcpy_avx512_no_vzeroupper): New alias.
    	(__memcpy_chk_avx512_no_vzeroupper): Likewise.
    
    (cherry picked from commit 064f01b10b57ff09cda7025f484b848c38ddd57a)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a9a14991fb2d3e69f80d25e9bbf2f6b0bcf11c3d

commit a9a14991fb2d3e69f80d25e9bbf2f6b0bcf11c3d
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 13:13:36 2016 -0700

    Implement x86-64 multiarch mempcpy in memcpy
    
    Implement x86-64 multiarch mempcpy in memcpy to share most of code.  It
    reduces code size of libc.so.
    
    	[BZ #18858]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	mempcpy-ssse3, mempcpy-ssse3-back, mempcpy-avx-unaligned
    	and mempcpy-avx512-no-vzeroupper.
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMPCPY_CHK):
    	New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S
    	(MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-ssse3-back.S (MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-ssse3.S (MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-avx-unaligned.S: Removed.
    	* sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-ssse3-back.S: Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-ssse3.S: Likewise.
    
    (cherry picked from commit c365e615f7429aee302f8af7bf07ae262278febb)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4fc09dabecee1b7cafdbca26ee7c63f68e53c229

commit 4fc09dabecee1b7cafdbca26ee7c63f68e53c229
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 04:39:48 2016 -0700

    [x86] Add a feature bit: Fast_Unaligned_Copy
    
    On AMD processors, memcpy optimized with unaligned SSE load is
    slower than emcpy optimized with aligned SSSE3 while other string
    functions are faster with unaligned SSE load.  A feature bit,
    Fast_Unaligned_Copy, is added to select memcpy optimized with
    unaligned SSE load.
    
    	[BZ #19583]
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Set
    	Fast_Unaligned_Copy with Fast_Unaligned_Load for Intel
    	processors.  Set Fast_Copy_Backward for AMD Excavator
    	processors.
    	* sysdeps/x86/cpu-features.h (bit_arch_Fast_Unaligned_Copy):
    	New.
    	(index_arch_Fast_Unaligned_Copy): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Check
    	Fast_Unaligned_Copy instead of Fast_Unaligned_Load.
    
    (cherry picked from commit e41b395523040fcb58c7d378475720c2836d280c)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=75f2d47e459a6bf5656a938e5c63f8b581eb3ee6

commit 75f2d47e459a6bf5656a938e5c63f8b581eb3ee6
Author: Florian Weimer <fweimer@redhat.com>
Date:   Fri Mar 25 11:11:42 2016 +0100

    tst-audit10: Fix compilation on compilers without bit_AVX512F [BZ #19860]
    
    	[BZ# 19860]
    	* sysdeps/x86_64/tst-audit10.c (avx512_enabled): Always return
    	zero if the compiler does not provide the AVX512F bit.
    
    (cherry picked from commit f327f5b47be57bc05a4077344b381016c1bb2c11)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=96c7375cb8b6f1875d9865f2ae92ecacf5f5e6fa

commit 96c7375cb8b6f1875d9865f2ae92ecacf5f5e6fa
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Mar 22 08:36:16 2016 -0700

    Don't set %rcx twice before "rep movsb"
    
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMCPY):
    	Don't set %rcx twice before "rep movsb".
    
    (cherry picked from commit 3c9a4cd16cbc7b79094fec68add2df66061ab5d7)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c273f613b0cc779ee33cc33d20941d271316e483

commit c273f613b0cc779ee33cc33d20941d271316e483
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Mar 22 07:46:56 2016 -0700

    Set index_arch_AVX_Fast_Unaligned_Load only for Intel processors
    
    Since only Intel processors with AVX2 have fast unaligned load, we
    should set index_arch_AVX_Fast_Unaligned_Load only for Intel processors.
    
    Move AVX, AVX2, AVX512, FMA and FMA4 detection into get_common_indeces
    and call get_common_indeces for other processors.
    
    Add CPU_FEATURES_CPU_P and CPU_FEATURES_ARCH_P to aoid loading
    GLRO(dl_x86_cpu_features) in cpu-features.c.
    
    	[BZ #19583]
    	* sysdeps/x86/cpu-features.c (get_common_indeces): Remove
    	inline.  Check family before setting family, model and
    	extended_model.  Set AVX, AVX2, AVX512, FMA and FMA4 usable
    	bits here.
    	(init_cpu_features): Replace HAS_CPU_FEATURE and
    	HAS_ARCH_FEATURE with CPU_FEATURES_CPU_P and
    	CPU_FEATURES_ARCH_P.  Set index_arch_AVX_Fast_Unaligned_Load
    	for Intel processors with usable AVX2.  Call get_common_indeces
    	for other processors with family == NULL.
    	* sysdeps/x86/cpu-features.h (CPU_FEATURES_CPU_P): New macro.
    	(CPU_FEATURES_ARCH_P): Likewise.
    	(HAS_CPU_FEATURE): Use CPU_FEATURES_CPU_P.
    	(HAS_ARCH_FEATURE): Use CPU_FEATURES_ARCH_P.
    
    (cherry picked from commit f781a9e96138d8839663af5e88649ab1fbed74f8)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c858d10a4e7fd682f2e7083836e4feacc2d580f4

commit c858d10a4e7fd682f2e7083836e4feacc2d580f4
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 10 05:26:46 2016 -0800

    Add _arch_/_cpu_ to index_*/bit_* in x86 cpu-features.h
    
    index_* and bit_* macros are used to access cpuid and feature arrays o
    struct cpu_features.  It is very easy to use bits and indices of cpuid
    array on feature array, especially in assembly codes.  For example,
    sysdeps/i386/i686/multiarch/bcopy.S has
    
    	HAS_CPU_FEATURE (Fast_Rep_String)
    
    which should be
    
    	HAS_ARCH_FEATURE (Fast_Rep_String)
    
    We change index_* and bit_* to index_cpu_*/index_arch_* and
    bit_cpu_*/bit_arch_* so that we can catch such error at build time.
    
    	[BZ #19762]
    	* sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h
    	(EXTRA_LD_ENVVARS): Add _arch_ to index_*/bit_*.
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Likewise.
    	* sysdeps/x86/cpu-features.h (bit_*): Renamed to ...
    	(bit_arch_*): This for feature array.
    	(bit_*): Renamed to ...
    	(bit_cpu_*): This for cpu array.
    	(index_*): Renamed to ...
    	(index_arch_*): This for feature array.
    	(index_*): Renamed to ...
    	(index_cpu_*): This for cpu array.
    	[__ASSEMBLER__] (HAS_FEATURE): Add and use field.
    	[__ASSEMBLER__] (HAS_CPU_FEATURE)): Pass cpu to HAS_FEATURE.
    	[__ASSEMBLER__] (HAS_ARCH_FEATURE)): Pass arch to HAS_FEATURE.
    	[!__ASSEMBLER__] (HAS_CPU_FEATURE): Replace index_##name and
    	bit_##name with index_cpu_##name and bit_cpu_##name.
    	[!__ASSEMBLER__] (HAS_ARCH_FEATURE): Replace index_##name and
    	bit_##name with index_arch_##name and bit_arch_##name.
    
    (cherry picked from commit 6aa3e97e2530f9917f504eb4146af119a3f27229)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7a90b56b0c3f8e55df44957cf6de7d3c9c04cbb9

commit 7a90b56b0c3f8e55df44957cf6de7d3c9c04cbb9
Author: Roland McGrath <roland@hack.frob.com>
Date:   Tue Mar 8 12:31:13 2016 -0800

    Fix tst-audit10 build when -mavx512f is not supported.
    
    (cherry picked from commit 3bd80c0de2f8e7ca8020d37739339636d169957e)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=ba80f6ceea3a6b6f711038646f419125fe3ad39c

commit ba80f6ceea3a6b6f711038646f419125fe3ad39c
Author: Florian Weimer <fweimer@redhat.com>
Date:   Mon Mar 7 16:00:25 2016 +0100

    tst-audit4, tst-audit10: Compile AVX/AVX-512 code separately [BZ #19269]
    
    This ensures that GCC will not use unsupported instructions before
    the run-time check to ensure support.
    
    (cherry picked from commit 3c0f7407eedb524c9114bb675cd55b903c71daaa)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=b8fe596e7f750d4ee2fca14d6a3999364c02662e

commit b8fe596e7f750d4ee2fca14d6a3999364c02662e
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Mar 6 16:48:11 2016 -0800

    Group AVX512 functions in .text.avx512 section
    
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S:
    	Replace .text with .text.avx512.
    	* sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S:
    	Likewise.
    
    (cherry picked from commit fee9eb6200f0e44a4b684903bc47fde36d46f1a5)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e455d17680cfaebb12692547422f95ba1ed30e29

commit e455d17680cfaebb12692547422f95ba1ed30e29
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Mar 4 08:37:40 2016 -0800

    x86-64: Fix memcpy IFUNC selection
    
    Chek Fast_Unaligned_Load, instead of Slow_BSF, and also check for
    Fast_Copy_Backward to enable __memcpy_ssse3_back.  Existing selection
    order is updated with following selection order:
    
    1. __memcpy_avx_unaligned if AVX_Fast_Unaligned_Load bit is set.
    2. __memcpy_sse2_unaligned if Fast_Unaligned_Load bit is set.
    3. __memcpy_sse2 if SSSE3 isn't available.
    4. __memcpy_ssse3_back if Fast_Copy_Backward bit it set.
    5. __memcpy_ssse3
    
    	[BZ #18880]
    	* sysdeps/x86_64/multiarch/memcpy.S: Check Fast_Unaligned_Load,
    	instead of Slow_BSF, and also check for Fast_Copy_Backward to
    	enable __memcpy_ssse3_back.
    
    (cherry picked from commit 14a1d7cc4c4fd5ee8e4e66b777221dd32a84efe8)

-----------------------------------------------------------------------
Comment 16 cvs-commit@gcc.gnu.org 2016-04-07 23:42:07 UTC
This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, hjl/erms/2.22 has been created
        at  8b65cadefc53cc42e1970e0817336fe96a7aa396 (commit)

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=8b65cadefc53cc42e1970e0817336fe96a7aa396

commit 8b65cadefc53cc42e1970e0817336fe96a7aa396
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 1 14:01:24 2016 -0700

    X86-64: Add dummy memcopy.h and wordcopy.c
    
    Since x86-64 doesn't use memory copy functions, add dummy memcopy.h and
    wordcopy.c to reduce code size.  It reduces the size of libc.so by about
    1 KB.
    
    	* sysdeps/x86_64/memcopy.h: New file.
    	* sysdeps/x86_64/wordcopy.c: Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c4b1dec2c115ba19192fdb143f25cfc1ac76c94a

commit c4b1dec2c115ba19192fdb143f25cfc1ac76c94a
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 12:46:57 2016 -0700

    X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove
    
    Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones,
    we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with
    the new ones.
    
    No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used
    before.  If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2
    memcpy/memmove optimized with Enhanced REP MOVSB will be used for
    processors with ERMS.  The new AVX512 memcpy/memmove will be used for
    processors with AVX512 which prefer vzeroupper.
    
    Since the new SSE2 memcpy/memmove are faster than the previous default
    memcpy/memmove used in libc.a and ld.so, we also remove the previous
    default memcpy/memmove and make them the default memcpy/memmove.
    
    Together, it reduces the size of libc.so by about 6 KB and the size of
    ld.so by about 2 KB.
    
    	[BZ #19776]
    	* sysdeps/x86_64/memcpy.S: Make it dummy.
    	* sysdeps/x86_64/mempcpy.S: Likewise.
    	* sysdeps/x86_64/memmove.S: New file.
    	* sysdeps/x86_64/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/memmove.c: Removed.
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memmove.c: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memcpy-sse2-unaligned, memmove-avx-unaligned,
    	memcpy-avx-unaligned and memmove-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Replace
    	__memmove_chk_avx512_unaligned_2 with
    	__memmove_chk_avx512_unaligned.  Remove
    	__memmove_chk_avx_unaligned_2.  Replace
    	__memmove_chk_sse2_unaligned_2 with
    	__memmove_chk_sse2_unaligned.  Remove __memmove_chk_sse2 and
    	__memmove_avx_unaligned_2.  Replace __memmove_avx512_unaligned_2
    	with __memmove_avx512_unaligned.  Replace
    	__memmove_sse2_unaligned_2 with __memmove_sse2_unaligned.
    	Remove __memmove_sse2.  Replace __memcpy_chk_avx512_unaligned_2
    	with __memcpy_chk_avx512_unaligned.  Remove
    	__memcpy_chk_avx_unaligned_2.  Replace
    	__memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned.
    	Remove __memcpy_chk_sse2.  Remove __memcpy_avx_unaligned_2.
    	Replace __memcpy_avx512_unaligned_2 with
    	__memcpy_avx512_unaligned.  Remove __memcpy_sse2_unaligned_2
    	and __memcpy_sse2.  Replace __mempcpy_chk_avx512_unaligned_2
    	with __mempcpy_chk_avx512_unaligned.  Remove
    	__mempcpy_chk_avx_unaligned_2.  Replace
    	__mempcpy_chk_sse2_unaligned_2 with
    	__mempcpy_chk_sse2_unaligned.  Remove __mempcpy_chk_sse2.
    	Replace __mempcpy_avx512_unaligned_2 with
    	__mempcpy_avx512_unaligned.  Remove __mempcpy_avx_unaligned_2.
    	Replace __mempcpy_sse2_unaligned_2 with
    	__mempcpy_sse2_unaligned.  Remove __mempcpy_sse2.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support
    	__memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned.
    	Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __memcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../memcpy.S.
    	* sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support
    	__memcpy_chk_avx512_unaligned_erms and
    	__memcpy_chk_avx512_unaligned.  Use
    	__memcpy_chk_avx_unaligned_erms and
    	__memcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __memcpy_chk_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip if
    	not in libc.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
    	Change function suffix from unaligned_2 to unaligned.
    	* sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support
    	__mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned.
    	Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __mempcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../mempcpy.S.
    	(mempcpy): New.  Add a weak alias.
    	* sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support
    	__mempcpy_chk_avx512_unaligned_erms and
    	__mempcpy_chk_avx512_unaligned.  Use
    	__mempcpy_chk_avx_unaligned_erms and
    	__mempcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __mempcpy_chk_sse2_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=1d9b78b7787695ab0fddbaabeb3ef07c730e94a4

commit 1d9b78b7787695ab0fddbaabeb3ef07c730e94a4
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:42:30 2016 -0700

    X86-64: Remove the previous SSE2/AVX2 memsets
    
    Since the new SSE2/AVX2 memsets are faster than the previous ones, we
    can remove the previous SSE2/AVX2 memsets and replace them with the
    new ones.  This reduces the size of libc.so by about 900 bytes.
    
    No change in IFUNC selection if SSE2 and AVX2 memsets weren't used
    before.  If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset
    optimized with Enhanced REP STOSB will be used for processors with
    ERMS.  The new AVX512 memset will be used for processors with AVX512
    which prefer vzeroupper.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded
    	into ...
    	* sysdeps/x86_64/memset.S: This.
    	(__bzero): Removed.
    	(__memset_tail): Likewise.
    	(__memset_chk): Likewise.
    	(memset): Likewise.
    	(MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't
    	defined.
    	(MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined.
    	* sysdeps/x86_64/multiarch/memset-avx2.S: Removed.
    	(__memset_zero_constant_len_parameter): Check SHARED instead of
    	PIC.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memset-avx2 and memset-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Remove __memset_chk_sse2,
    	__memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: Skip
    	if not in libc.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
    	(__bzero): Enabled.
    	* sysdeps/x86_64/multiarch/memset.S (memset): Replace
    	__memset_sse2 and __memset_avx2 with __memset_sse2_unaligned
    	and __memset_avx2_unaligned.  Use __memset_sse2_unaligned_erms
    	or __memset_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	(memset): Removed.
    	(__memset_chk): Likewise.
    	(MEMSET_SYMBOL): New.
    	(libc_hidden_builtin_def): Replace __memset_sse2 with
    	__memset_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace
    	__memset_chk_sse2 and __memset_chk_avx2 with
    	__memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms.
    	Use __memset_chk_sse2_unaligned_erms or
    	__memset_chk_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_chk_avx512_unaligned_erms and
    	__memset_chk_avx512_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=2e9ef7960d01bc9bb36f2f3e7c9c567f11e56da9

commit 2e9ef7960d01bc9bb36f2f3e7c9c567f11e56da9
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Apr 3 17:21:45 2016 -0700

    X86-64: Use non-temporal store in memcpy on large data
    
    The large memcpy micro benchmark in glibc shows that there is a
    regression with large data on Haswell machine.  non-temporal store in
    memcpy on large data can improve performance significantly.  This
    patch adds a threshold to use non temporal store which is 6 times of
    shared cache size.  When size is above the threshold, non temporal
    store will be used.
    
    For size below 8 vector register width, we load all data into registers
    and store them together.  Only forward and backward loops, which move 4
    vector registers at a time, are used to support overlapping addresses.
    For forward loop, we load the last 4 vector register width of data and
    the first vector register width of data into vector registers before the
    loop and store them after the loop.  For backward loop, we load the first
    4 vector register width of data and the last vector register width of
    data into vector registers before the loop and store them after the loop.
    
    	* sysdeps/x86_64/cacheinfo.c (__x86_shared_non_temporal_threshold):
    	New.
    	(init_cacheinfo): Set __x86_shared_non_temporal_threshold to
    	6 times of shared cache size.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S
    	(VMOVNT): New.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S
    	(VMOVNT): Likewise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S
    	(VMOVNT): Likewise.
    	(VMOVU): Changed to movups for smaller code sizes.
    	(VMOVA): Changed to movaps for smaller code sizes.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Update
    	comments.
    	(PREFETCH): New.
    	(PREFETCH_SIZE): Likewise.
    	(PREFETCHED_LOAD_SIZE): Likewise.
    	(PREFETCH_ONE_SET): Likewise.
    	Rewrite to use forward and backward loops, which move 4 vector
    	registers at a time, to support overlapping addresses and use
    	non temporal store if size is above the threshold.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=54667f64fa4074325ee33e487c033c313ce95067

commit 54667f64fa4074325ee33e487c033c313ce95067
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Wed Apr 6 10:19:16 2016 -0700

    X86-64: Prepare memmove-vec-unaligned-erms.S
    
    Prepare memmove-vec-unaligned-erms.S to make the SSE2 version as the
    default memcpy, mempcpy and memmove.
    
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
    	(MEMCPY_SYMBOL): New.
    	(MEMPCPY_SYMBOL): Likewise.
    	(MEMMOVE_CHK_SYMBOL): Likewise.
    	Replace MEMMOVE_SYMBOL with MEMMOVE_CHK_SYMBOL on __mempcpy_chk
    	symbols.  Replace MEMMOVE_SYMBOL with MEMPCPY_SYMBOL on
    	__mempcpy symbols.  Provide alias for __memcpy_chk in libc.a.
    	Provide alias for memcpy in libc.a and ld.so.
    
    (cherry picked from commit a7d1c51482d15ab6c07e2ee0ae5e007067b18bfb)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=68a0b487e274b3452a1660e4b9fad5df8d8c0284

commit 68a0b487e274b3452a1660e4b9fad5df8d8c0284
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Wed Apr 6 09:10:18 2016 -0700

    X86-64: Prepare memset-vec-unaligned-erms.S
    
    Prepare memset-vec-unaligned-erms.S to make the SSE2 version as the
    default memset.
    
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
    	(MEMSET_CHK_SYMBOL): New.  Define if not defined.
    	(__bzero): Check VEC_SIZE == 16 instead of USE_MULTIARCH.
    	Disabled fro now.
    	Replace MEMSET_SYMBOL with MEMSET_CHK_SYMBOL on __memset_chk
    	symbols.  Properly check USE_MULTIARCH on __memset symbols.
    
    (cherry picked from commit 4af1bb06c59d24f35bf8dc55897838d926c05892)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c2d3bdd6aec639fd66fceb3e2c145420c25d409b

commit c2d3bdd6aec639fd66fceb3e2c145420c25d409b
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Apr 5 05:21:07 2016 -0700

    Force 32-bit displacement in memset-vec-unaligned-erms.S
    
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: Force
    	32-bit displacement to avoid long nop between instructions.
    
    (cherry picked from commit ec0cac9a1f4094bd0db6f77c1b329e7a40eecc10)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=070a5e77d66f5520c1bbbc24dc1843a0a1c161ee

commit 070a5e77d66f5520c1bbbc24dc1843a0a1c161ee
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Apr 5 05:19:05 2016 -0700

    Add a comment in memset-sse2-unaligned-erms.S
    
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Add
    	a comment on VMOVU and VMOVA.
    
    (cherry picked from commit 696ac774847b80cf994438739478b0c3003b5958)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7e00bb9720268f142668d22e91dff7c3e6e0c08c

commit 7e00bb9720268f142668d22e91dff7c3e6e0c08c
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Apr 3 14:32:20 2016 -0700

    Don't put SSE2/AVX/AVX512 memmove/memset in ld.so
    
    Since memmove and memset in ld.so don't use IFUNC, don't put SSE2, AVX
    and AVX512 memmove and memset in ld.so.
    
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip
    	if not in libc.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    
    (cherry picked from commit 5cd7af016d8587ff53b20ba259746f97edbddbf7)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e1e57539f5dbdefc96a85021b611863eaa28dd13

commit e1e57539f5dbdefc96a85021b611863eaa28dd13
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Apr 3 12:38:25 2016 -0700

    Fix memmove-vec-unaligned-erms.S
    
    __mempcpy_erms and __memmove_erms can't be placed between __memmove_chk
    and __memmove it breaks __memmove_chk.
    
    Don't check source == destination first since it is less common.
    
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:
    	(__mempcpy_erms, __memmove_erms): Moved before __mempcpy_chk
    	with unaligned_erms.
    	(__memmove_erms): Skip if source == destination.
    	(__memmove_unaligned_erms): Don't check source == destination
    	first.
    
    (cherry picked from commit ea2785e96fa503f3a2b5dd9f3a6ca65622b3c5f2)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a13ac6b5ced68aadb7c1546102445f9c57f43231

commit a13ac6b5ced68aadb7c1546102445f9c57f43231
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Mar 6 08:23:24 2016 -0800

    Use HAS_ARCH_FEATURE with Fast_Rep_String
    
    HAS_ARCH_FEATURE, not HAS_CPU_FEATURE, should be used with
    Fast_Rep_String.
    
    	[BZ #19762]
    	* sysdeps/i386/i686/multiarch/bcopy.S (bcopy): Use
    	HAS_ARCH_FEATURE with Fast_Rep_String.
    	* sysdeps/i386/i686/multiarch/bzero.S (__bzero): Likewise.
    	* sysdeps/i386/i686/multiarch/memcpy.S (memcpy): Likewise.
    	* sysdeps/i386/i686/multiarch/memcpy_chk.S (__memcpy_chk):
    	Likewise.
    	* sysdeps/i386/i686/multiarch/memmove_chk.S (__memmove_chk):
    	Likewise.
    	* sysdeps/i386/i686/multiarch/mempcpy.S (__mempcpy): Likewise.
    	* sysdeps/i386/i686/multiarch/mempcpy_chk.S (__mempcpy_chk):
    	Likewise.
    	* sysdeps/i386/i686/multiarch/memset.S (memset): Likewise.
    	* sysdeps/i386/i686/multiarch/memset_chk.S (__memset_chk):
    	Likewise.
    
    (cherry picked from commit 4e940b2f4b577f3a530e0580373f7c2d569f4d63)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4ad4d58ed7a444e2d9787113fce132a99b35b417

commit 4ad4d58ed7a444e2d9787113fce132a99b35b417
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 1 15:08:48 2016 -0700

    Remove Fast_Copy_Backward from Intel Core processors
    
    Intel Core i3, i5 and i7 processors have fast unaligned copy and
    copy backward is ignored.  Remove Fast_Copy_Backward from Intel Core
    processors to avoid confusion.
    
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Don't set
    	bit_arch_Fast_Copy_Backward for Intel Core proessors.
    
    (cherry picked from commit 27d3ce1467990f89126e228559dec8f84b96c60e)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a304f3933c7f8347e49057a7a315cbd571662ff7

commit a304f3933c7f8347e49057a7a315cbd571662ff7
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:05:51 2016 -0700

    Add x86-64 memset with unaligned store and rep stosb
    
    Implement x86-64 memset with unaligned store and rep movsb.  Support
    16-byte, 32-byte and 64-byte vector register sizes.  A single file
    provides 2 implementations of memset, one with rep stosb and the other
    without rep stosb.  They share the same codes when size is between 2
    times of vector register size and REP_STOSB_THRESHOLD which defaults
    to 2KB.
    
    Key features:
    
    1. Use overlapping store to avoid branch.
    2. For size <= 4 times of vector register size, fully unroll the loop.
    3. For size > 4 times of vector register size, store 4 times of vector
    register size at a time.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
    	memset-sse2-unaligned-erms, memset-avx2-unaligned-erms and
    	memset-avx512-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Test __memset_chk_sse2_unaligned,
    	__memset_chk_sse2_unaligned_erms, __memset_chk_avx2_unaligned,
    	__memset_chk_avx2_unaligned_erms, __memset_chk_avx512_unaligned,
    	__memset_chk_avx512_unaligned_erms, __memset_sse2_unaligned,
    	__memset_sse2_unaligned_erms, __memset_erms,
    	__memset_avx2_unaligned, __memset_avx2_unaligned_erms,
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: New
    	file.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:
    	Likewise.
    
    (cherry picked from commit 830566307f038387ca0af3fd327706a8d1a2f595)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=1c5d0b0ba41376c2f0792da4f22cc1f5b2b2688e

commit 1c5d0b0ba41376c2f0792da4f22cc1f5b2b2688e
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:04:26 2016 -0700

    Add x86-64 memmove with unaligned load/store and rep movsb
    
    Implement x86-64 memmove with unaligned load/store and rep movsb.
    Support 16-byte, 32-byte and 64-byte vector register sizes.  When
    size <= 8 times of vector register size, there is no check for
    address overlap bewteen source and destination.  Since overhead for
    overlap check is small when size > 8 times of vector register size,
    memcpy is an alias of memmove.
    
    A single file provides 2 implementations of memmove, one with rep movsb
    and the other without rep movsb.  They share the same codes when size is
    between 2 times of vector register size and REP_MOVSB_THRESHOLD which
    is 2KB for 16-byte vector register size and scaled up by large vector
    register size.
    
    Key features:
    
    1. Use overlapping load and store to avoid branch.
    2. For size <= 8 times of vector register size, load  all sources into
    registers and store them together.
    3. If there is no address overlap bewteen source and destination, copy
    from both ends with 4 times of vector register size at a time.
    4. If address of destination > address of source, backward copy 8 times
    of vector register size at a time.
    5. Otherwise, forward copy 8 times of vector register size at a time.
    6. Use rep movsb only for forward copy.  Avoid slow backward rep movsb
    by fallbacking to backward copy 8 times of vector register size at a
    time.
    7. Skip when address of destination == address of source.
    
    	[BZ #19776]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
    	memmove-sse2-unaligned-erms, memmove-avx-unaligned-erms and
    	memmove-avx512-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Test
    	__memmove_chk_avx512_unaligned_2,
    	__memmove_chk_avx512_unaligned_erms,
    	__memmove_chk_avx_unaligned_2, __memmove_chk_avx_unaligned_erms,
    	__memmove_chk_sse2_unaligned_2,
    	__memmove_chk_sse2_unaligned_erms, __memmove_avx_unaligned_2,
    	__memmove_avx_unaligned_erms, __memmove_avx512_unaligned_2,
    	__memmove_avx512_unaligned_erms, __memmove_erms,
    	__memmove_sse2_unaligned_2, __memmove_sse2_unaligned_erms,
    	__memcpy_chk_avx512_unaligned_2,
    	__memcpy_chk_avx512_unaligned_erms,
    	__memcpy_chk_avx_unaligned_2, __memcpy_chk_avx_unaligned_erms,
    	__memcpy_chk_sse2_unaligned_2, __memcpy_chk_sse2_unaligned_erms,
    	__memcpy_avx_unaligned_2, __memcpy_avx_unaligned_erms,
    	__memcpy_avx512_unaligned_2, __memcpy_avx512_unaligned_erms,
    	__memcpy_sse2_unaligned_2, __memcpy_sse2_unaligned_erms,
    	__memcpy_erms, __mempcpy_chk_avx512_unaligned_2,
    	__mempcpy_chk_avx512_unaligned_erms,
    	__mempcpy_chk_avx_unaligned_2, __mempcpy_chk_avx_unaligned_erms,
    	__mempcpy_chk_sse2_unaligned_2, __mempcpy_chk_sse2_unaligned_erms,
    	__mempcpy_avx512_unaligned_2, __mempcpy_avx512_unaligned_erms,
    	__mempcpy_avx_unaligned_2, __mempcpy_avx_unaligned_erms,
    	__mempcpy_sse2_unaligned_2, __mempcpy_sse2_unaligned_erms and
    	__mempcpy_erms.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: New
    	file.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likwise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
    	Likwise.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:
    	Likwise.
    
    (cherry picked from commit 88b57b8ed41d5ecf2e1bdfc19556f9246a665ebb)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e1203f48239fbb9832db6ed3a0d2a008e317aff9

commit e1203f48239fbb9832db6ed3a0d2a008e317aff9
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 19:22:59 2016 -0700

    Initial Enhanced REP MOVSB/STOSB (ERMS) support
    
    The newer Intel processors support Enhanced REP MOVSB/STOSB (ERMS) which
    has a feature bit in CPUID.  This patch adds the Enhanced REP MOVSB/STOSB
    (ERMS) bit to x86 cpu-features.
    
    	* sysdeps/x86/cpu-features.h (bit_cpu_ERMS): New.
    	(index_cpu_ERMS): Likewise.
    	(reg_ERMS): Likewise.
    
    (cherry picked from commit 0791f91dff9a77263fa8173b143d854cad902c6d)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=3597d65be2a44f063ef12bb907fdad8567aa3e6a

commit 3597d65be2a44f063ef12bb907fdad8567aa3e6a
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 13:15:59 2016 -0700

    Make __memcpy_avx512_no_vzeroupper an alias
    
    Since x86-64 memcpy-avx512-no-vzeroupper.S implements memmove, make
    __memcpy_avx512_no_vzeroupper an alias of __memmove_avx512_no_vzeroupper
    to reduce code size of libc.so.
    
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memcpy-avx512-no-vzeroupper.
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: Renamed
    	to ...
    	* sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: This.
    	(MEMCPY): Don't define.
    	(MEMCPY_CHK): Likewise.
    	(MEMPCPY): Likewise.
    	(MEMPCPY_CHK): Likewise.
    	(MEMPCPY_CHK): Renamed to ...
    	(__mempcpy_chk_avx512_no_vzeroupper): This.
    	(MEMPCPY_CHK): Renamed to ...
    	(__mempcpy_chk_avx512_no_vzeroupper): This.
    	(MEMCPY_CHK): Renamed to ...
    	(__memmove_chk_avx512_no_vzeroupper): This.
    	(MEMCPY): Renamed to ...
    	(__memmove_avx512_no_vzeroupper): This.
    	(__memcpy_avx512_no_vzeroupper): New alias.
    	(__memcpy_chk_avx512_no_vzeroupper): Likewise.
    
    (cherry picked from commit 064f01b10b57ff09cda7025f484b848c38ddd57a)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=9fbaf0f27a11deb98df79d04adee97aebee78d40

commit 9fbaf0f27a11deb98df79d04adee97aebee78d40
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 13:13:36 2016 -0700

    Implement x86-64 multiarch mempcpy in memcpy
    
    Implement x86-64 multiarch mempcpy in memcpy to share most of code.  It
    reduces code size of libc.so.
    
    	[BZ #18858]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	mempcpy-ssse3, mempcpy-ssse3-back, mempcpy-avx-unaligned
    	and mempcpy-avx512-no-vzeroupper.
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMPCPY_CHK):
    	New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S
    	(MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-ssse3-back.S (MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-ssse3.S (MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-avx-unaligned.S: Removed.
    	* sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-ssse3-back.S: Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-ssse3.S: Likewise.
    
    (cherry picked from commit c365e615f7429aee302f8af7bf07ae262278febb)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=5239cb481eea27650173b9b9af22439afdcbf358

commit 5239cb481eea27650173b9b9af22439afdcbf358
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 04:39:48 2016 -0700

    [x86] Add a feature bit: Fast_Unaligned_Copy
    
    On AMD processors, memcpy optimized with unaligned SSE load is
    slower than emcpy optimized with aligned SSSE3 while other string
    functions are faster with unaligned SSE load.  A feature bit,
    Fast_Unaligned_Copy, is added to select memcpy optimized with
    unaligned SSE load.
    
    	[BZ #19583]
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Set
    	Fast_Unaligned_Copy with Fast_Unaligned_Load for Intel
    	processors.  Set Fast_Copy_Backward for AMD Excavator
    	processors.
    	* sysdeps/x86/cpu-features.h (bit_arch_Fast_Unaligned_Copy):
    	New.
    	(index_arch_Fast_Unaligned_Copy): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Check
    	Fast_Unaligned_Copy instead of Fast_Unaligned_Load.
    
    (cherry picked from commit e41b395523040fcb58c7d378475720c2836d280c)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a65b3d13e1754d568782e64a762c2c7fab45a55d

commit a65b3d13e1754d568782e64a762c2c7fab45a55d
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Mar 22 08:36:16 2016 -0700

    Don't set %rcx twice before "rep movsb"
    
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMCPY):
    	Don't set %rcx twice before "rep movsb".
    
    (cherry picked from commit 3c9a4cd16cbc7b79094fec68add2df66061ab5d7)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=f4b6d20366aac66070f1cf50552cf2951991a1e5

commit f4b6d20366aac66070f1cf50552cf2951991a1e5
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Mar 22 07:46:56 2016 -0700

    Set index_arch_AVX_Fast_Unaligned_Load only for Intel processors
    
    Since only Intel processors with AVX2 have fast unaligned load, we
    should set index_arch_AVX_Fast_Unaligned_Load only for Intel processors.
    
    Move AVX, AVX2, AVX512, FMA and FMA4 detection into get_common_indeces
    and call get_common_indeces for other processors.
    
    Add CPU_FEATURES_CPU_P and CPU_FEATURES_ARCH_P to aoid loading
    GLRO(dl_x86_cpu_features) in cpu-features.c.
    
    	[BZ #19583]
    	* sysdeps/x86/cpu-features.c (get_common_indeces): Remove
    	inline.  Check family before setting family, model and
    	extended_model.  Set AVX, AVX2, AVX512, FMA and FMA4 usable
    	bits here.
    	(init_cpu_features): Replace HAS_CPU_FEATURE and
    	HAS_ARCH_FEATURE with CPU_FEATURES_CPU_P and
    	CPU_FEATURES_ARCH_P.  Set index_arch_AVX_Fast_Unaligned_Load
    	for Intel processors with usable AVX2.  Call get_common_indeces
    	for other processors with family == NULL.
    	* sysdeps/x86/cpu-features.h (CPU_FEATURES_CPU_P): New macro.
    	(CPU_FEATURES_ARCH_P): Likewise.
    	(HAS_CPU_FEATURE): Use CPU_FEATURES_CPU_P.
    	(HAS_ARCH_FEATURE): Use CPU_FEATURES_ARCH_P.
    
    (cherry picked from commit f781a9e96138d8839663af5e88649ab1fbed74f8)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=ca9c5edeea52dc18f42ebbe29b1af352f5555538

commit ca9c5edeea52dc18f42ebbe29b1af352f5555538
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Nov 30 08:53:37 2015 -0800

    Update family and model detection for AMD CPUs
    
    AMD CPUs uses the similar encoding scheme for extended family and model
    as Intel CPUs as shown in:
    
    http://support.amd.com/TechDocs/25481.pdf
    
    This patch updates get_common_indeces to get family and model for both
    Intel and AMD CPUs when family == 0x0f.
    
    	[BZ #19214]
    	* sysdeps/x86/cpu-features.c (get_common_indeces): Add an
    	argument to return extended model.  Update family and model
    	with extended family and model when family == 0x0f.
    	(init_cpu_features): Updated.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c23cdbac4ea473effbef5c50b1217f95595b3460

commit c23cdbac4ea473effbef5c50b1217f95595b3460
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 10 05:26:46 2016 -0800

    Add _arch_/_cpu_ to index_*/bit_* in x86 cpu-features.h
    
    index_* and bit_* macros are used to access cpuid and feature arrays o
    struct cpu_features.  It is very easy to use bits and indices of cpuid
    array on feature array, especially in assembly codes.  For example,
    sysdeps/i386/i686/multiarch/bcopy.S has
    
    	HAS_CPU_FEATURE (Fast_Rep_String)
    
    which should be
    
    	HAS_ARCH_FEATURE (Fast_Rep_String)
    
    We change index_* and bit_* to index_cpu_*/index_arch_* and
    bit_cpu_*/bit_arch_* so that we can catch such error at build time.
    
    	[BZ #19762]
    	* sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h
    	(EXTRA_LD_ENVVARS): Add _arch_ to index_*/bit_*.
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Likewise.
    	* sysdeps/x86/cpu-features.h (bit_*): Renamed to ...
    	(bit_arch_*): This for feature array.
    	(bit_*): Renamed to ...
    	(bit_cpu_*): This for cpu array.
    	(index_*): Renamed to ...
    	(index_arch_*): This for feature array.
    	(index_*): Renamed to ...
    	(index_cpu_*): This for cpu array.
    	[__ASSEMBLER__] (HAS_FEATURE): Add and use field.
    	[__ASSEMBLER__] (HAS_CPU_FEATURE)): Pass cpu to HAS_FEATURE.
    	[__ASSEMBLER__] (HAS_ARCH_FEATURE)): Pass arch to HAS_FEATURE.
    	[!__ASSEMBLER__] (HAS_CPU_FEATURE): Replace index_##name and
    	bit_##name with index_cpu_##name and bit_cpu_##name.
    	[!__ASSEMBLER__] (HAS_ARCH_FEATURE): Replace index_##name and
    	bit_##name with index_arch_##name and bit_arch_##name.
    
    (cherry picked from commit 6aa3e97e2530f9917f504eb4146af119a3f27229)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4a49c82956f5a42a2cce22c2e97360de1b32301d

commit 4a49c82956f5a42a2cce22c2e97360de1b32301d
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 3 14:51:40 2016 -0800

    Or bit_Prefer_MAP_32BIT_EXEC in EXTRA_LD_ENVVARS
    
    We should turn on bit_Prefer_MAP_32BIT_EXEC in EXTRA_LD_ENVVARS without
    overriding other bits.
    
    	[BZ #19758]
    	* sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h
    	(EXTRA_LD_ENVVARS): Or bit_Prefer_MAP_32BIT_EXEC.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=18f8c0e3cc9ff7b092f02c9b42874a5439347bbc

commit 18f8c0e3cc9ff7b092f02c9b42874a5439347bbc
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Mar 6 16:48:11 2016 -0800

    Group AVX512 functions in .text.avx512 section
    
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S:
    	Replace .text with .text.avx512.
    	* sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S:
    	Likewise.
    
    (cherry picked from commit fee9eb6200f0e44a4b684903bc47fde36d46f1a5)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=0c8e297a186f844ebb7eba7a3bc0343c83615ca9

commit 0c8e297a186f844ebb7eba7a3bc0343c83615ca9
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Mar 4 08:37:40 2016 -0800

    x86-64: Fix memcpy IFUNC selection
    
    Chek Fast_Unaligned_Load, instead of Slow_BSF, and also check for
    Fast_Copy_Backward to enable __memcpy_ssse3_back.  Existing selection
    order is updated with following selection order:
    
    1. __memcpy_avx_unaligned if AVX_Fast_Unaligned_Load bit is set.
    2. __memcpy_sse2_unaligned if Fast_Unaligned_Load bit is set.
    3. __memcpy_sse2 if SSSE3 isn't available.
    4. __memcpy_ssse3_back if Fast_Copy_Backward bit it set.
    5. __memcpy_ssse3
    
    	[BZ #18880]
    	* sysdeps/x86_64/multiarch/memcpy.S: Check Fast_Unaligned_Load,
    	instead of Slow_BSF, and also check for Fast_Copy_Backward to
    	enable __memcpy_ssse3_back.
    
    (cherry picked from commit 14a1d7cc4c4fd5ee8e4e66b777221dd32a84efe8)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=3c772cb4d9cbe19cd97ad991e3dab43014198c44

commit 3c772cb4d9cbe19cd97ad991e3dab43014198c44
Author: Andrew Senkevich <andrew.senkevich@intel.com>
Date:   Sat Jan 16 00:49:45 2016 +0300

    Added memcpy/memmove family optimized with AVX512 for KNL hardware.
    
    Added AVX512 implementations of memcpy, mempcpy, memmove, memcpy_chk,
    mempcpy_chk, memmove_chk.
    It shows average improvement more than 30% over AVX versions on KNL
    hardware (performance results in the thread
    <https://sourceware.org/ml/libc-alpha/2016-01/msg00258.html>).
    
        * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Added new files.
        * sysdeps/x86_64/multiarch/ifunc-impl-list.c: Added new tests.
        * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: New file.
        * sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S: Likewise.
        * sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: Likewise.
        * sysdeps/x86_64/multiarch/memcpy.S: Added new IFUNC branch.
        * sysdeps/x86_64/multiarch/memcpy_chk.S: Likewise.
        * sysdeps/x86_64/multiarch/memmove.c: Likewise.
        * sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
        * sysdeps/x86_64/multiarch/mempcpy.S: Likewise.
        * sysdeps/x86_64/multiarch/mempcpy_chk.S: Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7f20e52f1f8d7709140f4bdf828a6bb0f0f08af2

commit 7f20e52f1f8d7709140f4bdf828a6bb0f0f08af2
Author: Andrew Senkevich <andrew.senkevich@intel.com>
Date:   Sat Dec 19 02:47:28 2015 +0300

    Added memset optimized with AVX512 for KNL hardware.
    
    It shows improvement up to 28% over AVX2 memset (performance results
    attached at <https://sourceware.org/ml/libc-alpha/2015-12/msg00052.html>).
    
        * sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S: New file.
        * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Added new file.
        * sysdeps/x86_64/multiarch/ifunc-impl-list.c: Added new tests.
        * sysdeps/x86_64/multiarch/memset.S: Added new IFUNC branch.
        * sysdeps/x86_64/multiarch/memset_chk.S: Likewise.
        * sysdeps/x86/cpu-features.h (bit_Prefer_No_VZEROUPPER,
        index_Prefer_No_VZEROUPPER): New.
        * sysdeps/x86/cpu-features.c (init_cpu_features): Set the
        Prefer_No_VZEROUPPER for Knights Landing.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=d530cd5463701a59ed923d53a97d3b534fdfea8a

commit d530cd5463701a59ed923d53a97d3b534fdfea8a
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Wed Oct 21 14:44:23 2015 -0700

    Add Prefer_MAP_32BIT_EXEC to map executable pages with MAP_32BIT
    
    According to Silvermont software optimization guide, for 64-bit
    applications, branch prediction performance can be negatively impacted
    when the target of a branch is more than 4GB away from the branch.  Add
    the Prefer_MAP_32BIT_EXEC bit so that mmap will try to map executable
    pages with MAP_32BIT first.  NB: MAP_32BIT will map to lower 2GB, not
    lower 4GB, address.  Prefer_MAP_32BIT_EXEC reduces bits available for
    address space layout randomization (ASLR), which is always disabled for
    SUID programs and can only be enabled by setting environment variable,
    LD_PREFER_MAP_32BIT_EXEC.
    
    On Fedora 23, this patch speeds up GCC 5 testsuite by 3% on Silvermont.
    
    	[BZ #19367]
    	* sysdeps/unix/sysv/linux/wordsize-64/mmap.c: New file.
    	* sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h: Likewise.
    	* sysdeps/unix/sysv/linux/x86_64/64/mmap.c: Likewise.
    	* sysdeps/x86/cpu-features.h (bit_Prefer_MAP_32BIT_EXEC): New.
    	(index_Prefer_MAP_32BIT_EXEC): Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=fe24aedc3530037d7bb614b84d309e6b816686bf

commit fe24aedc3530037d7bb614b84d309e6b816686bf
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Dec 15 11:46:54 2015 -0800

    Enable Silvermont optimizations for Knights Landing
    
    Knights Landing processor is based on Silvermont.  This patch enables
    Silvermont optimizations for Knights Landing.
    
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Enable
    	Silvermont optimizations for Knights Landing.

-----------------------------------------------------------------------
Comment 17 cvs-commit@gcc.gnu.org 2016-04-08 18:37:40 UTC
This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, hjl/erms/2.22 has been created
        at  09104b0b6fc150112f5e282c096f739a2f49fb6e (commit)

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=09104b0b6fc150112f5e282c096f739a2f49fb6e

commit 09104b0b6fc150112f5e282c096f739a2f49fb6e
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 1 14:01:24 2016 -0700

    X86-64: Add dummy memcopy.h and wordcopy.c
    
    Since x86-64 doesn't use memory copy functions, add dummy memcopy.h and
    wordcopy.c to reduce code size.  It reduces the size of libc.so by about
    1 KB.
    
    	* sysdeps/x86_64/memcopy.h: New file.
    	* sysdeps/x86_64/wordcopy.c: Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=ec598c844faca4fc87e8c1ec067c94109ba58402

commit ec598c844faca4fc87e8c1ec067c94109ba58402
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 12:46:57 2016 -0700

    X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove
    
    Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones,
    we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with
    the new ones.
    
    No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used
    before.  If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2
    memcpy/memmove optimized with Enhanced REP MOVSB will be used for
    processors with ERMS.  The new AVX512 memcpy/memmove will be used for
    processors with AVX512 which prefer vzeroupper.
    
    Since the new SSE2 memcpy/memmove are faster than the previous default
    memcpy/memmove used in libc.a and ld.so, we also remove the previous
    default memcpy/memmove and make them the default memcpy/memmove, except
    that non-temporal store isn't used in ld.so.
    
    Together, it reduces the size of libc.so by about 6 KB and the size of
    ld.so by about 2 KB.
    
    	[BZ #19776]
    	* sysdeps/x86_64/memcpy.S: Make it dummy.
    	* sysdeps/x86_64/mempcpy.S: Likewise.
    	* sysdeps/x86_64/memmove.S: New file.
    	* sysdeps/x86_64/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/memmove.c: Removed.
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memmove.c: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memcpy-sse2-unaligned, memmove-avx-unaligned,
    	memcpy-avx-unaligned and memmove-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Replace
    	__memmove_chk_avx512_unaligned_2 with
    	__memmove_chk_avx512_unaligned.  Remove
    	__memmove_chk_avx_unaligned_2.  Replace
    	__memmove_chk_sse2_unaligned_2 with
    	__memmove_chk_sse2_unaligned.  Remove __memmove_chk_sse2 and
    	__memmove_avx_unaligned_2.  Replace __memmove_avx512_unaligned_2
    	with __memmove_avx512_unaligned.  Replace
    	__memmove_sse2_unaligned_2 with __memmove_sse2_unaligned.
    	Remove __memmove_sse2.  Replace __memcpy_chk_avx512_unaligned_2
    	with __memcpy_chk_avx512_unaligned.  Remove
    	__memcpy_chk_avx_unaligned_2.  Replace
    	__memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned.
    	Remove __memcpy_chk_sse2.  Remove __memcpy_avx_unaligned_2.
    	Replace __memcpy_avx512_unaligned_2 with
    	__memcpy_avx512_unaligned.  Remove __memcpy_sse2_unaligned_2
    	and __memcpy_sse2.  Replace __mempcpy_chk_avx512_unaligned_2
    	with __mempcpy_chk_avx512_unaligned.  Remove
    	__mempcpy_chk_avx_unaligned_2.  Replace
    	__mempcpy_chk_sse2_unaligned_2 with
    	__mempcpy_chk_sse2_unaligned.  Remove __mempcpy_chk_sse2.
    	Replace __mempcpy_avx512_unaligned_2 with
    	__mempcpy_avx512_unaligned.  Remove __mempcpy_avx_unaligned_2.
    	Replace __mempcpy_sse2_unaligned_2 with
    	__mempcpy_sse2_unaligned.  Remove __mempcpy_sse2.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support
    	__memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned.
    	Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __memcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../memcpy.S.
    	* sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support
    	__memcpy_chk_avx512_unaligned_erms and
    	__memcpy_chk_avx512_unaligned.  Use
    	__memcpy_chk_avx_unaligned_erms and
    	__memcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __memcpy_chk_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
    	Change function suffix from unaligned_2 to unaligned.
    	* sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support
    	__mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned.
    	Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __mempcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../mempcpy.S.
    	(mempcpy): New.  Add a weak alias.
    	* sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support
    	__mempcpy_chk_avx512_unaligned_erms and
    	__mempcpy_chk_avx512_unaligned.  Use
    	__mempcpy_chk_avx_unaligned_erms and
    	__mempcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __mempcpy_chk_sse2_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=ff9d413f34efc46e4160ee4a3b30ddc04fb37518

commit ff9d413f34efc46e4160ee4a3b30ddc04fb37518
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:42:30 2016 -0700

    X86-64: Remove the previous SSE2/AVX2 memsets
    
    Since the new SSE2/AVX2 memsets are faster than the previous ones, we
    can remove the previous SSE2/AVX2 memsets and replace them with the
    new ones.  This reduces the size of libc.so by about 900 bytes.
    
    No change in IFUNC selection if SSE2 and AVX2 memsets weren't used
    before.  If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset
    optimized with Enhanced REP STOSB will be used for processors with
    ERMS.  The new AVX512 memset will be used for processors with AVX512
    which prefer vzeroupper.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded
    	into ...
    	* sysdeps/x86_64/memset.S: This.
    	(__bzero): Removed.
    	(__memset_tail): Likewise.
    	(__memset_chk): Likewise.
    	(memset): Likewise.
    	(MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't
    	defined.
    	(MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined.
    	* sysdeps/x86_64/multiarch/memset-avx2.S: Removed.
    	(__memset_zero_constant_len_parameter): Check SHARED instead of
    	PIC.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memset-avx2 and memset-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Remove __memset_chk_sse2,
    	__memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
    	(__bzero): Enabled.
    	* sysdeps/x86_64/multiarch/memset.S (memset): Replace
    	__memset_sse2 and __memset_avx2 with __memset_sse2_unaligned
    	and __memset_avx2_unaligned.  Use __memset_sse2_unaligned_erms
    	or __memset_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	(memset): Removed.
    	(__memset_chk): Likewise.
    	(MEMSET_SYMBOL): New.
    	(libc_hidden_builtin_def): Replace __memset_sse2 with
    	__memset_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace
    	__memset_chk_sse2 and __memset_chk_avx2 with
    	__memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms.
    	Use __memset_chk_sse2_unaligned_erms or
    	__memset_chk_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_chk_avx512_unaligned_erms and
    	__memset_chk_avx512_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=76945cf3a33403b5dff551d48cb68a6729848740

commit 76945cf3a33403b5dff551d48cb68a6729848740
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Apr 3 17:21:45 2016 -0700

    X86-64: Use non-temporal store in memcpy on large data
    
    The large memcpy micro benchmark in glibc shows that there is a
    regression with large data on Haswell machine.  non-temporal store in
    memcpy on large data can improve performance significantly.  This
    patch adds a threshold to use non temporal store which is 6 times of
    shared cache size.  When size is above the threshold, non temporal
    store will be used, but avoid non-temporal store if there is overlap
    between destination and source since destination may be in cache when
    source is loaded.
    
    For size below 8 vector register width, we load all data into registers
    and store them together.  Only forward and backward loops, which move 4
    vector registers at a time, are used to support overlapping addresses.
    For forward loop, we load the last 4 vector register width of data and
    the first vector register width of data into vector registers before the
    loop and store them after the loop.  For backward loop, we load the first
    4 vector register width of data and the last vector register width of
    data into vector registers before the loop and store them after the loop.
    
    	* sysdeps/x86_64/cacheinfo.c (__x86_shared_non_temporal_threshold):
    	New.
    	(init_cacheinfo): Set __x86_shared_non_temporal_threshold to
    	6 times of shared cache size.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S
    	(VMOVNT): New.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S
    	(VMOVNT): Likewise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S
    	(VMOVNT): Likewise.
    	(VMOVU): Changed to movups for smaller code sizes.
    	(VMOVA): Changed to movaps for smaller code sizes.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Update
    	comments.
    	(PREFETCH): New.
    	(PREFETCH_SIZE): Likewise.
    	(PREFETCHED_LOAD_SIZE): Likewise.
    	(PREFETCH_ONE_SET): Likewise.
    	Rewrite to use forward and backward loops, which move 4 vector
    	registers at a time, to support overlapping addresses and use
    	non temporal store if size is above the threshold and there is
    	no overlap between destination and source.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=54667f64fa4074325ee33e487c033c313ce95067

commit 54667f64fa4074325ee33e487c033c313ce95067
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Wed Apr 6 10:19:16 2016 -0700

    X86-64: Prepare memmove-vec-unaligned-erms.S
    
    Prepare memmove-vec-unaligned-erms.S to make the SSE2 version as the
    default memcpy, mempcpy and memmove.
    
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
    	(MEMCPY_SYMBOL): New.
    	(MEMPCPY_SYMBOL): Likewise.
    	(MEMMOVE_CHK_SYMBOL): Likewise.
    	Replace MEMMOVE_SYMBOL with MEMMOVE_CHK_SYMBOL on __mempcpy_chk
    	symbols.  Replace MEMMOVE_SYMBOL with MEMPCPY_SYMBOL on
    	__mempcpy symbols.  Provide alias for __memcpy_chk in libc.a.
    	Provide alias for memcpy in libc.a and ld.so.
    
    (cherry picked from commit a7d1c51482d15ab6c07e2ee0ae5e007067b18bfb)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=68a0b487e274b3452a1660e4b9fad5df8d8c0284

commit 68a0b487e274b3452a1660e4b9fad5df8d8c0284
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Wed Apr 6 09:10:18 2016 -0700

    X86-64: Prepare memset-vec-unaligned-erms.S
    
    Prepare memset-vec-unaligned-erms.S to make the SSE2 version as the
    default memset.
    
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
    	(MEMSET_CHK_SYMBOL): New.  Define if not defined.
    	(__bzero): Check VEC_SIZE == 16 instead of USE_MULTIARCH.
    	Disabled fro now.
    	Replace MEMSET_SYMBOL with MEMSET_CHK_SYMBOL on __memset_chk
    	symbols.  Properly check USE_MULTIARCH on __memset symbols.
    
    (cherry picked from commit 4af1bb06c59d24f35bf8dc55897838d926c05892)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c2d3bdd6aec639fd66fceb3e2c145420c25d409b

commit c2d3bdd6aec639fd66fceb3e2c145420c25d409b
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Apr 5 05:21:07 2016 -0700

    Force 32-bit displacement in memset-vec-unaligned-erms.S
    
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: Force
    	32-bit displacement to avoid long nop between instructions.
    
    (cherry picked from commit ec0cac9a1f4094bd0db6f77c1b329e7a40eecc10)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=070a5e77d66f5520c1bbbc24dc1843a0a1c161ee

commit 070a5e77d66f5520c1bbbc24dc1843a0a1c161ee
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Apr 5 05:19:05 2016 -0700

    Add a comment in memset-sse2-unaligned-erms.S
    
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Add
    	a comment on VMOVU and VMOVA.
    
    (cherry picked from commit 696ac774847b80cf994438739478b0c3003b5958)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7e00bb9720268f142668d22e91dff7c3e6e0c08c

commit 7e00bb9720268f142668d22e91dff7c3e6e0c08c
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Apr 3 14:32:20 2016 -0700

    Don't put SSE2/AVX/AVX512 memmove/memset in ld.so
    
    Since memmove and memset in ld.so don't use IFUNC, don't put SSE2, AVX
    and AVX512 memmove and memset in ld.so.
    
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip
    	if not in libc.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    
    (cherry picked from commit 5cd7af016d8587ff53b20ba259746f97edbddbf7)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e1e57539f5dbdefc96a85021b611863eaa28dd13

commit e1e57539f5dbdefc96a85021b611863eaa28dd13
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Apr 3 12:38:25 2016 -0700

    Fix memmove-vec-unaligned-erms.S
    
    __mempcpy_erms and __memmove_erms can't be placed between __memmove_chk
    and __memmove it breaks __memmove_chk.
    
    Don't check source == destination first since it is less common.
    
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:
    	(__mempcpy_erms, __memmove_erms): Moved before __mempcpy_chk
    	with unaligned_erms.
    	(__memmove_erms): Skip if source == destination.
    	(__memmove_unaligned_erms): Don't check source == destination
    	first.
    
    (cherry picked from commit ea2785e96fa503f3a2b5dd9f3a6ca65622b3c5f2)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a13ac6b5ced68aadb7c1546102445f9c57f43231

commit a13ac6b5ced68aadb7c1546102445f9c57f43231
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Mar 6 08:23:24 2016 -0800

    Use HAS_ARCH_FEATURE with Fast_Rep_String
    
    HAS_ARCH_FEATURE, not HAS_CPU_FEATURE, should be used with
    Fast_Rep_String.
    
    	[BZ #19762]
    	* sysdeps/i386/i686/multiarch/bcopy.S (bcopy): Use
    	HAS_ARCH_FEATURE with Fast_Rep_String.
    	* sysdeps/i386/i686/multiarch/bzero.S (__bzero): Likewise.
    	* sysdeps/i386/i686/multiarch/memcpy.S (memcpy): Likewise.
    	* sysdeps/i386/i686/multiarch/memcpy_chk.S (__memcpy_chk):
    	Likewise.
    	* sysdeps/i386/i686/multiarch/memmove_chk.S (__memmove_chk):
    	Likewise.
    	* sysdeps/i386/i686/multiarch/mempcpy.S (__mempcpy): Likewise.
    	* sysdeps/i386/i686/multiarch/mempcpy_chk.S (__mempcpy_chk):
    	Likewise.
    	* sysdeps/i386/i686/multiarch/memset.S (memset): Likewise.
    	* sysdeps/i386/i686/multiarch/memset_chk.S (__memset_chk):
    	Likewise.
    
    (cherry picked from commit 4e940b2f4b577f3a530e0580373f7c2d569f4d63)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4ad4d58ed7a444e2d9787113fce132a99b35b417

commit 4ad4d58ed7a444e2d9787113fce132a99b35b417
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 1 15:08:48 2016 -0700

    Remove Fast_Copy_Backward from Intel Core processors
    
    Intel Core i3, i5 and i7 processors have fast unaligned copy and
    copy backward is ignored.  Remove Fast_Copy_Backward from Intel Core
    processors to avoid confusion.
    
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Don't set
    	bit_arch_Fast_Copy_Backward for Intel Core proessors.
    
    (cherry picked from commit 27d3ce1467990f89126e228559dec8f84b96c60e)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a304f3933c7f8347e49057a7a315cbd571662ff7

commit a304f3933c7f8347e49057a7a315cbd571662ff7
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:05:51 2016 -0700

    Add x86-64 memset with unaligned store and rep stosb
    
    Implement x86-64 memset with unaligned store and rep movsb.  Support
    16-byte, 32-byte and 64-byte vector register sizes.  A single file
    provides 2 implementations of memset, one with rep stosb and the other
    without rep stosb.  They share the same codes when size is between 2
    times of vector register size and REP_STOSB_THRESHOLD which defaults
    to 2KB.
    
    Key features:
    
    1. Use overlapping store to avoid branch.
    2. For size <= 4 times of vector register size, fully unroll the loop.
    3. For size > 4 times of vector register size, store 4 times of vector
    register size at a time.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
    	memset-sse2-unaligned-erms, memset-avx2-unaligned-erms and
    	memset-avx512-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Test __memset_chk_sse2_unaligned,
    	__memset_chk_sse2_unaligned_erms, __memset_chk_avx2_unaligned,
    	__memset_chk_avx2_unaligned_erms, __memset_chk_avx512_unaligned,
    	__memset_chk_avx512_unaligned_erms, __memset_sse2_unaligned,
    	__memset_sse2_unaligned_erms, __memset_erms,
    	__memset_avx2_unaligned, __memset_avx2_unaligned_erms,
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: New
    	file.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:
    	Likewise.
    
    (cherry picked from commit 830566307f038387ca0af3fd327706a8d1a2f595)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=1c5d0b0ba41376c2f0792da4f22cc1f5b2b2688e

commit 1c5d0b0ba41376c2f0792da4f22cc1f5b2b2688e
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:04:26 2016 -0700

    Add x86-64 memmove with unaligned load/store and rep movsb
    
    Implement x86-64 memmove with unaligned load/store and rep movsb.
    Support 16-byte, 32-byte and 64-byte vector register sizes.  When
    size <= 8 times of vector register size, there is no check for
    address overlap bewteen source and destination.  Since overhead for
    overlap check is small when size > 8 times of vector register size,
    memcpy is an alias of memmove.
    
    A single file provides 2 implementations of memmove, one with rep movsb
    and the other without rep movsb.  They share the same codes when size is
    between 2 times of vector register size and REP_MOVSB_THRESHOLD which
    is 2KB for 16-byte vector register size and scaled up by large vector
    register size.
    
    Key features:
    
    1. Use overlapping load and store to avoid branch.
    2. For size <= 8 times of vector register size, load  all sources into
    registers and store them together.
    3. If there is no address overlap bewteen source and destination, copy
    from both ends with 4 times of vector register size at a time.
    4. If address of destination > address of source, backward copy 8 times
    of vector register size at a time.
    5. Otherwise, forward copy 8 times of vector register size at a time.
    6. Use rep movsb only for forward copy.  Avoid slow backward rep movsb
    by fallbacking to backward copy 8 times of vector register size at a
    time.
    7. Skip when address of destination == address of source.
    
    	[BZ #19776]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
    	memmove-sse2-unaligned-erms, memmove-avx-unaligned-erms and
    	memmove-avx512-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Test
    	__memmove_chk_avx512_unaligned_2,
    	__memmove_chk_avx512_unaligned_erms,
    	__memmove_chk_avx_unaligned_2, __memmove_chk_avx_unaligned_erms,
    	__memmove_chk_sse2_unaligned_2,
    	__memmove_chk_sse2_unaligned_erms, __memmove_avx_unaligned_2,
    	__memmove_avx_unaligned_erms, __memmove_avx512_unaligned_2,
    	__memmove_avx512_unaligned_erms, __memmove_erms,
    	__memmove_sse2_unaligned_2, __memmove_sse2_unaligned_erms,
    	__memcpy_chk_avx512_unaligned_2,
    	__memcpy_chk_avx512_unaligned_erms,
    	__memcpy_chk_avx_unaligned_2, __memcpy_chk_avx_unaligned_erms,
    	__memcpy_chk_sse2_unaligned_2, __memcpy_chk_sse2_unaligned_erms,
    	__memcpy_avx_unaligned_2, __memcpy_avx_unaligned_erms,
    	__memcpy_avx512_unaligned_2, __memcpy_avx512_unaligned_erms,
    	__memcpy_sse2_unaligned_2, __memcpy_sse2_unaligned_erms,
    	__memcpy_erms, __mempcpy_chk_avx512_unaligned_2,
    	__mempcpy_chk_avx512_unaligned_erms,
    	__mempcpy_chk_avx_unaligned_2, __mempcpy_chk_avx_unaligned_erms,
    	__mempcpy_chk_sse2_unaligned_2, __mempcpy_chk_sse2_unaligned_erms,
    	__mempcpy_avx512_unaligned_2, __mempcpy_avx512_unaligned_erms,
    	__mempcpy_avx_unaligned_2, __mempcpy_avx_unaligned_erms,
    	__mempcpy_sse2_unaligned_2, __mempcpy_sse2_unaligned_erms and
    	__mempcpy_erms.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: New
    	file.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likwise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
    	Likwise.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:
    	Likwise.
    
    (cherry picked from commit 88b57b8ed41d5ecf2e1bdfc19556f9246a665ebb)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e1203f48239fbb9832db6ed3a0d2a008e317aff9

commit e1203f48239fbb9832db6ed3a0d2a008e317aff9
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 19:22:59 2016 -0700

    Initial Enhanced REP MOVSB/STOSB (ERMS) support
    
    The newer Intel processors support Enhanced REP MOVSB/STOSB (ERMS) which
    has a feature bit in CPUID.  This patch adds the Enhanced REP MOVSB/STOSB
    (ERMS) bit to x86 cpu-features.
    
    	* sysdeps/x86/cpu-features.h (bit_cpu_ERMS): New.
    	(index_cpu_ERMS): Likewise.
    	(reg_ERMS): Likewise.
    
    (cherry picked from commit 0791f91dff9a77263fa8173b143d854cad902c6d)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=3597d65be2a44f063ef12bb907fdad8567aa3e6a

commit 3597d65be2a44f063ef12bb907fdad8567aa3e6a
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 13:15:59 2016 -0700

    Make __memcpy_avx512_no_vzeroupper an alias
    
    Since x86-64 memcpy-avx512-no-vzeroupper.S implements memmove, make
    __memcpy_avx512_no_vzeroupper an alias of __memmove_avx512_no_vzeroupper
    to reduce code size of libc.so.
    
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memcpy-avx512-no-vzeroupper.
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: Renamed
    	to ...
    	* sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: This.
    	(MEMCPY): Don't define.
    	(MEMCPY_CHK): Likewise.
    	(MEMPCPY): Likewise.
    	(MEMPCPY_CHK): Likewise.
    	(MEMPCPY_CHK): Renamed to ...
    	(__mempcpy_chk_avx512_no_vzeroupper): This.
    	(MEMPCPY_CHK): Renamed to ...
    	(__mempcpy_chk_avx512_no_vzeroupper): This.
    	(MEMCPY_CHK): Renamed to ...
    	(__memmove_chk_avx512_no_vzeroupper): This.
    	(MEMCPY): Renamed to ...
    	(__memmove_avx512_no_vzeroupper): This.
    	(__memcpy_avx512_no_vzeroupper): New alias.
    	(__memcpy_chk_avx512_no_vzeroupper): Likewise.
    
    (cherry picked from commit 064f01b10b57ff09cda7025f484b848c38ddd57a)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=9fbaf0f27a11deb98df79d04adee97aebee78d40

commit 9fbaf0f27a11deb98df79d04adee97aebee78d40
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 13:13:36 2016 -0700

    Implement x86-64 multiarch mempcpy in memcpy
    
    Implement x86-64 multiarch mempcpy in memcpy to share most of code.  It
    reduces code size of libc.so.
    
    	[BZ #18858]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	mempcpy-ssse3, mempcpy-ssse3-back, mempcpy-avx-unaligned
    	and mempcpy-avx512-no-vzeroupper.
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMPCPY_CHK):
    	New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S
    	(MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-ssse3-back.S (MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-ssse3.S (MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-avx-unaligned.S: Removed.
    	* sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-ssse3-back.S: Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-ssse3.S: Likewise.
    
    (cherry picked from commit c365e615f7429aee302f8af7bf07ae262278febb)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=5239cb481eea27650173b9b9af22439afdcbf358

commit 5239cb481eea27650173b9b9af22439afdcbf358
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 04:39:48 2016 -0700

    [x86] Add a feature bit: Fast_Unaligned_Copy
    
    On AMD processors, memcpy optimized with unaligned SSE load is
    slower than emcpy optimized with aligned SSSE3 while other string
    functions are faster with unaligned SSE load.  A feature bit,
    Fast_Unaligned_Copy, is added to select memcpy optimized with
    unaligned SSE load.
    
    	[BZ #19583]
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Set
    	Fast_Unaligned_Copy with Fast_Unaligned_Load for Intel
    	processors.  Set Fast_Copy_Backward for AMD Excavator
    	processors.
    	* sysdeps/x86/cpu-features.h (bit_arch_Fast_Unaligned_Copy):
    	New.
    	(index_arch_Fast_Unaligned_Copy): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Check
    	Fast_Unaligned_Copy instead of Fast_Unaligned_Load.
    
    (cherry picked from commit e41b395523040fcb58c7d378475720c2836d280c)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a65b3d13e1754d568782e64a762c2c7fab45a55d

commit a65b3d13e1754d568782e64a762c2c7fab45a55d
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Mar 22 08:36:16 2016 -0700

    Don't set %rcx twice before "rep movsb"
    
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMCPY):
    	Don't set %rcx twice before "rep movsb".
    
    (cherry picked from commit 3c9a4cd16cbc7b79094fec68add2df66061ab5d7)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=f4b6d20366aac66070f1cf50552cf2951991a1e5

commit f4b6d20366aac66070f1cf50552cf2951991a1e5
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Mar 22 07:46:56 2016 -0700

    Set index_arch_AVX_Fast_Unaligned_Load only for Intel processors
    
    Since only Intel processors with AVX2 have fast unaligned load, we
    should set index_arch_AVX_Fast_Unaligned_Load only for Intel processors.
    
    Move AVX, AVX2, AVX512, FMA and FMA4 detection into get_common_indeces
    and call get_common_indeces for other processors.
    
    Add CPU_FEATURES_CPU_P and CPU_FEATURES_ARCH_P to aoid loading
    GLRO(dl_x86_cpu_features) in cpu-features.c.
    
    	[BZ #19583]
    	* sysdeps/x86/cpu-features.c (get_common_indeces): Remove
    	inline.  Check family before setting family, model and
    	extended_model.  Set AVX, AVX2, AVX512, FMA and FMA4 usable
    	bits here.
    	(init_cpu_features): Replace HAS_CPU_FEATURE and
    	HAS_ARCH_FEATURE with CPU_FEATURES_CPU_P and
    	CPU_FEATURES_ARCH_P.  Set index_arch_AVX_Fast_Unaligned_Load
    	for Intel processors with usable AVX2.  Call get_common_indeces
    	for other processors with family == NULL.
    	* sysdeps/x86/cpu-features.h (CPU_FEATURES_CPU_P): New macro.
    	(CPU_FEATURES_ARCH_P): Likewise.
    	(HAS_CPU_FEATURE): Use CPU_FEATURES_CPU_P.
    	(HAS_ARCH_FEATURE): Use CPU_FEATURES_ARCH_P.
    
    (cherry picked from commit f781a9e96138d8839663af5e88649ab1fbed74f8)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=ca9c5edeea52dc18f42ebbe29b1af352f5555538

commit ca9c5edeea52dc18f42ebbe29b1af352f5555538
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Nov 30 08:53:37 2015 -0800

    Update family and model detection for AMD CPUs
    
    AMD CPUs uses the similar encoding scheme for extended family and model
    as Intel CPUs as shown in:
    
    http://support.amd.com/TechDocs/25481.pdf
    
    This patch updates get_common_indeces to get family and model for both
    Intel and AMD CPUs when family == 0x0f.
    
    	[BZ #19214]
    	* sysdeps/x86/cpu-features.c (get_common_indeces): Add an
    	argument to return extended model.  Update family and model
    	with extended family and model when family == 0x0f.
    	(init_cpu_features): Updated.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c23cdbac4ea473effbef5c50b1217f95595b3460

commit c23cdbac4ea473effbef5c50b1217f95595b3460
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 10 05:26:46 2016 -0800

    Add _arch_/_cpu_ to index_*/bit_* in x86 cpu-features.h
    
    index_* and bit_* macros are used to access cpuid and feature arrays o
    struct cpu_features.  It is very easy to use bits and indices of cpuid
    array on feature array, especially in assembly codes.  For example,
    sysdeps/i386/i686/multiarch/bcopy.S has
    
    	HAS_CPU_FEATURE (Fast_Rep_String)
    
    which should be
    
    	HAS_ARCH_FEATURE (Fast_Rep_String)
    
    We change index_* and bit_* to index_cpu_*/index_arch_* and
    bit_cpu_*/bit_arch_* so that we can catch such error at build time.
    
    	[BZ #19762]
    	* sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h
    	(EXTRA_LD_ENVVARS): Add _arch_ to index_*/bit_*.
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Likewise.
    	* sysdeps/x86/cpu-features.h (bit_*): Renamed to ...
    	(bit_arch_*): This for feature array.
    	(bit_*): Renamed to ...
    	(bit_cpu_*): This for cpu array.
    	(index_*): Renamed to ...
    	(index_arch_*): This for feature array.
    	(index_*): Renamed to ...
    	(index_cpu_*): This for cpu array.
    	[__ASSEMBLER__] (HAS_FEATURE): Add and use field.
    	[__ASSEMBLER__] (HAS_CPU_FEATURE)): Pass cpu to HAS_FEATURE.
    	[__ASSEMBLER__] (HAS_ARCH_FEATURE)): Pass arch to HAS_FEATURE.
    	[!__ASSEMBLER__] (HAS_CPU_FEATURE): Replace index_##name and
    	bit_##name with index_cpu_##name and bit_cpu_##name.
    	[!__ASSEMBLER__] (HAS_ARCH_FEATURE): Replace index_##name and
    	bit_##name with index_arch_##name and bit_arch_##name.
    
    (cherry picked from commit 6aa3e97e2530f9917f504eb4146af119a3f27229)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4a49c82956f5a42a2cce22c2e97360de1b32301d

commit 4a49c82956f5a42a2cce22c2e97360de1b32301d
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 3 14:51:40 2016 -0800

    Or bit_Prefer_MAP_32BIT_EXEC in EXTRA_LD_ENVVARS
    
    We should turn on bit_Prefer_MAP_32BIT_EXEC in EXTRA_LD_ENVVARS without
    overriding other bits.
    
    	[BZ #19758]
    	* sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h
    	(EXTRA_LD_ENVVARS): Or bit_Prefer_MAP_32BIT_EXEC.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=18f8c0e3cc9ff7b092f02c9b42874a5439347bbc

commit 18f8c0e3cc9ff7b092f02c9b42874a5439347bbc
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Mar 6 16:48:11 2016 -0800

    Group AVX512 functions in .text.avx512 section
    
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S:
    	Replace .text with .text.avx512.
    	* sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S:
    	Likewise.
    
    (cherry picked from commit fee9eb6200f0e44a4b684903bc47fde36d46f1a5)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=0c8e297a186f844ebb7eba7a3bc0343c83615ca9

commit 0c8e297a186f844ebb7eba7a3bc0343c83615ca9
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Mar 4 08:37:40 2016 -0800

    x86-64: Fix memcpy IFUNC selection
    
    Chek Fast_Unaligned_Load, instead of Slow_BSF, and also check for
    Fast_Copy_Backward to enable __memcpy_ssse3_back.  Existing selection
    order is updated with following selection order:
    
    1. __memcpy_avx_unaligned if AVX_Fast_Unaligned_Load bit is set.
    2. __memcpy_sse2_unaligned if Fast_Unaligned_Load bit is set.
    3. __memcpy_sse2 if SSSE3 isn't available.
    4. __memcpy_ssse3_back if Fast_Copy_Backward bit it set.
    5. __memcpy_ssse3
    
    	[BZ #18880]
    	* sysdeps/x86_64/multiarch/memcpy.S: Check Fast_Unaligned_Load,
    	instead of Slow_BSF, and also check for Fast_Copy_Backward to
    	enable __memcpy_ssse3_back.
    
    (cherry picked from commit 14a1d7cc4c4fd5ee8e4e66b777221dd32a84efe8)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=3c772cb4d9cbe19cd97ad991e3dab43014198c44

commit 3c772cb4d9cbe19cd97ad991e3dab43014198c44
Author: Andrew Senkevich <andrew.senkevich@intel.com>
Date:   Sat Jan 16 00:49:45 2016 +0300

    Added memcpy/memmove family optimized with AVX512 for KNL hardware.
    
    Added AVX512 implementations of memcpy, mempcpy, memmove, memcpy_chk,
    mempcpy_chk, memmove_chk.
    It shows average improvement more than 30% over AVX versions on KNL
    hardware (performance results in the thread
    <https://sourceware.org/ml/libc-alpha/2016-01/msg00258.html>).
    
        * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Added new files.
        * sysdeps/x86_64/multiarch/ifunc-impl-list.c: Added new tests.
        * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: New file.
        * sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S: Likewise.
        * sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: Likewise.
        * sysdeps/x86_64/multiarch/memcpy.S: Added new IFUNC branch.
        * sysdeps/x86_64/multiarch/memcpy_chk.S: Likewise.
        * sysdeps/x86_64/multiarch/memmove.c: Likewise.
        * sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
        * sysdeps/x86_64/multiarch/mempcpy.S: Likewise.
        * sysdeps/x86_64/multiarch/mempcpy_chk.S: Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7f20e52f1f8d7709140f4bdf828a6bb0f0f08af2

commit 7f20e52f1f8d7709140f4bdf828a6bb0f0f08af2
Author: Andrew Senkevich <andrew.senkevich@intel.com>
Date:   Sat Dec 19 02:47:28 2015 +0300

    Added memset optimized with AVX512 for KNL hardware.
    
    It shows improvement up to 28% over AVX2 memset (performance results
    attached at <https://sourceware.org/ml/libc-alpha/2015-12/msg00052.html>).
    
        * sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S: New file.
        * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Added new file.
        * sysdeps/x86_64/multiarch/ifunc-impl-list.c: Added new tests.
        * sysdeps/x86_64/multiarch/memset.S: Added new IFUNC branch.
        * sysdeps/x86_64/multiarch/memset_chk.S: Likewise.
        * sysdeps/x86/cpu-features.h (bit_Prefer_No_VZEROUPPER,
        index_Prefer_No_VZEROUPPER): New.
        * sysdeps/x86/cpu-features.c (init_cpu_features): Set the
        Prefer_No_VZEROUPPER for Knights Landing.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=d530cd5463701a59ed923d53a97d3b534fdfea8a

commit d530cd5463701a59ed923d53a97d3b534fdfea8a
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Wed Oct 21 14:44:23 2015 -0700

    Add Prefer_MAP_32BIT_EXEC to map executable pages with MAP_32BIT
    
    According to Silvermont software optimization guide, for 64-bit
    applications, branch prediction performance can be negatively impacted
    when the target of a branch is more than 4GB away from the branch.  Add
    the Prefer_MAP_32BIT_EXEC bit so that mmap will try to map executable
    pages with MAP_32BIT first.  NB: MAP_32BIT will map to lower 2GB, not
    lower 4GB, address.  Prefer_MAP_32BIT_EXEC reduces bits available for
    address space layout randomization (ASLR), which is always disabled for
    SUID programs and can only be enabled by setting environment variable,
    LD_PREFER_MAP_32BIT_EXEC.
    
    On Fedora 23, this patch speeds up GCC 5 testsuite by 3% on Silvermont.
    
    	[BZ #19367]
    	* sysdeps/unix/sysv/linux/wordsize-64/mmap.c: New file.
    	* sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h: Likewise.
    	* sysdeps/unix/sysv/linux/x86_64/64/mmap.c: Likewise.
    	* sysdeps/x86/cpu-features.h (bit_Prefer_MAP_32BIT_EXEC): New.
    	(index_Prefer_MAP_32BIT_EXEC): Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=fe24aedc3530037d7bb614b84d309e6b816686bf

commit fe24aedc3530037d7bb614b84d309e6b816686bf
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Dec 15 11:46:54 2015 -0800

    Enable Silvermont optimizations for Knights Landing
    
    Knights Landing processor is based on Silvermont.  This patch enables
    Silvermont optimizations for Knights Landing.
    
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Enable
    	Silvermont optimizations for Knights Landing.

-----------------------------------------------------------------------
Comment 18 cvs-commit@gcc.gnu.org 2016-04-08 19:18:56 UTC
This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, hjl/erms/2.22 has been created
        at  157c57198e893b4882d1feb98de2b0721ee408fc (commit)

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=157c57198e893b4882d1feb98de2b0721ee408fc

commit 157c57198e893b4882d1feb98de2b0721ee408fc
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 1 14:01:24 2016 -0700

    X86-64: Add dummy memcopy.h and wordcopy.c
    
    Since x86-64 doesn't use memory copy functions, add dummy memcopy.h and
    wordcopy.c to reduce code size.  It reduces the size of libc.so by about
    1 KB.
    
    	* sysdeps/x86_64/memcopy.h: New file.
    	* sysdeps/x86_64/wordcopy.c: Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=f817b9d36215ab60d58cc744d22773b4961a2c9b

commit f817b9d36215ab60d58cc744d22773b4961a2c9b
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 12:46:57 2016 -0700

    X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove
    
    Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones,
    we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with
    the new ones.
    
    No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used
    before.  If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2
    memcpy/memmove optimized with Enhanced REP MOVSB will be used for
    processors with ERMS.  The new AVX512 memcpy/memmove will be used for
    processors with AVX512 which prefer vzeroupper.
    
    Since the new SSE2 memcpy/memmove are faster than the previous default
    memcpy/memmove used in libc.a and ld.so, we also remove the previous
    default memcpy/memmove and make them the default memcpy/memmove, except
    that non-temporal store isn't used in ld.so.
    
    Together, it reduces the size of libc.so by about 6 KB and the size of
    ld.so by about 2 KB.
    
    	[BZ #19776]
    	* sysdeps/x86_64/memcpy.S: Make it dummy.
    	* sysdeps/x86_64/mempcpy.S: Likewise.
    	* sysdeps/x86_64/memmove.S: New file.
    	* sysdeps/x86_64/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/memmove.c: Removed.
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memmove.c: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memcpy-sse2-unaligned, memmove-avx-unaligned,
    	memcpy-avx-unaligned and memmove-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Replace
    	__memmove_chk_avx512_unaligned_2 with
    	__memmove_chk_avx512_unaligned.  Remove
    	__memmove_chk_avx_unaligned_2.  Replace
    	__memmove_chk_sse2_unaligned_2 with
    	__memmove_chk_sse2_unaligned.  Remove __memmove_chk_sse2 and
    	__memmove_avx_unaligned_2.  Replace __memmove_avx512_unaligned_2
    	with __memmove_avx512_unaligned.  Replace
    	__memmove_sse2_unaligned_2 with __memmove_sse2_unaligned.
    	Remove __memmove_sse2.  Replace __memcpy_chk_avx512_unaligned_2
    	with __memcpy_chk_avx512_unaligned.  Remove
    	__memcpy_chk_avx_unaligned_2.  Replace
    	__memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned.
    	Remove __memcpy_chk_sse2.  Remove __memcpy_avx_unaligned_2.
    	Replace __memcpy_avx512_unaligned_2 with
    	__memcpy_avx512_unaligned.  Remove __memcpy_sse2_unaligned_2
    	and __memcpy_sse2.  Replace __mempcpy_chk_avx512_unaligned_2
    	with __mempcpy_chk_avx512_unaligned.  Remove
    	__mempcpy_chk_avx_unaligned_2.  Replace
    	__mempcpy_chk_sse2_unaligned_2 with
    	__mempcpy_chk_sse2_unaligned.  Remove __mempcpy_chk_sse2.
    	Replace __mempcpy_avx512_unaligned_2 with
    	__mempcpy_avx512_unaligned.  Remove __mempcpy_avx_unaligned_2.
    	Replace __mempcpy_sse2_unaligned_2 with
    	__mempcpy_sse2_unaligned.  Remove __mempcpy_sse2.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support
    	__memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned.
    	Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __memcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../memcpy.S.
    	* sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support
    	__memcpy_chk_avx512_unaligned_erms and
    	__memcpy_chk_avx512_unaligned.  Use
    	__memcpy_chk_avx_unaligned_erms and
    	__memcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __memcpy_chk_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
    	Change function suffix from unaligned_2 to unaligned.
    	* sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support
    	__mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned.
    	Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __mempcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../mempcpy.S.
    	(mempcpy): New.  Add a weak alias.
    	* sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support
    	__mempcpy_chk_avx512_unaligned_erms and
    	__mempcpy_chk_avx512_unaligned.  Use
    	__mempcpy_chk_avx_unaligned_erms and
    	__mempcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __mempcpy_chk_sse2_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=122600f4b380b00ce0f682039fe59af4bd0edbc0

commit 122600f4b380b00ce0f682039fe59af4bd0edbc0
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:42:30 2016 -0700

    X86-64: Remove the previous SSE2/AVX2 memsets
    
    Since the new SSE2/AVX2 memsets are faster than the previous ones, we
    can remove the previous SSE2/AVX2 memsets and replace them with the
    new ones.  This reduces the size of libc.so by about 900 bytes.
    
    No change in IFUNC selection if SSE2 and AVX2 memsets weren't used
    before.  If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset
    optimized with Enhanced REP STOSB will be used for processors with
    ERMS.  The new AVX512 memset will be used for processors with AVX512
    which prefer vzeroupper.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded
    	into ...
    	* sysdeps/x86_64/memset.S: This.
    	(__bzero): Removed.
    	(__memset_tail): Likewise.
    	(__memset_chk): Likewise.
    	(memset): Likewise.
    	(MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't
    	defined.
    	(MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined.
    	* sysdeps/x86_64/multiarch/memset-avx2.S: Removed.
    	(__memset_zero_constant_len_parameter): Check SHARED instead of
    	PIC.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memset-avx2 and memset-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Remove __memset_chk_sse2,
    	__memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
    	(__bzero): Enabled.
    	* sysdeps/x86_64/multiarch/memset.S (memset): Replace
    	__memset_sse2 and __memset_avx2 with __memset_sse2_unaligned
    	and __memset_avx2_unaligned.  Use __memset_sse2_unaligned_erms
    	or __memset_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	(memset): Removed.
    	(__memset_chk): Likewise.
    	(MEMSET_SYMBOL): New.
    	(libc_hidden_builtin_def): Replace __memset_sse2 with
    	__memset_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace
    	__memset_chk_sse2 and __memset_chk_avx2 with
    	__memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms.
    	Use __memset_chk_sse2_unaligned_erms or
    	__memset_chk_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_chk_avx512_unaligned_erms and
    	__memset_chk_avx512_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=0ee4375cef69e00e69ddb1d08fe0d492053208f3

commit 0ee4375cef69e00e69ddb1d08fe0d492053208f3
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Apr 3 17:21:45 2016 -0700

    X86-64: Use non-temporal store in memcpy on large data
    
    The large memcpy micro benchmark in glibc shows that there is a
    regression with large data on Haswell machine.  non-temporal store in
    memcpy on large data can improve performance significantly.  This
    patch adds a threshold to use non temporal store which is 6 times of
    shared cache size.  When size is above the threshold, non temporal
    store will be used, but avoid non-temporal store if there is overlap
    between destination and source since destination may be in cache when
    source is loaded.
    
    For size below 8 vector register width, we load all data into registers
    and store them together.  Only forward and backward loops, which move 4
    vector registers at a time, are used to support overlapping addresses.
    For forward loop, we load the last 4 vector register width of data and
    the first vector register width of data into vector registers before the
    loop and store them after the loop.  For backward loop, we load the first
    4 vector register width of data and the last vector register width of
    data into vector registers before the loop and store them after the loop.
    
    	[BZ #19928]
    	* sysdeps/x86_64/cacheinfo.c (__x86_shared_non_temporal_threshold):
    	New.
    	(init_cacheinfo): Set __x86_shared_non_temporal_threshold to
    	6 times of shared cache size.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S
    	(VMOVNT): New.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S
    	(VMOVNT): Likewise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S
    	(VMOVNT): Likewise.
    	(VMOVU): Changed to movups for smaller code sizes.
    	(VMOVA): Changed to movaps for smaller code sizes.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Update
    	comments.
    	(PREFETCH): New.
    	(PREFETCH_SIZE): Likewise.
    	(PREFETCHED_LOAD_SIZE): Likewise.
    	(PREFETCH_ONE_SET): Likewise.
    	Rewrite to use forward and backward loops, which move 4 vector
    	registers at a time, to support overlapping addresses and use
    	non temporal store if size is above the threshold and there is
    	no overlap between destination and source.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=54667f64fa4074325ee33e487c033c313ce95067

commit 54667f64fa4074325ee33e487c033c313ce95067
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Wed Apr 6 10:19:16 2016 -0700

    X86-64: Prepare memmove-vec-unaligned-erms.S
    
    Prepare memmove-vec-unaligned-erms.S to make the SSE2 version as the
    default memcpy, mempcpy and memmove.
    
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
    	(MEMCPY_SYMBOL): New.
    	(MEMPCPY_SYMBOL): Likewise.
    	(MEMMOVE_CHK_SYMBOL): Likewise.
    	Replace MEMMOVE_SYMBOL with MEMMOVE_CHK_SYMBOL on __mempcpy_chk
    	symbols.  Replace MEMMOVE_SYMBOL with MEMPCPY_SYMBOL on
    	__mempcpy symbols.  Provide alias for __memcpy_chk in libc.a.
    	Provide alias for memcpy in libc.a and ld.so.
    
    (cherry picked from commit a7d1c51482d15ab6c07e2ee0ae5e007067b18bfb)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=68a0b487e274b3452a1660e4b9fad5df8d8c0284

commit 68a0b487e274b3452a1660e4b9fad5df8d8c0284
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Wed Apr 6 09:10:18 2016 -0700

    X86-64: Prepare memset-vec-unaligned-erms.S
    
    Prepare memset-vec-unaligned-erms.S to make the SSE2 version as the
    default memset.
    
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
    	(MEMSET_CHK_SYMBOL): New.  Define if not defined.
    	(__bzero): Check VEC_SIZE == 16 instead of USE_MULTIARCH.
    	Disabled fro now.
    	Replace MEMSET_SYMBOL with MEMSET_CHK_SYMBOL on __memset_chk
    	symbols.  Properly check USE_MULTIARCH on __memset symbols.
    
    (cherry picked from commit 4af1bb06c59d24f35bf8dc55897838d926c05892)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c2d3bdd6aec639fd66fceb3e2c145420c25d409b

commit c2d3bdd6aec639fd66fceb3e2c145420c25d409b
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Apr 5 05:21:07 2016 -0700

    Force 32-bit displacement in memset-vec-unaligned-erms.S
    
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: Force
    	32-bit displacement to avoid long nop between instructions.
    
    (cherry picked from commit ec0cac9a1f4094bd0db6f77c1b329e7a40eecc10)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=070a5e77d66f5520c1bbbc24dc1843a0a1c161ee

commit 070a5e77d66f5520c1bbbc24dc1843a0a1c161ee
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Apr 5 05:19:05 2016 -0700

    Add a comment in memset-sse2-unaligned-erms.S
    
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Add
    	a comment on VMOVU and VMOVA.
    
    (cherry picked from commit 696ac774847b80cf994438739478b0c3003b5958)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7e00bb9720268f142668d22e91dff7c3e6e0c08c

commit 7e00bb9720268f142668d22e91dff7c3e6e0c08c
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Apr 3 14:32:20 2016 -0700

    Don't put SSE2/AVX/AVX512 memmove/memset in ld.so
    
    Since memmove and memset in ld.so don't use IFUNC, don't put SSE2, AVX
    and AVX512 memmove and memset in ld.so.
    
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip
    	if not in libc.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    
    (cherry picked from commit 5cd7af016d8587ff53b20ba259746f97edbddbf7)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e1e57539f5dbdefc96a85021b611863eaa28dd13

commit e1e57539f5dbdefc96a85021b611863eaa28dd13
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Apr 3 12:38:25 2016 -0700

    Fix memmove-vec-unaligned-erms.S
    
    __mempcpy_erms and __memmove_erms can't be placed between __memmove_chk
    and __memmove it breaks __memmove_chk.
    
    Don't check source == destination first since it is less common.
    
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:
    	(__mempcpy_erms, __memmove_erms): Moved before __mempcpy_chk
    	with unaligned_erms.
    	(__memmove_erms): Skip if source == destination.
    	(__memmove_unaligned_erms): Don't check source == destination
    	first.
    
    (cherry picked from commit ea2785e96fa503f3a2b5dd9f3a6ca65622b3c5f2)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a13ac6b5ced68aadb7c1546102445f9c57f43231

commit a13ac6b5ced68aadb7c1546102445f9c57f43231
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Mar 6 08:23:24 2016 -0800

    Use HAS_ARCH_FEATURE with Fast_Rep_String
    
    HAS_ARCH_FEATURE, not HAS_CPU_FEATURE, should be used with
    Fast_Rep_String.
    
    	[BZ #19762]
    	* sysdeps/i386/i686/multiarch/bcopy.S (bcopy): Use
    	HAS_ARCH_FEATURE with Fast_Rep_String.
    	* sysdeps/i386/i686/multiarch/bzero.S (__bzero): Likewise.
    	* sysdeps/i386/i686/multiarch/memcpy.S (memcpy): Likewise.
    	* sysdeps/i386/i686/multiarch/memcpy_chk.S (__memcpy_chk):
    	Likewise.
    	* sysdeps/i386/i686/multiarch/memmove_chk.S (__memmove_chk):
    	Likewise.
    	* sysdeps/i386/i686/multiarch/mempcpy.S (__mempcpy): Likewise.
    	* sysdeps/i386/i686/multiarch/mempcpy_chk.S (__mempcpy_chk):
    	Likewise.
    	* sysdeps/i386/i686/multiarch/memset.S (memset): Likewise.
    	* sysdeps/i386/i686/multiarch/memset_chk.S (__memset_chk):
    	Likewise.
    
    (cherry picked from commit 4e940b2f4b577f3a530e0580373f7c2d569f4d63)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4ad4d58ed7a444e2d9787113fce132a99b35b417

commit 4ad4d58ed7a444e2d9787113fce132a99b35b417
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 1 15:08:48 2016 -0700

    Remove Fast_Copy_Backward from Intel Core processors
    
    Intel Core i3, i5 and i7 processors have fast unaligned copy and
    copy backward is ignored.  Remove Fast_Copy_Backward from Intel Core
    processors to avoid confusion.
    
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Don't set
    	bit_arch_Fast_Copy_Backward for Intel Core proessors.
    
    (cherry picked from commit 27d3ce1467990f89126e228559dec8f84b96c60e)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a304f3933c7f8347e49057a7a315cbd571662ff7

commit a304f3933c7f8347e49057a7a315cbd571662ff7
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:05:51 2016 -0700

    Add x86-64 memset with unaligned store and rep stosb
    
    Implement x86-64 memset with unaligned store and rep movsb.  Support
    16-byte, 32-byte and 64-byte vector register sizes.  A single file
    provides 2 implementations of memset, one with rep stosb and the other
    without rep stosb.  They share the same codes when size is between 2
    times of vector register size and REP_STOSB_THRESHOLD which defaults
    to 2KB.
    
    Key features:
    
    1. Use overlapping store to avoid branch.
    2. For size <= 4 times of vector register size, fully unroll the loop.
    3. For size > 4 times of vector register size, store 4 times of vector
    register size at a time.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
    	memset-sse2-unaligned-erms, memset-avx2-unaligned-erms and
    	memset-avx512-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Test __memset_chk_sse2_unaligned,
    	__memset_chk_sse2_unaligned_erms, __memset_chk_avx2_unaligned,
    	__memset_chk_avx2_unaligned_erms, __memset_chk_avx512_unaligned,
    	__memset_chk_avx512_unaligned_erms, __memset_sse2_unaligned,
    	__memset_sse2_unaligned_erms, __memset_erms,
    	__memset_avx2_unaligned, __memset_avx2_unaligned_erms,
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: New
    	file.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:
    	Likewise.
    
    (cherry picked from commit 830566307f038387ca0af3fd327706a8d1a2f595)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=1c5d0b0ba41376c2f0792da4f22cc1f5b2b2688e

commit 1c5d0b0ba41376c2f0792da4f22cc1f5b2b2688e
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:04:26 2016 -0700

    Add x86-64 memmove with unaligned load/store and rep movsb
    
    Implement x86-64 memmove with unaligned load/store and rep movsb.
    Support 16-byte, 32-byte and 64-byte vector register sizes.  When
    size <= 8 times of vector register size, there is no check for
    address overlap bewteen source and destination.  Since overhead for
    overlap check is small when size > 8 times of vector register size,
    memcpy is an alias of memmove.
    
    A single file provides 2 implementations of memmove, one with rep movsb
    and the other without rep movsb.  They share the same codes when size is
    between 2 times of vector register size and REP_MOVSB_THRESHOLD which
    is 2KB for 16-byte vector register size and scaled up by large vector
    register size.
    
    Key features:
    
    1. Use overlapping load and store to avoid branch.
    2. For size <= 8 times of vector register size, load  all sources into
    registers and store them together.
    3. If there is no address overlap bewteen source and destination, copy
    from both ends with 4 times of vector register size at a time.
    4. If address of destination > address of source, backward copy 8 times
    of vector register size at a time.
    5. Otherwise, forward copy 8 times of vector register size at a time.
    6. Use rep movsb only for forward copy.  Avoid slow backward rep movsb
    by fallbacking to backward copy 8 times of vector register size at a
    time.
    7. Skip when address of destination == address of source.
    
    	[BZ #19776]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
    	memmove-sse2-unaligned-erms, memmove-avx-unaligned-erms and
    	memmove-avx512-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Test
    	__memmove_chk_avx512_unaligned_2,
    	__memmove_chk_avx512_unaligned_erms,
    	__memmove_chk_avx_unaligned_2, __memmove_chk_avx_unaligned_erms,
    	__memmove_chk_sse2_unaligned_2,
    	__memmove_chk_sse2_unaligned_erms, __memmove_avx_unaligned_2,
    	__memmove_avx_unaligned_erms, __memmove_avx512_unaligned_2,
    	__memmove_avx512_unaligned_erms, __memmove_erms,
    	__memmove_sse2_unaligned_2, __memmove_sse2_unaligned_erms,
    	__memcpy_chk_avx512_unaligned_2,
    	__memcpy_chk_avx512_unaligned_erms,
    	__memcpy_chk_avx_unaligned_2, __memcpy_chk_avx_unaligned_erms,
    	__memcpy_chk_sse2_unaligned_2, __memcpy_chk_sse2_unaligned_erms,
    	__memcpy_avx_unaligned_2, __memcpy_avx_unaligned_erms,
    	__memcpy_avx512_unaligned_2, __memcpy_avx512_unaligned_erms,
    	__memcpy_sse2_unaligned_2, __memcpy_sse2_unaligned_erms,
    	__memcpy_erms, __mempcpy_chk_avx512_unaligned_2,
    	__mempcpy_chk_avx512_unaligned_erms,
    	__mempcpy_chk_avx_unaligned_2, __mempcpy_chk_avx_unaligned_erms,
    	__mempcpy_chk_sse2_unaligned_2, __mempcpy_chk_sse2_unaligned_erms,
    	__mempcpy_avx512_unaligned_2, __mempcpy_avx512_unaligned_erms,
    	__mempcpy_avx_unaligned_2, __mempcpy_avx_unaligned_erms,
    	__mempcpy_sse2_unaligned_2, __mempcpy_sse2_unaligned_erms and
    	__mempcpy_erms.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: New
    	file.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likwise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
    	Likwise.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:
    	Likwise.
    
    (cherry picked from commit 88b57b8ed41d5ecf2e1bdfc19556f9246a665ebb)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e1203f48239fbb9832db6ed3a0d2a008e317aff9

commit e1203f48239fbb9832db6ed3a0d2a008e317aff9
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 19:22:59 2016 -0700

    Initial Enhanced REP MOVSB/STOSB (ERMS) support
    
    The newer Intel processors support Enhanced REP MOVSB/STOSB (ERMS) which
    has a feature bit in CPUID.  This patch adds the Enhanced REP MOVSB/STOSB
    (ERMS) bit to x86 cpu-features.
    
    	* sysdeps/x86/cpu-features.h (bit_cpu_ERMS): New.
    	(index_cpu_ERMS): Likewise.
    	(reg_ERMS): Likewise.
    
    (cherry picked from commit 0791f91dff9a77263fa8173b143d854cad902c6d)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=3597d65be2a44f063ef12bb907fdad8567aa3e6a

commit 3597d65be2a44f063ef12bb907fdad8567aa3e6a
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 13:15:59 2016 -0700

    Make __memcpy_avx512_no_vzeroupper an alias
    
    Since x86-64 memcpy-avx512-no-vzeroupper.S implements memmove, make
    __memcpy_avx512_no_vzeroupper an alias of __memmove_avx512_no_vzeroupper
    to reduce code size of libc.so.
    
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memcpy-avx512-no-vzeroupper.
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: Renamed
    	to ...
    	* sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: This.
    	(MEMCPY): Don't define.
    	(MEMCPY_CHK): Likewise.
    	(MEMPCPY): Likewise.
    	(MEMPCPY_CHK): Likewise.
    	(MEMPCPY_CHK): Renamed to ...
    	(__mempcpy_chk_avx512_no_vzeroupper): This.
    	(MEMPCPY_CHK): Renamed to ...
    	(__mempcpy_chk_avx512_no_vzeroupper): This.
    	(MEMCPY_CHK): Renamed to ...
    	(__memmove_chk_avx512_no_vzeroupper): This.
    	(MEMCPY): Renamed to ...
    	(__memmove_avx512_no_vzeroupper): This.
    	(__memcpy_avx512_no_vzeroupper): New alias.
    	(__memcpy_chk_avx512_no_vzeroupper): Likewise.
    
    (cherry picked from commit 064f01b10b57ff09cda7025f484b848c38ddd57a)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=9fbaf0f27a11deb98df79d04adee97aebee78d40

commit 9fbaf0f27a11deb98df79d04adee97aebee78d40
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 13:13:36 2016 -0700

    Implement x86-64 multiarch mempcpy in memcpy
    
    Implement x86-64 multiarch mempcpy in memcpy to share most of code.  It
    reduces code size of libc.so.
    
    	[BZ #18858]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	mempcpy-ssse3, mempcpy-ssse3-back, mempcpy-avx-unaligned
    	and mempcpy-avx512-no-vzeroupper.
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMPCPY_CHK):
    	New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S
    	(MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-ssse3-back.S (MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-ssse3.S (MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-avx-unaligned.S: Removed.
    	* sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-ssse3-back.S: Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-ssse3.S: Likewise.
    
    (cherry picked from commit c365e615f7429aee302f8af7bf07ae262278febb)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=5239cb481eea27650173b9b9af22439afdcbf358

commit 5239cb481eea27650173b9b9af22439afdcbf358
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 04:39:48 2016 -0700

    [x86] Add a feature bit: Fast_Unaligned_Copy
    
    On AMD processors, memcpy optimized with unaligned SSE load is
    slower than emcpy optimized with aligned SSSE3 while other string
    functions are faster with unaligned SSE load.  A feature bit,
    Fast_Unaligned_Copy, is added to select memcpy optimized with
    unaligned SSE load.
    
    	[BZ #19583]
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Set
    	Fast_Unaligned_Copy with Fast_Unaligned_Load for Intel
    	processors.  Set Fast_Copy_Backward for AMD Excavator
    	processors.
    	* sysdeps/x86/cpu-features.h (bit_arch_Fast_Unaligned_Copy):
    	New.
    	(index_arch_Fast_Unaligned_Copy): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Check
    	Fast_Unaligned_Copy instead of Fast_Unaligned_Load.
    
    (cherry picked from commit e41b395523040fcb58c7d378475720c2836d280c)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a65b3d13e1754d568782e64a762c2c7fab45a55d

commit a65b3d13e1754d568782e64a762c2c7fab45a55d
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Mar 22 08:36:16 2016 -0700

    Don't set %rcx twice before "rep movsb"
    
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMCPY):
    	Don't set %rcx twice before "rep movsb".
    
    (cherry picked from commit 3c9a4cd16cbc7b79094fec68add2df66061ab5d7)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=f4b6d20366aac66070f1cf50552cf2951991a1e5

commit f4b6d20366aac66070f1cf50552cf2951991a1e5
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Mar 22 07:46:56 2016 -0700

    Set index_arch_AVX_Fast_Unaligned_Load only for Intel processors
    
    Since only Intel processors with AVX2 have fast unaligned load, we
    should set index_arch_AVX_Fast_Unaligned_Load only for Intel processors.
    
    Move AVX, AVX2, AVX512, FMA and FMA4 detection into get_common_indeces
    and call get_common_indeces for other processors.
    
    Add CPU_FEATURES_CPU_P and CPU_FEATURES_ARCH_P to aoid loading
    GLRO(dl_x86_cpu_features) in cpu-features.c.
    
    	[BZ #19583]
    	* sysdeps/x86/cpu-features.c (get_common_indeces): Remove
    	inline.  Check family before setting family, model and
    	extended_model.  Set AVX, AVX2, AVX512, FMA and FMA4 usable
    	bits here.
    	(init_cpu_features): Replace HAS_CPU_FEATURE and
    	HAS_ARCH_FEATURE with CPU_FEATURES_CPU_P and
    	CPU_FEATURES_ARCH_P.  Set index_arch_AVX_Fast_Unaligned_Load
    	for Intel processors with usable AVX2.  Call get_common_indeces
    	for other processors with family == NULL.
    	* sysdeps/x86/cpu-features.h (CPU_FEATURES_CPU_P): New macro.
    	(CPU_FEATURES_ARCH_P): Likewise.
    	(HAS_CPU_FEATURE): Use CPU_FEATURES_CPU_P.
    	(HAS_ARCH_FEATURE): Use CPU_FEATURES_ARCH_P.
    
    (cherry picked from commit f781a9e96138d8839663af5e88649ab1fbed74f8)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=ca9c5edeea52dc18f42ebbe29b1af352f5555538

commit ca9c5edeea52dc18f42ebbe29b1af352f5555538
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Nov 30 08:53:37 2015 -0800

    Update family and model detection for AMD CPUs
    
    AMD CPUs uses the similar encoding scheme for extended family and model
    as Intel CPUs as shown in:
    
    http://support.amd.com/TechDocs/25481.pdf
    
    This patch updates get_common_indeces to get family and model for both
    Intel and AMD CPUs when family == 0x0f.
    
    	[BZ #19214]
    	* sysdeps/x86/cpu-features.c (get_common_indeces): Add an
    	argument to return extended model.  Update family and model
    	with extended family and model when family == 0x0f.
    	(init_cpu_features): Updated.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c23cdbac4ea473effbef5c50b1217f95595b3460

commit c23cdbac4ea473effbef5c50b1217f95595b3460
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 10 05:26:46 2016 -0800

    Add _arch_/_cpu_ to index_*/bit_* in x86 cpu-features.h
    
    index_* and bit_* macros are used to access cpuid and feature arrays o
    struct cpu_features.  It is very easy to use bits and indices of cpuid
    array on feature array, especially in assembly codes.  For example,
    sysdeps/i386/i686/multiarch/bcopy.S has
    
    	HAS_CPU_FEATURE (Fast_Rep_String)
    
    which should be
    
    	HAS_ARCH_FEATURE (Fast_Rep_String)
    
    We change index_* and bit_* to index_cpu_*/index_arch_* and
    bit_cpu_*/bit_arch_* so that we can catch such error at build time.
    
    	[BZ #19762]
    	* sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h
    	(EXTRA_LD_ENVVARS): Add _arch_ to index_*/bit_*.
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Likewise.
    	* sysdeps/x86/cpu-features.h (bit_*): Renamed to ...
    	(bit_arch_*): This for feature array.
    	(bit_*): Renamed to ...
    	(bit_cpu_*): This for cpu array.
    	(index_*): Renamed to ...
    	(index_arch_*): This for feature array.
    	(index_*): Renamed to ...
    	(index_cpu_*): This for cpu array.
    	[__ASSEMBLER__] (HAS_FEATURE): Add and use field.
    	[__ASSEMBLER__] (HAS_CPU_FEATURE)): Pass cpu to HAS_FEATURE.
    	[__ASSEMBLER__] (HAS_ARCH_FEATURE)): Pass arch to HAS_FEATURE.
    	[!__ASSEMBLER__] (HAS_CPU_FEATURE): Replace index_##name and
    	bit_##name with index_cpu_##name and bit_cpu_##name.
    	[!__ASSEMBLER__] (HAS_ARCH_FEATURE): Replace index_##name and
    	bit_##name with index_arch_##name and bit_arch_##name.
    
    (cherry picked from commit 6aa3e97e2530f9917f504eb4146af119a3f27229)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4a49c82956f5a42a2cce22c2e97360de1b32301d

commit 4a49c82956f5a42a2cce22c2e97360de1b32301d
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 3 14:51:40 2016 -0800

    Or bit_Prefer_MAP_32BIT_EXEC in EXTRA_LD_ENVVARS
    
    We should turn on bit_Prefer_MAP_32BIT_EXEC in EXTRA_LD_ENVVARS without
    overriding other bits.
    
    	[BZ #19758]
    	* sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h
    	(EXTRA_LD_ENVVARS): Or bit_Prefer_MAP_32BIT_EXEC.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=18f8c0e3cc9ff7b092f02c9b42874a5439347bbc

commit 18f8c0e3cc9ff7b092f02c9b42874a5439347bbc
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Mar 6 16:48:11 2016 -0800

    Group AVX512 functions in .text.avx512 section
    
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S:
    	Replace .text with .text.avx512.
    	* sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S:
    	Likewise.
    
    (cherry picked from commit fee9eb6200f0e44a4b684903bc47fde36d46f1a5)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=0c8e297a186f844ebb7eba7a3bc0343c83615ca9

commit 0c8e297a186f844ebb7eba7a3bc0343c83615ca9
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Mar 4 08:37:40 2016 -0800

    x86-64: Fix memcpy IFUNC selection
    
    Chek Fast_Unaligned_Load, instead of Slow_BSF, and also check for
    Fast_Copy_Backward to enable __memcpy_ssse3_back.  Existing selection
    order is updated with following selection order:
    
    1. __memcpy_avx_unaligned if AVX_Fast_Unaligned_Load bit is set.
    2. __memcpy_sse2_unaligned if Fast_Unaligned_Load bit is set.
    3. __memcpy_sse2 if SSSE3 isn't available.
    4. __memcpy_ssse3_back if Fast_Copy_Backward bit it set.
    5. __memcpy_ssse3
    
    	[BZ #18880]
    	* sysdeps/x86_64/multiarch/memcpy.S: Check Fast_Unaligned_Load,
    	instead of Slow_BSF, and also check for Fast_Copy_Backward to
    	enable __memcpy_ssse3_back.
    
    (cherry picked from commit 14a1d7cc4c4fd5ee8e4e66b777221dd32a84efe8)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=3c772cb4d9cbe19cd97ad991e3dab43014198c44

commit 3c772cb4d9cbe19cd97ad991e3dab43014198c44
Author: Andrew Senkevich <andrew.senkevich@intel.com>
Date:   Sat Jan 16 00:49:45 2016 +0300

    Added memcpy/memmove family optimized with AVX512 for KNL hardware.
    
    Added AVX512 implementations of memcpy, mempcpy, memmove, memcpy_chk,
    mempcpy_chk, memmove_chk.
    It shows average improvement more than 30% over AVX versions on KNL
    hardware (performance results in the thread
    <https://sourceware.org/ml/libc-alpha/2016-01/msg00258.html>).
    
        * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Added new files.
        * sysdeps/x86_64/multiarch/ifunc-impl-list.c: Added new tests.
        * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: New file.
        * sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S: Likewise.
        * sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: Likewise.
        * sysdeps/x86_64/multiarch/memcpy.S: Added new IFUNC branch.
        * sysdeps/x86_64/multiarch/memcpy_chk.S: Likewise.
        * sysdeps/x86_64/multiarch/memmove.c: Likewise.
        * sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
        * sysdeps/x86_64/multiarch/mempcpy.S: Likewise.
        * sysdeps/x86_64/multiarch/mempcpy_chk.S: Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7f20e52f1f8d7709140f4bdf828a6bb0f0f08af2

commit 7f20e52f1f8d7709140f4bdf828a6bb0f0f08af2
Author: Andrew Senkevich <andrew.senkevich@intel.com>
Date:   Sat Dec 19 02:47:28 2015 +0300

    Added memset optimized with AVX512 for KNL hardware.
    
    It shows improvement up to 28% over AVX2 memset (performance results
    attached at <https://sourceware.org/ml/libc-alpha/2015-12/msg00052.html>).
    
        * sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S: New file.
        * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Added new file.
        * sysdeps/x86_64/multiarch/ifunc-impl-list.c: Added new tests.
        * sysdeps/x86_64/multiarch/memset.S: Added new IFUNC branch.
        * sysdeps/x86_64/multiarch/memset_chk.S: Likewise.
        * sysdeps/x86/cpu-features.h (bit_Prefer_No_VZEROUPPER,
        index_Prefer_No_VZEROUPPER): New.
        * sysdeps/x86/cpu-features.c (init_cpu_features): Set the
        Prefer_No_VZEROUPPER for Knights Landing.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=d530cd5463701a59ed923d53a97d3b534fdfea8a

commit d530cd5463701a59ed923d53a97d3b534fdfea8a
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Wed Oct 21 14:44:23 2015 -0700

    Add Prefer_MAP_32BIT_EXEC to map executable pages with MAP_32BIT
    
    According to Silvermont software optimization guide, for 64-bit
    applications, branch prediction performance can be negatively impacted
    when the target of a branch is more than 4GB away from the branch.  Add
    the Prefer_MAP_32BIT_EXEC bit so that mmap will try to map executable
    pages with MAP_32BIT first.  NB: MAP_32BIT will map to lower 2GB, not
    lower 4GB, address.  Prefer_MAP_32BIT_EXEC reduces bits available for
    address space layout randomization (ASLR), which is always disabled for
    SUID programs and can only be enabled by setting environment variable,
    LD_PREFER_MAP_32BIT_EXEC.
    
    On Fedora 23, this patch speeds up GCC 5 testsuite by 3% on Silvermont.
    
    	[BZ #19367]
    	* sysdeps/unix/sysv/linux/wordsize-64/mmap.c: New file.
    	* sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h: Likewise.
    	* sysdeps/unix/sysv/linux/x86_64/64/mmap.c: Likewise.
    	* sysdeps/x86/cpu-features.h (bit_Prefer_MAP_32BIT_EXEC): New.
    	(index_Prefer_MAP_32BIT_EXEC): Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=fe24aedc3530037d7bb614b84d309e6b816686bf

commit fe24aedc3530037d7bb614b84d309e6b816686bf
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Dec 15 11:46:54 2015 -0800

    Enable Silvermont optimizations for Knights Landing
    
    Knights Landing processor is based on Silvermont.  This patch enables
    Silvermont optimizations for Knights Landing.
    
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Enable
    	Silvermont optimizations for Knights Landing.

-----------------------------------------------------------------------
Comment 19 cvs-commit@gcc.gnu.org 2016-04-08 20:33:04 UTC
This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, hjl/erms/2.23 has been created
        at  9e1ddc1180ca0619d12b620b227726233a48b9bc (commit)

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=9e1ddc1180ca0619d12b620b227726233a48b9bc

commit 9e1ddc1180ca0619d12b620b227726233a48b9bc
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 1 14:01:24 2016 -0700

    X86-64: Add dummy memcopy.h and wordcopy.c
    
    Since x86-64 doesn't use memory copy functions, add dummy memcopy.h and
    wordcopy.c to reduce code size.  It reduces the size of libc.so by about
    1 KB.
    
    	* sysdeps/x86_64/memcopy.h: New file.
    	* sysdeps/x86_64/wordcopy.c: Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=3443d7810db1092ac70a0fde7b85732a2e00cdc3

commit 3443d7810db1092ac70a0fde7b85732a2e00cdc3
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 12:46:57 2016 -0700

    X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove
    
    Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones,
    we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with
    the new ones.
    
    No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used
    before.  If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2
    memcpy/memmove optimized with Enhanced REP MOVSB will be used for
    processors with ERMS.  The new AVX512 memcpy/memmove will be used for
    processors with AVX512 which prefer vzeroupper.
    
    Since the new SSE2 memcpy/memmove are faster than the previous default
    memcpy/memmove used in libc.a and ld.so, we also remove the previous
    default memcpy/memmove and make them the default memcpy/memmove, except
    that non-temporal store isn't used in ld.so.
    
    Together, it reduces the size of libc.so by about 6 KB and the size of
    ld.so by about 2 KB.
    
    	[BZ #19776]
    	* sysdeps/x86_64/memcpy.S: Make it dummy.
    	* sysdeps/x86_64/mempcpy.S: Likewise.
    	* sysdeps/x86_64/memmove.S: New file.
    	* sysdeps/x86_64/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/memmove.c: Removed.
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memmove.c: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memcpy-sse2-unaligned, memmove-avx-unaligned,
    	memcpy-avx-unaligned and memmove-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Replace
    	__memmove_chk_avx512_unaligned_2 with
    	__memmove_chk_avx512_unaligned.  Remove
    	__memmove_chk_avx_unaligned_2.  Replace
    	__memmove_chk_sse2_unaligned_2 with
    	__memmove_chk_sse2_unaligned.  Remove __memmove_chk_sse2 and
    	__memmove_avx_unaligned_2.  Replace __memmove_avx512_unaligned_2
    	with __memmove_avx512_unaligned.  Replace
    	__memmove_sse2_unaligned_2 with __memmove_sse2_unaligned.
    	Remove __memmove_sse2.  Replace __memcpy_chk_avx512_unaligned_2
    	with __memcpy_chk_avx512_unaligned.  Remove
    	__memcpy_chk_avx_unaligned_2.  Replace
    	__memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned.
    	Remove __memcpy_chk_sse2.  Remove __memcpy_avx_unaligned_2.
    	Replace __memcpy_avx512_unaligned_2 with
    	__memcpy_avx512_unaligned.  Remove __memcpy_sse2_unaligned_2
    	and __memcpy_sse2.  Replace __mempcpy_chk_avx512_unaligned_2
    	with __mempcpy_chk_avx512_unaligned.  Remove
    	__mempcpy_chk_avx_unaligned_2.  Replace
    	__mempcpy_chk_sse2_unaligned_2 with
    	__mempcpy_chk_sse2_unaligned.  Remove __mempcpy_chk_sse2.
    	Replace __mempcpy_avx512_unaligned_2 with
    	__mempcpy_avx512_unaligned.  Remove __mempcpy_avx_unaligned_2.
    	Replace __mempcpy_sse2_unaligned_2 with
    	__mempcpy_sse2_unaligned.  Remove __mempcpy_sse2.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support
    	__memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned.
    	Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __memcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../memcpy.S.
    	* sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support
    	__memcpy_chk_avx512_unaligned_erms and
    	__memcpy_chk_avx512_unaligned.  Use
    	__memcpy_chk_avx_unaligned_erms and
    	__memcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __memcpy_chk_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
    	Change function suffix from unaligned_2 to unaligned.
    	* sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support
    	__mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned.
    	Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __mempcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../mempcpy.S.
    	(mempcpy): New.  Add a weak alias.
    	* sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support
    	__mempcpy_chk_avx512_unaligned_erms and
    	__mempcpy_chk_avx512_unaligned.  Use
    	__mempcpy_chk_avx_unaligned_erms and
    	__mempcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __mempcpy_chk_sse2_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=1d2a372d44dc05201242d0fd5551df9c3174806c

commit 1d2a372d44dc05201242d0fd5551df9c3174806c
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:42:30 2016 -0700

    X86-64: Remove the previous SSE2/AVX2 memsets
    
    Since the new SSE2/AVX2 memsets are faster than the previous ones, we
    can remove the previous SSE2/AVX2 memsets and replace them with the
    new ones.  This reduces the size of libc.so by about 900 bytes.
    
    No change in IFUNC selection if SSE2 and AVX2 memsets weren't used
    before.  If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset
    optimized with Enhanced REP STOSB will be used for processors with
    ERMS.  The new AVX512 memset will be used for processors with AVX512
    which prefer vzeroupper.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded
    	into ...
    	* sysdeps/x86_64/memset.S: This.
    	(__bzero): Removed.
    	(__memset_tail): Likewise.
    	(__memset_chk): Likewise.
    	(memset): Likewise.
    	(MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't
    	defined.
    	(MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined.
    	* sysdeps/x86_64/multiarch/memset-avx2.S: Removed.
    	(__memset_zero_constant_len_parameter): Check SHARED instead of
    	PIC.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memset-avx2 and memset-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Remove __memset_chk_sse2,
    	__memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
    	(__bzero): Enabled.
    	* sysdeps/x86_64/multiarch/memset.S (memset): Replace
    	__memset_sse2 and __memset_avx2 with __memset_sse2_unaligned
    	and __memset_avx2_unaligned.  Use __memset_sse2_unaligned_erms
    	or __memset_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	(memset): Removed.
    	(__memset_chk): Likewise.
    	(MEMSET_SYMBOL): New.
    	(libc_hidden_builtin_def): Replace __memset_sse2 with
    	__memset_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace
    	__memset_chk_sse2 and __memset_chk_avx2 with
    	__memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms.
    	Use __memset_chk_sse2_unaligned_erms or
    	__memset_chk_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_chk_avx512_unaligned_erms and
    	__memset_chk_avx512_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=9fa066d5f5ff996990869bbbad08435f02d18bb3

commit 9fa066d5f5ff996990869bbbad08435f02d18bb3
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Apr 3 17:21:45 2016 -0700

    X86-64: Use non-temporal store in memcpy on large data
    
    The large memcpy micro benchmark in glibc shows that there is a
    regression with large data on Haswell machine.  non-temporal store in
    memcpy on large data can improve performance significantly.  This
    patch adds a threshold to use non temporal store which is 6 times of
    shared cache size.  When size is above the threshold, non temporal
    store will be used, but avoid non-temporal store if there is overlap
    between destination and source since destination may be in cache when
    source is loaded.
    
    For size below 8 vector register width, we load all data into registers
    and store them together.  Only forward and backward loops, which move 4
    vector registers at a time, are used to support overlapping addresses.
    For forward loop, we load the last 4 vector register width of data and
    the first vector register width of data into vector registers before the
    loop and store them after the loop.  For backward loop, we load the first
    4 vector register width of data and the last vector register width of
    data into vector registers before the loop and store them after the loop.
    
    	[BZ #19928]
    	* sysdeps/x86_64/cacheinfo.c (__x86_shared_non_temporal_threshold):
    	New.
    	(init_cacheinfo): Set __x86_shared_non_temporal_threshold to
    	6 times of shared cache size.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S
    	(VMOVNT): New.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S
    	(VMOVNT): Likewise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S
    	(VMOVNT): Likewise.
    	(VMOVU): Changed to movups for smaller code sizes.
    	(VMOVA): Changed to movaps for smaller code sizes.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Update
    	comments.
    	(PREFETCH): New.
    	(PREFETCH_SIZE): Likewise.
    	(PREFETCHED_LOAD_SIZE): Likewise.
    	(PREFETCH_ONE_SET): Likewise.
    	Rewrite to use forward and backward loops, which move 4 vector
    	registers at a time, to support overlapping addresses and use
    	non temporal store if size is above the threshold and there is
    	no overlap between destination and source.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=0932dd8b56db46dd421a4855fb5dee9de092538d

commit 0932dd8b56db46dd421a4855fb5dee9de092538d
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Wed Apr 6 10:19:16 2016 -0700

    X86-64: Prepare memmove-vec-unaligned-erms.S
    
    Prepare memmove-vec-unaligned-erms.S to make the SSE2 version as the
    default memcpy, mempcpy and memmove.
    
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
    	(MEMCPY_SYMBOL): New.
    	(MEMPCPY_SYMBOL): Likewise.
    	(MEMMOVE_CHK_SYMBOL): Likewise.
    	Replace MEMMOVE_SYMBOL with MEMMOVE_CHK_SYMBOL on __mempcpy_chk
    	symbols.  Replace MEMMOVE_SYMBOL with MEMPCPY_SYMBOL on
    	__mempcpy symbols.  Provide alias for __memcpy_chk in libc.a.
    	Provide alias for memcpy in libc.a and ld.so.
    
    (cherry picked from commit a7d1c51482d15ab6c07e2ee0ae5e007067b18bfb)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=da2da79262814ba4ead3ee487549949096d8ad2d

commit da2da79262814ba4ead3ee487549949096d8ad2d
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Wed Apr 6 09:10:18 2016 -0700

    X86-64: Prepare memset-vec-unaligned-erms.S
    
    Prepare memset-vec-unaligned-erms.S to make the SSE2 version as the
    default memset.
    
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
    	(MEMSET_CHK_SYMBOL): New.  Define if not defined.
    	(__bzero): Check VEC_SIZE == 16 instead of USE_MULTIARCH.
    	Disabled fro now.
    	Replace MEMSET_SYMBOL with MEMSET_CHK_SYMBOL on __memset_chk
    	symbols.  Properly check USE_MULTIARCH on __memset symbols.
    
    (cherry picked from commit 4af1bb06c59d24f35bf8dc55897838d926c05892)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=9a93bdbaff81edf67c5486c84f8098055e355abb

commit 9a93bdbaff81edf67c5486c84f8098055e355abb
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Apr 5 05:21:07 2016 -0700

    Force 32-bit displacement in memset-vec-unaligned-erms.S
    
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: Force
    	32-bit displacement to avoid long nop between instructions.
    
    (cherry picked from commit ec0cac9a1f4094bd0db6f77c1b329e7a40eecc10)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=5118e532600549ad0f56cb9b1a179b8eab70c483

commit 5118e532600549ad0f56cb9b1a179b8eab70c483
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Apr 5 05:19:05 2016 -0700

    Add a comment in memset-sse2-unaligned-erms.S
    
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Add
    	a comment on VMOVU and VMOVA.
    
    (cherry picked from commit 696ac774847b80cf994438739478b0c3003b5958)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=06c6d4ae6ee7e5b83fd5868bef494def01f59292

commit 06c6d4ae6ee7e5b83fd5868bef494def01f59292
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Apr 3 14:32:20 2016 -0700

    Don't put SSE2/AVX/AVX512 memmove/memset in ld.so
    
    Since memmove and memset in ld.so don't use IFUNC, don't put SSE2, AVX
    and AVX512 memmove and memset in ld.so.
    
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip
    	if not in libc.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    
    (cherry picked from commit 5cd7af016d8587ff53b20ba259746f97edbddbf7)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a96379797a7eecc1b709cad7b68981eb698783dc

commit a96379797a7eecc1b709cad7b68981eb698783dc
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Apr 3 12:38:25 2016 -0700

    Fix memmove-vec-unaligned-erms.S
    
    __mempcpy_erms and __memmove_erms can't be placed between __memmove_chk
    and __memmove it breaks __memmove_chk.
    
    Don't check source == destination first since it is less common.
    
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:
    	(__mempcpy_erms, __memmove_erms): Moved before __mempcpy_chk
    	with unaligned_erms.
    	(__memmove_erms): Skip if source == destination.
    	(__memmove_unaligned_erms): Don't check source == destination
    	first.
    
    (cherry picked from commit ea2785e96fa503f3a2b5dd9f3a6ca65622b3c5f2)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=cfb059c79729b26284863334c9aa04f0a3b967b9

commit cfb059c79729b26284863334c9aa04f0a3b967b9
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 1 15:08:48 2016 -0700

    Remove Fast_Copy_Backward from Intel Core processors
    
    Intel Core i3, i5 and i7 processors have fast unaligned copy and
    copy backward is ignored.  Remove Fast_Copy_Backward from Intel Core
    processors to avoid confusion.
    
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Don't set
    	bit_arch_Fast_Copy_Backward for Intel Core proessors.
    
    (cherry picked from commit 27d3ce1467990f89126e228559dec8f84b96c60e)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=30c389be1af67c4d0716d207b6780c6169d1355f

commit 30c389be1af67c4d0716d207b6780c6169d1355f
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:05:51 2016 -0700

    Add x86-64 memset with unaligned store and rep stosb
    
    Implement x86-64 memset with unaligned store and rep movsb.  Support
    16-byte, 32-byte and 64-byte vector register sizes.  A single file
    provides 2 implementations of memset, one with rep stosb and the other
    without rep stosb.  They share the same codes when size is between 2
    times of vector register size and REP_STOSB_THRESHOLD which defaults
    to 2KB.
    
    Key features:
    
    1. Use overlapping store to avoid branch.
    2. For size <= 4 times of vector register size, fully unroll the loop.
    3. For size > 4 times of vector register size, store 4 times of vector
    register size at a time.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
    	memset-sse2-unaligned-erms, memset-avx2-unaligned-erms and
    	memset-avx512-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Test __memset_chk_sse2_unaligned,
    	__memset_chk_sse2_unaligned_erms, __memset_chk_avx2_unaligned,
    	__memset_chk_avx2_unaligned_erms, __memset_chk_avx512_unaligned,
    	__memset_chk_avx512_unaligned_erms, __memset_sse2_unaligned,
    	__memset_sse2_unaligned_erms, __memset_erms,
    	__memset_avx2_unaligned, __memset_avx2_unaligned_erms,
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: New
    	file.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:
    	Likewise.
    
    (cherry picked from commit 830566307f038387ca0af3fd327706a8d1a2f595)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=980d639b4ae58209843f09a29d86b0a8303b6650

commit 980d639b4ae58209843f09a29d86b0a8303b6650
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:04:26 2016 -0700

    Add x86-64 memmove with unaligned load/store and rep movsb
    
    Implement x86-64 memmove with unaligned load/store and rep movsb.
    Support 16-byte, 32-byte and 64-byte vector register sizes.  When
    size <= 8 times of vector register size, there is no check for
    address overlap bewteen source and destination.  Since overhead for
    overlap check is small when size > 8 times of vector register size,
    memcpy is an alias of memmove.
    
    A single file provides 2 implementations of memmove, one with rep movsb
    and the other without rep movsb.  They share the same codes when size is
    between 2 times of vector register size and REP_MOVSB_THRESHOLD which
    is 2KB for 16-byte vector register size and scaled up by large vector
    register size.
    
    Key features:
    
    1. Use overlapping load and store to avoid branch.
    2. For size <= 8 times of vector register size, load  all sources into
    registers and store them together.
    3. If there is no address overlap bewteen source and destination, copy
    from both ends with 4 times of vector register size at a time.
    4. If address of destination > address of source, backward copy 8 times
    of vector register size at a time.
    5. Otherwise, forward copy 8 times of vector register size at a time.
    6. Use rep movsb only for forward copy.  Avoid slow backward rep movsb
    by fallbacking to backward copy 8 times of vector register size at a
    time.
    7. Skip when address of destination == address of source.
    
    	[BZ #19776]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
    	memmove-sse2-unaligned-erms, memmove-avx-unaligned-erms and
    	memmove-avx512-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Test
    	__memmove_chk_avx512_unaligned_2,
    	__memmove_chk_avx512_unaligned_erms,
    	__memmove_chk_avx_unaligned_2, __memmove_chk_avx_unaligned_erms,
    	__memmove_chk_sse2_unaligned_2,
    	__memmove_chk_sse2_unaligned_erms, __memmove_avx_unaligned_2,
    	__memmove_avx_unaligned_erms, __memmove_avx512_unaligned_2,
    	__memmove_avx512_unaligned_erms, __memmove_erms,
    	__memmove_sse2_unaligned_2, __memmove_sse2_unaligned_erms,
    	__memcpy_chk_avx512_unaligned_2,
    	__memcpy_chk_avx512_unaligned_erms,
    	__memcpy_chk_avx_unaligned_2, __memcpy_chk_avx_unaligned_erms,
    	__memcpy_chk_sse2_unaligned_2, __memcpy_chk_sse2_unaligned_erms,
    	__memcpy_avx_unaligned_2, __memcpy_avx_unaligned_erms,
    	__memcpy_avx512_unaligned_2, __memcpy_avx512_unaligned_erms,
    	__memcpy_sse2_unaligned_2, __memcpy_sse2_unaligned_erms,
    	__memcpy_erms, __mempcpy_chk_avx512_unaligned_2,
    	__mempcpy_chk_avx512_unaligned_erms,
    	__mempcpy_chk_avx_unaligned_2, __mempcpy_chk_avx_unaligned_erms,
    	__mempcpy_chk_sse2_unaligned_2, __mempcpy_chk_sse2_unaligned_erms,
    	__mempcpy_avx512_unaligned_2, __mempcpy_avx512_unaligned_erms,
    	__mempcpy_avx_unaligned_2, __mempcpy_avx_unaligned_erms,
    	__mempcpy_sse2_unaligned_2, __mempcpy_sse2_unaligned_erms and
    	__mempcpy_erms.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: New
    	file.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likwise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
    	Likwise.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:
    	Likwise.
    
    (cherry picked from commit 88b57b8ed41d5ecf2e1bdfc19556f9246a665ebb)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=bf2bc5e5c9d7aa8af28b299ec26b8a37352730cc

commit bf2bc5e5c9d7aa8af28b299ec26b8a37352730cc
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 19:22:59 2016 -0700

    Initial Enhanced REP MOVSB/STOSB (ERMS) support
    
    The newer Intel processors support Enhanced REP MOVSB/STOSB (ERMS) which
    has a feature bit in CPUID.  This patch adds the Enhanced REP MOVSB/STOSB
    (ERMS) bit to x86 cpu-features.
    
    	* sysdeps/x86/cpu-features.h (bit_cpu_ERMS): New.
    	(index_cpu_ERMS): Likewise.
    	(reg_ERMS): Likewise.
    
    (cherry picked from commit 0791f91dff9a77263fa8173b143d854cad902c6d)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7c244283ff12329b3bca9878b8edac3b3fe5c7bc

commit 7c244283ff12329b3bca9878b8edac3b3fe5c7bc
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 13:15:59 2016 -0700

    Make __memcpy_avx512_no_vzeroupper an alias
    
    Since x86-64 memcpy-avx512-no-vzeroupper.S implements memmove, make
    __memcpy_avx512_no_vzeroupper an alias of __memmove_avx512_no_vzeroupper
    to reduce code size of libc.so.
    
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memcpy-avx512-no-vzeroupper.
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: Renamed
    	to ...
    	* sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: This.
    	(MEMCPY): Don't define.
    	(MEMCPY_CHK): Likewise.
    	(MEMPCPY): Likewise.
    	(MEMPCPY_CHK): Likewise.
    	(MEMPCPY_CHK): Renamed to ...
    	(__mempcpy_chk_avx512_no_vzeroupper): This.
    	(MEMPCPY_CHK): Renamed to ...
    	(__mempcpy_chk_avx512_no_vzeroupper): This.
    	(MEMCPY_CHK): Renamed to ...
    	(__memmove_chk_avx512_no_vzeroupper): This.
    	(MEMCPY): Renamed to ...
    	(__memmove_avx512_no_vzeroupper): This.
    	(__memcpy_avx512_no_vzeroupper): New alias.
    	(__memcpy_chk_avx512_no_vzeroupper): Likewise.
    
    (cherry picked from commit 064f01b10b57ff09cda7025f484b848c38ddd57a)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a9a14991fb2d3e69f80d25e9bbf2f6b0bcf11c3d

commit a9a14991fb2d3e69f80d25e9bbf2f6b0bcf11c3d
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 13:13:36 2016 -0700

    Implement x86-64 multiarch mempcpy in memcpy
    
    Implement x86-64 multiarch mempcpy in memcpy to share most of code.  It
    reduces code size of libc.so.
    
    	[BZ #18858]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	mempcpy-ssse3, mempcpy-ssse3-back, mempcpy-avx-unaligned
    	and mempcpy-avx512-no-vzeroupper.
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMPCPY_CHK):
    	New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S
    	(MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-ssse3-back.S (MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-ssse3.S (MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-avx-unaligned.S: Removed.
    	* sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-ssse3-back.S: Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-ssse3.S: Likewise.
    
    (cherry picked from commit c365e615f7429aee302f8af7bf07ae262278febb)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4fc09dabecee1b7cafdbca26ee7c63f68e53c229

commit 4fc09dabecee1b7cafdbca26ee7c63f68e53c229
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 04:39:48 2016 -0700

    [x86] Add a feature bit: Fast_Unaligned_Copy
    
    On AMD processors, memcpy optimized with unaligned SSE load is
    slower than emcpy optimized with aligned SSSE3 while other string
    functions are faster with unaligned SSE load.  A feature bit,
    Fast_Unaligned_Copy, is added to select memcpy optimized with
    unaligned SSE load.
    
    	[BZ #19583]
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Set
    	Fast_Unaligned_Copy with Fast_Unaligned_Load for Intel
    	processors.  Set Fast_Copy_Backward for AMD Excavator
    	processors.
    	* sysdeps/x86/cpu-features.h (bit_arch_Fast_Unaligned_Copy):
    	New.
    	(index_arch_Fast_Unaligned_Copy): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Check
    	Fast_Unaligned_Copy instead of Fast_Unaligned_Load.
    
    (cherry picked from commit e41b395523040fcb58c7d378475720c2836d280c)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=75f2d47e459a6bf5656a938e5c63f8b581eb3ee6

commit 75f2d47e459a6bf5656a938e5c63f8b581eb3ee6
Author: Florian Weimer <fweimer@redhat.com>
Date:   Fri Mar 25 11:11:42 2016 +0100

    tst-audit10: Fix compilation on compilers without bit_AVX512F [BZ #19860]
    
    	[BZ# 19860]
    	* sysdeps/x86_64/tst-audit10.c (avx512_enabled): Always return
    	zero if the compiler does not provide the AVX512F bit.
    
    (cherry picked from commit f327f5b47be57bc05a4077344b381016c1bb2c11)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=96c7375cb8b6f1875d9865f2ae92ecacf5f5e6fa

commit 96c7375cb8b6f1875d9865f2ae92ecacf5f5e6fa
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Mar 22 08:36:16 2016 -0700

    Don't set %rcx twice before "rep movsb"
    
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMCPY):
    	Don't set %rcx twice before "rep movsb".
    
    (cherry picked from commit 3c9a4cd16cbc7b79094fec68add2df66061ab5d7)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c273f613b0cc779ee33cc33d20941d271316e483

commit c273f613b0cc779ee33cc33d20941d271316e483
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Mar 22 07:46:56 2016 -0700

    Set index_arch_AVX_Fast_Unaligned_Load only for Intel processors
    
    Since only Intel processors with AVX2 have fast unaligned load, we
    should set index_arch_AVX_Fast_Unaligned_Load only for Intel processors.
    
    Move AVX, AVX2, AVX512, FMA and FMA4 detection into get_common_indeces
    and call get_common_indeces for other processors.
    
    Add CPU_FEATURES_CPU_P and CPU_FEATURES_ARCH_P to aoid loading
    GLRO(dl_x86_cpu_features) in cpu-features.c.
    
    	[BZ #19583]
    	* sysdeps/x86/cpu-features.c (get_common_indeces): Remove
    	inline.  Check family before setting family, model and
    	extended_model.  Set AVX, AVX2, AVX512, FMA and FMA4 usable
    	bits here.
    	(init_cpu_features): Replace HAS_CPU_FEATURE and
    	HAS_ARCH_FEATURE with CPU_FEATURES_CPU_P and
    	CPU_FEATURES_ARCH_P.  Set index_arch_AVX_Fast_Unaligned_Load
    	for Intel processors with usable AVX2.  Call get_common_indeces
    	for other processors with family == NULL.
    	* sysdeps/x86/cpu-features.h (CPU_FEATURES_CPU_P): New macro.
    	(CPU_FEATURES_ARCH_P): Likewise.
    	(HAS_CPU_FEATURE): Use CPU_FEATURES_CPU_P.
    	(HAS_ARCH_FEATURE): Use CPU_FEATURES_ARCH_P.
    
    (cherry picked from commit f781a9e96138d8839663af5e88649ab1fbed74f8)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c858d10a4e7fd682f2e7083836e4feacc2d580f4

commit c858d10a4e7fd682f2e7083836e4feacc2d580f4
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 10 05:26:46 2016 -0800

    Add _arch_/_cpu_ to index_*/bit_* in x86 cpu-features.h
    
    index_* and bit_* macros are used to access cpuid and feature arrays o
    struct cpu_features.  It is very easy to use bits and indices of cpuid
    array on feature array, especially in assembly codes.  For example,
    sysdeps/i386/i686/multiarch/bcopy.S has
    
    	HAS_CPU_FEATURE (Fast_Rep_String)
    
    which should be
    
    	HAS_ARCH_FEATURE (Fast_Rep_String)
    
    We change index_* and bit_* to index_cpu_*/index_arch_* and
    bit_cpu_*/bit_arch_* so that we can catch such error at build time.
    
    	[BZ #19762]
    	* sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h
    	(EXTRA_LD_ENVVARS): Add _arch_ to index_*/bit_*.
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Likewise.
    	* sysdeps/x86/cpu-features.h (bit_*): Renamed to ...
    	(bit_arch_*): This for feature array.
    	(bit_*): Renamed to ...
    	(bit_cpu_*): This for cpu array.
    	(index_*): Renamed to ...
    	(index_arch_*): This for feature array.
    	(index_*): Renamed to ...
    	(index_cpu_*): This for cpu array.
    	[__ASSEMBLER__] (HAS_FEATURE): Add and use field.
    	[__ASSEMBLER__] (HAS_CPU_FEATURE)): Pass cpu to HAS_FEATURE.
    	[__ASSEMBLER__] (HAS_ARCH_FEATURE)): Pass arch to HAS_FEATURE.
    	[!__ASSEMBLER__] (HAS_CPU_FEATURE): Replace index_##name and
    	bit_##name with index_cpu_##name and bit_cpu_##name.
    	[!__ASSEMBLER__] (HAS_ARCH_FEATURE): Replace index_##name and
    	bit_##name with index_arch_##name and bit_arch_##name.
    
    (cherry picked from commit 6aa3e97e2530f9917f504eb4146af119a3f27229)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7a90b56b0c3f8e55df44957cf6de7d3c9c04cbb9

commit 7a90b56b0c3f8e55df44957cf6de7d3c9c04cbb9
Author: Roland McGrath <roland@hack.frob.com>
Date:   Tue Mar 8 12:31:13 2016 -0800

    Fix tst-audit10 build when -mavx512f is not supported.
    
    (cherry picked from commit 3bd80c0de2f8e7ca8020d37739339636d169957e)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=ba80f6ceea3a6b6f711038646f419125fe3ad39c

commit ba80f6ceea3a6b6f711038646f419125fe3ad39c
Author: Florian Weimer <fweimer@redhat.com>
Date:   Mon Mar 7 16:00:25 2016 +0100

    tst-audit4, tst-audit10: Compile AVX/AVX-512 code separately [BZ #19269]
    
    This ensures that GCC will not use unsupported instructions before
    the run-time check to ensure support.
    
    (cherry picked from commit 3c0f7407eedb524c9114bb675cd55b903c71daaa)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=b8fe596e7f750d4ee2fca14d6a3999364c02662e

commit b8fe596e7f750d4ee2fca14d6a3999364c02662e
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Mar 6 16:48:11 2016 -0800

    Group AVX512 functions in .text.avx512 section
    
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S:
    	Replace .text with .text.avx512.
    	* sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S:
    	Likewise.
    
    (cherry picked from commit fee9eb6200f0e44a4b684903bc47fde36d46f1a5)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e455d17680cfaebb12692547422f95ba1ed30e29

commit e455d17680cfaebb12692547422f95ba1ed30e29
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Mar 4 08:37:40 2016 -0800

    x86-64: Fix memcpy IFUNC selection
    
    Chek Fast_Unaligned_Load, instead of Slow_BSF, and also check for
    Fast_Copy_Backward to enable __memcpy_ssse3_back.  Existing selection
    order is updated with following selection order:
    
    1. __memcpy_avx_unaligned if AVX_Fast_Unaligned_Load bit is set.
    2. __memcpy_sse2_unaligned if Fast_Unaligned_Load bit is set.
    3. __memcpy_sse2 if SSSE3 isn't available.
    4. __memcpy_ssse3_back if Fast_Copy_Backward bit it set.
    5. __memcpy_ssse3
    
    	[BZ #18880]
    	* sysdeps/x86_64/multiarch/memcpy.S: Check Fast_Unaligned_Load,
    	instead of Slow_BSF, and also check for Fast_Copy_Backward to
    	enable __memcpy_ssse3_back.
    
    (cherry picked from commit 14a1d7cc4c4fd5ee8e4e66b777221dd32a84efe8)

-----------------------------------------------------------------------
Comment 20 cvs-commit@gcc.gnu.org 2016-06-06 20:37:30 UTC
This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, hjl/erms/2.22 has been created
        at  b60dda5f2385aaca873069f9fb28645b82a1b711 (commit)

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=b60dda5f2385aaca873069f9fb28645b82a1b711

commit b60dda5f2385aaca873069f9fb28645b82a1b711
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri May 27 15:16:22 2016 -0700

    Count number of logical processors sharing L2 cache
    
    For Intel processors, when there are both L2 and L3 caches, SMT level
    type should be ued to count number of available logical processors
    sharing L2 cache.  If there is only L2 cache, core level type should
    be used to count number of available logical processors sharing L2
    cache.  Number of available logical processors sharing L2 cache should
    be used for non-inclusive L2 and L3 caches.
    
    	* sysdeps/x86/cacheinfo.c (init_cacheinfo): Count number of
    	available logical processors with SMT level type sharing L2
    	cache for Intel processors.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=ed46697862f2b0c2db726cc4c772e6003914bd72

commit ed46697862f2b0c2db726cc4c772e6003914bd72
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri May 20 14:41:14 2016 -0700

    Remove special L2 cache case for Knights Landing
    
    L2 cache is shared by 2 cores on Knights Landing, which has 4 threads
    per core:
    
    https://en.wikipedia.org/wiki/Xeon_Phi#Knights_Landing
    
    So L2 cache is shared by 8 threads on Knights Landing as reported by
    CPUID.  We should remove special L2 cache case for Knights Landing.
    
    	[BZ #18185]
    	* sysdeps/x86/cacheinfo.c (init_cacheinfo): Don't limit threads
    	sharing L2 cache to 2 for Knights Landing.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=07f943915311f6f92e5a031911d32c5e7458bfd5

commit 07f943915311f6f92e5a031911d32c5e7458bfd5
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu May 19 10:02:36 2016 -0700

    Correct Intel processor level type mask from CPUID
    
    Intel CPUID with EAX == 11 returns:
    
    ECX Bits 07 - 00: Level number. Same value in ECX input.
        Bits 15 - 08: Level type.
        ^^^^^^^^^^^^^^^^^^^^^^^^ This is level type.
        Bits 31 - 16: Reserved.
    
    Intel processor level type mask should be 0xff00, not 0xff0.
    
    	[BZ #20119]
    	* sysdeps/x86/cacheinfo.c (init_cacheinfo): Correct Intel
    	processor level type mask for CPUID with EAX == 11.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=201aebf739482fbb730d10eb7cf8335629bb4de4

commit 201aebf739482fbb730d10eb7cf8335629bb4de4
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu May 19 09:09:00 2016 -0700

    Check the HTT bit before counting logical threads
    
    Skip counting logical threads for Intel processors if the HTT bit is 0
    which indicates there is only a single logical processor.
    
    	* sysdeps/x86/cacheinfo.c (init_cacheinfo): Skip counting
    	logical threads if the HTT bit is 0.
    	* sysdeps/x86/cpu-features.h (bit_cpu_HTT): New.
    	(index_cpu_HTT): Likewise.
    	(reg_HTT): Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=dff8bcdab5968ac53e52ef06cabe8d921b429d22

commit dff8bcdab5968ac53e52ef06cabe8d921b429d22
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu May 19 08:49:45 2016 -0700

    Remove alignments on jump targets in memset
    
    X86-64 memset-vec-unaligned-erms.S aligns many jump targets, which
    increases code sizes, but not necessarily improve performance.  As
    memset benchtest data of align vs no align on various Intel and AMD
    processors
    
    https://sourceware.org/bugzilla/attachment.cgi?id=9277
    
    shows that aligning jump targets isn't necessary.
    
    	[BZ #20115]
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S (__memset):
    	Remove alignments on jump targets.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=aba9d000bf8441d77f0557af360e3aea7525d03e

commit aba9d000bf8441d77f0557af360e3aea7525d03e
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri May 13 08:29:22 2016 -0700

    Call init_cpu_features only if SHARED is defined
    
    In static executable, since init_cpu_features is called early from
    __libc_start_main, there is no need to call it again in dl_platform_init.
    
    	[BZ #20072]
    	* sysdeps/i386/dl-machine.h (dl_platform_init): Call
    	init_cpu_features only if SHARED is defined.
    	* sysdeps/x86_64/dl-machine.h (dl_platform_init): Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=6118b2d23016ec790b99b9331c3d7a45d588134e

commit 6118b2d23016ec790b99b9331c3d7a45d588134e
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri May 13 07:18:25 2016 -0700

    Support non-inclusive caches on Intel processors
    
    	* sysdeps/x86/cacheinfo.c (init_cacheinfo): Check and support
    	non-inclusive caches on Intel processors.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=8642c9a553d8ce8a3a0496ed11fed5a575d338c5

commit 8642c9a553d8ce8a3a0496ed11fed5a575d338c5
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Wed May 11 05:49:09 2016 -0700

    Remove x86 ifunc-defines.sym and rtld-global-offsets.sym
    
    Merge x86 ifunc-defines.sym with x86 cpu-features-offsets.sym.  Remove
    x86 ifunc-defines.sym and rtld-global-offsets.sym.  No code changes on
    i686 and x86-64.
    
    	* sysdeps/i386/i686/multiarch/Makefile (gen-as-const-headers):
    	Remove ifunc-defines.sym.
    	* sysdeps/x86_64/multiarch/Makefile (gen-as-const-headers):
    	Likewise.
    	* sysdeps/i386/i686/multiarch/ifunc-defines.sym: Removed.
    	* sysdeps/x86/rtld-global-offsets.sym: Likewise.
    	* sysdeps/x86_64/multiarch/ifunc-defines.sym: Likewise.
    	* sysdeps/x86/Makefile (gen-as-const-headers): Remove
    	rtld-global-offsets.sym.
    	* sysdeps/x86_64/multiarch/ifunc-defines.sym: Merged with ...
    	* sysdeps/x86/cpu-features-offsets.sym: This.
    	* sysdeps/x86/cpu-features.h: Include <cpu-features-offsets.h>
    	instead of <ifunc-defines.h> and <rtld-global-offsets.h>.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=3038902f233a5e0028a6424685b410f6c201040f

commit 3038902f233a5e0028a6424685b410f6c201040f
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun May 8 08:49:02 2016 -0700

    Move sysdeps/x86_64/cacheinfo.c to sysdeps/x86
    
    Move sysdeps/x86_64/cacheinfo.c to sysdeps/x86.  No code changes on x86
    and x86_64.
    
    	* sysdeps/i386/cacheinfo.c: Include <sysdeps/x86/cacheinfo.c>
    	instead of <sysdeps/x86_64/cacheinfo.c>.
    	* sysdeps/x86_64/cacheinfo.c: Moved to ...
    	* sysdeps/x86/cacheinfo.c: Here.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=df2b390bba18903d62c8910e808bfb0dce7f033c

commit df2b390bba18903d62c8910e808bfb0dce7f033c
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 15 05:22:53 2016 -0700

    Detect Intel Goldmont and Airmont processors
    
    Updated from the model numbers of Goldmont and Airmont processors in
    Intel64 And IA-32 Processor Architectures Software Developer's Manual
    Volume 3 Revision 058.
    
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Detect Intel
    	Goldmont and Airmont processors.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=157c57198e893b4882d1feb98de2b0721ee408fc

commit 157c57198e893b4882d1feb98de2b0721ee408fc
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 1 14:01:24 2016 -0700

    X86-64: Add dummy memcopy.h and wordcopy.c
    
    Since x86-64 doesn't use memory copy functions, add dummy memcopy.h and
    wordcopy.c to reduce code size.  It reduces the size of libc.so by about
    1 KB.
    
    	* sysdeps/x86_64/memcopy.h: New file.
    	* sysdeps/x86_64/wordcopy.c: Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=f817b9d36215ab60d58cc744d22773b4961a2c9b

commit f817b9d36215ab60d58cc744d22773b4961a2c9b
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 12:46:57 2016 -0700

    X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove
    
    Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones,
    we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with
    the new ones.
    
    No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used
    before.  If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2
    memcpy/memmove optimized with Enhanced REP MOVSB will be used for
    processors with ERMS.  The new AVX512 memcpy/memmove will be used for
    processors with AVX512 which prefer vzeroupper.
    
    Since the new SSE2 memcpy/memmove are faster than the previous default
    memcpy/memmove used in libc.a and ld.so, we also remove the previous
    default memcpy/memmove and make them the default memcpy/memmove, except
    that non-temporal store isn't used in ld.so.
    
    Together, it reduces the size of libc.so by about 6 KB and the size of
    ld.so by about 2 KB.
    
    	[BZ #19776]
    	* sysdeps/x86_64/memcpy.S: Make it dummy.
    	* sysdeps/x86_64/mempcpy.S: Likewise.
    	* sysdeps/x86_64/memmove.S: New file.
    	* sysdeps/x86_64/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/memmove.c: Removed.
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memmove.c: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memcpy-sse2-unaligned, memmove-avx-unaligned,
    	memcpy-avx-unaligned and memmove-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Replace
    	__memmove_chk_avx512_unaligned_2 with
    	__memmove_chk_avx512_unaligned.  Remove
    	__memmove_chk_avx_unaligned_2.  Replace
    	__memmove_chk_sse2_unaligned_2 with
    	__memmove_chk_sse2_unaligned.  Remove __memmove_chk_sse2 and
    	__memmove_avx_unaligned_2.  Replace __memmove_avx512_unaligned_2
    	with __memmove_avx512_unaligned.  Replace
    	__memmove_sse2_unaligned_2 with __memmove_sse2_unaligned.
    	Remove __memmove_sse2.  Replace __memcpy_chk_avx512_unaligned_2
    	with __memcpy_chk_avx512_unaligned.  Remove
    	__memcpy_chk_avx_unaligned_2.  Replace
    	__memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned.
    	Remove __memcpy_chk_sse2.  Remove __memcpy_avx_unaligned_2.
    	Replace __memcpy_avx512_unaligned_2 with
    	__memcpy_avx512_unaligned.  Remove __memcpy_sse2_unaligned_2
    	and __memcpy_sse2.  Replace __mempcpy_chk_avx512_unaligned_2
    	with __mempcpy_chk_avx512_unaligned.  Remove
    	__mempcpy_chk_avx_unaligned_2.  Replace
    	__mempcpy_chk_sse2_unaligned_2 with
    	__mempcpy_chk_sse2_unaligned.  Remove __mempcpy_chk_sse2.
    	Replace __mempcpy_avx512_unaligned_2 with
    	__mempcpy_avx512_unaligned.  Remove __mempcpy_avx_unaligned_2.
    	Replace __mempcpy_sse2_unaligned_2 with
    	__mempcpy_sse2_unaligned.  Remove __mempcpy_sse2.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support
    	__memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned.
    	Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __memcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../memcpy.S.
    	* sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support
    	__memcpy_chk_avx512_unaligned_erms and
    	__memcpy_chk_avx512_unaligned.  Use
    	__memcpy_chk_avx_unaligned_erms and
    	__memcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __memcpy_chk_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
    	Change function suffix from unaligned_2 to unaligned.
    	* sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support
    	__mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned.
    	Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __mempcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../mempcpy.S.
    	(mempcpy): New.  Add a weak alias.
    	* sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support
    	__mempcpy_chk_avx512_unaligned_erms and
    	__mempcpy_chk_avx512_unaligned.  Use
    	__mempcpy_chk_avx_unaligned_erms and
    	__mempcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __mempcpy_chk_sse2_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=122600f4b380b00ce0f682039fe59af4bd0edbc0

commit 122600f4b380b00ce0f682039fe59af4bd0edbc0
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:42:30 2016 -0700

    X86-64: Remove the previous SSE2/AVX2 memsets
    
    Since the new SSE2/AVX2 memsets are faster than the previous ones, we
    can remove the previous SSE2/AVX2 memsets and replace them with the
    new ones.  This reduces the size of libc.so by about 900 bytes.
    
    No change in IFUNC selection if SSE2 and AVX2 memsets weren't used
    before.  If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset
    optimized with Enhanced REP STOSB will be used for processors with
    ERMS.  The new AVX512 memset will be used for processors with AVX512
    which prefer vzeroupper.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded
    	into ...
    	* sysdeps/x86_64/memset.S: This.
    	(__bzero): Removed.
    	(__memset_tail): Likewise.
    	(__memset_chk): Likewise.
    	(memset): Likewise.
    	(MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't
    	defined.
    	(MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined.
    	* sysdeps/x86_64/multiarch/memset-avx2.S: Removed.
    	(__memset_zero_constant_len_parameter): Check SHARED instead of
    	PIC.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memset-avx2 and memset-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Remove __memset_chk_sse2,
    	__memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
    	(__bzero): Enabled.
    	* sysdeps/x86_64/multiarch/memset.S (memset): Replace
    	__memset_sse2 and __memset_avx2 with __memset_sse2_unaligned
    	and __memset_avx2_unaligned.  Use __memset_sse2_unaligned_erms
    	or __memset_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	(memset): Removed.
    	(__memset_chk): Likewise.
    	(MEMSET_SYMBOL): New.
    	(libc_hidden_builtin_def): Replace __memset_sse2 with
    	__memset_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace
    	__memset_chk_sse2 and __memset_chk_avx2 with
    	__memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms.
    	Use __memset_chk_sse2_unaligned_erms or
    	__memset_chk_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_chk_avx512_unaligned_erms and
    	__memset_chk_avx512_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=0ee4375cef69e00e69ddb1d08fe0d492053208f3

commit 0ee4375cef69e00e69ddb1d08fe0d492053208f3
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Apr 3 17:21:45 2016 -0700

    X86-64: Use non-temporal store in memcpy on large data
    
    The large memcpy micro benchmark in glibc shows that there is a
    regression with large data on Haswell machine.  non-temporal store in
    memcpy on large data can improve performance significantly.  This
    patch adds a threshold to use non temporal store which is 6 times of
    shared cache size.  When size is above the threshold, non temporal
    store will be used, but avoid non-temporal store if there is overlap
    between destination and source since destination may be in cache when
    source is loaded.
    
    For size below 8 vector register width, we load all data into registers
    and store them together.  Only forward and backward loops, which move 4
    vector registers at a time, are used to support overlapping addresses.
    For forward loop, we load the last 4 vector register width of data and
    the first vector register width of data into vector registers before the
    loop and store them after the loop.  For backward loop, we load the first
    4 vector register width of data and the last vector register width of
    data into vector registers before the loop and store them after the loop.
    
    	[BZ #19928]
    	* sysdeps/x86_64/cacheinfo.c (__x86_shared_non_temporal_threshold):
    	New.
    	(init_cacheinfo): Set __x86_shared_non_temporal_threshold to
    	6 times of shared cache size.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S
    	(VMOVNT): New.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S
    	(VMOVNT): Likewise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S
    	(VMOVNT): Likewise.
    	(VMOVU): Changed to movups for smaller code sizes.
    	(VMOVA): Changed to movaps for smaller code sizes.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Update
    	comments.
    	(PREFETCH): New.
    	(PREFETCH_SIZE): Likewise.
    	(PREFETCHED_LOAD_SIZE): Likewise.
    	(PREFETCH_ONE_SET): Likewise.
    	Rewrite to use forward and backward loops, which move 4 vector
    	registers at a time, to support overlapping addresses and use
    	non temporal store if size is above the threshold and there is
    	no overlap between destination and source.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=54667f64fa4074325ee33e487c033c313ce95067

commit 54667f64fa4074325ee33e487c033c313ce95067
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Wed Apr 6 10:19:16 2016 -0700

    X86-64: Prepare memmove-vec-unaligned-erms.S
    
    Prepare memmove-vec-unaligned-erms.S to make the SSE2 version as the
    default memcpy, mempcpy and memmove.
    
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
    	(MEMCPY_SYMBOL): New.
    	(MEMPCPY_SYMBOL): Likewise.
    	(MEMMOVE_CHK_SYMBOL): Likewise.
    	Replace MEMMOVE_SYMBOL with MEMMOVE_CHK_SYMBOL on __mempcpy_chk
    	symbols.  Replace MEMMOVE_SYMBOL with MEMPCPY_SYMBOL on
    	__mempcpy symbols.  Provide alias for __memcpy_chk in libc.a.
    	Provide alias for memcpy in libc.a and ld.so.
    
    (cherry picked from commit a7d1c51482d15ab6c07e2ee0ae5e007067b18bfb)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=68a0b487e274b3452a1660e4b9fad5df8d8c0284

commit 68a0b487e274b3452a1660e4b9fad5df8d8c0284
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Wed Apr 6 09:10:18 2016 -0700

    X86-64: Prepare memset-vec-unaligned-erms.S
    
    Prepare memset-vec-unaligned-erms.S to make the SSE2 version as the
    default memset.
    
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
    	(MEMSET_CHK_SYMBOL): New.  Define if not defined.
    	(__bzero): Check VEC_SIZE == 16 instead of USE_MULTIARCH.
    	Disabled fro now.
    	Replace MEMSET_SYMBOL with MEMSET_CHK_SYMBOL on __memset_chk
    	symbols.  Properly check USE_MULTIARCH on __memset symbols.
    
    (cherry picked from commit 4af1bb06c59d24f35bf8dc55897838d926c05892)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c2d3bdd6aec639fd66fceb3e2c145420c25d409b

commit c2d3bdd6aec639fd66fceb3e2c145420c25d409b
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Apr 5 05:21:07 2016 -0700

    Force 32-bit displacement in memset-vec-unaligned-erms.S
    
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: Force
    	32-bit displacement to avoid long nop between instructions.
    
    (cherry picked from commit ec0cac9a1f4094bd0db6f77c1b329e7a40eecc10)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=070a5e77d66f5520c1bbbc24dc1843a0a1c161ee

commit 070a5e77d66f5520c1bbbc24dc1843a0a1c161ee
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Apr 5 05:19:05 2016 -0700

    Add a comment in memset-sse2-unaligned-erms.S
    
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Add
    	a comment on VMOVU and VMOVA.
    
    (cherry picked from commit 696ac774847b80cf994438739478b0c3003b5958)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7e00bb9720268f142668d22e91dff7c3e6e0c08c

commit 7e00bb9720268f142668d22e91dff7c3e6e0c08c
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Apr 3 14:32:20 2016 -0700

    Don't put SSE2/AVX/AVX512 memmove/memset in ld.so
    
    Since memmove and memset in ld.so don't use IFUNC, don't put SSE2, AVX
    and AVX512 memmove and memset in ld.so.
    
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip
    	if not in libc.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    
    (cherry picked from commit 5cd7af016d8587ff53b20ba259746f97edbddbf7)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e1e57539f5dbdefc96a85021b611863eaa28dd13

commit e1e57539f5dbdefc96a85021b611863eaa28dd13
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Apr 3 12:38:25 2016 -0700

    Fix memmove-vec-unaligned-erms.S
    
    __mempcpy_erms and __memmove_erms can't be placed between __memmove_chk
    and __memmove it breaks __memmove_chk.
    
    Don't check source == destination first since it is less common.
    
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:
    	(__mempcpy_erms, __memmove_erms): Moved before __mempcpy_chk
    	with unaligned_erms.
    	(__memmove_erms): Skip if source == destination.
    	(__memmove_unaligned_erms): Don't check source == destination
    	first.
    
    (cherry picked from commit ea2785e96fa503f3a2b5dd9f3a6ca65622b3c5f2)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a13ac6b5ced68aadb7c1546102445f9c57f43231

commit a13ac6b5ced68aadb7c1546102445f9c57f43231
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Mar 6 08:23:24 2016 -0800

    Use HAS_ARCH_FEATURE with Fast_Rep_String
    
    HAS_ARCH_FEATURE, not HAS_CPU_FEATURE, should be used with
    Fast_Rep_String.
    
    	[BZ #19762]
    	* sysdeps/i386/i686/multiarch/bcopy.S (bcopy): Use
    	HAS_ARCH_FEATURE with Fast_Rep_String.
    	* sysdeps/i386/i686/multiarch/bzero.S (__bzero): Likewise.
    	* sysdeps/i386/i686/multiarch/memcpy.S (memcpy): Likewise.
    	* sysdeps/i386/i686/multiarch/memcpy_chk.S (__memcpy_chk):
    	Likewise.
    	* sysdeps/i386/i686/multiarch/memmove_chk.S (__memmove_chk):
    	Likewise.
    	* sysdeps/i386/i686/multiarch/mempcpy.S (__mempcpy): Likewise.
    	* sysdeps/i386/i686/multiarch/mempcpy_chk.S (__mempcpy_chk):
    	Likewise.
    	* sysdeps/i386/i686/multiarch/memset.S (memset): Likewise.
    	* sysdeps/i386/i686/multiarch/memset_chk.S (__memset_chk):
    	Likewise.
    
    (cherry picked from commit 4e940b2f4b577f3a530e0580373f7c2d569f4d63)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4ad4d58ed7a444e2d9787113fce132a99b35b417

commit 4ad4d58ed7a444e2d9787113fce132a99b35b417
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 1 15:08:48 2016 -0700

    Remove Fast_Copy_Backward from Intel Core processors
    
    Intel Core i3, i5 and i7 processors have fast unaligned copy and
    copy backward is ignored.  Remove Fast_Copy_Backward from Intel Core
    processors to avoid confusion.
    
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Don't set
    	bit_arch_Fast_Copy_Backward for Intel Core proessors.
    
    (cherry picked from commit 27d3ce1467990f89126e228559dec8f84b96c60e)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a304f3933c7f8347e49057a7a315cbd571662ff7

commit a304f3933c7f8347e49057a7a315cbd571662ff7
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:05:51 2016 -0700

    Add x86-64 memset with unaligned store and rep stosb
    
    Implement x86-64 memset with unaligned store and rep movsb.  Support
    16-byte, 32-byte and 64-byte vector register sizes.  A single file
    provides 2 implementations of memset, one with rep stosb and the other
    without rep stosb.  They share the same codes when size is between 2
    times of vector register size and REP_STOSB_THRESHOLD which defaults
    to 2KB.
    
    Key features:
    
    1. Use overlapping store to avoid branch.
    2. For size <= 4 times of vector register size, fully unroll the loop.
    3. For size > 4 times of vector register size, store 4 times of vector
    register size at a time.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
    	memset-sse2-unaligned-erms, memset-avx2-unaligned-erms and
    	memset-avx512-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Test __memset_chk_sse2_unaligned,
    	__memset_chk_sse2_unaligned_erms, __memset_chk_avx2_unaligned,
    	__memset_chk_avx2_unaligned_erms, __memset_chk_avx512_unaligned,
    	__memset_chk_avx512_unaligned_erms, __memset_sse2_unaligned,
    	__memset_sse2_unaligned_erms, __memset_erms,
    	__memset_avx2_unaligned, __memset_avx2_unaligned_erms,
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: New
    	file.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:
    	Likewise.
    
    (cherry picked from commit 830566307f038387ca0af3fd327706a8d1a2f595)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=1c5d0b0ba41376c2f0792da4f22cc1f5b2b2688e

commit 1c5d0b0ba41376c2f0792da4f22cc1f5b2b2688e
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:04:26 2016 -0700

    Add x86-64 memmove with unaligned load/store and rep movsb
    
    Implement x86-64 memmove with unaligned load/store and rep movsb.
    Support 16-byte, 32-byte and 64-byte vector register sizes.  When
    size <= 8 times of vector register size, there is no check for
    address overlap bewteen source and destination.  Since overhead for
    overlap check is small when size > 8 times of vector register size,
    memcpy is an alias of memmove.
    
    A single file provides 2 implementations of memmove, one with rep movsb
    and the other without rep movsb.  They share the same codes when size is
    between 2 times of vector register size and REP_MOVSB_THRESHOLD which
    is 2KB for 16-byte vector register size and scaled up by large vector
    register size.
    
    Key features:
    
    1. Use overlapping load and store to avoid branch.
    2. For size <= 8 times of vector register size, load  all sources into
    registers and store them together.
    3. If there is no address overlap bewteen source and destination, copy
    from both ends with 4 times of vector register size at a time.
    4. If address of destination > address of source, backward copy 8 times
    of vector register size at a time.
    5. Otherwise, forward copy 8 times of vector register size at a time.
    6. Use rep movsb only for forward copy.  Avoid slow backward rep movsb
    by fallbacking to backward copy 8 times of vector register size at a
    time.
    7. Skip when address of destination == address of source.
    
    	[BZ #19776]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
    	memmove-sse2-unaligned-erms, memmove-avx-unaligned-erms and
    	memmove-avx512-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Test
    	__memmove_chk_avx512_unaligned_2,
    	__memmove_chk_avx512_unaligned_erms,
    	__memmove_chk_avx_unaligned_2, __memmove_chk_avx_unaligned_erms,
    	__memmove_chk_sse2_unaligned_2,
    	__memmove_chk_sse2_unaligned_erms, __memmove_avx_unaligned_2,
    	__memmove_avx_unaligned_erms, __memmove_avx512_unaligned_2,
    	__memmove_avx512_unaligned_erms, __memmove_erms,
    	__memmove_sse2_unaligned_2, __memmove_sse2_unaligned_erms,
    	__memcpy_chk_avx512_unaligned_2,
    	__memcpy_chk_avx512_unaligned_erms,
    	__memcpy_chk_avx_unaligned_2, __memcpy_chk_avx_unaligned_erms,
    	__memcpy_chk_sse2_unaligned_2, __memcpy_chk_sse2_unaligned_erms,
    	__memcpy_avx_unaligned_2, __memcpy_avx_unaligned_erms,
    	__memcpy_avx512_unaligned_2, __memcpy_avx512_unaligned_erms,
    	__memcpy_sse2_unaligned_2, __memcpy_sse2_unaligned_erms,
    	__memcpy_erms, __mempcpy_chk_avx512_unaligned_2,
    	__mempcpy_chk_avx512_unaligned_erms,
    	__mempcpy_chk_avx_unaligned_2, __mempcpy_chk_avx_unaligned_erms,
    	__mempcpy_chk_sse2_unaligned_2, __mempcpy_chk_sse2_unaligned_erms,
    	__mempcpy_avx512_unaligned_2, __mempcpy_avx512_unaligned_erms,
    	__mempcpy_avx_unaligned_2, __mempcpy_avx_unaligned_erms,
    	__mempcpy_sse2_unaligned_2, __mempcpy_sse2_unaligned_erms and
    	__mempcpy_erms.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: New
    	file.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likwise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
    	Likwise.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:
    	Likwise.
    
    (cherry picked from commit 88b57b8ed41d5ecf2e1bdfc19556f9246a665ebb)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e1203f48239fbb9832db6ed3a0d2a008e317aff9

commit e1203f48239fbb9832db6ed3a0d2a008e317aff9
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 19:22:59 2016 -0700

    Initial Enhanced REP MOVSB/STOSB (ERMS) support
    
    The newer Intel processors support Enhanced REP MOVSB/STOSB (ERMS) which
    has a feature bit in CPUID.  This patch adds the Enhanced REP MOVSB/STOSB
    (ERMS) bit to x86 cpu-features.
    
    	* sysdeps/x86/cpu-features.h (bit_cpu_ERMS): New.
    	(index_cpu_ERMS): Likewise.
    	(reg_ERMS): Likewise.
    
    (cherry picked from commit 0791f91dff9a77263fa8173b143d854cad902c6d)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=3597d65be2a44f063ef12bb907fdad8567aa3e6a

commit 3597d65be2a44f063ef12bb907fdad8567aa3e6a
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 13:15:59 2016 -0700

    Make __memcpy_avx512_no_vzeroupper an alias
    
    Since x86-64 memcpy-avx512-no-vzeroupper.S implements memmove, make
    __memcpy_avx512_no_vzeroupper an alias of __memmove_avx512_no_vzeroupper
    to reduce code size of libc.so.
    
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memcpy-avx512-no-vzeroupper.
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: Renamed
    	to ...
    	* sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: This.
    	(MEMCPY): Don't define.
    	(MEMCPY_CHK): Likewise.
    	(MEMPCPY): Likewise.
    	(MEMPCPY_CHK): Likewise.
    	(MEMPCPY_CHK): Renamed to ...
    	(__mempcpy_chk_avx512_no_vzeroupper): This.
    	(MEMPCPY_CHK): Renamed to ...
    	(__mempcpy_chk_avx512_no_vzeroupper): This.
    	(MEMCPY_CHK): Renamed to ...
    	(__memmove_chk_avx512_no_vzeroupper): This.
    	(MEMCPY): Renamed to ...
    	(__memmove_avx512_no_vzeroupper): This.
    	(__memcpy_avx512_no_vzeroupper): New alias.
    	(__memcpy_chk_avx512_no_vzeroupper): Likewise.
    
    (cherry picked from commit 064f01b10b57ff09cda7025f484b848c38ddd57a)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=9fbaf0f27a11deb98df79d04adee97aebee78d40

commit 9fbaf0f27a11deb98df79d04adee97aebee78d40
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 13:13:36 2016 -0700

    Implement x86-64 multiarch mempcpy in memcpy
    
    Implement x86-64 multiarch mempcpy in memcpy to share most of code.  It
    reduces code size of libc.so.
    
    	[BZ #18858]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	mempcpy-ssse3, mempcpy-ssse3-back, mempcpy-avx-unaligned
    	and mempcpy-avx512-no-vzeroupper.
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMPCPY_CHK):
    	New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S
    	(MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-ssse3-back.S (MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-ssse3.S (MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-avx-unaligned.S: Removed.
    	* sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-ssse3-back.S: Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-ssse3.S: Likewise.
    
    (cherry picked from commit c365e615f7429aee302f8af7bf07ae262278febb)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=5239cb481eea27650173b9b9af22439afdcbf358

commit 5239cb481eea27650173b9b9af22439afdcbf358
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 04:39:48 2016 -0700

    [x86] Add a feature bit: Fast_Unaligned_Copy
    
    On AMD processors, memcpy optimized with unaligned SSE load is
    slower than emcpy optimized with aligned SSSE3 while other string
    functions are faster with unaligned SSE load.  A feature bit,
    Fast_Unaligned_Copy, is added to select memcpy optimized with
    unaligned SSE load.
    
    	[BZ #19583]
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Set
    	Fast_Unaligned_Copy with Fast_Unaligned_Load for Intel
    	processors.  Set Fast_Copy_Backward for AMD Excavator
    	processors.
    	* sysdeps/x86/cpu-features.h (bit_arch_Fast_Unaligned_Copy):
    	New.
    	(index_arch_Fast_Unaligned_Copy): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Check
    	Fast_Unaligned_Copy instead of Fast_Unaligned_Load.
    
    (cherry picked from commit e41b395523040fcb58c7d378475720c2836d280c)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a65b3d13e1754d568782e64a762c2c7fab45a55d

commit a65b3d13e1754d568782e64a762c2c7fab45a55d
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Mar 22 08:36:16 2016 -0700

    Don't set %rcx twice before "rep movsb"
    
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMCPY):
    	Don't set %rcx twice before "rep movsb".
    
    (cherry picked from commit 3c9a4cd16cbc7b79094fec68add2df66061ab5d7)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=f4b6d20366aac66070f1cf50552cf2951991a1e5

commit f4b6d20366aac66070f1cf50552cf2951991a1e5
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Mar 22 07:46:56 2016 -0700

    Set index_arch_AVX_Fast_Unaligned_Load only for Intel processors
    
    Since only Intel processors with AVX2 have fast unaligned load, we
    should set index_arch_AVX_Fast_Unaligned_Load only for Intel processors.
    
    Move AVX, AVX2, AVX512, FMA and FMA4 detection into get_common_indeces
    and call get_common_indeces for other processors.
    
    Add CPU_FEATURES_CPU_P and CPU_FEATURES_ARCH_P to aoid loading
    GLRO(dl_x86_cpu_features) in cpu-features.c.
    
    	[BZ #19583]
    	* sysdeps/x86/cpu-features.c (get_common_indeces): Remove
    	inline.  Check family before setting family, model and
    	extended_model.  Set AVX, AVX2, AVX512, FMA and FMA4 usable
    	bits here.
    	(init_cpu_features): Replace HAS_CPU_FEATURE and
    	HAS_ARCH_FEATURE with CPU_FEATURES_CPU_P and
    	CPU_FEATURES_ARCH_P.  Set index_arch_AVX_Fast_Unaligned_Load
    	for Intel processors with usable AVX2.  Call get_common_indeces
    	for other processors with family == NULL.
    	* sysdeps/x86/cpu-features.h (CPU_FEATURES_CPU_P): New macro.
    	(CPU_FEATURES_ARCH_P): Likewise.
    	(HAS_CPU_FEATURE): Use CPU_FEATURES_CPU_P.
    	(HAS_ARCH_FEATURE): Use CPU_FEATURES_ARCH_P.
    
    (cherry picked from commit f781a9e96138d8839663af5e88649ab1fbed74f8)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=ca9c5edeea52dc18f42ebbe29b1af352f5555538

commit ca9c5edeea52dc18f42ebbe29b1af352f5555538
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Nov 30 08:53:37 2015 -0800

    Update family and model detection for AMD CPUs
    
    AMD CPUs uses the similar encoding scheme for extended family and model
    as Intel CPUs as shown in:
    
    http://support.amd.com/TechDocs/25481.pdf
    
    This patch updates get_common_indeces to get family and model for both
    Intel and AMD CPUs when family == 0x0f.
    
    	[BZ #19214]
    	* sysdeps/x86/cpu-features.c (get_common_indeces): Add an
    	argument to return extended model.  Update family and model
    	with extended family and model when family == 0x0f.
    	(init_cpu_features): Updated.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c23cdbac4ea473effbef5c50b1217f95595b3460

commit c23cdbac4ea473effbef5c50b1217f95595b3460
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 10 05:26:46 2016 -0800

    Add _arch_/_cpu_ to index_*/bit_* in x86 cpu-features.h
    
    index_* and bit_* macros are used to access cpuid and feature arrays o
    struct cpu_features.  It is very easy to use bits and indices of cpuid
    array on feature array, especially in assembly codes.  For example,
    sysdeps/i386/i686/multiarch/bcopy.S has
    
    	HAS_CPU_FEATURE (Fast_Rep_String)
    
    which should be
    
    	HAS_ARCH_FEATURE (Fast_Rep_String)
    
    We change index_* and bit_* to index_cpu_*/index_arch_* and
    bit_cpu_*/bit_arch_* so that we can catch such error at build time.
    
    	[BZ #19762]
    	* sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h
    	(EXTRA_LD_ENVVARS): Add _arch_ to index_*/bit_*.
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Likewise.
    	* sysdeps/x86/cpu-features.h (bit_*): Renamed to ...
    	(bit_arch_*): This for feature array.
    	(bit_*): Renamed to ...
    	(bit_cpu_*): This for cpu array.
    	(index_*): Renamed to ...
    	(index_arch_*): This for feature array.
    	(index_*): Renamed to ...
    	(index_cpu_*): This for cpu array.
    	[__ASSEMBLER__] (HAS_FEATURE): Add and use field.
    	[__ASSEMBLER__] (HAS_CPU_FEATURE)): Pass cpu to HAS_FEATURE.
    	[__ASSEMBLER__] (HAS_ARCH_FEATURE)): Pass arch to HAS_FEATURE.
    	[!__ASSEMBLER__] (HAS_CPU_FEATURE): Replace index_##name and
    	bit_##name with index_cpu_##name and bit_cpu_##name.
    	[!__ASSEMBLER__] (HAS_ARCH_FEATURE): Replace index_##name and
    	bit_##name with index_arch_##name and bit_arch_##name.
    
    (cherry picked from commit 6aa3e97e2530f9917f504eb4146af119a3f27229)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4a49c82956f5a42a2cce22c2e97360de1b32301d

commit 4a49c82956f5a42a2cce22c2e97360de1b32301d
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 3 14:51:40 2016 -0800

    Or bit_Prefer_MAP_32BIT_EXEC in EXTRA_LD_ENVVARS
    
    We should turn on bit_Prefer_MAP_32BIT_EXEC in EXTRA_LD_ENVVARS without
    overriding other bits.
    
    	[BZ #19758]
    	* sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h
    	(EXTRA_LD_ENVVARS): Or bit_Prefer_MAP_32BIT_EXEC.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=18f8c0e3cc9ff7b092f02c9b42874a5439347bbc

commit 18f8c0e3cc9ff7b092f02c9b42874a5439347bbc
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Mar 6 16:48:11 2016 -0800

    Group AVX512 functions in .text.avx512 section
    
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S:
    	Replace .text with .text.avx512.
    	* sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S:
    	Likewise.
    
    (cherry picked from commit fee9eb6200f0e44a4b684903bc47fde36d46f1a5)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=0c8e297a186f844ebb7eba7a3bc0343c83615ca9

commit 0c8e297a186f844ebb7eba7a3bc0343c83615ca9
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Mar 4 08:37:40 2016 -0800

    x86-64: Fix memcpy IFUNC selection
    
    Chek Fast_Unaligned_Load, instead of Slow_BSF, and also check for
    Fast_Copy_Backward to enable __memcpy_ssse3_back.  Existing selection
    order is updated with following selection order:
    
    1. __memcpy_avx_unaligned if AVX_Fast_Unaligned_Load bit is set.
    2. __memcpy_sse2_unaligned if Fast_Unaligned_Load bit is set.
    3. __memcpy_sse2 if SSSE3 isn't available.
    4. __memcpy_ssse3_back if Fast_Copy_Backward bit it set.
    5. __memcpy_ssse3
    
    	[BZ #18880]
    	* sysdeps/x86_64/multiarch/memcpy.S: Check Fast_Unaligned_Load,
    	instead of Slow_BSF, and also check for Fast_Copy_Backward to
    	enable __memcpy_ssse3_back.
    
    (cherry picked from commit 14a1d7cc4c4fd5ee8e4e66b777221dd32a84efe8)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=3c772cb4d9cbe19cd97ad991e3dab43014198c44

commit 3c772cb4d9cbe19cd97ad991e3dab43014198c44
Author: Andrew Senkevich <andrew.senkevich@intel.com>
Date:   Sat Jan 16 00:49:45 2016 +0300

    Added memcpy/memmove family optimized with AVX512 for KNL hardware.
    
    Added AVX512 implementations of memcpy, mempcpy, memmove, memcpy_chk,
    mempcpy_chk, memmove_chk.
    It shows average improvement more than 30% over AVX versions on KNL
    hardware (performance results in the thread
    <https://sourceware.org/ml/libc-alpha/2016-01/msg00258.html>).
    
        * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Added new files.
        * sysdeps/x86_64/multiarch/ifunc-impl-list.c: Added new tests.
        * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: New file.
        * sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S: Likewise.
        * sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: Likewise.
        * sysdeps/x86_64/multiarch/memcpy.S: Added new IFUNC branch.
        * sysdeps/x86_64/multiarch/memcpy_chk.S: Likewise.
        * sysdeps/x86_64/multiarch/memmove.c: Likewise.
        * sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
        * sysdeps/x86_64/multiarch/mempcpy.S: Likewise.
        * sysdeps/x86_64/multiarch/mempcpy_chk.S: Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7f20e52f1f8d7709140f4bdf828a6bb0f0f08af2

commit 7f20e52f1f8d7709140f4bdf828a6bb0f0f08af2
Author: Andrew Senkevich <andrew.senkevich@intel.com>
Date:   Sat Dec 19 02:47:28 2015 +0300

    Added memset optimized with AVX512 for KNL hardware.
    
    It shows improvement up to 28% over AVX2 memset (performance results
    attached at <https://sourceware.org/ml/libc-alpha/2015-12/msg00052.html>).
    
        * sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S: New file.
        * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Added new file.
        * sysdeps/x86_64/multiarch/ifunc-impl-list.c: Added new tests.
        * sysdeps/x86_64/multiarch/memset.S: Added new IFUNC branch.
        * sysdeps/x86_64/multiarch/memset_chk.S: Likewise.
        * sysdeps/x86/cpu-features.h (bit_Prefer_No_VZEROUPPER,
        index_Prefer_No_VZEROUPPER): New.
        * sysdeps/x86/cpu-features.c (init_cpu_features): Set the
        Prefer_No_VZEROUPPER for Knights Landing.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=d530cd5463701a59ed923d53a97d3b534fdfea8a

commit d530cd5463701a59ed923d53a97d3b534fdfea8a
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Wed Oct 21 14:44:23 2015 -0700

    Add Prefer_MAP_32BIT_EXEC to map executable pages with MAP_32BIT
    
    According to Silvermont software optimization guide, for 64-bit
    applications, branch prediction performance can be negatively impacted
    when the target of a branch is more than 4GB away from the branch.  Add
    the Prefer_MAP_32BIT_EXEC bit so that mmap will try to map executable
    pages with MAP_32BIT first.  NB: MAP_32BIT will map to lower 2GB, not
    lower 4GB, address.  Prefer_MAP_32BIT_EXEC reduces bits available for
    address space layout randomization (ASLR), which is always disabled for
    SUID programs and can only be enabled by setting environment variable,
    LD_PREFER_MAP_32BIT_EXEC.
    
    On Fedora 23, this patch speeds up GCC 5 testsuite by 3% on Silvermont.
    
    	[BZ #19367]
    	* sysdeps/unix/sysv/linux/wordsize-64/mmap.c: New file.
    	* sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h: Likewise.
    	* sysdeps/unix/sysv/linux/x86_64/64/mmap.c: Likewise.
    	* sysdeps/x86/cpu-features.h (bit_Prefer_MAP_32BIT_EXEC): New.
    	(index_Prefer_MAP_32BIT_EXEC): Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=fe24aedc3530037d7bb614b84d309e6b816686bf

commit fe24aedc3530037d7bb614b84d309e6b816686bf
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Dec 15 11:46:54 2015 -0800

    Enable Silvermont optimizations for Knights Landing
    
    Knights Landing processor is based on Silvermont.  This patch enables
    Silvermont optimizations for Knights Landing.
    
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Enable
    	Silvermont optimizations for Knights Landing.

-----------------------------------------------------------------------
Comment 21 Adhemerval Zanella 2017-01-04 12:38:46 UTC
Fixed by 14a1d7cc4c4fd5ee8.