Bug 19928 - memmove-vec-unaligned-erms.S is slow with large data size
Summary: memmove-vec-unaligned-erms.S is slow with large data size
Status: RESOLVED FIXED
Alias: None
Product: glibc
Classification: Unclassified
Component: string (show other bugs)
Version: 2.24
: P2 normal
Target Milestone: 2.24
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-04-08 19:15 UTC by H.J. Lu
Modified: 2016-06-06 20:36 UTC (History)
0 users

See Also:
Host:
Target: x86-64
Build:
Last reconfirmed:
fweimer: security-


Attachments
bench-memcpy data on Intel Haswell machine with large data size (9.78 KB, application/octet-stream)
2016-04-08 19:15 UTC, H.J. Lu
Details
memcpy performance data on various Intel and AMD processors (373.18 KB, application/octet-stream)
2016-04-12 15:08 UTC, H.J. Lu
Details
memmove performance data on various Intel and AMD processors (350.22 KB, application/octet-stream)
2016-04-12 15:08 UTC, H.J. Lu
Details

Note You need to log in before you can comment on or make changes to this bug.
Description H.J. Lu 2016-04-08 19:15:44 UTC
Created attachment 9171 [details]
bench-memcpy data on Intel Haswell machine with large data size

The large memcpy micro benchmark in glibc shows that there is a regression
with large data on Haswell.  memmove-vec-unaligned-erms.S doesn't use
non-temporal store with large data size.  Benchmark data shows that the
threshold to use non temporal store is approximately 6 times of shared cache
size.  But non temporal store isn't a win on large data size when there is
overlap between destination and source since destination may be in cache
when source is loaded.
Comment 1 Sourceware Commits 2016-04-08 19:18:55 UTC
This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, hjl/erms/2.22 has been created
        at  157c57198e893b4882d1feb98de2b0721ee408fc (commit)

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=157c57198e893b4882d1feb98de2b0721ee408fc

commit 157c57198e893b4882d1feb98de2b0721ee408fc
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 1 14:01:24 2016 -0700

    X86-64: Add dummy memcopy.h and wordcopy.c
    
    Since x86-64 doesn't use memory copy functions, add dummy memcopy.h and
    wordcopy.c to reduce code size.  It reduces the size of libc.so by about
    1 KB.
    
    	* sysdeps/x86_64/memcopy.h: New file.
    	* sysdeps/x86_64/wordcopy.c: Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=f817b9d36215ab60d58cc744d22773b4961a2c9b

commit f817b9d36215ab60d58cc744d22773b4961a2c9b
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 12:46:57 2016 -0700

    X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove
    
    Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones,
    we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with
    the new ones.
    
    No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used
    before.  If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2
    memcpy/memmove optimized with Enhanced REP MOVSB will be used for
    processors with ERMS.  The new AVX512 memcpy/memmove will be used for
    processors with AVX512 which prefer vzeroupper.
    
    Since the new SSE2 memcpy/memmove are faster than the previous default
    memcpy/memmove used in libc.a and ld.so, we also remove the previous
    default memcpy/memmove and make them the default memcpy/memmove, except
    that non-temporal store isn't used in ld.so.
    
    Together, it reduces the size of libc.so by about 6 KB and the size of
    ld.so by about 2 KB.
    
    	[BZ #19776]
    	* sysdeps/x86_64/memcpy.S: Make it dummy.
    	* sysdeps/x86_64/mempcpy.S: Likewise.
    	* sysdeps/x86_64/memmove.S: New file.
    	* sysdeps/x86_64/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/memmove.c: Removed.
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memmove.c: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memcpy-sse2-unaligned, memmove-avx-unaligned,
    	memcpy-avx-unaligned and memmove-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Replace
    	__memmove_chk_avx512_unaligned_2 with
    	__memmove_chk_avx512_unaligned.  Remove
    	__memmove_chk_avx_unaligned_2.  Replace
    	__memmove_chk_sse2_unaligned_2 with
    	__memmove_chk_sse2_unaligned.  Remove __memmove_chk_sse2 and
    	__memmove_avx_unaligned_2.  Replace __memmove_avx512_unaligned_2
    	with __memmove_avx512_unaligned.  Replace
    	__memmove_sse2_unaligned_2 with __memmove_sse2_unaligned.
    	Remove __memmove_sse2.  Replace __memcpy_chk_avx512_unaligned_2
    	with __memcpy_chk_avx512_unaligned.  Remove
    	__memcpy_chk_avx_unaligned_2.  Replace
    	__memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned.
    	Remove __memcpy_chk_sse2.  Remove __memcpy_avx_unaligned_2.
    	Replace __memcpy_avx512_unaligned_2 with
    	__memcpy_avx512_unaligned.  Remove __memcpy_sse2_unaligned_2
    	and __memcpy_sse2.  Replace __mempcpy_chk_avx512_unaligned_2
    	with __mempcpy_chk_avx512_unaligned.  Remove
    	__mempcpy_chk_avx_unaligned_2.  Replace
    	__mempcpy_chk_sse2_unaligned_2 with
    	__mempcpy_chk_sse2_unaligned.  Remove __mempcpy_chk_sse2.
    	Replace __mempcpy_avx512_unaligned_2 with
    	__mempcpy_avx512_unaligned.  Remove __mempcpy_avx_unaligned_2.
    	Replace __mempcpy_sse2_unaligned_2 with
    	__mempcpy_sse2_unaligned.  Remove __mempcpy_sse2.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support
    	__memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned.
    	Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __memcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../memcpy.S.
    	* sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support
    	__memcpy_chk_avx512_unaligned_erms and
    	__memcpy_chk_avx512_unaligned.  Use
    	__memcpy_chk_avx_unaligned_erms and
    	__memcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __memcpy_chk_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
    	Change function suffix from unaligned_2 to unaligned.
    	* sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support
    	__mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned.
    	Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __mempcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../mempcpy.S.
    	(mempcpy): New.  Add a weak alias.
    	* sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support
    	__mempcpy_chk_avx512_unaligned_erms and
    	__mempcpy_chk_avx512_unaligned.  Use
    	__mempcpy_chk_avx_unaligned_erms and
    	__mempcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __mempcpy_chk_sse2_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=122600f4b380b00ce0f682039fe59af4bd0edbc0

commit 122600f4b380b00ce0f682039fe59af4bd0edbc0
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:42:30 2016 -0700

    X86-64: Remove the previous SSE2/AVX2 memsets
    
    Since the new SSE2/AVX2 memsets are faster than the previous ones, we
    can remove the previous SSE2/AVX2 memsets and replace them with the
    new ones.  This reduces the size of libc.so by about 900 bytes.
    
    No change in IFUNC selection if SSE2 and AVX2 memsets weren't used
    before.  If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset
    optimized with Enhanced REP STOSB will be used for processors with
    ERMS.  The new AVX512 memset will be used for processors with AVX512
    which prefer vzeroupper.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded
    	into ...
    	* sysdeps/x86_64/memset.S: This.
    	(__bzero): Removed.
    	(__memset_tail): Likewise.
    	(__memset_chk): Likewise.
    	(memset): Likewise.
    	(MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't
    	defined.
    	(MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined.
    	* sysdeps/x86_64/multiarch/memset-avx2.S: Removed.
    	(__memset_zero_constant_len_parameter): Check SHARED instead of
    	PIC.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memset-avx2 and memset-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Remove __memset_chk_sse2,
    	__memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
    	(__bzero): Enabled.
    	* sysdeps/x86_64/multiarch/memset.S (memset): Replace
    	__memset_sse2 and __memset_avx2 with __memset_sse2_unaligned
    	and __memset_avx2_unaligned.  Use __memset_sse2_unaligned_erms
    	or __memset_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	(memset): Removed.
    	(__memset_chk): Likewise.
    	(MEMSET_SYMBOL): New.
    	(libc_hidden_builtin_def): Replace __memset_sse2 with
    	__memset_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace
    	__memset_chk_sse2 and __memset_chk_avx2 with
    	__memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms.
    	Use __memset_chk_sse2_unaligned_erms or
    	__memset_chk_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_chk_avx512_unaligned_erms and
    	__memset_chk_avx512_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=0ee4375cef69e00e69ddb1d08fe0d492053208f3

commit 0ee4375cef69e00e69ddb1d08fe0d492053208f3
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Apr 3 17:21:45 2016 -0700

    X86-64: Use non-temporal store in memcpy on large data
    
    The large memcpy micro benchmark in glibc shows that there is a
    regression with large data on Haswell machine.  non-temporal store in
    memcpy on large data can improve performance significantly.  This
    patch adds a threshold to use non temporal store which is 6 times of
    shared cache size.  When size is above the threshold, non temporal
    store will be used, but avoid non-temporal store if there is overlap
    between destination and source since destination may be in cache when
    source is loaded.
    
    For size below 8 vector register width, we load all data into registers
    and store them together.  Only forward and backward loops, which move 4
    vector registers at a time, are used to support overlapping addresses.
    For forward loop, we load the last 4 vector register width of data and
    the first vector register width of data into vector registers before the
    loop and store them after the loop.  For backward loop, we load the first
    4 vector register width of data and the last vector register width of
    data into vector registers before the loop and store them after the loop.
    
    	[BZ #19928]
    	* sysdeps/x86_64/cacheinfo.c (__x86_shared_non_temporal_threshold):
    	New.
    	(init_cacheinfo): Set __x86_shared_non_temporal_threshold to
    	6 times of shared cache size.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S
    	(VMOVNT): New.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S
    	(VMOVNT): Likewise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S
    	(VMOVNT): Likewise.
    	(VMOVU): Changed to movups for smaller code sizes.
    	(VMOVA): Changed to movaps for smaller code sizes.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Update
    	comments.
    	(PREFETCH): New.
    	(PREFETCH_SIZE): Likewise.
    	(PREFETCHED_LOAD_SIZE): Likewise.
    	(PREFETCH_ONE_SET): Likewise.
    	Rewrite to use forward and backward loops, which move 4 vector
    	registers at a time, to support overlapping addresses and use
    	non temporal store if size is above the threshold and there is
    	no overlap between destination and source.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=54667f64fa4074325ee33e487c033c313ce95067

commit 54667f64fa4074325ee33e487c033c313ce95067
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Wed Apr 6 10:19:16 2016 -0700

    X86-64: Prepare memmove-vec-unaligned-erms.S
    
    Prepare memmove-vec-unaligned-erms.S to make the SSE2 version as the
    default memcpy, mempcpy and memmove.
    
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
    	(MEMCPY_SYMBOL): New.
    	(MEMPCPY_SYMBOL): Likewise.
    	(MEMMOVE_CHK_SYMBOL): Likewise.
    	Replace MEMMOVE_SYMBOL with MEMMOVE_CHK_SYMBOL on __mempcpy_chk
    	symbols.  Replace MEMMOVE_SYMBOL with MEMPCPY_SYMBOL on
    	__mempcpy symbols.  Provide alias for __memcpy_chk in libc.a.
    	Provide alias for memcpy in libc.a and ld.so.
    
    (cherry picked from commit a7d1c51482d15ab6c07e2ee0ae5e007067b18bfb)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=68a0b487e274b3452a1660e4b9fad5df8d8c0284

commit 68a0b487e274b3452a1660e4b9fad5df8d8c0284
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Wed Apr 6 09:10:18 2016 -0700

    X86-64: Prepare memset-vec-unaligned-erms.S
    
    Prepare memset-vec-unaligned-erms.S to make the SSE2 version as the
    default memset.
    
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
    	(MEMSET_CHK_SYMBOL): New.  Define if not defined.
    	(__bzero): Check VEC_SIZE == 16 instead of USE_MULTIARCH.
    	Disabled fro now.
    	Replace MEMSET_SYMBOL with MEMSET_CHK_SYMBOL on __memset_chk
    	symbols.  Properly check USE_MULTIARCH on __memset symbols.
    
    (cherry picked from commit 4af1bb06c59d24f35bf8dc55897838d926c05892)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c2d3bdd6aec639fd66fceb3e2c145420c25d409b

commit c2d3bdd6aec639fd66fceb3e2c145420c25d409b
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Apr 5 05:21:07 2016 -0700

    Force 32-bit displacement in memset-vec-unaligned-erms.S
    
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: Force
    	32-bit displacement to avoid long nop between instructions.
    
    (cherry picked from commit ec0cac9a1f4094bd0db6f77c1b329e7a40eecc10)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=070a5e77d66f5520c1bbbc24dc1843a0a1c161ee

commit 070a5e77d66f5520c1bbbc24dc1843a0a1c161ee
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Apr 5 05:19:05 2016 -0700

    Add a comment in memset-sse2-unaligned-erms.S
    
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Add
    	a comment on VMOVU and VMOVA.
    
    (cherry picked from commit 696ac774847b80cf994438739478b0c3003b5958)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7e00bb9720268f142668d22e91dff7c3e6e0c08c

commit 7e00bb9720268f142668d22e91dff7c3e6e0c08c
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Apr 3 14:32:20 2016 -0700

    Don't put SSE2/AVX/AVX512 memmove/memset in ld.so
    
    Since memmove and memset in ld.so don't use IFUNC, don't put SSE2, AVX
    and AVX512 memmove and memset in ld.so.
    
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip
    	if not in libc.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    
    (cherry picked from commit 5cd7af016d8587ff53b20ba259746f97edbddbf7)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e1e57539f5dbdefc96a85021b611863eaa28dd13

commit e1e57539f5dbdefc96a85021b611863eaa28dd13
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Apr 3 12:38:25 2016 -0700

    Fix memmove-vec-unaligned-erms.S
    
    __mempcpy_erms and __memmove_erms can't be placed between __memmove_chk
    and __memmove it breaks __memmove_chk.
    
    Don't check source == destination first since it is less common.
    
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:
    	(__mempcpy_erms, __memmove_erms): Moved before __mempcpy_chk
    	with unaligned_erms.
    	(__memmove_erms): Skip if source == destination.
    	(__memmove_unaligned_erms): Don't check source == destination
    	first.
    
    (cherry picked from commit ea2785e96fa503f3a2b5dd9f3a6ca65622b3c5f2)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a13ac6b5ced68aadb7c1546102445f9c57f43231

commit a13ac6b5ced68aadb7c1546102445f9c57f43231
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Mar 6 08:23:24 2016 -0800

    Use HAS_ARCH_FEATURE with Fast_Rep_String
    
    HAS_ARCH_FEATURE, not HAS_CPU_FEATURE, should be used with
    Fast_Rep_String.
    
    	[BZ #19762]
    	* sysdeps/i386/i686/multiarch/bcopy.S (bcopy): Use
    	HAS_ARCH_FEATURE with Fast_Rep_String.
    	* sysdeps/i386/i686/multiarch/bzero.S (__bzero): Likewise.
    	* sysdeps/i386/i686/multiarch/memcpy.S (memcpy): Likewise.
    	* sysdeps/i386/i686/multiarch/memcpy_chk.S (__memcpy_chk):
    	Likewise.
    	* sysdeps/i386/i686/multiarch/memmove_chk.S (__memmove_chk):
    	Likewise.
    	* sysdeps/i386/i686/multiarch/mempcpy.S (__mempcpy): Likewise.
    	* sysdeps/i386/i686/multiarch/mempcpy_chk.S (__mempcpy_chk):
    	Likewise.
    	* sysdeps/i386/i686/multiarch/memset.S (memset): Likewise.
    	* sysdeps/i386/i686/multiarch/memset_chk.S (__memset_chk):
    	Likewise.
    
    (cherry picked from commit 4e940b2f4b577f3a530e0580373f7c2d569f4d63)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4ad4d58ed7a444e2d9787113fce132a99b35b417

commit 4ad4d58ed7a444e2d9787113fce132a99b35b417
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 1 15:08:48 2016 -0700

    Remove Fast_Copy_Backward from Intel Core processors
    
    Intel Core i3, i5 and i7 processors have fast unaligned copy and
    copy backward is ignored.  Remove Fast_Copy_Backward from Intel Core
    processors to avoid confusion.
    
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Don't set
    	bit_arch_Fast_Copy_Backward for Intel Core proessors.
    
    (cherry picked from commit 27d3ce1467990f89126e228559dec8f84b96c60e)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a304f3933c7f8347e49057a7a315cbd571662ff7

commit a304f3933c7f8347e49057a7a315cbd571662ff7
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:05:51 2016 -0700

    Add x86-64 memset with unaligned store and rep stosb
    
    Implement x86-64 memset with unaligned store and rep movsb.  Support
    16-byte, 32-byte and 64-byte vector register sizes.  A single file
    provides 2 implementations of memset, one with rep stosb and the other
    without rep stosb.  They share the same codes when size is between 2
    times of vector register size and REP_STOSB_THRESHOLD which defaults
    to 2KB.
    
    Key features:
    
    1. Use overlapping store to avoid branch.
    2. For size <= 4 times of vector register size, fully unroll the loop.
    3. For size > 4 times of vector register size, store 4 times of vector
    register size at a time.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
    	memset-sse2-unaligned-erms, memset-avx2-unaligned-erms and
    	memset-avx512-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Test __memset_chk_sse2_unaligned,
    	__memset_chk_sse2_unaligned_erms, __memset_chk_avx2_unaligned,
    	__memset_chk_avx2_unaligned_erms, __memset_chk_avx512_unaligned,
    	__memset_chk_avx512_unaligned_erms, __memset_sse2_unaligned,
    	__memset_sse2_unaligned_erms, __memset_erms,
    	__memset_avx2_unaligned, __memset_avx2_unaligned_erms,
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: New
    	file.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:
    	Likewise.
    
    (cherry picked from commit 830566307f038387ca0af3fd327706a8d1a2f595)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=1c5d0b0ba41376c2f0792da4f22cc1f5b2b2688e

commit 1c5d0b0ba41376c2f0792da4f22cc1f5b2b2688e
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:04:26 2016 -0700

    Add x86-64 memmove with unaligned load/store and rep movsb
    
    Implement x86-64 memmove with unaligned load/store and rep movsb.
    Support 16-byte, 32-byte and 64-byte vector register sizes.  When
    size <= 8 times of vector register size, there is no check for
    address overlap bewteen source and destination.  Since overhead for
    overlap check is small when size > 8 times of vector register size,
    memcpy is an alias of memmove.
    
    A single file provides 2 implementations of memmove, one with rep movsb
    and the other without rep movsb.  They share the same codes when size is
    between 2 times of vector register size and REP_MOVSB_THRESHOLD which
    is 2KB for 16-byte vector register size and scaled up by large vector
    register size.
    
    Key features:
    
    1. Use overlapping load and store to avoid branch.
    2. For size <= 8 times of vector register size, load  all sources into
    registers and store them together.
    3. If there is no address overlap bewteen source and destination, copy
    from both ends with 4 times of vector register size at a time.
    4. If address of destination > address of source, backward copy 8 times
    of vector register size at a time.
    5. Otherwise, forward copy 8 times of vector register size at a time.
    6. Use rep movsb only for forward copy.  Avoid slow backward rep movsb
    by fallbacking to backward copy 8 times of vector register size at a
    time.
    7. Skip when address of destination == address of source.
    
    	[BZ #19776]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
    	memmove-sse2-unaligned-erms, memmove-avx-unaligned-erms and
    	memmove-avx512-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Test
    	__memmove_chk_avx512_unaligned_2,
    	__memmove_chk_avx512_unaligned_erms,
    	__memmove_chk_avx_unaligned_2, __memmove_chk_avx_unaligned_erms,
    	__memmove_chk_sse2_unaligned_2,
    	__memmove_chk_sse2_unaligned_erms, __memmove_avx_unaligned_2,
    	__memmove_avx_unaligned_erms, __memmove_avx512_unaligned_2,
    	__memmove_avx512_unaligned_erms, __memmove_erms,
    	__memmove_sse2_unaligned_2, __memmove_sse2_unaligned_erms,
    	__memcpy_chk_avx512_unaligned_2,
    	__memcpy_chk_avx512_unaligned_erms,
    	__memcpy_chk_avx_unaligned_2, __memcpy_chk_avx_unaligned_erms,
    	__memcpy_chk_sse2_unaligned_2, __memcpy_chk_sse2_unaligned_erms,
    	__memcpy_avx_unaligned_2, __memcpy_avx_unaligned_erms,
    	__memcpy_avx512_unaligned_2, __memcpy_avx512_unaligned_erms,
    	__memcpy_sse2_unaligned_2, __memcpy_sse2_unaligned_erms,
    	__memcpy_erms, __mempcpy_chk_avx512_unaligned_2,
    	__mempcpy_chk_avx512_unaligned_erms,
    	__mempcpy_chk_avx_unaligned_2, __mempcpy_chk_avx_unaligned_erms,
    	__mempcpy_chk_sse2_unaligned_2, __mempcpy_chk_sse2_unaligned_erms,
    	__mempcpy_avx512_unaligned_2, __mempcpy_avx512_unaligned_erms,
    	__mempcpy_avx_unaligned_2, __mempcpy_avx_unaligned_erms,
    	__mempcpy_sse2_unaligned_2, __mempcpy_sse2_unaligned_erms and
    	__mempcpy_erms.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: New
    	file.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likwise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
    	Likwise.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:
    	Likwise.
    
    (cherry picked from commit 88b57b8ed41d5ecf2e1bdfc19556f9246a665ebb)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e1203f48239fbb9832db6ed3a0d2a008e317aff9

commit e1203f48239fbb9832db6ed3a0d2a008e317aff9
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 19:22:59 2016 -0700

    Initial Enhanced REP MOVSB/STOSB (ERMS) support
    
    The newer Intel processors support Enhanced REP MOVSB/STOSB (ERMS) which
    has a feature bit in CPUID.  This patch adds the Enhanced REP MOVSB/STOSB
    (ERMS) bit to x86 cpu-features.
    
    	* sysdeps/x86/cpu-features.h (bit_cpu_ERMS): New.
    	(index_cpu_ERMS): Likewise.
    	(reg_ERMS): Likewise.
    
    (cherry picked from commit 0791f91dff9a77263fa8173b143d854cad902c6d)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=3597d65be2a44f063ef12bb907fdad8567aa3e6a

commit 3597d65be2a44f063ef12bb907fdad8567aa3e6a
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 13:15:59 2016 -0700

    Make __memcpy_avx512_no_vzeroupper an alias
    
    Since x86-64 memcpy-avx512-no-vzeroupper.S implements memmove, make
    __memcpy_avx512_no_vzeroupper an alias of __memmove_avx512_no_vzeroupper
    to reduce code size of libc.so.
    
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memcpy-avx512-no-vzeroupper.
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: Renamed
    	to ...
    	* sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: This.
    	(MEMCPY): Don't define.
    	(MEMCPY_CHK): Likewise.
    	(MEMPCPY): Likewise.
    	(MEMPCPY_CHK): Likewise.
    	(MEMPCPY_CHK): Renamed to ...
    	(__mempcpy_chk_avx512_no_vzeroupper): This.
    	(MEMPCPY_CHK): Renamed to ...
    	(__mempcpy_chk_avx512_no_vzeroupper): This.
    	(MEMCPY_CHK): Renamed to ...
    	(__memmove_chk_avx512_no_vzeroupper): This.
    	(MEMCPY): Renamed to ...
    	(__memmove_avx512_no_vzeroupper): This.
    	(__memcpy_avx512_no_vzeroupper): New alias.
    	(__memcpy_chk_avx512_no_vzeroupper): Likewise.
    
    (cherry picked from commit 064f01b10b57ff09cda7025f484b848c38ddd57a)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=9fbaf0f27a11deb98df79d04adee97aebee78d40

commit 9fbaf0f27a11deb98df79d04adee97aebee78d40
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 13:13:36 2016 -0700

    Implement x86-64 multiarch mempcpy in memcpy
    
    Implement x86-64 multiarch mempcpy in memcpy to share most of code.  It
    reduces code size of libc.so.
    
    	[BZ #18858]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	mempcpy-ssse3, mempcpy-ssse3-back, mempcpy-avx-unaligned
    	and mempcpy-avx512-no-vzeroupper.
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMPCPY_CHK):
    	New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S
    	(MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-ssse3-back.S (MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-ssse3.S (MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-avx-unaligned.S: Removed.
    	* sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-ssse3-back.S: Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-ssse3.S: Likewise.
    
    (cherry picked from commit c365e615f7429aee302f8af7bf07ae262278febb)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=5239cb481eea27650173b9b9af22439afdcbf358

commit 5239cb481eea27650173b9b9af22439afdcbf358
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 04:39:48 2016 -0700

    [x86] Add a feature bit: Fast_Unaligned_Copy
    
    On AMD processors, memcpy optimized with unaligned SSE load is
    slower than emcpy optimized with aligned SSSE3 while other string
    functions are faster with unaligned SSE load.  A feature bit,
    Fast_Unaligned_Copy, is added to select memcpy optimized with
    unaligned SSE load.
    
    	[BZ #19583]
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Set
    	Fast_Unaligned_Copy with Fast_Unaligned_Load for Intel
    	processors.  Set Fast_Copy_Backward for AMD Excavator
    	processors.
    	* sysdeps/x86/cpu-features.h (bit_arch_Fast_Unaligned_Copy):
    	New.
    	(index_arch_Fast_Unaligned_Copy): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Check
    	Fast_Unaligned_Copy instead of Fast_Unaligned_Load.
    
    (cherry picked from commit e41b395523040fcb58c7d378475720c2836d280c)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a65b3d13e1754d568782e64a762c2c7fab45a55d

commit a65b3d13e1754d568782e64a762c2c7fab45a55d
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Mar 22 08:36:16 2016 -0700

    Don't set %rcx twice before "rep movsb"
    
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMCPY):
    	Don't set %rcx twice before "rep movsb".
    
    (cherry picked from commit 3c9a4cd16cbc7b79094fec68add2df66061ab5d7)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=f4b6d20366aac66070f1cf50552cf2951991a1e5

commit f4b6d20366aac66070f1cf50552cf2951991a1e5
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Mar 22 07:46:56 2016 -0700

    Set index_arch_AVX_Fast_Unaligned_Load only for Intel processors
    
    Since only Intel processors with AVX2 have fast unaligned load, we
    should set index_arch_AVX_Fast_Unaligned_Load only for Intel processors.
    
    Move AVX, AVX2, AVX512, FMA and FMA4 detection into get_common_indeces
    and call get_common_indeces for other processors.
    
    Add CPU_FEATURES_CPU_P and CPU_FEATURES_ARCH_P to aoid loading
    GLRO(dl_x86_cpu_features) in cpu-features.c.
    
    	[BZ #19583]
    	* sysdeps/x86/cpu-features.c (get_common_indeces): Remove
    	inline.  Check family before setting family, model and
    	extended_model.  Set AVX, AVX2, AVX512, FMA and FMA4 usable
    	bits here.
    	(init_cpu_features): Replace HAS_CPU_FEATURE and
    	HAS_ARCH_FEATURE with CPU_FEATURES_CPU_P and
    	CPU_FEATURES_ARCH_P.  Set index_arch_AVX_Fast_Unaligned_Load
    	for Intel processors with usable AVX2.  Call get_common_indeces
    	for other processors with family == NULL.
    	* sysdeps/x86/cpu-features.h (CPU_FEATURES_CPU_P): New macro.
    	(CPU_FEATURES_ARCH_P): Likewise.
    	(HAS_CPU_FEATURE): Use CPU_FEATURES_CPU_P.
    	(HAS_ARCH_FEATURE): Use CPU_FEATURES_ARCH_P.
    
    (cherry picked from commit f781a9e96138d8839663af5e88649ab1fbed74f8)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=ca9c5edeea52dc18f42ebbe29b1af352f5555538

commit ca9c5edeea52dc18f42ebbe29b1af352f5555538
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Nov 30 08:53:37 2015 -0800

    Update family and model detection for AMD CPUs
    
    AMD CPUs uses the similar encoding scheme for extended family and model
    as Intel CPUs as shown in:
    
    http://support.amd.com/TechDocs/25481.pdf
    
    This patch updates get_common_indeces to get family and model for both
    Intel and AMD CPUs when family == 0x0f.
    
    	[BZ #19214]
    	* sysdeps/x86/cpu-features.c (get_common_indeces): Add an
    	argument to return extended model.  Update family and model
    	with extended family and model when family == 0x0f.
    	(init_cpu_features): Updated.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c23cdbac4ea473effbef5c50b1217f95595b3460

commit c23cdbac4ea473effbef5c50b1217f95595b3460
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 10 05:26:46 2016 -0800

    Add _arch_/_cpu_ to index_*/bit_* in x86 cpu-features.h
    
    index_* and bit_* macros are used to access cpuid and feature arrays o
    struct cpu_features.  It is very easy to use bits and indices of cpuid
    array on feature array, especially in assembly codes.  For example,
    sysdeps/i386/i686/multiarch/bcopy.S has
    
    	HAS_CPU_FEATURE (Fast_Rep_String)
    
    which should be
    
    	HAS_ARCH_FEATURE (Fast_Rep_String)
    
    We change index_* and bit_* to index_cpu_*/index_arch_* and
    bit_cpu_*/bit_arch_* so that we can catch such error at build time.
    
    	[BZ #19762]
    	* sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h
    	(EXTRA_LD_ENVVARS): Add _arch_ to index_*/bit_*.
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Likewise.
    	* sysdeps/x86/cpu-features.h (bit_*): Renamed to ...
    	(bit_arch_*): This for feature array.
    	(bit_*): Renamed to ...
    	(bit_cpu_*): This for cpu array.
    	(index_*): Renamed to ...
    	(index_arch_*): This for feature array.
    	(index_*): Renamed to ...
    	(index_cpu_*): This for cpu array.
    	[__ASSEMBLER__] (HAS_FEATURE): Add and use field.
    	[__ASSEMBLER__] (HAS_CPU_FEATURE)): Pass cpu to HAS_FEATURE.
    	[__ASSEMBLER__] (HAS_ARCH_FEATURE)): Pass arch to HAS_FEATURE.
    	[!__ASSEMBLER__] (HAS_CPU_FEATURE): Replace index_##name and
    	bit_##name with index_cpu_##name and bit_cpu_##name.
    	[!__ASSEMBLER__] (HAS_ARCH_FEATURE): Replace index_##name and
    	bit_##name with index_arch_##name and bit_arch_##name.
    
    (cherry picked from commit 6aa3e97e2530f9917f504eb4146af119a3f27229)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4a49c82956f5a42a2cce22c2e97360de1b32301d

commit 4a49c82956f5a42a2cce22c2e97360de1b32301d
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 3 14:51:40 2016 -0800

    Or bit_Prefer_MAP_32BIT_EXEC in EXTRA_LD_ENVVARS
    
    We should turn on bit_Prefer_MAP_32BIT_EXEC in EXTRA_LD_ENVVARS without
    overriding other bits.
    
    	[BZ #19758]
    	* sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h
    	(EXTRA_LD_ENVVARS): Or bit_Prefer_MAP_32BIT_EXEC.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=18f8c0e3cc9ff7b092f02c9b42874a5439347bbc

commit 18f8c0e3cc9ff7b092f02c9b42874a5439347bbc
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Mar 6 16:48:11 2016 -0800

    Group AVX512 functions in .text.avx512 section
    
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S:
    	Replace .text with .text.avx512.
    	* sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S:
    	Likewise.
    
    (cherry picked from commit fee9eb6200f0e44a4b684903bc47fde36d46f1a5)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=0c8e297a186f844ebb7eba7a3bc0343c83615ca9

commit 0c8e297a186f844ebb7eba7a3bc0343c83615ca9
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Mar 4 08:37:40 2016 -0800

    x86-64: Fix memcpy IFUNC selection
    
    Chek Fast_Unaligned_Load, instead of Slow_BSF, and also check for
    Fast_Copy_Backward to enable __memcpy_ssse3_back.  Existing selection
    order is updated with following selection order:
    
    1. __memcpy_avx_unaligned if AVX_Fast_Unaligned_Load bit is set.
    2. __memcpy_sse2_unaligned if Fast_Unaligned_Load bit is set.
    3. __memcpy_sse2 if SSSE3 isn't available.
    4. __memcpy_ssse3_back if Fast_Copy_Backward bit it set.
    5. __memcpy_ssse3
    
    	[BZ #18880]
    	* sysdeps/x86_64/multiarch/memcpy.S: Check Fast_Unaligned_Load,
    	instead of Slow_BSF, and also check for Fast_Copy_Backward to
    	enable __memcpy_ssse3_back.
    
    (cherry picked from commit 14a1d7cc4c4fd5ee8e4e66b777221dd32a84efe8)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=3c772cb4d9cbe19cd97ad991e3dab43014198c44

commit 3c772cb4d9cbe19cd97ad991e3dab43014198c44
Author: Andrew Senkevich <andrew.senkevich@intel.com>
Date:   Sat Jan 16 00:49:45 2016 +0300

    Added memcpy/memmove family optimized with AVX512 for KNL hardware.
    
    Added AVX512 implementations of memcpy, mempcpy, memmove, memcpy_chk,
    mempcpy_chk, memmove_chk.
    It shows average improvement more than 30% over AVX versions on KNL
    hardware (performance results in the thread
    <https://sourceware.org/ml/libc-alpha/2016-01/msg00258.html>).
    
        * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Added new files.
        * sysdeps/x86_64/multiarch/ifunc-impl-list.c: Added new tests.
        * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: New file.
        * sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S: Likewise.
        * sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: Likewise.
        * sysdeps/x86_64/multiarch/memcpy.S: Added new IFUNC branch.
        * sysdeps/x86_64/multiarch/memcpy_chk.S: Likewise.
        * sysdeps/x86_64/multiarch/memmove.c: Likewise.
        * sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
        * sysdeps/x86_64/multiarch/mempcpy.S: Likewise.
        * sysdeps/x86_64/multiarch/mempcpy_chk.S: Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7f20e52f1f8d7709140f4bdf828a6bb0f0f08af2

commit 7f20e52f1f8d7709140f4bdf828a6bb0f0f08af2
Author: Andrew Senkevich <andrew.senkevich@intel.com>
Date:   Sat Dec 19 02:47:28 2015 +0300

    Added memset optimized with AVX512 for KNL hardware.
    
    It shows improvement up to 28% over AVX2 memset (performance results
    attached at <https://sourceware.org/ml/libc-alpha/2015-12/msg00052.html>).
    
        * sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S: New file.
        * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Added new file.
        * sysdeps/x86_64/multiarch/ifunc-impl-list.c: Added new tests.
        * sysdeps/x86_64/multiarch/memset.S: Added new IFUNC branch.
        * sysdeps/x86_64/multiarch/memset_chk.S: Likewise.
        * sysdeps/x86/cpu-features.h (bit_Prefer_No_VZEROUPPER,
        index_Prefer_No_VZEROUPPER): New.
        * sysdeps/x86/cpu-features.c (init_cpu_features): Set the
        Prefer_No_VZEROUPPER for Knights Landing.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=d530cd5463701a59ed923d53a97d3b534fdfea8a

commit d530cd5463701a59ed923d53a97d3b534fdfea8a
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Wed Oct 21 14:44:23 2015 -0700

    Add Prefer_MAP_32BIT_EXEC to map executable pages with MAP_32BIT
    
    According to Silvermont software optimization guide, for 64-bit
    applications, branch prediction performance can be negatively impacted
    when the target of a branch is more than 4GB away from the branch.  Add
    the Prefer_MAP_32BIT_EXEC bit so that mmap will try to map executable
    pages with MAP_32BIT first.  NB: MAP_32BIT will map to lower 2GB, not
    lower 4GB, address.  Prefer_MAP_32BIT_EXEC reduces bits available for
    address space layout randomization (ASLR), which is always disabled for
    SUID programs and can only be enabled by setting environment variable,
    LD_PREFER_MAP_32BIT_EXEC.
    
    On Fedora 23, this patch speeds up GCC 5 testsuite by 3% on Silvermont.
    
    	[BZ #19367]
    	* sysdeps/unix/sysv/linux/wordsize-64/mmap.c: New file.
    	* sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h: Likewise.
    	* sysdeps/unix/sysv/linux/x86_64/64/mmap.c: Likewise.
    	* sysdeps/x86/cpu-features.h (bit_Prefer_MAP_32BIT_EXEC): New.
    	(index_Prefer_MAP_32BIT_EXEC): Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=fe24aedc3530037d7bb614b84d309e6b816686bf

commit fe24aedc3530037d7bb614b84d309e6b816686bf
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Dec 15 11:46:54 2015 -0800

    Enable Silvermont optimizations for Knights Landing
    
    Knights Landing processor is based on Silvermont.  This patch enables
    Silvermont optimizations for Knights Landing.
    
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Enable
    	Silvermont optimizations for Knights Landing.

-----------------------------------------------------------------------
Comment 2 Sourceware Commits 2016-04-08 19:22:15 UTC
This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, hjl/erms/ifunc has been created
        at  fe38127f6d289dd6eaa6425acb108b7b384ddc4b (commit)

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=fe38127f6d289dd6eaa6425acb108b7b384ddc4b

commit fe38127f6d289dd6eaa6425acb108b7b384ddc4b
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 1 14:01:24 2016 -0700

    X86-64: Add dummy memcopy.h and wordcopy.c
    
    Since x86-64 doesn't use memory copy functions, add dummy memcopy.h and
    wordcopy.c to reduce code size.  It reduces the size of libc.so by about
    1 KB.
    
    	* sysdeps/x86_64/memcopy.h: New file.
    	* sysdeps/x86_64/wordcopy.c: Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=2c5fc8567a694ae6115b25db787673fb8dc140a5

commit 2c5fc8567a694ae6115b25db787673fb8dc140a5
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 12:46:57 2016 -0700

    X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove
    
    Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones,
    we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with
    the new ones.
    
    No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used
    before.  If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2
    memcpy/memmove optimized with Enhanced REP MOVSB will be used for
    processors with ERMS.  The new AVX512 memcpy/memmove will be used for
    processors with AVX512 which prefer vzeroupper.
    
    Since the new SSE2 memcpy/memmove are faster than the previous default
    memcpy/memmove used in libc.a and ld.so, we also remove the previous
    default memcpy/memmove and make them the default memcpy/memmove, except
    that non-temporal store isn't used in ld.so.
    
    Together, it reduces the size of libc.so by about 6 KB and the size of
    ld.so by about 2 KB.
    
    	[BZ #19776]
    	* sysdeps/x86_64/memcpy.S: Make it dummy.
    	* sysdeps/x86_64/mempcpy.S: Likewise.
    	* sysdeps/x86_64/memmove.S: New file.
    	* sysdeps/x86_64/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/memmove.c: Removed.
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memmove.c: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memcpy-sse2-unaligned, memmove-avx-unaligned,
    	memcpy-avx-unaligned and memmove-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Replace
    	__memmove_chk_avx512_unaligned_2 with
    	__memmove_chk_avx512_unaligned.  Remove
    	__memmove_chk_avx_unaligned_2.  Replace
    	__memmove_chk_sse2_unaligned_2 with
    	__memmove_chk_sse2_unaligned.  Remove __memmove_chk_sse2 and
    	__memmove_avx_unaligned_2.  Replace __memmove_avx512_unaligned_2
    	with __memmove_avx512_unaligned.  Replace
    	__memmove_sse2_unaligned_2 with __memmove_sse2_unaligned.
    	Remove __memmove_sse2.  Replace __memcpy_chk_avx512_unaligned_2
    	with __memcpy_chk_avx512_unaligned.  Remove
    	__memcpy_chk_avx_unaligned_2.  Replace
    	__memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned.
    	Remove __memcpy_chk_sse2.  Remove __memcpy_avx_unaligned_2.
    	Replace __memcpy_avx512_unaligned_2 with
    	__memcpy_avx512_unaligned.  Remove __memcpy_sse2_unaligned_2
    	and __memcpy_sse2.  Replace __mempcpy_chk_avx512_unaligned_2
    	with __mempcpy_chk_avx512_unaligned.  Remove
    	__mempcpy_chk_avx_unaligned_2.  Replace
    	__mempcpy_chk_sse2_unaligned_2 with
    	__mempcpy_chk_sse2_unaligned.  Remove __mempcpy_chk_sse2.
    	Replace __mempcpy_avx512_unaligned_2 with
    	__mempcpy_avx512_unaligned.  Remove __mempcpy_avx_unaligned_2.
    	Replace __mempcpy_sse2_unaligned_2 with
    	__mempcpy_sse2_unaligned.  Remove __mempcpy_sse2.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support
    	__memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned.
    	Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __memcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../memcpy.S.
    	* sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support
    	__memcpy_chk_avx512_unaligned_erms and
    	__memcpy_chk_avx512_unaligned.  Use
    	__memcpy_chk_avx_unaligned_erms and
    	__memcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __memcpy_chk_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
    	Change function suffix from unaligned_2 to unaligned.
    	* sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support
    	__mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned.
    	Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __mempcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../mempcpy.S.
    	(mempcpy): New.  Add a weak alias.
    	* sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support
    	__mempcpy_chk_avx512_unaligned_erms and
    	__mempcpy_chk_avx512_unaligned.  Use
    	__mempcpy_chk_avx_unaligned_erms and
    	__mempcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __mempcpy_chk_sse2_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=ed37fe74cfe0d9f68a8023b7f73a5805f4a5a206

commit ed37fe74cfe0d9f68a8023b7f73a5805f4a5a206
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:42:30 2016 -0700

    X86-64: Remove the previous SSE2/AVX2 memsets
    
    Since the new SSE2/AVX2 memsets are faster than the previous ones, we
    can remove the previous SSE2/AVX2 memsets and replace them with the
    new ones.  This reduces the size of libc.so by about 900 bytes.
    
    No change in IFUNC selection if SSE2 and AVX2 memsets weren't used
    before.  If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset
    optimized with Enhanced REP STOSB will be used for processors with
    ERMS.  The new AVX512 memset will be used for processors with AVX512
    which prefer vzeroupper.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded
    	into ...
    	* sysdeps/x86_64/memset.S: This.
    	(__bzero): Removed.
    	(__memset_tail): Likewise.
    	(__memset_chk): Likewise.
    	(memset): Likewise.
    	(MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't
    	defined.
    	(MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined.
    	* sysdeps/x86_64/multiarch/memset-avx2.S: Removed.
    	(__memset_zero_constant_len_parameter): Check SHARED instead of
    	PIC.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memset-avx2 and memset-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Remove __memset_chk_sse2,
    	__memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
    	(__bzero): Enabled.
    	* sysdeps/x86_64/multiarch/memset.S (memset): Replace
    	__memset_sse2 and __memset_avx2 with __memset_sse2_unaligned
    	and __memset_avx2_unaligned.  Use __memset_sse2_unaligned_erms
    	or __memset_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	(memset): Removed.
    	(__memset_chk): Likewise.
    	(MEMSET_SYMBOL): New.
    	(libc_hidden_builtin_def): Replace __memset_sse2 with
    	__memset_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace
    	__memset_chk_sse2 and __memset_chk_avx2 with
    	__memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms.
    	Use __memset_chk_sse2_unaligned_erms or
    	__memset_chk_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_chk_avx512_unaligned_erms and
    	__memset_chk_avx512_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=96b5fbcbc09df10b093221d6b55eaa5e7e8c044f

commit 96b5fbcbc09df10b093221d6b55eaa5e7e8c044f
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Apr 3 17:21:45 2016 -0700

    X86-64: Use non-temporal store in memcpy on large data
    
    The large memcpy micro benchmark in glibc shows that there is a
    regression with large data on Haswell machine.  non-temporal store in
    memcpy on large data can improve performance significantly.  This
    patch adds a threshold to use non temporal store which is 6 times of
    shared cache size.  When size is above the threshold, non temporal
    store will be used, but avoid non-temporal store if there is overlap
    between destination and source since destination may be in cache when
    source is loaded.
    
    For size below 8 vector register width, we load all data into registers
    and store them together.  Only forward and backward loops, which move 4
    vector registers at a time, are used to support overlapping addresses.
    For forward loop, we load the last 4 vector register width of data and
    the first vector register width of data into vector registers before the
    loop and store them after the loop.  For backward loop, we load the first
    4 vector register width of data and the last vector register width of
    data into vector registers before the loop and store them after the loop.
    
    	[BZ #19928]
    	* sysdeps/x86_64/cacheinfo.c (__x86_shared_non_temporal_threshold):
    	New.
    	(init_cacheinfo): Set __x86_shared_non_temporal_threshold to
    	6 times of shared cache size.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S
    	(VMOVNT): New.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S
    	(VMOVNT): Likewise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S
    	(VMOVNT): Likewise.
    	(VMOVU): Changed to movups for smaller code sizes.
    	(VMOVA): Changed to movaps for smaller code sizes.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Update
    	comments.
    	(PREFETCH): New.
    	(PREFETCH_SIZE): Likewise.
    	(PREFETCHED_LOAD_SIZE): Likewise.
    	(PREFETCH_ONE_SET): Likewise.
    	Rewrite to use forward and backward loops, which move 4 vector
    	registers at a time, to support overlapping addresses and use
    	non temporal store if size is above the threshold and there is
    	no overlap between destination and source.

-----------------------------------------------------------------------
Comment 3 Sourceware Commits 2016-04-08 20:33:03 UTC
This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, hjl/erms/2.23 has been created
        at  9e1ddc1180ca0619d12b620b227726233a48b9bc (commit)

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=9e1ddc1180ca0619d12b620b227726233a48b9bc

commit 9e1ddc1180ca0619d12b620b227726233a48b9bc
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 1 14:01:24 2016 -0700

    X86-64: Add dummy memcopy.h and wordcopy.c
    
    Since x86-64 doesn't use memory copy functions, add dummy memcopy.h and
    wordcopy.c to reduce code size.  It reduces the size of libc.so by about
    1 KB.
    
    	* sysdeps/x86_64/memcopy.h: New file.
    	* sysdeps/x86_64/wordcopy.c: Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=3443d7810db1092ac70a0fde7b85732a2e00cdc3

commit 3443d7810db1092ac70a0fde7b85732a2e00cdc3
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 12:46:57 2016 -0700

    X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove
    
    Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones,
    we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with
    the new ones.
    
    No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used
    before.  If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2
    memcpy/memmove optimized with Enhanced REP MOVSB will be used for
    processors with ERMS.  The new AVX512 memcpy/memmove will be used for
    processors with AVX512 which prefer vzeroupper.
    
    Since the new SSE2 memcpy/memmove are faster than the previous default
    memcpy/memmove used in libc.a and ld.so, we also remove the previous
    default memcpy/memmove and make them the default memcpy/memmove, except
    that non-temporal store isn't used in ld.so.
    
    Together, it reduces the size of libc.so by about 6 KB and the size of
    ld.so by about 2 KB.
    
    	[BZ #19776]
    	* sysdeps/x86_64/memcpy.S: Make it dummy.
    	* sysdeps/x86_64/mempcpy.S: Likewise.
    	* sysdeps/x86_64/memmove.S: New file.
    	* sysdeps/x86_64/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/memmove.c: Removed.
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memmove.c: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memcpy-sse2-unaligned, memmove-avx-unaligned,
    	memcpy-avx-unaligned and memmove-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Replace
    	__memmove_chk_avx512_unaligned_2 with
    	__memmove_chk_avx512_unaligned.  Remove
    	__memmove_chk_avx_unaligned_2.  Replace
    	__memmove_chk_sse2_unaligned_2 with
    	__memmove_chk_sse2_unaligned.  Remove __memmove_chk_sse2 and
    	__memmove_avx_unaligned_2.  Replace __memmove_avx512_unaligned_2
    	with __memmove_avx512_unaligned.  Replace
    	__memmove_sse2_unaligned_2 with __memmove_sse2_unaligned.
    	Remove __memmove_sse2.  Replace __memcpy_chk_avx512_unaligned_2
    	with __memcpy_chk_avx512_unaligned.  Remove
    	__memcpy_chk_avx_unaligned_2.  Replace
    	__memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned.
    	Remove __memcpy_chk_sse2.  Remove __memcpy_avx_unaligned_2.
    	Replace __memcpy_avx512_unaligned_2 with
    	__memcpy_avx512_unaligned.  Remove __memcpy_sse2_unaligned_2
    	and __memcpy_sse2.  Replace __mempcpy_chk_avx512_unaligned_2
    	with __mempcpy_chk_avx512_unaligned.  Remove
    	__mempcpy_chk_avx_unaligned_2.  Replace
    	__mempcpy_chk_sse2_unaligned_2 with
    	__mempcpy_chk_sse2_unaligned.  Remove __mempcpy_chk_sse2.
    	Replace __mempcpy_avx512_unaligned_2 with
    	__mempcpy_avx512_unaligned.  Remove __mempcpy_avx_unaligned_2.
    	Replace __mempcpy_sse2_unaligned_2 with
    	__mempcpy_sse2_unaligned.  Remove __mempcpy_sse2.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support
    	__memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned.
    	Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __memcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../memcpy.S.
    	* sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support
    	__memcpy_chk_avx512_unaligned_erms and
    	__memcpy_chk_avx512_unaligned.  Use
    	__memcpy_chk_avx_unaligned_erms and
    	__memcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __memcpy_chk_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
    	Change function suffix from unaligned_2 to unaligned.
    	* sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support
    	__mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned.
    	Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __mempcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../mempcpy.S.
    	(mempcpy): New.  Add a weak alias.
    	* sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support
    	__mempcpy_chk_avx512_unaligned_erms and
    	__mempcpy_chk_avx512_unaligned.  Use
    	__mempcpy_chk_avx_unaligned_erms and
    	__mempcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __mempcpy_chk_sse2_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=1d2a372d44dc05201242d0fd5551df9c3174806c

commit 1d2a372d44dc05201242d0fd5551df9c3174806c
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:42:30 2016 -0700

    X86-64: Remove the previous SSE2/AVX2 memsets
    
    Since the new SSE2/AVX2 memsets are faster than the previous ones, we
    can remove the previous SSE2/AVX2 memsets and replace them with the
    new ones.  This reduces the size of libc.so by about 900 bytes.
    
    No change in IFUNC selection if SSE2 and AVX2 memsets weren't used
    before.  If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset
    optimized with Enhanced REP STOSB will be used for processors with
    ERMS.  The new AVX512 memset will be used for processors with AVX512
    which prefer vzeroupper.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded
    	into ...
    	* sysdeps/x86_64/memset.S: This.
    	(__bzero): Removed.
    	(__memset_tail): Likewise.
    	(__memset_chk): Likewise.
    	(memset): Likewise.
    	(MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't
    	defined.
    	(MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined.
    	* sysdeps/x86_64/multiarch/memset-avx2.S: Removed.
    	(__memset_zero_constant_len_parameter): Check SHARED instead of
    	PIC.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memset-avx2 and memset-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Remove __memset_chk_sse2,
    	__memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
    	(__bzero): Enabled.
    	* sysdeps/x86_64/multiarch/memset.S (memset): Replace
    	__memset_sse2 and __memset_avx2 with __memset_sse2_unaligned
    	and __memset_avx2_unaligned.  Use __memset_sse2_unaligned_erms
    	or __memset_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	(memset): Removed.
    	(__memset_chk): Likewise.
    	(MEMSET_SYMBOL): New.
    	(libc_hidden_builtin_def): Replace __memset_sse2 with
    	__memset_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace
    	__memset_chk_sse2 and __memset_chk_avx2 with
    	__memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms.
    	Use __memset_chk_sse2_unaligned_erms or
    	__memset_chk_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_chk_avx512_unaligned_erms and
    	__memset_chk_avx512_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=9fa066d5f5ff996990869bbbad08435f02d18bb3

commit 9fa066d5f5ff996990869bbbad08435f02d18bb3
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Apr 3 17:21:45 2016 -0700

    X86-64: Use non-temporal store in memcpy on large data
    
    The large memcpy micro benchmark in glibc shows that there is a
    regression with large data on Haswell machine.  non-temporal store in
    memcpy on large data can improve performance significantly.  This
    patch adds a threshold to use non temporal store which is 6 times of
    shared cache size.  When size is above the threshold, non temporal
    store will be used, but avoid non-temporal store if there is overlap
    between destination and source since destination may be in cache when
    source is loaded.
    
    For size below 8 vector register width, we load all data into registers
    and store them together.  Only forward and backward loops, which move 4
    vector registers at a time, are used to support overlapping addresses.
    For forward loop, we load the last 4 vector register width of data and
    the first vector register width of data into vector registers before the
    loop and store them after the loop.  For backward loop, we load the first
    4 vector register width of data and the last vector register width of
    data into vector registers before the loop and store them after the loop.
    
    	[BZ #19928]
    	* sysdeps/x86_64/cacheinfo.c (__x86_shared_non_temporal_threshold):
    	New.
    	(init_cacheinfo): Set __x86_shared_non_temporal_threshold to
    	6 times of shared cache size.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S
    	(VMOVNT): New.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S
    	(VMOVNT): Likewise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S
    	(VMOVNT): Likewise.
    	(VMOVU): Changed to movups for smaller code sizes.
    	(VMOVA): Changed to movaps for smaller code sizes.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Update
    	comments.
    	(PREFETCH): New.
    	(PREFETCH_SIZE): Likewise.
    	(PREFETCHED_LOAD_SIZE): Likewise.
    	(PREFETCH_ONE_SET): Likewise.
    	Rewrite to use forward and backward loops, which move 4 vector
    	registers at a time, to support overlapping addresses and use
    	non temporal store if size is above the threshold and there is
    	no overlap between destination and source.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=0932dd8b56db46dd421a4855fb5dee9de092538d

commit 0932dd8b56db46dd421a4855fb5dee9de092538d
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Wed Apr 6 10:19:16 2016 -0700

    X86-64: Prepare memmove-vec-unaligned-erms.S
    
    Prepare memmove-vec-unaligned-erms.S to make the SSE2 version as the
    default memcpy, mempcpy and memmove.
    
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
    	(MEMCPY_SYMBOL): New.
    	(MEMPCPY_SYMBOL): Likewise.
    	(MEMMOVE_CHK_SYMBOL): Likewise.
    	Replace MEMMOVE_SYMBOL with MEMMOVE_CHK_SYMBOL on __mempcpy_chk
    	symbols.  Replace MEMMOVE_SYMBOL with MEMPCPY_SYMBOL on
    	__mempcpy symbols.  Provide alias for __memcpy_chk in libc.a.
    	Provide alias for memcpy in libc.a and ld.so.
    
    (cherry picked from commit a7d1c51482d15ab6c07e2ee0ae5e007067b18bfb)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=da2da79262814ba4ead3ee487549949096d8ad2d

commit da2da79262814ba4ead3ee487549949096d8ad2d
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Wed Apr 6 09:10:18 2016 -0700

    X86-64: Prepare memset-vec-unaligned-erms.S
    
    Prepare memset-vec-unaligned-erms.S to make the SSE2 version as the
    default memset.
    
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
    	(MEMSET_CHK_SYMBOL): New.  Define if not defined.
    	(__bzero): Check VEC_SIZE == 16 instead of USE_MULTIARCH.
    	Disabled fro now.
    	Replace MEMSET_SYMBOL with MEMSET_CHK_SYMBOL on __memset_chk
    	symbols.  Properly check USE_MULTIARCH on __memset symbols.
    
    (cherry picked from commit 4af1bb06c59d24f35bf8dc55897838d926c05892)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=9a93bdbaff81edf67c5486c84f8098055e355abb

commit 9a93bdbaff81edf67c5486c84f8098055e355abb
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Apr 5 05:21:07 2016 -0700

    Force 32-bit displacement in memset-vec-unaligned-erms.S
    
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: Force
    	32-bit displacement to avoid long nop between instructions.
    
    (cherry picked from commit ec0cac9a1f4094bd0db6f77c1b329e7a40eecc10)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=5118e532600549ad0f56cb9b1a179b8eab70c483

commit 5118e532600549ad0f56cb9b1a179b8eab70c483
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Apr 5 05:19:05 2016 -0700

    Add a comment in memset-sse2-unaligned-erms.S
    
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Add
    	a comment on VMOVU and VMOVA.
    
    (cherry picked from commit 696ac774847b80cf994438739478b0c3003b5958)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=06c6d4ae6ee7e5b83fd5868bef494def01f59292

commit 06c6d4ae6ee7e5b83fd5868bef494def01f59292
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Apr 3 14:32:20 2016 -0700

    Don't put SSE2/AVX/AVX512 memmove/memset in ld.so
    
    Since memmove and memset in ld.so don't use IFUNC, don't put SSE2, AVX
    and AVX512 memmove and memset in ld.so.
    
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip
    	if not in libc.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    
    (cherry picked from commit 5cd7af016d8587ff53b20ba259746f97edbddbf7)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a96379797a7eecc1b709cad7b68981eb698783dc

commit a96379797a7eecc1b709cad7b68981eb698783dc
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Apr 3 12:38:25 2016 -0700

    Fix memmove-vec-unaligned-erms.S
    
    __mempcpy_erms and __memmove_erms can't be placed between __memmove_chk
    and __memmove it breaks __memmove_chk.
    
    Don't check source == destination first since it is less common.
    
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:
    	(__mempcpy_erms, __memmove_erms): Moved before __mempcpy_chk
    	with unaligned_erms.
    	(__memmove_erms): Skip if source == destination.
    	(__memmove_unaligned_erms): Don't check source == destination
    	first.
    
    (cherry picked from commit ea2785e96fa503f3a2b5dd9f3a6ca65622b3c5f2)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=cfb059c79729b26284863334c9aa04f0a3b967b9

commit cfb059c79729b26284863334c9aa04f0a3b967b9
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 1 15:08:48 2016 -0700

    Remove Fast_Copy_Backward from Intel Core processors
    
    Intel Core i3, i5 and i7 processors have fast unaligned copy and
    copy backward is ignored.  Remove Fast_Copy_Backward from Intel Core
    processors to avoid confusion.
    
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Don't set
    	bit_arch_Fast_Copy_Backward for Intel Core proessors.
    
    (cherry picked from commit 27d3ce1467990f89126e228559dec8f84b96c60e)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=30c389be1af67c4d0716d207b6780c6169d1355f

commit 30c389be1af67c4d0716d207b6780c6169d1355f
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:05:51 2016 -0700

    Add x86-64 memset with unaligned store and rep stosb
    
    Implement x86-64 memset with unaligned store and rep movsb.  Support
    16-byte, 32-byte and 64-byte vector register sizes.  A single file
    provides 2 implementations of memset, one with rep stosb and the other
    without rep stosb.  They share the same codes when size is between 2
    times of vector register size and REP_STOSB_THRESHOLD which defaults
    to 2KB.
    
    Key features:
    
    1. Use overlapping store to avoid branch.
    2. For size <= 4 times of vector register size, fully unroll the loop.
    3. For size > 4 times of vector register size, store 4 times of vector
    register size at a time.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
    	memset-sse2-unaligned-erms, memset-avx2-unaligned-erms and
    	memset-avx512-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Test __memset_chk_sse2_unaligned,
    	__memset_chk_sse2_unaligned_erms, __memset_chk_avx2_unaligned,
    	__memset_chk_avx2_unaligned_erms, __memset_chk_avx512_unaligned,
    	__memset_chk_avx512_unaligned_erms, __memset_sse2_unaligned,
    	__memset_sse2_unaligned_erms, __memset_erms,
    	__memset_avx2_unaligned, __memset_avx2_unaligned_erms,
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: New
    	file.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:
    	Likewise.
    
    (cherry picked from commit 830566307f038387ca0af3fd327706a8d1a2f595)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=980d639b4ae58209843f09a29d86b0a8303b6650

commit 980d639b4ae58209843f09a29d86b0a8303b6650
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:04:26 2016 -0700

    Add x86-64 memmove with unaligned load/store and rep movsb
    
    Implement x86-64 memmove with unaligned load/store and rep movsb.
    Support 16-byte, 32-byte and 64-byte vector register sizes.  When
    size <= 8 times of vector register size, there is no check for
    address overlap bewteen source and destination.  Since overhead for
    overlap check is small when size > 8 times of vector register size,
    memcpy is an alias of memmove.
    
    A single file provides 2 implementations of memmove, one with rep movsb
    and the other without rep movsb.  They share the same codes when size is
    between 2 times of vector register size and REP_MOVSB_THRESHOLD which
    is 2KB for 16-byte vector register size and scaled up by large vector
    register size.
    
    Key features:
    
    1. Use overlapping load and store to avoid branch.
    2. For size <= 8 times of vector register size, load  all sources into
    registers and store them together.
    3. If there is no address overlap bewteen source and destination, copy
    from both ends with 4 times of vector register size at a time.
    4. If address of destination > address of source, backward copy 8 times
    of vector register size at a time.
    5. Otherwise, forward copy 8 times of vector register size at a time.
    6. Use rep movsb only for forward copy.  Avoid slow backward rep movsb
    by fallbacking to backward copy 8 times of vector register size at a
    time.
    7. Skip when address of destination == address of source.
    
    	[BZ #19776]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
    	memmove-sse2-unaligned-erms, memmove-avx-unaligned-erms and
    	memmove-avx512-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Test
    	__memmove_chk_avx512_unaligned_2,
    	__memmove_chk_avx512_unaligned_erms,
    	__memmove_chk_avx_unaligned_2, __memmove_chk_avx_unaligned_erms,
    	__memmove_chk_sse2_unaligned_2,
    	__memmove_chk_sse2_unaligned_erms, __memmove_avx_unaligned_2,
    	__memmove_avx_unaligned_erms, __memmove_avx512_unaligned_2,
    	__memmove_avx512_unaligned_erms, __memmove_erms,
    	__memmove_sse2_unaligned_2, __memmove_sse2_unaligned_erms,
    	__memcpy_chk_avx512_unaligned_2,
    	__memcpy_chk_avx512_unaligned_erms,
    	__memcpy_chk_avx_unaligned_2, __memcpy_chk_avx_unaligned_erms,
    	__memcpy_chk_sse2_unaligned_2, __memcpy_chk_sse2_unaligned_erms,
    	__memcpy_avx_unaligned_2, __memcpy_avx_unaligned_erms,
    	__memcpy_avx512_unaligned_2, __memcpy_avx512_unaligned_erms,
    	__memcpy_sse2_unaligned_2, __memcpy_sse2_unaligned_erms,
    	__memcpy_erms, __mempcpy_chk_avx512_unaligned_2,
    	__mempcpy_chk_avx512_unaligned_erms,
    	__mempcpy_chk_avx_unaligned_2, __mempcpy_chk_avx_unaligned_erms,
    	__mempcpy_chk_sse2_unaligned_2, __mempcpy_chk_sse2_unaligned_erms,
    	__mempcpy_avx512_unaligned_2, __mempcpy_avx512_unaligned_erms,
    	__mempcpy_avx_unaligned_2, __mempcpy_avx_unaligned_erms,
    	__mempcpy_sse2_unaligned_2, __mempcpy_sse2_unaligned_erms and
    	__mempcpy_erms.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: New
    	file.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likwise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
    	Likwise.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:
    	Likwise.
    
    (cherry picked from commit 88b57b8ed41d5ecf2e1bdfc19556f9246a665ebb)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=bf2bc5e5c9d7aa8af28b299ec26b8a37352730cc

commit bf2bc5e5c9d7aa8af28b299ec26b8a37352730cc
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 19:22:59 2016 -0700

    Initial Enhanced REP MOVSB/STOSB (ERMS) support
    
    The newer Intel processors support Enhanced REP MOVSB/STOSB (ERMS) which
    has a feature bit in CPUID.  This patch adds the Enhanced REP MOVSB/STOSB
    (ERMS) bit to x86 cpu-features.
    
    	* sysdeps/x86/cpu-features.h (bit_cpu_ERMS): New.
    	(index_cpu_ERMS): Likewise.
    	(reg_ERMS): Likewise.
    
    (cherry picked from commit 0791f91dff9a77263fa8173b143d854cad902c6d)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7c244283ff12329b3bca9878b8edac3b3fe5c7bc

commit 7c244283ff12329b3bca9878b8edac3b3fe5c7bc
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 13:15:59 2016 -0700

    Make __memcpy_avx512_no_vzeroupper an alias
    
    Since x86-64 memcpy-avx512-no-vzeroupper.S implements memmove, make
    __memcpy_avx512_no_vzeroupper an alias of __memmove_avx512_no_vzeroupper
    to reduce code size of libc.so.
    
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memcpy-avx512-no-vzeroupper.
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: Renamed
    	to ...
    	* sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: This.
    	(MEMCPY): Don't define.
    	(MEMCPY_CHK): Likewise.
    	(MEMPCPY): Likewise.
    	(MEMPCPY_CHK): Likewise.
    	(MEMPCPY_CHK): Renamed to ...
    	(__mempcpy_chk_avx512_no_vzeroupper): This.
    	(MEMPCPY_CHK): Renamed to ...
    	(__mempcpy_chk_avx512_no_vzeroupper): This.
    	(MEMCPY_CHK): Renamed to ...
    	(__memmove_chk_avx512_no_vzeroupper): This.
    	(MEMCPY): Renamed to ...
    	(__memmove_avx512_no_vzeroupper): This.
    	(__memcpy_avx512_no_vzeroupper): New alias.
    	(__memcpy_chk_avx512_no_vzeroupper): Likewise.
    
    (cherry picked from commit 064f01b10b57ff09cda7025f484b848c38ddd57a)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a9a14991fb2d3e69f80d25e9bbf2f6b0bcf11c3d

commit a9a14991fb2d3e69f80d25e9bbf2f6b0bcf11c3d
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 13:13:36 2016 -0700

    Implement x86-64 multiarch mempcpy in memcpy
    
    Implement x86-64 multiarch mempcpy in memcpy to share most of code.  It
    reduces code size of libc.so.
    
    	[BZ #18858]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	mempcpy-ssse3, mempcpy-ssse3-back, mempcpy-avx-unaligned
    	and mempcpy-avx512-no-vzeroupper.
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMPCPY_CHK):
    	New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S
    	(MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-ssse3-back.S (MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-ssse3.S (MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-avx-unaligned.S: Removed.
    	* sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-ssse3-back.S: Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-ssse3.S: Likewise.
    
    (cherry picked from commit c365e615f7429aee302f8af7bf07ae262278febb)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4fc09dabecee1b7cafdbca26ee7c63f68e53c229

commit 4fc09dabecee1b7cafdbca26ee7c63f68e53c229
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 04:39:48 2016 -0700

    [x86] Add a feature bit: Fast_Unaligned_Copy
    
    On AMD processors, memcpy optimized with unaligned SSE load is
    slower than emcpy optimized with aligned SSSE3 while other string
    functions are faster with unaligned SSE load.  A feature bit,
    Fast_Unaligned_Copy, is added to select memcpy optimized with
    unaligned SSE load.
    
    	[BZ #19583]
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Set
    	Fast_Unaligned_Copy with Fast_Unaligned_Load for Intel
    	processors.  Set Fast_Copy_Backward for AMD Excavator
    	processors.
    	* sysdeps/x86/cpu-features.h (bit_arch_Fast_Unaligned_Copy):
    	New.
    	(index_arch_Fast_Unaligned_Copy): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Check
    	Fast_Unaligned_Copy instead of Fast_Unaligned_Load.
    
    (cherry picked from commit e41b395523040fcb58c7d378475720c2836d280c)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=75f2d47e459a6bf5656a938e5c63f8b581eb3ee6

commit 75f2d47e459a6bf5656a938e5c63f8b581eb3ee6
Author: Florian Weimer <fweimer@redhat.com>
Date:   Fri Mar 25 11:11:42 2016 +0100

    tst-audit10: Fix compilation on compilers without bit_AVX512F [BZ #19860]
    
    	[BZ# 19860]
    	* sysdeps/x86_64/tst-audit10.c (avx512_enabled): Always return
    	zero if the compiler does not provide the AVX512F bit.
    
    (cherry picked from commit f327f5b47be57bc05a4077344b381016c1bb2c11)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=96c7375cb8b6f1875d9865f2ae92ecacf5f5e6fa

commit 96c7375cb8b6f1875d9865f2ae92ecacf5f5e6fa
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Mar 22 08:36:16 2016 -0700

    Don't set %rcx twice before "rep movsb"
    
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMCPY):
    	Don't set %rcx twice before "rep movsb".
    
    (cherry picked from commit 3c9a4cd16cbc7b79094fec68add2df66061ab5d7)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c273f613b0cc779ee33cc33d20941d271316e483

commit c273f613b0cc779ee33cc33d20941d271316e483
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Mar 22 07:46:56 2016 -0700

    Set index_arch_AVX_Fast_Unaligned_Load only for Intel processors
    
    Since only Intel processors with AVX2 have fast unaligned load, we
    should set index_arch_AVX_Fast_Unaligned_Load only for Intel processors.
    
    Move AVX, AVX2, AVX512, FMA and FMA4 detection into get_common_indeces
    and call get_common_indeces for other processors.
    
    Add CPU_FEATURES_CPU_P and CPU_FEATURES_ARCH_P to aoid loading
    GLRO(dl_x86_cpu_features) in cpu-features.c.
    
    	[BZ #19583]
    	* sysdeps/x86/cpu-features.c (get_common_indeces): Remove
    	inline.  Check family before setting family, model and
    	extended_model.  Set AVX, AVX2, AVX512, FMA and FMA4 usable
    	bits here.
    	(init_cpu_features): Replace HAS_CPU_FEATURE and
    	HAS_ARCH_FEATURE with CPU_FEATURES_CPU_P and
    	CPU_FEATURES_ARCH_P.  Set index_arch_AVX_Fast_Unaligned_Load
    	for Intel processors with usable AVX2.  Call get_common_indeces
    	for other processors with family == NULL.
    	* sysdeps/x86/cpu-features.h (CPU_FEATURES_CPU_P): New macro.
    	(CPU_FEATURES_ARCH_P): Likewise.
    	(HAS_CPU_FEATURE): Use CPU_FEATURES_CPU_P.
    	(HAS_ARCH_FEATURE): Use CPU_FEATURES_ARCH_P.
    
    (cherry picked from commit f781a9e96138d8839663af5e88649ab1fbed74f8)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c858d10a4e7fd682f2e7083836e4feacc2d580f4

commit c858d10a4e7fd682f2e7083836e4feacc2d580f4
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 10 05:26:46 2016 -0800

    Add _arch_/_cpu_ to index_*/bit_* in x86 cpu-features.h
    
    index_* and bit_* macros are used to access cpuid and feature arrays o
    struct cpu_features.  It is very easy to use bits and indices of cpuid
    array on feature array, especially in assembly codes.  For example,
    sysdeps/i386/i686/multiarch/bcopy.S has
    
    	HAS_CPU_FEATURE (Fast_Rep_String)
    
    which should be
    
    	HAS_ARCH_FEATURE (Fast_Rep_String)
    
    We change index_* and bit_* to index_cpu_*/index_arch_* and
    bit_cpu_*/bit_arch_* so that we can catch such error at build time.
    
    	[BZ #19762]
    	* sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h
    	(EXTRA_LD_ENVVARS): Add _arch_ to index_*/bit_*.
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Likewise.
    	* sysdeps/x86/cpu-features.h (bit_*): Renamed to ...
    	(bit_arch_*): This for feature array.
    	(bit_*): Renamed to ...
    	(bit_cpu_*): This for cpu array.
    	(index_*): Renamed to ...
    	(index_arch_*): This for feature array.
    	(index_*): Renamed to ...
    	(index_cpu_*): This for cpu array.
    	[__ASSEMBLER__] (HAS_FEATURE): Add and use field.
    	[__ASSEMBLER__] (HAS_CPU_FEATURE)): Pass cpu to HAS_FEATURE.
    	[__ASSEMBLER__] (HAS_ARCH_FEATURE)): Pass arch to HAS_FEATURE.
    	[!__ASSEMBLER__] (HAS_CPU_FEATURE): Replace index_##name and
    	bit_##name with index_cpu_##name and bit_cpu_##name.
    	[!__ASSEMBLER__] (HAS_ARCH_FEATURE): Replace index_##name and
    	bit_##name with index_arch_##name and bit_arch_##name.
    
    (cherry picked from commit 6aa3e97e2530f9917f504eb4146af119a3f27229)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7a90b56b0c3f8e55df44957cf6de7d3c9c04cbb9

commit 7a90b56b0c3f8e55df44957cf6de7d3c9c04cbb9
Author: Roland McGrath <roland@hack.frob.com>
Date:   Tue Mar 8 12:31:13 2016 -0800

    Fix tst-audit10 build when -mavx512f is not supported.
    
    (cherry picked from commit 3bd80c0de2f8e7ca8020d37739339636d169957e)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=ba80f6ceea3a6b6f711038646f419125fe3ad39c

commit ba80f6ceea3a6b6f711038646f419125fe3ad39c
Author: Florian Weimer <fweimer@redhat.com>
Date:   Mon Mar 7 16:00:25 2016 +0100

    tst-audit4, tst-audit10: Compile AVX/AVX-512 code separately [BZ #19269]
    
    This ensures that GCC will not use unsupported instructions before
    the run-time check to ensure support.
    
    (cherry picked from commit 3c0f7407eedb524c9114bb675cd55b903c71daaa)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=b8fe596e7f750d4ee2fca14d6a3999364c02662e

commit b8fe596e7f750d4ee2fca14d6a3999364c02662e
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Mar 6 16:48:11 2016 -0800

    Group AVX512 functions in .text.avx512 section
    
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S:
    	Replace .text with .text.avx512.
    	* sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S:
    	Likewise.
    
    (cherry picked from commit fee9eb6200f0e44a4b684903bc47fde36d46f1a5)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e455d17680cfaebb12692547422f95ba1ed30e29

commit e455d17680cfaebb12692547422f95ba1ed30e29
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Mar 4 08:37:40 2016 -0800

    x86-64: Fix memcpy IFUNC selection
    
    Chek Fast_Unaligned_Load, instead of Slow_BSF, and also check for
    Fast_Copy_Backward to enable __memcpy_ssse3_back.  Existing selection
    order is updated with following selection order:
    
    1. __memcpy_avx_unaligned if AVX_Fast_Unaligned_Load bit is set.
    2. __memcpy_sse2_unaligned if Fast_Unaligned_Load bit is set.
    3. __memcpy_sse2 if SSSE3 isn't available.
    4. __memcpy_ssse3_back if Fast_Copy_Backward bit it set.
    5. __memcpy_ssse3
    
    	[BZ #18880]
    	* sysdeps/x86_64/multiarch/memcpy.S: Check Fast_Unaligned_Load,
    	instead of Slow_BSF, and also check for Fast_Copy_Backward to
    	enable __memcpy_ssse3_back.
    
    (cherry picked from commit 14a1d7cc4c4fd5ee8e4e66b777221dd32a84efe8)

-----------------------------------------------------------------------
Comment 4 H.J. Lu 2016-04-12 15:08:00 UTC
Created attachment 9184 [details]
memcpy performance data on various Intel and AMD processors
Comment 5 H.J. Lu 2016-04-12 15:08:50 UTC
Created attachment 9185 [details]
memmove performance data on various Intel and AMD processors
Comment 6 Sourceware Commits 2016-04-12 15:33:26 UTC
This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, master has been updated
       via  a057f5f8cd1becc5ae8b51220283095bc808d72a (commit)
      from  b39d84adff832bddc3e2fc4a1878a7fba6bbb2a1 (commit)

Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a057f5f8cd1becc5ae8b51220283095bc808d72a

commit a057f5f8cd1becc5ae8b51220283095bc808d72a
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Apr 12 08:10:31 2016 -0700

    X86-64: Use non-temporal store in memcpy on large data
    
    The large memcpy micro benchmark in glibc shows that there is a
    regression with large data on Haswell machine.  non-temporal store in
    memcpy on large data can improve performance significantly.  This
    patch adds a threshold to use non temporal store which is 6 times of
    shared cache size.  When size is above the threshold, non temporal
    store will be used, but avoid non-temporal store if there is overlap
    between destination and source since destination may be in cache when
    source is loaded.
    
    For size below 8 vector register width, we load all data into registers
    and store them together.  Only forward and backward loops, which move 4
    vector registers at a time, are used to support overlapping addresses.
    For forward loop, we load the last 4 vector register width of data and
    the first vector register width of data into vector registers before the
    loop and store them after the loop.  For backward loop, we load the first
    4 vector register width of data and the last vector register width of
    data into vector registers before the loop and store them after the loop.
    
    	[BZ #19928]
    	* sysdeps/x86_64/cacheinfo.c (__x86_shared_non_temporal_threshold):
    	New.
    	(init_cacheinfo): Set __x86_shared_non_temporal_threshold to 6
    	times of shared cache size.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S
    	(VMOVNT): New.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S
    	(VMOVNT): Likewise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S
    	(VMOVNT): Likewise.
    	(VMOVU): Changed to movups for smaller code sizes.
    	(VMOVA): Changed to movaps for smaller code sizes.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Update
    	comments.
    	(PREFETCH): New.
    	(PREFETCH_SIZE): Likewise.
    	(PREFETCHED_LOAD_SIZE): Likewise.
    	(PREFETCH_ONE_SET): Likewise.
    	Rewrite to use forward and backward loops, which move 4 vector
    	registers at a time, to support overlapping addresses and use
    	non temporal store if size is above the threshold and there is
    	no overlap between destination and source.

-----------------------------------------------------------------------

Summary of changes:
 ChangeLog                                          |   26 ++
 sysdeps/x86_64/cacheinfo.c                         |    8 +
 .../x86_64/multiarch/memmove-avx-unaligned-erms.S  |    1 +
 .../multiarch/memmove-avx512-unaligned-erms.S      |    1 +
 .../x86_64/multiarch/memmove-sse2-unaligned-erms.S |    6 +-
 .../x86_64/multiarch/memmove-vec-unaligned-erms.S  |  389 +++++++++++---------
 6 files changed, 260 insertions(+), 171 deletions(-)
Comment 7 H.J. Lu 2016-04-15 16:24:35 UTC
Fixed for 2.24.
Comment 8 Sourceware Commits 2016-06-06 20:36:56 UTC
This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, hjl/erms/2.22 has been created
        at  b60dda5f2385aaca873069f9fb28645b82a1b711 (commit)

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=b60dda5f2385aaca873069f9fb28645b82a1b711

commit b60dda5f2385aaca873069f9fb28645b82a1b711
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri May 27 15:16:22 2016 -0700

    Count number of logical processors sharing L2 cache
    
    For Intel processors, when there are both L2 and L3 caches, SMT level
    type should be ued to count number of available logical processors
    sharing L2 cache.  If there is only L2 cache, core level type should
    be used to count number of available logical processors sharing L2
    cache.  Number of available logical processors sharing L2 cache should
    be used for non-inclusive L2 and L3 caches.
    
    	* sysdeps/x86/cacheinfo.c (init_cacheinfo): Count number of
    	available logical processors with SMT level type sharing L2
    	cache for Intel processors.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=ed46697862f2b0c2db726cc4c772e6003914bd72

commit ed46697862f2b0c2db726cc4c772e6003914bd72
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri May 20 14:41:14 2016 -0700

    Remove special L2 cache case for Knights Landing
    
    L2 cache is shared by 2 cores on Knights Landing, which has 4 threads
    per core:
    
    https://en.wikipedia.org/wiki/Xeon_Phi#Knights_Landing
    
    So L2 cache is shared by 8 threads on Knights Landing as reported by
    CPUID.  We should remove special L2 cache case for Knights Landing.
    
    	[BZ #18185]
    	* sysdeps/x86/cacheinfo.c (init_cacheinfo): Don't limit threads
    	sharing L2 cache to 2 for Knights Landing.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=07f943915311f6f92e5a031911d32c5e7458bfd5

commit 07f943915311f6f92e5a031911d32c5e7458bfd5
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu May 19 10:02:36 2016 -0700

    Correct Intel processor level type mask from CPUID
    
    Intel CPUID with EAX == 11 returns:
    
    ECX Bits 07 - 00: Level number. Same value in ECX input.
        Bits 15 - 08: Level type.
        ^^^^^^^^^^^^^^^^^^^^^^^^ This is level type.
        Bits 31 - 16: Reserved.
    
    Intel processor level type mask should be 0xff00, not 0xff0.
    
    	[BZ #20119]
    	* sysdeps/x86/cacheinfo.c (init_cacheinfo): Correct Intel
    	processor level type mask for CPUID with EAX == 11.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=201aebf739482fbb730d10eb7cf8335629bb4de4

commit 201aebf739482fbb730d10eb7cf8335629bb4de4
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu May 19 09:09:00 2016 -0700

    Check the HTT bit before counting logical threads
    
    Skip counting logical threads for Intel processors if the HTT bit is 0
    which indicates there is only a single logical processor.
    
    	* sysdeps/x86/cacheinfo.c (init_cacheinfo): Skip counting
    	logical threads if the HTT bit is 0.
    	* sysdeps/x86/cpu-features.h (bit_cpu_HTT): New.
    	(index_cpu_HTT): Likewise.
    	(reg_HTT): Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=dff8bcdab5968ac53e52ef06cabe8d921b429d22

commit dff8bcdab5968ac53e52ef06cabe8d921b429d22
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu May 19 08:49:45 2016 -0700

    Remove alignments on jump targets in memset
    
    X86-64 memset-vec-unaligned-erms.S aligns many jump targets, which
    increases code sizes, but not necessarily improve performance.  As
    memset benchtest data of align vs no align on various Intel and AMD
    processors
    
    https://sourceware.org/bugzilla/attachment.cgi?id=9277
    
    shows that aligning jump targets isn't necessary.
    
    	[BZ #20115]
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S (__memset):
    	Remove alignments on jump targets.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=aba9d000bf8441d77f0557af360e3aea7525d03e

commit aba9d000bf8441d77f0557af360e3aea7525d03e
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri May 13 08:29:22 2016 -0700

    Call init_cpu_features only if SHARED is defined
    
    In static executable, since init_cpu_features is called early from
    __libc_start_main, there is no need to call it again in dl_platform_init.
    
    	[BZ #20072]
    	* sysdeps/i386/dl-machine.h (dl_platform_init): Call
    	init_cpu_features only if SHARED is defined.
    	* sysdeps/x86_64/dl-machine.h (dl_platform_init): Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=6118b2d23016ec790b99b9331c3d7a45d588134e

commit 6118b2d23016ec790b99b9331c3d7a45d588134e
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri May 13 07:18:25 2016 -0700

    Support non-inclusive caches on Intel processors
    
    	* sysdeps/x86/cacheinfo.c (init_cacheinfo): Check and support
    	non-inclusive caches on Intel processors.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=8642c9a553d8ce8a3a0496ed11fed5a575d338c5

commit 8642c9a553d8ce8a3a0496ed11fed5a575d338c5
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Wed May 11 05:49:09 2016 -0700

    Remove x86 ifunc-defines.sym and rtld-global-offsets.sym
    
    Merge x86 ifunc-defines.sym with x86 cpu-features-offsets.sym.  Remove
    x86 ifunc-defines.sym and rtld-global-offsets.sym.  No code changes on
    i686 and x86-64.
    
    	* sysdeps/i386/i686/multiarch/Makefile (gen-as-const-headers):
    	Remove ifunc-defines.sym.
    	* sysdeps/x86_64/multiarch/Makefile (gen-as-const-headers):
    	Likewise.
    	* sysdeps/i386/i686/multiarch/ifunc-defines.sym: Removed.
    	* sysdeps/x86/rtld-global-offsets.sym: Likewise.
    	* sysdeps/x86_64/multiarch/ifunc-defines.sym: Likewise.
    	* sysdeps/x86/Makefile (gen-as-const-headers): Remove
    	rtld-global-offsets.sym.
    	* sysdeps/x86_64/multiarch/ifunc-defines.sym: Merged with ...
    	* sysdeps/x86/cpu-features-offsets.sym: This.
    	* sysdeps/x86/cpu-features.h: Include <cpu-features-offsets.h>
    	instead of <ifunc-defines.h> and <rtld-global-offsets.h>.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=3038902f233a5e0028a6424685b410f6c201040f

commit 3038902f233a5e0028a6424685b410f6c201040f
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun May 8 08:49:02 2016 -0700

    Move sysdeps/x86_64/cacheinfo.c to sysdeps/x86
    
    Move sysdeps/x86_64/cacheinfo.c to sysdeps/x86.  No code changes on x86
    and x86_64.
    
    	* sysdeps/i386/cacheinfo.c: Include <sysdeps/x86/cacheinfo.c>
    	instead of <sysdeps/x86_64/cacheinfo.c>.
    	* sysdeps/x86_64/cacheinfo.c: Moved to ...
    	* sysdeps/x86/cacheinfo.c: Here.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=df2b390bba18903d62c8910e808bfb0dce7f033c

commit df2b390bba18903d62c8910e808bfb0dce7f033c
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 15 05:22:53 2016 -0700

    Detect Intel Goldmont and Airmont processors
    
    Updated from the model numbers of Goldmont and Airmont processors in
    Intel64 And IA-32 Processor Architectures Software Developer's Manual
    Volume 3 Revision 058.
    
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Detect Intel
    	Goldmont and Airmont processors.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=157c57198e893b4882d1feb98de2b0721ee408fc

commit 157c57198e893b4882d1feb98de2b0721ee408fc
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 1 14:01:24 2016 -0700

    X86-64: Add dummy memcopy.h and wordcopy.c
    
    Since x86-64 doesn't use memory copy functions, add dummy memcopy.h and
    wordcopy.c to reduce code size.  It reduces the size of libc.so by about
    1 KB.
    
    	* sysdeps/x86_64/memcopy.h: New file.
    	* sysdeps/x86_64/wordcopy.c: Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=f817b9d36215ab60d58cc744d22773b4961a2c9b

commit f817b9d36215ab60d58cc744d22773b4961a2c9b
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 12:46:57 2016 -0700

    X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove
    
    Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones,
    we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with
    the new ones.
    
    No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used
    before.  If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2
    memcpy/memmove optimized with Enhanced REP MOVSB will be used for
    processors with ERMS.  The new AVX512 memcpy/memmove will be used for
    processors with AVX512 which prefer vzeroupper.
    
    Since the new SSE2 memcpy/memmove are faster than the previous default
    memcpy/memmove used in libc.a and ld.so, we also remove the previous
    default memcpy/memmove and make them the default memcpy/memmove, except
    that non-temporal store isn't used in ld.so.
    
    Together, it reduces the size of libc.so by about 6 KB and the size of
    ld.so by about 2 KB.
    
    	[BZ #19776]
    	* sysdeps/x86_64/memcpy.S: Make it dummy.
    	* sysdeps/x86_64/mempcpy.S: Likewise.
    	* sysdeps/x86_64/memmove.S: New file.
    	* sysdeps/x86_64/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/memmove.c: Removed.
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memmove.c: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memcpy-sse2-unaligned, memmove-avx-unaligned,
    	memcpy-avx-unaligned and memmove-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Replace
    	__memmove_chk_avx512_unaligned_2 with
    	__memmove_chk_avx512_unaligned.  Remove
    	__memmove_chk_avx_unaligned_2.  Replace
    	__memmove_chk_sse2_unaligned_2 with
    	__memmove_chk_sse2_unaligned.  Remove __memmove_chk_sse2 and
    	__memmove_avx_unaligned_2.  Replace __memmove_avx512_unaligned_2
    	with __memmove_avx512_unaligned.  Replace
    	__memmove_sse2_unaligned_2 with __memmove_sse2_unaligned.
    	Remove __memmove_sse2.  Replace __memcpy_chk_avx512_unaligned_2
    	with __memcpy_chk_avx512_unaligned.  Remove
    	__memcpy_chk_avx_unaligned_2.  Replace
    	__memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned.
    	Remove __memcpy_chk_sse2.  Remove __memcpy_avx_unaligned_2.
    	Replace __memcpy_avx512_unaligned_2 with
    	__memcpy_avx512_unaligned.  Remove __memcpy_sse2_unaligned_2
    	and __memcpy_sse2.  Replace __mempcpy_chk_avx512_unaligned_2
    	with __mempcpy_chk_avx512_unaligned.  Remove
    	__mempcpy_chk_avx_unaligned_2.  Replace
    	__mempcpy_chk_sse2_unaligned_2 with
    	__mempcpy_chk_sse2_unaligned.  Remove __mempcpy_chk_sse2.
    	Replace __mempcpy_avx512_unaligned_2 with
    	__mempcpy_avx512_unaligned.  Remove __mempcpy_avx_unaligned_2.
    	Replace __mempcpy_sse2_unaligned_2 with
    	__mempcpy_sse2_unaligned.  Remove __mempcpy_sse2.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support
    	__memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned.
    	Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __memcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../memcpy.S.
    	* sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support
    	__memcpy_chk_avx512_unaligned_erms and
    	__memcpy_chk_avx512_unaligned.  Use
    	__memcpy_chk_avx_unaligned_erms and
    	__memcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __memcpy_chk_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
    	Change function suffix from unaligned_2 to unaligned.
    	* sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support
    	__mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned.
    	Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __mempcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../mempcpy.S.
    	(mempcpy): New.  Add a weak alias.
    	* sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support
    	__mempcpy_chk_avx512_unaligned_erms and
    	__mempcpy_chk_avx512_unaligned.  Use
    	__mempcpy_chk_avx_unaligned_erms and
    	__mempcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __mempcpy_chk_sse2_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=122600f4b380b00ce0f682039fe59af4bd0edbc0

commit 122600f4b380b00ce0f682039fe59af4bd0edbc0
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:42:30 2016 -0700

    X86-64: Remove the previous SSE2/AVX2 memsets
    
    Since the new SSE2/AVX2 memsets are faster than the previous ones, we
    can remove the previous SSE2/AVX2 memsets and replace them with the
    new ones.  This reduces the size of libc.so by about 900 bytes.
    
    No change in IFUNC selection if SSE2 and AVX2 memsets weren't used
    before.  If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset
    optimized with Enhanced REP STOSB will be used for processors with
    ERMS.  The new AVX512 memset will be used for processors with AVX512
    which prefer vzeroupper.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded
    	into ...
    	* sysdeps/x86_64/memset.S: This.
    	(__bzero): Removed.
    	(__memset_tail): Likewise.
    	(__memset_chk): Likewise.
    	(memset): Likewise.
    	(MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't
    	defined.
    	(MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined.
    	* sysdeps/x86_64/multiarch/memset-avx2.S: Removed.
    	(__memset_zero_constant_len_parameter): Check SHARED instead of
    	PIC.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memset-avx2 and memset-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Remove __memset_chk_sse2,
    	__memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
    	(__bzero): Enabled.
    	* sysdeps/x86_64/multiarch/memset.S (memset): Replace
    	__memset_sse2 and __memset_avx2 with __memset_sse2_unaligned
    	and __memset_avx2_unaligned.  Use __memset_sse2_unaligned_erms
    	or __memset_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	(memset): Removed.
    	(__memset_chk): Likewise.
    	(MEMSET_SYMBOL): New.
    	(libc_hidden_builtin_def): Replace __memset_sse2 with
    	__memset_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace
    	__memset_chk_sse2 and __memset_chk_avx2 with
    	__memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms.
    	Use __memset_chk_sse2_unaligned_erms or
    	__memset_chk_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_chk_avx512_unaligned_erms and
    	__memset_chk_avx512_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=0ee4375cef69e00e69ddb1d08fe0d492053208f3

commit 0ee4375cef69e00e69ddb1d08fe0d492053208f3
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Apr 3 17:21:45 2016 -0700

    X86-64: Use non-temporal store in memcpy on large data
    
    The large memcpy micro benchmark in glibc shows that there is a
    regression with large data on Haswell machine.  non-temporal store in
    memcpy on large data can improve performance significantly.  This
    patch adds a threshold to use non temporal store which is 6 times of
    shared cache size.  When size is above the threshold, non temporal
    store will be used, but avoid non-temporal store if there is overlap
    between destination and source since destination may be in cache when
    source is loaded.
    
    For size below 8 vector register width, we load all data into registers
    and store them together.  Only forward and backward loops, which move 4
    vector registers at a time, are used to support overlapping addresses.
    For forward loop, we load the last 4 vector register width of data and
    the first vector register width of data into vector registers before the
    loop and store them after the loop.  For backward loop, we load the first
    4 vector register width of data and the last vector register width of
    data into vector registers before the loop and store them after the loop.
    
    	[BZ #19928]
    	* sysdeps/x86_64/cacheinfo.c (__x86_shared_non_temporal_threshold):
    	New.
    	(init_cacheinfo): Set __x86_shared_non_temporal_threshold to
    	6 times of shared cache size.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S
    	(VMOVNT): New.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S
    	(VMOVNT): Likewise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S
    	(VMOVNT): Likewise.
    	(VMOVU): Changed to movups for smaller code sizes.
    	(VMOVA): Changed to movaps for smaller code sizes.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Update
    	comments.
    	(PREFETCH): New.
    	(PREFETCH_SIZE): Likewise.
    	(PREFETCHED_LOAD_SIZE): Likewise.
    	(PREFETCH_ONE_SET): Likewise.
    	Rewrite to use forward and backward loops, which move 4 vector
    	registers at a time, to support overlapping addresses and use
    	non temporal store if size is above the threshold and there is
    	no overlap between destination and source.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=54667f64fa4074325ee33e487c033c313ce95067

commit 54667f64fa4074325ee33e487c033c313ce95067
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Wed Apr 6 10:19:16 2016 -0700

    X86-64: Prepare memmove-vec-unaligned-erms.S
    
    Prepare memmove-vec-unaligned-erms.S to make the SSE2 version as the
    default memcpy, mempcpy and memmove.
    
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
    	(MEMCPY_SYMBOL): New.
    	(MEMPCPY_SYMBOL): Likewise.
    	(MEMMOVE_CHK_SYMBOL): Likewise.
    	Replace MEMMOVE_SYMBOL with MEMMOVE_CHK_SYMBOL on __mempcpy_chk
    	symbols.  Replace MEMMOVE_SYMBOL with MEMPCPY_SYMBOL on
    	__mempcpy symbols.  Provide alias for __memcpy_chk in libc.a.
    	Provide alias for memcpy in libc.a and ld.so.
    
    (cherry picked from commit a7d1c51482d15ab6c07e2ee0ae5e007067b18bfb)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=68a0b487e274b3452a1660e4b9fad5df8d8c0284

commit 68a0b487e274b3452a1660e4b9fad5df8d8c0284
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Wed Apr 6 09:10:18 2016 -0700

    X86-64: Prepare memset-vec-unaligned-erms.S
    
    Prepare memset-vec-unaligned-erms.S to make the SSE2 version as the
    default memset.
    
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
    	(MEMSET_CHK_SYMBOL): New.  Define if not defined.
    	(__bzero): Check VEC_SIZE == 16 instead of USE_MULTIARCH.
    	Disabled fro now.
    	Replace MEMSET_SYMBOL with MEMSET_CHK_SYMBOL on __memset_chk
    	symbols.  Properly check USE_MULTIARCH on __memset symbols.
    
    (cherry picked from commit 4af1bb06c59d24f35bf8dc55897838d926c05892)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c2d3bdd6aec639fd66fceb3e2c145420c25d409b

commit c2d3bdd6aec639fd66fceb3e2c145420c25d409b
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Apr 5 05:21:07 2016 -0700

    Force 32-bit displacement in memset-vec-unaligned-erms.S
    
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: Force
    	32-bit displacement to avoid long nop between instructions.
    
    (cherry picked from commit ec0cac9a1f4094bd0db6f77c1b329e7a40eecc10)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=070a5e77d66f5520c1bbbc24dc1843a0a1c161ee

commit 070a5e77d66f5520c1bbbc24dc1843a0a1c161ee
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Apr 5 05:19:05 2016 -0700

    Add a comment in memset-sse2-unaligned-erms.S
    
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Add
    	a comment on VMOVU and VMOVA.
    
    (cherry picked from commit 696ac774847b80cf994438739478b0c3003b5958)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7e00bb9720268f142668d22e91dff7c3e6e0c08c

commit 7e00bb9720268f142668d22e91dff7c3e6e0c08c
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Apr 3 14:32:20 2016 -0700

    Don't put SSE2/AVX/AVX512 memmove/memset in ld.so
    
    Since memmove and memset in ld.so don't use IFUNC, don't put SSE2, AVX
    and AVX512 memmove and memset in ld.so.
    
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip
    	if not in libc.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    
    (cherry picked from commit 5cd7af016d8587ff53b20ba259746f97edbddbf7)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e1e57539f5dbdefc96a85021b611863eaa28dd13

commit e1e57539f5dbdefc96a85021b611863eaa28dd13
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Apr 3 12:38:25 2016 -0700

    Fix memmove-vec-unaligned-erms.S
    
    __mempcpy_erms and __memmove_erms can't be placed between __memmove_chk
    and __memmove it breaks __memmove_chk.
    
    Don't check source == destination first since it is less common.
    
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:
    	(__mempcpy_erms, __memmove_erms): Moved before __mempcpy_chk
    	with unaligned_erms.
    	(__memmove_erms): Skip if source == destination.
    	(__memmove_unaligned_erms): Don't check source == destination
    	first.
    
    (cherry picked from commit ea2785e96fa503f3a2b5dd9f3a6ca65622b3c5f2)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a13ac6b5ced68aadb7c1546102445f9c57f43231

commit a13ac6b5ced68aadb7c1546102445f9c57f43231
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Mar 6 08:23:24 2016 -0800

    Use HAS_ARCH_FEATURE with Fast_Rep_String
    
    HAS_ARCH_FEATURE, not HAS_CPU_FEATURE, should be used with
    Fast_Rep_String.
    
    	[BZ #19762]
    	* sysdeps/i386/i686/multiarch/bcopy.S (bcopy): Use
    	HAS_ARCH_FEATURE with Fast_Rep_String.
    	* sysdeps/i386/i686/multiarch/bzero.S (__bzero): Likewise.
    	* sysdeps/i386/i686/multiarch/memcpy.S (memcpy): Likewise.
    	* sysdeps/i386/i686/multiarch/memcpy_chk.S (__memcpy_chk):
    	Likewise.
    	* sysdeps/i386/i686/multiarch/memmove_chk.S (__memmove_chk):
    	Likewise.
    	* sysdeps/i386/i686/multiarch/mempcpy.S (__mempcpy): Likewise.
    	* sysdeps/i386/i686/multiarch/mempcpy_chk.S (__mempcpy_chk):
    	Likewise.
    	* sysdeps/i386/i686/multiarch/memset.S (memset): Likewise.
    	* sysdeps/i386/i686/multiarch/memset_chk.S (__memset_chk):
    	Likewise.
    
    (cherry picked from commit 4e940b2f4b577f3a530e0580373f7c2d569f4d63)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4ad4d58ed7a444e2d9787113fce132a99b35b417

commit 4ad4d58ed7a444e2d9787113fce132a99b35b417
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 1 15:08:48 2016 -0700

    Remove Fast_Copy_Backward from Intel Core processors
    
    Intel Core i3, i5 and i7 processors have fast unaligned copy and
    copy backward is ignored.  Remove Fast_Copy_Backward from Intel Core
    processors to avoid confusion.
    
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Don't set
    	bit_arch_Fast_Copy_Backward for Intel Core proessors.
    
    (cherry picked from commit 27d3ce1467990f89126e228559dec8f84b96c60e)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a304f3933c7f8347e49057a7a315cbd571662ff7

commit a304f3933c7f8347e49057a7a315cbd571662ff7
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:05:51 2016 -0700

    Add x86-64 memset with unaligned store and rep stosb
    
    Implement x86-64 memset with unaligned store and rep movsb.  Support
    16-byte, 32-byte and 64-byte vector register sizes.  A single file
    provides 2 implementations of memset, one with rep stosb and the other
    without rep stosb.  They share the same codes when size is between 2
    times of vector register size and REP_STOSB_THRESHOLD which defaults
    to 2KB.
    
    Key features:
    
    1. Use overlapping store to avoid branch.
    2. For size <= 4 times of vector register size, fully unroll the loop.
    3. For size > 4 times of vector register size, store 4 times of vector
    register size at a time.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
    	memset-sse2-unaligned-erms, memset-avx2-unaligned-erms and
    	memset-avx512-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Test __memset_chk_sse2_unaligned,
    	__memset_chk_sse2_unaligned_erms, __memset_chk_avx2_unaligned,
    	__memset_chk_avx2_unaligned_erms, __memset_chk_avx512_unaligned,
    	__memset_chk_avx512_unaligned_erms, __memset_sse2_unaligned,
    	__memset_sse2_unaligned_erms, __memset_erms,
    	__memset_avx2_unaligned, __memset_avx2_unaligned_erms,
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: New
    	file.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:
    	Likewise.
    
    (cherry picked from commit 830566307f038387ca0af3fd327706a8d1a2f595)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=1c5d0b0ba41376c2f0792da4f22cc1f5b2b2688e

commit 1c5d0b0ba41376c2f0792da4f22cc1f5b2b2688e
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:04:26 2016 -0700

    Add x86-64 memmove with unaligned load/store and rep movsb
    
    Implement x86-64 memmove with unaligned load/store and rep movsb.
    Support 16-byte, 32-byte and 64-byte vector register sizes.  When
    size <= 8 times of vector register size, there is no check for
    address overlap bewteen source and destination.  Since overhead for
    overlap check is small when size > 8 times of vector register size,
    memcpy is an alias of memmove.
    
    A single file provides 2 implementations of memmove, one with rep movsb
    and the other without rep movsb.  They share the same codes when size is
    between 2 times of vector register size and REP_MOVSB_THRESHOLD which
    is 2KB for 16-byte vector register size and scaled up by large vector
    register size.
    
    Key features:
    
    1. Use overlapping load and store to avoid branch.
    2. For size <= 8 times of vector register size, load  all sources into
    registers and store them together.
    3. If there is no address overlap bewteen source and destination, copy
    from both ends with 4 times of vector register size at a time.
    4. If address of destination > address of source, backward copy 8 times
    of vector register size at a time.
    5. Otherwise, forward copy 8 times of vector register size at a time.
    6. Use rep movsb only for forward copy.  Avoid slow backward rep movsb
    by fallbacking to backward copy 8 times of vector register size at a
    time.
    7. Skip when address of destination == address of source.
    
    	[BZ #19776]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
    	memmove-sse2-unaligned-erms, memmove-avx-unaligned-erms and
    	memmove-avx512-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Test
    	__memmove_chk_avx512_unaligned_2,
    	__memmove_chk_avx512_unaligned_erms,
    	__memmove_chk_avx_unaligned_2, __memmove_chk_avx_unaligned_erms,
    	__memmove_chk_sse2_unaligned_2,
    	__memmove_chk_sse2_unaligned_erms, __memmove_avx_unaligned_2,
    	__memmove_avx_unaligned_erms, __memmove_avx512_unaligned_2,
    	__memmove_avx512_unaligned_erms, __memmove_erms,
    	__memmove_sse2_unaligned_2, __memmove_sse2_unaligned_erms,
    	__memcpy_chk_avx512_unaligned_2,
    	__memcpy_chk_avx512_unaligned_erms,
    	__memcpy_chk_avx_unaligned_2, __memcpy_chk_avx_unaligned_erms,
    	__memcpy_chk_sse2_unaligned_2, __memcpy_chk_sse2_unaligned_erms,
    	__memcpy_avx_unaligned_2, __memcpy_avx_unaligned_erms,
    	__memcpy_avx512_unaligned_2, __memcpy_avx512_unaligned_erms,
    	__memcpy_sse2_unaligned_2, __memcpy_sse2_unaligned_erms,
    	__memcpy_erms, __mempcpy_chk_avx512_unaligned_2,
    	__mempcpy_chk_avx512_unaligned_erms,
    	__mempcpy_chk_avx_unaligned_2, __mempcpy_chk_avx_unaligned_erms,
    	__mempcpy_chk_sse2_unaligned_2, __mempcpy_chk_sse2_unaligned_erms,
    	__mempcpy_avx512_unaligned_2, __mempcpy_avx512_unaligned_erms,
    	__mempcpy_avx_unaligned_2, __mempcpy_avx_unaligned_erms,
    	__mempcpy_sse2_unaligned_2, __mempcpy_sse2_unaligned_erms and
    	__mempcpy_erms.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: New
    	file.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likwise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
    	Likwise.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:
    	Likwise.
    
    (cherry picked from commit 88b57b8ed41d5ecf2e1bdfc19556f9246a665ebb)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e1203f48239fbb9832db6ed3a0d2a008e317aff9

commit e1203f48239fbb9832db6ed3a0d2a008e317aff9
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 19:22:59 2016 -0700

    Initial Enhanced REP MOVSB/STOSB (ERMS) support
    
    The newer Intel processors support Enhanced REP MOVSB/STOSB (ERMS) which
    has a feature bit in CPUID.  This patch adds the Enhanced REP MOVSB/STOSB
    (ERMS) bit to x86 cpu-features.
    
    	* sysdeps/x86/cpu-features.h (bit_cpu_ERMS): New.
    	(index_cpu_ERMS): Likewise.
    	(reg_ERMS): Likewise.
    
    (cherry picked from commit 0791f91dff9a77263fa8173b143d854cad902c6d)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=3597d65be2a44f063ef12bb907fdad8567aa3e6a

commit 3597d65be2a44f063ef12bb907fdad8567aa3e6a
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 13:15:59 2016 -0700

    Make __memcpy_avx512_no_vzeroupper an alias
    
    Since x86-64 memcpy-avx512-no-vzeroupper.S implements memmove, make
    __memcpy_avx512_no_vzeroupper an alias of __memmove_avx512_no_vzeroupper
    to reduce code size of libc.so.
    
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memcpy-avx512-no-vzeroupper.
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: Renamed
    	to ...
    	* sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: This.
    	(MEMCPY): Don't define.
    	(MEMCPY_CHK): Likewise.
    	(MEMPCPY): Likewise.
    	(MEMPCPY_CHK): Likewise.
    	(MEMPCPY_CHK): Renamed to ...
    	(__mempcpy_chk_avx512_no_vzeroupper): This.
    	(MEMPCPY_CHK): Renamed to ...
    	(__mempcpy_chk_avx512_no_vzeroupper): This.
    	(MEMCPY_CHK): Renamed to ...
    	(__memmove_chk_avx512_no_vzeroupper): This.
    	(MEMCPY): Renamed to ...
    	(__memmove_avx512_no_vzeroupper): This.
    	(__memcpy_avx512_no_vzeroupper): New alias.
    	(__memcpy_chk_avx512_no_vzeroupper): Likewise.
    
    (cherry picked from commit 064f01b10b57ff09cda7025f484b848c38ddd57a)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=9fbaf0f27a11deb98df79d04adee97aebee78d40

commit 9fbaf0f27a11deb98df79d04adee97aebee78d40
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 13:13:36 2016 -0700

    Implement x86-64 multiarch mempcpy in memcpy
    
    Implement x86-64 multiarch mempcpy in memcpy to share most of code.  It
    reduces code size of libc.so.
    
    	[BZ #18858]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	mempcpy-ssse3, mempcpy-ssse3-back, mempcpy-avx-unaligned
    	and mempcpy-avx512-no-vzeroupper.
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMPCPY_CHK):
    	New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S
    	(MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-ssse3-back.S (MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-ssse3.S (MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-avx-unaligned.S: Removed.
    	* sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-ssse3-back.S: Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-ssse3.S: Likewise.
    
    (cherry picked from commit c365e615f7429aee302f8af7bf07ae262278febb)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=5239cb481eea27650173b9b9af22439afdcbf358

commit 5239cb481eea27650173b9b9af22439afdcbf358
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 04:39:48 2016 -0700

    [x86] Add a feature bit: Fast_Unaligned_Copy
    
    On AMD processors, memcpy optimized with unaligned SSE load is
    slower than emcpy optimized with aligned SSSE3 while other string
    functions are faster with unaligned SSE load.  A feature bit,
    Fast_Unaligned_Copy, is added to select memcpy optimized with
    unaligned SSE load.
    
    	[BZ #19583]
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Set
    	Fast_Unaligned_Copy with Fast_Unaligned_Load for Intel
    	processors.  Set Fast_Copy_Backward for AMD Excavator
    	processors.
    	* sysdeps/x86/cpu-features.h (bit_arch_Fast_Unaligned_Copy):
    	New.
    	(index_arch_Fast_Unaligned_Copy): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Check
    	Fast_Unaligned_Copy instead of Fast_Unaligned_Load.
    
    (cherry picked from commit e41b395523040fcb58c7d378475720c2836d280c)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a65b3d13e1754d568782e64a762c2c7fab45a55d

commit a65b3d13e1754d568782e64a762c2c7fab45a55d
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Mar 22 08:36:16 2016 -0700

    Don't set %rcx twice before "rep movsb"
    
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMCPY):
    	Don't set %rcx twice before "rep movsb".
    
    (cherry picked from commit 3c9a4cd16cbc7b79094fec68add2df66061ab5d7)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=f4b6d20366aac66070f1cf50552cf2951991a1e5

commit f4b6d20366aac66070f1cf50552cf2951991a1e5
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Mar 22 07:46:56 2016 -0700

    Set index_arch_AVX_Fast_Unaligned_Load only for Intel processors
    
    Since only Intel processors with AVX2 have fast unaligned load, we
    should set index_arch_AVX_Fast_Unaligned_Load only for Intel processors.
    
    Move AVX, AVX2, AVX512, FMA and FMA4 detection into get_common_indeces
    and call get_common_indeces for other processors.
    
    Add CPU_FEATURES_CPU_P and CPU_FEATURES_ARCH_P to aoid loading
    GLRO(dl_x86_cpu_features) in cpu-features.c.
    
    	[BZ #19583]
    	* sysdeps/x86/cpu-features.c (get_common_indeces): Remove
    	inline.  Check family before setting family, model and
    	extended_model.  Set AVX, AVX2, AVX512, FMA and FMA4 usable
    	bits here.
    	(init_cpu_features): Replace HAS_CPU_FEATURE and
    	HAS_ARCH_FEATURE with CPU_FEATURES_CPU_P and
    	CPU_FEATURES_ARCH_P.  Set index_arch_AVX_Fast_Unaligned_Load
    	for Intel processors with usable AVX2.  Call get_common_indeces
    	for other processors with family == NULL.
    	* sysdeps/x86/cpu-features.h (CPU_FEATURES_CPU_P): New macro.
    	(CPU_FEATURES_ARCH_P): Likewise.
    	(HAS_CPU_FEATURE): Use CPU_FEATURES_CPU_P.
    	(HAS_ARCH_FEATURE): Use CPU_FEATURES_ARCH_P.
    
    (cherry picked from commit f781a9e96138d8839663af5e88649ab1fbed74f8)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=ca9c5edeea52dc18f42ebbe29b1af352f5555538

commit ca9c5edeea52dc18f42ebbe29b1af352f5555538
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Nov 30 08:53:37 2015 -0800

    Update family and model detection for AMD CPUs
    
    AMD CPUs uses the similar encoding scheme for extended family and model
    as Intel CPUs as shown in:
    
    http://support.amd.com/TechDocs/25481.pdf
    
    This patch updates get_common_indeces to get family and model for both
    Intel and AMD CPUs when family == 0x0f.
    
    	[BZ #19214]
    	* sysdeps/x86/cpu-features.c (get_common_indeces): Add an
    	argument to return extended model.  Update family and model
    	with extended family and model when family == 0x0f.
    	(init_cpu_features): Updated.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c23cdbac4ea473effbef5c50b1217f95595b3460

commit c23cdbac4ea473effbef5c50b1217f95595b3460
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 10 05:26:46 2016 -0800

    Add _arch_/_cpu_ to index_*/bit_* in x86 cpu-features.h
    
    index_* and bit_* macros are used to access cpuid and feature arrays o
    struct cpu_features.  It is very easy to use bits and indices of cpuid
    array on feature array, especially in assembly codes.  For example,
    sysdeps/i386/i686/multiarch/bcopy.S has
    
    	HAS_CPU_FEATURE (Fast_Rep_String)
    
    which should be
    
    	HAS_ARCH_FEATURE (Fast_Rep_String)
    
    We change index_* and bit_* to index_cpu_*/index_arch_* and
    bit_cpu_*/bit_arch_* so that we can catch such error at build time.
    
    	[BZ #19762]
    	* sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h
    	(EXTRA_LD_ENVVARS): Add _arch_ to index_*/bit_*.
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Likewise.
    	* sysdeps/x86/cpu-features.h (bit_*): Renamed to ...
    	(bit_arch_*): This for feature array.
    	(bit_*): Renamed to ...
    	(bit_cpu_*): This for cpu array.
    	(index_*): Renamed to ...
    	(index_arch_*): This for feature array.
    	(index_*): Renamed to ...
    	(index_cpu_*): This for cpu array.
    	[__ASSEMBLER__] (HAS_FEATURE): Add and use field.
    	[__ASSEMBLER__] (HAS_CPU_FEATURE)): Pass cpu to HAS_FEATURE.
    	[__ASSEMBLER__] (HAS_ARCH_FEATURE)): Pass arch to HAS_FEATURE.
    	[!__ASSEMBLER__] (HAS_CPU_FEATURE): Replace index_##name and
    	bit_##name with index_cpu_##name and bit_cpu_##name.
    	[!__ASSEMBLER__] (HAS_ARCH_FEATURE): Replace index_##name and
    	bit_##name with index_arch_##name and bit_arch_##name.
    
    (cherry picked from commit 6aa3e97e2530f9917f504eb4146af119a3f27229)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4a49c82956f5a42a2cce22c2e97360de1b32301d

commit 4a49c82956f5a42a2cce22c2e97360de1b32301d
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 3 14:51:40 2016 -0800

    Or bit_Prefer_MAP_32BIT_EXEC in EXTRA_LD_ENVVARS
    
    We should turn on bit_Prefer_MAP_32BIT_EXEC in EXTRA_LD_ENVVARS without
    overriding other bits.
    
    	[BZ #19758]
    	* sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h
    	(EXTRA_LD_ENVVARS): Or bit_Prefer_MAP_32BIT_EXEC.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=18f8c0e3cc9ff7b092f02c9b42874a5439347bbc

commit 18f8c0e3cc9ff7b092f02c9b42874a5439347bbc
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Mar 6 16:48:11 2016 -0800

    Group AVX512 functions in .text.avx512 section
    
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S:
    	Replace .text with .text.avx512.
    	* sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S:
    	Likewise.
    
    (cherry picked from commit fee9eb6200f0e44a4b684903bc47fde36d46f1a5)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=0c8e297a186f844ebb7eba7a3bc0343c83615ca9

commit 0c8e297a186f844ebb7eba7a3bc0343c83615ca9
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Mar 4 08:37:40 2016 -0800

    x86-64: Fix memcpy IFUNC selection
    
    Chek Fast_Unaligned_Load, instead of Slow_BSF, and also check for
    Fast_Copy_Backward to enable __memcpy_ssse3_back.  Existing selection
    order is updated with following selection order:
    
    1. __memcpy_avx_unaligned if AVX_Fast_Unaligned_Load bit is set.
    2. __memcpy_sse2_unaligned if Fast_Unaligned_Load bit is set.
    3. __memcpy_sse2 if SSSE3 isn't available.
    4. __memcpy_ssse3_back if Fast_Copy_Backward bit it set.
    5. __memcpy_ssse3
    
    	[BZ #18880]
    	* sysdeps/x86_64/multiarch/memcpy.S: Check Fast_Unaligned_Load,
    	instead of Slow_BSF, and also check for Fast_Copy_Backward to
    	enable __memcpy_ssse3_back.
    
    (cherry picked from commit 14a1d7cc4c4fd5ee8e4e66b777221dd32a84efe8)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=3c772cb4d9cbe19cd97ad991e3dab43014198c44

commit 3c772cb4d9cbe19cd97ad991e3dab43014198c44
Author: Andrew Senkevich <andrew.senkevich@intel.com>
Date:   Sat Jan 16 00:49:45 2016 +0300

    Added memcpy/memmove family optimized with AVX512 for KNL hardware.
    
    Added AVX512 implementations of memcpy, mempcpy, memmove, memcpy_chk,
    mempcpy_chk, memmove_chk.
    It shows average improvement more than 30% over AVX versions on KNL
    hardware (performance results in the thread
    <https://sourceware.org/ml/libc-alpha/2016-01/msg00258.html>).
    
        * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Added new files.
        * sysdeps/x86_64/multiarch/ifunc-impl-list.c: Added new tests.
        * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: New file.
        * sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S: Likewise.
        * sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: Likewise.
        * sysdeps/x86_64/multiarch/memcpy.S: Added new IFUNC branch.
        * sysdeps/x86_64/multiarch/memcpy_chk.S: Likewise.
        * sysdeps/x86_64/multiarch/memmove.c: Likewise.
        * sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
        * sysdeps/x86_64/multiarch/mempcpy.S: Likewise.
        * sysdeps/x86_64/multiarch/mempcpy_chk.S: Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7f20e52f1f8d7709140f4bdf828a6bb0f0f08af2

commit 7f20e52f1f8d7709140f4bdf828a6bb0f0f08af2
Author: Andrew Senkevich <andrew.senkevich@intel.com>
Date:   Sat Dec 19 02:47:28 2015 +0300

    Added memset optimized with AVX512 for KNL hardware.
    
    It shows improvement up to 28% over AVX2 memset (performance results
    attached at <https://sourceware.org/ml/libc-alpha/2015-12/msg00052.html>).
    
        * sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S: New file.
        * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Added new file.
        * sysdeps/x86_64/multiarch/ifunc-impl-list.c: Added new tests.
        * sysdeps/x86_64/multiarch/memset.S: Added new IFUNC branch.
        * sysdeps/x86_64/multiarch/memset_chk.S: Likewise.
        * sysdeps/x86/cpu-features.h (bit_Prefer_No_VZEROUPPER,
        index_Prefer_No_VZEROUPPER): New.
        * sysdeps/x86/cpu-features.c (init_cpu_features): Set the
        Prefer_No_VZEROUPPER for Knights Landing.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=d530cd5463701a59ed923d53a97d3b534fdfea8a

commit d530cd5463701a59ed923d53a97d3b534fdfea8a
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Wed Oct 21 14:44:23 2015 -0700

    Add Prefer_MAP_32BIT_EXEC to map executable pages with MAP_32BIT
    
    According to Silvermont software optimization guide, for 64-bit
    applications, branch prediction performance can be negatively impacted
    when the target of a branch is more than 4GB away from the branch.  Add
    the Prefer_MAP_32BIT_EXEC bit so that mmap will try to map executable
    pages with MAP_32BIT first.  NB: MAP_32BIT will map to lower 2GB, not
    lower 4GB, address.  Prefer_MAP_32BIT_EXEC reduces bits available for
    address space layout randomization (ASLR), which is always disabled for
    SUID programs and can only be enabled by setting environment variable,
    LD_PREFER_MAP_32BIT_EXEC.
    
    On Fedora 23, this patch speeds up GCC 5 testsuite by 3% on Silvermont.
    
    	[BZ #19367]
    	* sysdeps/unix/sysv/linux/wordsize-64/mmap.c: New file.
    	* sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h: Likewise.
    	* sysdeps/unix/sysv/linux/x86_64/64/mmap.c: Likewise.
    	* sysdeps/x86/cpu-features.h (bit_Prefer_MAP_32BIT_EXEC): New.
    	(index_Prefer_MAP_32BIT_EXEC): Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=fe24aedc3530037d7bb614b84d309e6b816686bf

commit fe24aedc3530037d7bb614b84d309e6b816686bf
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Dec 15 11:46:54 2015 -0800

    Enable Silvermont optimizations for Knights Landing
    
    Knights Landing processor is based on Silvermont.  This patch enables
    Silvermont optimizations for Knights Landing.
    
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Enable
    	Silvermont optimizations for Knights Landing.

-----------------------------------------------------------------------