19881 – Improve x86-64 memset

Bug 19881 - Improve x86-64 memset

Summary: Improve x86-64 memset

Status:	RESOLVED FIXED

Alias:	None

Product:	glibc
Classification:	Unclassified
Component:	string (show other bugs)
Version:	2.24

Importance:	P2 normal
Target Milestone:	2.24
Assignee:	Not yet assigned to anyone

URL:
Keywords:

Depends on:
Blocks:

Reported:	2016-03-29 19:17 UTC by H.J. Lu
Modified:	2016-06-08 22:01 UTC (History)
CC List:	1 user (show)

See Also:
Host:
Target:	x86-64
Build:
Last reconfirmed:

Flags:	fweimer: security-

Attachments
bench-memset data on various Intel and AMD processors (223.09 KB, application/octet-stream) 2016-03-29 19:18 UTC, H.J. Lu	Details
Perfomance data with graphics (3.23 MB, application/x-bzip) 2016-06-08 17:19 UTC, Andrew Senkevich	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description H.J. Lu 2016-03-29 19:17:22 UTC

x86-64 memset can be made smaller and support 64-byte vector register
size.

Comment 1 H.J. Lu 2016-03-29 19:18:39 UTC

Created attachment 9140 [details]
bench-memset data on various Intel and AMD processors

Comment 2 Sourceware Commits 2016-03-30 20:17:08 UTC

This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, hjl/erms/master has been created
        at  d745a49ca39f47a701352b593151d8839dcba554 (commit)

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=d745a49ca39f47a701352b593151d8839dcba554

commit d745a49ca39f47a701352b593151d8839dcba554
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Mar 13 00:26:57 2016 -0800

    Add memmove/memset-avx512-unaligned-erms-no-vzeroupper.S

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a16432df8b46ce4db3f1b051ae61eda285833b42

commit a16432df8b46ce4db3f1b051ae61eda285833b42
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Mar 25 08:20:17 2016 -0700

    Add x86-64 memset with unaligned store and rep stosb
    
    Implement x86-64 memset with unaligned store and rep movsb.  Support
    16-byte, 32-byte and 64-byte vector register sizes.  A single file
    provides 2 implementations of memset, one with rep stosb and the other
    without rep stosb.  They share the same codes when size is between 2
    times of vector register size and REP_STOSB_THRESHOLD which is 1KB for
    16-byte vector register size and scaled up by larger vector register
    size.
    
    Key features:
    
    1. Use overlapping store to avoid branch.
    2. For size <= 4 times of vector register size, fully unroll the loop.
    3. For size > 4 times of vector register size, store 4 times of vector
    register size at a time.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
    	memset-sse2-unaligned-erms, memset-avx2-unaligned-erms and
    	memset-avx512-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Test __memset_chk_sse2_unaligned,
    	__memset_chk_sse2_unaligned_erms, __memset_chk_avx2_unaligned,
    	__memset_chk_avx2_unaligned_erms, __memset_chk_avx512_unaligned,
    	__memset_chk_avx512_unaligned_erms, __memset_sse2_unaligned,
    	__memset_sse2_unaligned_erms, __memset_erms,
    	__memset_avx2_unaligned, __memset_avx2_unaligned_erms,
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: New
    	file.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:
    	Likewise.
    
    Memset

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=071b7236f96e0649e0728a83150f9fa52487563a

commit 071b7236f96e0649e0728a83150f9fa52487563a
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Mar 18 12:36:03 2016 -0700

    Add x86-64 memmove with unaligned load/store and rep movsb
    
    Implement x86-64 memmove with unaligned load/store and rep movsb.
    Support 16-byte, 32-byte and 64-byte vector register sizes.  When
    size <= 8 times of vector register size, there is no check for
    address overlap bewteen source and destination.  Since overhead for
    overlap check is small when size > 8 times of vector register size,
    memcpy is an alias of memmove.
    
    A single file provides 2 implementations of memmove, one with rep movsb
    and the other without rep movsb.  They share the same codes when size is
    between 2 times of vector register size and REP_MOVSB_THRESHOLD which
    is 2KB for 16-byte vector register size and scaled up by large vector
    register size.
    
    Key features:
    
    1. Use overlapping load and store to avoid branch.
    2. For size <= 8 times of vector register size, load  all sources into
    registers and store them together.
    3. If there is no address overlap bewteen source and destination, copy
    from both ends with 4 times of vector register size at a time.
    4. If address of destination > address of source, backward copy 8 times
    of vector register size at a time.
    5. Otherwise, forward copy 8 times of vector register size at a time.
    6. Use rep movsb only for forward copy.  Avoid slow backward rep movsb
    by fallbacking to backward copy 8 times of vector register size at a
    time.
    7. Skip when address of destination == address of source.
    
    	[BZ #19776]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
    	memmove-sse2-unaligned-erms, memmove-avx-unaligned-erms and
    	memmove-avx512-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Test
    	__memmove_chk_avx512_unaligned_2,
    	__memmove_chk_avx512_unaligned_erms,
    	__memmove_chk_avx_unaligned_2, __memmove_chk_avx_unaligned_erms,
    	__memmove_chk_sse2_unaligned_2,
    	__memmove_chk_sse2_unaligned_erms, __memmove_avx_unaligned_2,
    	__memmove_avx_unaligned_erms, __memmove_avx512_unaligned_2,
    	__memmove_avx512_unaligned_erms, __memmove_erms,
    	__memmove_sse2_unaligned_2, __memmove_sse2_unaligned_erms,
    	__memcpy_chk_avx512_unaligned_2,
    	__memcpy_chk_avx512_unaligned_erms,
    	__memcpy_chk_avx_unaligned_2, __memcpy_chk_avx_unaligned_erms,
    	__memcpy_chk_sse2_unaligned_2, __memcpy_chk_sse2_unaligned_erms,
    	__memcpy_avx_unaligned_2, __memcpy_avx_unaligned_erms,
    	__memcpy_avx512_unaligned_2, __memcpy_avx512_unaligned_erms,
    	__memcpy_sse2_unaligned_2, __memcpy_sse2_unaligned_erms,
    	__memcpy_erms, __mempcpy_chk_avx512_unaligned_2,
    	__mempcpy_chk_avx512_unaligned_erms,
    	__mempcpy_chk_avx_unaligned_2, __mempcpy_chk_avx_unaligned_erms,
    	__mempcpy_chk_sse2_unaligned_2, __mempcpy_chk_sse2_unaligned_erms,
    	__mempcpy_avx512_unaligned_2, __mempcpy_avx512_unaligned_erms,
    	__mempcpy_avx_unaligned_2, __mempcpy_avx_unaligned_erms,
    	__mempcpy_sse2_unaligned_2, __mempcpy_sse2_unaligned_erms and
    	__mempcpy_erms.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: New
    	file.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likwise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
    	Likwise.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:
    	Likwise.

-----------------------------------------------------------------------

Comment 3 Sourceware Commits 2016-03-30 22:28:00 UTC

This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, hjl/erms/master has been created
        at  68d440000cbbdc2480943db27bf539ba712d1607 (commit)

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=68d440000cbbdc2480943db27bf539ba712d1607

commit 68d440000cbbdc2480943db27bf539ba712d1607
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Mar 13 00:26:57 2016 -0800

    Add memmove/memset-avx512-unaligned-erms-no-vzeroupper.S

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=42aacde6879da25be932fabd05f05fa612d7c69e

commit 42aacde6879da25be932fabd05f05fa612d7c69e
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Mar 25 08:20:17 2016 -0700

    Add x86-64 memset with unaligned store and rep stosb
    
    Implement x86-64 memset with unaligned store and rep movsb.  Support
    16-byte, 32-byte and 64-byte vector register sizes.  A single file
    provides 2 implementations of memset, one with rep stosb and the other
    without rep stosb.  They share the same codes when size is between 2
    times of vector register size and REP_STOSB_THRESHOLD which defaults
    to 2KB.
    
    Key features:
    
    1. Use overlapping store to avoid branch.
    2. For size <= 4 times of vector register size, fully unroll the loop.
    3. For size > 4 times of vector register size, store 4 times of vector
    register size at a time.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
    	memset-sse2-unaligned-erms, memset-avx2-unaligned-erms and
    	memset-avx512-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Test __memset_chk_sse2_unaligned,
    	__memset_chk_sse2_unaligned_erms, __memset_chk_avx2_unaligned,
    	__memset_chk_avx2_unaligned_erms, __memset_chk_avx512_unaligned,
    	__memset_chk_avx512_unaligned_erms, __memset_sse2_unaligned,
    	__memset_sse2_unaligned_erms, __memset_erms,
    	__memset_avx2_unaligned, __memset_avx2_unaligned_erms,
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: New
    	file.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:
    	Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=071b7236f96e0649e0728a83150f9fa52487563a

commit 071b7236f96e0649e0728a83150f9fa52487563a
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Mar 18 12:36:03 2016 -0700

    Add x86-64 memmove with unaligned load/store and rep movsb
    
    Implement x86-64 memmove with unaligned load/store and rep movsb.
    Support 16-byte, 32-byte and 64-byte vector register sizes.  When
    size <= 8 times of vector register size, there is no check for
    address overlap bewteen source and destination.  Since overhead for
    overlap check is small when size > 8 times of vector register size,
    memcpy is an alias of memmove.
    
    A single file provides 2 implementations of memmove, one with rep movsb
    and the other without rep movsb.  They share the same codes when size is
    between 2 times of vector register size and REP_MOVSB_THRESHOLD which
    is 2KB for 16-byte vector register size and scaled up by large vector
    register size.
    
    Key features:
    
    1. Use overlapping load and store to avoid branch.
    2. For size <= 8 times of vector register size, load  all sources into
    registers and store them together.
    3. If there is no address overlap bewteen source and destination, copy
    from both ends with 4 times of vector register size at a time.
    4. If address of destination > address of source, backward copy 8 times
    of vector register size at a time.
    5. Otherwise, forward copy 8 times of vector register size at a time.
    6. Use rep movsb only for forward copy.  Avoid slow backward rep movsb
    by fallbacking to backward copy 8 times of vector register size at a
    time.
    7. Skip when address of destination == address of source.
    
    	[BZ #19776]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
    	memmove-sse2-unaligned-erms, memmove-avx-unaligned-erms and
    	memmove-avx512-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Test
    	__memmove_chk_avx512_unaligned_2,
    	__memmove_chk_avx512_unaligned_erms,
    	__memmove_chk_avx_unaligned_2, __memmove_chk_avx_unaligned_erms,
    	__memmove_chk_sse2_unaligned_2,
    	__memmove_chk_sse2_unaligned_erms, __memmove_avx_unaligned_2,
    	__memmove_avx_unaligned_erms, __memmove_avx512_unaligned_2,
    	__memmove_avx512_unaligned_erms, __memmove_erms,
    	__memmove_sse2_unaligned_2, __memmove_sse2_unaligned_erms,
    	__memcpy_chk_avx512_unaligned_2,
    	__memcpy_chk_avx512_unaligned_erms,
    	__memcpy_chk_avx_unaligned_2, __memcpy_chk_avx_unaligned_erms,
    	__memcpy_chk_sse2_unaligned_2, __memcpy_chk_sse2_unaligned_erms,
    	__memcpy_avx_unaligned_2, __memcpy_avx_unaligned_erms,
    	__memcpy_avx512_unaligned_2, __memcpy_avx512_unaligned_erms,
    	__memcpy_sse2_unaligned_2, __memcpy_sse2_unaligned_erms,
    	__memcpy_erms, __mempcpy_chk_avx512_unaligned_2,
    	__mempcpy_chk_avx512_unaligned_erms,
    	__mempcpy_chk_avx_unaligned_2, __mempcpy_chk_avx_unaligned_erms,
    	__mempcpy_chk_sse2_unaligned_2, __mempcpy_chk_sse2_unaligned_erms,
    	__mempcpy_avx512_unaligned_2, __mempcpy_avx512_unaligned_erms,
    	__mempcpy_avx_unaligned_2, __mempcpy_avx_unaligned_erms,
    	__mempcpy_sse2_unaligned_2, __mempcpy_sse2_unaligned_erms and
    	__mempcpy_erms.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: New
    	file.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likwise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
    	Likwise.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:
    	Likwise.

-----------------------------------------------------------------------

Comment 4 Sourceware Commits 2016-03-31 16:18:28 UTC

This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, hjl/erms/master has been created
        at  0db56470f1bee39a252daf2728d818296b179a9e (commit)

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=0db56470f1bee39a252daf2728d818296b179a9e

commit 0db56470f1bee39a252daf2728d818296b179a9e
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Mar 13 00:26:57 2016 -0800

    Add memmove/memset-avx512-unaligned-erms-no-vzeroupper.S

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7df7c6a195d6bc6ffdd90db0786d5de9c67d037a

commit 7df7c6a195d6bc6ffdd90db0786d5de9c67d037a
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Mar 25 08:20:17 2016 -0700

    Add x86-64 memset with unaligned store and rep stosb
    
    Implement x86-64 memset with unaligned store and rep movsb.  Support
    16-byte, 32-byte and 64-byte vector register sizes.  A single file
    provides 2 implementations of memset, one with rep stosb and the other
    without rep stosb.  They share the same codes when size is between 2
    times of vector register size and REP_STOSB_THRESHOLD which defaults
    to 2KB.
    
    Key features:
    
    1. Use overlapping store to avoid branch.
    2. For size <= 4 times of vector register size, fully unroll the loop.
    3. For size > 4 times of vector register size, store 4 times of vector
    register size at a time.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
    	memset-sse2-unaligned-erms, memset-avx2-unaligned-erms and
    	memset-avx512-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Test __memset_chk_sse2_unaligned,
    	__memset_chk_sse2_unaligned_erms, __memset_chk_avx2_unaligned,
    	__memset_chk_avx2_unaligned_erms, __memset_chk_avx512_unaligned,
    	__memset_chk_avx512_unaligned_erms, __memset_sse2_unaligned,
    	__memset_sse2_unaligned_erms, __memset_erms,
    	__memset_avx2_unaligned, __memset_avx2_unaligned_erms,
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: New
    	file.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:
    	Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=d1f2de07cb44abfb9e78f825e3edf2490cf1057c

commit d1f2de07cb44abfb9e78f825e3edf2490cf1057c
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Mar 18 12:36:03 2016 -0700

    Add x86-64 memmove with unaligned load/store and rep movsb
    
    Implement x86-64 memmove with unaligned load/store and rep movsb.
    Support 16-byte, 32-byte and 64-byte vector register sizes.  When
    size <= 8 times of vector register size, there is no check for
    address overlap bewteen source and destination.  Since overhead for
    overlap check is small when size > 8 times of vector register size,
    memcpy is an alias of memmove.
    
    A single file provides 2 implementations of memmove, one with rep movsb
    and the other without rep movsb.  They share the same codes when size is
    between 2 times of vector register size and REP_MOVSB_THRESHOLD which
    is 2KB for 16-byte vector register size and scaled up by large vector
    register size.
    
    Key features:
    
    1. Use overlapping load and store to avoid branch.
    2. For size <= 8 times of vector register size, load  all sources into
    registers and store them together.
    3. If there is no address overlap bewteen source and destination, copy
    from both ends with 4 times of vector register size at a time.
    4. If address of destination > address of source, backward copy 8 times
    of vector register size at a time.
    5. Otherwise, forward copy 8 times of vector register size at a time.
    6. Use rep movsb only for forward copy.  Avoid slow backward rep movsb
    by fallbacking to backward copy 8 times of vector register size at a
    time.
    7. Skip when address of destination == address of source.
    
    	[BZ #19776]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
    	memmove-sse2-unaligned-erms, memmove-avx-unaligned-erms and
    	memmove-avx512-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Test
    	__memmove_chk_avx512_unaligned_2,
    	__memmove_chk_avx512_unaligned_erms,
    	__memmove_chk_avx_unaligned_2, __memmove_chk_avx_unaligned_erms,
    	__memmove_chk_sse2_unaligned_2,
    	__memmove_chk_sse2_unaligned_erms, __memmove_avx_unaligned_2,
    	__memmove_avx_unaligned_erms, __memmove_avx512_unaligned_2,
    	__memmove_avx512_unaligned_erms, __memmove_erms,
    	__memmove_sse2_unaligned_2, __memmove_sse2_unaligned_erms,
    	__memcpy_chk_avx512_unaligned_2,
    	__memcpy_chk_avx512_unaligned_erms,
    	__memcpy_chk_avx_unaligned_2, __memcpy_chk_avx_unaligned_erms,
    	__memcpy_chk_sse2_unaligned_2, __memcpy_chk_sse2_unaligned_erms,
    	__memcpy_avx_unaligned_2, __memcpy_avx_unaligned_erms,
    	__memcpy_avx512_unaligned_2, __memcpy_avx512_unaligned_erms,
    	__memcpy_sse2_unaligned_2, __memcpy_sse2_unaligned_erms,
    	__memcpy_erms, __mempcpy_chk_avx512_unaligned_2,
    	__mempcpy_chk_avx512_unaligned_erms,
    	__mempcpy_chk_avx_unaligned_2, __mempcpy_chk_avx_unaligned_erms,
    	__mempcpy_chk_sse2_unaligned_2, __mempcpy_chk_sse2_unaligned_erms,
    	__mempcpy_avx512_unaligned_2, __mempcpy_avx512_unaligned_erms,
    	__mempcpy_avx_unaligned_2, __mempcpy_avx_unaligned_erms,
    	__mempcpy_sse2_unaligned_2, __mempcpy_sse2_unaligned_erms and
    	__mempcpy_erms.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: New
    	file.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likwise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
    	Likwise.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:
    	Likwise.

-----------------------------------------------------------------------

Comment 5 Sourceware Commits 2016-03-31 17:07:18 UTC

This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, master has been updated
       via  830566307f038387ca0af3fd327706a8d1a2f595 (commit)
      from  88b57b8ed41d5ecf2e1bdfc19556f9246a665ebb (commit)

Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=830566307f038387ca0af3fd327706a8d1a2f595

commit 830566307f038387ca0af3fd327706a8d1a2f595
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:05:51 2016 -0700

    Add x86-64 memset with unaligned store and rep stosb
    
    Implement x86-64 memset with unaligned store and rep movsb.  Support
    16-byte, 32-byte and 64-byte vector register sizes.  A single file
    provides 2 implementations of memset, one with rep stosb and the other
    without rep stosb.  They share the same codes when size is between 2
    times of vector register size and REP_STOSB_THRESHOLD which defaults
    to 2KB.
    
    Key features:
    
    1. Use overlapping store to avoid branch.
    2. For size <= 4 times of vector register size, fully unroll the loop.
    3. For size > 4 times of vector register size, store 4 times of vector
    register size at a time.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
    	memset-sse2-unaligned-erms, memset-avx2-unaligned-erms and
    	memset-avx512-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Test __memset_chk_sse2_unaligned,
    	__memset_chk_sse2_unaligned_erms, __memset_chk_avx2_unaligned,
    	__memset_chk_avx2_unaligned_erms, __memset_chk_avx512_unaligned,
    	__memset_chk_avx512_unaligned_erms, __memset_sse2_unaligned,
    	__memset_sse2_unaligned_erms, __memset_erms,
    	__memset_avx2_unaligned, __memset_avx2_unaligned_erms,
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: New
    	file.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:
    	Likewise.

-----------------------------------------------------------------------

Summary of changes:
 ChangeLog                                          |   23 ++
 sysdeps/x86_64/multiarch/Makefile                  |    5 +-
 sysdeps/x86_64/multiarch/ifunc-impl-list.c         |   33 +++
 .../x86_64/multiarch/memset-avx2-unaligned-erms.S  |   14 +
 .../multiarch/memset-avx512-unaligned-erms.S       |   17 ++
 .../x86_64/multiarch/memset-sse2-unaligned-erms.S  |   16 ++
 .../x86_64/multiarch/memset-vec-unaligned-erms.S   |  251 ++++++++++++++++++++
 7 files changed, 358 insertions(+), 1 deletions(-)
 create mode 100644 sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S
 create mode 100644 sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S
 create mode 100644 sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S
 create mode 100644 sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S

Comment 6 Sourceware Commits 2016-04-01 01:38:30 UTC

This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, hjl/erms/ifunc has been created
        at  e129704237c4ffb5f284dd6d1e3ed638cac3bf02 (commit)

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e129704237c4ffb5f284dd6d1e3ed638cac3bf02

commit e129704237c4ffb5f284dd6d1e3ed638cac3bf02
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 18:35:16 2016 -0700

    memmove.S

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4ec426a6026c1f96f9db56d577e64b75c3e95f4f

commit 4ec426a6026c1f96f9db56d577e64b75c3e95f4f
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 18:33:51 2016 -0700

    Remove old GPR, SSE2 and AVX2 memcpy/memmove

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=54a3f917fc4763257ac9d085ce6e1a2b618ca0d3

commit 54a3f917fc4763257ac9d085ce6e1a2b618ca0d3
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 16:09:24 2016 -0700

    Use SSE2 memmove by default

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=22ba9213b9e74721847309b71b08d0f524f353e7

commit 22ba9213b9e74721847309b71b08d0f524f353e7
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 12:46:57 2016 -0700

    No AVX memmove in rtld

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=36e5b7a18c109faa172e0d78cd061d1d2824060c

commit 36e5b7a18c109faa172e0d78cd061d1d2824060c
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:42:30 2016 -0700

    X86-64: Remove the previous SSE2/AVX2 memsets
    
    Since the new SSE2/AVX2 memsets are faster than the previous ones, we
    can remove the previous SSE2/AVX2 memsets and replace them with the
    new ones.
    
    No change in IFUNC selection if SSE2 and AVX2 memsets weren't used
    before.  If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset
    optimized with Enhanced REP STOSB will be used for processors with
    ERMS.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded
    	into ...
    	* sysdeps/x86_64/memset.S: This.
    	(__bzero): Removed.
    	(__memset_tail): Likewise.
    	(__memset_chk): Likewise.
    	(memset): Likewise.
    	(MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't
    	defined.
    	(MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined.
    	* sysdeps/x86_64/multiarch/memset-avx2.S: Removed.
    	(__memset_zero_constant_len_parameter): Check SHARED instead of
    	PIC.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memset-avx2 and memset-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Remove __memset_chk_sse2,
    	__memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: Skip
    	if not in libc.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
    	(MEMSET_CHK_SYMBOL): New.  Define if not defined.
    	(__bzero): Check VEC_SIZE == 16 instead of USE_MULTIARCH.
    	Replace MEMSET_SYMBOL with MEMSET_CHK_SYMBOL on __memset_chk
    	symbols.
    	Properly check USE_MULTIARCH on __memset symbols.
    	* sysdeps/x86_64/multiarch/memset.S (memset): Replace
    	__memset_sse2 and __memset_avx2 with __memset_sse2_unaligned
    	and __memset_avx2_unaligned.  Use __memset_sse2_unaligned_erms
    	or __memset_avx2_unaligned_erms if processor has ERMS.
    	(memset): Removed.
    	(__memset_chk): Likewise.
    	(MEMSET_SYMBOL): New.
    	(libc_hidden_builtin_def): Replace __memset_sse2 with
    	__memset_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace
    	__memset_chk_sse2 and __memset_chk_avx2 with
    	__memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms.
    	Use __memset_chk_sse2_unaligned_erms or
    	__memset_chk_avx2_unaligned_erms if processor has ERMS.

-----------------------------------------------------------------------

Comment 7 Sourceware Commits 2016-04-01 13:00:25 UTC

This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, hjl/erms/ifunc has been created
        at  0662f9c7caa5dccd62a7eea29e89c4f6aec6a5fc (commit)

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=0662f9c7caa5dccd62a7eea29e89c4f6aec6a5fc

commit 0662f9c7caa5dccd62a7eea29e89c4f6aec6a5fc
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 1 05:56:55 2016 -0700

    Static memmove.S

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=992cd8d071561786869010708d1a8a8bc68baabf

commit 992cd8d071561786869010708d1a8a8bc68baabf
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 1 05:49:10 2016 -0700

    memmove-vec-unaligned-erms.S

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=8723e4536ce58080c89d0a38d84d57b80f18669f

commit 8723e4536ce58080c89d0a38d84d57b80f18669f
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 12:46:57 2016 -0700

    X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove
    
    Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones,
    we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with
    the new ones.
    
    No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used
    before.  If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2
    memcpy/memmove optimized with Enhanced REP MOVSB will be used for
    processors with ERMS.  The new AVX512 memcpy/memmove will be used for
    processors with AVX512 which prefer vzeroupper.
    
    Since the new SSE2 memcpy/memmove are faster than the previous default
    memcpy/memmove used in libc.a and ld.so, we also remove the previous
    default memcpy/memmove and make them the default memcpy/memmove.
    
    	* sysdeps/x86_64/memcpy.S: Make it dummy.
    	* sysdeps/x86_64/mempcpy.S: Likewise.
    	* sysdeps/x86_64/memmove.S: New file.
    	* sysdeps/x86_64/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/memmove.c: Removed.
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memmove.c: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memcpy-sse2-unaligned, memmove-avx-unaligned,
    	memcpy-avx-unaligned and memmove-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Replace
    	__memmove_chk_avx512_unaligned_2 with
    	__memmove_chk_avx512_unaligned.  Remove
    	__memmove_chk_avx_unaligned_2.  Replace
    	__memmove_chk_sse2_unaligned_2 with
    	__memmove_chk_sse2_unaligned.  Remove __memmove_chk_sse2 and
    	__memmove_avx_unaligned_2.  Replace __memmove_avx512_unaligned_2
    	with __memmove_avx512_unaligned.  Replace
    	__memmove_sse2_unaligned_2 with __memmove_sse2_unaligned.
    	Remove __memmove_sse2.  Replace __memcpy_chk_avx512_unaligned_2
    	with __memcpy_chk_avx512_unaligned.  Remove
    	__memcpy_chk_avx_unaligned_2.  Replace
    	__memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned.
    	Remove __memcpy_chk_sse2.  Remove __memcpy_avx_unaligned_2.
    	Replace __memcpy_avx512_unaligned_2 with
    	__memcpy_avx512_unaligned.  Remove __memcpy_sse2_unaligned_2
    	and __memcpy_sse2.  Replace __mempcpy_chk_avx512_unaligned_2
    	with __mempcpy_chk_avx512_unaligned.  Remove
    	__mempcpy_chk_avx_unaligned_2.  Replace
    	__mempcpy_chk_sse2_unaligned_2 with
    	__mempcpy_chk_sse2_unaligned.  Remove __mempcpy_chk_sse2.
    	Replace __mempcpy_avx512_unaligned_2 with
    	__mempcpy_avx512_unaligned.  Remove __mempcpy_avx_unaligned_2.
    	Replace __mempcpy_sse2_unaligned_2 with
    	__mempcpy_sse2_unaligned.  Remove __mempcpy_sse2.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support
    	__memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned.
    	Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __memcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../memcpy.S.
    	(memcpy@GLIBC_2_2_5): Make it an alias of __new_memcpy.
    	* sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support
    	__memcpy_chk_avx512_unaligned_erms and
    	__memcpy_chk_avx512_unaligned.  Use
    	__memcpy_chk_avx_unaligned_erms and
    	__memcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __memcpy_chk_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip if
    	not in libc.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
    	(MEMCPY_SYMBOL): New.
    	(MEMPCPY_SYMBOL): Likewise.
    	(MEMMOVE_CHK_SYMBOL): Likewise.
    	Replace MEMMOVE_SYMBOL with MEMMOVE_CHK_SYMBOL on __mempcpy_chk
    	symbols.  Replace MEMMOVE_SYMBOL with MEMPCPY_SYMBOL on
    	__mempcpy symbols.  Change function suffix from unaligned_2 to
    	unaligned.  Provide aliase of memcpy in libc.a and ld.so.
    	* sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support
    	__mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned.
    	Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __mempcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../mempcpy.S.
    	(mempcpy): New.  Add a weak alias.
    	* sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support
    	__mempcpy_chk_avx512_unaligned_erms and
    	__mempcpy_chk_avx512_unaligned.  Use
    	__mempcpy_chk_avx_unaligned_erms and
    	__mempcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __mempcpy_chk_sse2_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=1dcd048e097820aeb423c346fa957bbbf2d0ae84

commit 1dcd048e097820aeb423c346fa957bbbf2d0ae84
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:42:30 2016 -0700

    X86-64: Remove the previous SSE2/AVX2 memsets
    
    Since the new SSE2/AVX2 memsets are faster than the previous ones, we
    can remove the previous SSE2/AVX2 memsets and replace them with the
    new ones.
    
    No change in IFUNC selection if SSE2 and AVX2 memsets weren't used
    before.  If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset
    optimized with Enhanced REP STOSB will be used for processors with
    ERMS.  The new AVX512 memset will be used for processors with AVX512
    which prefer vzeroupper.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded
    	into ...
    	* sysdeps/x86_64/memset.S: This.
    	(__bzero): Removed.
    	(__memset_tail): Likewise.
    	(__memset_chk): Likewise.
    	(memset): Likewise.
    	(MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't
    	defined.
    	(MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined.
    	* sysdeps/x86_64/multiarch/memset-avx2.S: Removed.
    	(__memset_zero_constant_len_parameter): Check SHARED instead of
    	PIC.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memset-avx2 and memset-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Remove __memset_chk_sse2,
    	__memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: Skip
    	if not in libc.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
    	(MEMSET_CHK_SYMBOL): New.  Define if not defined.
    	(__bzero): Check VEC_SIZE == 16 instead of USE_MULTIARCH.
    	Replace MEMSET_SYMBOL with MEMSET_CHK_SYMBOL on __memset_chk
    	symbols.
    	Properly check USE_MULTIARCH on __memset symbols.
    	* sysdeps/x86_64/multiarch/memset.S (memset): Replace
    	__memset_sse2 and __memset_avx2 with __memset_sse2_unaligned
    	and __memset_avx2_unaligned.  Use __memset_sse2_unaligned_erms
    	or __memset_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	(memset): Removed.
    	(__memset_chk): Likewise.
    	(MEMSET_SYMBOL): New.
    	(libc_hidden_builtin_def): Replace __memset_sse2 with
    	__memset_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace
    	__memset_chk_sse2 and __memset_chk_avx2 with
    	__memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms.
    	Use __memset_chk_sse2_unaligned_erms or
    	__memset_chk_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_chk_avx512_unaligned_erms and
    	__memset_chk_avx512_unaligned.

-----------------------------------------------------------------------

Comment 8 Sourceware Commits 2016-04-01 17:09:41 UTC

This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, hjl/erms/ifunc has been created
        at  3de6f56572ebf6c421c2f0e783280c2c0cc5c29d (commit)

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=3de6f56572ebf6c421c2f0e783280c2c0cc5c29d

commit 3de6f56572ebf6c421c2f0e783280c2c0cc5c29d
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 12:46:57 2016 -0700

    X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove
    
    Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones,
    we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with
    the new ones.
    
    No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used
    before.  If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2
    memcpy/memmove optimized with Enhanced REP MOVSB will be used for
    processors with ERMS.  The new AVX512 memcpy/memmove will be used for
    processors with AVX512 which prefer vzeroupper.
    
    Since the new SSE2 memcpy/memmove are faster than the previous default
    memcpy/memmove used in libc.a and ld.so, we also remove the previous
    default memcpy/memmove and make them the default memcpy/memmove.
    
    	* sysdeps/x86_64/memcpy.S: Make it dummy.
    	* sysdeps/x86_64/mempcpy.S: Likewise.
    	* sysdeps/x86_64/memmove.S: New file.
    	* sysdeps/x86_64/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/memmove.c: Removed.
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memmove.c: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memcpy-sse2-unaligned, memmove-avx-unaligned,
    	memcpy-avx-unaligned and memmove-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Replace
    	__memmove_chk_avx512_unaligned_2 with
    	__memmove_chk_avx512_unaligned.  Remove
    	__memmove_chk_avx_unaligned_2.  Replace
    	__memmove_chk_sse2_unaligned_2 with
    	__memmove_chk_sse2_unaligned.  Remove __memmove_chk_sse2 and
    	__memmove_avx_unaligned_2.  Replace __memmove_avx512_unaligned_2
    	with __memmove_avx512_unaligned.  Replace
    	__memmove_sse2_unaligned_2 with __memmove_sse2_unaligned.
    	Remove __memmove_sse2.  Replace __memcpy_chk_avx512_unaligned_2
    	with __memcpy_chk_avx512_unaligned.  Remove
    	__memcpy_chk_avx_unaligned_2.  Replace
    	__memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned.
    	Remove __memcpy_chk_sse2.  Remove __memcpy_avx_unaligned_2.
    	Replace __memcpy_avx512_unaligned_2 with
    	__memcpy_avx512_unaligned.  Remove __memcpy_sse2_unaligned_2
    	and __memcpy_sse2.  Replace __mempcpy_chk_avx512_unaligned_2
    	with __mempcpy_chk_avx512_unaligned.  Remove
    	__mempcpy_chk_avx_unaligned_2.  Replace
    	__mempcpy_chk_sse2_unaligned_2 with
    	__mempcpy_chk_sse2_unaligned.  Remove __mempcpy_chk_sse2.
    	Replace __mempcpy_avx512_unaligned_2 with
    	__mempcpy_avx512_unaligned.  Remove __mempcpy_avx_unaligned_2.
    	Replace __mempcpy_sse2_unaligned_2 with
    	__mempcpy_sse2_unaligned.  Remove __mempcpy_sse2.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support
    	__memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned.
    	Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __memcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../memcpy.S.
    	* sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support
    	__memcpy_chk_avx512_unaligned_erms and
    	__memcpy_chk_avx512_unaligned.  Use
    	__memcpy_chk_avx_unaligned_erms and
    	__memcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __memcpy_chk_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip if
    	not in libc.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
    	(MEMCPY_SYMBOL): New.
    	(MEMPCPY_SYMBOL): Likewise.
    	(MEMMOVE_CHK_SYMBOL): Likewise.
    	Replace MEMMOVE_SYMBOL with MEMMOVE_CHK_SYMBOL on __mempcpy_chk
    	symbols.  Replace MEMMOVE_SYMBOL with MEMPCPY_SYMBOL on
    	__mempcpy symbols.  Change function suffix from unaligned_2 to
    	unaligned.  Provide alias for __memcpy_chk in libc.a.  Provide
    	alias for memcpy in libc.a and ld.so.
    	* sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support
    	__mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned.
    	Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __mempcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../mempcpy.S.
    	(mempcpy): New.  Add a weak alias.
    	* sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support
    	__mempcpy_chk_avx512_unaligned_erms and
    	__mempcpy_chk_avx512_unaligned.  Use
    	__mempcpy_chk_avx_unaligned_erms and
    	__mempcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __mempcpy_chk_sse2_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=503990f436e51cbd8f9479017fc0976687dcc90e

commit 503990f436e51cbd8f9479017fc0976687dcc90e
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:42:30 2016 -0700

    X86-64: Remove the previous SSE2/AVX2 memsets
    
    Since the new SSE2/AVX2 memsets are faster than the previous ones, we
    can remove the previous SSE2/AVX2 memsets and replace them with the
    new ones.
    
    No change in IFUNC selection if SSE2 and AVX2 memsets weren't used
    before.  If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset
    optimized with Enhanced REP STOSB will be used for processors with
    ERMS.  The new AVX512 memset will be used for processors with AVX512
    which prefer vzeroupper.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded
    	into ...
    	* sysdeps/x86_64/memset.S: This.
    	(__bzero): Removed.
    	(__memset_tail): Likewise.
    	(__memset_chk): Likewise.
    	(memset): Likewise.
    	(MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't
    	defined.
    	(MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined.
    	* sysdeps/x86_64/multiarch/memset-avx2.S: Removed.
    	(__memset_zero_constant_len_parameter): Check SHARED instead of
    	PIC.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memset-avx2 and memset-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Remove __memset_chk_sse2,
    	__memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: Skip
    	if not in libc.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
    	(MEMSET_CHK_SYMBOL): New.  Define if not defined.
    	(__bzero): Check VEC_SIZE == 16 instead of USE_MULTIARCH.
    	Replace MEMSET_SYMBOL with MEMSET_CHK_SYMBOL on __memset_chk
    	symbols.
    	Properly check USE_MULTIARCH on __memset symbols.
    	* sysdeps/x86_64/multiarch/memset.S (memset): Replace
    	__memset_sse2 and __memset_avx2 with __memset_sse2_unaligned
    	and __memset_avx2_unaligned.  Use __memset_sse2_unaligned_erms
    	or __memset_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	(memset): Removed.
    	(__memset_chk): Likewise.
    	(MEMSET_SYMBOL): New.
    	(libc_hidden_builtin_def): Replace __memset_sse2 with
    	__memset_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace
    	__memset_chk_sse2 and __memset_chk_avx2 with
    	__memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms.
    	Use __memset_chk_sse2_unaligned_erms or
    	__memset_chk_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_chk_avx512_unaligned_erms and
    	__memset_chk_avx512_unaligned.

-----------------------------------------------------------------------

Comment 9 Sourceware Commits 2016-04-01 17:52:49 UTC

This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, hjl/erms/ifunc has been created
        at  8e58621eb52cc2d7b89a55b3ce7bf0f918a79c3b (commit)

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=8e58621eb52cc2d7b89a55b3ce7bf0f918a79c3b

commit 8e58621eb52cc2d7b89a55b3ce7bf0f918a79c3b
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 12:46:57 2016 -0700

    X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove
    
    Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones,
    we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with
    the new ones.
    
    No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used
    before.  If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2
    memcpy/memmove optimized with Enhanced REP MOVSB will be used for
    processors with ERMS.  The new AVX512 memcpy/memmove will be used for
    processors with AVX512 which prefer vzeroupper.
    
    Since the new SSE2 memcpy/memmove are faster than the previous default
    memcpy/memmove used in libc.a and ld.so, we also remove the previous
    default memcpy/memmove and make them the default memcpy/memmove.
    
    	* sysdeps/x86_64/memcpy.S: Make it dummy.
    	* sysdeps/x86_64/mempcpy.S: Likewise.
    	* sysdeps/x86_64/memmove.S: New file.
    	* sysdeps/x86_64/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/memmove.c: Removed.
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memmove.c: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memcpy-sse2-unaligned, memmove-avx-unaligned,
    	memcpy-avx-unaligned and memmove-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Replace
    	__memmove_chk_avx512_unaligned_2 with
    	__memmove_chk_avx512_unaligned.  Remove
    	__memmove_chk_avx_unaligned_2.  Replace
    	__memmove_chk_sse2_unaligned_2 with
    	__memmove_chk_sse2_unaligned.  Remove __memmove_chk_sse2 and
    	__memmove_avx_unaligned_2.  Replace __memmove_avx512_unaligned_2
    	with __memmove_avx512_unaligned.  Replace
    	__memmove_sse2_unaligned_2 with __memmove_sse2_unaligned.
    	Remove __memmove_sse2.  Replace __memcpy_chk_avx512_unaligned_2
    	with __memcpy_chk_avx512_unaligned.  Remove
    	__memcpy_chk_avx_unaligned_2.  Replace
    	__memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned.
    	Remove __memcpy_chk_sse2.  Remove __memcpy_avx_unaligned_2.
    	Replace __memcpy_avx512_unaligned_2 with
    	__memcpy_avx512_unaligned.  Remove __memcpy_sse2_unaligned_2
    	and __memcpy_sse2.  Replace __mempcpy_chk_avx512_unaligned_2
    	with __mempcpy_chk_avx512_unaligned.  Remove
    	__mempcpy_chk_avx_unaligned_2.  Replace
    	__mempcpy_chk_sse2_unaligned_2 with
    	__mempcpy_chk_sse2_unaligned.  Remove __mempcpy_chk_sse2.
    	Replace __mempcpy_avx512_unaligned_2 with
    	__mempcpy_avx512_unaligned.  Remove __mempcpy_avx_unaligned_2.
    	Replace __mempcpy_sse2_unaligned_2 with
    	__mempcpy_sse2_unaligned.  Remove __mempcpy_sse2.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support
    	__memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned.
    	Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __memcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../memcpy.S.
    	* sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support
    	__memcpy_chk_avx512_unaligned_erms and
    	__memcpy_chk_avx512_unaligned.  Use
    	__memcpy_chk_avx_unaligned_erms and
    	__memcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __memcpy_chk_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip if
    	not in libc.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
    	(MEMCPY_SYMBOL): New.
    	(MEMPCPY_SYMBOL): Likewise.
    	(MEMMOVE_CHK_SYMBOL): Likewise.
    	(__mempcpy_erms, __memmove_erms): Moved before __mempcpy_chk
    	with unaligned_erms.
    	Replace MEMMOVE_SYMBOL with MEMMOVE_CHK_SYMBOL on __mempcpy_chk
    	symbols.  Replace MEMMOVE_SYMBOL with MEMPCPY_SYMBOL on
    	__mempcpy symbols.  Change function suffix from unaligned_2 to
    	unaligned.  Provide alias for __memcpy_chk in libc.a.  Provide
    	alias for memcpy in libc.a and ld.so.
    	* sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support
    	__mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned.
    	Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __mempcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../mempcpy.S.
    	(mempcpy): New.  Add a weak alias.
    	* sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support
    	__mempcpy_chk_avx512_unaligned_erms and
    	__mempcpy_chk_avx512_unaligned.  Use
    	__mempcpy_chk_avx_unaligned_erms and
    	__mempcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __mempcpy_chk_sse2_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=503990f436e51cbd8f9479017fc0976687dcc90e

commit 503990f436e51cbd8f9479017fc0976687dcc90e
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:42:30 2016 -0700

    X86-64: Remove the previous SSE2/AVX2 memsets
    
    Since the new SSE2/AVX2 memsets are faster than the previous ones, we
    can remove the previous SSE2/AVX2 memsets and replace them with the
    new ones.
    
    No change in IFUNC selection if SSE2 and AVX2 memsets weren't used
    before.  If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset
    optimized with Enhanced REP STOSB will be used for processors with
    ERMS.  The new AVX512 memset will be used for processors with AVX512
    which prefer vzeroupper.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded
    	into ...
    	* sysdeps/x86_64/memset.S: This.
    	(__bzero): Removed.
    	(__memset_tail): Likewise.
    	(__memset_chk): Likewise.
    	(memset): Likewise.
    	(MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't
    	defined.
    	(MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined.
    	* sysdeps/x86_64/multiarch/memset-avx2.S: Removed.
    	(__memset_zero_constant_len_parameter): Check SHARED instead of
    	PIC.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memset-avx2 and memset-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Remove __memset_chk_sse2,
    	__memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: Skip
    	if not in libc.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
    	(MEMSET_CHK_SYMBOL): New.  Define if not defined.
    	(__bzero): Check VEC_SIZE == 16 instead of USE_MULTIARCH.
    	Replace MEMSET_SYMBOL with MEMSET_CHK_SYMBOL on __memset_chk
    	symbols.
    	Properly check USE_MULTIARCH on __memset symbols.
    	* sysdeps/x86_64/multiarch/memset.S (memset): Replace
    	__memset_sse2 and __memset_avx2 with __memset_sse2_unaligned
    	and __memset_avx2_unaligned.  Use __memset_sse2_unaligned_erms
    	or __memset_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	(memset): Removed.
    	(__memset_chk): Likewise.
    	(MEMSET_SYMBOL): New.
    	(libc_hidden_builtin_def): Replace __memset_sse2 with
    	__memset_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace
    	__memset_chk_sse2 and __memset_chk_avx2 with
    	__memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms.
    	Use __memset_chk_sse2_unaligned_erms or
    	__memset_chk_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_chk_avx512_unaligned_erms and
    	__memset_chk_avx512_unaligned.

-----------------------------------------------------------------------

Comment 10 Sourceware Commits 2016-04-01 19:18:22 UTC

This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, hjl/erms/ifunc has been created
        at  320daec84c5d2c4a6a17a0043ca5f8c8fe30734e (commit)

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=320daec84c5d2c4a6a17a0043ca5f8c8fe30734e

commit 320daec84c5d2c4a6a17a0043ca5f8c8fe30734e
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 12:46:57 2016 -0700

    X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove
    
    Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones,
    we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with
    the new ones.
    
    No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used
    before.  If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2
    memcpy/memmove optimized with Enhanced REP MOVSB will be used for
    processors with ERMS.  The new AVX512 memcpy/memmove will be used for
    processors with AVX512 which prefer vzeroupper.
    
    Since the new SSE2 memcpy/memmove are faster than the previous default
    memcpy/memmove used in libc.a and ld.so, we also remove the previous
    default memcpy/memmove and make them the default memcpy/memmove.
    
    It also fixes the placement of __mempcpy_erms and __memmove_erms.
    
    	(__mempcpy_erms, __memmove_erms): Moved before __mempcpy_chk
    	with unaligned_erms.
    
    	[BZ #19776]
    	* sysdeps/x86_64/memcpy.S: Make it dummy.
    	* sysdeps/x86_64/mempcpy.S: Likewise.
    	* sysdeps/x86_64/memmove.S: New file.
    	* sysdeps/x86_64/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/memmove.c: Removed.
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memmove.c: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memcpy-sse2-unaligned, memmove-avx-unaligned,
    	memcpy-avx-unaligned and memmove-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Replace
    	__memmove_chk_avx512_unaligned_2 with
    	__memmove_chk_avx512_unaligned.  Remove
    	__memmove_chk_avx_unaligned_2.  Replace
    	__memmove_chk_sse2_unaligned_2 with
    	__memmove_chk_sse2_unaligned.  Remove __memmove_chk_sse2 and
    	__memmove_avx_unaligned_2.  Replace __memmove_avx512_unaligned_2
    	with __memmove_avx512_unaligned.  Replace
    	__memmove_sse2_unaligned_2 with __memmove_sse2_unaligned.
    	Remove __memmove_sse2.  Replace __memcpy_chk_avx512_unaligned_2
    	with __memcpy_chk_avx512_unaligned.  Remove
    	__memcpy_chk_avx_unaligned_2.  Replace
    	__memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned.
    	Remove __memcpy_chk_sse2.  Remove __memcpy_avx_unaligned_2.
    	Replace __memcpy_avx512_unaligned_2 with
    	__memcpy_avx512_unaligned.  Remove __memcpy_sse2_unaligned_2
    	and __memcpy_sse2.  Replace __mempcpy_chk_avx512_unaligned_2
    	with __mempcpy_chk_avx512_unaligned.  Remove
    	__mempcpy_chk_avx_unaligned_2.  Replace
    	__mempcpy_chk_sse2_unaligned_2 with
    	__mempcpy_chk_sse2_unaligned.  Remove __mempcpy_chk_sse2.
    	Replace __mempcpy_avx512_unaligned_2 with
    	__mempcpy_avx512_unaligned.  Remove __mempcpy_avx_unaligned_2.
    	Replace __mempcpy_sse2_unaligned_2 with
    	__mempcpy_sse2_unaligned.  Remove __mempcpy_sse2.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support
    	__memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned.
    	Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __memcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../memcpy.S.
    	* sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support
    	__memcpy_chk_avx512_unaligned_erms and
    	__memcpy_chk_avx512_unaligned.  Use
    	__memcpy_chk_avx_unaligned_erms and
    	__memcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __memcpy_chk_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip if
    	not in libc.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
    	(MEMCPY_SYMBOL): New.
    	(MEMPCPY_SYMBOL): Likewise.
    	(MEMMOVE_CHK_SYMBOL): Likewise.
    	(__mempcpy_erms, __memmove_erms): Moved before __mempcpy_chk
    	with unaligned_erms.
    	Replace MEMMOVE_SYMBOL with MEMMOVE_CHK_SYMBOL on __mempcpy_chk
    	symbols.  Replace MEMMOVE_SYMBOL with MEMPCPY_SYMBOL on
    	__mempcpy symbols.  Change function suffix from unaligned_2 to
    	unaligned.  Provide alias for __memcpy_chk in libc.a.  Provide
    	alias for memcpy in libc.a and ld.so.
    	* sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support
    	__mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned.
    	Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __mempcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../mempcpy.S.
    	(mempcpy): New.  Add a weak alias.
    	* sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support
    	__mempcpy_chk_avx512_unaligned_erms and
    	__mempcpy_chk_avx512_unaligned.  Use
    	__mempcpy_chk_avx_unaligned_erms and
    	__mempcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __mempcpy_chk_sse2_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=503990f436e51cbd8f9479017fc0976687dcc90e

commit 503990f436e51cbd8f9479017fc0976687dcc90e
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:42:30 2016 -0700

    X86-64: Remove the previous SSE2/AVX2 memsets
    
    Since the new SSE2/AVX2 memsets are faster than the previous ones, we
    can remove the previous SSE2/AVX2 memsets and replace them with the
    new ones.
    
    No change in IFUNC selection if SSE2 and AVX2 memsets weren't used
    before.  If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset
    optimized with Enhanced REP STOSB will be used for processors with
    ERMS.  The new AVX512 memset will be used for processors with AVX512
    which prefer vzeroupper.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded
    	into ...
    	* sysdeps/x86_64/memset.S: This.
    	(__bzero): Removed.
    	(__memset_tail): Likewise.
    	(__memset_chk): Likewise.
    	(memset): Likewise.
    	(MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't
    	defined.
    	(MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined.
    	* sysdeps/x86_64/multiarch/memset-avx2.S: Removed.
    	(__memset_zero_constant_len_parameter): Check SHARED instead of
    	PIC.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memset-avx2 and memset-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Remove __memset_chk_sse2,
    	__memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: Skip
    	if not in libc.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
    	(MEMSET_CHK_SYMBOL): New.  Define if not defined.
    	(__bzero): Check VEC_SIZE == 16 instead of USE_MULTIARCH.
    	Replace MEMSET_SYMBOL with MEMSET_CHK_SYMBOL on __memset_chk
    	symbols.
    	Properly check USE_MULTIARCH on __memset symbols.
    	* sysdeps/x86_64/multiarch/memset.S (memset): Replace
    	__memset_sse2 and __memset_avx2 with __memset_sse2_unaligned
    	and __memset_avx2_unaligned.  Use __memset_sse2_unaligned_erms
    	or __memset_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	(memset): Removed.
    	(__memset_chk): Likewise.
    	(MEMSET_SYMBOL): New.
    	(libc_hidden_builtin_def): Replace __memset_sse2 with
    	__memset_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace
    	__memset_chk_sse2 and __memset_chk_avx2 with
    	__memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms.
    	Use __memset_chk_sse2_unaligned_erms or
    	__memset_chk_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_chk_avx512_unaligned_erms and
    	__memset_chk_avx512_unaligned.

-----------------------------------------------------------------------

Comment 11 Sourceware Commits 2016-04-01 19:19:34 UTC

This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, hjl/erms/ifunc has been created
        at  6449801fe6f0d733f6fda77f057bd60d9091ebba (commit)

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=6449801fe6f0d733f6fda77f057bd60d9091ebba

commit 6449801fe6f0d733f6fda77f057bd60d9091ebba
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 12:46:57 2016 -0700

    X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove
    
    Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones,
    we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with
    the new ones.
    
    No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used
    before.  If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2
    memcpy/memmove optimized with Enhanced REP MOVSB will be used for
    processors with ERMS.  The new AVX512 memcpy/memmove will be used for
    processors with AVX512 which prefer vzeroupper.
    
    Since the new SSE2 memcpy/memmove are faster than the previous default
    memcpy/memmove used in libc.a and ld.so, we also remove the previous
    default memcpy/memmove and make them the default memcpy/memmove.
    
    It also fixes the placement of __mempcpy_erms and __memmove_erms.
    
    	[BZ #19776]
    	* sysdeps/x86_64/memcpy.S: Make it dummy.
    	* sysdeps/x86_64/mempcpy.S: Likewise.
    	* sysdeps/x86_64/memmove.S: New file.
    	* sysdeps/x86_64/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/memmove.c: Removed.
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memmove.c: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memcpy-sse2-unaligned, memmove-avx-unaligned,
    	memcpy-avx-unaligned and memmove-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Replace
    	__memmove_chk_avx512_unaligned_2 with
    	__memmove_chk_avx512_unaligned.  Remove
    	__memmove_chk_avx_unaligned_2.  Replace
    	__memmove_chk_sse2_unaligned_2 with
    	__memmove_chk_sse2_unaligned.  Remove __memmove_chk_sse2 and
    	__memmove_avx_unaligned_2.  Replace __memmove_avx512_unaligned_2
    	with __memmove_avx512_unaligned.  Replace
    	__memmove_sse2_unaligned_2 with __memmove_sse2_unaligned.
    	Remove __memmove_sse2.  Replace __memcpy_chk_avx512_unaligned_2
    	with __memcpy_chk_avx512_unaligned.  Remove
    	__memcpy_chk_avx_unaligned_2.  Replace
    	__memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned.
    	Remove __memcpy_chk_sse2.  Remove __memcpy_avx_unaligned_2.
    	Replace __memcpy_avx512_unaligned_2 with
    	__memcpy_avx512_unaligned.  Remove __memcpy_sse2_unaligned_2
    	and __memcpy_sse2.  Replace __mempcpy_chk_avx512_unaligned_2
    	with __mempcpy_chk_avx512_unaligned.  Remove
    	__mempcpy_chk_avx_unaligned_2.  Replace
    	__mempcpy_chk_sse2_unaligned_2 with
    	__mempcpy_chk_sse2_unaligned.  Remove __mempcpy_chk_sse2.
    	Replace __mempcpy_avx512_unaligned_2 with
    	__mempcpy_avx512_unaligned.  Remove __mempcpy_avx_unaligned_2.
    	Replace __mempcpy_sse2_unaligned_2 with
    	__mempcpy_sse2_unaligned.  Remove __mempcpy_sse2.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support
    	__memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned.
    	Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __memcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../memcpy.S.
    	* sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support
    	__memcpy_chk_avx512_unaligned_erms and
    	__memcpy_chk_avx512_unaligned.  Use
    	__memcpy_chk_avx_unaligned_erms and
    	__memcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __memcpy_chk_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip if
    	not in libc.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
    	(MEMCPY_SYMBOL): New.
    	(MEMPCPY_SYMBOL): Likewise.
    	(MEMMOVE_CHK_SYMBOL): Likewise.
    	(__mempcpy_erms, __memmove_erms): Moved before __mempcpy_chk
    	with unaligned_erms.
    	Replace MEMMOVE_SYMBOL with MEMMOVE_CHK_SYMBOL on __mempcpy_chk
    	symbols.  Replace MEMMOVE_SYMBOL with MEMPCPY_SYMBOL on
    	__mempcpy symbols.  Change function suffix from unaligned_2 to
    	unaligned.  Provide alias for __memcpy_chk in libc.a.  Provide
    	alias for memcpy in libc.a and ld.so.
    	* sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support
    	__mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned.
    	Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __mempcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../mempcpy.S.
    	(mempcpy): New.  Add a weak alias.
    	* sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support
    	__mempcpy_chk_avx512_unaligned_erms and
    	__mempcpy_chk_avx512_unaligned.  Use
    	__mempcpy_chk_avx_unaligned_erms and
    	__mempcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __mempcpy_chk_sse2_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=503990f436e51cbd8f9479017fc0976687dcc90e

commit 503990f436e51cbd8f9479017fc0976687dcc90e
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:42:30 2016 -0700

    X86-64: Remove the previous SSE2/AVX2 memsets
    
    Since the new SSE2/AVX2 memsets are faster than the previous ones, we
    can remove the previous SSE2/AVX2 memsets and replace them with the
    new ones.
    
    No change in IFUNC selection if SSE2 and AVX2 memsets weren't used
    before.  If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset
    optimized with Enhanced REP STOSB will be used for processors with
    ERMS.  The new AVX512 memset will be used for processors with AVX512
    which prefer vzeroupper.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded
    	into ...
    	* sysdeps/x86_64/memset.S: This.
    	(__bzero): Removed.
    	(__memset_tail): Likewise.
    	(__memset_chk): Likewise.
    	(memset): Likewise.
    	(MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't
    	defined.
    	(MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined.
    	* sysdeps/x86_64/multiarch/memset-avx2.S: Removed.
    	(__memset_zero_constant_len_parameter): Check SHARED instead of
    	PIC.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memset-avx2 and memset-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Remove __memset_chk_sse2,
    	__memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: Skip
    	if not in libc.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
    	(MEMSET_CHK_SYMBOL): New.  Define if not defined.
    	(__bzero): Check VEC_SIZE == 16 instead of USE_MULTIARCH.
    	Replace MEMSET_SYMBOL with MEMSET_CHK_SYMBOL on __memset_chk
    	symbols.
    	Properly check USE_MULTIARCH on __memset symbols.
    	* sysdeps/x86_64/multiarch/memset.S (memset): Replace
    	__memset_sse2 and __memset_avx2 with __memset_sse2_unaligned
    	and __memset_avx2_unaligned.  Use __memset_sse2_unaligned_erms
    	or __memset_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	(memset): Removed.
    	(__memset_chk): Likewise.
    	(MEMSET_SYMBOL): New.
    	(libc_hidden_builtin_def): Replace __memset_sse2 with
    	__memset_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace
    	__memset_chk_sse2 and __memset_chk_avx2 with
    	__memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms.
    	Use __memset_chk_sse2_unaligned_erms or
    	__memset_chk_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_chk_avx512_unaligned_erms and
    	__memset_chk_avx512_unaligned.

-----------------------------------------------------------------------

Comment 12 Sourceware Commits 2016-04-01 19:36:48 UTC

This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, hjl/erms/ifunc has been created
        at  e499a177b666cb39041e4aa70582742f7844b685 (commit)

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e499a177b666cb39041e4aa70582742f7844b685

commit e499a177b666cb39041e4aa70582742f7844b685
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 12:46:57 2016 -0700

    X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove
    
    Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones,
    we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with
    the new ones.
    
    No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used
    before.  If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2
    memcpy/memmove optimized with Enhanced REP MOVSB will be used for
    processors with ERMS.  The new AVX512 memcpy/memmove will be used for
    processors with AVX512 which prefer vzeroupper.
    
    Since the new SSE2 memcpy/memmove are faster than the previous default
    memcpy/memmove used in libc.a and ld.so, we also remove the previous
    default memcpy/memmove and make them the default memcpy/memmove.
    
    Together, it reduces the size of libc.so by about 6 KB and the size of
    ld.so by about 2 KB.
    
    It also fixes the placement of __mempcpy_erms and __memmove_erms.
    
    	[BZ #19776]
    	* sysdeps/x86_64/memcpy.S: Make it dummy.
    	* sysdeps/x86_64/mempcpy.S: Likewise.
    	* sysdeps/x86_64/memmove.S: New file.
    	* sysdeps/x86_64/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/memmove.c: Removed.
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memmove.c: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memcpy-sse2-unaligned, memmove-avx-unaligned,
    	memcpy-avx-unaligned and memmove-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Replace
    	__memmove_chk_avx512_unaligned_2 with
    	__memmove_chk_avx512_unaligned.  Remove
    	__memmove_chk_avx_unaligned_2.  Replace
    	__memmove_chk_sse2_unaligned_2 with
    	__memmove_chk_sse2_unaligned.  Remove __memmove_chk_sse2 and
    	__memmove_avx_unaligned_2.  Replace __memmove_avx512_unaligned_2
    	with __memmove_avx512_unaligned.  Replace
    	__memmove_sse2_unaligned_2 with __memmove_sse2_unaligned.
    	Remove __memmove_sse2.  Replace __memcpy_chk_avx512_unaligned_2
    	with __memcpy_chk_avx512_unaligned.  Remove
    	__memcpy_chk_avx_unaligned_2.  Replace
    	__memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned.
    	Remove __memcpy_chk_sse2.  Remove __memcpy_avx_unaligned_2.
    	Replace __memcpy_avx512_unaligned_2 with
    	__memcpy_avx512_unaligned.  Remove __memcpy_sse2_unaligned_2
    	and __memcpy_sse2.  Replace __mempcpy_chk_avx512_unaligned_2
    	with __mempcpy_chk_avx512_unaligned.  Remove
    	__mempcpy_chk_avx_unaligned_2.  Replace
    	__mempcpy_chk_sse2_unaligned_2 with
    	__mempcpy_chk_sse2_unaligned.  Remove __mempcpy_chk_sse2.
    	Replace __mempcpy_avx512_unaligned_2 with
    	__mempcpy_avx512_unaligned.  Remove __mempcpy_avx_unaligned_2.
    	Replace __mempcpy_sse2_unaligned_2 with
    	__mempcpy_sse2_unaligned.  Remove __mempcpy_sse2.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support
    	__memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned.
    	Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __memcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../memcpy.S.
    	* sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support
    	__memcpy_chk_avx512_unaligned_erms and
    	__memcpy_chk_avx512_unaligned.  Use
    	__memcpy_chk_avx_unaligned_erms and
    	__memcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __memcpy_chk_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip if
    	not in libc.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
    	(MEMCPY_SYMBOL): New.
    	(MEMPCPY_SYMBOL): Likewise.
    	(MEMMOVE_CHK_SYMBOL): Likewise.
    	(__mempcpy_erms, __memmove_erms): Moved before __mempcpy_chk
    	with unaligned_erms.
    	Replace MEMMOVE_SYMBOL with MEMMOVE_CHK_SYMBOL on __mempcpy_chk
    	symbols.  Replace MEMMOVE_SYMBOL with MEMPCPY_SYMBOL on
    	__mempcpy symbols.  Change function suffix from unaligned_2 to
    	unaligned.  Provide alias for __memcpy_chk in libc.a.  Provide
    	alias for memcpy in libc.a and ld.so.
    	* sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support
    	__mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned.
    	Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __mempcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../mempcpy.S.
    	(mempcpy): New.  Add a weak alias.
    	* sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support
    	__mempcpy_chk_avx512_unaligned_erms and
    	__mempcpy_chk_avx512_unaligned.  Use
    	__mempcpy_chk_avx_unaligned_erms and
    	__mempcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __mempcpy_chk_sse2_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e01f71b4161a485bdca91ba95b7f3d976739291a

commit e01f71b4161a485bdca91ba95b7f3d976739291a
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:42:30 2016 -0700

    X86-64: Remove the previous SSE2/AVX2 memsets
    
    Since the new SSE2/AVX2 memsets are faster than the previous ones, we
    can remove the previous SSE2/AVX2 memsets and replace them with the
    new ones.  This reduces the size of libc.so by about 900 bytes.
    
    No change in IFUNC selection if SSE2 and AVX2 memsets weren't used
    before.  If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset
    optimized with Enhanced REP STOSB will be used for processors with
    ERMS.  The new AVX512 memset will be used for processors with AVX512
    which prefer vzeroupper.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded
    	into ...
    	* sysdeps/x86_64/memset.S: This.
    	(__bzero): Removed.
    	(__memset_tail): Likewise.
    	(__memset_chk): Likewise.
    	(memset): Likewise.
    	(MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't
    	defined.
    	(MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined.
    	* sysdeps/x86_64/multiarch/memset-avx2.S: Removed.
    	(__memset_zero_constant_len_parameter): Check SHARED instead of
    	PIC.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memset-avx2 and memset-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Remove __memset_chk_sse2,
    	__memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: Skip
    	if not in libc.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
    	(MEMSET_CHK_SYMBOL): New.  Define if not defined.
    	(__bzero): Check VEC_SIZE == 16 instead of USE_MULTIARCH.
    	Replace MEMSET_SYMBOL with MEMSET_CHK_SYMBOL on __memset_chk
    	symbols.
    	Properly check USE_MULTIARCH on __memset symbols.
    	* sysdeps/x86_64/multiarch/memset.S (memset): Replace
    	__memset_sse2 and __memset_avx2 with __memset_sse2_unaligned
    	and __memset_avx2_unaligned.  Use __memset_sse2_unaligned_erms
    	or __memset_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	(memset): Removed.
    	(__memset_chk): Likewise.
    	(MEMSET_SYMBOL): New.
    	(libc_hidden_builtin_def): Replace __memset_sse2 with
    	__memset_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace
    	__memset_chk_sse2 and __memset_chk_avx2 with
    	__memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms.
    	Use __memset_chk_sse2_unaligned_erms or
    	__memset_chk_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_chk_avx512_unaligned_erms and
    	__memset_chk_avx512_unaligned.

-----------------------------------------------------------------------

Comment 13 Sourceware Commits 2016-04-01 22:14:13 UTC

This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, hjl/erms/ifunc has been created
        at  e3233fab32b028ca9630d28afff6f9c97a8f0a51 (commit)

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e3233fab32b028ca9630d28afff6f9c97a8f0a51

commit e3233fab32b028ca9630d28afff6f9c97a8f0a51
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 1 14:01:24 2016 -0700

    X86-64: Add dummy memcopy.h and wordcopy.c
    
    Since x86-64 doesn't use memory copy functions, add dummy memcopy.h and
    wordcopy.c to reduce code size.  It reduces the size of libc.so by about
    1 KB.
    
    	* sysdeps/x86_64/memcopy.h: New file.
    	* sysdeps/x86_64/wordcopy.c: Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=740dd54823c1514e339186aab3868a69af2836a9

commit 740dd54823c1514e339186aab3868a69af2836a9
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 12:46:57 2016 -0700

    X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove
    
    Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones,
    we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with
    the new ones.
    
    No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used
    before.  If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2
    memcpy/memmove optimized with Enhanced REP MOVSB will be used for
    processors with ERMS.  The new AVX512 memcpy/memmove will be used for
    processors with AVX512 which prefer vzeroupper.
    
    Since the new SSE2 memcpy/memmove are faster than the previous default
    memcpy/memmove used in libc.a and ld.so, we also remove the previous
    default memcpy/memmove and make them the default memcpy/memmove.
    
    Together, it reduces the size of libc.so by about 6 KB and the size of
    ld.so by about 2 KB.
    
    It also fixes the placement of __mempcpy_erms and __memmove_erms.
    
    	[BZ #19776]
    	* sysdeps/x86_64/memcpy.S: Make it dummy.
    	* sysdeps/x86_64/mempcpy.S: Likewise.
    	* sysdeps/x86_64/memmove.S: New file.
    	* sysdeps/x86_64/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/memmove.c: Removed.
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memmove.c: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memcpy-sse2-unaligned, memmove-avx-unaligned,
    	memcpy-avx-unaligned and memmove-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Replace
    	__memmove_chk_avx512_unaligned_2 with
    	__memmove_chk_avx512_unaligned.  Remove
    	__memmove_chk_avx_unaligned_2.  Replace
    	__memmove_chk_sse2_unaligned_2 with
    	__memmove_chk_sse2_unaligned.  Remove __memmove_chk_sse2 and
    	__memmove_avx_unaligned_2.  Replace __memmove_avx512_unaligned_2
    	with __memmove_avx512_unaligned.  Replace
    	__memmove_sse2_unaligned_2 with __memmove_sse2_unaligned.
    	Remove __memmove_sse2.  Replace __memcpy_chk_avx512_unaligned_2
    	with __memcpy_chk_avx512_unaligned.  Remove
    	__memcpy_chk_avx_unaligned_2.  Replace
    	__memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned.
    	Remove __memcpy_chk_sse2.  Remove __memcpy_avx_unaligned_2.
    	Replace __memcpy_avx512_unaligned_2 with
    	__memcpy_avx512_unaligned.  Remove __memcpy_sse2_unaligned_2
    	and __memcpy_sse2.  Replace __mempcpy_chk_avx512_unaligned_2
    	with __mempcpy_chk_avx512_unaligned.  Remove
    	__mempcpy_chk_avx_unaligned_2.  Replace
    	__mempcpy_chk_sse2_unaligned_2 with
    	__mempcpy_chk_sse2_unaligned.  Remove __mempcpy_chk_sse2.
    	Replace __mempcpy_avx512_unaligned_2 with
    	__mempcpy_avx512_unaligned.  Remove __mempcpy_avx_unaligned_2.
    	Replace __mempcpy_sse2_unaligned_2 with
    	__mempcpy_sse2_unaligned.  Remove __mempcpy_sse2.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support
    	__memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned.
    	Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __memcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../memcpy.S.
    	* sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support
    	__memcpy_chk_avx512_unaligned_erms and
    	__memcpy_chk_avx512_unaligned.  Use
    	__memcpy_chk_avx_unaligned_erms and
    	__memcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __memcpy_chk_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip if
    	not in libc.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
    	(MEMCPY_SYMBOL): New.
    	(MEMPCPY_SYMBOL): Likewise.
    	(MEMMOVE_CHK_SYMBOL): Likewise.
    	(__mempcpy_erms, __memmove_erms): Moved before __mempcpy_chk
    	with unaligned_erms.
    	Replace MEMMOVE_SYMBOL with MEMMOVE_CHK_SYMBOL on __mempcpy_chk
    	symbols.  Replace MEMMOVE_SYMBOL with MEMPCPY_SYMBOL on
    	__mempcpy symbols.  Change function suffix from unaligned_2 to
    	unaligned.  Provide alias for __memcpy_chk in libc.a.  Provide
    	alias for memcpy in libc.a and ld.so.
    	* sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support
    	__mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned.
    	Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __mempcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../mempcpy.S.
    	(mempcpy): New.  Add a weak alias.
    	* sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support
    	__mempcpy_chk_avx512_unaligned_erms and
    	__mempcpy_chk_avx512_unaligned.  Use
    	__mempcpy_chk_avx_unaligned_erms and
    	__mempcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __mempcpy_chk_sse2_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=003dea41324380333f1c3eca867ad012f6ca549f

commit 003dea41324380333f1c3eca867ad012f6ca549f
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:42:30 2016 -0700

    X86-64: Remove the previous SSE2/AVX2 memsets
    
    Since the new SSE2/AVX2 memsets are faster than the previous ones, we
    can remove the previous SSE2/AVX2 memsets and replace them with the
    new ones.  This reduces the size of libc.so by about 900 bytes.
    
    No change in IFUNC selection if SSE2 and AVX2 memsets weren't used
    before.  If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset
    optimized with Enhanced REP STOSB will be used for processors with
    ERMS.  The new AVX512 memset will be used for processors with AVX512
    which prefer vzeroupper.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded
    	into ...
    	* sysdeps/x86_64/memset.S: This.
    	(__bzero): Removed.
    	(__memset_tail): Likewise.
    	(__memset_chk): Likewise.
    	(memset): Likewise.
    	(MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't
    	defined.
    	(MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined.
    	* sysdeps/x86_64/multiarch/memset-avx2.S: Removed.
    	(__memset_zero_constant_len_parameter): Check SHARED instead of
    	PIC.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memset-avx2 and memset-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Remove __memset_chk_sse2,
    	__memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: Skip
    	if not in libc.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
    	(MEMSET_CHK_SYMBOL): New.  Define if not defined.
    	(__bzero): Check VEC_SIZE == 16 instead of USE_MULTIARCH.
    	Replace MEMSET_SYMBOL with MEMSET_CHK_SYMBOL on __memset_chk
    	symbols.
    	Properly check USE_MULTIARCH on __memset symbols.
    	* sysdeps/x86_64/multiarch/memset.S (memset): Replace
    	__memset_sse2 and __memset_avx2 with __memset_sse2_unaligned
    	and __memset_avx2_unaligned.  Use __memset_sse2_unaligned_erms
    	or __memset_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	(memset): Removed.
    	(__memset_chk): Likewise.
    	(MEMSET_SYMBOL): New.
    	(libc_hidden_builtin_def): Replace __memset_sse2 with
    	__memset_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace
    	__memset_chk_sse2 and __memset_chk_avx2 with
    	__memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms.
    	Use __memset_chk_sse2_unaligned_erms or
    	__memset_chk_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_chk_avx512_unaligned_erms and
    	__memset_chk_avx512_unaligned.

-----------------------------------------------------------------------

Comment 14 Sourceware Commits 2016-04-02 17:13:48 UTC

This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, hjl/erms/2.23 has been created
        at  4e339b9dc65217fb9b9be6cdc0e991f4ae64ccfe (commit)

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4e339b9dc65217fb9b9be6cdc0e991f4ae64ccfe

commit 4e339b9dc65217fb9b9be6cdc0e991f4ae64ccfe
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 1 14:01:24 2016 -0700

    X86-64: Add dummy memcopy.h and wordcopy.c
    
    Since x86-64 doesn't use memory copy functions, add dummy memcopy.h and
    wordcopy.c to reduce code size.  It reduces the size of libc.so by about
    1 KB.
    
    	* sysdeps/x86_64/memcopy.h: New file.
    	* sysdeps/x86_64/wordcopy.c: Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=997e6c0db2c351f4a7b688c3134c1f77a0aa49de

commit 997e6c0db2c351f4a7b688c3134c1f77a0aa49de
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 12:46:57 2016 -0700

    X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove
    
    Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones,
    we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with
    the new ones.
    
    No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used
    before.  If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2
    memcpy/memmove optimized with Enhanced REP MOVSB will be used for
    processors with ERMS.  The new AVX512 memcpy/memmove will be used for
    processors with AVX512 which prefer vzeroupper.
    
    Since the new SSE2 memcpy/memmove are faster than the previous default
    memcpy/memmove used in libc.a and ld.so, we also remove the previous
    default memcpy/memmove and make them the default memcpy/memmove.
    
    Together, it reduces the size of libc.so by about 6 KB and the size of
    ld.so by about 2 KB.
    
    It also fixes the placement of __mempcpy_erms and __memmove_erms.
    
    	[BZ #19776]
    	* sysdeps/x86_64/memcpy.S: Make it dummy.
    	* sysdeps/x86_64/mempcpy.S: Likewise.
    	* sysdeps/x86_64/memmove.S: New file.
    	* sysdeps/x86_64/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/memmove.c: Removed.
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memmove.c: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memcpy-sse2-unaligned, memmove-avx-unaligned,
    	memcpy-avx-unaligned and memmove-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Replace
    	__memmove_chk_avx512_unaligned_2 with
    	__memmove_chk_avx512_unaligned.  Remove
    	__memmove_chk_avx_unaligned_2.  Replace
    	__memmove_chk_sse2_unaligned_2 with
    	__memmove_chk_sse2_unaligned.  Remove __memmove_chk_sse2 and
    	__memmove_avx_unaligned_2.  Replace __memmove_avx512_unaligned_2
    	with __memmove_avx512_unaligned.  Replace
    	__memmove_sse2_unaligned_2 with __memmove_sse2_unaligned.
    	Remove __memmove_sse2.  Replace __memcpy_chk_avx512_unaligned_2
    	with __memcpy_chk_avx512_unaligned.  Remove
    	__memcpy_chk_avx_unaligned_2.  Replace
    	__memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned.
    	Remove __memcpy_chk_sse2.  Remove __memcpy_avx_unaligned_2.
    	Replace __memcpy_avx512_unaligned_2 with
    	__memcpy_avx512_unaligned.  Remove __memcpy_sse2_unaligned_2
    	and __memcpy_sse2.  Replace __mempcpy_chk_avx512_unaligned_2
    	with __mempcpy_chk_avx512_unaligned.  Remove
    	__mempcpy_chk_avx_unaligned_2.  Replace
    	__mempcpy_chk_sse2_unaligned_2 with
    	__mempcpy_chk_sse2_unaligned.  Remove __mempcpy_chk_sse2.
    	Replace __mempcpy_avx512_unaligned_2 with
    	__mempcpy_avx512_unaligned.  Remove __mempcpy_avx_unaligned_2.
    	Replace __mempcpy_sse2_unaligned_2 with
    	__mempcpy_sse2_unaligned.  Remove __mempcpy_sse2.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support
    	__memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned.
    	Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __memcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../memcpy.S.
    	* sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support
    	__memcpy_chk_avx512_unaligned_erms and
    	__memcpy_chk_avx512_unaligned.  Use
    	__memcpy_chk_avx_unaligned_erms and
    	__memcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __memcpy_chk_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip if
    	not in libc.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
    	(MEMCPY_SYMBOL): New.
    	(MEMPCPY_SYMBOL): Likewise.
    	(MEMMOVE_CHK_SYMBOL): Likewise.
    	(__mempcpy_erms, __memmove_erms): Moved before __mempcpy_chk
    	with unaligned_erms.
    	Replace MEMMOVE_SYMBOL with MEMMOVE_CHK_SYMBOL on __mempcpy_chk
    	symbols.  Replace MEMMOVE_SYMBOL with MEMPCPY_SYMBOL on
    	__mempcpy symbols.  Change function suffix from unaligned_2 to
    	unaligned.  Provide alias for __memcpy_chk in libc.a.  Provide
    	alias for memcpy in libc.a and ld.so.
    	* sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support
    	__mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned.
    	Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __mempcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../mempcpy.S.
    	(mempcpy): New.  Add a weak alias.
    	* sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support
    	__mempcpy_chk_avx512_unaligned_erms and
    	__mempcpy_chk_avx512_unaligned.  Use
    	__mempcpy_chk_avx_unaligned_erms and
    	__mempcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __mempcpy_chk_sse2_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=0ff8c6a7b53c5bb28ac3d3e0ae8da8099491b16c

commit 0ff8c6a7b53c5bb28ac3d3e0ae8da8099491b16c
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:42:30 2016 -0700

    X86-64: Remove the previous SSE2/AVX2 memsets
    
    Since the new SSE2/AVX2 memsets are faster than the previous ones, we
    can remove the previous SSE2/AVX2 memsets and replace them with the
    new ones.  This reduces the size of libc.so by about 900 bytes.
    
    No change in IFUNC selection if SSE2 and AVX2 memsets weren't used
    before.  If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset
    optimized with Enhanced REP STOSB will be used for processors with
    ERMS.  The new AVX512 memset will be used for processors with AVX512
    which prefer vzeroupper.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded
    	into ...
    	* sysdeps/x86_64/memset.S: This.
    	(__bzero): Removed.
    	(__memset_tail): Likewise.
    	(__memset_chk): Likewise.
    	(memset): Likewise.
    	(MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't
    	defined.
    	(MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined.
    	* sysdeps/x86_64/multiarch/memset-avx2.S: Removed.
    	(__memset_zero_constant_len_parameter): Check SHARED instead of
    	PIC.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memset-avx2 and memset-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Remove __memset_chk_sse2,
    	__memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: Skip
    	if not in libc.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
    	(MEMSET_CHK_SYMBOL): New.  Define if not defined.
    	(__bzero): Check VEC_SIZE == 16 instead of USE_MULTIARCH.
    	Replace MEMSET_SYMBOL with MEMSET_CHK_SYMBOL on __memset_chk
    	symbols.
    	Properly check USE_MULTIARCH on __memset symbols.
    	* sysdeps/x86_64/multiarch/memset.S (memset): Replace
    	__memset_sse2 and __memset_avx2 with __memset_sse2_unaligned
    	and __memset_avx2_unaligned.  Use __memset_sse2_unaligned_erms
    	or __memset_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	(memset): Removed.
    	(__memset_chk): Likewise.
    	(MEMSET_SYMBOL): New.
    	(libc_hidden_builtin_def): Replace __memset_sse2 with
    	__memset_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace
    	__memset_chk_sse2 and __memset_chk_avx2 with
    	__memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms.
    	Use __memset_chk_sse2_unaligned_erms or
    	__memset_chk_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_chk_avx512_unaligned_erms and
    	__memset_chk_avx512_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=cfb059c79729b26284863334c9aa04f0a3b967b9

commit cfb059c79729b26284863334c9aa04f0a3b967b9
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 1 15:08:48 2016 -0700

    Remove Fast_Copy_Backward from Intel Core processors
    
    Intel Core i3, i5 and i7 processors have fast unaligned copy and
    copy backward is ignored.  Remove Fast_Copy_Backward from Intel Core
    processors to avoid confusion.
    
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Don't set
    	bit_arch_Fast_Copy_Backward for Intel Core proessors.
    
    (cherry picked from commit 27d3ce1467990f89126e228559dec8f84b96c60e)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=30c389be1af67c4d0716d207b6780c6169d1355f

commit 30c389be1af67c4d0716d207b6780c6169d1355f
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:05:51 2016 -0700

    Add x86-64 memset with unaligned store and rep stosb
    
    Implement x86-64 memset with unaligned store and rep movsb.  Support
    16-byte, 32-byte and 64-byte vector register sizes.  A single file
    provides 2 implementations of memset, one with rep stosb and the other
    without rep stosb.  They share the same codes when size is between 2
    times of vector register size and REP_STOSB_THRESHOLD which defaults
    to 2KB.
    
    Key features:
    
    1. Use overlapping store to avoid branch.
    2. For size <= 4 times of vector register size, fully unroll the loop.
    3. For size > 4 times of vector register size, store 4 times of vector
    register size at a time.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
    	memset-sse2-unaligned-erms, memset-avx2-unaligned-erms and
    	memset-avx512-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Test __memset_chk_sse2_unaligned,
    	__memset_chk_sse2_unaligned_erms, __memset_chk_avx2_unaligned,
    	__memset_chk_avx2_unaligned_erms, __memset_chk_avx512_unaligned,
    	__memset_chk_avx512_unaligned_erms, __memset_sse2_unaligned,
    	__memset_sse2_unaligned_erms, __memset_erms,
    	__memset_avx2_unaligned, __memset_avx2_unaligned_erms,
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: New
    	file.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:
    	Likewise.
    
    (cherry picked from commit 830566307f038387ca0af3fd327706a8d1a2f595)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=980d639b4ae58209843f09a29d86b0a8303b6650

commit 980d639b4ae58209843f09a29d86b0a8303b6650
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:04:26 2016 -0700

    Add x86-64 memmove with unaligned load/store and rep movsb
    
    Implement x86-64 memmove with unaligned load/store and rep movsb.
    Support 16-byte, 32-byte and 64-byte vector register sizes.  When
    size <= 8 times of vector register size, there is no check for
    address overlap bewteen source and destination.  Since overhead for
    overlap check is small when size > 8 times of vector register size,
    memcpy is an alias of memmove.
    
    A single file provides 2 implementations of memmove, one with rep movsb
    and the other without rep movsb.  They share the same codes when size is
    between 2 times of vector register size and REP_MOVSB_THRESHOLD which
    is 2KB for 16-byte vector register size and scaled up by large vector
    register size.
    
    Key features:
    
    1. Use overlapping load and store to avoid branch.
    2. For size <= 8 times of vector register size, load  all sources into
    registers and store them together.
    3. If there is no address overlap bewteen source and destination, copy
    from both ends with 4 times of vector register size at a time.
    4. If address of destination > address of source, backward copy 8 times
    of vector register size at a time.
    5. Otherwise, forward copy 8 times of vector register size at a time.
    6. Use rep movsb only for forward copy.  Avoid slow backward rep movsb
    by fallbacking to backward copy 8 times of vector register size at a
    time.
    7. Skip when address of destination == address of source.
    
    	[BZ #19776]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
    	memmove-sse2-unaligned-erms, memmove-avx-unaligned-erms and
    	memmove-avx512-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Test
    	__memmove_chk_avx512_unaligned_2,
    	__memmove_chk_avx512_unaligned_erms,
    	__memmove_chk_avx_unaligned_2, __memmove_chk_avx_unaligned_erms,
    	__memmove_chk_sse2_unaligned_2,
    	__memmove_chk_sse2_unaligned_erms, __memmove_avx_unaligned_2,
    	__memmove_avx_unaligned_erms, __memmove_avx512_unaligned_2,
    	__memmove_avx512_unaligned_erms, __memmove_erms,
    	__memmove_sse2_unaligned_2, __memmove_sse2_unaligned_erms,
    	__memcpy_chk_avx512_unaligned_2,
    	__memcpy_chk_avx512_unaligned_erms,
    	__memcpy_chk_avx_unaligned_2, __memcpy_chk_avx_unaligned_erms,
    	__memcpy_chk_sse2_unaligned_2, __memcpy_chk_sse2_unaligned_erms,
    	__memcpy_avx_unaligned_2, __memcpy_avx_unaligned_erms,
    	__memcpy_avx512_unaligned_2, __memcpy_avx512_unaligned_erms,
    	__memcpy_sse2_unaligned_2, __memcpy_sse2_unaligned_erms,
    	__memcpy_erms, __mempcpy_chk_avx512_unaligned_2,
    	__mempcpy_chk_avx512_unaligned_erms,
    	__mempcpy_chk_avx_unaligned_2, __mempcpy_chk_avx_unaligned_erms,
    	__mempcpy_chk_sse2_unaligned_2, __mempcpy_chk_sse2_unaligned_erms,
    	__mempcpy_avx512_unaligned_2, __mempcpy_avx512_unaligned_erms,
    	__mempcpy_avx_unaligned_2, __mempcpy_avx_unaligned_erms,
    	__mempcpy_sse2_unaligned_2, __mempcpy_sse2_unaligned_erms and
    	__mempcpy_erms.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: New
    	file.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likwise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
    	Likwise.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:
    	Likwise.
    
    (cherry picked from commit 88b57b8ed41d5ecf2e1bdfc19556f9246a665ebb)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=bf2bc5e5c9d7aa8af28b299ec26b8a37352730cc

commit bf2bc5e5c9d7aa8af28b299ec26b8a37352730cc
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 19:22:59 2016 -0700

    Initial Enhanced REP MOVSB/STOSB (ERMS) support
    
    The newer Intel processors support Enhanced REP MOVSB/STOSB (ERMS) which
    has a feature bit in CPUID.  This patch adds the Enhanced REP MOVSB/STOSB
    (ERMS) bit to x86 cpu-features.
    
    	* sysdeps/x86/cpu-features.h (bit_cpu_ERMS): New.
    	(index_cpu_ERMS): Likewise.
    	(reg_ERMS): Likewise.
    
    (cherry picked from commit 0791f91dff9a77263fa8173b143d854cad902c6d)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7c244283ff12329b3bca9878b8edac3b3fe5c7bc

commit 7c244283ff12329b3bca9878b8edac3b3fe5c7bc
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 13:15:59 2016 -0700

    Make __memcpy_avx512_no_vzeroupper an alias
    
    Since x86-64 memcpy-avx512-no-vzeroupper.S implements memmove, make
    __memcpy_avx512_no_vzeroupper an alias of __memmove_avx512_no_vzeroupper
    to reduce code size of libc.so.
    
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memcpy-avx512-no-vzeroupper.
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: Renamed
    	to ...
    	* sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: This.
    	(MEMCPY): Don't define.
    	(MEMCPY_CHK): Likewise.
    	(MEMPCPY): Likewise.
    	(MEMPCPY_CHK): Likewise.
    	(MEMPCPY_CHK): Renamed to ...
    	(__mempcpy_chk_avx512_no_vzeroupper): This.
    	(MEMPCPY_CHK): Renamed to ...
    	(__mempcpy_chk_avx512_no_vzeroupper): This.
    	(MEMCPY_CHK): Renamed to ...
    	(__memmove_chk_avx512_no_vzeroupper): This.
    	(MEMCPY): Renamed to ...
    	(__memmove_avx512_no_vzeroupper): This.
    	(__memcpy_avx512_no_vzeroupper): New alias.
    	(__memcpy_chk_avx512_no_vzeroupper): Likewise.
    
    (cherry picked from commit 064f01b10b57ff09cda7025f484b848c38ddd57a)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a9a14991fb2d3e69f80d25e9bbf2f6b0bcf11c3d

commit a9a14991fb2d3e69f80d25e9bbf2f6b0bcf11c3d
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 13:13:36 2016 -0700

    Implement x86-64 multiarch mempcpy in memcpy
    
    Implement x86-64 multiarch mempcpy in memcpy to share most of code.  It
    reduces code size of libc.so.
    
    	[BZ #18858]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	mempcpy-ssse3, mempcpy-ssse3-back, mempcpy-avx-unaligned
    	and mempcpy-avx512-no-vzeroupper.
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMPCPY_CHK):
    	New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S
    	(MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-ssse3-back.S (MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-ssse3.S (MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-avx-unaligned.S: Removed.
    	* sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-ssse3-back.S: Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-ssse3.S: Likewise.
    
    (cherry picked from commit c365e615f7429aee302f8af7bf07ae262278febb)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4fc09dabecee1b7cafdbca26ee7c63f68e53c229

commit 4fc09dabecee1b7cafdbca26ee7c63f68e53c229
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 04:39:48 2016 -0700

    [x86] Add a feature bit: Fast_Unaligned_Copy
    
    On AMD processors, memcpy optimized with unaligned SSE load is
    slower than emcpy optimized with aligned SSSE3 while other string
    functions are faster with unaligned SSE load.  A feature bit,
    Fast_Unaligned_Copy, is added to select memcpy optimized with
    unaligned SSE load.
    
    	[BZ #19583]
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Set
    	Fast_Unaligned_Copy with Fast_Unaligned_Load for Intel
    	processors.  Set Fast_Copy_Backward for AMD Excavator
    	processors.
    	* sysdeps/x86/cpu-features.h (bit_arch_Fast_Unaligned_Copy):
    	New.
    	(index_arch_Fast_Unaligned_Copy): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Check
    	Fast_Unaligned_Copy instead of Fast_Unaligned_Load.
    
    (cherry picked from commit e41b395523040fcb58c7d378475720c2836d280c)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=75f2d47e459a6bf5656a938e5c63f8b581eb3ee6

commit 75f2d47e459a6bf5656a938e5c63f8b581eb3ee6
Author: Florian Weimer <fweimer@redhat.com>
Date:   Fri Mar 25 11:11:42 2016 +0100

    tst-audit10: Fix compilation on compilers without bit_AVX512F [BZ #19860]
    
    	[BZ# 19860]
    	* sysdeps/x86_64/tst-audit10.c (avx512_enabled): Always return
    	zero if the compiler does not provide the AVX512F bit.
    
    (cherry picked from commit f327f5b47be57bc05a4077344b381016c1bb2c11)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=96c7375cb8b6f1875d9865f2ae92ecacf5f5e6fa

commit 96c7375cb8b6f1875d9865f2ae92ecacf5f5e6fa
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Mar 22 08:36:16 2016 -0700

    Don't set %rcx twice before "rep movsb"
    
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMCPY):
    	Don't set %rcx twice before "rep movsb".
    
    (cherry picked from commit 3c9a4cd16cbc7b79094fec68add2df66061ab5d7)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c273f613b0cc779ee33cc33d20941d271316e483

commit c273f613b0cc779ee33cc33d20941d271316e483
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Mar 22 07:46:56 2016 -0700

    Set index_arch_AVX_Fast_Unaligned_Load only for Intel processors
    
    Since only Intel processors with AVX2 have fast unaligned load, we
    should set index_arch_AVX_Fast_Unaligned_Load only for Intel processors.
    
    Move AVX, AVX2, AVX512, FMA and FMA4 detection into get_common_indeces
    and call get_common_indeces for other processors.
    
    Add CPU_FEATURES_CPU_P and CPU_FEATURES_ARCH_P to aoid loading
    GLRO(dl_x86_cpu_features) in cpu-features.c.
    
    	[BZ #19583]
    	* sysdeps/x86/cpu-features.c (get_common_indeces): Remove
    	inline.  Check family before setting family, model and
    	extended_model.  Set AVX, AVX2, AVX512, FMA and FMA4 usable
    	bits here.
    	(init_cpu_features): Replace HAS_CPU_FEATURE and
    	HAS_ARCH_FEATURE with CPU_FEATURES_CPU_P and
    	CPU_FEATURES_ARCH_P.  Set index_arch_AVX_Fast_Unaligned_Load
    	for Intel processors with usable AVX2.  Call get_common_indeces
    	for other processors with family == NULL.
    	* sysdeps/x86/cpu-features.h (CPU_FEATURES_CPU_P): New macro.
    	(CPU_FEATURES_ARCH_P): Likewise.
    	(HAS_CPU_FEATURE): Use CPU_FEATURES_CPU_P.
    	(HAS_ARCH_FEATURE): Use CPU_FEATURES_ARCH_P.
    
    (cherry picked from commit f781a9e96138d8839663af5e88649ab1fbed74f8)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c858d10a4e7fd682f2e7083836e4feacc2d580f4

commit c858d10a4e7fd682f2e7083836e4feacc2d580f4
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 10 05:26:46 2016 -0800

    Add _arch_/_cpu_ to index_*/bit_* in x86 cpu-features.h
    
    index_* and bit_* macros are used to access cpuid and feature arrays o
    struct cpu_features.  It is very easy to use bits and indices of cpuid
    array on feature array, especially in assembly codes.  For example,
    sysdeps/i386/i686/multiarch/bcopy.S has
    
    	HAS_CPU_FEATURE (Fast_Rep_String)
    
    which should be
    
    	HAS_ARCH_FEATURE (Fast_Rep_String)
    
    We change index_* and bit_* to index_cpu_*/index_arch_* and
    bit_cpu_*/bit_arch_* so that we can catch such error at build time.
    
    	[BZ #19762]
    	* sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h
    	(EXTRA_LD_ENVVARS): Add _arch_ to index_*/bit_*.
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Likewise.
    	* sysdeps/x86/cpu-features.h (bit_*): Renamed to ...
    	(bit_arch_*): This for feature array.
    	(bit_*): Renamed to ...
    	(bit_cpu_*): This for cpu array.
    	(index_*): Renamed to ...
    	(index_arch_*): This for feature array.
    	(index_*): Renamed to ...
    	(index_cpu_*): This for cpu array.
    	[__ASSEMBLER__] (HAS_FEATURE): Add and use field.
    	[__ASSEMBLER__] (HAS_CPU_FEATURE)): Pass cpu to HAS_FEATURE.
    	[__ASSEMBLER__] (HAS_ARCH_FEATURE)): Pass arch to HAS_FEATURE.
    	[!__ASSEMBLER__] (HAS_CPU_FEATURE): Replace index_##name and
    	bit_##name with index_cpu_##name and bit_cpu_##name.
    	[!__ASSEMBLER__] (HAS_ARCH_FEATURE): Replace index_##name and
    	bit_##name with index_arch_##name and bit_arch_##name.
    
    (cherry picked from commit 6aa3e97e2530f9917f504eb4146af119a3f27229)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7a90b56b0c3f8e55df44957cf6de7d3c9c04cbb9

commit 7a90b56b0c3f8e55df44957cf6de7d3c9c04cbb9
Author: Roland McGrath <roland@hack.frob.com>
Date:   Tue Mar 8 12:31:13 2016 -0800

    Fix tst-audit10 build when -mavx512f is not supported.
    
    (cherry picked from commit 3bd80c0de2f8e7ca8020d37739339636d169957e)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=ba80f6ceea3a6b6f711038646f419125fe3ad39c

commit ba80f6ceea3a6b6f711038646f419125fe3ad39c
Author: Florian Weimer <fweimer@redhat.com>
Date:   Mon Mar 7 16:00:25 2016 +0100

    tst-audit4, tst-audit10: Compile AVX/AVX-512 code separately [BZ #19269]
    
    This ensures that GCC will not use unsupported instructions before
    the run-time check to ensure support.
    
    (cherry picked from commit 3c0f7407eedb524c9114bb675cd55b903c71daaa)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=b8fe596e7f750d4ee2fca14d6a3999364c02662e

commit b8fe596e7f750d4ee2fca14d6a3999364c02662e
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Mar 6 16:48:11 2016 -0800

    Group AVX512 functions in .text.avx512 section
    
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S:
    	Replace .text with .text.avx512.
    	* sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S:
    	Likewise.
    
    (cherry picked from commit fee9eb6200f0e44a4b684903bc47fde36d46f1a5)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e455d17680cfaebb12692547422f95ba1ed30e29

commit e455d17680cfaebb12692547422f95ba1ed30e29
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Mar 4 08:37:40 2016 -0800

    x86-64: Fix memcpy IFUNC selection
    
    Chek Fast_Unaligned_Load, instead of Slow_BSF, and also check for
    Fast_Copy_Backward to enable __memcpy_ssse3_back.  Existing selection
    order is updated with following selection order:
    
    1. __memcpy_avx_unaligned if AVX_Fast_Unaligned_Load bit is set.
    2. __memcpy_sse2_unaligned if Fast_Unaligned_Load bit is set.
    3. __memcpy_sse2 if SSSE3 isn't available.
    4. __memcpy_ssse3_back if Fast_Copy_Backward bit it set.
    5. __memcpy_ssse3
    
    	[BZ #18880]
    	* sysdeps/x86_64/multiarch/memcpy.S: Check Fast_Unaligned_Load,
    	instead of Slow_BSF, and also check for Fast_Copy_Backward to
    	enable __memcpy_ssse3_back.
    
    (cherry picked from commit 14a1d7cc4c4fd5ee8e4e66b777221dd32a84efe8)

-----------------------------------------------------------------------

Comment 15 Sourceware Commits 2016-04-02 19:31:08 UTC

This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, hjl/erms/2.22 has been created
        at  7962f7b04a6374b36d1df15c0c7c8f5747e2e85f (commit)

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7962f7b04a6374b36d1df15c0c7c8f5747e2e85f

commit 7962f7b04a6374b36d1df15c0c7c8f5747e2e85f
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 1 14:01:24 2016 -0700

    X86-64: Add dummy memcopy.h and wordcopy.c
    
    Since x86-64 doesn't use memory copy functions, add dummy memcopy.h and
    wordcopy.c to reduce code size.  It reduces the size of libc.so by about
    1 KB.
    
    	* sysdeps/x86_64/memcopy.h: New file.
    	* sysdeps/x86_64/wordcopy.c: Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=40d52d834531b7a4315b68155ee3daec3cdceb46

commit 40d52d834531b7a4315b68155ee3daec3cdceb46
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 12:46:57 2016 -0700

    X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove
    
    Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones,
    we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with
    the new ones.
    
    No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used
    before.  If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2
    memcpy/memmove optimized with Enhanced REP MOVSB will be used for
    processors with ERMS.  The new AVX512 memcpy/memmove will be used for
    processors with AVX512 which prefer vzeroupper.
    
    Since the new SSE2 memcpy/memmove are faster than the previous default
    memcpy/memmove used in libc.a and ld.so, we also remove the previous
    default memcpy/memmove and make them the default memcpy/memmove.
    
    Together, it reduces the size of libc.so by about 6 KB and the size of
    ld.so by about 2 KB.
    
    It also fixes the placement of __mempcpy_erms and __memmove_erms.
    
    	[BZ #19776]
    	* sysdeps/x86_64/memcpy.S: Make it dummy.
    	* sysdeps/x86_64/mempcpy.S: Likewise.
    	* sysdeps/x86_64/memmove.S: New file.
    	* sysdeps/x86_64/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/memmove.c: Removed.
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memmove.c: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memcpy-sse2-unaligned, memmove-avx-unaligned,
    	memcpy-avx-unaligned and memmove-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Replace
    	__memmove_chk_avx512_unaligned_2 with
    	__memmove_chk_avx512_unaligned.  Remove
    	__memmove_chk_avx_unaligned_2.  Replace
    	__memmove_chk_sse2_unaligned_2 with
    	__memmove_chk_sse2_unaligned.  Remove __memmove_chk_sse2 and
    	__memmove_avx_unaligned_2.  Replace __memmove_avx512_unaligned_2
    	with __memmove_avx512_unaligned.  Replace
    	__memmove_sse2_unaligned_2 with __memmove_sse2_unaligned.
    	Remove __memmove_sse2.  Replace __memcpy_chk_avx512_unaligned_2
    	with __memcpy_chk_avx512_unaligned.  Remove
    	__memcpy_chk_avx_unaligned_2.  Replace
    	__memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned.
    	Remove __memcpy_chk_sse2.  Remove __memcpy_avx_unaligned_2.
    	Replace __memcpy_avx512_unaligned_2 with
    	__memcpy_avx512_unaligned.  Remove __memcpy_sse2_unaligned_2
    	and __memcpy_sse2.  Replace __mempcpy_chk_avx512_unaligned_2
    	with __mempcpy_chk_avx512_unaligned.  Remove
    	__mempcpy_chk_avx_unaligned_2.  Replace
    	__mempcpy_chk_sse2_unaligned_2 with
    	__mempcpy_chk_sse2_unaligned.  Remove __mempcpy_chk_sse2.
    	Replace __mempcpy_avx512_unaligned_2 with
    	__mempcpy_avx512_unaligned.  Remove __mempcpy_avx_unaligned_2.
    	Replace __mempcpy_sse2_unaligned_2 with
    	__mempcpy_sse2_unaligned.  Remove __mempcpy_sse2.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support
    	__memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned.
    	Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __memcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../memcpy.S.
    	* sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support
    	__memcpy_chk_avx512_unaligned_erms and
    	__memcpy_chk_avx512_unaligned.  Use
    	__memcpy_chk_avx_unaligned_erms and
    	__memcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __memcpy_chk_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip if
    	not in libc.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
    	(MEMCPY_SYMBOL): New.
    	(MEMPCPY_SYMBOL): Likewise.
    	(MEMMOVE_CHK_SYMBOL): Likewise.
    	(__mempcpy_erms, __memmove_erms): Moved before __mempcpy_chk
    	with unaligned_erms.
    	Replace MEMMOVE_SYMBOL with MEMMOVE_CHK_SYMBOL on __mempcpy_chk
    	symbols.  Replace MEMMOVE_SYMBOL with MEMPCPY_SYMBOL on
    	__mempcpy symbols.  Change function suffix from unaligned_2 to
    	unaligned.  Provide alias for __memcpy_chk in libc.a.  Provide
    	alias for memcpy in libc.a and ld.so.
    	* sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support
    	__mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned.
    	Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __mempcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../mempcpy.S.
    	(mempcpy): New.  Add a weak alias.
    	* sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support
    	__mempcpy_chk_avx512_unaligned_erms and
    	__mempcpy_chk_avx512_unaligned.  Use
    	__mempcpy_chk_avx_unaligned_erms and
    	__mempcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __mempcpy_chk_sse2_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a61bbdcc906231982398239ec38f193a7522af5b

commit a61bbdcc906231982398239ec38f193a7522af5b
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:42:30 2016 -0700

    X86-64: Remove the previous SSE2/AVX2 memsets
    
    Since the new SSE2/AVX2 memsets are faster than the previous ones, we
    can remove the previous SSE2/AVX2 memsets and replace them with the
    new ones.  This reduces the size of libc.so by about 900 bytes.
    
    No change in IFUNC selection if SSE2 and AVX2 memsets weren't used
    before.  If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset
    optimized with Enhanced REP STOSB will be used for processors with
    ERMS.  The new AVX512 memset will be used for processors with AVX512
    which prefer vzeroupper.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded
    	into ...
    	* sysdeps/x86_64/memset.S: This.
    	(__bzero): Removed.
    	(__memset_tail): Likewise.
    	(__memset_chk): Likewise.
    	(memset): Likewise.
    	(MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't
    	defined.
    	(MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined.
    	* sysdeps/x86_64/multiarch/memset-avx2.S: Removed.
    	(__memset_zero_constant_len_parameter): Check SHARED instead of
    	PIC.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memset-avx2 and memset-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Remove __memset_chk_sse2,
    	__memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: Skip
    	if not in libc.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
    	(MEMSET_CHK_SYMBOL): New.  Define if not defined.
    	(__bzero): Check VEC_SIZE == 16 instead of USE_MULTIARCH.
    	Replace MEMSET_SYMBOL with MEMSET_CHK_SYMBOL on __memset_chk
    	symbols.
    	Properly check USE_MULTIARCH on __memset symbols.
    	* sysdeps/x86_64/multiarch/memset.S (memset): Replace
    	__memset_sse2 and __memset_avx2 with __memset_sse2_unaligned
    	and __memset_avx2_unaligned.  Use __memset_sse2_unaligned_erms
    	or __memset_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	(memset): Removed.
    	(__memset_chk): Likewise.
    	(MEMSET_SYMBOL): New.
    	(libc_hidden_builtin_def): Replace __memset_sse2 with
    	__memset_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace
    	__memset_chk_sse2 and __memset_chk_avx2 with
    	__memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms.
    	Use __memset_chk_sse2_unaligned_erms or
    	__memset_chk_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_chk_avx512_unaligned_erms and
    	__memset_chk_avx512_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4ad4d58ed7a444e2d9787113fce132a99b35b417

commit 4ad4d58ed7a444e2d9787113fce132a99b35b417
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 1 15:08:48 2016 -0700

    Remove Fast_Copy_Backward from Intel Core processors
    
    Intel Core i3, i5 and i7 processors have fast unaligned copy and
    copy backward is ignored.  Remove Fast_Copy_Backward from Intel Core
    processors to avoid confusion.
    
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Don't set
    	bit_arch_Fast_Copy_Backward for Intel Core proessors.
    
    (cherry picked from commit 27d3ce1467990f89126e228559dec8f84b96c60e)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a304f3933c7f8347e49057a7a315cbd571662ff7

commit a304f3933c7f8347e49057a7a315cbd571662ff7
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:05:51 2016 -0700

    Add x86-64 memset with unaligned store and rep stosb
    
    Implement x86-64 memset with unaligned store and rep movsb.  Support
    16-byte, 32-byte and 64-byte vector register sizes.  A single file
    provides 2 implementations of memset, one with rep stosb and the other
    without rep stosb.  They share the same codes when size is between 2
    times of vector register size and REP_STOSB_THRESHOLD which defaults
    to 2KB.
    
    Key features:
    
    1. Use overlapping store to avoid branch.
    2. For size <= 4 times of vector register size, fully unroll the loop.
    3. For size > 4 times of vector register size, store 4 times of vector
    register size at a time.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
    	memset-sse2-unaligned-erms, memset-avx2-unaligned-erms and
    	memset-avx512-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Test __memset_chk_sse2_unaligned,
    	__memset_chk_sse2_unaligned_erms, __memset_chk_avx2_unaligned,
    	__memset_chk_avx2_unaligned_erms, __memset_chk_avx512_unaligned,
    	__memset_chk_avx512_unaligned_erms, __memset_sse2_unaligned,
    	__memset_sse2_unaligned_erms, __memset_erms,
    	__memset_avx2_unaligned, __memset_avx2_unaligned_erms,
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: New
    	file.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:
    	Likewise.
    
    (cherry picked from commit 830566307f038387ca0af3fd327706a8d1a2f595)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=1c5d0b0ba41376c2f0792da4f22cc1f5b2b2688e

commit 1c5d0b0ba41376c2f0792da4f22cc1f5b2b2688e
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:04:26 2016 -0700

    Add x86-64 memmove with unaligned load/store and rep movsb
    
    Implement x86-64 memmove with unaligned load/store and rep movsb.
    Support 16-byte, 32-byte and 64-byte vector register sizes.  When
    size <= 8 times of vector register size, there is no check for
    address overlap bewteen source and destination.  Since overhead for
    overlap check is small when size > 8 times of vector register size,
    memcpy is an alias of memmove.
    
    A single file provides 2 implementations of memmove, one with rep movsb
    and the other without rep movsb.  They share the same codes when size is
    between 2 times of vector register size and REP_MOVSB_THRESHOLD which
    is 2KB for 16-byte vector register size and scaled up by large vector
    register size.
    
    Key features:
    
    1. Use overlapping load and store to avoid branch.
    2. For size <= 8 times of vector register size, load  all sources into
    registers and store them together.
    3. If there is no address overlap bewteen source and destination, copy
    from both ends with 4 times of vector register size at a time.
    4. If address of destination > address of source, backward copy 8 times
    of vector register size at a time.
    5. Otherwise, forward copy 8 times of vector register size at a time.
    6. Use rep movsb only for forward copy.  Avoid slow backward rep movsb
    by fallbacking to backward copy 8 times of vector register size at a
    time.
    7. Skip when address of destination == address of source.
    
    	[BZ #19776]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
    	memmove-sse2-unaligned-erms, memmove-avx-unaligned-erms and
    	memmove-avx512-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Test
    	__memmove_chk_avx512_unaligned_2,
    	__memmove_chk_avx512_unaligned_erms,
    	__memmove_chk_avx_unaligned_2, __memmove_chk_avx_unaligned_erms,
    	__memmove_chk_sse2_unaligned_2,
    	__memmove_chk_sse2_unaligned_erms, __memmove_avx_unaligned_2,
    	__memmove_avx_unaligned_erms, __memmove_avx512_unaligned_2,
    	__memmove_avx512_unaligned_erms, __memmove_erms,
    	__memmove_sse2_unaligned_2, __memmove_sse2_unaligned_erms,
    	__memcpy_chk_avx512_unaligned_2,
    	__memcpy_chk_avx512_unaligned_erms,
    	__memcpy_chk_avx_unaligned_2, __memcpy_chk_avx_unaligned_erms,
    	__memcpy_chk_sse2_unaligned_2, __memcpy_chk_sse2_unaligned_erms,
    	__memcpy_avx_unaligned_2, __memcpy_avx_unaligned_erms,
    	__memcpy_avx512_unaligned_2, __memcpy_avx512_unaligned_erms,
    	__memcpy_sse2_unaligned_2, __memcpy_sse2_unaligned_erms,
    	__memcpy_erms, __mempcpy_chk_avx512_unaligned_2,
    	__mempcpy_chk_avx512_unaligned_erms,
    	__mempcpy_chk_avx_unaligned_2, __mempcpy_chk_avx_unaligned_erms,
    	__mempcpy_chk_sse2_unaligned_2, __mempcpy_chk_sse2_unaligned_erms,
    	__mempcpy_avx512_unaligned_2, __mempcpy_avx512_unaligned_erms,
    	__mempcpy_avx_unaligned_2, __mempcpy_avx_unaligned_erms,
    	__mempcpy_sse2_unaligned_2, __mempcpy_sse2_unaligned_erms and
    	__mempcpy_erms.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: New
    	file.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likwise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
    	Likwise.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:
    	Likwise.
    
    (cherry picked from commit 88b57b8ed41d5ecf2e1bdfc19556f9246a665ebb)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e1203f48239fbb9832db6ed3a0d2a008e317aff9

commit e1203f48239fbb9832db6ed3a0d2a008e317aff9
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 19:22:59 2016 -0700

    Initial Enhanced REP MOVSB/STOSB (ERMS) support
    
    The newer Intel processors support Enhanced REP MOVSB/STOSB (ERMS) which
    has a feature bit in CPUID.  This patch adds the Enhanced REP MOVSB/STOSB
    (ERMS) bit to x86 cpu-features.
    
    	* sysdeps/x86/cpu-features.h (bit_cpu_ERMS): New.
    	(index_cpu_ERMS): Likewise.
    	(reg_ERMS): Likewise.
    
    (cherry picked from commit 0791f91dff9a77263fa8173b143d854cad902c6d)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=3597d65be2a44f063ef12bb907fdad8567aa3e6a

commit 3597d65be2a44f063ef12bb907fdad8567aa3e6a
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 13:15:59 2016 -0700

    Make __memcpy_avx512_no_vzeroupper an alias
    
    Since x86-64 memcpy-avx512-no-vzeroupper.S implements memmove, make
    __memcpy_avx512_no_vzeroupper an alias of __memmove_avx512_no_vzeroupper
    to reduce code size of libc.so.
    
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memcpy-avx512-no-vzeroupper.
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: Renamed
    	to ...
    	* sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: This.
    	(MEMCPY): Don't define.
    	(MEMCPY_CHK): Likewise.
    	(MEMPCPY): Likewise.
    	(MEMPCPY_CHK): Likewise.
    	(MEMPCPY_CHK): Renamed to ...
    	(__mempcpy_chk_avx512_no_vzeroupper): This.
    	(MEMPCPY_CHK): Renamed to ...
    	(__mempcpy_chk_avx512_no_vzeroupper): This.
    	(MEMCPY_CHK): Renamed to ...
    	(__memmove_chk_avx512_no_vzeroupper): This.
    	(MEMCPY): Renamed to ...
    	(__memmove_avx512_no_vzeroupper): This.
    	(__memcpy_avx512_no_vzeroupper): New alias.
    	(__memcpy_chk_avx512_no_vzeroupper): Likewise.
    
    (cherry picked from commit 064f01b10b57ff09cda7025f484b848c38ddd57a)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=9fbaf0f27a11deb98df79d04adee97aebee78d40

commit 9fbaf0f27a11deb98df79d04adee97aebee78d40
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 13:13:36 2016 -0700

    Implement x86-64 multiarch mempcpy in memcpy
    
    Implement x86-64 multiarch mempcpy in memcpy to share most of code.  It
    reduces code size of libc.so.
    
    	[BZ #18858]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	mempcpy-ssse3, mempcpy-ssse3-back, mempcpy-avx-unaligned
    	and mempcpy-avx512-no-vzeroupper.
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMPCPY_CHK):
    	New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S
    	(MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-ssse3-back.S (MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-ssse3.S (MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-avx-unaligned.S: Removed.
    	* sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-ssse3-back.S: Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-ssse3.S: Likewise.
    
    (cherry picked from commit c365e615f7429aee302f8af7bf07ae262278febb)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=5239cb481eea27650173b9b9af22439afdcbf358

commit 5239cb481eea27650173b9b9af22439afdcbf358
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 04:39:48 2016 -0700

    [x86] Add a feature bit: Fast_Unaligned_Copy
    
    On AMD processors, memcpy optimized with unaligned SSE load is
    slower than emcpy optimized with aligned SSSE3 while other string
    functions are faster with unaligned SSE load.  A feature bit,
    Fast_Unaligned_Copy, is added to select memcpy optimized with
    unaligned SSE load.
    
    	[BZ #19583]
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Set
    	Fast_Unaligned_Copy with Fast_Unaligned_Load for Intel
    	processors.  Set Fast_Copy_Backward for AMD Excavator
    	processors.
    	* sysdeps/x86/cpu-features.h (bit_arch_Fast_Unaligned_Copy):
    	New.
    	(index_arch_Fast_Unaligned_Copy): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Check
    	Fast_Unaligned_Copy instead of Fast_Unaligned_Load.
    
    (cherry picked from commit e41b395523040fcb58c7d378475720c2836d280c)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a65b3d13e1754d568782e64a762c2c7fab45a55d

commit a65b3d13e1754d568782e64a762c2c7fab45a55d
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Mar 22 08:36:16 2016 -0700

    Don't set %rcx twice before "rep movsb"
    
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMCPY):
    	Don't set %rcx twice before "rep movsb".
    
    (cherry picked from commit 3c9a4cd16cbc7b79094fec68add2df66061ab5d7)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=f4b6d20366aac66070f1cf50552cf2951991a1e5

commit f4b6d20366aac66070f1cf50552cf2951991a1e5
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Mar 22 07:46:56 2016 -0700

    Set index_arch_AVX_Fast_Unaligned_Load only for Intel processors
    
    Since only Intel processors with AVX2 have fast unaligned load, we
    should set index_arch_AVX_Fast_Unaligned_Load only for Intel processors.
    
    Move AVX, AVX2, AVX512, FMA and FMA4 detection into get_common_indeces
    and call get_common_indeces for other processors.
    
    Add CPU_FEATURES_CPU_P and CPU_FEATURES_ARCH_P to aoid loading
    GLRO(dl_x86_cpu_features) in cpu-features.c.
    
    	[BZ #19583]
    	* sysdeps/x86/cpu-features.c (get_common_indeces): Remove
    	inline.  Check family before setting family, model and
    	extended_model.  Set AVX, AVX2, AVX512, FMA and FMA4 usable
    	bits here.
    	(init_cpu_features): Replace HAS_CPU_FEATURE and
    	HAS_ARCH_FEATURE with CPU_FEATURES_CPU_P and
    	CPU_FEATURES_ARCH_P.  Set index_arch_AVX_Fast_Unaligned_Load
    	for Intel processors with usable AVX2.  Call get_common_indeces
    	for other processors with family == NULL.
    	* sysdeps/x86/cpu-features.h (CPU_FEATURES_CPU_P): New macro.
    	(CPU_FEATURES_ARCH_P): Likewise.
    	(HAS_CPU_FEATURE): Use CPU_FEATURES_CPU_P.
    	(HAS_ARCH_FEATURE): Use CPU_FEATURES_ARCH_P.
    
    (cherry picked from commit f781a9e96138d8839663af5e88649ab1fbed74f8)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=ca9c5edeea52dc18f42ebbe29b1af352f5555538

commit ca9c5edeea52dc18f42ebbe29b1af352f5555538
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Nov 30 08:53:37 2015 -0800

    Update family and model detection for AMD CPUs
    
    AMD CPUs uses the similar encoding scheme for extended family and model
    as Intel CPUs as shown in:
    
    http://support.amd.com/TechDocs/25481.pdf
    
    This patch updates get_common_indeces to get family and model for both
    Intel and AMD CPUs when family == 0x0f.
    
    	[BZ #19214]
    	* sysdeps/x86/cpu-features.c (get_common_indeces): Add an
    	argument to return extended model.  Update family and model
    	with extended family and model when family == 0x0f.
    	(init_cpu_features): Updated.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c23cdbac4ea473effbef5c50b1217f95595b3460

commit c23cdbac4ea473effbef5c50b1217f95595b3460
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 10 05:26:46 2016 -0800

    Add _arch_/_cpu_ to index_*/bit_* in x86 cpu-features.h
    
    index_* and bit_* macros are used to access cpuid and feature arrays o
    struct cpu_features.  It is very easy to use bits and indices of cpuid
    array on feature array, especially in assembly codes.  For example,
    sysdeps/i386/i686/multiarch/bcopy.S has
    
    	HAS_CPU_FEATURE (Fast_Rep_String)
    
    which should be
    
    	HAS_ARCH_FEATURE (Fast_Rep_String)
    
    We change index_* and bit_* to index_cpu_*/index_arch_* and
    bit_cpu_*/bit_arch_* so that we can catch such error at build time.
    
    	[BZ #19762]
    	* sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h
    	(EXTRA_LD_ENVVARS): Add _arch_ to index_*/bit_*.
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Likewise.
    	* sysdeps/x86/cpu-features.h (bit_*): Renamed to ...
    	(bit_arch_*): This for feature array.
    	(bit_*): Renamed to ...
    	(bit_cpu_*): This for cpu array.
    	(index_*): Renamed to ...
    	(index_arch_*): This for feature array.
    	(index_*): Renamed to ...
    	(index_cpu_*): This for cpu array.
    	[__ASSEMBLER__] (HAS_FEATURE): Add and use field.
    	[__ASSEMBLER__] (HAS_CPU_FEATURE)): Pass cpu to HAS_FEATURE.
    	[__ASSEMBLER__] (HAS_ARCH_FEATURE)): Pass arch to HAS_FEATURE.
    	[!__ASSEMBLER__] (HAS_CPU_FEATURE): Replace index_##name and
    	bit_##name with index_cpu_##name and bit_cpu_##name.
    	[!__ASSEMBLER__] (HAS_ARCH_FEATURE): Replace index_##name and
    	bit_##name with index_arch_##name and bit_arch_##name.
    
    (cherry picked from commit 6aa3e97e2530f9917f504eb4146af119a3f27229)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4a49c82956f5a42a2cce22c2e97360de1b32301d

commit 4a49c82956f5a42a2cce22c2e97360de1b32301d
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 3 14:51:40 2016 -0800

    Or bit_Prefer_MAP_32BIT_EXEC in EXTRA_LD_ENVVARS
    
    We should turn on bit_Prefer_MAP_32BIT_EXEC in EXTRA_LD_ENVVARS without
    overriding other bits.
    
    	[BZ #19758]
    	* sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h
    	(EXTRA_LD_ENVVARS): Or bit_Prefer_MAP_32BIT_EXEC.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=18f8c0e3cc9ff7b092f02c9b42874a5439347bbc

commit 18f8c0e3cc9ff7b092f02c9b42874a5439347bbc
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Mar 6 16:48:11 2016 -0800

    Group AVX512 functions in .text.avx512 section
    
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S:
    	Replace .text with .text.avx512.
    	* sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S:
    	Likewise.
    
    (cherry picked from commit fee9eb6200f0e44a4b684903bc47fde36d46f1a5)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=0c8e297a186f844ebb7eba7a3bc0343c83615ca9

commit 0c8e297a186f844ebb7eba7a3bc0343c83615ca9
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Mar 4 08:37:40 2016 -0800

    x86-64: Fix memcpy IFUNC selection
    
    Chek Fast_Unaligned_Load, instead of Slow_BSF, and also check for
    Fast_Copy_Backward to enable __memcpy_ssse3_back.  Existing selection
    order is updated with following selection order:
    
    1. __memcpy_avx_unaligned if AVX_Fast_Unaligned_Load bit is set.
    2. __memcpy_sse2_unaligned if Fast_Unaligned_Load bit is set.
    3. __memcpy_sse2 if SSSE3 isn't available.
    4. __memcpy_ssse3_back if Fast_Copy_Backward bit it set.
    5. __memcpy_ssse3
    
    	[BZ #18880]
    	* sysdeps/x86_64/multiarch/memcpy.S: Check Fast_Unaligned_Load,
    	instead of Slow_BSF, and also check for Fast_Copy_Backward to
    	enable __memcpy_ssse3_back.
    
    (cherry picked from commit 14a1d7cc4c4fd5ee8e4e66b777221dd32a84efe8)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=3c772cb4d9cbe19cd97ad991e3dab43014198c44

commit 3c772cb4d9cbe19cd97ad991e3dab43014198c44
Author: Andrew Senkevich <andrew.senkevich@intel.com>
Date:   Sat Jan 16 00:49:45 2016 +0300

    Added memcpy/memmove family optimized with AVX512 for KNL hardware.
    
    Added AVX512 implementations of memcpy, mempcpy, memmove, memcpy_chk,
    mempcpy_chk, memmove_chk.
    It shows average improvement more than 30% over AVX versions on KNL
    hardware (performance results in the thread
    <https://sourceware.org/ml/libc-alpha/2016-01/msg00258.html>).
    
        * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Added new files.
        * sysdeps/x86_64/multiarch/ifunc-impl-list.c: Added new tests.
        * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: New file.
        * sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S: Likewise.
        * sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: Likewise.
        * sysdeps/x86_64/multiarch/memcpy.S: Added new IFUNC branch.
        * sysdeps/x86_64/multiarch/memcpy_chk.S: Likewise.
        * sysdeps/x86_64/multiarch/memmove.c: Likewise.
        * sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
        * sysdeps/x86_64/multiarch/mempcpy.S: Likewise.
        * sysdeps/x86_64/multiarch/mempcpy_chk.S: Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7f20e52f1f8d7709140f4bdf828a6bb0f0f08af2

commit 7f20e52f1f8d7709140f4bdf828a6bb0f0f08af2
Author: Andrew Senkevich <andrew.senkevich@intel.com>
Date:   Sat Dec 19 02:47:28 2015 +0300

    Added memset optimized with AVX512 for KNL hardware.
    
    It shows improvement up to 28% over AVX2 memset (performance results
    attached at <https://sourceware.org/ml/libc-alpha/2015-12/msg00052.html>).
    
        * sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S: New file.
        * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Added new file.
        * sysdeps/x86_64/multiarch/ifunc-impl-list.c: Added new tests.
        * sysdeps/x86_64/multiarch/memset.S: Added new IFUNC branch.
        * sysdeps/x86_64/multiarch/memset_chk.S: Likewise.
        * sysdeps/x86/cpu-features.h (bit_Prefer_No_VZEROUPPER,
        index_Prefer_No_VZEROUPPER): New.
        * sysdeps/x86/cpu-features.c (init_cpu_features): Set the
        Prefer_No_VZEROUPPER for Knights Landing.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=d530cd5463701a59ed923d53a97d3b534fdfea8a

commit d530cd5463701a59ed923d53a97d3b534fdfea8a
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Wed Oct 21 14:44:23 2015 -0700

    Add Prefer_MAP_32BIT_EXEC to map executable pages with MAP_32BIT
    
    According to Silvermont software optimization guide, for 64-bit
    applications, branch prediction performance can be negatively impacted
    when the target of a branch is more than 4GB away from the branch.  Add
    the Prefer_MAP_32BIT_EXEC bit so that mmap will try to map executable
    pages with MAP_32BIT first.  NB: MAP_32BIT will map to lower 2GB, not
    lower 4GB, address.  Prefer_MAP_32BIT_EXEC reduces bits available for
    address space layout randomization (ASLR), which is always disabled for
    SUID programs and can only be enabled by setting environment variable,
    LD_PREFER_MAP_32BIT_EXEC.
    
    On Fedora 23, this patch speeds up GCC 5 testsuite by 3% on Silvermont.
    
    	[BZ #19367]
    	* sysdeps/unix/sysv/linux/wordsize-64/mmap.c: New file.
    	* sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h: Likewise.
    	* sysdeps/unix/sysv/linux/x86_64/64/mmap.c: Likewise.
    	* sysdeps/x86/cpu-features.h (bit_Prefer_MAP_32BIT_EXEC): New.
    	(index_Prefer_MAP_32BIT_EXEC): Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=fe24aedc3530037d7bb614b84d309e6b816686bf

commit fe24aedc3530037d7bb614b84d309e6b816686bf
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Dec 15 11:46:54 2015 -0800

    Enable Silvermont optimizations for Knights Landing
    
    Knights Landing processor is based on Silvermont.  This patch enables
    Silvermont optimizations for Knights Landing.
    
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Enable
    	Silvermont optimizations for Knights Landing.

-----------------------------------------------------------------------

Comment 16 Sourceware Commits 2016-04-05 16:38:38 UTC

This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, hjl/erms/ifunc has been created
        at  8073bd8f850c1b7b04921a4c921d26bbbe5fbcae (commit)

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=8073bd8f850c1b7b04921a4c921d26bbbe5fbcae

commit 8073bd8f850c1b7b04921a4c921d26bbbe5fbcae
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 1 14:01:24 2016 -0700

    X86-64: Add dummy memcopy.h and wordcopy.c
    
    Since x86-64 doesn't use memory copy functions, add dummy memcopy.h and
    wordcopy.c to reduce code size.  It reduces the size of libc.so by about
    1 KB.
    
    	* sysdeps/x86_64/memcopy.h: New file.
    	* sysdeps/x86_64/wordcopy.c: Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=3608e9eb3d04e22a8341ac6c397e52f531330ac1

commit 3608e9eb3d04e22a8341ac6c397e52f531330ac1
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 12:46:57 2016 -0700

    X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove
    
    Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones,
    we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with
    the new ones.
    
    No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used
    before.  If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2
    memcpy/memmove optimized with Enhanced REP MOVSB will be used for
    processors with ERMS.  The new AVX512 memcpy/memmove will be used for
    processors with AVX512 which prefer vzeroupper.
    
    Since the new SSE2 memcpy/memmove are faster than the previous default
    memcpy/memmove used in libc.a and ld.so, we also remove the previous
    default memcpy/memmove and make them the default memcpy/memmove.
    
    Together, it reduces the size of libc.so by about 6 KB and the size of
    ld.so by about 2 KB.
    
    	[BZ #19776]
    	* sysdeps/x86_64/memcpy.S: Make it dummy.
    	* sysdeps/x86_64/mempcpy.S: Likewise.
    	* sysdeps/x86_64/memmove.S: New file.
    	* sysdeps/x86_64/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/memmove.c: Removed.
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memmove.c: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memcpy-sse2-unaligned, memmove-avx-unaligned,
    	memcpy-avx-unaligned and memmove-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Replace
    	__memmove_chk_avx512_unaligned_2 with
    	__memmove_chk_avx512_unaligned.  Remove
    	__memmove_chk_avx_unaligned_2.  Replace
    	__memmove_chk_sse2_unaligned_2 with
    	__memmove_chk_sse2_unaligned.  Remove __memmove_chk_sse2 and
    	__memmove_avx_unaligned_2.  Replace __memmove_avx512_unaligned_2
    	with __memmove_avx512_unaligned.  Replace
    	__memmove_sse2_unaligned_2 with __memmove_sse2_unaligned.
    	Remove __memmove_sse2.  Replace __memcpy_chk_avx512_unaligned_2
    	with __memcpy_chk_avx512_unaligned.  Remove
    	__memcpy_chk_avx_unaligned_2.  Replace
    	__memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned.
    	Remove __memcpy_chk_sse2.  Remove __memcpy_avx_unaligned_2.
    	Replace __memcpy_avx512_unaligned_2 with
    	__memcpy_avx512_unaligned.  Remove __memcpy_sse2_unaligned_2
    	and __memcpy_sse2.  Replace __mempcpy_chk_avx512_unaligned_2
    	with __mempcpy_chk_avx512_unaligned.  Remove
    	__mempcpy_chk_avx_unaligned_2.  Replace
    	__mempcpy_chk_sse2_unaligned_2 with
    	__mempcpy_chk_sse2_unaligned.  Remove __mempcpy_chk_sse2.
    	Replace __mempcpy_avx512_unaligned_2 with
    	__mempcpy_avx512_unaligned.  Remove __mempcpy_avx_unaligned_2.
    	Replace __mempcpy_sse2_unaligned_2 with
    	__mempcpy_sse2_unaligned.  Remove __mempcpy_sse2.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support
    	__memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned.
    	Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __memcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../memcpy.S.
    	* sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support
    	__memcpy_chk_avx512_unaligned_erms and
    	__memcpy_chk_avx512_unaligned.  Use
    	__memcpy_chk_avx_unaligned_erms and
    	__memcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __memcpy_chk_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip if
    	not in libc.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
    	(MEMCPY_SYMBOL): New.
    	(MEMPCPY_SYMBOL): Likewise.
    	(MEMMOVE_CHK_SYMBOL): Likewise.
    	Replace MEMMOVE_SYMBOL with MEMMOVE_CHK_SYMBOL on __mempcpy_chk
    	symbols.  Replace MEMMOVE_SYMBOL with MEMPCPY_SYMBOL on
    	__mempcpy symbols.  Change function suffix from unaligned_2 to
    	unaligned.  Provide alias for __memcpy_chk in libc.a.  Provide
    	alias for memcpy in libc.a and ld.so.
    	* sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support
    	__mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned.
    	Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __mempcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../mempcpy.S.
    	(mempcpy): New.  Add a weak alias.
    	* sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support
    	__mempcpy_chk_avx512_unaligned_erms and
    	__mempcpy_chk_avx512_unaligned.  Use
    	__mempcpy_chk_avx_unaligned_erms and
    	__mempcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __mempcpy_chk_sse2_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=83823b091e18fec9151752a0429a78a0d81d6317

commit 83823b091e18fec9151752a0429a78a0d81d6317
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:42:30 2016 -0700

    X86-64: Remove the previous SSE2/AVX2 memsets
    
    Since the new SSE2/AVX2 memsets are faster than the previous ones, we
    can remove the previous SSE2/AVX2 memsets and replace them with the
    new ones.  This reduces the size of libc.so by about 900 bytes.
    
    No change in IFUNC selection if SSE2 and AVX2 memsets weren't used
    before.  If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset
    optimized with Enhanced REP STOSB will be used for processors with
    ERMS.  The new AVX512 memset will be used for processors with AVX512
    which prefer vzeroupper.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded
    	into ...
    	* sysdeps/x86_64/memset.S: This.
    	(__bzero): Removed.
    	(__memset_tail): Likewise.
    	(__memset_chk): Likewise.
    	(memset): Likewise.
    	(MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't
    	defined.
    	(MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined.
    	* sysdeps/x86_64/multiarch/memset-avx2.S: Removed.
    	(__memset_zero_constant_len_parameter): Check SHARED instead of
    	PIC.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memset-avx2 and memset-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Remove __memset_chk_sse2,
    	__memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: Skip
    	if not in libc.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
    	(MEMSET_CHK_SYMBOL): New.  Define if not defined.
    	(__bzero): Check VEC_SIZE == 16 instead of USE_MULTIARCH.
    	Replace MEMSET_SYMBOL with MEMSET_CHK_SYMBOL on __memset_chk
    	symbols.
    	Properly check USE_MULTIARCH on __memset symbols.
    	* sysdeps/x86_64/multiarch/memset.S (memset): Replace
    	__memset_sse2 and __memset_avx2 with __memset_sse2_unaligned
    	and __memset_avx2_unaligned.  Use __memset_sse2_unaligned_erms
    	or __memset_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	(memset): Removed.
    	(__memset_chk): Likewise.
    	(MEMSET_SYMBOL): New.
    	(libc_hidden_builtin_def): Replace __memset_sse2 with
    	__memset_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace
    	__memset_chk_sse2 and __memset_chk_avx2 with
    	__memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms.
    	Use __memset_chk_sse2_unaligned_erms or
    	__memset_chk_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_chk_avx512_unaligned_erms and
    	__memset_chk_avx512_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e39197c4403a082b1c607210211cb643830b8d9c

commit e39197c4403a082b1c607210211cb643830b8d9c
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Apr 3 17:21:45 2016 -0700

    X86-64: Use non-temporal store in memmove on large data
    
    memcpy/memmove benchmarks with large data shows that there is a
    regression with large data on Haswell machine.  non-temporal store
    in memmove on large data can improve performance significantly.  This
    patch adds a threshold to use non temporal store which is 4 times of
    shared cache size.  When size is above the threshold, non temporal
    store will be used.
    
    For size below 8 vector register width, we load all data into registers
    and store them together.  Only forward and backward loops, which move 4
    vector registers at a time, are used to support overlapping addresses.
    For forward loop, we load the last 4 vector register width of data and
    the first vector register width of data into vector registers before the
    loop and store them after the loop.  For backward loop, we load the first
    4 vector register width of data and the last vector register width of
    data into vector registers before the loop and store them after the loop.
    
    	* sysdeps/x86_64/cacheinfo.c (__x86_shared_non_temporal_threshold):
    	New.
    	(init_cacheinfo): Set __x86_shared_non_temporal_threshold to
    	4 times of shared cache size.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S
    	(PREFETCHNT): New.
    	(VMOVNT): Likewise.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S
    	(PREFETCHNT): Likewise.
    	(VMOVNT): Likewise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S
    	(PREFETCHNT): Likewise.
    	(VMOVNT): Likewise.
    	(VMOVU): Changed to movups for smaller code sizes.
    	(VMOVA): Changed to movaps for smaller code sizes.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Update
    	comments.  Rewrite to use forward and backward loops, which move
    	4 vector registers at a time, to support overlapping addresses
    	and use non temporal store if size is above the threshold.

-----------------------------------------------------------------------

Comment 17 Sourceware Commits 2016-04-05 17:06:09 UTC

This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, hjl/erms/2.23 has been created
        at  9910c54c2e97b6c36f8593097e53d5e09f837a69 (commit)

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=9910c54c2e97b6c36f8593097e53d5e09f837a69

commit 9910c54c2e97b6c36f8593097e53d5e09f837a69
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 1 14:01:24 2016 -0700

    X86-64: Add dummy memcopy.h and wordcopy.c
    
    Since x86-64 doesn't use memory copy functions, add dummy memcopy.h and
    wordcopy.c to reduce code size.  It reduces the size of libc.so by about
    1 KB.
    
    	* sysdeps/x86_64/memcopy.h: New file.
    	* sysdeps/x86_64/wordcopy.c: Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=3429d9dd330a5c140cb37e77e7c388a71fdb44f1

commit 3429d9dd330a5c140cb37e77e7c388a71fdb44f1
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 12:46:57 2016 -0700

    X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove
    
    Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones,
    we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with
    the new ones.
    
    No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used
    before.  If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2
    memcpy/memmove optimized with Enhanced REP MOVSB will be used for
    processors with ERMS.  The new AVX512 memcpy/memmove will be used for
    processors with AVX512 which prefer vzeroupper.
    
    Since the new SSE2 memcpy/memmove are faster than the previous default
    memcpy/memmove used in libc.a and ld.so, we also remove the previous
    default memcpy/memmove and make them the default memcpy/memmove.
    
    Together, it reduces the size of libc.so by about 6 KB and the size of
    ld.so by about 2 KB.
    
    	[BZ #19776]
    	* sysdeps/x86_64/memcpy.S: Make it dummy.
    	* sysdeps/x86_64/mempcpy.S: Likewise.
    	* sysdeps/x86_64/memmove.S: New file.
    	* sysdeps/x86_64/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/memmove.c: Removed.
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memmove.c: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memcpy-sse2-unaligned, memmove-avx-unaligned,
    	memcpy-avx-unaligned and memmove-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Replace
    	__memmove_chk_avx512_unaligned_2 with
    	__memmove_chk_avx512_unaligned.  Remove
    	__memmove_chk_avx_unaligned_2.  Replace
    	__memmove_chk_sse2_unaligned_2 with
    	__memmove_chk_sse2_unaligned.  Remove __memmove_chk_sse2 and
    	__memmove_avx_unaligned_2.  Replace __memmove_avx512_unaligned_2
    	with __memmove_avx512_unaligned.  Replace
    	__memmove_sse2_unaligned_2 with __memmove_sse2_unaligned.
    	Remove __memmove_sse2.  Replace __memcpy_chk_avx512_unaligned_2
    	with __memcpy_chk_avx512_unaligned.  Remove
    	__memcpy_chk_avx_unaligned_2.  Replace
    	__memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned.
    	Remove __memcpy_chk_sse2.  Remove __memcpy_avx_unaligned_2.
    	Replace __memcpy_avx512_unaligned_2 with
    	__memcpy_avx512_unaligned.  Remove __memcpy_sse2_unaligned_2
    	and __memcpy_sse2.  Replace __mempcpy_chk_avx512_unaligned_2
    	with __mempcpy_chk_avx512_unaligned.  Remove
    	__mempcpy_chk_avx_unaligned_2.  Replace
    	__mempcpy_chk_sse2_unaligned_2 with
    	__mempcpy_chk_sse2_unaligned.  Remove __mempcpy_chk_sse2.
    	Replace __mempcpy_avx512_unaligned_2 with
    	__mempcpy_avx512_unaligned.  Remove __mempcpy_avx_unaligned_2.
    	Replace __mempcpy_sse2_unaligned_2 with
    	__mempcpy_sse2_unaligned.  Remove __mempcpy_sse2.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support
    	__memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned.
    	Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __memcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../memcpy.S.
    	* sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support
    	__memcpy_chk_avx512_unaligned_erms and
    	__memcpy_chk_avx512_unaligned.  Use
    	__memcpy_chk_avx_unaligned_erms and
    	__memcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __memcpy_chk_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip if
    	not in libc.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
    	(MEMCPY_SYMBOL): New.
    	(MEMPCPY_SYMBOL): Likewise.
    	(MEMMOVE_CHK_SYMBOL): Likewise.
    	Replace MEMMOVE_SYMBOL with MEMMOVE_CHK_SYMBOL on __mempcpy_chk
    	symbols.  Replace MEMMOVE_SYMBOL with MEMPCPY_SYMBOL on
    	__mempcpy symbols.  Change function suffix from unaligned_2 to
    	unaligned.  Provide alias for __memcpy_chk in libc.a.  Provide
    	alias for memcpy in libc.a and ld.so.
    	* sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support
    	__mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned.
    	Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __mempcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../mempcpy.S.
    	(mempcpy): New.  Add a weak alias.
    	* sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support
    	__mempcpy_chk_avx512_unaligned_erms and
    	__mempcpy_chk_avx512_unaligned.  Use
    	__mempcpy_chk_avx_unaligned_erms and
    	__mempcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __mempcpy_chk_sse2_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7c36cac64f6855f1f4ff007beaca3cb766e694ec

commit 7c36cac64f6855f1f4ff007beaca3cb766e694ec
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:42:30 2016 -0700

    X86-64: Remove the previous SSE2/AVX2 memsets
    
    Since the new SSE2/AVX2 memsets are faster than the previous ones, we
    can remove the previous SSE2/AVX2 memsets and replace them with the
    new ones.  This reduces the size of libc.so by about 900 bytes.
    
    No change in IFUNC selection if SSE2 and AVX2 memsets weren't used
    before.  If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset
    optimized with Enhanced REP STOSB will be used for processors with
    ERMS.  The new AVX512 memset will be used for processors with AVX512
    which prefer vzeroupper.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded
    	into ...
    	* sysdeps/x86_64/memset.S: This.
    	(__bzero): Removed.
    	(__memset_tail): Likewise.
    	(__memset_chk): Likewise.
    	(memset): Likewise.
    	(MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't
    	defined.
    	(MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined.
    	* sysdeps/x86_64/multiarch/memset-avx2.S: Removed.
    	(__memset_zero_constant_len_parameter): Check SHARED instead of
    	PIC.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memset-avx2 and memset-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Remove __memset_chk_sse2,
    	__memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: Skip
    	if not in libc.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
    	(MEMSET_CHK_SYMBOL): New.  Define if not defined.
    	(__bzero): Check VEC_SIZE == 16 instead of USE_MULTIARCH.
    	Replace MEMSET_SYMBOL with MEMSET_CHK_SYMBOL on __memset_chk
    	symbols.
    	Properly check USE_MULTIARCH on __memset symbols.
    	* sysdeps/x86_64/multiarch/memset.S (memset): Replace
    	__memset_sse2 and __memset_avx2 with __memset_sse2_unaligned
    	and __memset_avx2_unaligned.  Use __memset_sse2_unaligned_erms
    	or __memset_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	(memset): Removed.
    	(__memset_chk): Likewise.
    	(MEMSET_SYMBOL): New.
    	(libc_hidden_builtin_def): Replace __memset_sse2 with
    	__memset_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace
    	__memset_chk_sse2 and __memset_chk_avx2 with
    	__memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms.
    	Use __memset_chk_sse2_unaligned_erms or
    	__memset_chk_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_chk_avx512_unaligned_erms and
    	__memset_chk_avx512_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=69b122e1149e158c382c2b0bdd4591a4a19cb505

commit 69b122e1149e158c382c2b0bdd4591a4a19cb505
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Apr 3 17:21:45 2016 -0700

    X86-64: Use non-temporal store in memmove on large data
    
    memcpy/memmove benchmarks with large data shows that there is a
    regression with large data on Haswell machine.  non-temporal store
    in memmove on large data can improve performance significantly.  This
    patch adds a threshold to use non temporal store which is 4 times of
    shared cache size.  When size is above the threshold, non temporal
    store will be used.
    
    For size below 8 vector register width, we load all data into registers
    and store them together.  Only forward and backward loops, which move 4
    vector registers at a time, are used to support overlapping addresses.
    For forward loop, we load the last 4 vector register width of data and
    the first vector register width of data into vector registers before the
    loop and store them after the loop.  For backward loop, we load the first
    4 vector register width of data and the last vector register width of
    data into vector registers before the loop and store them after the loop.
    
    	* sysdeps/x86_64/cacheinfo.c (__x86_shared_non_temporal_threshold):
    	New.
    	(init_cacheinfo): Set __x86_shared_non_temporal_threshold to
    	4 times of shared cache size.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S
    	(PREFETCHNT): New.
    	(VMOVNT): Likewise.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S
    	(PREFETCHNT): Likewise.
    	(VMOVNT): Likewise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S
    	(PREFETCHNT): Likewise.
    	(VMOVNT): Likewise.
    	(VMOVU): Changed to movups for smaller code sizes.
    	(VMOVA): Changed to movaps for smaller code sizes.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Update
    	comments.  Rewrite to use forward and backward loops, which move
    	4 vector registers at a time, to support overlapping addresses
    	and use non temporal store if size is above the threshold.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=9a93bdbaff81edf67c5486c84f8098055e355abb

commit 9a93bdbaff81edf67c5486c84f8098055e355abb
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Apr 5 05:21:07 2016 -0700

    Force 32-bit displacement in memset-vec-unaligned-erms.S
    
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: Force
    	32-bit displacement to avoid long nop between instructions.
    
    (cherry picked from commit ec0cac9a1f4094bd0db6f77c1b329e7a40eecc10)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=5118e532600549ad0f56cb9b1a179b8eab70c483

commit 5118e532600549ad0f56cb9b1a179b8eab70c483
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Apr 5 05:19:05 2016 -0700

    Add a comment in memset-sse2-unaligned-erms.S
    
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Add
    	a comment on VMOVU and VMOVA.
    
    (cherry picked from commit 696ac774847b80cf994438739478b0c3003b5958)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=06c6d4ae6ee7e5b83fd5868bef494def01f59292

commit 06c6d4ae6ee7e5b83fd5868bef494def01f59292
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Apr 3 14:32:20 2016 -0700

    Don't put SSE2/AVX/AVX512 memmove/memset in ld.so
    
    Since memmove and memset in ld.so don't use IFUNC, don't put SSE2, AVX
    and AVX512 memmove and memset in ld.so.
    
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip
    	if not in libc.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    
    (cherry picked from commit 5cd7af016d8587ff53b20ba259746f97edbddbf7)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a96379797a7eecc1b709cad7b68981eb698783dc

commit a96379797a7eecc1b709cad7b68981eb698783dc
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Apr 3 12:38:25 2016 -0700

    Fix memmove-vec-unaligned-erms.S
    
    __mempcpy_erms and __memmove_erms can't be placed between __memmove_chk
    and __memmove it breaks __memmove_chk.
    
    Don't check source == destination first since it is less common.
    
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:
    	(__mempcpy_erms, __memmove_erms): Moved before __mempcpy_chk
    	with unaligned_erms.
    	(__memmove_erms): Skip if source == destination.
    	(__memmove_unaligned_erms): Don't check source == destination
    	first.
    
    (cherry picked from commit ea2785e96fa503f3a2b5dd9f3a6ca65622b3c5f2)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=cfb059c79729b26284863334c9aa04f0a3b967b9

commit cfb059c79729b26284863334c9aa04f0a3b967b9
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 1 15:08:48 2016 -0700

    Remove Fast_Copy_Backward from Intel Core processors
    
    Intel Core i3, i5 and i7 processors have fast unaligned copy and
    copy backward is ignored.  Remove Fast_Copy_Backward from Intel Core
    processors to avoid confusion.
    
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Don't set
    	bit_arch_Fast_Copy_Backward for Intel Core proessors.
    
    (cherry picked from commit 27d3ce1467990f89126e228559dec8f84b96c60e)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=30c389be1af67c4d0716d207b6780c6169d1355f

commit 30c389be1af67c4d0716d207b6780c6169d1355f
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:05:51 2016 -0700

    Add x86-64 memset with unaligned store and rep stosb
    
    Implement x86-64 memset with unaligned store and rep movsb.  Support
    16-byte, 32-byte and 64-byte vector register sizes.  A single file
    provides 2 implementations of memset, one with rep stosb and the other
    without rep stosb.  They share the same codes when size is between 2
    times of vector register size and REP_STOSB_THRESHOLD which defaults
    to 2KB.
    
    Key features:
    
    1. Use overlapping store to avoid branch.
    2. For size <= 4 times of vector register size, fully unroll the loop.
    3. For size > 4 times of vector register size, store 4 times of vector
    register size at a time.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
    	memset-sse2-unaligned-erms, memset-avx2-unaligned-erms and
    	memset-avx512-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Test __memset_chk_sse2_unaligned,
    	__memset_chk_sse2_unaligned_erms, __memset_chk_avx2_unaligned,
    	__memset_chk_avx2_unaligned_erms, __memset_chk_avx512_unaligned,
    	__memset_chk_avx512_unaligned_erms, __memset_sse2_unaligned,
    	__memset_sse2_unaligned_erms, __memset_erms,
    	__memset_avx2_unaligned, __memset_avx2_unaligned_erms,
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: New
    	file.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:
    	Likewise.
    
    (cherry picked from commit 830566307f038387ca0af3fd327706a8d1a2f595)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=980d639b4ae58209843f09a29d86b0a8303b6650

commit 980d639b4ae58209843f09a29d86b0a8303b6650
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:04:26 2016 -0700

    Add x86-64 memmove with unaligned load/store and rep movsb
    
    Implement x86-64 memmove with unaligned load/store and rep movsb.
    Support 16-byte, 32-byte and 64-byte vector register sizes.  When
    size <= 8 times of vector register size, there is no check for
    address overlap bewteen source and destination.  Since overhead for
    overlap check is small when size > 8 times of vector register size,
    memcpy is an alias of memmove.
    
    A single file provides 2 implementations of memmove, one with rep movsb
    and the other without rep movsb.  They share the same codes when size is
    between 2 times of vector register size and REP_MOVSB_THRESHOLD which
    is 2KB for 16-byte vector register size and scaled up by large vector
    register size.
    
    Key features:
    
    1. Use overlapping load and store to avoid branch.
    2. For size <= 8 times of vector register size, load  all sources into
    registers and store them together.
    3. If there is no address overlap bewteen source and destination, copy
    from both ends with 4 times of vector register size at a time.
    4. If address of destination > address of source, backward copy 8 times
    of vector register size at a time.
    5. Otherwise, forward copy 8 times of vector register size at a time.
    6. Use rep movsb only for forward copy.  Avoid slow backward rep movsb
    by fallbacking to backward copy 8 times of vector register size at a
    time.
    7. Skip when address of destination == address of source.
    
    	[BZ #19776]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
    	memmove-sse2-unaligned-erms, memmove-avx-unaligned-erms and
    	memmove-avx512-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Test
    	__memmove_chk_avx512_unaligned_2,
    	__memmove_chk_avx512_unaligned_erms,
    	__memmove_chk_avx_unaligned_2, __memmove_chk_avx_unaligned_erms,
    	__memmove_chk_sse2_unaligned_2,
    	__memmove_chk_sse2_unaligned_erms, __memmove_avx_unaligned_2,
    	__memmove_avx_unaligned_erms, __memmove_avx512_unaligned_2,
    	__memmove_avx512_unaligned_erms, __memmove_erms,
    	__memmove_sse2_unaligned_2, __memmove_sse2_unaligned_erms,
    	__memcpy_chk_avx512_unaligned_2,
    	__memcpy_chk_avx512_unaligned_erms,
    	__memcpy_chk_avx_unaligned_2, __memcpy_chk_avx_unaligned_erms,
    	__memcpy_chk_sse2_unaligned_2, __memcpy_chk_sse2_unaligned_erms,
    	__memcpy_avx_unaligned_2, __memcpy_avx_unaligned_erms,
    	__memcpy_avx512_unaligned_2, __memcpy_avx512_unaligned_erms,
    	__memcpy_sse2_unaligned_2, __memcpy_sse2_unaligned_erms,
    	__memcpy_erms, __mempcpy_chk_avx512_unaligned_2,
    	__mempcpy_chk_avx512_unaligned_erms,
    	__mempcpy_chk_avx_unaligned_2, __mempcpy_chk_avx_unaligned_erms,
    	__mempcpy_chk_sse2_unaligned_2, __mempcpy_chk_sse2_unaligned_erms,
    	__mempcpy_avx512_unaligned_2, __mempcpy_avx512_unaligned_erms,
    	__mempcpy_avx_unaligned_2, __mempcpy_avx_unaligned_erms,
    	__mempcpy_sse2_unaligned_2, __mempcpy_sse2_unaligned_erms and
    	__mempcpy_erms.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: New
    	file.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likwise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
    	Likwise.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:
    	Likwise.
    
    (cherry picked from commit 88b57b8ed41d5ecf2e1bdfc19556f9246a665ebb)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=bf2bc5e5c9d7aa8af28b299ec26b8a37352730cc

commit bf2bc5e5c9d7aa8af28b299ec26b8a37352730cc
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 19:22:59 2016 -0700

    Initial Enhanced REP MOVSB/STOSB (ERMS) support
    
    The newer Intel processors support Enhanced REP MOVSB/STOSB (ERMS) which
    has a feature bit in CPUID.  This patch adds the Enhanced REP MOVSB/STOSB
    (ERMS) bit to x86 cpu-features.
    
    	* sysdeps/x86/cpu-features.h (bit_cpu_ERMS): New.
    	(index_cpu_ERMS): Likewise.
    	(reg_ERMS): Likewise.
    
    (cherry picked from commit 0791f91dff9a77263fa8173b143d854cad902c6d)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7c244283ff12329b3bca9878b8edac3b3fe5c7bc

commit 7c244283ff12329b3bca9878b8edac3b3fe5c7bc
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 13:15:59 2016 -0700

    Make __memcpy_avx512_no_vzeroupper an alias
    
    Since x86-64 memcpy-avx512-no-vzeroupper.S implements memmove, make
    __memcpy_avx512_no_vzeroupper an alias of __memmove_avx512_no_vzeroupper
    to reduce code size of libc.so.
    
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memcpy-avx512-no-vzeroupper.
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: Renamed
    	to ...
    	* sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: This.
    	(MEMCPY): Don't define.
    	(MEMCPY_CHK): Likewise.
    	(MEMPCPY): Likewise.
    	(MEMPCPY_CHK): Likewise.
    	(MEMPCPY_CHK): Renamed to ...
    	(__mempcpy_chk_avx512_no_vzeroupper): This.
    	(MEMPCPY_CHK): Renamed to ...
    	(__mempcpy_chk_avx512_no_vzeroupper): This.
    	(MEMCPY_CHK): Renamed to ...
    	(__memmove_chk_avx512_no_vzeroupper): This.
    	(MEMCPY): Renamed to ...
    	(__memmove_avx512_no_vzeroupper): This.
    	(__memcpy_avx512_no_vzeroupper): New alias.
    	(__memcpy_chk_avx512_no_vzeroupper): Likewise.
    
    (cherry picked from commit 064f01b10b57ff09cda7025f484b848c38ddd57a)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a9a14991fb2d3e69f80d25e9bbf2f6b0bcf11c3d

commit a9a14991fb2d3e69f80d25e9bbf2f6b0bcf11c3d
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 13:13:36 2016 -0700

    Implement x86-64 multiarch mempcpy in memcpy
    
    Implement x86-64 multiarch mempcpy in memcpy to share most of code.  It
    reduces code size of libc.so.
    
    	[BZ #18858]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	mempcpy-ssse3, mempcpy-ssse3-back, mempcpy-avx-unaligned
    	and mempcpy-avx512-no-vzeroupper.
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMPCPY_CHK):
    	New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S
    	(MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-ssse3-back.S (MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-ssse3.S (MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-avx-unaligned.S: Removed.
    	* sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-ssse3-back.S: Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-ssse3.S: Likewise.
    
    (cherry picked from commit c365e615f7429aee302f8af7bf07ae262278febb)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4fc09dabecee1b7cafdbca26ee7c63f68e53c229

commit 4fc09dabecee1b7cafdbca26ee7c63f68e53c229
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 04:39:48 2016 -0700

    [x86] Add a feature bit: Fast_Unaligned_Copy
    
    On AMD processors, memcpy optimized with unaligned SSE load is
    slower than emcpy optimized with aligned SSSE3 while other string
    functions are faster with unaligned SSE load.  A feature bit,
    Fast_Unaligned_Copy, is added to select memcpy optimized with
    unaligned SSE load.
    
    	[BZ #19583]
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Set
    	Fast_Unaligned_Copy with Fast_Unaligned_Load for Intel
    	processors.  Set Fast_Copy_Backward for AMD Excavator
    	processors.
    	* sysdeps/x86/cpu-features.h (bit_arch_Fast_Unaligned_Copy):
    	New.
    	(index_arch_Fast_Unaligned_Copy): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Check
    	Fast_Unaligned_Copy instead of Fast_Unaligned_Load.
    
    (cherry picked from commit e41b395523040fcb58c7d378475720c2836d280c)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=75f2d47e459a6bf5656a938e5c63f8b581eb3ee6

commit 75f2d47e459a6bf5656a938e5c63f8b581eb3ee6
Author: Florian Weimer <fweimer@redhat.com>
Date:   Fri Mar 25 11:11:42 2016 +0100

    tst-audit10: Fix compilation on compilers without bit_AVX512F [BZ #19860]
    
    	[BZ# 19860]
    	* sysdeps/x86_64/tst-audit10.c (avx512_enabled): Always return
    	zero if the compiler does not provide the AVX512F bit.
    
    (cherry picked from commit f327f5b47be57bc05a4077344b381016c1bb2c11)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=96c7375cb8b6f1875d9865f2ae92ecacf5f5e6fa

commit 96c7375cb8b6f1875d9865f2ae92ecacf5f5e6fa
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Mar 22 08:36:16 2016 -0700

    Don't set %rcx twice before "rep movsb"
    
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMCPY):
    	Don't set %rcx twice before "rep movsb".
    
    (cherry picked from commit 3c9a4cd16cbc7b79094fec68add2df66061ab5d7)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c273f613b0cc779ee33cc33d20941d271316e483

commit c273f613b0cc779ee33cc33d20941d271316e483
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Mar 22 07:46:56 2016 -0700

    Set index_arch_AVX_Fast_Unaligned_Load only for Intel processors
    
    Since only Intel processors with AVX2 have fast unaligned load, we
    should set index_arch_AVX_Fast_Unaligned_Load only for Intel processors.
    
    Move AVX, AVX2, AVX512, FMA and FMA4 detection into get_common_indeces
    and call get_common_indeces for other processors.
    
    Add CPU_FEATURES_CPU_P and CPU_FEATURES_ARCH_P to aoid loading
    GLRO(dl_x86_cpu_features) in cpu-features.c.
    
    	[BZ #19583]
    	* sysdeps/x86/cpu-features.c (get_common_indeces): Remove
    	inline.  Check family before setting family, model and
    	extended_model.  Set AVX, AVX2, AVX512, FMA and FMA4 usable
    	bits here.
    	(init_cpu_features): Replace HAS_CPU_FEATURE and
    	HAS_ARCH_FEATURE with CPU_FEATURES_CPU_P and
    	CPU_FEATURES_ARCH_P.  Set index_arch_AVX_Fast_Unaligned_Load
    	for Intel processors with usable AVX2.  Call get_common_indeces
    	for other processors with family == NULL.
    	* sysdeps/x86/cpu-features.h (CPU_FEATURES_CPU_P): New macro.
    	(CPU_FEATURES_ARCH_P): Likewise.
    	(HAS_CPU_FEATURE): Use CPU_FEATURES_CPU_P.
    	(HAS_ARCH_FEATURE): Use CPU_FEATURES_ARCH_P.
    
    (cherry picked from commit f781a9e96138d8839663af5e88649ab1fbed74f8)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c858d10a4e7fd682f2e7083836e4feacc2d580f4

commit c858d10a4e7fd682f2e7083836e4feacc2d580f4
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 10 05:26:46 2016 -0800

    Add _arch_/_cpu_ to index_*/bit_* in x86 cpu-features.h
    
    index_* and bit_* macros are used to access cpuid and feature arrays o
    struct cpu_features.  It is very easy to use bits and indices of cpuid
    array on feature array, especially in assembly codes.  For example,
    sysdeps/i386/i686/multiarch/bcopy.S has
    
    	HAS_CPU_FEATURE (Fast_Rep_String)
    
    which should be
    
    	HAS_ARCH_FEATURE (Fast_Rep_String)
    
    We change index_* and bit_* to index_cpu_*/index_arch_* and
    bit_cpu_*/bit_arch_* so that we can catch such error at build time.
    
    	[BZ #19762]
    	* sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h
    	(EXTRA_LD_ENVVARS): Add _arch_ to index_*/bit_*.
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Likewise.
    	* sysdeps/x86/cpu-features.h (bit_*): Renamed to ...
    	(bit_arch_*): This for feature array.
    	(bit_*): Renamed to ...
    	(bit_cpu_*): This for cpu array.
    	(index_*): Renamed to ...
    	(index_arch_*): This for feature array.
    	(index_*): Renamed to ...
    	(index_cpu_*): This for cpu array.
    	[__ASSEMBLER__] (HAS_FEATURE): Add and use field.
    	[__ASSEMBLER__] (HAS_CPU_FEATURE)): Pass cpu to HAS_FEATURE.
    	[__ASSEMBLER__] (HAS_ARCH_FEATURE)): Pass arch to HAS_FEATURE.
    	[!__ASSEMBLER__] (HAS_CPU_FEATURE): Replace index_##name and
    	bit_##name with index_cpu_##name and bit_cpu_##name.
    	[!__ASSEMBLER__] (HAS_ARCH_FEATURE): Replace index_##name and
    	bit_##name with index_arch_##name and bit_arch_##name.
    
    (cherry picked from commit 6aa3e97e2530f9917f504eb4146af119a3f27229)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7a90b56b0c3f8e55df44957cf6de7d3c9c04cbb9

commit 7a90b56b0c3f8e55df44957cf6de7d3c9c04cbb9
Author: Roland McGrath <roland@hack.frob.com>
Date:   Tue Mar 8 12:31:13 2016 -0800

    Fix tst-audit10 build when -mavx512f is not supported.
    
    (cherry picked from commit 3bd80c0de2f8e7ca8020d37739339636d169957e)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=ba80f6ceea3a6b6f711038646f419125fe3ad39c

commit ba80f6ceea3a6b6f711038646f419125fe3ad39c
Author: Florian Weimer <fweimer@redhat.com>
Date:   Mon Mar 7 16:00:25 2016 +0100

    tst-audit4, tst-audit10: Compile AVX/AVX-512 code separately [BZ #19269]
    
    This ensures that GCC will not use unsupported instructions before
    the run-time check to ensure support.
    
    (cherry picked from commit 3c0f7407eedb524c9114bb675cd55b903c71daaa)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=b8fe596e7f750d4ee2fca14d6a3999364c02662e

commit b8fe596e7f750d4ee2fca14d6a3999364c02662e
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Mar 6 16:48:11 2016 -0800

    Group AVX512 functions in .text.avx512 section
    
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S:
    	Replace .text with .text.avx512.
    	* sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S:
    	Likewise.
    
    (cherry picked from commit fee9eb6200f0e44a4b684903bc47fde36d46f1a5)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e455d17680cfaebb12692547422f95ba1ed30e29

commit e455d17680cfaebb12692547422f95ba1ed30e29
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Mar 4 08:37:40 2016 -0800

    x86-64: Fix memcpy IFUNC selection
    
    Chek Fast_Unaligned_Load, instead of Slow_BSF, and also check for
    Fast_Copy_Backward to enable __memcpy_ssse3_back.  Existing selection
    order is updated with following selection order:
    
    1. __memcpy_avx_unaligned if AVX_Fast_Unaligned_Load bit is set.
    2. __memcpy_sse2_unaligned if Fast_Unaligned_Load bit is set.
    3. __memcpy_sse2 if SSSE3 isn't available.
    4. __memcpy_ssse3_back if Fast_Copy_Backward bit it set.
    5. __memcpy_ssse3
    
    	[BZ #18880]
    	* sysdeps/x86_64/multiarch/memcpy.S: Check Fast_Unaligned_Load,
    	instead of Slow_BSF, and also check for Fast_Copy_Backward to
    	enable __memcpy_ssse3_back.
    
    (cherry picked from commit 14a1d7cc4c4fd5ee8e4e66b777221dd32a84efe8)

-----------------------------------------------------------------------

Comment 18 Sourceware Commits 2016-04-05 21:17:38 UTC

This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, hjl/erms/2.22 has been created
        at  f0a3ab52c05e0813348e0e5460aaf1dc5d1e7a64 (commit)

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=f0a3ab52c05e0813348e0e5460aaf1dc5d1e7a64

commit f0a3ab52c05e0813348e0e5460aaf1dc5d1e7a64
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 1 14:01:24 2016 -0700

    X86-64: Add dummy memcopy.h and wordcopy.c
    
    Since x86-64 doesn't use memory copy functions, add dummy memcopy.h and
    wordcopy.c to reduce code size.  It reduces the size of libc.so by about
    1 KB.
    
    	* sysdeps/x86_64/memcopy.h: New file.
    	* sysdeps/x86_64/wordcopy.c: Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=d5af940569c5c48835acdf6c8c47451e1e92c817

commit d5af940569c5c48835acdf6c8c47451e1e92c817
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 12:46:57 2016 -0700

    X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove
    
    Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones,
    we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with
    the new ones.
    
    No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used
    before.  If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2
    memcpy/memmove optimized with Enhanced REP MOVSB will be used for
    processors with ERMS.  The new AVX512 memcpy/memmove will be used for
    processors with AVX512 which prefer vzeroupper.
    
    Since the new SSE2 memcpy/memmove are faster than the previous default
    memcpy/memmove used in libc.a and ld.so, we also remove the previous
    default memcpy/memmove and make them the default memcpy/memmove.
    
    Together, it reduces the size of libc.so by about 6 KB and the size of
    ld.so by about 2 KB.
    
    	[BZ #19776]
    	* sysdeps/x86_64/memcpy.S: Make it dummy.
    	* sysdeps/x86_64/mempcpy.S: Likewise.
    	* sysdeps/x86_64/memmove.S: New file.
    	* sysdeps/x86_64/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/memmove.c: Removed.
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memmove.c: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memcpy-sse2-unaligned, memmove-avx-unaligned,
    	memcpy-avx-unaligned and memmove-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Replace
    	__memmove_chk_avx512_unaligned_2 with
    	__memmove_chk_avx512_unaligned.  Remove
    	__memmove_chk_avx_unaligned_2.  Replace
    	__memmove_chk_sse2_unaligned_2 with
    	__memmove_chk_sse2_unaligned.  Remove __memmove_chk_sse2 and
    	__memmove_avx_unaligned_2.  Replace __memmove_avx512_unaligned_2
    	with __memmove_avx512_unaligned.  Replace
    	__memmove_sse2_unaligned_2 with __memmove_sse2_unaligned.
    	Remove __memmove_sse2.  Replace __memcpy_chk_avx512_unaligned_2
    	with __memcpy_chk_avx512_unaligned.  Remove
    	__memcpy_chk_avx_unaligned_2.  Replace
    	__memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned.
    	Remove __memcpy_chk_sse2.  Remove __memcpy_avx_unaligned_2.
    	Replace __memcpy_avx512_unaligned_2 with
    	__memcpy_avx512_unaligned.  Remove __memcpy_sse2_unaligned_2
    	and __memcpy_sse2.  Replace __mempcpy_chk_avx512_unaligned_2
    	with __mempcpy_chk_avx512_unaligned.  Remove
    	__mempcpy_chk_avx_unaligned_2.  Replace
    	__mempcpy_chk_sse2_unaligned_2 with
    	__mempcpy_chk_sse2_unaligned.  Remove __mempcpy_chk_sse2.
    	Replace __mempcpy_avx512_unaligned_2 with
    	__mempcpy_avx512_unaligned.  Remove __mempcpy_avx_unaligned_2.
    	Replace __mempcpy_sse2_unaligned_2 with
    	__mempcpy_sse2_unaligned.  Remove __mempcpy_sse2.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support
    	__memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned.
    	Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __memcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../memcpy.S.
    	* sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support
    	__memcpy_chk_avx512_unaligned_erms and
    	__memcpy_chk_avx512_unaligned.  Use
    	__memcpy_chk_avx_unaligned_erms and
    	__memcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __memcpy_chk_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip if
    	not in libc.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
    	(MEMCPY_SYMBOL): New.
    	(MEMPCPY_SYMBOL): Likewise.
    	(MEMMOVE_CHK_SYMBOL): Likewise.
    	Replace MEMMOVE_SYMBOL with MEMMOVE_CHK_SYMBOL on __mempcpy_chk
    	symbols.  Replace MEMMOVE_SYMBOL with MEMPCPY_SYMBOL on
    	__mempcpy symbols.  Change function suffix from unaligned_2 to
    	unaligned.  Provide alias for __memcpy_chk in libc.a.  Provide
    	alias for memcpy in libc.a and ld.so.
    	* sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support
    	__mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned.
    	Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __mempcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../mempcpy.S.
    	(mempcpy): New.  Add a weak alias.
    	* sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support
    	__mempcpy_chk_avx512_unaligned_erms and
    	__mempcpy_chk_avx512_unaligned.  Use
    	__mempcpy_chk_avx_unaligned_erms and
    	__mempcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __mempcpy_chk_sse2_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=3ad9dce564d95ac817f86cad1bb4f0bc29c58f5f

commit 3ad9dce564d95ac817f86cad1bb4f0bc29c58f5f
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:42:30 2016 -0700

    X86-64: Remove the previous SSE2/AVX2 memsets
    
    Since the new SSE2/AVX2 memsets are faster than the previous ones, we
    can remove the previous SSE2/AVX2 memsets and replace them with the
    new ones.  This reduces the size of libc.so by about 900 bytes.
    
    No change in IFUNC selection if SSE2 and AVX2 memsets weren't used
    before.  If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset
    optimized with Enhanced REP STOSB will be used for processors with
    ERMS.  The new AVX512 memset will be used for processors with AVX512
    which prefer vzeroupper.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded
    	into ...
    	* sysdeps/x86_64/memset.S: This.
    	(__bzero): Removed.
    	(__memset_tail): Likewise.
    	(__memset_chk): Likewise.
    	(memset): Likewise.
    	(MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't
    	defined.
    	(MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined.
    	* sysdeps/x86_64/multiarch/memset-avx2.S: Removed.
    	(__memset_zero_constant_len_parameter): Check SHARED instead of
    	PIC.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memset-avx2 and memset-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Remove __memset_chk_sse2,
    	__memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: Skip
    	if not in libc.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
    	(MEMSET_CHK_SYMBOL): New.  Define if not defined.
    	(__bzero): Check VEC_SIZE == 16 instead of USE_MULTIARCH.
    	Replace MEMSET_SYMBOL with MEMSET_CHK_SYMBOL on __memset_chk
    	symbols.
    	Properly check USE_MULTIARCH on __memset symbols.
    	* sysdeps/x86_64/multiarch/memset.S (memset): Replace
    	__memset_sse2 and __memset_avx2 with __memset_sse2_unaligned
    	and __memset_avx2_unaligned.  Use __memset_sse2_unaligned_erms
    	or __memset_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	(memset): Removed.
    	(__memset_chk): Likewise.
    	(MEMSET_SYMBOL): New.
    	(libc_hidden_builtin_def): Replace __memset_sse2 with
    	__memset_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace
    	__memset_chk_sse2 and __memset_chk_avx2 with
    	__memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms.
    	Use __memset_chk_sse2_unaligned_erms or
    	__memset_chk_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_chk_avx512_unaligned_erms and
    	__memset_chk_avx512_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e1e4908bee33dc0aed48835c1884387b5e942963

commit e1e4908bee33dc0aed48835c1884387b5e942963
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Apr 3 17:21:45 2016 -0700

    X86-64: Use non-temporal store in memmove on large data
    
    memcpy/memmove benchmarks with large data shows that there is a
    regression with large data on Haswell machine.  non-temporal store
    in memmove on large data can improve performance significantly.  This
    patch adds a threshold to use non temporal store which is 4 times of
    shared cache size.  When size is above the threshold, non temporal
    store will be used.
    
    For size below 8 vector register width, we load all data into registers
    and store them together.  Only forward and backward loops, which move 4
    vector registers at a time, are used to support overlapping addresses.
    For forward loop, we load the last 4 vector register width of data and
    the first vector register width of data into vector registers before the
    loop and store them after the loop.  For backward loop, we load the first
    4 vector register width of data and the last vector register width of
    data into vector registers before the loop and store them after the loop.
    
    	* sysdeps/x86_64/cacheinfo.c (__x86_shared_non_temporal_threshold):
    	New.
    	(init_cacheinfo): Set __x86_shared_non_temporal_threshold to
    	4 times of shared cache size.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S
    	(PREFETCHNT): New.
    	(VMOVNT): Likewise.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S
    	(PREFETCHNT): Likewise.
    	(VMOVNT): Likewise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S
    	(PREFETCHNT): Likewise.
    	(VMOVNT): Likewise.
    	(VMOVU): Changed to movups for smaller code sizes.
    	(VMOVA): Changed to movaps for smaller code sizes.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Update
    	comments.  Rewrite to use forward and backward loops, which move
    	4 vector registers at a time, to support overlapping addresses
    	and use non temporal store if size is above the threshold.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c2d3bdd6aec639fd66fceb3e2c145420c25d409b

commit c2d3bdd6aec639fd66fceb3e2c145420c25d409b
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Apr 5 05:21:07 2016 -0700

    Force 32-bit displacement in memset-vec-unaligned-erms.S
    
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: Force
    	32-bit displacement to avoid long nop between instructions.
    
    (cherry picked from commit ec0cac9a1f4094bd0db6f77c1b329e7a40eecc10)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=070a5e77d66f5520c1bbbc24dc1843a0a1c161ee

commit 070a5e77d66f5520c1bbbc24dc1843a0a1c161ee
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Apr 5 05:19:05 2016 -0700

    Add a comment in memset-sse2-unaligned-erms.S
    
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Add
    	a comment on VMOVU and VMOVA.
    
    (cherry picked from commit 696ac774847b80cf994438739478b0c3003b5958)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7e00bb9720268f142668d22e91dff7c3e6e0c08c

commit 7e00bb9720268f142668d22e91dff7c3e6e0c08c
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Apr 3 14:32:20 2016 -0700

    Don't put SSE2/AVX/AVX512 memmove/memset in ld.so
    
    Since memmove and memset in ld.so don't use IFUNC, don't put SSE2, AVX
    and AVX512 memmove and memset in ld.so.
    
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip
    	if not in libc.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    
    (cherry picked from commit 5cd7af016d8587ff53b20ba259746f97edbddbf7)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e1e57539f5dbdefc96a85021b611863eaa28dd13

commit e1e57539f5dbdefc96a85021b611863eaa28dd13
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Apr 3 12:38:25 2016 -0700

    Fix memmove-vec-unaligned-erms.S
    
    __mempcpy_erms and __memmove_erms can't be placed between __memmove_chk
    and __memmove it breaks __memmove_chk.
    
    Don't check source == destination first since it is less common.
    
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:
    	(__mempcpy_erms, __memmove_erms): Moved before __mempcpy_chk
    	with unaligned_erms.
    	(__memmove_erms): Skip if source == destination.
    	(__memmove_unaligned_erms): Don't check source == destination
    	first.
    
    (cherry picked from commit ea2785e96fa503f3a2b5dd9f3a6ca65622b3c5f2)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a13ac6b5ced68aadb7c1546102445f9c57f43231

commit a13ac6b5ced68aadb7c1546102445f9c57f43231
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Mar 6 08:23:24 2016 -0800

    Use HAS_ARCH_FEATURE with Fast_Rep_String
    
    HAS_ARCH_FEATURE, not HAS_CPU_FEATURE, should be used with
    Fast_Rep_String.
    
    	[BZ #19762]
    	* sysdeps/i386/i686/multiarch/bcopy.S (bcopy): Use
    	HAS_ARCH_FEATURE with Fast_Rep_String.
    	* sysdeps/i386/i686/multiarch/bzero.S (__bzero): Likewise.
    	* sysdeps/i386/i686/multiarch/memcpy.S (memcpy): Likewise.
    	* sysdeps/i386/i686/multiarch/memcpy_chk.S (__memcpy_chk):
    	Likewise.
    	* sysdeps/i386/i686/multiarch/memmove_chk.S (__memmove_chk):
    	Likewise.
    	* sysdeps/i386/i686/multiarch/mempcpy.S (__mempcpy): Likewise.
    	* sysdeps/i386/i686/multiarch/mempcpy_chk.S (__mempcpy_chk):
    	Likewise.
    	* sysdeps/i386/i686/multiarch/memset.S (memset): Likewise.
    	* sysdeps/i386/i686/multiarch/memset_chk.S (__memset_chk):
    	Likewise.
    
    (cherry picked from commit 4e940b2f4b577f3a530e0580373f7c2d569f4d63)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4ad4d58ed7a444e2d9787113fce132a99b35b417

commit 4ad4d58ed7a444e2d9787113fce132a99b35b417
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 1 15:08:48 2016 -0700

    Remove Fast_Copy_Backward from Intel Core processors
    
    Intel Core i3, i5 and i7 processors have fast unaligned copy and
    copy backward is ignored.  Remove Fast_Copy_Backward from Intel Core
    processors to avoid confusion.
    
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Don't set
    	bit_arch_Fast_Copy_Backward for Intel Core proessors.
    
    (cherry picked from commit 27d3ce1467990f89126e228559dec8f84b96c60e)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a304f3933c7f8347e49057a7a315cbd571662ff7

commit a304f3933c7f8347e49057a7a315cbd571662ff7
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:05:51 2016 -0700

    Add x86-64 memset with unaligned store and rep stosb
    
    Implement x86-64 memset with unaligned store and rep movsb.  Support
    16-byte, 32-byte and 64-byte vector register sizes.  A single file
    provides 2 implementations of memset, one with rep stosb and the other
    without rep stosb.  They share the same codes when size is between 2
    times of vector register size and REP_STOSB_THRESHOLD which defaults
    to 2KB.
    
    Key features:
    
    1. Use overlapping store to avoid branch.
    2. For size <= 4 times of vector register size, fully unroll the loop.
    3. For size > 4 times of vector register size, store 4 times of vector
    register size at a time.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
    	memset-sse2-unaligned-erms, memset-avx2-unaligned-erms and
    	memset-avx512-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Test __memset_chk_sse2_unaligned,
    	__memset_chk_sse2_unaligned_erms, __memset_chk_avx2_unaligned,
    	__memset_chk_avx2_unaligned_erms, __memset_chk_avx512_unaligned,
    	__memset_chk_avx512_unaligned_erms, __memset_sse2_unaligned,
    	__memset_sse2_unaligned_erms, __memset_erms,
    	__memset_avx2_unaligned, __memset_avx2_unaligned_erms,
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: New
    	file.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:
    	Likewise.
    
    (cherry picked from commit 830566307f038387ca0af3fd327706a8d1a2f595)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=1c5d0b0ba41376c2f0792da4f22cc1f5b2b2688e

commit 1c5d0b0ba41376c2f0792da4f22cc1f5b2b2688e
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:04:26 2016 -0700

    Add x86-64 memmove with unaligned load/store and rep movsb
    
    Implement x86-64 memmove with unaligned load/store and rep movsb.
    Support 16-byte, 32-byte and 64-byte vector register sizes.  When
    size <= 8 times of vector register size, there is no check for
    address overlap bewteen source and destination.  Since overhead for
    overlap check is small when size > 8 times of vector register size,
    memcpy is an alias of memmove.
    
    A single file provides 2 implementations of memmove, one with rep movsb
    and the other without rep movsb.  They share the same codes when size is
    between 2 times of vector register size and REP_MOVSB_THRESHOLD which
    is 2KB for 16-byte vector register size and scaled up by large vector
    register size.
    
    Key features:
    
    1. Use overlapping load and store to avoid branch.
    2. For size <= 8 times of vector register size, load  all sources into
    registers and store them together.
    3. If there is no address overlap bewteen source and destination, copy
    from both ends with 4 times of vector register size at a time.
    4. If address of destination > address of source, backward copy 8 times
    of vector register size at a time.
    5. Otherwise, forward copy 8 times of vector register size at a time.
    6. Use rep movsb only for forward copy.  Avoid slow backward rep movsb
    by fallbacking to backward copy 8 times of vector register size at a
    time.
    7. Skip when address of destination == address of source.
    
    	[BZ #19776]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
    	memmove-sse2-unaligned-erms, memmove-avx-unaligned-erms and
    	memmove-avx512-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Test
    	__memmove_chk_avx512_unaligned_2,
    	__memmove_chk_avx512_unaligned_erms,
    	__memmove_chk_avx_unaligned_2, __memmove_chk_avx_unaligned_erms,
    	__memmove_chk_sse2_unaligned_2,
    	__memmove_chk_sse2_unaligned_erms, __memmove_avx_unaligned_2,
    	__memmove_avx_unaligned_erms, __memmove_avx512_unaligned_2,
    	__memmove_avx512_unaligned_erms, __memmove_erms,
    	__memmove_sse2_unaligned_2, __memmove_sse2_unaligned_erms,
    	__memcpy_chk_avx512_unaligned_2,
    	__memcpy_chk_avx512_unaligned_erms,
    	__memcpy_chk_avx_unaligned_2, __memcpy_chk_avx_unaligned_erms,
    	__memcpy_chk_sse2_unaligned_2, __memcpy_chk_sse2_unaligned_erms,
    	__memcpy_avx_unaligned_2, __memcpy_avx_unaligned_erms,
    	__memcpy_avx512_unaligned_2, __memcpy_avx512_unaligned_erms,
    	__memcpy_sse2_unaligned_2, __memcpy_sse2_unaligned_erms,
    	__memcpy_erms, __mempcpy_chk_avx512_unaligned_2,
    	__mempcpy_chk_avx512_unaligned_erms,
    	__mempcpy_chk_avx_unaligned_2, __mempcpy_chk_avx_unaligned_erms,
    	__mempcpy_chk_sse2_unaligned_2, __mempcpy_chk_sse2_unaligned_erms,
    	__mempcpy_avx512_unaligned_2, __mempcpy_avx512_unaligned_erms,
    	__mempcpy_avx_unaligned_2, __mempcpy_avx_unaligned_erms,
    	__mempcpy_sse2_unaligned_2, __mempcpy_sse2_unaligned_erms and
    	__mempcpy_erms.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: New
    	file.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likwise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
    	Likwise.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:
    	Likwise.
    
    (cherry picked from commit 88b57b8ed41d5ecf2e1bdfc19556f9246a665ebb)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e1203f48239fbb9832db6ed3a0d2a008e317aff9

commit e1203f48239fbb9832db6ed3a0d2a008e317aff9
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 19:22:59 2016 -0700

    Initial Enhanced REP MOVSB/STOSB (ERMS) support
    
    The newer Intel processors support Enhanced REP MOVSB/STOSB (ERMS) which
    has a feature bit in CPUID.  This patch adds the Enhanced REP MOVSB/STOSB
    (ERMS) bit to x86 cpu-features.
    
    	* sysdeps/x86/cpu-features.h (bit_cpu_ERMS): New.
    	(index_cpu_ERMS): Likewise.
    	(reg_ERMS): Likewise.
    
    (cherry picked from commit 0791f91dff9a77263fa8173b143d854cad902c6d)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=3597d65be2a44f063ef12bb907fdad8567aa3e6a

commit 3597d65be2a44f063ef12bb907fdad8567aa3e6a
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 13:15:59 2016 -0700

    Make __memcpy_avx512_no_vzeroupper an alias
    
    Since x86-64 memcpy-avx512-no-vzeroupper.S implements memmove, make
    __memcpy_avx512_no_vzeroupper an alias of __memmove_avx512_no_vzeroupper
    to reduce code size of libc.so.
    
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memcpy-avx512-no-vzeroupper.
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: Renamed
    	to ...
    	* sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: This.
    	(MEMCPY): Don't define.
    	(MEMCPY_CHK): Likewise.
    	(MEMPCPY): Likewise.
    	(MEMPCPY_CHK): Likewise.
    	(MEMPCPY_CHK): Renamed to ...
    	(__mempcpy_chk_avx512_no_vzeroupper): This.
    	(MEMPCPY_CHK): Renamed to ...
    	(__mempcpy_chk_avx512_no_vzeroupper): This.
    	(MEMCPY_CHK): Renamed to ...
    	(__memmove_chk_avx512_no_vzeroupper): This.
    	(MEMCPY): Renamed to ...
    	(__memmove_avx512_no_vzeroupper): This.
    	(__memcpy_avx512_no_vzeroupper): New alias.
    	(__memcpy_chk_avx512_no_vzeroupper): Likewise.
    
    (cherry picked from commit 064f01b10b57ff09cda7025f484b848c38ddd57a)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=9fbaf0f27a11deb98df79d04adee97aebee78d40

commit 9fbaf0f27a11deb98df79d04adee97aebee78d40
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 13:13:36 2016 -0700

    Implement x86-64 multiarch mempcpy in memcpy
    
    Implement x86-64 multiarch mempcpy in memcpy to share most of code.  It
    reduces code size of libc.so.
    
    	[BZ #18858]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	mempcpy-ssse3, mempcpy-ssse3-back, mempcpy-avx-unaligned
    	and mempcpy-avx512-no-vzeroupper.
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMPCPY_CHK):
    	New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S
    	(MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-ssse3-back.S (MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-ssse3.S (MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-avx-unaligned.S: Removed.
    	* sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-ssse3-back.S: Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-ssse3.S: Likewise.
    
    (cherry picked from commit c365e615f7429aee302f8af7bf07ae262278febb)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=5239cb481eea27650173b9b9af22439afdcbf358

commit 5239cb481eea27650173b9b9af22439afdcbf358
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 04:39:48 2016 -0700

    [x86] Add a feature bit: Fast_Unaligned_Copy
    
    On AMD processors, memcpy optimized with unaligned SSE load is
    slower than emcpy optimized with aligned SSSE3 while other string
    functions are faster with unaligned SSE load.  A feature bit,
    Fast_Unaligned_Copy, is added to select memcpy optimized with
    unaligned SSE load.
    
    	[BZ #19583]
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Set
    	Fast_Unaligned_Copy with Fast_Unaligned_Load for Intel
    	processors.  Set Fast_Copy_Backward for AMD Excavator
    	processors.
    	* sysdeps/x86/cpu-features.h (bit_arch_Fast_Unaligned_Copy):
    	New.
    	(index_arch_Fast_Unaligned_Copy): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Check
    	Fast_Unaligned_Copy instead of Fast_Unaligned_Load.
    
    (cherry picked from commit e41b395523040fcb58c7d378475720c2836d280c)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a65b3d13e1754d568782e64a762c2c7fab45a55d

commit a65b3d13e1754d568782e64a762c2c7fab45a55d
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Mar 22 08:36:16 2016 -0700

    Don't set %rcx twice before "rep movsb"
    
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMCPY):
    	Don't set %rcx twice before "rep movsb".
    
    (cherry picked from commit 3c9a4cd16cbc7b79094fec68add2df66061ab5d7)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=f4b6d20366aac66070f1cf50552cf2951991a1e5

commit f4b6d20366aac66070f1cf50552cf2951991a1e5
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Mar 22 07:46:56 2016 -0700

    Set index_arch_AVX_Fast_Unaligned_Load only for Intel processors
    
    Since only Intel processors with AVX2 have fast unaligned load, we
    should set index_arch_AVX_Fast_Unaligned_Load only for Intel processors.
    
    Move AVX, AVX2, AVX512, FMA and FMA4 detection into get_common_indeces
    and call get_common_indeces for other processors.
    
    Add CPU_FEATURES_CPU_P and CPU_FEATURES_ARCH_P to aoid loading
    GLRO(dl_x86_cpu_features) in cpu-features.c.
    
    	[BZ #19583]
    	* sysdeps/x86/cpu-features.c (get_common_indeces): Remove
    	inline.  Check family before setting family, model and
    	extended_model.  Set AVX, AVX2, AVX512, FMA and FMA4 usable
    	bits here.
    	(init_cpu_features): Replace HAS_CPU_FEATURE and
    	HAS_ARCH_FEATURE with CPU_FEATURES_CPU_P and
    	CPU_FEATURES_ARCH_P.  Set index_arch_AVX_Fast_Unaligned_Load
    	for Intel processors with usable AVX2.  Call get_common_indeces
    	for other processors with family == NULL.
    	* sysdeps/x86/cpu-features.h (CPU_FEATURES_CPU_P): New macro.
    	(CPU_FEATURES_ARCH_P): Likewise.
    	(HAS_CPU_FEATURE): Use CPU_FEATURES_CPU_P.
    	(HAS_ARCH_FEATURE): Use CPU_FEATURES_ARCH_P.
    
    (cherry picked from commit f781a9e96138d8839663af5e88649ab1fbed74f8)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=ca9c5edeea52dc18f42ebbe29b1af352f5555538

commit ca9c5edeea52dc18f42ebbe29b1af352f5555538
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Nov 30 08:53:37 2015 -0800

    Update family and model detection for AMD CPUs
    
    AMD CPUs uses the similar encoding scheme for extended family and model
    as Intel CPUs as shown in:
    
    http://support.amd.com/TechDocs/25481.pdf
    
    This patch updates get_common_indeces to get family and model for both
    Intel and AMD CPUs when family == 0x0f.
    
    	[BZ #19214]
    	* sysdeps/x86/cpu-features.c (get_common_indeces): Add an
    	argument to return extended model.  Update family and model
    	with extended family and model when family == 0x0f.
    	(init_cpu_features): Updated.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c23cdbac4ea473effbef5c50b1217f95595b3460

commit c23cdbac4ea473effbef5c50b1217f95595b3460
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 10 05:26:46 2016 -0800

    Add _arch_/_cpu_ to index_*/bit_* in x86 cpu-features.h
    
    index_* and bit_* macros are used to access cpuid and feature arrays o
    struct cpu_features.  It is very easy to use bits and indices of cpuid
    array on feature array, especially in assembly codes.  For example,
    sysdeps/i386/i686/multiarch/bcopy.S has
    
    	HAS_CPU_FEATURE (Fast_Rep_String)
    
    which should be
    
    	HAS_ARCH_FEATURE (Fast_Rep_String)
    
    We change index_* and bit_* to index_cpu_*/index_arch_* and
    bit_cpu_*/bit_arch_* so that we can catch such error at build time.
    
    	[BZ #19762]
    	* sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h
    	(EXTRA_LD_ENVVARS): Add _arch_ to index_*/bit_*.
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Likewise.
    	* sysdeps/x86/cpu-features.h (bit_*): Renamed to ...
    	(bit_arch_*): This for feature array.
    	(bit_*): Renamed to ...
    	(bit_cpu_*): This for cpu array.
    	(index_*): Renamed to ...
    	(index_arch_*): This for feature array.
    	(index_*): Renamed to ...
    	(index_cpu_*): This for cpu array.
    	[__ASSEMBLER__] (HAS_FEATURE): Add and use field.
    	[__ASSEMBLER__] (HAS_CPU_FEATURE)): Pass cpu to HAS_FEATURE.
    	[__ASSEMBLER__] (HAS_ARCH_FEATURE)): Pass arch to HAS_FEATURE.
    	[!__ASSEMBLER__] (HAS_CPU_FEATURE): Replace index_##name and
    	bit_##name with index_cpu_##name and bit_cpu_##name.
    	[!__ASSEMBLER__] (HAS_ARCH_FEATURE): Replace index_##name and
    	bit_##name with index_arch_##name and bit_arch_##name.
    
    (cherry picked from commit 6aa3e97e2530f9917f504eb4146af119a3f27229)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4a49c82956f5a42a2cce22c2e97360de1b32301d

commit 4a49c82956f5a42a2cce22c2e97360de1b32301d
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 3 14:51:40 2016 -0800

    Or bit_Prefer_MAP_32BIT_EXEC in EXTRA_LD_ENVVARS
    
    We should turn on bit_Prefer_MAP_32BIT_EXEC in EXTRA_LD_ENVVARS without
    overriding other bits.
    
    	[BZ #19758]
    	* sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h
    	(EXTRA_LD_ENVVARS): Or bit_Prefer_MAP_32BIT_EXEC.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=18f8c0e3cc9ff7b092f02c9b42874a5439347bbc

commit 18f8c0e3cc9ff7b092f02c9b42874a5439347bbc
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Mar 6 16:48:11 2016 -0800

    Group AVX512 functions in .text.avx512 section
    
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S:
    	Replace .text with .text.avx512.
    	* sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S:
    	Likewise.
    
    (cherry picked from commit fee9eb6200f0e44a4b684903bc47fde36d46f1a5)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=0c8e297a186f844ebb7eba7a3bc0343c83615ca9

commit 0c8e297a186f844ebb7eba7a3bc0343c83615ca9
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Mar 4 08:37:40 2016 -0800

    x86-64: Fix memcpy IFUNC selection
    
    Chek Fast_Unaligned_Load, instead of Slow_BSF, and also check for
    Fast_Copy_Backward to enable __memcpy_ssse3_back.  Existing selection
    order is updated with following selection order:
    
    1. __memcpy_avx_unaligned if AVX_Fast_Unaligned_Load bit is set.
    2. __memcpy_sse2_unaligned if Fast_Unaligned_Load bit is set.
    3. __memcpy_sse2 if SSSE3 isn't available.
    4. __memcpy_ssse3_back if Fast_Copy_Backward bit it set.
    5. __memcpy_ssse3
    
    	[BZ #18880]
    	* sysdeps/x86_64/multiarch/memcpy.S: Check Fast_Unaligned_Load,
    	instead of Slow_BSF, and also check for Fast_Copy_Backward to
    	enable __memcpy_ssse3_back.
    
    (cherry picked from commit 14a1d7cc4c4fd5ee8e4e66b777221dd32a84efe8)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=3c772cb4d9cbe19cd97ad991e3dab43014198c44

commit 3c772cb4d9cbe19cd97ad991e3dab43014198c44
Author: Andrew Senkevich <andrew.senkevich@intel.com>
Date:   Sat Jan 16 00:49:45 2016 +0300

    Added memcpy/memmove family optimized with AVX512 for KNL hardware.
    
    Added AVX512 implementations of memcpy, mempcpy, memmove, memcpy_chk,
    mempcpy_chk, memmove_chk.
    It shows average improvement more than 30% over AVX versions on KNL
    hardware (performance results in the thread
    <https://sourceware.org/ml/libc-alpha/2016-01/msg00258.html>).
    
        * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Added new files.
        * sysdeps/x86_64/multiarch/ifunc-impl-list.c: Added new tests.
        * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: New file.
        * sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S: Likewise.
        * sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: Likewise.
        * sysdeps/x86_64/multiarch/memcpy.S: Added new IFUNC branch.
        * sysdeps/x86_64/multiarch/memcpy_chk.S: Likewise.
        * sysdeps/x86_64/multiarch/memmove.c: Likewise.
        * sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
        * sysdeps/x86_64/multiarch/mempcpy.S: Likewise.
        * sysdeps/x86_64/multiarch/mempcpy_chk.S: Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7f20e52f1f8d7709140f4bdf828a6bb0f0f08af2

commit 7f20e52f1f8d7709140f4bdf828a6bb0f0f08af2
Author: Andrew Senkevich <andrew.senkevich@intel.com>
Date:   Sat Dec 19 02:47:28 2015 +0300

    Added memset optimized with AVX512 for KNL hardware.
    
    It shows improvement up to 28% over AVX2 memset (performance results
    attached at <https://sourceware.org/ml/libc-alpha/2015-12/msg00052.html>).
    
        * sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S: New file.
        * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Added new file.
        * sysdeps/x86_64/multiarch/ifunc-impl-list.c: Added new tests.
        * sysdeps/x86_64/multiarch/memset.S: Added new IFUNC branch.
        * sysdeps/x86_64/multiarch/memset_chk.S: Likewise.
        * sysdeps/x86/cpu-features.h (bit_Prefer_No_VZEROUPPER,
        index_Prefer_No_VZEROUPPER): New.
        * sysdeps/x86/cpu-features.c (init_cpu_features): Set the
        Prefer_No_VZEROUPPER for Knights Landing.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=d530cd5463701a59ed923d53a97d3b534fdfea8a

commit d530cd5463701a59ed923d53a97d3b534fdfea8a
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Wed Oct 21 14:44:23 2015 -0700

    Add Prefer_MAP_32BIT_EXEC to map executable pages with MAP_32BIT
    
    According to Silvermont software optimization guide, for 64-bit
    applications, branch prediction performance can be negatively impacted
    when the target of a branch is more than 4GB away from the branch.  Add
    the Prefer_MAP_32BIT_EXEC bit so that mmap will try to map executable
    pages with MAP_32BIT first.  NB: MAP_32BIT will map to lower 2GB, not
    lower 4GB, address.  Prefer_MAP_32BIT_EXEC reduces bits available for
    address space layout randomization (ASLR), which is always disabled for
    SUID programs and can only be enabled by setting environment variable,
    LD_PREFER_MAP_32BIT_EXEC.
    
    On Fedora 23, this patch speeds up GCC 5 testsuite by 3% on Silvermont.
    
    	[BZ #19367]
    	* sysdeps/unix/sysv/linux/wordsize-64/mmap.c: New file.
    	* sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h: Likewise.
    	* sysdeps/unix/sysv/linux/x86_64/64/mmap.c: Likewise.
    	* sysdeps/x86/cpu-features.h (bit_Prefer_MAP_32BIT_EXEC): New.
    	(index_Prefer_MAP_32BIT_EXEC): Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=fe24aedc3530037d7bb614b84d309e6b816686bf

commit fe24aedc3530037d7bb614b84d309e6b816686bf
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Dec 15 11:46:54 2015 -0800

    Enable Silvermont optimizations for Knights Landing
    
    Knights Landing processor is based on Silvermont.  This patch enables
    Silvermont optimizations for Knights Landing.
    
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Enable
    	Silvermont optimizations for Knights Landing.

-----------------------------------------------------------------------

Comment 19 Sourceware Commits 2016-04-06 18:00:45 UTC

This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, hjl/erms/ifunc has been created
        at  7d3414159ba17db4224b675cf4086741210544b1 (commit)

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7d3414159ba17db4224b675cf4086741210544b1

commit 7d3414159ba17db4224b675cf4086741210544b1
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 1 14:01:24 2016 -0700

    X86-64: Add dummy memcopy.h and wordcopy.c
    
    Since x86-64 doesn't use memory copy functions, add dummy memcopy.h and
    wordcopy.c to reduce code size.  It reduces the size of libc.so by about
    1 KB.
    
    	* sysdeps/x86_64/memcopy.h: New file.
    	* sysdeps/x86_64/wordcopy.c: Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=efd380007f75c4157a823ee14d658c0ced3ba4a8

commit efd380007f75c4157a823ee14d658c0ced3ba4a8
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 12:46:57 2016 -0700

    X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove
    
    Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones,
    we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with
    the new ones.
    
    No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used
    before.  If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2
    memcpy/memmove optimized with Enhanced REP MOVSB will be used for
    processors with ERMS.  The new AVX512 memcpy/memmove will be used for
    processors with AVX512 which prefer vzeroupper.
    
    Since the new SSE2 memcpy/memmove are faster than the previous default
    memcpy/memmove used in libc.a and ld.so, we also remove the previous
    default memcpy/memmove and make them the default memcpy/memmove.
    
    Together, it reduces the size of libc.so by about 6 KB and the size of
    ld.so by about 2 KB.
    
    	[BZ #19776]
    	* sysdeps/x86_64/memcpy.S: Make it dummy.
    	* sysdeps/x86_64/mempcpy.S: Likewise.
    	* sysdeps/x86_64/memmove.S: New file.
    	* sysdeps/x86_64/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/memmove.c: Removed.
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memmove.c: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memcpy-sse2-unaligned, memmove-avx-unaligned,
    	memcpy-avx-unaligned and memmove-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Replace
    	__memmove_chk_avx512_unaligned_2 with
    	__memmove_chk_avx512_unaligned.  Remove
    	__memmove_chk_avx_unaligned_2.  Replace
    	__memmove_chk_sse2_unaligned_2 with
    	__memmove_chk_sse2_unaligned.  Remove __memmove_chk_sse2 and
    	__memmove_avx_unaligned_2.  Replace __memmove_avx512_unaligned_2
    	with __memmove_avx512_unaligned.  Replace
    	__memmove_sse2_unaligned_2 with __memmove_sse2_unaligned.
    	Remove __memmove_sse2.  Replace __memcpy_chk_avx512_unaligned_2
    	with __memcpy_chk_avx512_unaligned.  Remove
    	__memcpy_chk_avx_unaligned_2.  Replace
    	__memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned.
    	Remove __memcpy_chk_sse2.  Remove __memcpy_avx_unaligned_2.
    	Replace __memcpy_avx512_unaligned_2 with
    	__memcpy_avx512_unaligned.  Remove __memcpy_sse2_unaligned_2
    	and __memcpy_sse2.  Replace __mempcpy_chk_avx512_unaligned_2
    	with __mempcpy_chk_avx512_unaligned.  Remove
    	__mempcpy_chk_avx_unaligned_2.  Replace
    	__mempcpy_chk_sse2_unaligned_2 with
    	__mempcpy_chk_sse2_unaligned.  Remove __mempcpy_chk_sse2.
    	Replace __mempcpy_avx512_unaligned_2 with
    	__mempcpy_avx512_unaligned.  Remove __mempcpy_avx_unaligned_2.
    	Replace __mempcpy_sse2_unaligned_2 with
    	__mempcpy_sse2_unaligned.  Remove __mempcpy_sse2.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support
    	__memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned.
    	Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __memcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../memcpy.S.
    	* sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support
    	__memcpy_chk_avx512_unaligned_erms and
    	__memcpy_chk_avx512_unaligned.  Use
    	__memcpy_chk_avx_unaligned_erms and
    	__memcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __memcpy_chk_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip if
    	not in libc.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
    	Change function suffix from unaligned_2 to unaligned.
    	* sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support
    	__mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned.
    	Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __mempcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../mempcpy.S.
    	(mempcpy): New.  Add a weak alias.
    	* sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support
    	__mempcpy_chk_avx512_unaligned_erms and
    	__mempcpy_chk_avx512_unaligned.  Use
    	__mempcpy_chk_avx_unaligned_erms and
    	__mempcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __mempcpy_chk_sse2_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=8f1593ebaee38ddedcabee5fe3553abdb0f08bfd

commit 8f1593ebaee38ddedcabee5fe3553abdb0f08bfd
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:42:30 2016 -0700

    X86-64: Remove the previous SSE2/AVX2 memsets
    
    Since the new SSE2/AVX2 memsets are faster than the previous ones, we
    can remove the previous SSE2/AVX2 memsets and replace them with the
    new ones.  This reduces the size of libc.so by about 900 bytes.
    
    No change in IFUNC selection if SSE2 and AVX2 memsets weren't used
    before.  If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset
    optimized with Enhanced REP STOSB will be used for processors with
    ERMS.  The new AVX512 memset will be used for processors with AVX512
    which prefer vzeroupper.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded
    	into ...
    	* sysdeps/x86_64/memset.S: This.
    	(__bzero): Removed.
    	(__memset_tail): Likewise.
    	(__memset_chk): Likewise.
    	(memset): Likewise.
    	(MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't
    	defined.
    	(MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined.
    	* sysdeps/x86_64/multiarch/memset-avx2.S: Removed.
    	(__memset_zero_constant_len_parameter): Check SHARED instead of
    	PIC.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memset-avx2 and memset-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Remove __memset_chk_sse2,
    	__memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: Skip
    	if not in libc.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
    	(__bzero): Enabled.
    	* sysdeps/x86_64/multiarch/memset.S (memset): Replace
    	__memset_sse2 and __memset_avx2 with __memset_sse2_unaligned
    	and __memset_avx2_unaligned.  Use __memset_sse2_unaligned_erms
    	or __memset_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	(memset): Removed.
    	(__memset_chk): Likewise.
    	(MEMSET_SYMBOL): New.
    	(libc_hidden_builtin_def): Replace __memset_sse2 with
    	__memset_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace
    	__memset_chk_sse2 and __memset_chk_avx2 with
    	__memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms.
    	Use __memset_chk_sse2_unaligned_erms or
    	__memset_chk_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_chk_avx512_unaligned_erms and
    	__memset_chk_avx512_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=545cd24ea6b85661abfa9ac1e49d56dd7cc19cc9

commit 545cd24ea6b85661abfa9ac1e49d56dd7cc19cc9
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Wed Apr 6 10:49:27 2016 -0700

    Use PREFETCH_ONE_SET_X

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=af07dbdaa999d0172dd840f3dbe6963901c3496f

commit af07dbdaa999d0172dd840f3dbe6963901c3496f
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Apr 3 17:21:45 2016 -0700

    X86-64: Use non-temporal store in memmove on large data
    
    memcpy/memmove benchmarks with large data shows that there is a
    regression with large data on Haswell machine.  non-temporal store
    in memmove on large data can improve performance significantly.  This
    patch adds a threshold to use non temporal store which is 4 times of
    shared cache size.  When size is above the threshold, non temporal
    store will be used.
    
    For size below 8 vector register width, we load all data into registers
    and store them together.  Only forward and backward loops, which move 4
    vector registers at a time, are used to support overlapping addresses.
    For forward loop, we load the last 4 vector register width of data and
    the first vector register width of data into vector registers before the
    loop and store them after the loop.  For backward loop, we load the first
    4 vector register width of data and the last vector register width of
    data into vector registers before the loop and store them after the loop.
    
    	* sysdeps/x86_64/cacheinfo.c (__x86_shared_non_temporal_threshold):
    	New.
    	(init_cacheinfo): Set __x86_shared_non_temporal_threshold to
    	4 times of shared cache size.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S
    	(PREFETCHNT): New.
    	(VMOVNT): Likewise.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S
    	(PREFETCHNT): Likewise.
    	(VMOVNT): Likewise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S
    	(PREFETCHNT): Likewise.
    	(VMOVNT): Likewise.
    	(VMOVU): Changed to movups for smaller code sizes.
    	(VMOVA): Changed to movaps for smaller code sizes.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Update
    	comments.
    	(PREFETCH_SIZE): New.
    	(PREFETCHED_LOAD_SIZE): Likewise.
    	(PREFETCH_ONE_SET): Likewise.
    	Rewrite to use forward and backward loops, which move 4 vector
    	registers at a time, to support overlapping addresses and use
    	non temporal store if size is above the threshold.

-----------------------------------------------------------------------

Comment 20 Sourceware Commits 2016-04-06 19:35:33 UTC

This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, hjl/erms/ifunc has been created
        at  8624d88eb694d12da34edd5c6fd10d19fe7e3400 (commit)

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=8624d88eb694d12da34edd5c6fd10d19fe7e3400

commit 8624d88eb694d12da34edd5c6fd10d19fe7e3400
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 1 14:01:24 2016 -0700

    X86-64: Add dummy memcopy.h and wordcopy.c
    
    Since x86-64 doesn't use memory copy functions, add dummy memcopy.h and
    wordcopy.c to reduce code size.  It reduces the size of libc.so by about
    1 KB.
    
    	* sysdeps/x86_64/memcopy.h: New file.
    	* sysdeps/x86_64/wordcopy.c: Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=50c50aad0d7f9fedbc72953875fc1ef73bb2fa8e

commit 50c50aad0d7f9fedbc72953875fc1ef73bb2fa8e
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 12:46:57 2016 -0700

    X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove
    
    Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones,
    we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with
    the new ones.
    
    No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used
    before.  If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2
    memcpy/memmove optimized with Enhanced REP MOVSB will be used for
    processors with ERMS.  The new AVX512 memcpy/memmove will be used for
    processors with AVX512 which prefer vzeroupper.
    
    Since the new SSE2 memcpy/memmove are faster than the previous default
    memcpy/memmove used in libc.a and ld.so, we also remove the previous
    default memcpy/memmove and make them the default memcpy/memmove.
    
    Together, it reduces the size of libc.so by about 6 KB and the size of
    ld.so by about 2 KB.
    
    	[BZ #19776]
    	* sysdeps/x86_64/memcpy.S: Make it dummy.
    	* sysdeps/x86_64/mempcpy.S: Likewise.
    	* sysdeps/x86_64/memmove.S: New file.
    	* sysdeps/x86_64/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/memmove.c: Removed.
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memmove.c: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memcpy-sse2-unaligned, memmove-avx-unaligned,
    	memcpy-avx-unaligned and memmove-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Replace
    	__memmove_chk_avx512_unaligned_2 with
    	__memmove_chk_avx512_unaligned.  Remove
    	__memmove_chk_avx_unaligned_2.  Replace
    	__memmove_chk_sse2_unaligned_2 with
    	__memmove_chk_sse2_unaligned.  Remove __memmove_chk_sse2 and
    	__memmove_avx_unaligned_2.  Replace __memmove_avx512_unaligned_2
    	with __memmove_avx512_unaligned.  Replace
    	__memmove_sse2_unaligned_2 with __memmove_sse2_unaligned.
    	Remove __memmove_sse2.  Replace __memcpy_chk_avx512_unaligned_2
    	with __memcpy_chk_avx512_unaligned.  Remove
    	__memcpy_chk_avx_unaligned_2.  Replace
    	__memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned.
    	Remove __memcpy_chk_sse2.  Remove __memcpy_avx_unaligned_2.
    	Replace __memcpy_avx512_unaligned_2 with
    	__memcpy_avx512_unaligned.  Remove __memcpy_sse2_unaligned_2
    	and __memcpy_sse2.  Replace __mempcpy_chk_avx512_unaligned_2
    	with __mempcpy_chk_avx512_unaligned.  Remove
    	__mempcpy_chk_avx_unaligned_2.  Replace
    	__mempcpy_chk_sse2_unaligned_2 with
    	__mempcpy_chk_sse2_unaligned.  Remove __mempcpy_chk_sse2.
    	Replace __mempcpy_avx512_unaligned_2 with
    	__mempcpy_avx512_unaligned.  Remove __mempcpy_avx_unaligned_2.
    	Replace __mempcpy_sse2_unaligned_2 with
    	__mempcpy_sse2_unaligned.  Remove __mempcpy_sse2.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support
    	__memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned.
    	Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __memcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../memcpy.S.
    	* sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support
    	__memcpy_chk_avx512_unaligned_erms and
    	__memcpy_chk_avx512_unaligned.  Use
    	__memcpy_chk_avx_unaligned_erms and
    	__memcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __memcpy_chk_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip if
    	not in libc.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
    	Change function suffix from unaligned_2 to unaligned.
    	* sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support
    	__mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned.
    	Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __mempcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../mempcpy.S.
    	(mempcpy): New.  Add a weak alias.
    	* sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support
    	__mempcpy_chk_avx512_unaligned_erms and
    	__mempcpy_chk_avx512_unaligned.  Use
    	__mempcpy_chk_avx_unaligned_erms and
    	__mempcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __mempcpy_chk_sse2_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=967e0813a7740c8d6cc82247beaf69a5dee491a9

commit 967e0813a7740c8d6cc82247beaf69a5dee491a9
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:42:30 2016 -0700

    X86-64: Remove the previous SSE2/AVX2 memsets
    
    Since the new SSE2/AVX2 memsets are faster than the previous ones, we
    can remove the previous SSE2/AVX2 memsets and replace them with the
    new ones.  This reduces the size of libc.so by about 900 bytes.
    
    No change in IFUNC selection if SSE2 and AVX2 memsets weren't used
    before.  If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset
    optimized with Enhanced REP STOSB will be used for processors with
    ERMS.  The new AVX512 memset will be used for processors with AVX512
    which prefer vzeroupper.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded
    	into ...
    	* sysdeps/x86_64/memset.S: This.
    	(__bzero): Removed.
    	(__memset_tail): Likewise.
    	(__memset_chk): Likewise.
    	(memset): Likewise.
    	(MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't
    	defined.
    	(MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined.
    	* sysdeps/x86_64/multiarch/memset-avx2.S: Removed.
    	(__memset_zero_constant_len_parameter): Check SHARED instead of
    	PIC.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memset-avx2 and memset-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Remove __memset_chk_sse2,
    	__memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: Skip
    	if not in libc.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
    	(__bzero): Enabled.
    	* sysdeps/x86_64/multiarch/memset.S (memset): Replace
    	__memset_sse2 and __memset_avx2 with __memset_sse2_unaligned
    	and __memset_avx2_unaligned.  Use __memset_sse2_unaligned_erms
    	or __memset_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	(memset): Removed.
    	(__memset_chk): Likewise.
    	(MEMSET_SYMBOL): New.
    	(libc_hidden_builtin_def): Replace __memset_sse2 with
    	__memset_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace
    	__memset_chk_sse2 and __memset_chk_avx2 with
    	__memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms.
    	Use __memset_chk_sse2_unaligned_erms or
    	__memset_chk_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_chk_avx512_unaligned_erms and
    	__memset_chk_avx512_unaligned.

-----------------------------------------------------------------------

Comment 21 Sourceware Commits 2016-04-06 20:03:05 UTC

This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, hjl/erms/2.23 has been created
        at  c51eab61e17e7575265f1e36bd0293e224500f52 (commit)

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c51eab61e17e7575265f1e36bd0293e224500f52

commit c51eab61e17e7575265f1e36bd0293e224500f52
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 1 14:01:24 2016 -0700

    X86-64: Add dummy memcopy.h and wordcopy.c
    
    Since x86-64 doesn't use memory copy functions, add dummy memcopy.h and
    wordcopy.c to reduce code size.  It reduces the size of libc.so by about
    1 KB.
    
    	* sysdeps/x86_64/memcopy.h: New file.
    	* sysdeps/x86_64/wordcopy.c: Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7f9478e6530ab0ede00f705e456445aeff283560

commit 7f9478e6530ab0ede00f705e456445aeff283560
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 12:46:57 2016 -0700

    X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove
    
    Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones,
    we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with
    the new ones.
    
    No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used
    before.  If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2
    memcpy/memmove optimized with Enhanced REP MOVSB will be used for
    processors with ERMS.  The new AVX512 memcpy/memmove will be used for
    processors with AVX512 which prefer vzeroupper.
    
    Since the new SSE2 memcpy/memmove are faster than the previous default
    memcpy/memmove used in libc.a and ld.so, we also remove the previous
    default memcpy/memmove and make them the default memcpy/memmove.
    
    Together, it reduces the size of libc.so by about 6 KB and the size of
    ld.so by about 2 KB.
    
    	[BZ #19776]
    	* sysdeps/x86_64/memcpy.S: Make it dummy.
    	* sysdeps/x86_64/mempcpy.S: Likewise.
    	* sysdeps/x86_64/memmove.S: New file.
    	* sysdeps/x86_64/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/memmove.c: Removed.
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memmove.c: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memcpy-sse2-unaligned, memmove-avx-unaligned,
    	memcpy-avx-unaligned and memmove-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Replace
    	__memmove_chk_avx512_unaligned_2 with
    	__memmove_chk_avx512_unaligned.  Remove
    	__memmove_chk_avx_unaligned_2.  Replace
    	__memmove_chk_sse2_unaligned_2 with
    	__memmove_chk_sse2_unaligned.  Remove __memmove_chk_sse2 and
    	__memmove_avx_unaligned_2.  Replace __memmove_avx512_unaligned_2
    	with __memmove_avx512_unaligned.  Replace
    	__memmove_sse2_unaligned_2 with __memmove_sse2_unaligned.
    	Remove __memmove_sse2.  Replace __memcpy_chk_avx512_unaligned_2
    	with __memcpy_chk_avx512_unaligned.  Remove
    	__memcpy_chk_avx_unaligned_2.  Replace
    	__memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned.
    	Remove __memcpy_chk_sse2.  Remove __memcpy_avx_unaligned_2.
    	Replace __memcpy_avx512_unaligned_2 with
    	__memcpy_avx512_unaligned.  Remove __memcpy_sse2_unaligned_2
    	and __memcpy_sse2.  Replace __mempcpy_chk_avx512_unaligned_2
    	with __mempcpy_chk_avx512_unaligned.  Remove
    	__mempcpy_chk_avx_unaligned_2.  Replace
    	__mempcpy_chk_sse2_unaligned_2 with
    	__mempcpy_chk_sse2_unaligned.  Remove __mempcpy_chk_sse2.
    	Replace __mempcpy_avx512_unaligned_2 with
    	__mempcpy_avx512_unaligned.  Remove __mempcpy_avx_unaligned_2.
    	Replace __mempcpy_sse2_unaligned_2 with
    	__mempcpy_sse2_unaligned.  Remove __mempcpy_sse2.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support
    	__memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned.
    	Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __memcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../memcpy.S.
    	* sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support
    	__memcpy_chk_avx512_unaligned_erms and
    	__memcpy_chk_avx512_unaligned.  Use
    	__memcpy_chk_avx_unaligned_erms and
    	__memcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __memcpy_chk_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip if
    	not in libc.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
    	Change function suffix from unaligned_2 to unaligned.
    	* sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support
    	__mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned.
    	Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __mempcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../mempcpy.S.
    	(mempcpy): New.  Add a weak alias.
    	* sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support
    	__mempcpy_chk_avx512_unaligned_erms and
    	__mempcpy_chk_avx512_unaligned.  Use
    	__mempcpy_chk_avx_unaligned_erms and
    	__mempcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __mempcpy_chk_sse2_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=68248ecc51b4725e794236c495effde76d4be61c

commit 68248ecc51b4725e794236c495effde76d4be61c
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:42:30 2016 -0700

    X86-64: Remove the previous SSE2/AVX2 memsets
    
    Since the new SSE2/AVX2 memsets are faster than the previous ones, we
    can remove the previous SSE2/AVX2 memsets and replace them with the
    new ones.  This reduces the size of libc.so by about 900 bytes.
    
    No change in IFUNC selection if SSE2 and AVX2 memsets weren't used
    before.  If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset
    optimized with Enhanced REP STOSB will be used for processors with
    ERMS.  The new AVX512 memset will be used for processors with AVX512
    which prefer vzeroupper.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded
    	into ...
    	* sysdeps/x86_64/memset.S: This.
    	(__bzero): Removed.
    	(__memset_tail): Likewise.
    	(__memset_chk): Likewise.
    	(memset): Likewise.
    	(MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't
    	defined.
    	(MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined.
    	* sysdeps/x86_64/multiarch/memset-avx2.S: Removed.
    	(__memset_zero_constant_len_parameter): Check SHARED instead of
    	PIC.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memset-avx2 and memset-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Remove __memset_chk_sse2,
    	__memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: Skip
    	if not in libc.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
    	(__bzero): Enabled.
    	* sysdeps/x86_64/multiarch/memset.S (memset): Replace
    	__memset_sse2 and __memset_avx2 with __memset_sse2_unaligned
    	and __memset_avx2_unaligned.  Use __memset_sse2_unaligned_erms
    	or __memset_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	(memset): Removed.
    	(__memset_chk): Likewise.
    	(MEMSET_SYMBOL): New.
    	(libc_hidden_builtin_def): Replace __memset_sse2 with
    	__memset_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace
    	__memset_chk_sse2 and __memset_chk_avx2 with
    	__memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms.
    	Use __memset_chk_sse2_unaligned_erms or
    	__memset_chk_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_chk_avx512_unaligned_erms and
    	__memset_chk_avx512_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=095d851c67b7ea5edb536ead965c73fce34b2edd

commit 095d851c67b7ea5edb536ead965c73fce34b2edd
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Apr 3 17:21:45 2016 -0700

    X86-64: Use non-temporal store in memmove on large data
    
    memcpy/memmove benchmarks with large data shows that there is a
    regression with large data on Haswell machine.  non-temporal store
    in memmove on large data can improve performance significantly.  This
    patch adds a threshold to use non temporal store which is 4 times of
    shared cache size.  When size is above the threshold, non temporal
    store will be used.
    
    For size below 8 vector register width, we load all data into registers
    and store them together.  Only forward and backward loops, which move 4
    vector registers at a time, are used to support overlapping addresses.
    For forward loop, we load the last 4 vector register width of data and
    the first vector register width of data into vector registers before the
    loop and store them after the loop.  For backward loop, we load the first
    4 vector register width of data and the last vector register width of
    data into vector registers before the loop and store them after the loop.
    
    	* sysdeps/x86_64/cacheinfo.c (__x86_shared_non_temporal_threshold):
    	New.
    	(init_cacheinfo): Set __x86_shared_non_temporal_threshold to
    	4 times of shared cache size.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S
    	(PREFETCHNT): New.
    	(VMOVNT): Likewise.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S
    	(PREFETCHNT): Likewise.
    	(VMOVNT): Likewise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S
    	(PREFETCHNT): Likewise.
    	(VMOVNT): Likewise.
    	(VMOVU): Changed to movups for smaller code sizes.
    	(VMOVA): Changed to movaps for smaller code sizes.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Update
    	comments.
    	(PREFETCH_SIZE): New.
    	(PREFETCHED_LOAD_SIZE): Likewise.
    	(PREFETCH_ONE_SET): Likewise.
    	Rewrite to use forward and backward loops, which move 4 vector
    	registers at a time, to support overlapping addresses and use
    	non temporal store if size is above the threshold.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=0932dd8b56db46dd421a4855fb5dee9de092538d

commit 0932dd8b56db46dd421a4855fb5dee9de092538d
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Wed Apr 6 10:19:16 2016 -0700

    X86-64: Prepare memmove-vec-unaligned-erms.S
    
    Prepare memmove-vec-unaligned-erms.S to make the SSE2 version as the
    default memcpy, mempcpy and memmove.
    
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
    	(MEMCPY_SYMBOL): New.
    	(MEMPCPY_SYMBOL): Likewise.
    	(MEMMOVE_CHK_SYMBOL): Likewise.
    	Replace MEMMOVE_SYMBOL with MEMMOVE_CHK_SYMBOL on __mempcpy_chk
    	symbols.  Replace MEMMOVE_SYMBOL with MEMPCPY_SYMBOL on
    	__mempcpy symbols.  Provide alias for __memcpy_chk in libc.a.
    	Provide alias for memcpy in libc.a and ld.so.
    
    (cherry picked from commit a7d1c51482d15ab6c07e2ee0ae5e007067b18bfb)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=da2da79262814ba4ead3ee487549949096d8ad2d

commit da2da79262814ba4ead3ee487549949096d8ad2d
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Wed Apr 6 09:10:18 2016 -0700

    X86-64: Prepare memset-vec-unaligned-erms.S
    
    Prepare memset-vec-unaligned-erms.S to make the SSE2 version as the
    default memset.
    
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
    	(MEMSET_CHK_SYMBOL): New.  Define if not defined.
    	(__bzero): Check VEC_SIZE == 16 instead of USE_MULTIARCH.
    	Disabled fro now.
    	Replace MEMSET_SYMBOL with MEMSET_CHK_SYMBOL on __memset_chk
    	symbols.  Properly check USE_MULTIARCH on __memset symbols.
    
    (cherry picked from commit 4af1bb06c59d24f35bf8dc55897838d926c05892)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=9a93bdbaff81edf67c5486c84f8098055e355abb

commit 9a93bdbaff81edf67c5486c84f8098055e355abb
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Apr 5 05:21:07 2016 -0700

    Force 32-bit displacement in memset-vec-unaligned-erms.S
    
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: Force
    	32-bit displacement to avoid long nop between instructions.
    
    (cherry picked from commit ec0cac9a1f4094bd0db6f77c1b329e7a40eecc10)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=5118e532600549ad0f56cb9b1a179b8eab70c483

commit 5118e532600549ad0f56cb9b1a179b8eab70c483
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Apr 5 05:19:05 2016 -0700

    Add a comment in memset-sse2-unaligned-erms.S
    
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Add
    	a comment on VMOVU and VMOVA.
    
    (cherry picked from commit 696ac774847b80cf994438739478b0c3003b5958)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=06c6d4ae6ee7e5b83fd5868bef494def01f59292

commit 06c6d4ae6ee7e5b83fd5868bef494def01f59292
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Apr 3 14:32:20 2016 -0700

    Don't put SSE2/AVX/AVX512 memmove/memset in ld.so
    
    Since memmove and memset in ld.so don't use IFUNC, don't put SSE2, AVX
    and AVX512 memmove and memset in ld.so.
    
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip
    	if not in libc.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    
    (cherry picked from commit 5cd7af016d8587ff53b20ba259746f97edbddbf7)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a96379797a7eecc1b709cad7b68981eb698783dc

commit a96379797a7eecc1b709cad7b68981eb698783dc
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Apr 3 12:38:25 2016 -0700

    Fix memmove-vec-unaligned-erms.S
    
    __mempcpy_erms and __memmove_erms can't be placed between __memmove_chk
    and __memmove it breaks __memmove_chk.
    
    Don't check source == destination first since it is less common.
    
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:
    	(__mempcpy_erms, __memmove_erms): Moved before __mempcpy_chk
    	with unaligned_erms.
    	(__memmove_erms): Skip if source == destination.
    	(__memmove_unaligned_erms): Don't check source == destination
    	first.
    
    (cherry picked from commit ea2785e96fa503f3a2b5dd9f3a6ca65622b3c5f2)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=cfb059c79729b26284863334c9aa04f0a3b967b9

commit cfb059c79729b26284863334c9aa04f0a3b967b9
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 1 15:08:48 2016 -0700

    Remove Fast_Copy_Backward from Intel Core processors
    
    Intel Core i3, i5 and i7 processors have fast unaligned copy and
    copy backward is ignored.  Remove Fast_Copy_Backward from Intel Core
    processors to avoid confusion.
    
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Don't set
    	bit_arch_Fast_Copy_Backward for Intel Core proessors.
    
    (cherry picked from commit 27d3ce1467990f89126e228559dec8f84b96c60e)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=30c389be1af67c4d0716d207b6780c6169d1355f

commit 30c389be1af67c4d0716d207b6780c6169d1355f
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:05:51 2016 -0700

    Add x86-64 memset with unaligned store and rep stosb
    
    Implement x86-64 memset with unaligned store and rep movsb.  Support
    16-byte, 32-byte and 64-byte vector register sizes.  A single file
    provides 2 implementations of memset, one with rep stosb and the other
    without rep stosb.  They share the same codes when size is between 2
    times of vector register size and REP_STOSB_THRESHOLD which defaults
    to 2KB.
    
    Key features:
    
    1. Use overlapping store to avoid branch.
    2. For size <= 4 times of vector register size, fully unroll the loop.
    3. For size > 4 times of vector register size, store 4 times of vector
    register size at a time.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
    	memset-sse2-unaligned-erms, memset-avx2-unaligned-erms and
    	memset-avx512-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Test __memset_chk_sse2_unaligned,
    	__memset_chk_sse2_unaligned_erms, __memset_chk_avx2_unaligned,
    	__memset_chk_avx2_unaligned_erms, __memset_chk_avx512_unaligned,
    	__memset_chk_avx512_unaligned_erms, __memset_sse2_unaligned,
    	__memset_sse2_unaligned_erms, __memset_erms,
    	__memset_avx2_unaligned, __memset_avx2_unaligned_erms,
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: New
    	file.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:
    	Likewise.
    
    (cherry picked from commit 830566307f038387ca0af3fd327706a8d1a2f595)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=980d639b4ae58209843f09a29d86b0a8303b6650

commit 980d639b4ae58209843f09a29d86b0a8303b6650
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:04:26 2016 -0700

    Add x86-64 memmove with unaligned load/store and rep movsb
    
    Implement x86-64 memmove with unaligned load/store and rep movsb.
    Support 16-byte, 32-byte and 64-byte vector register sizes.  When
    size <= 8 times of vector register size, there is no check for
    address overlap bewteen source and destination.  Since overhead for
    overlap check is small when size > 8 times of vector register size,
    memcpy is an alias of memmove.
    
    A single file provides 2 implementations of memmove, one with rep movsb
    and the other without rep movsb.  They share the same codes when size is
    between 2 times of vector register size and REP_MOVSB_THRESHOLD which
    is 2KB for 16-byte vector register size and scaled up by large vector
    register size.
    
    Key features:
    
    1. Use overlapping load and store to avoid branch.
    2. For size <= 8 times of vector register size, load  all sources into
    registers and store them together.
    3. If there is no address overlap bewteen source and destination, copy
    from both ends with 4 times of vector register size at a time.
    4. If address of destination > address of source, backward copy 8 times
    of vector register size at a time.
    5. Otherwise, forward copy 8 times of vector register size at a time.
    6. Use rep movsb only for forward copy.  Avoid slow backward rep movsb
    by fallbacking to backward copy 8 times of vector register size at a
    time.
    7. Skip when address of destination == address of source.
    
    	[BZ #19776]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
    	memmove-sse2-unaligned-erms, memmove-avx-unaligned-erms and
    	memmove-avx512-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Test
    	__memmove_chk_avx512_unaligned_2,
    	__memmove_chk_avx512_unaligned_erms,
    	__memmove_chk_avx_unaligned_2, __memmove_chk_avx_unaligned_erms,
    	__memmove_chk_sse2_unaligned_2,
    	__memmove_chk_sse2_unaligned_erms, __memmove_avx_unaligned_2,
    	__memmove_avx_unaligned_erms, __memmove_avx512_unaligned_2,
    	__memmove_avx512_unaligned_erms, __memmove_erms,
    	__memmove_sse2_unaligned_2, __memmove_sse2_unaligned_erms,
    	__memcpy_chk_avx512_unaligned_2,
    	__memcpy_chk_avx512_unaligned_erms,
    	__memcpy_chk_avx_unaligned_2, __memcpy_chk_avx_unaligned_erms,
    	__memcpy_chk_sse2_unaligned_2, __memcpy_chk_sse2_unaligned_erms,
    	__memcpy_avx_unaligned_2, __memcpy_avx_unaligned_erms,
    	__memcpy_avx512_unaligned_2, __memcpy_avx512_unaligned_erms,
    	__memcpy_sse2_unaligned_2, __memcpy_sse2_unaligned_erms,
    	__memcpy_erms, __mempcpy_chk_avx512_unaligned_2,
    	__mempcpy_chk_avx512_unaligned_erms,
    	__mempcpy_chk_avx_unaligned_2, __mempcpy_chk_avx_unaligned_erms,
    	__mempcpy_chk_sse2_unaligned_2, __mempcpy_chk_sse2_unaligned_erms,
    	__mempcpy_avx512_unaligned_2, __mempcpy_avx512_unaligned_erms,
    	__mempcpy_avx_unaligned_2, __mempcpy_avx_unaligned_erms,
    	__mempcpy_sse2_unaligned_2, __mempcpy_sse2_unaligned_erms and
    	__mempcpy_erms.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: New
    	file.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likwise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
    	Likwise.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:
    	Likwise.
    
    (cherry picked from commit 88b57b8ed41d5ecf2e1bdfc19556f9246a665ebb)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=bf2bc5e5c9d7aa8af28b299ec26b8a37352730cc

commit bf2bc5e5c9d7aa8af28b299ec26b8a37352730cc
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 19:22:59 2016 -0700

    Initial Enhanced REP MOVSB/STOSB (ERMS) support
    
    The newer Intel processors support Enhanced REP MOVSB/STOSB (ERMS) which
    has a feature bit in CPUID.  This patch adds the Enhanced REP MOVSB/STOSB
    (ERMS) bit to x86 cpu-features.
    
    	* sysdeps/x86/cpu-features.h (bit_cpu_ERMS): New.
    	(index_cpu_ERMS): Likewise.
    	(reg_ERMS): Likewise.
    
    (cherry picked from commit 0791f91dff9a77263fa8173b143d854cad902c6d)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7c244283ff12329b3bca9878b8edac3b3fe5c7bc

commit 7c244283ff12329b3bca9878b8edac3b3fe5c7bc
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 13:15:59 2016 -0700

    Make __memcpy_avx512_no_vzeroupper an alias
    
    Since x86-64 memcpy-avx512-no-vzeroupper.S implements memmove, make
    __memcpy_avx512_no_vzeroupper an alias of __memmove_avx512_no_vzeroupper
    to reduce code size of libc.so.
    
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memcpy-avx512-no-vzeroupper.
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: Renamed
    	to ...
    	* sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: This.
    	(MEMCPY): Don't define.
    	(MEMCPY_CHK): Likewise.
    	(MEMPCPY): Likewise.
    	(MEMPCPY_CHK): Likewise.
    	(MEMPCPY_CHK): Renamed to ...
    	(__mempcpy_chk_avx512_no_vzeroupper): This.
    	(MEMPCPY_CHK): Renamed to ...
    	(__mempcpy_chk_avx512_no_vzeroupper): This.
    	(MEMCPY_CHK): Renamed to ...
    	(__memmove_chk_avx512_no_vzeroupper): This.
    	(MEMCPY): Renamed to ...
    	(__memmove_avx512_no_vzeroupper): This.
    	(__memcpy_avx512_no_vzeroupper): New alias.
    	(__memcpy_chk_avx512_no_vzeroupper): Likewise.
    
    (cherry picked from commit 064f01b10b57ff09cda7025f484b848c38ddd57a)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a9a14991fb2d3e69f80d25e9bbf2f6b0bcf11c3d

commit a9a14991fb2d3e69f80d25e9bbf2f6b0bcf11c3d
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 13:13:36 2016 -0700

    Implement x86-64 multiarch mempcpy in memcpy
    
    Implement x86-64 multiarch mempcpy in memcpy to share most of code.  It
    reduces code size of libc.so.
    
    	[BZ #18858]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	mempcpy-ssse3, mempcpy-ssse3-back, mempcpy-avx-unaligned
    	and mempcpy-avx512-no-vzeroupper.
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMPCPY_CHK):
    	New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S
    	(MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-ssse3-back.S (MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-ssse3.S (MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-avx-unaligned.S: Removed.
    	* sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-ssse3-back.S: Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-ssse3.S: Likewise.
    
    (cherry picked from commit c365e615f7429aee302f8af7bf07ae262278febb)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4fc09dabecee1b7cafdbca26ee7c63f68e53c229

commit 4fc09dabecee1b7cafdbca26ee7c63f68e53c229
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 04:39:48 2016 -0700

    [x86] Add a feature bit: Fast_Unaligned_Copy
    
    On AMD processors, memcpy optimized with unaligned SSE load is
    slower than emcpy optimized with aligned SSSE3 while other string
    functions are faster with unaligned SSE load.  A feature bit,
    Fast_Unaligned_Copy, is added to select memcpy optimized with
    unaligned SSE load.
    
    	[BZ #19583]
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Set
    	Fast_Unaligned_Copy with Fast_Unaligned_Load for Intel
    	processors.  Set Fast_Copy_Backward for AMD Excavator
    	processors.
    	* sysdeps/x86/cpu-features.h (bit_arch_Fast_Unaligned_Copy):
    	New.
    	(index_arch_Fast_Unaligned_Copy): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Check
    	Fast_Unaligned_Copy instead of Fast_Unaligned_Load.
    
    (cherry picked from commit e41b395523040fcb58c7d378475720c2836d280c)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=75f2d47e459a6bf5656a938e5c63f8b581eb3ee6

commit 75f2d47e459a6bf5656a938e5c63f8b581eb3ee6
Author: Florian Weimer <fweimer@redhat.com>
Date:   Fri Mar 25 11:11:42 2016 +0100

    tst-audit10: Fix compilation on compilers without bit_AVX512F [BZ #19860]
    
    	[BZ# 19860]
    	* sysdeps/x86_64/tst-audit10.c (avx512_enabled): Always return
    	zero if the compiler does not provide the AVX512F bit.
    
    (cherry picked from commit f327f5b47be57bc05a4077344b381016c1bb2c11)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=96c7375cb8b6f1875d9865f2ae92ecacf5f5e6fa

commit 96c7375cb8b6f1875d9865f2ae92ecacf5f5e6fa
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Mar 22 08:36:16 2016 -0700

    Don't set %rcx twice before "rep movsb"
    
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMCPY):
    	Don't set %rcx twice before "rep movsb".
    
    (cherry picked from commit 3c9a4cd16cbc7b79094fec68add2df66061ab5d7)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c273f613b0cc779ee33cc33d20941d271316e483

commit c273f613b0cc779ee33cc33d20941d271316e483
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Mar 22 07:46:56 2016 -0700

    Set index_arch_AVX_Fast_Unaligned_Load only for Intel processors
    
    Since only Intel processors with AVX2 have fast unaligned load, we
    should set index_arch_AVX_Fast_Unaligned_Load only for Intel processors.
    
    Move AVX, AVX2, AVX512, FMA and FMA4 detection into get_common_indeces
    and call get_common_indeces for other processors.
    
    Add CPU_FEATURES_CPU_P and CPU_FEATURES_ARCH_P to aoid loading
    GLRO(dl_x86_cpu_features) in cpu-features.c.
    
    	[BZ #19583]
    	* sysdeps/x86/cpu-features.c (get_common_indeces): Remove
    	inline.  Check family before setting family, model and
    	extended_model.  Set AVX, AVX2, AVX512, FMA and FMA4 usable
    	bits here.
    	(init_cpu_features): Replace HAS_CPU_FEATURE and
    	HAS_ARCH_FEATURE with CPU_FEATURES_CPU_P and
    	CPU_FEATURES_ARCH_P.  Set index_arch_AVX_Fast_Unaligned_Load
    	for Intel processors with usable AVX2.  Call get_common_indeces
    	for other processors with family == NULL.
    	* sysdeps/x86/cpu-features.h (CPU_FEATURES_CPU_P): New macro.
    	(CPU_FEATURES_ARCH_P): Likewise.
    	(HAS_CPU_FEATURE): Use CPU_FEATURES_CPU_P.
    	(HAS_ARCH_FEATURE): Use CPU_FEATURES_ARCH_P.
    
    (cherry picked from commit f781a9e96138d8839663af5e88649ab1fbed74f8)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c858d10a4e7fd682f2e7083836e4feacc2d580f4

commit c858d10a4e7fd682f2e7083836e4feacc2d580f4
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 10 05:26:46 2016 -0800

    Add _arch_/_cpu_ to index_*/bit_* in x86 cpu-features.h
    
    index_* and bit_* macros are used to access cpuid and feature arrays o
    struct cpu_features.  It is very easy to use bits and indices of cpuid
    array on feature array, especially in assembly codes.  For example,
    sysdeps/i386/i686/multiarch/bcopy.S has
    
    	HAS_CPU_FEATURE (Fast_Rep_String)
    
    which should be
    
    	HAS_ARCH_FEATURE (Fast_Rep_String)
    
    We change index_* and bit_* to index_cpu_*/index_arch_* and
    bit_cpu_*/bit_arch_* so that we can catch such error at build time.
    
    	[BZ #19762]
    	* sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h
    	(EXTRA_LD_ENVVARS): Add _arch_ to index_*/bit_*.
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Likewise.
    	* sysdeps/x86/cpu-features.h (bit_*): Renamed to ...
    	(bit_arch_*): This for feature array.
    	(bit_*): Renamed to ...
    	(bit_cpu_*): This for cpu array.
    	(index_*): Renamed to ...
    	(index_arch_*): This for feature array.
    	(index_*): Renamed to ...
    	(index_cpu_*): This for cpu array.
    	[__ASSEMBLER__] (HAS_FEATURE): Add and use field.
    	[__ASSEMBLER__] (HAS_CPU_FEATURE)): Pass cpu to HAS_FEATURE.
    	[__ASSEMBLER__] (HAS_ARCH_FEATURE)): Pass arch to HAS_FEATURE.
    	[!__ASSEMBLER__] (HAS_CPU_FEATURE): Replace index_##name and
    	bit_##name with index_cpu_##name and bit_cpu_##name.
    	[!__ASSEMBLER__] (HAS_ARCH_FEATURE): Replace index_##name and
    	bit_##name with index_arch_##name and bit_arch_##name.
    
    (cherry picked from commit 6aa3e97e2530f9917f504eb4146af119a3f27229)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7a90b56b0c3f8e55df44957cf6de7d3c9c04cbb9

commit 7a90b56b0c3f8e55df44957cf6de7d3c9c04cbb9
Author: Roland McGrath <roland@hack.frob.com>
Date:   Tue Mar 8 12:31:13 2016 -0800

    Fix tst-audit10 build when -mavx512f is not supported.
    
    (cherry picked from commit 3bd80c0de2f8e7ca8020d37739339636d169957e)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=ba80f6ceea3a6b6f711038646f419125fe3ad39c

commit ba80f6ceea3a6b6f711038646f419125fe3ad39c
Author: Florian Weimer <fweimer@redhat.com>
Date:   Mon Mar 7 16:00:25 2016 +0100

    tst-audit4, tst-audit10: Compile AVX/AVX-512 code separately [BZ #19269]
    
    This ensures that GCC will not use unsupported instructions before
    the run-time check to ensure support.
    
    (cherry picked from commit 3c0f7407eedb524c9114bb675cd55b903c71daaa)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=b8fe596e7f750d4ee2fca14d6a3999364c02662e

commit b8fe596e7f750d4ee2fca14d6a3999364c02662e
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Mar 6 16:48:11 2016 -0800

    Group AVX512 functions in .text.avx512 section
    
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S:
    	Replace .text with .text.avx512.
    	* sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S:
    	Likewise.
    
    (cherry picked from commit fee9eb6200f0e44a4b684903bc47fde36d46f1a5)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e455d17680cfaebb12692547422f95ba1ed30e29

commit e455d17680cfaebb12692547422f95ba1ed30e29
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Mar 4 08:37:40 2016 -0800

    x86-64: Fix memcpy IFUNC selection
    
    Chek Fast_Unaligned_Load, instead of Slow_BSF, and also check for
    Fast_Copy_Backward to enable __memcpy_ssse3_back.  Existing selection
    order is updated with following selection order:
    
    1. __memcpy_avx_unaligned if AVX_Fast_Unaligned_Load bit is set.
    2. __memcpy_sse2_unaligned if Fast_Unaligned_Load bit is set.
    3. __memcpy_sse2 if SSSE3 isn't available.
    4. __memcpy_ssse3_back if Fast_Copy_Backward bit it set.
    5. __memcpy_ssse3
    
    	[BZ #18880]
    	* sysdeps/x86_64/multiarch/memcpy.S: Check Fast_Unaligned_Load,
    	instead of Slow_BSF, and also check for Fast_Copy_Backward to
    	enable __memcpy_ssse3_back.
    
    (cherry picked from commit 14a1d7cc4c4fd5ee8e4e66b777221dd32a84efe8)

-----------------------------------------------------------------------

Comment 22 Sourceware Commits 2016-04-06 20:13:12 UTC

This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, hjl/erms/2.22 has been created
        at  34f2cbf8ca6ee99f36229315fb03c27e3acd805d (commit)

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=34f2cbf8ca6ee99f36229315fb03c27e3acd805d

commit 34f2cbf8ca6ee99f36229315fb03c27e3acd805d
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 1 14:01:24 2016 -0700

    X86-64: Add dummy memcopy.h and wordcopy.c
    
    Since x86-64 doesn't use memory copy functions, add dummy memcopy.h and
    wordcopy.c to reduce code size.  It reduces the size of libc.so by about
    1 KB.
    
    	* sysdeps/x86_64/memcopy.h: New file.
    	* sysdeps/x86_64/wordcopy.c: Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=912d6a93556739773b511766c2ca95fb293f5566

commit 912d6a93556739773b511766c2ca95fb293f5566
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 12:46:57 2016 -0700

    X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove
    
    Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones,
    we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with
    the new ones.
    
    No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used
    before.  If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2
    memcpy/memmove optimized with Enhanced REP MOVSB will be used for
    processors with ERMS.  The new AVX512 memcpy/memmove will be used for
    processors with AVX512 which prefer vzeroupper.
    
    Since the new SSE2 memcpy/memmove are faster than the previous default
    memcpy/memmove used in libc.a and ld.so, we also remove the previous
    default memcpy/memmove and make them the default memcpy/memmove.
    
    Together, it reduces the size of libc.so by about 6 KB and the size of
    ld.so by about 2 KB.
    
    	[BZ #19776]
    	* sysdeps/x86_64/memcpy.S: Make it dummy.
    	* sysdeps/x86_64/mempcpy.S: Likewise.
    	* sysdeps/x86_64/memmove.S: New file.
    	* sysdeps/x86_64/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/memmove.c: Removed.
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memmove.c: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memcpy-sse2-unaligned, memmove-avx-unaligned,
    	memcpy-avx-unaligned and memmove-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Replace
    	__memmove_chk_avx512_unaligned_2 with
    	__memmove_chk_avx512_unaligned.  Remove
    	__memmove_chk_avx_unaligned_2.  Replace
    	__memmove_chk_sse2_unaligned_2 with
    	__memmove_chk_sse2_unaligned.  Remove __memmove_chk_sse2 and
    	__memmove_avx_unaligned_2.  Replace __memmove_avx512_unaligned_2
    	with __memmove_avx512_unaligned.  Replace
    	__memmove_sse2_unaligned_2 with __memmove_sse2_unaligned.
    	Remove __memmove_sse2.  Replace __memcpy_chk_avx512_unaligned_2
    	with __memcpy_chk_avx512_unaligned.  Remove
    	__memcpy_chk_avx_unaligned_2.  Replace
    	__memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned.
    	Remove __memcpy_chk_sse2.  Remove __memcpy_avx_unaligned_2.
    	Replace __memcpy_avx512_unaligned_2 with
    	__memcpy_avx512_unaligned.  Remove __memcpy_sse2_unaligned_2
    	and __memcpy_sse2.  Replace __mempcpy_chk_avx512_unaligned_2
    	with __mempcpy_chk_avx512_unaligned.  Remove
    	__mempcpy_chk_avx_unaligned_2.  Replace
    	__mempcpy_chk_sse2_unaligned_2 with
    	__mempcpy_chk_sse2_unaligned.  Remove __mempcpy_chk_sse2.
    	Replace __mempcpy_avx512_unaligned_2 with
    	__mempcpy_avx512_unaligned.  Remove __mempcpy_avx_unaligned_2.
    	Replace __mempcpy_sse2_unaligned_2 with
    	__mempcpy_sse2_unaligned.  Remove __mempcpy_sse2.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support
    	__memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned.
    	Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __memcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../memcpy.S.
    	* sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support
    	__memcpy_chk_avx512_unaligned_erms and
    	__memcpy_chk_avx512_unaligned.  Use
    	__memcpy_chk_avx_unaligned_erms and
    	__memcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __memcpy_chk_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip if
    	not in libc.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
    	Change function suffix from unaligned_2 to unaligned.
    	* sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support
    	__mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned.
    	Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __mempcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../mempcpy.S.
    	(mempcpy): New.  Add a weak alias.
    	* sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support
    	__mempcpy_chk_avx512_unaligned_erms and
    	__mempcpy_chk_avx512_unaligned.  Use
    	__mempcpy_chk_avx_unaligned_erms and
    	__mempcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __mempcpy_chk_sse2_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7dfa91a07593740cf3ad71060300b1cc38ac2910

commit 7dfa91a07593740cf3ad71060300b1cc38ac2910
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:42:30 2016 -0700

    X86-64: Remove the previous SSE2/AVX2 memsets
    
    Since the new SSE2/AVX2 memsets are faster than the previous ones, we
    can remove the previous SSE2/AVX2 memsets and replace them with the
    new ones.  This reduces the size of libc.so by about 900 bytes.
    
    No change in IFUNC selection if SSE2 and AVX2 memsets weren't used
    before.  If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset
    optimized with Enhanced REP STOSB will be used for processors with
    ERMS.  The new AVX512 memset will be used for processors with AVX512
    which prefer vzeroupper.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded
    	into ...
    	* sysdeps/x86_64/memset.S: This.
    	(__bzero): Removed.
    	(__memset_tail): Likewise.
    	(__memset_chk): Likewise.
    	(memset): Likewise.
    	(MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't
    	defined.
    	(MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined.
    	* sysdeps/x86_64/multiarch/memset-avx2.S: Removed.
    	(__memset_zero_constant_len_parameter): Check SHARED instead of
    	PIC.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memset-avx2 and memset-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Remove __memset_chk_sse2,
    	__memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: Skip
    	if not in libc.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
    	(MEMSET_CHK_SYMBOL): New.  Define if not defined.
    	(__bzero): Check VEC_SIZE == 16 instead of USE_MULTIARCH.
    	Replace MEMSET_SYMBOL with MEMSET_CHK_SYMBOL on __memset_chk
    	symbols.
    	Properly check USE_MULTIARCH on __memset symbols.
    	* sysdeps/x86_64/multiarch/memset.S (memset): Replace
    	__memset_sse2 and __memset_avx2 with __memset_sse2_unaligned
    	and __memset_avx2_unaligned.  Use __memset_sse2_unaligned_erms
    	or __memset_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	(memset): Removed.
    	(__memset_chk): Likewise.
    	(MEMSET_SYMBOL): New.
    	(libc_hidden_builtin_def): Replace __memset_sse2 with
    	__memset_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace
    	__memset_chk_sse2 and __memset_chk_avx2 with
    	__memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms.
    	Use __memset_chk_sse2_unaligned_erms or
    	__memset_chk_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_chk_avx512_unaligned_erms and
    	__memset_chk_avx512_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=5b3e44eeb5ae74fb4a1c353db7e8a5ee18ccdb10

commit 5b3e44eeb5ae74fb4a1c353db7e8a5ee18ccdb10
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Apr 3 17:21:45 2016 -0700

    X86-64: Use non-temporal store in memmove on large data
    
    memcpy/memmove benchmarks with large data shows that there is a
    regression with large data on Haswell machine.  non-temporal store
    in memmove on large data can improve performance significantly.  This
    patch adds a threshold to use non temporal store which is 4 times of
    shared cache size.  When size is above the threshold, non temporal
    store will be used.
    
    For size below 8 vector register width, we load all data into registers
    and store them together.  Only forward and backward loops, which move 4
    vector registers at a time, are used to support overlapping addresses.
    For forward loop, we load the last 4 vector register width of data and
    the first vector register width of data into vector registers before the
    loop and store them after the loop.  For backward loop, we load the first
    4 vector register width of data and the last vector register width of
    data into vector registers before the loop and store them after the loop.
    
    	* sysdeps/x86_64/cacheinfo.c (__x86_shared_non_temporal_threshold):
    	New.
    	(init_cacheinfo): Set __x86_shared_non_temporal_threshold to
    	4 times of shared cache size.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S
    	(PREFETCHNT): New.
    	(VMOVNT): Likewise.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S
    	(PREFETCHNT): Likewise.
    	(VMOVNT): Likewise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S
    	(PREFETCHNT): Likewise.
    	(VMOVNT): Likewise.
    	(VMOVU): Changed to movups for smaller code sizes.
    	(VMOVA): Changed to movaps for smaller code sizes.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Update
    	comments.
    	(PREFETCH_SIZE): New.
    	(PREFETCHED_LOAD_SIZE): Likewise.
    	(PREFETCH_ONE_SET): Likewise.
    	Rewrite to use forward and backward loops, which move 4 vector
    	registers at a time, to support overlapping addresses and use
    	non temporal store if size is above the threshold.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=54667f64fa4074325ee33e487c033c313ce95067

commit 54667f64fa4074325ee33e487c033c313ce95067
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Wed Apr 6 10:19:16 2016 -0700

    X86-64: Prepare memmove-vec-unaligned-erms.S
    
    Prepare memmove-vec-unaligned-erms.S to make the SSE2 version as the
    default memcpy, mempcpy and memmove.
    
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
    	(MEMCPY_SYMBOL): New.
    	(MEMPCPY_SYMBOL): Likewise.
    	(MEMMOVE_CHK_SYMBOL): Likewise.
    	Replace MEMMOVE_SYMBOL with MEMMOVE_CHK_SYMBOL on __mempcpy_chk
    	symbols.  Replace MEMMOVE_SYMBOL with MEMPCPY_SYMBOL on
    	__mempcpy symbols.  Provide alias for __memcpy_chk in libc.a.
    	Provide alias for memcpy in libc.a and ld.so.
    
    (cherry picked from commit a7d1c51482d15ab6c07e2ee0ae5e007067b18bfb)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=68a0b487e274b3452a1660e4b9fad5df8d8c0284

commit 68a0b487e274b3452a1660e4b9fad5df8d8c0284
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Wed Apr 6 09:10:18 2016 -0700

    X86-64: Prepare memset-vec-unaligned-erms.S
    
    Prepare memset-vec-unaligned-erms.S to make the SSE2 version as the
    default memset.
    
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
    	(MEMSET_CHK_SYMBOL): New.  Define if not defined.
    	(__bzero): Check VEC_SIZE == 16 instead of USE_MULTIARCH.
    	Disabled fro now.
    	Replace MEMSET_SYMBOL with MEMSET_CHK_SYMBOL on __memset_chk
    	symbols.  Properly check USE_MULTIARCH on __memset symbols.
    
    (cherry picked from commit 4af1bb06c59d24f35bf8dc55897838d926c05892)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c2d3bdd6aec639fd66fceb3e2c145420c25d409b

commit c2d3bdd6aec639fd66fceb3e2c145420c25d409b
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Apr 5 05:21:07 2016 -0700

    Force 32-bit displacement in memset-vec-unaligned-erms.S
    
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: Force
    	32-bit displacement to avoid long nop between instructions.
    
    (cherry picked from commit ec0cac9a1f4094bd0db6f77c1b329e7a40eecc10)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=070a5e77d66f5520c1bbbc24dc1843a0a1c161ee

commit 070a5e77d66f5520c1bbbc24dc1843a0a1c161ee
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Apr 5 05:19:05 2016 -0700

    Add a comment in memset-sse2-unaligned-erms.S
    
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Add
    	a comment on VMOVU and VMOVA.
    
    (cherry picked from commit 696ac774847b80cf994438739478b0c3003b5958)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7e00bb9720268f142668d22e91dff7c3e6e0c08c

commit 7e00bb9720268f142668d22e91dff7c3e6e0c08c
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Apr 3 14:32:20 2016 -0700

    Don't put SSE2/AVX/AVX512 memmove/memset in ld.so
    
    Since memmove and memset in ld.so don't use IFUNC, don't put SSE2, AVX
    and AVX512 memmove and memset in ld.so.
    
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip
    	if not in libc.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    
    (cherry picked from commit 5cd7af016d8587ff53b20ba259746f97edbddbf7)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e1e57539f5dbdefc96a85021b611863eaa28dd13

commit e1e57539f5dbdefc96a85021b611863eaa28dd13
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Apr 3 12:38:25 2016 -0700

    Fix memmove-vec-unaligned-erms.S
    
    __mempcpy_erms and __memmove_erms can't be placed between __memmove_chk
    and __memmove it breaks __memmove_chk.
    
    Don't check source == destination first since it is less common.
    
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:
    	(__mempcpy_erms, __memmove_erms): Moved before __mempcpy_chk
    	with unaligned_erms.
    	(__memmove_erms): Skip if source == destination.
    	(__memmove_unaligned_erms): Don't check source == destination
    	first.
    
    (cherry picked from commit ea2785e96fa503f3a2b5dd9f3a6ca65622b3c5f2)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a13ac6b5ced68aadb7c1546102445f9c57f43231

commit a13ac6b5ced68aadb7c1546102445f9c57f43231
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Mar 6 08:23:24 2016 -0800

    Use HAS_ARCH_FEATURE with Fast_Rep_String
    
    HAS_ARCH_FEATURE, not HAS_CPU_FEATURE, should be used with
    Fast_Rep_String.
    
    	[BZ #19762]
    	* sysdeps/i386/i686/multiarch/bcopy.S (bcopy): Use
    	HAS_ARCH_FEATURE with Fast_Rep_String.
    	* sysdeps/i386/i686/multiarch/bzero.S (__bzero): Likewise.
    	* sysdeps/i386/i686/multiarch/memcpy.S (memcpy): Likewise.
    	* sysdeps/i386/i686/multiarch/memcpy_chk.S (__memcpy_chk):
    	Likewise.
    	* sysdeps/i386/i686/multiarch/memmove_chk.S (__memmove_chk):
    	Likewise.
    	* sysdeps/i386/i686/multiarch/mempcpy.S (__mempcpy): Likewise.
    	* sysdeps/i386/i686/multiarch/mempcpy_chk.S (__mempcpy_chk):
    	Likewise.
    	* sysdeps/i386/i686/multiarch/memset.S (memset): Likewise.
    	* sysdeps/i386/i686/multiarch/memset_chk.S (__memset_chk):
    	Likewise.
    
    (cherry picked from commit 4e940b2f4b577f3a530e0580373f7c2d569f4d63)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4ad4d58ed7a444e2d9787113fce132a99b35b417

commit 4ad4d58ed7a444e2d9787113fce132a99b35b417
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 1 15:08:48 2016 -0700

    Remove Fast_Copy_Backward from Intel Core processors
    
    Intel Core i3, i5 and i7 processors have fast unaligned copy and
    copy backward is ignored.  Remove Fast_Copy_Backward from Intel Core
    processors to avoid confusion.
    
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Don't set
    	bit_arch_Fast_Copy_Backward for Intel Core proessors.
    
    (cherry picked from commit 27d3ce1467990f89126e228559dec8f84b96c60e)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a304f3933c7f8347e49057a7a315cbd571662ff7

commit a304f3933c7f8347e49057a7a315cbd571662ff7
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:05:51 2016 -0700

    Add x86-64 memset with unaligned store and rep stosb
    
    Implement x86-64 memset with unaligned store and rep movsb.  Support
    16-byte, 32-byte and 64-byte vector register sizes.  A single file
    provides 2 implementations of memset, one with rep stosb and the other
    without rep stosb.  They share the same codes when size is between 2
    times of vector register size and REP_STOSB_THRESHOLD which defaults
    to 2KB.
    
    Key features:
    
    1. Use overlapping store to avoid branch.
    2. For size <= 4 times of vector register size, fully unroll the loop.
    3. For size > 4 times of vector register size, store 4 times of vector
    register size at a time.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
    	memset-sse2-unaligned-erms, memset-avx2-unaligned-erms and
    	memset-avx512-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Test __memset_chk_sse2_unaligned,
    	__memset_chk_sse2_unaligned_erms, __memset_chk_avx2_unaligned,
    	__memset_chk_avx2_unaligned_erms, __memset_chk_avx512_unaligned,
    	__memset_chk_avx512_unaligned_erms, __memset_sse2_unaligned,
    	__memset_sse2_unaligned_erms, __memset_erms,
    	__memset_avx2_unaligned, __memset_avx2_unaligned_erms,
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: New
    	file.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:
    	Likewise.
    
    (cherry picked from commit 830566307f038387ca0af3fd327706a8d1a2f595)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=1c5d0b0ba41376c2f0792da4f22cc1f5b2b2688e

commit 1c5d0b0ba41376c2f0792da4f22cc1f5b2b2688e
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:04:26 2016 -0700

    Add x86-64 memmove with unaligned load/store and rep movsb
    
    Implement x86-64 memmove with unaligned load/store and rep movsb.
    Support 16-byte, 32-byte and 64-byte vector register sizes.  When
    size <= 8 times of vector register size, there is no check for
    address overlap bewteen source and destination.  Since overhead for
    overlap check is small when size > 8 times of vector register size,
    memcpy is an alias of memmove.
    
    A single file provides 2 implementations of memmove, one with rep movsb
    and the other without rep movsb.  They share the same codes when size is
    between 2 times of vector register size and REP_MOVSB_THRESHOLD which
    is 2KB for 16-byte vector register size and scaled up by large vector
    register size.
    
    Key features:
    
    1. Use overlapping load and store to avoid branch.
    2. For size <= 8 times of vector register size, load  all sources into
    registers and store them together.
    3. If there is no address overlap bewteen source and destination, copy
    from both ends with 4 times of vector register size at a time.
    4. If address of destination > address of source, backward copy 8 times
    of vector register size at a time.
    5. Otherwise, forward copy 8 times of vector register size at a time.
    6. Use rep movsb only for forward copy.  Avoid slow backward rep movsb
    by fallbacking to backward copy 8 times of vector register size at a
    time.
    7. Skip when address of destination == address of source.
    
    	[BZ #19776]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
    	memmove-sse2-unaligned-erms, memmove-avx-unaligned-erms and
    	memmove-avx512-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Test
    	__memmove_chk_avx512_unaligned_2,
    	__memmove_chk_avx512_unaligned_erms,
    	__memmove_chk_avx_unaligned_2, __memmove_chk_avx_unaligned_erms,
    	__memmove_chk_sse2_unaligned_2,
    	__memmove_chk_sse2_unaligned_erms, __memmove_avx_unaligned_2,
    	__memmove_avx_unaligned_erms, __memmove_avx512_unaligned_2,
    	__memmove_avx512_unaligned_erms, __memmove_erms,
    	__memmove_sse2_unaligned_2, __memmove_sse2_unaligned_erms,
    	__memcpy_chk_avx512_unaligned_2,
    	__memcpy_chk_avx512_unaligned_erms,
    	__memcpy_chk_avx_unaligned_2, __memcpy_chk_avx_unaligned_erms,
    	__memcpy_chk_sse2_unaligned_2, __memcpy_chk_sse2_unaligned_erms,
    	__memcpy_avx_unaligned_2, __memcpy_avx_unaligned_erms,
    	__memcpy_avx512_unaligned_2, __memcpy_avx512_unaligned_erms,
    	__memcpy_sse2_unaligned_2, __memcpy_sse2_unaligned_erms,
    	__memcpy_erms, __mempcpy_chk_avx512_unaligned_2,
    	__mempcpy_chk_avx512_unaligned_erms,
    	__mempcpy_chk_avx_unaligned_2, __mempcpy_chk_avx_unaligned_erms,
    	__mempcpy_chk_sse2_unaligned_2, __mempcpy_chk_sse2_unaligned_erms,
    	__mempcpy_avx512_unaligned_2, __mempcpy_avx512_unaligned_erms,
    	__mempcpy_avx_unaligned_2, __mempcpy_avx_unaligned_erms,
    	__mempcpy_sse2_unaligned_2, __mempcpy_sse2_unaligned_erms and
    	__mempcpy_erms.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: New
    	file.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likwise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
    	Likwise.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:
    	Likwise.
    
    (cherry picked from commit 88b57b8ed41d5ecf2e1bdfc19556f9246a665ebb)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e1203f48239fbb9832db6ed3a0d2a008e317aff9

commit e1203f48239fbb9832db6ed3a0d2a008e317aff9
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 19:22:59 2016 -0700

    Initial Enhanced REP MOVSB/STOSB (ERMS) support
    
    The newer Intel processors support Enhanced REP MOVSB/STOSB (ERMS) which
    has a feature bit in CPUID.  This patch adds the Enhanced REP MOVSB/STOSB
    (ERMS) bit to x86 cpu-features.
    
    	* sysdeps/x86/cpu-features.h (bit_cpu_ERMS): New.
    	(index_cpu_ERMS): Likewise.
    	(reg_ERMS): Likewise.
    
    (cherry picked from commit 0791f91dff9a77263fa8173b143d854cad902c6d)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=3597d65be2a44f063ef12bb907fdad8567aa3e6a

commit 3597d65be2a44f063ef12bb907fdad8567aa3e6a
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 13:15:59 2016 -0700

    Make __memcpy_avx512_no_vzeroupper an alias
    
    Since x86-64 memcpy-avx512-no-vzeroupper.S implements memmove, make
    __memcpy_avx512_no_vzeroupper an alias of __memmove_avx512_no_vzeroupper
    to reduce code size of libc.so.
    
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memcpy-avx512-no-vzeroupper.
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: Renamed
    	to ...
    	* sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: This.
    	(MEMCPY): Don't define.
    	(MEMCPY_CHK): Likewise.
    	(MEMPCPY): Likewise.
    	(MEMPCPY_CHK): Likewise.
    	(MEMPCPY_CHK): Renamed to ...
    	(__mempcpy_chk_avx512_no_vzeroupper): This.
    	(MEMPCPY_CHK): Renamed to ...
    	(__mempcpy_chk_avx512_no_vzeroupper): This.
    	(MEMCPY_CHK): Renamed to ...
    	(__memmove_chk_avx512_no_vzeroupper): This.
    	(MEMCPY): Renamed to ...
    	(__memmove_avx512_no_vzeroupper): This.
    	(__memcpy_avx512_no_vzeroupper): New alias.
    	(__memcpy_chk_avx512_no_vzeroupper): Likewise.
    
    (cherry picked from commit 064f01b10b57ff09cda7025f484b848c38ddd57a)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=9fbaf0f27a11deb98df79d04adee97aebee78d40

commit 9fbaf0f27a11deb98df79d04adee97aebee78d40
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 13:13:36 2016 -0700

    Implement x86-64 multiarch mempcpy in memcpy
    
    Implement x86-64 multiarch mempcpy in memcpy to share most of code.  It
    reduces code size of libc.so.
    
    	[BZ #18858]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	mempcpy-ssse3, mempcpy-ssse3-back, mempcpy-avx-unaligned
    	and mempcpy-avx512-no-vzeroupper.
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMPCPY_CHK):
    	New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S
    	(MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-ssse3-back.S (MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-ssse3.S (MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-avx-unaligned.S: Removed.
    	* sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-ssse3-back.S: Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-ssse3.S: Likewise.
    
    (cherry picked from commit c365e615f7429aee302f8af7bf07ae262278febb)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=5239cb481eea27650173b9b9af22439afdcbf358

commit 5239cb481eea27650173b9b9af22439afdcbf358
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 04:39:48 2016 -0700

    [x86] Add a feature bit: Fast_Unaligned_Copy
    
    On AMD processors, memcpy optimized with unaligned SSE load is
    slower than emcpy optimized with aligned SSSE3 while other string
    functions are faster with unaligned SSE load.  A feature bit,
    Fast_Unaligned_Copy, is added to select memcpy optimized with
    unaligned SSE load.
    
    	[BZ #19583]
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Set
    	Fast_Unaligned_Copy with Fast_Unaligned_Load for Intel
    	processors.  Set Fast_Copy_Backward for AMD Excavator
    	processors.
    	* sysdeps/x86/cpu-features.h (bit_arch_Fast_Unaligned_Copy):
    	New.
    	(index_arch_Fast_Unaligned_Copy): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Check
    	Fast_Unaligned_Copy instead of Fast_Unaligned_Load.
    
    (cherry picked from commit e41b395523040fcb58c7d378475720c2836d280c)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a65b3d13e1754d568782e64a762c2c7fab45a55d

commit a65b3d13e1754d568782e64a762c2c7fab45a55d
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Mar 22 08:36:16 2016 -0700

    Don't set %rcx twice before "rep movsb"
    
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMCPY):
    	Don't set %rcx twice before "rep movsb".
    
    (cherry picked from commit 3c9a4cd16cbc7b79094fec68add2df66061ab5d7)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=f4b6d20366aac66070f1cf50552cf2951991a1e5

commit f4b6d20366aac66070f1cf50552cf2951991a1e5
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Mar 22 07:46:56 2016 -0700

    Set index_arch_AVX_Fast_Unaligned_Load only for Intel processors
    
    Since only Intel processors with AVX2 have fast unaligned load, we
    should set index_arch_AVX_Fast_Unaligned_Load only for Intel processors.
    
    Move AVX, AVX2, AVX512, FMA and FMA4 detection into get_common_indeces
    and call get_common_indeces for other processors.
    
    Add CPU_FEATURES_CPU_P and CPU_FEATURES_ARCH_P to aoid loading
    GLRO(dl_x86_cpu_features) in cpu-features.c.
    
    	[BZ #19583]
    	* sysdeps/x86/cpu-features.c (get_common_indeces): Remove
    	inline.  Check family before setting family, model and
    	extended_model.  Set AVX, AVX2, AVX512, FMA and FMA4 usable
    	bits here.
    	(init_cpu_features): Replace HAS_CPU_FEATURE and
    	HAS_ARCH_FEATURE with CPU_FEATURES_CPU_P and
    	CPU_FEATURES_ARCH_P.  Set index_arch_AVX_Fast_Unaligned_Load
    	for Intel processors with usable AVX2.  Call get_common_indeces
    	for other processors with family == NULL.
    	* sysdeps/x86/cpu-features.h (CPU_FEATURES_CPU_P): New macro.
    	(CPU_FEATURES_ARCH_P): Likewise.
    	(HAS_CPU_FEATURE): Use CPU_FEATURES_CPU_P.
    	(HAS_ARCH_FEATURE): Use CPU_FEATURES_ARCH_P.
    
    (cherry picked from commit f781a9e96138d8839663af5e88649ab1fbed74f8)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=ca9c5edeea52dc18f42ebbe29b1af352f5555538

commit ca9c5edeea52dc18f42ebbe29b1af352f5555538
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Nov 30 08:53:37 2015 -0800

    Update family and model detection for AMD CPUs
    
    AMD CPUs uses the similar encoding scheme for extended family and model
    as Intel CPUs as shown in:
    
    http://support.amd.com/TechDocs/25481.pdf
    
    This patch updates get_common_indeces to get family and model for both
    Intel and AMD CPUs when family == 0x0f.
    
    	[BZ #19214]
    	* sysdeps/x86/cpu-features.c (get_common_indeces): Add an
    	argument to return extended model.  Update family and model
    	with extended family and model when family == 0x0f.
    	(init_cpu_features): Updated.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c23cdbac4ea473effbef5c50b1217f95595b3460

commit c23cdbac4ea473effbef5c50b1217f95595b3460
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 10 05:26:46 2016 -0800

    Add _arch_/_cpu_ to index_*/bit_* in x86 cpu-features.h
    
    index_* and bit_* macros are used to access cpuid and feature arrays o
    struct cpu_features.  It is very easy to use bits and indices of cpuid
    array on feature array, especially in assembly codes.  For example,
    sysdeps/i386/i686/multiarch/bcopy.S has
    
    	HAS_CPU_FEATURE (Fast_Rep_String)
    
    which should be
    
    	HAS_ARCH_FEATURE (Fast_Rep_String)
    
    We change index_* and bit_* to index_cpu_*/index_arch_* and
    bit_cpu_*/bit_arch_* so that we can catch such error at build time.
    
    	[BZ #19762]
    	* sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h
    	(EXTRA_LD_ENVVARS): Add _arch_ to index_*/bit_*.
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Likewise.
    	* sysdeps/x86/cpu-features.h (bit_*): Renamed to ...
    	(bit_arch_*): This for feature array.
    	(bit_*): Renamed to ...
    	(bit_cpu_*): This for cpu array.
    	(index_*): Renamed to ...
    	(index_arch_*): This for feature array.
    	(index_*): Renamed to ...
    	(index_cpu_*): This for cpu array.
    	[__ASSEMBLER__] (HAS_FEATURE): Add and use field.
    	[__ASSEMBLER__] (HAS_CPU_FEATURE)): Pass cpu to HAS_FEATURE.
    	[__ASSEMBLER__] (HAS_ARCH_FEATURE)): Pass arch to HAS_FEATURE.
    	[!__ASSEMBLER__] (HAS_CPU_FEATURE): Replace index_##name and
    	bit_##name with index_cpu_##name and bit_cpu_##name.
    	[!__ASSEMBLER__] (HAS_ARCH_FEATURE): Replace index_##name and
    	bit_##name with index_arch_##name and bit_arch_##name.
    
    (cherry picked from commit 6aa3e97e2530f9917f504eb4146af119a3f27229)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4a49c82956f5a42a2cce22c2e97360de1b32301d

commit 4a49c82956f5a42a2cce22c2e97360de1b32301d
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 3 14:51:40 2016 -0800

    Or bit_Prefer_MAP_32BIT_EXEC in EXTRA_LD_ENVVARS
    
    We should turn on bit_Prefer_MAP_32BIT_EXEC in EXTRA_LD_ENVVARS without
    overriding other bits.
    
    	[BZ #19758]
    	* sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h
    	(EXTRA_LD_ENVVARS): Or bit_Prefer_MAP_32BIT_EXEC.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=18f8c0e3cc9ff7b092f02c9b42874a5439347bbc

commit 18f8c0e3cc9ff7b092f02c9b42874a5439347bbc
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Mar 6 16:48:11 2016 -0800

    Group AVX512 functions in .text.avx512 section
    
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S:
    	Replace .text with .text.avx512.
    	* sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S:
    	Likewise.
    
    (cherry picked from commit fee9eb6200f0e44a4b684903bc47fde36d46f1a5)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=0c8e297a186f844ebb7eba7a3bc0343c83615ca9

commit 0c8e297a186f844ebb7eba7a3bc0343c83615ca9
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Mar 4 08:37:40 2016 -0800

    x86-64: Fix memcpy IFUNC selection
    
    Chek Fast_Unaligned_Load, instead of Slow_BSF, and also check for
    Fast_Copy_Backward to enable __memcpy_ssse3_back.  Existing selection
    order is updated with following selection order:
    
    1. __memcpy_avx_unaligned if AVX_Fast_Unaligned_Load bit is set.
    2. __memcpy_sse2_unaligned if Fast_Unaligned_Load bit is set.
    3. __memcpy_sse2 if SSSE3 isn't available.
    4. __memcpy_ssse3_back if Fast_Copy_Backward bit it set.
    5. __memcpy_ssse3
    
    	[BZ #18880]
    	* sysdeps/x86_64/multiarch/memcpy.S: Check Fast_Unaligned_Load,
    	instead of Slow_BSF, and also check for Fast_Copy_Backward to
    	enable __memcpy_ssse3_back.
    
    (cherry picked from commit 14a1d7cc4c4fd5ee8e4e66b777221dd32a84efe8)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=3c772cb4d9cbe19cd97ad991e3dab43014198c44

commit 3c772cb4d9cbe19cd97ad991e3dab43014198c44
Author: Andrew Senkevich <andrew.senkevich@intel.com>
Date:   Sat Jan 16 00:49:45 2016 +0300

    Added memcpy/memmove family optimized with AVX512 for KNL hardware.
    
    Added AVX512 implementations of memcpy, mempcpy, memmove, memcpy_chk,
    mempcpy_chk, memmove_chk.
    It shows average improvement more than 30% over AVX versions on KNL
    hardware (performance results in the thread
    <https://sourceware.org/ml/libc-alpha/2016-01/msg00258.html>).
    
        * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Added new files.
        * sysdeps/x86_64/multiarch/ifunc-impl-list.c: Added new tests.
        * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: New file.
        * sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S: Likewise.
        * sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: Likewise.
        * sysdeps/x86_64/multiarch/memcpy.S: Added new IFUNC branch.
        * sysdeps/x86_64/multiarch/memcpy_chk.S: Likewise.
        * sysdeps/x86_64/multiarch/memmove.c: Likewise.
        * sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
        * sysdeps/x86_64/multiarch/mempcpy.S: Likewise.
        * sysdeps/x86_64/multiarch/mempcpy_chk.S: Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7f20e52f1f8d7709140f4bdf828a6bb0f0f08af2

commit 7f20e52f1f8d7709140f4bdf828a6bb0f0f08af2
Author: Andrew Senkevich <andrew.senkevich@intel.com>
Date:   Sat Dec 19 02:47:28 2015 +0300

    Added memset optimized with AVX512 for KNL hardware.
    
    It shows improvement up to 28% over AVX2 memset (performance results
    attached at <https://sourceware.org/ml/libc-alpha/2015-12/msg00052.html>).
    
        * sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S: New file.
        * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Added new file.
        * sysdeps/x86_64/multiarch/ifunc-impl-list.c: Added new tests.
        * sysdeps/x86_64/multiarch/memset.S: Added new IFUNC branch.
        * sysdeps/x86_64/multiarch/memset_chk.S: Likewise.
        * sysdeps/x86/cpu-features.h (bit_Prefer_No_VZEROUPPER,
        index_Prefer_No_VZEROUPPER): New.
        * sysdeps/x86/cpu-features.c (init_cpu_features): Set the
        Prefer_No_VZEROUPPER for Knights Landing.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=d530cd5463701a59ed923d53a97d3b534fdfea8a

commit d530cd5463701a59ed923d53a97d3b534fdfea8a
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Wed Oct 21 14:44:23 2015 -0700

    Add Prefer_MAP_32BIT_EXEC to map executable pages with MAP_32BIT
    
    According to Silvermont software optimization guide, for 64-bit
    applications, branch prediction performance can be negatively impacted
    when the target of a branch is more than 4GB away from the branch.  Add
    the Prefer_MAP_32BIT_EXEC bit so that mmap will try to map executable
    pages with MAP_32BIT first.  NB: MAP_32BIT will map to lower 2GB, not
    lower 4GB, address.  Prefer_MAP_32BIT_EXEC reduces bits available for
    address space layout randomization (ASLR), which is always disabled for
    SUID programs and can only be enabled by setting environment variable,
    LD_PREFER_MAP_32BIT_EXEC.
    
    On Fedora 23, this patch speeds up GCC 5 testsuite by 3% on Silvermont.
    
    	[BZ #19367]
    	* sysdeps/unix/sysv/linux/wordsize-64/mmap.c: New file.
    	* sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h: Likewise.
    	* sysdeps/unix/sysv/linux/x86_64/64/mmap.c: Likewise.
    	* sysdeps/x86/cpu-features.h (bit_Prefer_MAP_32BIT_EXEC): New.
    	(index_Prefer_MAP_32BIT_EXEC): Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=fe24aedc3530037d7bb614b84d309e6b816686bf

commit fe24aedc3530037d7bb614b84d309e6b816686bf
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Dec 15 11:46:54 2015 -0800

    Enable Silvermont optimizations for Knights Landing
    
    Knights Landing processor is based on Silvermont.  This patch enables
    Silvermont optimizations for Knights Landing.
    
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Enable
    	Silvermont optimizations for Knights Landing.

-----------------------------------------------------------------------

Comment 23 Sourceware Commits 2016-04-07 19:08:39 UTC

This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, hjl/erms/ifunc has been created
        at  25d87576122689b22db9929271bdd7cb403aec1c (commit)

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=25d87576122689b22db9929271bdd7cb403aec1c

commit 25d87576122689b22db9929271bdd7cb403aec1c
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 1 14:01:24 2016 -0700

    X86-64: Add dummy memcopy.h and wordcopy.c
    
    Since x86-64 doesn't use memory copy functions, add dummy memcopy.h and
    wordcopy.c to reduce code size.  It reduces the size of libc.so by about
    1 KB.
    
    	* sysdeps/x86_64/memcopy.h: New file.
    	* sysdeps/x86_64/wordcopy.c: Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a5d79693ef1c91aed0e12662ec84d5d9b597f283

commit a5d79693ef1c91aed0e12662ec84d5d9b597f283
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 12:46:57 2016 -0700

    X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove
    
    Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones,
    we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with
    the new ones.
    
    No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used
    before.  If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2
    memcpy/memmove optimized with Enhanced REP MOVSB will be used for
    processors with ERMS.  The new AVX512 memcpy/memmove will be used for
    processors with AVX512 which prefer vzeroupper.
    
    Since the new SSE2 memcpy/memmove are faster than the previous default
    memcpy/memmove used in libc.a and ld.so, we also remove the previous
    default memcpy/memmove and make them the default memcpy/memmove.
    
    Together, it reduces the size of libc.so by about 6 KB and the size of
    ld.so by about 2 KB.
    
    	[BZ #19776]
    	* sysdeps/x86_64/memcpy.S: Make it dummy.
    	* sysdeps/x86_64/mempcpy.S: Likewise.
    	* sysdeps/x86_64/memmove.S: New file.
    	* sysdeps/x86_64/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/memmove.c: Removed.
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memmove.c: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memcpy-sse2-unaligned, memmove-avx-unaligned,
    	memcpy-avx-unaligned and memmove-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Replace
    	__memmove_chk_avx512_unaligned_2 with
    	__memmove_chk_avx512_unaligned.  Remove
    	__memmove_chk_avx_unaligned_2.  Replace
    	__memmove_chk_sse2_unaligned_2 with
    	__memmove_chk_sse2_unaligned.  Remove __memmove_chk_sse2 and
    	__memmove_avx_unaligned_2.  Replace __memmove_avx512_unaligned_2
    	with __memmove_avx512_unaligned.  Replace
    	__memmove_sse2_unaligned_2 with __memmove_sse2_unaligned.
    	Remove __memmove_sse2.  Replace __memcpy_chk_avx512_unaligned_2
    	with __memcpy_chk_avx512_unaligned.  Remove
    	__memcpy_chk_avx_unaligned_2.  Replace
    	__memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned.
    	Remove __memcpy_chk_sse2.  Remove __memcpy_avx_unaligned_2.
    	Replace __memcpy_avx512_unaligned_2 with
    	__memcpy_avx512_unaligned.  Remove __memcpy_sse2_unaligned_2
    	and __memcpy_sse2.  Replace __mempcpy_chk_avx512_unaligned_2
    	with __mempcpy_chk_avx512_unaligned.  Remove
    	__mempcpy_chk_avx_unaligned_2.  Replace
    	__mempcpy_chk_sse2_unaligned_2 with
    	__mempcpy_chk_sse2_unaligned.  Remove __mempcpy_chk_sse2.
    	Replace __mempcpy_avx512_unaligned_2 with
    	__mempcpy_avx512_unaligned.  Remove __mempcpy_avx_unaligned_2.
    	Replace __mempcpy_sse2_unaligned_2 with
    	__mempcpy_sse2_unaligned.  Remove __mempcpy_sse2.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support
    	__memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned.
    	Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __memcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../memcpy.S.
    	* sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support
    	__memcpy_chk_avx512_unaligned_erms and
    	__memcpy_chk_avx512_unaligned.  Use
    	__memcpy_chk_avx_unaligned_erms and
    	__memcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __memcpy_chk_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip if
    	not in libc.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
    	Change function suffix from unaligned_2 to unaligned.
    	* sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support
    	__mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned.
    	Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __mempcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../mempcpy.S.
    	(mempcpy): New.  Add a weak alias.
    	* sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support
    	__mempcpy_chk_avx512_unaligned_erms and
    	__mempcpy_chk_avx512_unaligned.  Use
    	__mempcpy_chk_avx_unaligned_erms and
    	__mempcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __mempcpy_chk_sse2_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=6e7b4bf19131e1fb304c5d5429ceb5d4ef17a6b9

commit 6e7b4bf19131e1fb304c5d5429ceb5d4ef17a6b9
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:42:30 2016 -0700

    X86-64: Remove the previous SSE2/AVX2 memsets
    
    Since the new SSE2/AVX2 memsets are faster than the previous ones, we
    can remove the previous SSE2/AVX2 memsets and replace them with the
    new ones.  This reduces the size of libc.so by about 900 bytes.
    
    No change in IFUNC selection if SSE2 and AVX2 memsets weren't used
    before.  If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset
    optimized with Enhanced REP STOSB will be used for processors with
    ERMS.  The new AVX512 memset will be used for processors with AVX512
    which prefer vzeroupper.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded
    	into ...
    	* sysdeps/x86_64/memset.S: This.
    	(__bzero): Removed.
    	(__memset_tail): Likewise.
    	(__memset_chk): Likewise.
    	(memset): Likewise.
    	(MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't
    	defined.
    	(MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined.
    	* sysdeps/x86_64/multiarch/memset-avx2.S: Removed.
    	(__memset_zero_constant_len_parameter): Check SHARED instead of
    	PIC.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memset-avx2 and memset-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Remove __memset_chk_sse2,
    	__memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: Skip
    	if not in libc.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
    	(__bzero): Enabled.
    	* sysdeps/x86_64/multiarch/memset.S (memset): Replace
    	__memset_sse2 and __memset_avx2 with __memset_sse2_unaligned
    	and __memset_avx2_unaligned.  Use __memset_sse2_unaligned_erms
    	or __memset_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	(memset): Removed.
    	(__memset_chk): Likewise.
    	(MEMSET_SYMBOL): New.
    	(libc_hidden_builtin_def): Replace __memset_sse2 with
    	__memset_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace
    	__memset_chk_sse2 and __memset_chk_avx2 with
    	__memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms.
    	Use __memset_chk_sse2_unaligned_erms or
    	__memset_chk_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_chk_avx512_unaligned_erms and
    	__memset_chk_avx512_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=b80cc56d7d3a54c16d33d147438b67c715906675

commit b80cc56d7d3a54c16d33d147438b67c715906675
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Apr 3 17:21:45 2016 -0700

    X86-64: Use non-temporal store in memcpy on large data
    
    The large memcpy micro benchmark in glibc shows that there is a
    regression with large data on Haswell machine.  non-temporal store in
    memcpy on large data can improve performance significantly.  This
    patch adds a threshold to use non temporal store which is 6 times of
    shared cache size.  When size is above the threshold, non temporal
    store will be used.
    
    For size below 8 vector register width, we load all data into registers
    and store them together.  Only forward and backward loops, which move 4
    vector registers at a time, are used to support overlapping addresses.
    For forward loop, we load the last 4 vector register width of data and
    the first vector register width of data into vector registers before the
    loop and store them after the loop.  For backward loop, we load the first
    4 vector register width of data and the last vector register width of
    data into vector registers before the loop and store them after the loop.
    
    	* sysdeps/x86_64/cacheinfo.c (__x86_shared_non_temporal_threshold):
    	New.
    	(init_cacheinfo): Set __x86_shared_non_temporal_threshold to
    	6 times of shared cache size.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S
    	(VMOVNT): New.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S
    	(VMOVNT): Likewise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S
    	(VMOVNT): Likewise.
    	(VMOVU): Changed to movups for smaller code sizes.
    	(VMOVA): Changed to movaps for smaller code sizes.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Update
    	comments.
    	(PREFETCH): New.
    	(PREFETCH_SIZE): Likewise.
    	(PREFETCHED_LOAD_SIZE): Likewise.
    	(PREFETCH_ONE_SET): Likewise.
    	Rewrite to use forward and backward loops, which move 4 vector
    	registers at a time, to support overlapping addresses and use
    	non temporal store if size is above the threshold.

-----------------------------------------------------------------------

Comment 24 Sourceware Commits 2016-04-07 19:31:39 UTC

This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, hjl/erms/ifunc has been created
        at  e02f2640644024281bf02537606bbad201aa20d7 (commit)

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e02f2640644024281bf02537606bbad201aa20d7

commit e02f2640644024281bf02537606bbad201aa20d7
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 1 14:01:24 2016 -0700

    X86-64: Add dummy memcopy.h and wordcopy.c
    
    Since x86-64 doesn't use memory copy functions, add dummy memcopy.h and
    wordcopy.c to reduce code size.  It reduces the size of libc.so by about
    1 KB.
    
    	* sysdeps/x86_64/memcopy.h: New file.
    	* sysdeps/x86_64/wordcopy.c: Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=82069d63c103c6d9ff88718657ac8dd83715919f

commit 82069d63c103c6d9ff88718657ac8dd83715919f
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 12:46:57 2016 -0700

    X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove
    
    Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones,
    we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with
    the new ones.
    
    No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used
    before.  If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2
    memcpy/memmove optimized with Enhanced REP MOVSB will be used for
    processors with ERMS.  The new AVX512 memcpy/memmove will be used for
    processors with AVX512 which prefer vzeroupper.
    
    Since the new SSE2 memcpy/memmove are faster than the previous default
    memcpy/memmove used in libc.a and ld.so, we also remove the previous
    default memcpy/memmove and make them the default memcpy/memmove, except
    that non-temporal store isn't used in ld.so.
    
    Together, it reduces the size of libc.so by about 6 KB and the size of
    ld.so by about 2 KB.
    
    	[BZ #19776]
    	* sysdeps/x86_64/memcpy.S: Make it dummy.
    	* sysdeps/x86_64/mempcpy.S: Likewise.
    	* sysdeps/x86_64/memmove.S: New file.
    	* sysdeps/x86_64/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/memmove.c: Removed.
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memmove.c: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memcpy-sse2-unaligned, memmove-avx-unaligned,
    	memcpy-avx-unaligned and memmove-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Replace
    	__memmove_chk_avx512_unaligned_2 with
    	__memmove_chk_avx512_unaligned.  Remove
    	__memmove_chk_avx_unaligned_2.  Replace
    	__memmove_chk_sse2_unaligned_2 with
    	__memmove_chk_sse2_unaligned.  Remove __memmove_chk_sse2 and
    	__memmove_avx_unaligned_2.  Replace __memmove_avx512_unaligned_2
    	with __memmove_avx512_unaligned.  Replace
    	__memmove_sse2_unaligned_2 with __memmove_sse2_unaligned.
    	Remove __memmove_sse2.  Replace __memcpy_chk_avx512_unaligned_2
    	with __memcpy_chk_avx512_unaligned.  Remove
    	__memcpy_chk_avx_unaligned_2.  Replace
    	__memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned.
    	Remove __memcpy_chk_sse2.  Remove __memcpy_avx_unaligned_2.
    	Replace __memcpy_avx512_unaligned_2 with
    	__memcpy_avx512_unaligned.  Remove __memcpy_sse2_unaligned_2
    	and __memcpy_sse2.  Replace __mempcpy_chk_avx512_unaligned_2
    	with __mempcpy_chk_avx512_unaligned.  Remove
    	__mempcpy_chk_avx_unaligned_2.  Replace
    	__mempcpy_chk_sse2_unaligned_2 with
    	__mempcpy_chk_sse2_unaligned.  Remove __mempcpy_chk_sse2.
    	Replace __mempcpy_avx512_unaligned_2 with
    	__mempcpy_avx512_unaligned.  Remove __mempcpy_avx_unaligned_2.
    	Replace __mempcpy_sse2_unaligned_2 with
    	__mempcpy_sse2_unaligned.  Remove __mempcpy_sse2.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support
    	__memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned.
    	Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __memcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../memcpy.S.
    	* sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support
    	__memcpy_chk_avx512_unaligned_erms and
    	__memcpy_chk_avx512_unaligned.  Use
    	__memcpy_chk_avx_unaligned_erms and
    	__memcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __memcpy_chk_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip if
    	not in libc.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
    	Change function suffix from unaligned_2 to unaligned.
    	* sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support
    	__mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned.
    	Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __mempcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../mempcpy.S.
    	(mempcpy): New.  Add a weak alias.
    	* sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support
    	__mempcpy_chk_avx512_unaligned_erms and
    	__mempcpy_chk_avx512_unaligned.  Use
    	__mempcpy_chk_avx_unaligned_erms and
    	__mempcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __mempcpy_chk_sse2_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=6e7b4bf19131e1fb304c5d5429ceb5d4ef17a6b9

commit 6e7b4bf19131e1fb304c5d5429ceb5d4ef17a6b9
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:42:30 2016 -0700

    X86-64: Remove the previous SSE2/AVX2 memsets
    
    Since the new SSE2/AVX2 memsets are faster than the previous ones, we
    can remove the previous SSE2/AVX2 memsets and replace them with the
    new ones.  This reduces the size of libc.so by about 900 bytes.
    
    No change in IFUNC selection if SSE2 and AVX2 memsets weren't used
    before.  If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset
    optimized with Enhanced REP STOSB will be used for processors with
    ERMS.  The new AVX512 memset will be used for processors with AVX512
    which prefer vzeroupper.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded
    	into ...
    	* sysdeps/x86_64/memset.S: This.
    	(__bzero): Removed.
    	(__memset_tail): Likewise.
    	(__memset_chk): Likewise.
    	(memset): Likewise.
    	(MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't
    	defined.
    	(MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined.
    	* sysdeps/x86_64/multiarch/memset-avx2.S: Removed.
    	(__memset_zero_constant_len_parameter): Check SHARED instead of
    	PIC.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memset-avx2 and memset-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Remove __memset_chk_sse2,
    	__memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: Skip
    	if not in libc.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
    	(__bzero): Enabled.
    	* sysdeps/x86_64/multiarch/memset.S (memset): Replace
    	__memset_sse2 and __memset_avx2 with __memset_sse2_unaligned
    	and __memset_avx2_unaligned.  Use __memset_sse2_unaligned_erms
    	or __memset_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	(memset): Removed.
    	(__memset_chk): Likewise.
    	(MEMSET_SYMBOL): New.
    	(libc_hidden_builtin_def): Replace __memset_sse2 with
    	__memset_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace
    	__memset_chk_sse2 and __memset_chk_avx2 with
    	__memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms.
    	Use __memset_chk_sse2_unaligned_erms or
    	__memset_chk_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_chk_avx512_unaligned_erms and
    	__memset_chk_avx512_unaligned.

-----------------------------------------------------------------------

Comment 25 Sourceware Commits 2016-04-07 19:41:37 UTC

This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, hjl/erms/ifunc has been created
        at  c0da8faf69fa56c249ac5ec40836b76fe5ab0233 (commit)

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c0da8faf69fa56c249ac5ec40836b76fe5ab0233

commit c0da8faf69fa56c249ac5ec40836b76fe5ab0233
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 1 14:01:24 2016 -0700

    X86-64: Add dummy memcopy.h and wordcopy.c
    
    Since x86-64 doesn't use memory copy functions, add dummy memcopy.h and
    wordcopy.c to reduce code size.  It reduces the size of libc.so by about
    1 KB.
    
    	* sysdeps/x86_64/memcopy.h: New file.
    	* sysdeps/x86_64/wordcopy.c: Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=9871e1739bcb231314a4d2ee03c8c757aa139332

commit 9871e1739bcb231314a4d2ee03c8c757aa139332
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 12:46:57 2016 -0700

    X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove
    
    Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones,
    we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with
    the new ones.
    
    No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used
    before.  If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2
    memcpy/memmove optimized with Enhanced REP MOVSB will be used for
    processors with ERMS.  The new AVX512 memcpy/memmove will be used for
    processors with AVX512 which prefer vzeroupper.
    
    Since the new SSE2 memcpy/memmove are faster than the previous default
    memcpy/memmove used in libc.a and ld.so, we also remove the previous
    default memcpy/memmove and make them the default memcpy/memmove, except
    that non-temporal store isn't used in ld.so.
    
    Together, it reduces the size of libc.so by about 6 KB and the size of
    ld.so by about 2 KB.
    
    	[BZ #19776]
    	* sysdeps/x86_64/memcpy.S: Make it dummy.
    	* sysdeps/x86_64/mempcpy.S: Likewise.
    	* sysdeps/x86_64/memmove.S: New file.
    	* sysdeps/x86_64/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/memmove.c: Removed.
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memmove.c: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memcpy-sse2-unaligned, memmove-avx-unaligned,
    	memcpy-avx-unaligned and memmove-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Replace
    	__memmove_chk_avx512_unaligned_2 with
    	__memmove_chk_avx512_unaligned.  Remove
    	__memmove_chk_avx_unaligned_2.  Replace
    	__memmove_chk_sse2_unaligned_2 with
    	__memmove_chk_sse2_unaligned.  Remove __memmove_chk_sse2 and
    	__memmove_avx_unaligned_2.  Replace __memmove_avx512_unaligned_2
    	with __memmove_avx512_unaligned.  Replace
    	__memmove_sse2_unaligned_2 with __memmove_sse2_unaligned.
    	Remove __memmove_sse2.  Replace __memcpy_chk_avx512_unaligned_2
    	with __memcpy_chk_avx512_unaligned.  Remove
    	__memcpy_chk_avx_unaligned_2.  Replace
    	__memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned.
    	Remove __memcpy_chk_sse2.  Remove __memcpy_avx_unaligned_2.
    	Replace __memcpy_avx512_unaligned_2 with
    	__memcpy_avx512_unaligned.  Remove __memcpy_sse2_unaligned_2
    	and __memcpy_sse2.  Replace __mempcpy_chk_avx512_unaligned_2
    	with __mempcpy_chk_avx512_unaligned.  Remove
    	__mempcpy_chk_avx_unaligned_2.  Replace
    	__mempcpy_chk_sse2_unaligned_2 with
    	__mempcpy_chk_sse2_unaligned.  Remove __mempcpy_chk_sse2.
    	Replace __mempcpy_avx512_unaligned_2 with
    	__mempcpy_avx512_unaligned.  Remove __mempcpy_avx_unaligned_2.
    	Replace __mempcpy_sse2_unaligned_2 with
    	__mempcpy_sse2_unaligned.  Remove __mempcpy_sse2.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support
    	__memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned.
    	Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __memcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../memcpy.S.
    	* sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support
    	__memcpy_chk_avx512_unaligned_erms and
    	__memcpy_chk_avx512_unaligned.  Use
    	__memcpy_chk_avx_unaligned_erms and
    	__memcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __memcpy_chk_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip if
    	not in libc.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
    	Change function suffix from unaligned_2 to unaligned.
    	* sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support
    	__mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned.
    	Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __mempcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../mempcpy.S.
    	(mempcpy): New.  Add a weak alias.
    	* sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support
    	__mempcpy_chk_avx512_unaligned_erms and
    	__mempcpy_chk_avx512_unaligned.  Use
    	__mempcpy_chk_avx_unaligned_erms and
    	__mempcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __mempcpy_chk_sse2_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=3c888072ea1da909c3453e8873ab8ec6a6c7b7b2

commit 3c888072ea1da909c3453e8873ab8ec6a6c7b7b2
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:42:30 2016 -0700

    X86-64: Remove the previous SSE2/AVX2 memsets
    
    Since the new SSE2/AVX2 memsets are faster than the previous ones, we
    can remove the previous SSE2/AVX2 memsets and replace them with the
    new ones.  This reduces the size of libc.so by about 900 bytes.
    
    No change in IFUNC selection if SSE2 and AVX2 memsets weren't used
    before.  If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset
    optimized with Enhanced REP STOSB will be used for processors with
    ERMS.  The new AVX512 memset will be used for processors with AVX512
    which prefer vzeroupper.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded
    	into ...
    	* sysdeps/x86_64/memset.S: This.
    	(__bzero): Removed.
    	(__memset_tail): Likewise.
    	(__memset_chk): Likewise.
    	(memset): Likewise.
    	(MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't
    	defined.
    	(MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined.
    	* sysdeps/x86_64/multiarch/memset-avx2.S: Removed.
    	(__memset_zero_constant_len_parameter): Check SHARED instead of
    	PIC.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memset-avx2 and memset-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Remove __memset_chk_sse2,
    	__memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: Skip
    	if not in libc.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
    	(__bzero): Enabled.
    	* sysdeps/x86_64/multiarch/memset.S (memset): Replace
    	__memset_sse2 and __memset_avx2 with __memset_sse2_unaligned
    	and __memset_avx2_unaligned.  Use __memset_sse2_unaligned_erms
    	or __memset_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	(memset): Removed.
    	(__memset_chk): Likewise.
    	(MEMSET_SYMBOL): New.
    	(libc_hidden_builtin_def): Replace __memset_sse2 with
    	__memset_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace
    	__memset_chk_sse2 and __memset_chk_avx2 with
    	__memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms.
    	Use __memset_chk_sse2_unaligned_erms or
    	__memset_chk_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_chk_avx512_unaligned_erms and
    	__memset_chk_avx512_unaligned.

-----------------------------------------------------------------------

Comment 26 Sourceware Commits 2016-04-07 19:44:44 UTC

This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, hjl/erms/2.23 has been created
        at  2a1cca399be415d6c5a556af2018e5fb726d9a37 (commit)

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=2a1cca399be415d6c5a556af2018e5fb726d9a37

commit 2a1cca399be415d6c5a556af2018e5fb726d9a37
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 1 14:01:24 2016 -0700

    X86-64: Add dummy memcopy.h and wordcopy.c
    
    Since x86-64 doesn't use memory copy functions, add dummy memcopy.h and
    wordcopy.c to reduce code size.  It reduces the size of libc.so by about
    1 KB.
    
    	* sysdeps/x86_64/memcopy.h: New file.
    	* sysdeps/x86_64/wordcopy.c: Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=b361e72f264a06e856d97cbbf1cedbf2f7dd73bf

commit b361e72f264a06e856d97cbbf1cedbf2f7dd73bf
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 12:46:57 2016 -0700

    X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove
    
    Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones,
    we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with
    the new ones.
    
    No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used
    before.  If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2
    memcpy/memmove optimized with Enhanced REP MOVSB will be used for
    processors with ERMS.  The new AVX512 memcpy/memmove will be used for
    processors with AVX512 which prefer vzeroupper.
    
    Since the new SSE2 memcpy/memmove are faster than the previous default
    memcpy/memmove used in libc.a and ld.so, we also remove the previous
    default memcpy/memmove and make them the default memcpy/memmove.
    
    Together, it reduces the size of libc.so by about 6 KB and the size of
    ld.so by about 2 KB.
    
    	[BZ #19776]
    	* sysdeps/x86_64/memcpy.S: Make it dummy.
    	* sysdeps/x86_64/mempcpy.S: Likewise.
    	* sysdeps/x86_64/memmove.S: New file.
    	* sysdeps/x86_64/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/memmove.c: Removed.
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memmove.c: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memcpy-sse2-unaligned, memmove-avx-unaligned,
    	memcpy-avx-unaligned and memmove-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Replace
    	__memmove_chk_avx512_unaligned_2 with
    	__memmove_chk_avx512_unaligned.  Remove
    	__memmove_chk_avx_unaligned_2.  Replace
    	__memmove_chk_sse2_unaligned_2 with
    	__memmove_chk_sse2_unaligned.  Remove __memmove_chk_sse2 and
    	__memmove_avx_unaligned_2.  Replace __memmove_avx512_unaligned_2
    	with __memmove_avx512_unaligned.  Replace
    	__memmove_sse2_unaligned_2 with __memmove_sse2_unaligned.
    	Remove __memmove_sse2.  Replace __memcpy_chk_avx512_unaligned_2
    	with __memcpy_chk_avx512_unaligned.  Remove
    	__memcpy_chk_avx_unaligned_2.  Replace
    	__memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned.
    	Remove __memcpy_chk_sse2.  Remove __memcpy_avx_unaligned_2.
    	Replace __memcpy_avx512_unaligned_2 with
    	__memcpy_avx512_unaligned.  Remove __memcpy_sse2_unaligned_2
    	and __memcpy_sse2.  Replace __mempcpy_chk_avx512_unaligned_2
    	with __mempcpy_chk_avx512_unaligned.  Remove
    	__mempcpy_chk_avx_unaligned_2.  Replace
    	__mempcpy_chk_sse2_unaligned_2 with
    	__mempcpy_chk_sse2_unaligned.  Remove __mempcpy_chk_sse2.
    	Replace __mempcpy_avx512_unaligned_2 with
    	__mempcpy_avx512_unaligned.  Remove __mempcpy_avx_unaligned_2.
    	Replace __mempcpy_sse2_unaligned_2 with
    	__mempcpy_sse2_unaligned.  Remove __mempcpy_sse2.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support
    	__memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned.
    	Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __memcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../memcpy.S.
    	* sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support
    	__memcpy_chk_avx512_unaligned_erms and
    	__memcpy_chk_avx512_unaligned.  Use
    	__memcpy_chk_avx_unaligned_erms and
    	__memcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __memcpy_chk_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip if
    	not in libc.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
    	Change function suffix from unaligned_2 to unaligned.
    	* sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support
    	__mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned.
    	Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __mempcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../mempcpy.S.
    	(mempcpy): New.  Add a weak alias.
    	* sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support
    	__mempcpy_chk_avx512_unaligned_erms and
    	__mempcpy_chk_avx512_unaligned.  Use
    	__mempcpy_chk_avx_unaligned_erms and
    	__mempcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __mempcpy_chk_sse2_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c97c370612496379176be8e33c19dc4f80b7f01c

commit c97c370612496379176be8e33c19dc4f80b7f01c
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:42:30 2016 -0700

    X86-64: Remove the previous SSE2/AVX2 memsets
    
    Since the new SSE2/AVX2 memsets are faster than the previous ones, we
    can remove the previous SSE2/AVX2 memsets and replace them with the
    new ones.  This reduces the size of libc.so by about 900 bytes.
    
    No change in IFUNC selection if SSE2 and AVX2 memsets weren't used
    before.  If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset
    optimized with Enhanced REP STOSB will be used for processors with
    ERMS.  The new AVX512 memset will be used for processors with AVX512
    which prefer vzeroupper.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded
    	into ...
    	* sysdeps/x86_64/memset.S: This.
    	(__bzero): Removed.
    	(__memset_tail): Likewise.
    	(__memset_chk): Likewise.
    	(memset): Likewise.
    	(MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't
    	defined.
    	(MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined.
    	* sysdeps/x86_64/multiarch/memset-avx2.S: Removed.
    	(__memset_zero_constant_len_parameter): Check SHARED instead of
    	PIC.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memset-avx2 and memset-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Remove __memset_chk_sse2,
    	__memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: Skip
    	if not in libc.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
    	(__bzero): Enabled.
    	* sysdeps/x86_64/multiarch/memset.S (memset): Replace
    	__memset_sse2 and __memset_avx2 with __memset_sse2_unaligned
    	and __memset_avx2_unaligned.  Use __memset_sse2_unaligned_erms
    	or __memset_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	(memset): Removed.
    	(__memset_chk): Likewise.
    	(MEMSET_SYMBOL): New.
    	(libc_hidden_builtin_def): Replace __memset_sse2 with
    	__memset_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace
    	__memset_chk_sse2 and __memset_chk_avx2 with
    	__memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms.
    	Use __memset_chk_sse2_unaligned_erms or
    	__memset_chk_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_chk_avx512_unaligned_erms and
    	__memset_chk_avx512_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=121270b79236d7c5802e8d9af2d27952cb9efae9

commit 121270b79236d7c5802e8d9af2d27952cb9efae9
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Apr 3 17:21:45 2016 -0700

    X86-64: Use non-temporal store in memcpy on large data
    
    The large memcpy micro benchmark in glibc shows that there is a
    regression with large data on Haswell machine.  non-temporal store in
    memcpy on large data can improve performance significantly.  This
    patch adds a threshold to use non temporal store which is 6 times of
    shared cache size.  When size is above the threshold, non temporal
    store will be used.
    
    For size below 8 vector register width, we load all data into registers
    and store them together.  Only forward and backward loops, which move 4
    vector registers at a time, are used to support overlapping addresses.
    For forward loop, we load the last 4 vector register width of data and
    the first vector register width of data into vector registers before the
    loop and store them after the loop.  For backward loop, we load the first
    4 vector register width of data and the last vector register width of
    data into vector registers before the loop and store them after the loop.
    
    	* sysdeps/x86_64/cacheinfo.c (__x86_shared_non_temporal_threshold):
    	New.
    	(init_cacheinfo): Set __x86_shared_non_temporal_threshold to
    	6 times of shared cache size.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S
    	(VMOVNT): New.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S
    	(VMOVNT): Likewise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S
    	(VMOVNT): Likewise.
    	(VMOVU): Changed to movups for smaller code sizes.
    	(VMOVA): Changed to movaps for smaller code sizes.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Update
    	comments.
    	(PREFETCH): New.
    	(PREFETCH_SIZE): Likewise.
    	(PREFETCHED_LOAD_SIZE): Likewise.
    	(PREFETCH_ONE_SET): Likewise.
    	Rewrite to use forward and backward loops, which move 4 vector
    	registers at a time, to support overlapping addresses and use
    	non temporal store if size is above the threshold.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=0932dd8b56db46dd421a4855fb5dee9de092538d

commit 0932dd8b56db46dd421a4855fb5dee9de092538d
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Wed Apr 6 10:19:16 2016 -0700

    X86-64: Prepare memmove-vec-unaligned-erms.S
    
    Prepare memmove-vec-unaligned-erms.S to make the SSE2 version as the
    default memcpy, mempcpy and memmove.
    
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
    	(MEMCPY_SYMBOL): New.
    	(MEMPCPY_SYMBOL): Likewise.
    	(MEMMOVE_CHK_SYMBOL): Likewise.
    	Replace MEMMOVE_SYMBOL with MEMMOVE_CHK_SYMBOL on __mempcpy_chk
    	symbols.  Replace MEMMOVE_SYMBOL with MEMPCPY_SYMBOL on
    	__mempcpy symbols.  Provide alias for __memcpy_chk in libc.a.
    	Provide alias for memcpy in libc.a and ld.so.
    
    (cherry picked from commit a7d1c51482d15ab6c07e2ee0ae5e007067b18bfb)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=da2da79262814ba4ead3ee487549949096d8ad2d

commit da2da79262814ba4ead3ee487549949096d8ad2d
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Wed Apr 6 09:10:18 2016 -0700

    X86-64: Prepare memset-vec-unaligned-erms.S
    
    Prepare memset-vec-unaligned-erms.S to make the SSE2 version as the
    default memset.
    
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
    	(MEMSET_CHK_SYMBOL): New.  Define if not defined.
    	(__bzero): Check VEC_SIZE == 16 instead of USE_MULTIARCH.
    	Disabled fro now.
    	Replace MEMSET_SYMBOL with MEMSET_CHK_SYMBOL on __memset_chk
    	symbols.  Properly check USE_MULTIARCH on __memset symbols.
    
    (cherry picked from commit 4af1bb06c59d24f35bf8dc55897838d926c05892)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=9a93bdbaff81edf67c5486c84f8098055e355abb

commit 9a93bdbaff81edf67c5486c84f8098055e355abb
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Apr 5 05:21:07 2016 -0700

    Force 32-bit displacement in memset-vec-unaligned-erms.S
    
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: Force
    	32-bit displacement to avoid long nop between instructions.
    
    (cherry picked from commit ec0cac9a1f4094bd0db6f77c1b329e7a40eecc10)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=5118e532600549ad0f56cb9b1a179b8eab70c483

commit 5118e532600549ad0f56cb9b1a179b8eab70c483
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Apr 5 05:19:05 2016 -0700

    Add a comment in memset-sse2-unaligned-erms.S
    
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Add
    	a comment on VMOVU and VMOVA.
    
    (cherry picked from commit 696ac774847b80cf994438739478b0c3003b5958)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=06c6d4ae6ee7e5b83fd5868bef494def01f59292

commit 06c6d4ae6ee7e5b83fd5868bef494def01f59292
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Apr 3 14:32:20 2016 -0700

    Don't put SSE2/AVX/AVX512 memmove/memset in ld.so
    
    Since memmove and memset in ld.so don't use IFUNC, don't put SSE2, AVX
    and AVX512 memmove and memset in ld.so.
    
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip
    	if not in libc.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    
    (cherry picked from commit 5cd7af016d8587ff53b20ba259746f97edbddbf7)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a96379797a7eecc1b709cad7b68981eb698783dc

commit a96379797a7eecc1b709cad7b68981eb698783dc
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Apr 3 12:38:25 2016 -0700

    Fix memmove-vec-unaligned-erms.S
    
    __mempcpy_erms and __memmove_erms can't be placed between __memmove_chk
    and __memmove it breaks __memmove_chk.
    
    Don't check source == destination first since it is less common.
    
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:
    	(__mempcpy_erms, __memmove_erms): Moved before __mempcpy_chk
    	with unaligned_erms.
    	(__memmove_erms): Skip if source == destination.
    	(__memmove_unaligned_erms): Don't check source == destination
    	first.
    
    (cherry picked from commit ea2785e96fa503f3a2b5dd9f3a6ca65622b3c5f2)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=cfb059c79729b26284863334c9aa04f0a3b967b9

commit cfb059c79729b26284863334c9aa04f0a3b967b9
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 1 15:08:48 2016 -0700

    Remove Fast_Copy_Backward from Intel Core processors
    
    Intel Core i3, i5 and i7 processors have fast unaligned copy and
    copy backward is ignored.  Remove Fast_Copy_Backward from Intel Core
    processors to avoid confusion.
    
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Don't set
    	bit_arch_Fast_Copy_Backward for Intel Core proessors.
    
    (cherry picked from commit 27d3ce1467990f89126e228559dec8f84b96c60e)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=30c389be1af67c4d0716d207b6780c6169d1355f

commit 30c389be1af67c4d0716d207b6780c6169d1355f
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:05:51 2016 -0700

    Add x86-64 memset with unaligned store and rep stosb
    
    Implement x86-64 memset with unaligned store and rep movsb.  Support
    16-byte, 32-byte and 64-byte vector register sizes.  A single file
    provides 2 implementations of memset, one with rep stosb and the other
    without rep stosb.  They share the same codes when size is between 2
    times of vector register size and REP_STOSB_THRESHOLD which defaults
    to 2KB.
    
    Key features:
    
    1. Use overlapping store to avoid branch.
    2. For size <= 4 times of vector register size, fully unroll the loop.
    3. For size > 4 times of vector register size, store 4 times of vector
    register size at a time.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
    	memset-sse2-unaligned-erms, memset-avx2-unaligned-erms and
    	memset-avx512-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Test __memset_chk_sse2_unaligned,
    	__memset_chk_sse2_unaligned_erms, __memset_chk_avx2_unaligned,
    	__memset_chk_avx2_unaligned_erms, __memset_chk_avx512_unaligned,
    	__memset_chk_avx512_unaligned_erms, __memset_sse2_unaligned,
    	__memset_sse2_unaligned_erms, __memset_erms,
    	__memset_avx2_unaligned, __memset_avx2_unaligned_erms,
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: New
    	file.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:
    	Likewise.
    
    (cherry picked from commit 830566307f038387ca0af3fd327706a8d1a2f595)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=980d639b4ae58209843f09a29d86b0a8303b6650

commit 980d639b4ae58209843f09a29d86b0a8303b6650
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:04:26 2016 -0700

    Add x86-64 memmove with unaligned load/store and rep movsb
    
    Implement x86-64 memmove with unaligned load/store and rep movsb.
    Support 16-byte, 32-byte and 64-byte vector register sizes.  When
    size <= 8 times of vector register size, there is no check for
    address overlap bewteen source and destination.  Since overhead for
    overlap check is small when size > 8 times of vector register size,
    memcpy is an alias of memmove.
    
    A single file provides 2 implementations of memmove, one with rep movsb
    and the other without rep movsb.  They share the same codes when size is
    between 2 times of vector register size and REP_MOVSB_THRESHOLD which
    is 2KB for 16-byte vector register size and scaled up by large vector
    register size.
    
    Key features:
    
    1. Use overlapping load and store to avoid branch.
    2. For size <= 8 times of vector register size, load  all sources into
    registers and store them together.
    3. If there is no address overlap bewteen source and destination, copy
    from both ends with 4 times of vector register size at a time.
    4. If address of destination > address of source, backward copy 8 times
    of vector register size at a time.
    5. Otherwise, forward copy 8 times of vector register size at a time.
    6. Use rep movsb only for forward copy.  Avoid slow backward rep movsb
    by fallbacking to backward copy 8 times of vector register size at a
    time.
    7. Skip when address of destination == address of source.
    
    	[BZ #19776]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
    	memmove-sse2-unaligned-erms, memmove-avx-unaligned-erms and
    	memmove-avx512-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Test
    	__memmove_chk_avx512_unaligned_2,
    	__memmove_chk_avx512_unaligned_erms,
    	__memmove_chk_avx_unaligned_2, __memmove_chk_avx_unaligned_erms,
    	__memmove_chk_sse2_unaligned_2,
    	__memmove_chk_sse2_unaligned_erms, __memmove_avx_unaligned_2,
    	__memmove_avx_unaligned_erms, __memmove_avx512_unaligned_2,
    	__memmove_avx512_unaligned_erms, __memmove_erms,
    	__memmove_sse2_unaligned_2, __memmove_sse2_unaligned_erms,
    	__memcpy_chk_avx512_unaligned_2,
    	__memcpy_chk_avx512_unaligned_erms,
    	__memcpy_chk_avx_unaligned_2, __memcpy_chk_avx_unaligned_erms,
    	__memcpy_chk_sse2_unaligned_2, __memcpy_chk_sse2_unaligned_erms,
    	__memcpy_avx_unaligned_2, __memcpy_avx_unaligned_erms,
    	__memcpy_avx512_unaligned_2, __memcpy_avx512_unaligned_erms,
    	__memcpy_sse2_unaligned_2, __memcpy_sse2_unaligned_erms,
    	__memcpy_erms, __mempcpy_chk_avx512_unaligned_2,
    	__mempcpy_chk_avx512_unaligned_erms,
    	__mempcpy_chk_avx_unaligned_2, __mempcpy_chk_avx_unaligned_erms,
    	__mempcpy_chk_sse2_unaligned_2, __mempcpy_chk_sse2_unaligned_erms,
    	__mempcpy_avx512_unaligned_2, __mempcpy_avx512_unaligned_erms,
    	__mempcpy_avx_unaligned_2, __mempcpy_avx_unaligned_erms,
    	__mempcpy_sse2_unaligned_2, __mempcpy_sse2_unaligned_erms and
    	__mempcpy_erms.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: New
    	file.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likwise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
    	Likwise.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:
    	Likwise.
    
    (cherry picked from commit 88b57b8ed41d5ecf2e1bdfc19556f9246a665ebb)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=bf2bc5e5c9d7aa8af28b299ec26b8a37352730cc

commit bf2bc5e5c9d7aa8af28b299ec26b8a37352730cc
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 19:22:59 2016 -0700

    Initial Enhanced REP MOVSB/STOSB (ERMS) support
    
    The newer Intel processors support Enhanced REP MOVSB/STOSB (ERMS) which
    has a feature bit in CPUID.  This patch adds the Enhanced REP MOVSB/STOSB
    (ERMS) bit to x86 cpu-features.
    
    	* sysdeps/x86/cpu-features.h (bit_cpu_ERMS): New.
    	(index_cpu_ERMS): Likewise.
    	(reg_ERMS): Likewise.
    
    (cherry picked from commit 0791f91dff9a77263fa8173b143d854cad902c6d)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7c244283ff12329b3bca9878b8edac3b3fe5c7bc

commit 7c244283ff12329b3bca9878b8edac3b3fe5c7bc
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 13:15:59 2016 -0700

    Make __memcpy_avx512_no_vzeroupper an alias
    
    Since x86-64 memcpy-avx512-no-vzeroupper.S implements memmove, make
    __memcpy_avx512_no_vzeroupper an alias of __memmove_avx512_no_vzeroupper
    to reduce code size of libc.so.
    
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memcpy-avx512-no-vzeroupper.
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: Renamed
    	to ...
    	* sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: This.
    	(MEMCPY): Don't define.
    	(MEMCPY_CHK): Likewise.
    	(MEMPCPY): Likewise.
    	(MEMPCPY_CHK): Likewise.
    	(MEMPCPY_CHK): Renamed to ...
    	(__mempcpy_chk_avx512_no_vzeroupper): This.
    	(MEMPCPY_CHK): Renamed to ...
    	(__mempcpy_chk_avx512_no_vzeroupper): This.
    	(MEMCPY_CHK): Renamed to ...
    	(__memmove_chk_avx512_no_vzeroupper): This.
    	(MEMCPY): Renamed to ...
    	(__memmove_avx512_no_vzeroupper): This.
    	(__memcpy_avx512_no_vzeroupper): New alias.
    	(__memcpy_chk_avx512_no_vzeroupper): Likewise.
    
    (cherry picked from commit 064f01b10b57ff09cda7025f484b848c38ddd57a)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a9a14991fb2d3e69f80d25e9bbf2f6b0bcf11c3d

commit a9a14991fb2d3e69f80d25e9bbf2f6b0bcf11c3d
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 13:13:36 2016 -0700

    Implement x86-64 multiarch mempcpy in memcpy
    
    Implement x86-64 multiarch mempcpy in memcpy to share most of code.  It
    reduces code size of libc.so.
    
    	[BZ #18858]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	mempcpy-ssse3, mempcpy-ssse3-back, mempcpy-avx-unaligned
    	and mempcpy-avx512-no-vzeroupper.
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMPCPY_CHK):
    	New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S
    	(MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-ssse3-back.S (MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-ssse3.S (MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-avx-unaligned.S: Removed.
    	* sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-ssse3-back.S: Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-ssse3.S: Likewise.
    
    (cherry picked from commit c365e615f7429aee302f8af7bf07ae262278febb)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4fc09dabecee1b7cafdbca26ee7c63f68e53c229

commit 4fc09dabecee1b7cafdbca26ee7c63f68e53c229
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 04:39:48 2016 -0700

    [x86] Add a feature bit: Fast_Unaligned_Copy
    
    On AMD processors, memcpy optimized with unaligned SSE load is
    slower than emcpy optimized with aligned SSSE3 while other string
    functions are faster with unaligned SSE load.  A feature bit,
    Fast_Unaligned_Copy, is added to select memcpy optimized with
    unaligned SSE load.
    
    	[BZ #19583]
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Set
    	Fast_Unaligned_Copy with Fast_Unaligned_Load for Intel
    	processors.  Set Fast_Copy_Backward for AMD Excavator
    	processors.
    	* sysdeps/x86/cpu-features.h (bit_arch_Fast_Unaligned_Copy):
    	New.
    	(index_arch_Fast_Unaligned_Copy): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Check
    	Fast_Unaligned_Copy instead of Fast_Unaligned_Load.
    
    (cherry picked from commit e41b395523040fcb58c7d378475720c2836d280c)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=75f2d47e459a6bf5656a938e5c63f8b581eb3ee6

commit 75f2d47e459a6bf5656a938e5c63f8b581eb3ee6
Author: Florian Weimer <fweimer@redhat.com>
Date:   Fri Mar 25 11:11:42 2016 +0100

    tst-audit10: Fix compilation on compilers without bit_AVX512F [BZ #19860]
    
    	[BZ# 19860]
    	* sysdeps/x86_64/tst-audit10.c (avx512_enabled): Always return
    	zero if the compiler does not provide the AVX512F bit.
    
    (cherry picked from commit f327f5b47be57bc05a4077344b381016c1bb2c11)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=96c7375cb8b6f1875d9865f2ae92ecacf5f5e6fa

commit 96c7375cb8b6f1875d9865f2ae92ecacf5f5e6fa
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Mar 22 08:36:16 2016 -0700

    Don't set %rcx twice before "rep movsb"
    
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMCPY):
    	Don't set %rcx twice before "rep movsb".
    
    (cherry picked from commit 3c9a4cd16cbc7b79094fec68add2df66061ab5d7)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c273f613b0cc779ee33cc33d20941d271316e483

commit c273f613b0cc779ee33cc33d20941d271316e483
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Mar 22 07:46:56 2016 -0700

    Set index_arch_AVX_Fast_Unaligned_Load only for Intel processors
    
    Since only Intel processors with AVX2 have fast unaligned load, we
    should set index_arch_AVX_Fast_Unaligned_Load only for Intel processors.
    
    Move AVX, AVX2, AVX512, FMA and FMA4 detection into get_common_indeces
    and call get_common_indeces for other processors.
    
    Add CPU_FEATURES_CPU_P and CPU_FEATURES_ARCH_P to aoid loading
    GLRO(dl_x86_cpu_features) in cpu-features.c.
    
    	[BZ #19583]
    	* sysdeps/x86/cpu-features.c (get_common_indeces): Remove
    	inline.  Check family before setting family, model and
    	extended_model.  Set AVX, AVX2, AVX512, FMA and FMA4 usable
    	bits here.
    	(init_cpu_features): Replace HAS_CPU_FEATURE and
    	HAS_ARCH_FEATURE with CPU_FEATURES_CPU_P and
    	CPU_FEATURES_ARCH_P.  Set index_arch_AVX_Fast_Unaligned_Load
    	for Intel processors with usable AVX2.  Call get_common_indeces
    	for other processors with family == NULL.
    	* sysdeps/x86/cpu-features.h (CPU_FEATURES_CPU_P): New macro.
    	(CPU_FEATURES_ARCH_P): Likewise.
    	(HAS_CPU_FEATURE): Use CPU_FEATURES_CPU_P.
    	(HAS_ARCH_FEATURE): Use CPU_FEATURES_ARCH_P.
    
    (cherry picked from commit f781a9e96138d8839663af5e88649ab1fbed74f8)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c858d10a4e7fd682f2e7083836e4feacc2d580f4

commit c858d10a4e7fd682f2e7083836e4feacc2d580f4
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 10 05:26:46 2016 -0800

    Add _arch_/_cpu_ to index_*/bit_* in x86 cpu-features.h
    
    index_* and bit_* macros are used to access cpuid and feature arrays o
    struct cpu_features.  It is very easy to use bits and indices of cpuid
    array on feature array, especially in assembly codes.  For example,
    sysdeps/i386/i686/multiarch/bcopy.S has
    
    	HAS_CPU_FEATURE (Fast_Rep_String)
    
    which should be
    
    	HAS_ARCH_FEATURE (Fast_Rep_String)
    
    We change index_* and bit_* to index_cpu_*/index_arch_* and
    bit_cpu_*/bit_arch_* so that we can catch such error at build time.
    
    	[BZ #19762]
    	* sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h
    	(EXTRA_LD_ENVVARS): Add _arch_ to index_*/bit_*.
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Likewise.
    	* sysdeps/x86/cpu-features.h (bit_*): Renamed to ...
    	(bit_arch_*): This for feature array.
    	(bit_*): Renamed to ...
    	(bit_cpu_*): This for cpu array.
    	(index_*): Renamed to ...
    	(index_arch_*): This for feature array.
    	(index_*): Renamed to ...
    	(index_cpu_*): This for cpu array.
    	[__ASSEMBLER__] (HAS_FEATURE): Add and use field.
    	[__ASSEMBLER__] (HAS_CPU_FEATURE)): Pass cpu to HAS_FEATURE.
    	[__ASSEMBLER__] (HAS_ARCH_FEATURE)): Pass arch to HAS_FEATURE.
    	[!__ASSEMBLER__] (HAS_CPU_FEATURE): Replace index_##name and
    	bit_##name with index_cpu_##name and bit_cpu_##name.
    	[!__ASSEMBLER__] (HAS_ARCH_FEATURE): Replace index_##name and
    	bit_##name with index_arch_##name and bit_arch_##name.
    
    (cherry picked from commit 6aa3e97e2530f9917f504eb4146af119a3f27229)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7a90b56b0c3f8e55df44957cf6de7d3c9c04cbb9

commit 7a90b56b0c3f8e55df44957cf6de7d3c9c04cbb9
Author: Roland McGrath <roland@hack.frob.com>
Date:   Tue Mar 8 12:31:13 2016 -0800

    Fix tst-audit10 build when -mavx512f is not supported.
    
    (cherry picked from commit 3bd80c0de2f8e7ca8020d37739339636d169957e)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=ba80f6ceea3a6b6f711038646f419125fe3ad39c

commit ba80f6ceea3a6b6f711038646f419125fe3ad39c
Author: Florian Weimer <fweimer@redhat.com>
Date:   Mon Mar 7 16:00:25 2016 +0100

    tst-audit4, tst-audit10: Compile AVX/AVX-512 code separately [BZ #19269]
    
    This ensures that GCC will not use unsupported instructions before
    the run-time check to ensure support.
    
    (cherry picked from commit 3c0f7407eedb524c9114bb675cd55b903c71daaa)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=b8fe596e7f750d4ee2fca14d6a3999364c02662e

commit b8fe596e7f750d4ee2fca14d6a3999364c02662e
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Mar 6 16:48:11 2016 -0800

    Group AVX512 functions in .text.avx512 section
    
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S:
    	Replace .text with .text.avx512.
    	* sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S:
    	Likewise.
    
    (cherry picked from commit fee9eb6200f0e44a4b684903bc47fde36d46f1a5)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e455d17680cfaebb12692547422f95ba1ed30e29

commit e455d17680cfaebb12692547422f95ba1ed30e29
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Mar 4 08:37:40 2016 -0800

    x86-64: Fix memcpy IFUNC selection
    
    Chek Fast_Unaligned_Load, instead of Slow_BSF, and also check for
    Fast_Copy_Backward to enable __memcpy_ssse3_back.  Existing selection
    order is updated with following selection order:
    
    1. __memcpy_avx_unaligned if AVX_Fast_Unaligned_Load bit is set.
    2. __memcpy_sse2_unaligned if Fast_Unaligned_Load bit is set.
    3. __memcpy_sse2 if SSSE3 isn't available.
    4. __memcpy_ssse3_back if Fast_Copy_Backward bit it set.
    5. __memcpy_ssse3
    
    	[BZ #18880]
    	* sysdeps/x86_64/multiarch/memcpy.S: Check Fast_Unaligned_Load,
    	instead of Slow_BSF, and also check for Fast_Copy_Backward to
    	enable __memcpy_ssse3_back.
    
    (cherry picked from commit 14a1d7cc4c4fd5ee8e4e66b777221dd32a84efe8)

-----------------------------------------------------------------------

Comment 27 Sourceware Commits 2016-04-07 23:42:06 UTC

This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, hjl/erms/2.22 has been created
        at  8b65cadefc53cc42e1970e0817336fe96a7aa396 (commit)

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=8b65cadefc53cc42e1970e0817336fe96a7aa396

commit 8b65cadefc53cc42e1970e0817336fe96a7aa396
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 1 14:01:24 2016 -0700

    X86-64: Add dummy memcopy.h and wordcopy.c
    
    Since x86-64 doesn't use memory copy functions, add dummy memcopy.h and
    wordcopy.c to reduce code size.  It reduces the size of libc.so by about
    1 KB.
    
    	* sysdeps/x86_64/memcopy.h: New file.
    	* sysdeps/x86_64/wordcopy.c: Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c4b1dec2c115ba19192fdb143f25cfc1ac76c94a

commit c4b1dec2c115ba19192fdb143f25cfc1ac76c94a
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 12:46:57 2016 -0700

    X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove
    
    Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones,
    we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with
    the new ones.
    
    No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used
    before.  If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2
    memcpy/memmove optimized with Enhanced REP MOVSB will be used for
    processors with ERMS.  The new AVX512 memcpy/memmove will be used for
    processors with AVX512 which prefer vzeroupper.
    
    Since the new SSE2 memcpy/memmove are faster than the previous default
    memcpy/memmove used in libc.a and ld.so, we also remove the previous
    default memcpy/memmove and make them the default memcpy/memmove.
    
    Together, it reduces the size of libc.so by about 6 KB and the size of
    ld.so by about 2 KB.
    
    	[BZ #19776]
    	* sysdeps/x86_64/memcpy.S: Make it dummy.
    	* sysdeps/x86_64/mempcpy.S: Likewise.
    	* sysdeps/x86_64/memmove.S: New file.
    	* sysdeps/x86_64/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/memmove.c: Removed.
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memmove.c: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memcpy-sse2-unaligned, memmove-avx-unaligned,
    	memcpy-avx-unaligned and memmove-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Replace
    	__memmove_chk_avx512_unaligned_2 with
    	__memmove_chk_avx512_unaligned.  Remove
    	__memmove_chk_avx_unaligned_2.  Replace
    	__memmove_chk_sse2_unaligned_2 with
    	__memmove_chk_sse2_unaligned.  Remove __memmove_chk_sse2 and
    	__memmove_avx_unaligned_2.  Replace __memmove_avx512_unaligned_2
    	with __memmove_avx512_unaligned.  Replace
    	__memmove_sse2_unaligned_2 with __memmove_sse2_unaligned.
    	Remove __memmove_sse2.  Replace __memcpy_chk_avx512_unaligned_2
    	with __memcpy_chk_avx512_unaligned.  Remove
    	__memcpy_chk_avx_unaligned_2.  Replace
    	__memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned.
    	Remove __memcpy_chk_sse2.  Remove __memcpy_avx_unaligned_2.
    	Replace __memcpy_avx512_unaligned_2 with
    	__memcpy_avx512_unaligned.  Remove __memcpy_sse2_unaligned_2
    	and __memcpy_sse2.  Replace __mempcpy_chk_avx512_unaligned_2
    	with __mempcpy_chk_avx512_unaligned.  Remove
    	__mempcpy_chk_avx_unaligned_2.  Replace
    	__mempcpy_chk_sse2_unaligned_2 with
    	__mempcpy_chk_sse2_unaligned.  Remove __mempcpy_chk_sse2.
    	Replace __mempcpy_avx512_unaligned_2 with
    	__mempcpy_avx512_unaligned.  Remove __mempcpy_avx_unaligned_2.
    	Replace __mempcpy_sse2_unaligned_2 with
    	__mempcpy_sse2_unaligned.  Remove __mempcpy_sse2.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support
    	__memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned.
    	Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __memcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../memcpy.S.
    	* sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support
    	__memcpy_chk_avx512_unaligned_erms and
    	__memcpy_chk_avx512_unaligned.  Use
    	__memcpy_chk_avx_unaligned_erms and
    	__memcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __memcpy_chk_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip if
    	not in libc.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
    	Change function suffix from unaligned_2 to unaligned.
    	* sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support
    	__mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned.
    	Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __mempcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../mempcpy.S.
    	(mempcpy): New.  Add a weak alias.
    	* sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support
    	__mempcpy_chk_avx512_unaligned_erms and
    	__mempcpy_chk_avx512_unaligned.  Use
    	__mempcpy_chk_avx_unaligned_erms and
    	__mempcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __mempcpy_chk_sse2_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=1d9b78b7787695ab0fddbaabeb3ef07c730e94a4

commit 1d9b78b7787695ab0fddbaabeb3ef07c730e94a4
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:42:30 2016 -0700

    X86-64: Remove the previous SSE2/AVX2 memsets
    
    Since the new SSE2/AVX2 memsets are faster than the previous ones, we
    can remove the previous SSE2/AVX2 memsets and replace them with the
    new ones.  This reduces the size of libc.so by about 900 bytes.
    
    No change in IFUNC selection if SSE2 and AVX2 memsets weren't used
    before.  If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset
    optimized with Enhanced REP STOSB will be used for processors with
    ERMS.  The new AVX512 memset will be used for processors with AVX512
    which prefer vzeroupper.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded
    	into ...
    	* sysdeps/x86_64/memset.S: This.
    	(__bzero): Removed.
    	(__memset_tail): Likewise.
    	(__memset_chk): Likewise.
    	(memset): Likewise.
    	(MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't
    	defined.
    	(MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined.
    	* sysdeps/x86_64/multiarch/memset-avx2.S: Removed.
    	(__memset_zero_constant_len_parameter): Check SHARED instead of
    	PIC.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memset-avx2 and memset-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Remove __memset_chk_sse2,
    	__memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: Skip
    	if not in libc.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
    	(__bzero): Enabled.
    	* sysdeps/x86_64/multiarch/memset.S (memset): Replace
    	__memset_sse2 and __memset_avx2 with __memset_sse2_unaligned
    	and __memset_avx2_unaligned.  Use __memset_sse2_unaligned_erms
    	or __memset_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	(memset): Removed.
    	(__memset_chk): Likewise.
    	(MEMSET_SYMBOL): New.
    	(libc_hidden_builtin_def): Replace __memset_sse2 with
    	__memset_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace
    	__memset_chk_sse2 and __memset_chk_avx2 with
    	__memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms.
    	Use __memset_chk_sse2_unaligned_erms or
    	__memset_chk_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_chk_avx512_unaligned_erms and
    	__memset_chk_avx512_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=2e9ef7960d01bc9bb36f2f3e7c9c567f11e56da9

commit 2e9ef7960d01bc9bb36f2f3e7c9c567f11e56da9
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Apr 3 17:21:45 2016 -0700

    X86-64: Use non-temporal store in memcpy on large data
    
    The large memcpy micro benchmark in glibc shows that there is a
    regression with large data on Haswell machine.  non-temporal store in
    memcpy on large data can improve performance significantly.  This
    patch adds a threshold to use non temporal store which is 6 times of
    shared cache size.  When size is above the threshold, non temporal
    store will be used.
    
    For size below 8 vector register width, we load all data into registers
    and store them together.  Only forward and backward loops, which move 4
    vector registers at a time, are used to support overlapping addresses.
    For forward loop, we load the last 4 vector register width of data and
    the first vector register width of data into vector registers before the
    loop and store them after the loop.  For backward loop, we load the first
    4 vector register width of data and the last vector register width of
    data into vector registers before the loop and store them after the loop.
    
    	* sysdeps/x86_64/cacheinfo.c (__x86_shared_non_temporal_threshold):
    	New.
    	(init_cacheinfo): Set __x86_shared_non_temporal_threshold to
    	6 times of shared cache size.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S
    	(VMOVNT): New.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S
    	(VMOVNT): Likewise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S
    	(VMOVNT): Likewise.
    	(VMOVU): Changed to movups for smaller code sizes.
    	(VMOVA): Changed to movaps for smaller code sizes.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Update
    	comments.
    	(PREFETCH): New.
    	(PREFETCH_SIZE): Likewise.
    	(PREFETCHED_LOAD_SIZE): Likewise.
    	(PREFETCH_ONE_SET): Likewise.
    	Rewrite to use forward and backward loops, which move 4 vector
    	registers at a time, to support overlapping addresses and use
    	non temporal store if size is above the threshold.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=54667f64fa4074325ee33e487c033c313ce95067

commit 54667f64fa4074325ee33e487c033c313ce95067
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Wed Apr 6 10:19:16 2016 -0700

    X86-64: Prepare memmove-vec-unaligned-erms.S
    
    Prepare memmove-vec-unaligned-erms.S to make the SSE2 version as the
    default memcpy, mempcpy and memmove.
    
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
    	(MEMCPY_SYMBOL): New.
    	(MEMPCPY_SYMBOL): Likewise.
    	(MEMMOVE_CHK_SYMBOL): Likewise.
    	Replace MEMMOVE_SYMBOL with MEMMOVE_CHK_SYMBOL on __mempcpy_chk
    	symbols.  Replace MEMMOVE_SYMBOL with MEMPCPY_SYMBOL on
    	__mempcpy symbols.  Provide alias for __memcpy_chk in libc.a.
    	Provide alias for memcpy in libc.a and ld.so.
    
    (cherry picked from commit a7d1c51482d15ab6c07e2ee0ae5e007067b18bfb)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=68a0b487e274b3452a1660e4b9fad5df8d8c0284

commit 68a0b487e274b3452a1660e4b9fad5df8d8c0284
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Wed Apr 6 09:10:18 2016 -0700

    X86-64: Prepare memset-vec-unaligned-erms.S
    
    Prepare memset-vec-unaligned-erms.S to make the SSE2 version as the
    default memset.
    
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
    	(MEMSET_CHK_SYMBOL): New.  Define if not defined.
    	(__bzero): Check VEC_SIZE == 16 instead of USE_MULTIARCH.
    	Disabled fro now.
    	Replace MEMSET_SYMBOL with MEMSET_CHK_SYMBOL on __memset_chk
    	symbols.  Properly check USE_MULTIARCH on __memset symbols.
    
    (cherry picked from commit 4af1bb06c59d24f35bf8dc55897838d926c05892)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c2d3bdd6aec639fd66fceb3e2c145420c25d409b

commit c2d3bdd6aec639fd66fceb3e2c145420c25d409b
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Apr 5 05:21:07 2016 -0700

    Force 32-bit displacement in memset-vec-unaligned-erms.S
    
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: Force
    	32-bit displacement to avoid long nop between instructions.
    
    (cherry picked from commit ec0cac9a1f4094bd0db6f77c1b329e7a40eecc10)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=070a5e77d66f5520c1bbbc24dc1843a0a1c161ee

commit 070a5e77d66f5520c1bbbc24dc1843a0a1c161ee
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Apr 5 05:19:05 2016 -0700

    Add a comment in memset-sse2-unaligned-erms.S
    
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Add
    	a comment on VMOVU and VMOVA.
    
    (cherry picked from commit 696ac774847b80cf994438739478b0c3003b5958)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7e00bb9720268f142668d22e91dff7c3e6e0c08c

commit 7e00bb9720268f142668d22e91dff7c3e6e0c08c
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Apr 3 14:32:20 2016 -0700

    Don't put SSE2/AVX/AVX512 memmove/memset in ld.so
    
    Since memmove and memset in ld.so don't use IFUNC, don't put SSE2, AVX
    and AVX512 memmove and memset in ld.so.
    
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip
    	if not in libc.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    
    (cherry picked from commit 5cd7af016d8587ff53b20ba259746f97edbddbf7)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e1e57539f5dbdefc96a85021b611863eaa28dd13

commit e1e57539f5dbdefc96a85021b611863eaa28dd13
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Apr 3 12:38:25 2016 -0700

    Fix memmove-vec-unaligned-erms.S
    
    __mempcpy_erms and __memmove_erms can't be placed between __memmove_chk
    and __memmove it breaks __memmove_chk.
    
    Don't check source == destination first since it is less common.
    
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:
    	(__mempcpy_erms, __memmove_erms): Moved before __mempcpy_chk
    	with unaligned_erms.
    	(__memmove_erms): Skip if source == destination.
    	(__memmove_unaligned_erms): Don't check source == destination
    	first.
    
    (cherry picked from commit ea2785e96fa503f3a2b5dd9f3a6ca65622b3c5f2)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a13ac6b5ced68aadb7c1546102445f9c57f43231

commit a13ac6b5ced68aadb7c1546102445f9c57f43231
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Mar 6 08:23:24 2016 -0800

    Use HAS_ARCH_FEATURE with Fast_Rep_String
    
    HAS_ARCH_FEATURE, not HAS_CPU_FEATURE, should be used with
    Fast_Rep_String.
    
    	[BZ #19762]
    	* sysdeps/i386/i686/multiarch/bcopy.S (bcopy): Use
    	HAS_ARCH_FEATURE with Fast_Rep_String.
    	* sysdeps/i386/i686/multiarch/bzero.S (__bzero): Likewise.
    	* sysdeps/i386/i686/multiarch/memcpy.S (memcpy): Likewise.
    	* sysdeps/i386/i686/multiarch/memcpy_chk.S (__memcpy_chk):
    	Likewise.
    	* sysdeps/i386/i686/multiarch/memmove_chk.S (__memmove_chk):
    	Likewise.
    	* sysdeps/i386/i686/multiarch/mempcpy.S (__mempcpy): Likewise.
    	* sysdeps/i386/i686/multiarch/mempcpy_chk.S (__mempcpy_chk):
    	Likewise.
    	* sysdeps/i386/i686/multiarch/memset.S (memset): Likewise.
    	* sysdeps/i386/i686/multiarch/memset_chk.S (__memset_chk):
    	Likewise.
    
    (cherry picked from commit 4e940b2f4b577f3a530e0580373f7c2d569f4d63)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4ad4d58ed7a444e2d9787113fce132a99b35b417

commit 4ad4d58ed7a444e2d9787113fce132a99b35b417
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 1 15:08:48 2016 -0700

    Remove Fast_Copy_Backward from Intel Core processors
    
    Intel Core i3, i5 and i7 processors have fast unaligned copy and
    copy backward is ignored.  Remove Fast_Copy_Backward from Intel Core
    processors to avoid confusion.
    
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Don't set
    	bit_arch_Fast_Copy_Backward for Intel Core proessors.
    
    (cherry picked from commit 27d3ce1467990f89126e228559dec8f84b96c60e)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a304f3933c7f8347e49057a7a315cbd571662ff7

commit a304f3933c7f8347e49057a7a315cbd571662ff7
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:05:51 2016 -0700

    Add x86-64 memset with unaligned store and rep stosb
    
    Implement x86-64 memset with unaligned store and rep movsb.  Support
    16-byte, 32-byte and 64-byte vector register sizes.  A single file
    provides 2 implementations of memset, one with rep stosb and the other
    without rep stosb.  They share the same codes when size is between 2
    times of vector register size and REP_STOSB_THRESHOLD which defaults
    to 2KB.
    
    Key features:
    
    1. Use overlapping store to avoid branch.
    2. For size <= 4 times of vector register size, fully unroll the loop.
    3. For size > 4 times of vector register size, store 4 times of vector
    register size at a time.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
    	memset-sse2-unaligned-erms, memset-avx2-unaligned-erms and
    	memset-avx512-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Test __memset_chk_sse2_unaligned,
    	__memset_chk_sse2_unaligned_erms, __memset_chk_avx2_unaligned,
    	__memset_chk_avx2_unaligned_erms, __memset_chk_avx512_unaligned,
    	__memset_chk_avx512_unaligned_erms, __memset_sse2_unaligned,
    	__memset_sse2_unaligned_erms, __memset_erms,
    	__memset_avx2_unaligned, __memset_avx2_unaligned_erms,
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: New
    	file.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:
    	Likewise.
    
    (cherry picked from commit 830566307f038387ca0af3fd327706a8d1a2f595)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=1c5d0b0ba41376c2f0792da4f22cc1f5b2b2688e

commit 1c5d0b0ba41376c2f0792da4f22cc1f5b2b2688e
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:04:26 2016 -0700

    Add x86-64 memmove with unaligned load/store and rep movsb
    
    Implement x86-64 memmove with unaligned load/store and rep movsb.
    Support 16-byte, 32-byte and 64-byte vector register sizes.  When
    size <= 8 times of vector register size, there is no check for
    address overlap bewteen source and destination.  Since overhead for
    overlap check is small when size > 8 times of vector register size,
    memcpy is an alias of memmove.
    
    A single file provides 2 implementations of memmove, one with rep movsb
    and the other without rep movsb.  They share the same codes when size is
    between 2 times of vector register size and REP_MOVSB_THRESHOLD which
    is 2KB for 16-byte vector register size and scaled up by large vector
    register size.
    
    Key features:
    
    1. Use overlapping load and store to avoid branch.
    2. For size <= 8 times of vector register size, load  all sources into
    registers and store them together.
    3. If there is no address overlap bewteen source and destination, copy
    from both ends with 4 times of vector register size at a time.
    4. If address of destination > address of source, backward copy 8 times
    of vector register size at a time.
    5. Otherwise, forward copy 8 times of vector register size at a time.
    6. Use rep movsb only for forward copy.  Avoid slow backward rep movsb
    by fallbacking to backward copy 8 times of vector register size at a
    time.
    7. Skip when address of destination == address of source.
    
    	[BZ #19776]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
    	memmove-sse2-unaligned-erms, memmove-avx-unaligned-erms and
    	memmove-avx512-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Test
    	__memmove_chk_avx512_unaligned_2,
    	__memmove_chk_avx512_unaligned_erms,
    	__memmove_chk_avx_unaligned_2, __memmove_chk_avx_unaligned_erms,
    	__memmove_chk_sse2_unaligned_2,
    	__memmove_chk_sse2_unaligned_erms, __memmove_avx_unaligned_2,
    	__memmove_avx_unaligned_erms, __memmove_avx512_unaligned_2,
    	__memmove_avx512_unaligned_erms, __memmove_erms,
    	__memmove_sse2_unaligned_2, __memmove_sse2_unaligned_erms,
    	__memcpy_chk_avx512_unaligned_2,
    	__memcpy_chk_avx512_unaligned_erms,
    	__memcpy_chk_avx_unaligned_2, __memcpy_chk_avx_unaligned_erms,
    	__memcpy_chk_sse2_unaligned_2, __memcpy_chk_sse2_unaligned_erms,
    	__memcpy_avx_unaligned_2, __memcpy_avx_unaligned_erms,
    	__memcpy_avx512_unaligned_2, __memcpy_avx512_unaligned_erms,
    	__memcpy_sse2_unaligned_2, __memcpy_sse2_unaligned_erms,
    	__memcpy_erms, __mempcpy_chk_avx512_unaligned_2,
    	__mempcpy_chk_avx512_unaligned_erms,
    	__mempcpy_chk_avx_unaligned_2, __mempcpy_chk_avx_unaligned_erms,
    	__mempcpy_chk_sse2_unaligned_2, __mempcpy_chk_sse2_unaligned_erms,
    	__mempcpy_avx512_unaligned_2, __mempcpy_avx512_unaligned_erms,
    	__mempcpy_avx_unaligned_2, __mempcpy_avx_unaligned_erms,
    	__mempcpy_sse2_unaligned_2, __mempcpy_sse2_unaligned_erms and
    	__mempcpy_erms.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: New
    	file.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likwise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
    	Likwise.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:
    	Likwise.
    
    (cherry picked from commit 88b57b8ed41d5ecf2e1bdfc19556f9246a665ebb)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e1203f48239fbb9832db6ed3a0d2a008e317aff9

commit e1203f48239fbb9832db6ed3a0d2a008e317aff9
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 19:22:59 2016 -0700

    Initial Enhanced REP MOVSB/STOSB (ERMS) support
    
    The newer Intel processors support Enhanced REP MOVSB/STOSB (ERMS) which
    has a feature bit in CPUID.  This patch adds the Enhanced REP MOVSB/STOSB
    (ERMS) bit to x86 cpu-features.
    
    	* sysdeps/x86/cpu-features.h (bit_cpu_ERMS): New.
    	(index_cpu_ERMS): Likewise.
    	(reg_ERMS): Likewise.
    
    (cherry picked from commit 0791f91dff9a77263fa8173b143d854cad902c6d)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=3597d65be2a44f063ef12bb907fdad8567aa3e6a

commit 3597d65be2a44f063ef12bb907fdad8567aa3e6a
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 13:15:59 2016 -0700

    Make __memcpy_avx512_no_vzeroupper an alias
    
    Since x86-64 memcpy-avx512-no-vzeroupper.S implements memmove, make
    __memcpy_avx512_no_vzeroupper an alias of __memmove_avx512_no_vzeroupper
    to reduce code size of libc.so.
    
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memcpy-avx512-no-vzeroupper.
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: Renamed
    	to ...
    	* sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: This.
    	(MEMCPY): Don't define.
    	(MEMCPY_CHK): Likewise.
    	(MEMPCPY): Likewise.
    	(MEMPCPY_CHK): Likewise.
    	(MEMPCPY_CHK): Renamed to ...
    	(__mempcpy_chk_avx512_no_vzeroupper): This.
    	(MEMPCPY_CHK): Renamed to ...
    	(__mempcpy_chk_avx512_no_vzeroupper): This.
    	(MEMCPY_CHK): Renamed to ...
    	(__memmove_chk_avx512_no_vzeroupper): This.
    	(MEMCPY): Renamed to ...
    	(__memmove_avx512_no_vzeroupper): This.
    	(__memcpy_avx512_no_vzeroupper): New alias.
    	(__memcpy_chk_avx512_no_vzeroupper): Likewise.
    
    (cherry picked from commit 064f01b10b57ff09cda7025f484b848c38ddd57a)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=9fbaf0f27a11deb98df79d04adee97aebee78d40

commit 9fbaf0f27a11deb98df79d04adee97aebee78d40
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 13:13:36 2016 -0700

    Implement x86-64 multiarch mempcpy in memcpy
    
    Implement x86-64 multiarch mempcpy in memcpy to share most of code.  It
    reduces code size of libc.so.
    
    	[BZ #18858]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	mempcpy-ssse3, mempcpy-ssse3-back, mempcpy-avx-unaligned
    	and mempcpy-avx512-no-vzeroupper.
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMPCPY_CHK):
    	New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S
    	(MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-ssse3-back.S (MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-ssse3.S (MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-avx-unaligned.S: Removed.
    	* sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-ssse3-back.S: Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-ssse3.S: Likewise.
    
    (cherry picked from commit c365e615f7429aee302f8af7bf07ae262278febb)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=5239cb481eea27650173b9b9af22439afdcbf358

commit 5239cb481eea27650173b9b9af22439afdcbf358
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 04:39:48 2016 -0700

    [x86] Add a feature bit: Fast_Unaligned_Copy
    
    On AMD processors, memcpy optimized with unaligned SSE load is
    slower than emcpy optimized with aligned SSSE3 while other string
    functions are faster with unaligned SSE load.  A feature bit,
    Fast_Unaligned_Copy, is added to select memcpy optimized with
    unaligned SSE load.
    
    	[BZ #19583]
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Set
    	Fast_Unaligned_Copy with Fast_Unaligned_Load for Intel
    	processors.  Set Fast_Copy_Backward for AMD Excavator
    	processors.
    	* sysdeps/x86/cpu-features.h (bit_arch_Fast_Unaligned_Copy):
    	New.
    	(index_arch_Fast_Unaligned_Copy): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Check
    	Fast_Unaligned_Copy instead of Fast_Unaligned_Load.
    
    (cherry picked from commit e41b395523040fcb58c7d378475720c2836d280c)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a65b3d13e1754d568782e64a762c2c7fab45a55d

commit a65b3d13e1754d568782e64a762c2c7fab45a55d
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Mar 22 08:36:16 2016 -0700

    Don't set %rcx twice before "rep movsb"
    
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMCPY):
    	Don't set %rcx twice before "rep movsb".
    
    (cherry picked from commit 3c9a4cd16cbc7b79094fec68add2df66061ab5d7)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=f4b6d20366aac66070f1cf50552cf2951991a1e5

commit f4b6d20366aac66070f1cf50552cf2951991a1e5
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Mar 22 07:46:56 2016 -0700

    Set index_arch_AVX_Fast_Unaligned_Load only for Intel processors
    
    Since only Intel processors with AVX2 have fast unaligned load, we
    should set index_arch_AVX_Fast_Unaligned_Load only for Intel processors.
    
    Move AVX, AVX2, AVX512, FMA and FMA4 detection into get_common_indeces
    and call get_common_indeces for other processors.
    
    Add CPU_FEATURES_CPU_P and CPU_FEATURES_ARCH_P to aoid loading
    GLRO(dl_x86_cpu_features) in cpu-features.c.
    
    	[BZ #19583]
    	* sysdeps/x86/cpu-features.c (get_common_indeces): Remove
    	inline.  Check family before setting family, model and
    	extended_model.  Set AVX, AVX2, AVX512, FMA and FMA4 usable
    	bits here.
    	(init_cpu_features): Replace HAS_CPU_FEATURE and
    	HAS_ARCH_FEATURE with CPU_FEATURES_CPU_P and
    	CPU_FEATURES_ARCH_P.  Set index_arch_AVX_Fast_Unaligned_Load
    	for Intel processors with usable AVX2.  Call get_common_indeces
    	for other processors with family == NULL.
    	* sysdeps/x86/cpu-features.h (CPU_FEATURES_CPU_P): New macro.
    	(CPU_FEATURES_ARCH_P): Likewise.
    	(HAS_CPU_FEATURE): Use CPU_FEATURES_CPU_P.
    	(HAS_ARCH_FEATURE): Use CPU_FEATURES_ARCH_P.
    
    (cherry picked from commit f781a9e96138d8839663af5e88649ab1fbed74f8)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=ca9c5edeea52dc18f42ebbe29b1af352f5555538

commit ca9c5edeea52dc18f42ebbe29b1af352f5555538
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Nov 30 08:53:37 2015 -0800

    Update family and model detection for AMD CPUs
    
    AMD CPUs uses the similar encoding scheme for extended family and model
    as Intel CPUs as shown in:
    
    http://support.amd.com/TechDocs/25481.pdf
    
    This patch updates get_common_indeces to get family and model for both
    Intel and AMD CPUs when family == 0x0f.
    
    	[BZ #19214]
    	* sysdeps/x86/cpu-features.c (get_common_indeces): Add an
    	argument to return extended model.  Update family and model
    	with extended family and model when family == 0x0f.
    	(init_cpu_features): Updated.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c23cdbac4ea473effbef5c50b1217f95595b3460

commit c23cdbac4ea473effbef5c50b1217f95595b3460
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 10 05:26:46 2016 -0800

    Add _arch_/_cpu_ to index_*/bit_* in x86 cpu-features.h
    
    index_* and bit_* macros are used to access cpuid and feature arrays o
    struct cpu_features.  It is very easy to use bits and indices of cpuid
    array on feature array, especially in assembly codes.  For example,
    sysdeps/i386/i686/multiarch/bcopy.S has
    
    	HAS_CPU_FEATURE (Fast_Rep_String)
    
    which should be
    
    	HAS_ARCH_FEATURE (Fast_Rep_String)
    
    We change index_* and bit_* to index_cpu_*/index_arch_* and
    bit_cpu_*/bit_arch_* so that we can catch such error at build time.
    
    	[BZ #19762]
    	* sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h
    	(EXTRA_LD_ENVVARS): Add _arch_ to index_*/bit_*.
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Likewise.
    	* sysdeps/x86/cpu-features.h (bit_*): Renamed to ...
    	(bit_arch_*): This for feature array.
    	(bit_*): Renamed to ...
    	(bit_cpu_*): This for cpu array.
    	(index_*): Renamed to ...
    	(index_arch_*): This for feature array.
    	(index_*): Renamed to ...
    	(index_cpu_*): This for cpu array.
    	[__ASSEMBLER__] (HAS_FEATURE): Add and use field.
    	[__ASSEMBLER__] (HAS_CPU_FEATURE)): Pass cpu to HAS_FEATURE.
    	[__ASSEMBLER__] (HAS_ARCH_FEATURE)): Pass arch to HAS_FEATURE.
    	[!__ASSEMBLER__] (HAS_CPU_FEATURE): Replace index_##name and
    	bit_##name with index_cpu_##name and bit_cpu_##name.
    	[!__ASSEMBLER__] (HAS_ARCH_FEATURE): Replace index_##name and
    	bit_##name with index_arch_##name and bit_arch_##name.
    
    (cherry picked from commit 6aa3e97e2530f9917f504eb4146af119a3f27229)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4a49c82956f5a42a2cce22c2e97360de1b32301d

commit 4a49c82956f5a42a2cce22c2e97360de1b32301d
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 3 14:51:40 2016 -0800

    Or bit_Prefer_MAP_32BIT_EXEC in EXTRA_LD_ENVVARS
    
    We should turn on bit_Prefer_MAP_32BIT_EXEC in EXTRA_LD_ENVVARS without
    overriding other bits.
    
    	[BZ #19758]
    	* sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h
    	(EXTRA_LD_ENVVARS): Or bit_Prefer_MAP_32BIT_EXEC.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=18f8c0e3cc9ff7b092f02c9b42874a5439347bbc

commit 18f8c0e3cc9ff7b092f02c9b42874a5439347bbc
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Mar 6 16:48:11 2016 -0800

    Group AVX512 functions in .text.avx512 section
    
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S:
    	Replace .text with .text.avx512.
    	* sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S:
    	Likewise.
    
    (cherry picked from commit fee9eb6200f0e44a4b684903bc47fde36d46f1a5)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=0c8e297a186f844ebb7eba7a3bc0343c83615ca9

commit 0c8e297a186f844ebb7eba7a3bc0343c83615ca9
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Mar 4 08:37:40 2016 -0800

    x86-64: Fix memcpy IFUNC selection
    
    Chek Fast_Unaligned_Load, instead of Slow_BSF, and also check for
    Fast_Copy_Backward to enable __memcpy_ssse3_back.  Existing selection
    order is updated with following selection order:
    
    1. __memcpy_avx_unaligned if AVX_Fast_Unaligned_Load bit is set.
    2. __memcpy_sse2_unaligned if Fast_Unaligned_Load bit is set.
    3. __memcpy_sse2 if SSSE3 isn't available.
    4. __memcpy_ssse3_back if Fast_Copy_Backward bit it set.
    5. __memcpy_ssse3
    
    	[BZ #18880]
    	* sysdeps/x86_64/multiarch/memcpy.S: Check Fast_Unaligned_Load,
    	instead of Slow_BSF, and also check for Fast_Copy_Backward to
    	enable __memcpy_ssse3_back.
    
    (cherry picked from commit 14a1d7cc4c4fd5ee8e4e66b777221dd32a84efe8)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=3c772cb4d9cbe19cd97ad991e3dab43014198c44

commit 3c772cb4d9cbe19cd97ad991e3dab43014198c44
Author: Andrew Senkevich <andrew.senkevich@intel.com>
Date:   Sat Jan 16 00:49:45 2016 +0300

    Added memcpy/memmove family optimized with AVX512 for KNL hardware.
    
    Added AVX512 implementations of memcpy, mempcpy, memmove, memcpy_chk,
    mempcpy_chk, memmove_chk.
    It shows average improvement more than 30% over AVX versions on KNL
    hardware (performance results in the thread
    <https://sourceware.org/ml/libc-alpha/2016-01/msg00258.html>).
    
        * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Added new files.
        * sysdeps/x86_64/multiarch/ifunc-impl-list.c: Added new tests.
        * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: New file.
        * sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S: Likewise.
        * sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: Likewise.
        * sysdeps/x86_64/multiarch/memcpy.S: Added new IFUNC branch.
        * sysdeps/x86_64/multiarch/memcpy_chk.S: Likewise.
        * sysdeps/x86_64/multiarch/memmove.c: Likewise.
        * sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
        * sysdeps/x86_64/multiarch/mempcpy.S: Likewise.
        * sysdeps/x86_64/multiarch/mempcpy_chk.S: Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7f20e52f1f8d7709140f4bdf828a6bb0f0f08af2

commit 7f20e52f1f8d7709140f4bdf828a6bb0f0f08af2
Author: Andrew Senkevich <andrew.senkevich@intel.com>
Date:   Sat Dec 19 02:47:28 2015 +0300

    Added memset optimized with AVX512 for KNL hardware.
    
    It shows improvement up to 28% over AVX2 memset (performance results
    attached at <https://sourceware.org/ml/libc-alpha/2015-12/msg00052.html>).
    
        * sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S: New file.
        * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Added new file.
        * sysdeps/x86_64/multiarch/ifunc-impl-list.c: Added new tests.
        * sysdeps/x86_64/multiarch/memset.S: Added new IFUNC branch.
        * sysdeps/x86_64/multiarch/memset_chk.S: Likewise.
        * sysdeps/x86/cpu-features.h (bit_Prefer_No_VZEROUPPER,
        index_Prefer_No_VZEROUPPER): New.
        * sysdeps/x86/cpu-features.c (init_cpu_features): Set the
        Prefer_No_VZEROUPPER for Knights Landing.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=d530cd5463701a59ed923d53a97d3b534fdfea8a

commit d530cd5463701a59ed923d53a97d3b534fdfea8a
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Wed Oct 21 14:44:23 2015 -0700

    Add Prefer_MAP_32BIT_EXEC to map executable pages with MAP_32BIT
    
    According to Silvermont software optimization guide, for 64-bit
    applications, branch prediction performance can be negatively impacted
    when the target of a branch is more than 4GB away from the branch.  Add
    the Prefer_MAP_32BIT_EXEC bit so that mmap will try to map executable
    pages with MAP_32BIT first.  NB: MAP_32BIT will map to lower 2GB, not
    lower 4GB, address.  Prefer_MAP_32BIT_EXEC reduces bits available for
    address space layout randomization (ASLR), which is always disabled for
    SUID programs and can only be enabled by setting environment variable,
    LD_PREFER_MAP_32BIT_EXEC.
    
    On Fedora 23, this patch speeds up GCC 5 testsuite by 3% on Silvermont.
    
    	[BZ #19367]
    	* sysdeps/unix/sysv/linux/wordsize-64/mmap.c: New file.
    	* sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h: Likewise.
    	* sysdeps/unix/sysv/linux/x86_64/64/mmap.c: Likewise.
    	* sysdeps/x86/cpu-features.h (bit_Prefer_MAP_32BIT_EXEC): New.
    	(index_Prefer_MAP_32BIT_EXEC): Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=fe24aedc3530037d7bb614b84d309e6b816686bf

commit fe24aedc3530037d7bb614b84d309e6b816686bf
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Dec 15 11:46:54 2015 -0800

    Enable Silvermont optimizations for Knights Landing
    
    Knights Landing processor is based on Silvermont.  This patch enables
    Silvermont optimizations for Knights Landing.
    
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Enable
    	Silvermont optimizations for Knights Landing.

-----------------------------------------------------------------------

Comment 28 Sourceware Commits 2016-04-08 18:19:42 UTC

This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, hjl/erms/ifunc has been created
        at  3fe7f2277e4c557147c8cf93452c2f43a62bdffa (commit)

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=3fe7f2277e4c557147c8cf93452c2f43a62bdffa

commit 3fe7f2277e4c557147c8cf93452c2f43a62bdffa
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 1 14:01:24 2016 -0700

    X86-64: Add dummy memcopy.h and wordcopy.c
    
    Since x86-64 doesn't use memory copy functions, add dummy memcopy.h and
    wordcopy.c to reduce code size.  It reduces the size of libc.so by about
    1 KB.
    
    	* sysdeps/x86_64/memcopy.h: New file.
    	* sysdeps/x86_64/wordcopy.c: Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=f81588aa14ead50fb38ce07f0e77e03c33d54f6e

commit f81588aa14ead50fb38ce07f0e77e03c33d54f6e
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 12:46:57 2016 -0700

    X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove
    
    Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones,
    we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with
    the new ones.
    
    No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used
    before.  If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2
    memcpy/memmove optimized with Enhanced REP MOVSB will be used for
    processors with ERMS.  The new AVX512 memcpy/memmove will be used for
    processors with AVX512 which prefer vzeroupper.
    
    Since the new SSE2 memcpy/memmove are faster than the previous default
    memcpy/memmove used in libc.a and ld.so, we also remove the previous
    default memcpy/memmove and make them the default memcpy/memmove, except
    that non-temporal store isn't used in ld.so.
    
    Together, it reduces the size of libc.so by about 6 KB and the size of
    ld.so by about 2 KB.
    
    	[BZ #19776]
    	* sysdeps/x86_64/memcpy.S: Make it dummy.
    	* sysdeps/x86_64/mempcpy.S: Likewise.
    	* sysdeps/x86_64/memmove.S: New file.
    	* sysdeps/x86_64/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/memmove.c: Removed.
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memmove.c: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memcpy-sse2-unaligned, memmove-avx-unaligned,
    	memcpy-avx-unaligned and memmove-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Replace
    	__memmove_chk_avx512_unaligned_2 with
    	__memmove_chk_avx512_unaligned.  Remove
    	__memmove_chk_avx_unaligned_2.  Replace
    	__memmove_chk_sse2_unaligned_2 with
    	__memmove_chk_sse2_unaligned.  Remove __memmove_chk_sse2 and
    	__memmove_avx_unaligned_2.  Replace __memmove_avx512_unaligned_2
    	with __memmove_avx512_unaligned.  Replace
    	__memmove_sse2_unaligned_2 with __memmove_sse2_unaligned.
    	Remove __memmove_sse2.  Replace __memcpy_chk_avx512_unaligned_2
    	with __memcpy_chk_avx512_unaligned.  Remove
    	__memcpy_chk_avx_unaligned_2.  Replace
    	__memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned.
    	Remove __memcpy_chk_sse2.  Remove __memcpy_avx_unaligned_2.
    	Replace __memcpy_avx512_unaligned_2 with
    	__memcpy_avx512_unaligned.  Remove __memcpy_sse2_unaligned_2
    	and __memcpy_sse2.  Replace __mempcpy_chk_avx512_unaligned_2
    	with __mempcpy_chk_avx512_unaligned.  Remove
    	__mempcpy_chk_avx_unaligned_2.  Replace
    	__mempcpy_chk_sse2_unaligned_2 with
    	__mempcpy_chk_sse2_unaligned.  Remove __mempcpy_chk_sse2.
    	Replace __mempcpy_avx512_unaligned_2 with
    	__mempcpy_avx512_unaligned.  Remove __mempcpy_avx_unaligned_2.
    	Replace __mempcpy_sse2_unaligned_2 with
    	__mempcpy_sse2_unaligned.  Remove __mempcpy_sse2.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support
    	__memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned.
    	Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __memcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../memcpy.S.
    	* sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support
    	__memcpy_chk_avx512_unaligned_erms and
    	__memcpy_chk_avx512_unaligned.  Use
    	__memcpy_chk_avx_unaligned_erms and
    	__memcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __memcpy_chk_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip if
    	not in libc.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
    	Change function suffix from unaligned_2 to unaligned.
    	* sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support
    	__mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned.
    	Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __mempcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../mempcpy.S.
    	(mempcpy): New.  Add a weak alias.
    	* sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support
    	__mempcpy_chk_avx512_unaligned_erms and
    	__mempcpy_chk_avx512_unaligned.  Use
    	__mempcpy_chk_avx_unaligned_erms and
    	__mempcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __mempcpy_chk_sse2_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=2a1bbef9600bbbf32d20b5aaaa77ccaedce28f84

commit 2a1bbef9600bbbf32d20b5aaaa77ccaedce28f84
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:42:30 2016 -0700

    X86-64: Remove the previous SSE2/AVX2 memsets
    
    Since the new SSE2/AVX2 memsets are faster than the previous ones, we
    can remove the previous SSE2/AVX2 memsets and replace them with the
    new ones.  This reduces the size of libc.so by about 900 bytes.
    
    No change in IFUNC selection if SSE2 and AVX2 memsets weren't used
    before.  If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset
    optimized with Enhanced REP STOSB will be used for processors with
    ERMS.  The new AVX512 memset will be used for processors with AVX512
    which prefer vzeroupper.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded
    	into ...
    	* sysdeps/x86_64/memset.S: This.
    	(__bzero): Removed.
    	(__memset_tail): Likewise.
    	(__memset_chk): Likewise.
    	(memset): Likewise.
    	(MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't
    	defined.
    	(MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined.
    	* sysdeps/x86_64/multiarch/memset-avx2.S: Removed.
    	(__memset_zero_constant_len_parameter): Check SHARED instead of
    	PIC.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memset-avx2 and memset-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Remove __memset_chk_sse2,
    	__memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: Skip
    	if not in libc.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
    	(__bzero): Enabled.
    	* sysdeps/x86_64/multiarch/memset.S (memset): Replace
    	__memset_sse2 and __memset_avx2 with __memset_sse2_unaligned
    	and __memset_avx2_unaligned.  Use __memset_sse2_unaligned_erms
    	or __memset_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	(memset): Removed.
    	(__memset_chk): Likewise.
    	(MEMSET_SYMBOL): New.
    	(libc_hidden_builtin_def): Replace __memset_sse2 with
    	__memset_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace
    	__memset_chk_sse2 and __memset_chk_avx2 with
    	__memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms.
    	Use __memset_chk_sse2_unaligned_erms or
    	__memset_chk_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_chk_avx512_unaligned_erms and
    	__memset_chk_avx512_unaligned.

-----------------------------------------------------------------------

Comment 29 Sourceware Commits 2016-04-08 18:33:09 UTC

This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, hjl/erms/ifunc has been created
        at  8c1be321be69b89a454fa0e5d112a4377eec3200 (commit)

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=8c1be321be69b89a454fa0e5d112a4377eec3200

commit 8c1be321be69b89a454fa0e5d112a4377eec3200
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 1 14:01:24 2016 -0700

    X86-64: Add dummy memcopy.h and wordcopy.c
    
    Since x86-64 doesn't use memory copy functions, add dummy memcopy.h and
    wordcopy.c to reduce code size.  It reduces the size of libc.so by about
    1 KB.
    
    	* sysdeps/x86_64/memcopy.h: New file.
    	* sysdeps/x86_64/wordcopy.c: Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=9bd27fc5f25dd26dcd2402194e2c21fed74218ca

commit 9bd27fc5f25dd26dcd2402194e2c21fed74218ca
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 12:46:57 2016 -0700

    X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove
    
    Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones,
    we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with
    the new ones.
    
    No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used
    before.  If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2
    memcpy/memmove optimized with Enhanced REP MOVSB will be used for
    processors with ERMS.  The new AVX512 memcpy/memmove will be used for
    processors with AVX512 which prefer vzeroupper.
    
    Since the new SSE2 memcpy/memmove are faster than the previous default
    memcpy/memmove used in libc.a and ld.so, we also remove the previous
    default memcpy/memmove and make them the default memcpy/memmove, except
    that non-temporal store isn't used in ld.so.
    
    Together, it reduces the size of libc.so by about 6 KB and the size of
    ld.so by about 2 KB.
    
    	[BZ #19776]
    	* sysdeps/x86_64/memcpy.S: Make it dummy.
    	* sysdeps/x86_64/mempcpy.S: Likewise.
    	* sysdeps/x86_64/memmove.S: New file.
    	* sysdeps/x86_64/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/memmove.c: Removed.
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memmove.c: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memcpy-sse2-unaligned, memmove-avx-unaligned,
    	memcpy-avx-unaligned and memmove-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Replace
    	__memmove_chk_avx512_unaligned_2 with
    	__memmove_chk_avx512_unaligned.  Remove
    	__memmove_chk_avx_unaligned_2.  Replace
    	__memmove_chk_sse2_unaligned_2 with
    	__memmove_chk_sse2_unaligned.  Remove __memmove_chk_sse2 and
    	__memmove_avx_unaligned_2.  Replace __memmove_avx512_unaligned_2
    	with __memmove_avx512_unaligned.  Replace
    	__memmove_sse2_unaligned_2 with __memmove_sse2_unaligned.
    	Remove __memmove_sse2.  Replace __memcpy_chk_avx512_unaligned_2
    	with __memcpy_chk_avx512_unaligned.  Remove
    	__memcpy_chk_avx_unaligned_2.  Replace
    	__memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned.
    	Remove __memcpy_chk_sse2.  Remove __memcpy_avx_unaligned_2.
    	Replace __memcpy_avx512_unaligned_2 with
    	__memcpy_avx512_unaligned.  Remove __memcpy_sse2_unaligned_2
    	and __memcpy_sse2.  Replace __mempcpy_chk_avx512_unaligned_2
    	with __mempcpy_chk_avx512_unaligned.  Remove
    	__mempcpy_chk_avx_unaligned_2.  Replace
    	__mempcpy_chk_sse2_unaligned_2 with
    	__mempcpy_chk_sse2_unaligned.  Remove __mempcpy_chk_sse2.
    	Replace __mempcpy_avx512_unaligned_2 with
    	__mempcpy_avx512_unaligned.  Remove __mempcpy_avx_unaligned_2.
    	Replace __mempcpy_sse2_unaligned_2 with
    	__mempcpy_sse2_unaligned.  Remove __mempcpy_sse2.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support
    	__memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned.
    	Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __memcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../memcpy.S.
    	* sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support
    	__memcpy_chk_avx512_unaligned_erms and
    	__memcpy_chk_avx512_unaligned.  Use
    	__memcpy_chk_avx_unaligned_erms and
    	__memcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __memcpy_chk_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
    	Change function suffix from unaligned_2 to unaligned.
    	* sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support
    	__mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned.
    	Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __mempcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../mempcpy.S.
    	(mempcpy): New.  Add a weak alias.
    	* sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support
    	__mempcpy_chk_avx512_unaligned_erms and
    	__mempcpy_chk_avx512_unaligned.  Use
    	__mempcpy_chk_avx_unaligned_erms and
    	__mempcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __mempcpy_chk_sse2_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=9052a15d8ebb371e7cc5f8547fedb7539f4a25fe

commit 9052a15d8ebb371e7cc5f8547fedb7539f4a25fe
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:42:30 2016 -0700

    X86-64: Remove the previous SSE2/AVX2 memsets
    
    Since the new SSE2/AVX2 memsets are faster than the previous ones, we
    can remove the previous SSE2/AVX2 memsets and replace them with the
    new ones.  This reduces the size of libc.so by about 900 bytes.
    
    No change in IFUNC selection if SSE2 and AVX2 memsets weren't used
    before.  If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset
    optimized with Enhanced REP STOSB will be used for processors with
    ERMS.  The new AVX512 memset will be used for processors with AVX512
    which prefer vzeroupper.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded
    	into ...
    	* sysdeps/x86_64/memset.S: This.
    	(__bzero): Removed.
    	(__memset_tail): Likewise.
    	(__memset_chk): Likewise.
    	(memset): Likewise.
    	(MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't
    	defined.
    	(MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined.
    	* sysdeps/x86_64/multiarch/memset-avx2.S: Removed.
    	(__memset_zero_constant_len_parameter): Check SHARED instead of
    	PIC.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memset-avx2 and memset-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Remove __memset_chk_sse2,
    	__memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
    	(__bzero): Enabled.
    	* sysdeps/x86_64/multiarch/memset.S (memset): Replace
    	__memset_sse2 and __memset_avx2 with __memset_sse2_unaligned
    	and __memset_avx2_unaligned.  Use __memset_sse2_unaligned_erms
    	or __memset_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	(memset): Removed.
    	(__memset_chk): Likewise.
    	(MEMSET_SYMBOL): New.
    	(libc_hidden_builtin_def): Replace __memset_sse2 with
    	__memset_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace
    	__memset_chk_sse2 and __memset_chk_avx2 with
    	__memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms.
    	Use __memset_chk_sse2_unaligned_erms or
    	__memset_chk_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_chk_avx512_unaligned_erms and
    	__memset_chk_avx512_unaligned.

-----------------------------------------------------------------------

Comment 30 Sourceware Commits 2016-04-08 18:37:38 UTC

This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, hjl/erms/2.22 has been created
        at  09104b0b6fc150112f5e282c096f739a2f49fb6e (commit)

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=09104b0b6fc150112f5e282c096f739a2f49fb6e

commit 09104b0b6fc150112f5e282c096f739a2f49fb6e
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 1 14:01:24 2016 -0700

    X86-64: Add dummy memcopy.h and wordcopy.c
    
    Since x86-64 doesn't use memory copy functions, add dummy memcopy.h and
    wordcopy.c to reduce code size.  It reduces the size of libc.so by about
    1 KB.
    
    	* sysdeps/x86_64/memcopy.h: New file.
    	* sysdeps/x86_64/wordcopy.c: Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=ec598c844faca4fc87e8c1ec067c94109ba58402

commit ec598c844faca4fc87e8c1ec067c94109ba58402
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 12:46:57 2016 -0700

    X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove
    
    Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones,
    we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with
    the new ones.
    
    No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used
    before.  If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2
    memcpy/memmove optimized with Enhanced REP MOVSB will be used for
    processors with ERMS.  The new AVX512 memcpy/memmove will be used for
    processors with AVX512 which prefer vzeroupper.
    
    Since the new SSE2 memcpy/memmove are faster than the previous default
    memcpy/memmove used in libc.a and ld.so, we also remove the previous
    default memcpy/memmove and make them the default memcpy/memmove, except
    that non-temporal store isn't used in ld.so.
    
    Together, it reduces the size of libc.so by about 6 KB and the size of
    ld.so by about 2 KB.
    
    	[BZ #19776]
    	* sysdeps/x86_64/memcpy.S: Make it dummy.
    	* sysdeps/x86_64/mempcpy.S: Likewise.
    	* sysdeps/x86_64/memmove.S: New file.
    	* sysdeps/x86_64/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/memmove.c: Removed.
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memmove.c: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memcpy-sse2-unaligned, memmove-avx-unaligned,
    	memcpy-avx-unaligned and memmove-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Replace
    	__memmove_chk_avx512_unaligned_2 with
    	__memmove_chk_avx512_unaligned.  Remove
    	__memmove_chk_avx_unaligned_2.  Replace
    	__memmove_chk_sse2_unaligned_2 with
    	__memmove_chk_sse2_unaligned.  Remove __memmove_chk_sse2 and
    	__memmove_avx_unaligned_2.  Replace __memmove_avx512_unaligned_2
    	with __memmove_avx512_unaligned.  Replace
    	__memmove_sse2_unaligned_2 with __memmove_sse2_unaligned.
    	Remove __memmove_sse2.  Replace __memcpy_chk_avx512_unaligned_2
    	with __memcpy_chk_avx512_unaligned.  Remove
    	__memcpy_chk_avx_unaligned_2.  Replace
    	__memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned.
    	Remove __memcpy_chk_sse2.  Remove __memcpy_avx_unaligned_2.
    	Replace __memcpy_avx512_unaligned_2 with
    	__memcpy_avx512_unaligned.  Remove __memcpy_sse2_unaligned_2
    	and __memcpy_sse2.  Replace __mempcpy_chk_avx512_unaligned_2
    	with __mempcpy_chk_avx512_unaligned.  Remove
    	__mempcpy_chk_avx_unaligned_2.  Replace
    	__mempcpy_chk_sse2_unaligned_2 with
    	__mempcpy_chk_sse2_unaligned.  Remove __mempcpy_chk_sse2.
    	Replace __mempcpy_avx512_unaligned_2 with
    	__mempcpy_avx512_unaligned.  Remove __mempcpy_avx_unaligned_2.
    	Replace __mempcpy_sse2_unaligned_2 with
    	__mempcpy_sse2_unaligned.  Remove __mempcpy_sse2.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support
    	__memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned.
    	Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __memcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../memcpy.S.
    	* sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support
    	__memcpy_chk_avx512_unaligned_erms and
    	__memcpy_chk_avx512_unaligned.  Use
    	__memcpy_chk_avx_unaligned_erms and
    	__memcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __memcpy_chk_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
    	Change function suffix from unaligned_2 to unaligned.
    	* sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support
    	__mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned.
    	Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __mempcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../mempcpy.S.
    	(mempcpy): New.  Add a weak alias.
    	* sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support
    	__mempcpy_chk_avx512_unaligned_erms and
    	__mempcpy_chk_avx512_unaligned.  Use
    	__mempcpy_chk_avx_unaligned_erms and
    	__mempcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __mempcpy_chk_sse2_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=ff9d413f34efc46e4160ee4a3b30ddc04fb37518

commit ff9d413f34efc46e4160ee4a3b30ddc04fb37518
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:42:30 2016 -0700

    X86-64: Remove the previous SSE2/AVX2 memsets
    
    Since the new SSE2/AVX2 memsets are faster than the previous ones, we
    can remove the previous SSE2/AVX2 memsets and replace them with the
    new ones.  This reduces the size of libc.so by about 900 bytes.
    
    No change in IFUNC selection if SSE2 and AVX2 memsets weren't used
    before.  If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset
    optimized with Enhanced REP STOSB will be used for processors with
    ERMS.  The new AVX512 memset will be used for processors with AVX512
    which prefer vzeroupper.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded
    	into ...
    	* sysdeps/x86_64/memset.S: This.
    	(__bzero): Removed.
    	(__memset_tail): Likewise.
    	(__memset_chk): Likewise.
    	(memset): Likewise.
    	(MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't
    	defined.
    	(MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined.
    	* sysdeps/x86_64/multiarch/memset-avx2.S: Removed.
    	(__memset_zero_constant_len_parameter): Check SHARED instead of
    	PIC.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memset-avx2 and memset-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Remove __memset_chk_sse2,
    	__memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
    	(__bzero): Enabled.
    	* sysdeps/x86_64/multiarch/memset.S (memset): Replace
    	__memset_sse2 and __memset_avx2 with __memset_sse2_unaligned
    	and __memset_avx2_unaligned.  Use __memset_sse2_unaligned_erms
    	or __memset_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	(memset): Removed.
    	(__memset_chk): Likewise.
    	(MEMSET_SYMBOL): New.
    	(libc_hidden_builtin_def): Replace __memset_sse2 with
    	__memset_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace
    	__memset_chk_sse2 and __memset_chk_avx2 with
    	__memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms.
    	Use __memset_chk_sse2_unaligned_erms or
    	__memset_chk_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_chk_avx512_unaligned_erms and
    	__memset_chk_avx512_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=76945cf3a33403b5dff551d48cb68a6729848740

commit 76945cf3a33403b5dff551d48cb68a6729848740
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Apr 3 17:21:45 2016 -0700

    X86-64: Use non-temporal store in memcpy on large data
    
    The large memcpy micro benchmark in glibc shows that there is a
    regression with large data on Haswell machine.  non-temporal store in
    memcpy on large data can improve performance significantly.  This
    patch adds a threshold to use non temporal store which is 6 times of
    shared cache size.  When size is above the threshold, non temporal
    store will be used, but avoid non-temporal store if there is overlap
    between destination and source since destination may be in cache when
    source is loaded.
    
    For size below 8 vector register width, we load all data into registers
    and store them together.  Only forward and backward loops, which move 4
    vector registers at a time, are used to support overlapping addresses.
    For forward loop, we load the last 4 vector register width of data and
    the first vector register width of data into vector registers before the
    loop and store them after the loop.  For backward loop, we load the first
    4 vector register width of data and the last vector register width of
    data into vector registers before the loop and store them after the loop.
    
    	* sysdeps/x86_64/cacheinfo.c (__x86_shared_non_temporal_threshold):
    	New.
    	(init_cacheinfo): Set __x86_shared_non_temporal_threshold to
    	6 times of shared cache size.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S
    	(VMOVNT): New.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S
    	(VMOVNT): Likewise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S
    	(VMOVNT): Likewise.
    	(VMOVU): Changed to movups for smaller code sizes.
    	(VMOVA): Changed to movaps for smaller code sizes.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Update
    	comments.
    	(PREFETCH): New.
    	(PREFETCH_SIZE): Likewise.
    	(PREFETCHED_LOAD_SIZE): Likewise.
    	(PREFETCH_ONE_SET): Likewise.
    	Rewrite to use forward and backward loops, which move 4 vector
    	registers at a time, to support overlapping addresses and use
    	non temporal store if size is above the threshold and there is
    	no overlap between destination and source.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=54667f64fa4074325ee33e487c033c313ce95067

commit 54667f64fa4074325ee33e487c033c313ce95067
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Wed Apr 6 10:19:16 2016 -0700

    X86-64: Prepare memmove-vec-unaligned-erms.S
    
    Prepare memmove-vec-unaligned-erms.S to make the SSE2 version as the
    default memcpy, mempcpy and memmove.
    
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
    	(MEMCPY_SYMBOL): New.
    	(MEMPCPY_SYMBOL): Likewise.
    	(MEMMOVE_CHK_SYMBOL): Likewise.
    	Replace MEMMOVE_SYMBOL with MEMMOVE_CHK_SYMBOL on __mempcpy_chk
    	symbols.  Replace MEMMOVE_SYMBOL with MEMPCPY_SYMBOL on
    	__mempcpy symbols.  Provide alias for __memcpy_chk in libc.a.
    	Provide alias for memcpy in libc.a and ld.so.
    
    (cherry picked from commit a7d1c51482d15ab6c07e2ee0ae5e007067b18bfb)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=68a0b487e274b3452a1660e4b9fad5df8d8c0284

commit 68a0b487e274b3452a1660e4b9fad5df8d8c0284
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Wed Apr 6 09:10:18 2016 -0700

    X86-64: Prepare memset-vec-unaligned-erms.S
    
    Prepare memset-vec-unaligned-erms.S to make the SSE2 version as the
    default memset.
    
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
    	(MEMSET_CHK_SYMBOL): New.  Define if not defined.
    	(__bzero): Check VEC_SIZE == 16 instead of USE_MULTIARCH.
    	Disabled fro now.
    	Replace MEMSET_SYMBOL with MEMSET_CHK_SYMBOL on __memset_chk
    	symbols.  Properly check USE_MULTIARCH on __memset symbols.
    
    (cherry picked from commit 4af1bb06c59d24f35bf8dc55897838d926c05892)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c2d3bdd6aec639fd66fceb3e2c145420c25d409b

commit c2d3bdd6aec639fd66fceb3e2c145420c25d409b
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Apr 5 05:21:07 2016 -0700

    Force 32-bit displacement in memset-vec-unaligned-erms.S
    
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: Force
    	32-bit displacement to avoid long nop between instructions.
    
    (cherry picked from commit ec0cac9a1f4094bd0db6f77c1b329e7a40eecc10)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=070a5e77d66f5520c1bbbc24dc1843a0a1c161ee

commit 070a5e77d66f5520c1bbbc24dc1843a0a1c161ee
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Apr 5 05:19:05 2016 -0700

    Add a comment in memset-sse2-unaligned-erms.S
    
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Add
    	a comment on VMOVU and VMOVA.
    
    (cherry picked from commit 696ac774847b80cf994438739478b0c3003b5958)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7e00bb9720268f142668d22e91dff7c3e6e0c08c

commit 7e00bb9720268f142668d22e91dff7c3e6e0c08c
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Apr 3 14:32:20 2016 -0700

    Don't put SSE2/AVX/AVX512 memmove/memset in ld.so
    
    Since memmove and memset in ld.so don't use IFUNC, don't put SSE2, AVX
    and AVX512 memmove and memset in ld.so.
    
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip
    	if not in libc.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    
    (cherry picked from commit 5cd7af016d8587ff53b20ba259746f97edbddbf7)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e1e57539f5dbdefc96a85021b611863eaa28dd13

commit e1e57539f5dbdefc96a85021b611863eaa28dd13
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Apr 3 12:38:25 2016 -0700

    Fix memmove-vec-unaligned-erms.S
    
    __mempcpy_erms and __memmove_erms can't be placed between __memmove_chk
    and __memmove it breaks __memmove_chk.
    
    Don't check source == destination first since it is less common.
    
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:
    	(__mempcpy_erms, __memmove_erms): Moved before __mempcpy_chk
    	with unaligned_erms.
    	(__memmove_erms): Skip if source == destination.
    	(__memmove_unaligned_erms): Don't check source == destination
    	first.
    
    (cherry picked from commit ea2785e96fa503f3a2b5dd9f3a6ca65622b3c5f2)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a13ac6b5ced68aadb7c1546102445f9c57f43231

commit a13ac6b5ced68aadb7c1546102445f9c57f43231
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Mar 6 08:23:24 2016 -0800

    Use HAS_ARCH_FEATURE with Fast_Rep_String
    
    HAS_ARCH_FEATURE, not HAS_CPU_FEATURE, should be used with
    Fast_Rep_String.
    
    	[BZ #19762]
    	* sysdeps/i386/i686/multiarch/bcopy.S (bcopy): Use
    	HAS_ARCH_FEATURE with Fast_Rep_String.
    	* sysdeps/i386/i686/multiarch/bzero.S (__bzero): Likewise.
    	* sysdeps/i386/i686/multiarch/memcpy.S (memcpy): Likewise.
    	* sysdeps/i386/i686/multiarch/memcpy_chk.S (__memcpy_chk):
    	Likewise.
    	* sysdeps/i386/i686/multiarch/memmove_chk.S (__memmove_chk):
    	Likewise.
    	* sysdeps/i386/i686/multiarch/mempcpy.S (__mempcpy): Likewise.
    	* sysdeps/i386/i686/multiarch/mempcpy_chk.S (__mempcpy_chk):
    	Likewise.
    	* sysdeps/i386/i686/multiarch/memset.S (memset): Likewise.
    	* sysdeps/i386/i686/multiarch/memset_chk.S (__memset_chk):
    	Likewise.
    
    (cherry picked from commit 4e940b2f4b577f3a530e0580373f7c2d569f4d63)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4ad4d58ed7a444e2d9787113fce132a99b35b417

commit 4ad4d58ed7a444e2d9787113fce132a99b35b417
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 1 15:08:48 2016 -0700

    Remove Fast_Copy_Backward from Intel Core processors
    
    Intel Core i3, i5 and i7 processors have fast unaligned copy and
    copy backward is ignored.  Remove Fast_Copy_Backward from Intel Core
    processors to avoid confusion.
    
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Don't set
    	bit_arch_Fast_Copy_Backward for Intel Core proessors.
    
    (cherry picked from commit 27d3ce1467990f89126e228559dec8f84b96c60e)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a304f3933c7f8347e49057a7a315cbd571662ff7

commit a304f3933c7f8347e49057a7a315cbd571662ff7
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:05:51 2016 -0700

    Add x86-64 memset with unaligned store and rep stosb
    
    Implement x86-64 memset with unaligned store and rep movsb.  Support
    16-byte, 32-byte and 64-byte vector register sizes.  A single file
    provides 2 implementations of memset, one with rep stosb and the other
    without rep stosb.  They share the same codes when size is between 2
    times of vector register size and REP_STOSB_THRESHOLD which defaults
    to 2KB.
    
    Key features:
    
    1. Use overlapping store to avoid branch.
    2. For size <= 4 times of vector register size, fully unroll the loop.
    3. For size > 4 times of vector register size, store 4 times of vector
    register size at a time.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
    	memset-sse2-unaligned-erms, memset-avx2-unaligned-erms and
    	memset-avx512-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Test __memset_chk_sse2_unaligned,
    	__memset_chk_sse2_unaligned_erms, __memset_chk_avx2_unaligned,
    	__memset_chk_avx2_unaligned_erms, __memset_chk_avx512_unaligned,
    	__memset_chk_avx512_unaligned_erms, __memset_sse2_unaligned,
    	__memset_sse2_unaligned_erms, __memset_erms,
    	__memset_avx2_unaligned, __memset_avx2_unaligned_erms,
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: New
    	file.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:
    	Likewise.
    
    (cherry picked from commit 830566307f038387ca0af3fd327706a8d1a2f595)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=1c5d0b0ba41376c2f0792da4f22cc1f5b2b2688e

commit 1c5d0b0ba41376c2f0792da4f22cc1f5b2b2688e
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:04:26 2016 -0700

    Add x86-64 memmove with unaligned load/store and rep movsb
    
    Implement x86-64 memmove with unaligned load/store and rep movsb.
    Support 16-byte, 32-byte and 64-byte vector register sizes.  When
    size <= 8 times of vector register size, there is no check for
    address overlap bewteen source and destination.  Since overhead for
    overlap check is small when size > 8 times of vector register size,
    memcpy is an alias of memmove.
    
    A single file provides 2 implementations of memmove, one with rep movsb
    and the other without rep movsb.  They share the same codes when size is
    between 2 times of vector register size and REP_MOVSB_THRESHOLD which
    is 2KB for 16-byte vector register size and scaled up by large vector
    register size.
    
    Key features:
    
    1. Use overlapping load and store to avoid branch.
    2. For size <= 8 times of vector register size, load  all sources into
    registers and store them together.
    3. If there is no address overlap bewteen source and destination, copy
    from both ends with 4 times of vector register size at a time.
    4. If address of destination > address of source, backward copy 8 times
    of vector register size at a time.
    5. Otherwise, forward copy 8 times of vector register size at a time.
    6. Use rep movsb only for forward copy.  Avoid slow backward rep movsb
    by fallbacking to backward copy 8 times of vector register size at a
    time.
    7. Skip when address of destination == address of source.
    
    	[BZ #19776]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
    	memmove-sse2-unaligned-erms, memmove-avx-unaligned-erms and
    	memmove-avx512-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Test
    	__memmove_chk_avx512_unaligned_2,
    	__memmove_chk_avx512_unaligned_erms,
    	__memmove_chk_avx_unaligned_2, __memmove_chk_avx_unaligned_erms,
    	__memmove_chk_sse2_unaligned_2,
    	__memmove_chk_sse2_unaligned_erms, __memmove_avx_unaligned_2,
    	__memmove_avx_unaligned_erms, __memmove_avx512_unaligned_2,
    	__memmove_avx512_unaligned_erms, __memmove_erms,
    	__memmove_sse2_unaligned_2, __memmove_sse2_unaligned_erms,
    	__memcpy_chk_avx512_unaligned_2,
    	__memcpy_chk_avx512_unaligned_erms,
    	__memcpy_chk_avx_unaligned_2, __memcpy_chk_avx_unaligned_erms,
    	__memcpy_chk_sse2_unaligned_2, __memcpy_chk_sse2_unaligned_erms,
    	__memcpy_avx_unaligned_2, __memcpy_avx_unaligned_erms,
    	__memcpy_avx512_unaligned_2, __memcpy_avx512_unaligned_erms,
    	__memcpy_sse2_unaligned_2, __memcpy_sse2_unaligned_erms,
    	__memcpy_erms, __mempcpy_chk_avx512_unaligned_2,
    	__mempcpy_chk_avx512_unaligned_erms,
    	__mempcpy_chk_avx_unaligned_2, __mempcpy_chk_avx_unaligned_erms,
    	__mempcpy_chk_sse2_unaligned_2, __mempcpy_chk_sse2_unaligned_erms,
    	__mempcpy_avx512_unaligned_2, __mempcpy_avx512_unaligned_erms,
    	__mempcpy_avx_unaligned_2, __mempcpy_avx_unaligned_erms,
    	__mempcpy_sse2_unaligned_2, __mempcpy_sse2_unaligned_erms and
    	__mempcpy_erms.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: New
    	file.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likwise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
    	Likwise.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:
    	Likwise.
    
    (cherry picked from commit 88b57b8ed41d5ecf2e1bdfc19556f9246a665ebb)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e1203f48239fbb9832db6ed3a0d2a008e317aff9

commit e1203f48239fbb9832db6ed3a0d2a008e317aff9
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 19:22:59 2016 -0700

    Initial Enhanced REP MOVSB/STOSB (ERMS) support
    
    The newer Intel processors support Enhanced REP MOVSB/STOSB (ERMS) which
    has a feature bit in CPUID.  This patch adds the Enhanced REP MOVSB/STOSB
    (ERMS) bit to x86 cpu-features.
    
    	* sysdeps/x86/cpu-features.h (bit_cpu_ERMS): New.
    	(index_cpu_ERMS): Likewise.
    	(reg_ERMS): Likewise.
    
    (cherry picked from commit 0791f91dff9a77263fa8173b143d854cad902c6d)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=3597d65be2a44f063ef12bb907fdad8567aa3e6a

commit 3597d65be2a44f063ef12bb907fdad8567aa3e6a
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 13:15:59 2016 -0700

    Make __memcpy_avx512_no_vzeroupper an alias
    
    Since x86-64 memcpy-avx512-no-vzeroupper.S implements memmove, make
    __memcpy_avx512_no_vzeroupper an alias of __memmove_avx512_no_vzeroupper
    to reduce code size of libc.so.
    
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memcpy-avx512-no-vzeroupper.
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: Renamed
    	to ...
    	* sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: This.
    	(MEMCPY): Don't define.
    	(MEMCPY_CHK): Likewise.
    	(MEMPCPY): Likewise.
    	(MEMPCPY_CHK): Likewise.
    	(MEMPCPY_CHK): Renamed to ...
    	(__mempcpy_chk_avx512_no_vzeroupper): This.
    	(MEMPCPY_CHK): Renamed to ...
    	(__mempcpy_chk_avx512_no_vzeroupper): This.
    	(MEMCPY_CHK): Renamed to ...
    	(__memmove_chk_avx512_no_vzeroupper): This.
    	(MEMCPY): Renamed to ...
    	(__memmove_avx512_no_vzeroupper): This.
    	(__memcpy_avx512_no_vzeroupper): New alias.
    	(__memcpy_chk_avx512_no_vzeroupper): Likewise.
    
    (cherry picked from commit 064f01b10b57ff09cda7025f484b848c38ddd57a)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=9fbaf0f27a11deb98df79d04adee97aebee78d40

commit 9fbaf0f27a11deb98df79d04adee97aebee78d40
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 13:13:36 2016 -0700

    Implement x86-64 multiarch mempcpy in memcpy
    
    Implement x86-64 multiarch mempcpy in memcpy to share most of code.  It
    reduces code size of libc.so.
    
    	[BZ #18858]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	mempcpy-ssse3, mempcpy-ssse3-back, mempcpy-avx-unaligned
    	and mempcpy-avx512-no-vzeroupper.
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMPCPY_CHK):
    	New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S
    	(MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-ssse3-back.S (MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-ssse3.S (MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-avx-unaligned.S: Removed.
    	* sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-ssse3-back.S: Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-ssse3.S: Likewise.
    
    (cherry picked from commit c365e615f7429aee302f8af7bf07ae262278febb)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=5239cb481eea27650173b9b9af22439afdcbf358

commit 5239cb481eea27650173b9b9af22439afdcbf358
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 04:39:48 2016 -0700

    [x86] Add a feature bit: Fast_Unaligned_Copy
    
    On AMD processors, memcpy optimized with unaligned SSE load is
    slower than emcpy optimized with aligned SSSE3 while other string
    functions are faster with unaligned SSE load.  A feature bit,
    Fast_Unaligned_Copy, is added to select memcpy optimized with
    unaligned SSE load.
    
    	[BZ #19583]
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Set
    	Fast_Unaligned_Copy with Fast_Unaligned_Load for Intel
    	processors.  Set Fast_Copy_Backward for AMD Excavator
    	processors.
    	* sysdeps/x86/cpu-features.h (bit_arch_Fast_Unaligned_Copy):
    	New.
    	(index_arch_Fast_Unaligned_Copy): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Check
    	Fast_Unaligned_Copy instead of Fast_Unaligned_Load.
    
    (cherry picked from commit e41b395523040fcb58c7d378475720c2836d280c)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a65b3d13e1754d568782e64a762c2c7fab45a55d

commit a65b3d13e1754d568782e64a762c2c7fab45a55d
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Mar 22 08:36:16 2016 -0700

    Don't set %rcx twice before "rep movsb"
    
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMCPY):
    	Don't set %rcx twice before "rep movsb".
    
    (cherry picked from commit 3c9a4cd16cbc7b79094fec68add2df66061ab5d7)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=f4b6d20366aac66070f1cf50552cf2951991a1e5

commit f4b6d20366aac66070f1cf50552cf2951991a1e5
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Mar 22 07:46:56 2016 -0700

    Set index_arch_AVX_Fast_Unaligned_Load only for Intel processors
    
    Since only Intel processors with AVX2 have fast unaligned load, we
    should set index_arch_AVX_Fast_Unaligned_Load only for Intel processors.
    
    Move AVX, AVX2, AVX512, FMA and FMA4 detection into get_common_indeces
    and call get_common_indeces for other processors.
    
    Add CPU_FEATURES_CPU_P and CPU_FEATURES_ARCH_P to aoid loading
    GLRO(dl_x86_cpu_features) in cpu-features.c.
    
    	[BZ #19583]
    	* sysdeps/x86/cpu-features.c (get_common_indeces): Remove
    	inline.  Check family before setting family, model and
    	extended_model.  Set AVX, AVX2, AVX512, FMA and FMA4 usable
    	bits here.
    	(init_cpu_features): Replace HAS_CPU_FEATURE and
    	HAS_ARCH_FEATURE with CPU_FEATURES_CPU_P and
    	CPU_FEATURES_ARCH_P.  Set index_arch_AVX_Fast_Unaligned_Load
    	for Intel processors with usable AVX2.  Call get_common_indeces
    	for other processors with family == NULL.
    	* sysdeps/x86/cpu-features.h (CPU_FEATURES_CPU_P): New macro.
    	(CPU_FEATURES_ARCH_P): Likewise.
    	(HAS_CPU_FEATURE): Use CPU_FEATURES_CPU_P.
    	(HAS_ARCH_FEATURE): Use CPU_FEATURES_ARCH_P.
    
    (cherry picked from commit f781a9e96138d8839663af5e88649ab1fbed74f8)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=ca9c5edeea52dc18f42ebbe29b1af352f5555538

commit ca9c5edeea52dc18f42ebbe29b1af352f5555538
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Nov 30 08:53:37 2015 -0800

    Update family and model detection for AMD CPUs
    
    AMD CPUs uses the similar encoding scheme for extended family and model
    as Intel CPUs as shown in:
    
    http://support.amd.com/TechDocs/25481.pdf
    
    This patch updates get_common_indeces to get family and model for both
    Intel and AMD CPUs when family == 0x0f.
    
    	[BZ #19214]
    	* sysdeps/x86/cpu-features.c (get_common_indeces): Add an
    	argument to return extended model.  Update family and model
    	with extended family and model when family == 0x0f.
    	(init_cpu_features): Updated.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c23cdbac4ea473effbef5c50b1217f95595b3460

commit c23cdbac4ea473effbef5c50b1217f95595b3460
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 10 05:26:46 2016 -0800

    Add _arch_/_cpu_ to index_*/bit_* in x86 cpu-features.h
    
    index_* and bit_* macros are used to access cpuid and feature arrays o
    struct cpu_features.  It is very easy to use bits and indices of cpuid
    array on feature array, especially in assembly codes.  For example,
    sysdeps/i386/i686/multiarch/bcopy.S has
    
    	HAS_CPU_FEATURE (Fast_Rep_String)
    
    which should be
    
    	HAS_ARCH_FEATURE (Fast_Rep_String)
    
    We change index_* and bit_* to index_cpu_*/index_arch_* and
    bit_cpu_*/bit_arch_* so that we can catch such error at build time.
    
    	[BZ #19762]
    	* sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h
    	(EXTRA_LD_ENVVARS): Add _arch_ to index_*/bit_*.
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Likewise.
    	* sysdeps/x86/cpu-features.h (bit_*): Renamed to ...
    	(bit_arch_*): This for feature array.
    	(bit_*): Renamed to ...
    	(bit_cpu_*): This for cpu array.
    	(index_*): Renamed to ...
    	(index_arch_*): This for feature array.
    	(index_*): Renamed to ...
    	(index_cpu_*): This for cpu array.
    	[__ASSEMBLER__] (HAS_FEATURE): Add and use field.
    	[__ASSEMBLER__] (HAS_CPU_FEATURE)): Pass cpu to HAS_FEATURE.
    	[__ASSEMBLER__] (HAS_ARCH_FEATURE)): Pass arch to HAS_FEATURE.
    	[!__ASSEMBLER__] (HAS_CPU_FEATURE): Replace index_##name and
    	bit_##name with index_cpu_##name and bit_cpu_##name.
    	[!__ASSEMBLER__] (HAS_ARCH_FEATURE): Replace index_##name and
    	bit_##name with index_arch_##name and bit_arch_##name.
    
    (cherry picked from commit 6aa3e97e2530f9917f504eb4146af119a3f27229)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4a49c82956f5a42a2cce22c2e97360de1b32301d

commit 4a49c82956f5a42a2cce22c2e97360de1b32301d
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 3 14:51:40 2016 -0800

    Or bit_Prefer_MAP_32BIT_EXEC in EXTRA_LD_ENVVARS
    
    We should turn on bit_Prefer_MAP_32BIT_EXEC in EXTRA_LD_ENVVARS without
    overriding other bits.
    
    	[BZ #19758]
    	* sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h
    	(EXTRA_LD_ENVVARS): Or bit_Prefer_MAP_32BIT_EXEC.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=18f8c0e3cc9ff7b092f02c9b42874a5439347bbc

commit 18f8c0e3cc9ff7b092f02c9b42874a5439347bbc
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Mar 6 16:48:11 2016 -0800

    Group AVX512 functions in .text.avx512 section
    
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S:
    	Replace .text with .text.avx512.
    	* sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S:
    	Likewise.
    
    (cherry picked from commit fee9eb6200f0e44a4b684903bc47fde36d46f1a5)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=0c8e297a186f844ebb7eba7a3bc0343c83615ca9

commit 0c8e297a186f844ebb7eba7a3bc0343c83615ca9
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Mar 4 08:37:40 2016 -0800

    x86-64: Fix memcpy IFUNC selection
    
    Chek Fast_Unaligned_Load, instead of Slow_BSF, and also check for
    Fast_Copy_Backward to enable __memcpy_ssse3_back.  Existing selection
    order is updated with following selection order:
    
    1. __memcpy_avx_unaligned if AVX_Fast_Unaligned_Load bit is set.
    2. __memcpy_sse2_unaligned if Fast_Unaligned_Load bit is set.
    3. __memcpy_sse2 if SSSE3 isn't available.
    4. __memcpy_ssse3_back if Fast_Copy_Backward bit it set.
    5. __memcpy_ssse3
    
    	[BZ #18880]
    	* sysdeps/x86_64/multiarch/memcpy.S: Check Fast_Unaligned_Load,
    	instead of Slow_BSF, and also check for Fast_Copy_Backward to
    	enable __memcpy_ssse3_back.
    
    (cherry picked from commit 14a1d7cc4c4fd5ee8e4e66b777221dd32a84efe8)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=3c772cb4d9cbe19cd97ad991e3dab43014198c44

commit 3c772cb4d9cbe19cd97ad991e3dab43014198c44
Author: Andrew Senkevich <andrew.senkevich@intel.com>
Date:   Sat Jan 16 00:49:45 2016 +0300

    Added memcpy/memmove family optimized with AVX512 for KNL hardware.
    
    Added AVX512 implementations of memcpy, mempcpy, memmove, memcpy_chk,
    mempcpy_chk, memmove_chk.
    It shows average improvement more than 30% over AVX versions on KNL
    hardware (performance results in the thread
    <https://sourceware.org/ml/libc-alpha/2016-01/msg00258.html>).
    
        * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Added new files.
        * sysdeps/x86_64/multiarch/ifunc-impl-list.c: Added new tests.
        * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: New file.
        * sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S: Likewise.
        * sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: Likewise.
        * sysdeps/x86_64/multiarch/memcpy.S: Added new IFUNC branch.
        * sysdeps/x86_64/multiarch/memcpy_chk.S: Likewise.
        * sysdeps/x86_64/multiarch/memmove.c: Likewise.
        * sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
        * sysdeps/x86_64/multiarch/mempcpy.S: Likewise.
        * sysdeps/x86_64/multiarch/mempcpy_chk.S: Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7f20e52f1f8d7709140f4bdf828a6bb0f0f08af2

commit 7f20e52f1f8d7709140f4bdf828a6bb0f0f08af2
Author: Andrew Senkevich <andrew.senkevich@intel.com>
Date:   Sat Dec 19 02:47:28 2015 +0300

    Added memset optimized with AVX512 for KNL hardware.
    
    It shows improvement up to 28% over AVX2 memset (performance results
    attached at <https://sourceware.org/ml/libc-alpha/2015-12/msg00052.html>).
    
        * sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S: New file.
        * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Added new file.
        * sysdeps/x86_64/multiarch/ifunc-impl-list.c: Added new tests.
        * sysdeps/x86_64/multiarch/memset.S: Added new IFUNC branch.
        * sysdeps/x86_64/multiarch/memset_chk.S: Likewise.
        * sysdeps/x86/cpu-features.h (bit_Prefer_No_VZEROUPPER,
        index_Prefer_No_VZEROUPPER): New.
        * sysdeps/x86/cpu-features.c (init_cpu_features): Set the
        Prefer_No_VZEROUPPER for Knights Landing.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=d530cd5463701a59ed923d53a97d3b534fdfea8a

commit d530cd5463701a59ed923d53a97d3b534fdfea8a
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Wed Oct 21 14:44:23 2015 -0700

    Add Prefer_MAP_32BIT_EXEC to map executable pages with MAP_32BIT
    
    According to Silvermont software optimization guide, for 64-bit
    applications, branch prediction performance can be negatively impacted
    when the target of a branch is more than 4GB away from the branch.  Add
    the Prefer_MAP_32BIT_EXEC bit so that mmap will try to map executable
    pages with MAP_32BIT first.  NB: MAP_32BIT will map to lower 2GB, not
    lower 4GB, address.  Prefer_MAP_32BIT_EXEC reduces bits available for
    address space layout randomization (ASLR), which is always disabled for
    SUID programs and can only be enabled by setting environment variable,
    LD_PREFER_MAP_32BIT_EXEC.
    
    On Fedora 23, this patch speeds up GCC 5 testsuite by 3% on Silvermont.
    
    	[BZ #19367]
    	* sysdeps/unix/sysv/linux/wordsize-64/mmap.c: New file.
    	* sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h: Likewise.
    	* sysdeps/unix/sysv/linux/x86_64/64/mmap.c: Likewise.
    	* sysdeps/x86/cpu-features.h (bit_Prefer_MAP_32BIT_EXEC): New.
    	(index_Prefer_MAP_32BIT_EXEC): Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=fe24aedc3530037d7bb614b84d309e6b816686bf

commit fe24aedc3530037d7bb614b84d309e6b816686bf
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Dec 15 11:46:54 2015 -0800

    Enable Silvermont optimizations for Knights Landing
    
    Knights Landing processor is based on Silvermont.  This patch enables
    Silvermont optimizations for Knights Landing.
    
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Enable
    	Silvermont optimizations for Knights Landing.

-----------------------------------------------------------------------

Comment 31 Sourceware Commits 2016-04-08 19:18:55 UTC

This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, hjl/erms/2.22 has been created
        at  157c57198e893b4882d1feb98de2b0721ee408fc (commit)

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=157c57198e893b4882d1feb98de2b0721ee408fc

commit 157c57198e893b4882d1feb98de2b0721ee408fc
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 1 14:01:24 2016 -0700

    X86-64: Add dummy memcopy.h and wordcopy.c
    
    Since x86-64 doesn't use memory copy functions, add dummy memcopy.h and
    wordcopy.c to reduce code size.  It reduces the size of libc.so by about
    1 KB.
    
    	* sysdeps/x86_64/memcopy.h: New file.
    	* sysdeps/x86_64/wordcopy.c: Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=f817b9d36215ab60d58cc744d22773b4961a2c9b

commit f817b9d36215ab60d58cc744d22773b4961a2c9b
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 12:46:57 2016 -0700

    X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove
    
    Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones,
    we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with
    the new ones.
    
    No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used
    before.  If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2
    memcpy/memmove optimized with Enhanced REP MOVSB will be used for
    processors with ERMS.  The new AVX512 memcpy/memmove will be used for
    processors with AVX512 which prefer vzeroupper.
    
    Since the new SSE2 memcpy/memmove are faster than the previous default
    memcpy/memmove used in libc.a and ld.so, we also remove the previous
    default memcpy/memmove and make them the default memcpy/memmove, except
    that non-temporal store isn't used in ld.so.
    
    Together, it reduces the size of libc.so by about 6 KB and the size of
    ld.so by about 2 KB.
    
    	[BZ #19776]
    	* sysdeps/x86_64/memcpy.S: Make it dummy.
    	* sysdeps/x86_64/mempcpy.S: Likewise.
    	* sysdeps/x86_64/memmove.S: New file.
    	* sysdeps/x86_64/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/memmove.c: Removed.
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memmove.c: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memcpy-sse2-unaligned, memmove-avx-unaligned,
    	memcpy-avx-unaligned and memmove-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Replace
    	__memmove_chk_avx512_unaligned_2 with
    	__memmove_chk_avx512_unaligned.  Remove
    	__memmove_chk_avx_unaligned_2.  Replace
    	__memmove_chk_sse2_unaligned_2 with
    	__memmove_chk_sse2_unaligned.  Remove __memmove_chk_sse2 and
    	__memmove_avx_unaligned_2.  Replace __memmove_avx512_unaligned_2
    	with __memmove_avx512_unaligned.  Replace
    	__memmove_sse2_unaligned_2 with __memmove_sse2_unaligned.
    	Remove __memmove_sse2.  Replace __memcpy_chk_avx512_unaligned_2
    	with __memcpy_chk_avx512_unaligned.  Remove
    	__memcpy_chk_avx_unaligned_2.  Replace
    	__memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned.
    	Remove __memcpy_chk_sse2.  Remove __memcpy_avx_unaligned_2.
    	Replace __memcpy_avx512_unaligned_2 with
    	__memcpy_avx512_unaligned.  Remove __memcpy_sse2_unaligned_2
    	and __memcpy_sse2.  Replace __mempcpy_chk_avx512_unaligned_2
    	with __mempcpy_chk_avx512_unaligned.  Remove
    	__mempcpy_chk_avx_unaligned_2.  Replace
    	__mempcpy_chk_sse2_unaligned_2 with
    	__mempcpy_chk_sse2_unaligned.  Remove __mempcpy_chk_sse2.
    	Replace __mempcpy_avx512_unaligned_2 with
    	__mempcpy_avx512_unaligned.  Remove __mempcpy_avx_unaligned_2.
    	Replace __mempcpy_sse2_unaligned_2 with
    	__mempcpy_sse2_unaligned.  Remove __mempcpy_sse2.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support
    	__memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned.
    	Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __memcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../memcpy.S.
    	* sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support
    	__memcpy_chk_avx512_unaligned_erms and
    	__memcpy_chk_avx512_unaligned.  Use
    	__memcpy_chk_avx_unaligned_erms and
    	__memcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __memcpy_chk_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
    	Change function suffix from unaligned_2 to unaligned.
    	* sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support
    	__mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned.
    	Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __mempcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../mempcpy.S.
    	(mempcpy): New.  Add a weak alias.
    	* sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support
    	__mempcpy_chk_avx512_unaligned_erms and
    	__mempcpy_chk_avx512_unaligned.  Use
    	__mempcpy_chk_avx_unaligned_erms and
    	__mempcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __mempcpy_chk_sse2_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=122600f4b380b00ce0f682039fe59af4bd0edbc0

commit 122600f4b380b00ce0f682039fe59af4bd0edbc0
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:42:30 2016 -0700

    X86-64: Remove the previous SSE2/AVX2 memsets
    
    Since the new SSE2/AVX2 memsets are faster than the previous ones, we
    can remove the previous SSE2/AVX2 memsets and replace them with the
    new ones.  This reduces the size of libc.so by about 900 bytes.
    
    No change in IFUNC selection if SSE2 and AVX2 memsets weren't used
    before.  If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset
    optimized with Enhanced REP STOSB will be used for processors with
    ERMS.  The new AVX512 memset will be used for processors with AVX512
    which prefer vzeroupper.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded
    	into ...
    	* sysdeps/x86_64/memset.S: This.
    	(__bzero): Removed.
    	(__memset_tail): Likewise.
    	(__memset_chk): Likewise.
    	(memset): Likewise.
    	(MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't
    	defined.
    	(MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined.
    	* sysdeps/x86_64/multiarch/memset-avx2.S: Removed.
    	(__memset_zero_constant_len_parameter): Check SHARED instead of
    	PIC.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memset-avx2 and memset-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Remove __memset_chk_sse2,
    	__memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
    	(__bzero): Enabled.
    	* sysdeps/x86_64/multiarch/memset.S (memset): Replace
    	__memset_sse2 and __memset_avx2 with __memset_sse2_unaligned
    	and __memset_avx2_unaligned.  Use __memset_sse2_unaligned_erms
    	or __memset_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	(memset): Removed.
    	(__memset_chk): Likewise.
    	(MEMSET_SYMBOL): New.
    	(libc_hidden_builtin_def): Replace __memset_sse2 with
    	__memset_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace
    	__memset_chk_sse2 and __memset_chk_avx2 with
    	__memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms.
    	Use __memset_chk_sse2_unaligned_erms or
    	__memset_chk_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_chk_avx512_unaligned_erms and
    	__memset_chk_avx512_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=0ee4375cef69e00e69ddb1d08fe0d492053208f3

commit 0ee4375cef69e00e69ddb1d08fe0d492053208f3
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Apr 3 17:21:45 2016 -0700

    X86-64: Use non-temporal store in memcpy on large data
    
    The large memcpy micro benchmark in glibc shows that there is a
    regression with large data on Haswell machine.  non-temporal store in
    memcpy on large data can improve performance significantly.  This
    patch adds a threshold to use non temporal store which is 6 times of
    shared cache size.  When size is above the threshold, non temporal
    store will be used, but avoid non-temporal store if there is overlap
    between destination and source since destination may be in cache when
    source is loaded.
    
    For size below 8 vector register width, we load all data into registers
    and store them together.  Only forward and backward loops, which move 4
    vector registers at a time, are used to support overlapping addresses.
    For forward loop, we load the last 4 vector register width of data and
    the first vector register width of data into vector registers before the
    loop and store them after the loop.  For backward loop, we load the first
    4 vector register width of data and the last vector register width of
    data into vector registers before the loop and store them after the loop.
    
    	[BZ #19928]
    	* sysdeps/x86_64/cacheinfo.c (__x86_shared_non_temporal_threshold):
    	New.
    	(init_cacheinfo): Set __x86_shared_non_temporal_threshold to
    	6 times of shared cache size.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S
    	(VMOVNT): New.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S
    	(VMOVNT): Likewise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S
    	(VMOVNT): Likewise.
    	(VMOVU): Changed to movups for smaller code sizes.
    	(VMOVA): Changed to movaps for smaller code sizes.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Update
    	comments.
    	(PREFETCH): New.
    	(PREFETCH_SIZE): Likewise.
    	(PREFETCHED_LOAD_SIZE): Likewise.
    	(PREFETCH_ONE_SET): Likewise.
    	Rewrite to use forward and backward loops, which move 4 vector
    	registers at a time, to support overlapping addresses and use
    	non temporal store if size is above the threshold and there is
    	no overlap between destination and source.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=54667f64fa4074325ee33e487c033c313ce95067

commit 54667f64fa4074325ee33e487c033c313ce95067
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Wed Apr 6 10:19:16 2016 -0700

    X86-64: Prepare memmove-vec-unaligned-erms.S
    
    Prepare memmove-vec-unaligned-erms.S to make the SSE2 version as the
    default memcpy, mempcpy and memmove.
    
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
    	(MEMCPY_SYMBOL): New.
    	(MEMPCPY_SYMBOL): Likewise.
    	(MEMMOVE_CHK_SYMBOL): Likewise.
    	Replace MEMMOVE_SYMBOL with MEMMOVE_CHK_SYMBOL on __mempcpy_chk
    	symbols.  Replace MEMMOVE_SYMBOL with MEMPCPY_SYMBOL on
    	__mempcpy symbols.  Provide alias for __memcpy_chk in libc.a.
    	Provide alias for memcpy in libc.a and ld.so.
    
    (cherry picked from commit a7d1c51482d15ab6c07e2ee0ae5e007067b18bfb)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=68a0b487e274b3452a1660e4b9fad5df8d8c0284

commit 68a0b487e274b3452a1660e4b9fad5df8d8c0284
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Wed Apr 6 09:10:18 2016 -0700

    X86-64: Prepare memset-vec-unaligned-erms.S
    
    Prepare memset-vec-unaligned-erms.S to make the SSE2 version as the
    default memset.
    
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
    	(MEMSET_CHK_SYMBOL): New.  Define if not defined.
    	(__bzero): Check VEC_SIZE == 16 instead of USE_MULTIARCH.
    	Disabled fro now.
    	Replace MEMSET_SYMBOL with MEMSET_CHK_SYMBOL on __memset_chk
    	symbols.  Properly check USE_MULTIARCH on __memset symbols.
    
    (cherry picked from commit 4af1bb06c59d24f35bf8dc55897838d926c05892)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c2d3bdd6aec639fd66fceb3e2c145420c25d409b

commit c2d3bdd6aec639fd66fceb3e2c145420c25d409b
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Apr 5 05:21:07 2016 -0700

    Force 32-bit displacement in memset-vec-unaligned-erms.S
    
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: Force
    	32-bit displacement to avoid long nop between instructions.
    
    (cherry picked from commit ec0cac9a1f4094bd0db6f77c1b329e7a40eecc10)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=070a5e77d66f5520c1bbbc24dc1843a0a1c161ee

commit 070a5e77d66f5520c1bbbc24dc1843a0a1c161ee
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Apr 5 05:19:05 2016 -0700

    Add a comment in memset-sse2-unaligned-erms.S
    
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Add
    	a comment on VMOVU and VMOVA.
    
    (cherry picked from commit 696ac774847b80cf994438739478b0c3003b5958)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7e00bb9720268f142668d22e91dff7c3e6e0c08c

commit 7e00bb9720268f142668d22e91dff7c3e6e0c08c
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Apr 3 14:32:20 2016 -0700

    Don't put SSE2/AVX/AVX512 memmove/memset in ld.so
    
    Since memmove and memset in ld.so don't use IFUNC, don't put SSE2, AVX
    and AVX512 memmove and memset in ld.so.
    
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip
    	if not in libc.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    
    (cherry picked from commit 5cd7af016d8587ff53b20ba259746f97edbddbf7)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e1e57539f5dbdefc96a85021b611863eaa28dd13

commit e1e57539f5dbdefc96a85021b611863eaa28dd13
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Apr 3 12:38:25 2016 -0700

    Fix memmove-vec-unaligned-erms.S
    
    __mempcpy_erms and __memmove_erms can't be placed between __memmove_chk
    and __memmove it breaks __memmove_chk.
    
    Don't check source == destination first since it is less common.
    
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:
    	(__mempcpy_erms, __memmove_erms): Moved before __mempcpy_chk
    	with unaligned_erms.
    	(__memmove_erms): Skip if source == destination.
    	(__memmove_unaligned_erms): Don't check source == destination
    	first.
    
    (cherry picked from commit ea2785e96fa503f3a2b5dd9f3a6ca65622b3c5f2)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a13ac6b5ced68aadb7c1546102445f9c57f43231

commit a13ac6b5ced68aadb7c1546102445f9c57f43231
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Mar 6 08:23:24 2016 -0800

    Use HAS_ARCH_FEATURE with Fast_Rep_String
    
    HAS_ARCH_FEATURE, not HAS_CPU_FEATURE, should be used with
    Fast_Rep_String.
    
    	[BZ #19762]
    	* sysdeps/i386/i686/multiarch/bcopy.S (bcopy): Use
    	HAS_ARCH_FEATURE with Fast_Rep_String.
    	* sysdeps/i386/i686/multiarch/bzero.S (__bzero): Likewise.
    	* sysdeps/i386/i686/multiarch/memcpy.S (memcpy): Likewise.
    	* sysdeps/i386/i686/multiarch/memcpy_chk.S (__memcpy_chk):
    	Likewise.
    	* sysdeps/i386/i686/multiarch/memmove_chk.S (__memmove_chk):
    	Likewise.
    	* sysdeps/i386/i686/multiarch/mempcpy.S (__mempcpy): Likewise.
    	* sysdeps/i386/i686/multiarch/mempcpy_chk.S (__mempcpy_chk):
    	Likewise.
    	* sysdeps/i386/i686/multiarch/memset.S (memset): Likewise.
    	* sysdeps/i386/i686/multiarch/memset_chk.S (__memset_chk):
    	Likewise.
    
    (cherry picked from commit 4e940b2f4b577f3a530e0580373f7c2d569f4d63)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4ad4d58ed7a444e2d9787113fce132a99b35b417

commit 4ad4d58ed7a444e2d9787113fce132a99b35b417
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 1 15:08:48 2016 -0700

    Remove Fast_Copy_Backward from Intel Core processors
    
    Intel Core i3, i5 and i7 processors have fast unaligned copy and
    copy backward is ignored.  Remove Fast_Copy_Backward from Intel Core
    processors to avoid confusion.
    
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Don't set
    	bit_arch_Fast_Copy_Backward for Intel Core proessors.
    
    (cherry picked from commit 27d3ce1467990f89126e228559dec8f84b96c60e)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a304f3933c7f8347e49057a7a315cbd571662ff7

commit a304f3933c7f8347e49057a7a315cbd571662ff7
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:05:51 2016 -0700

    Add x86-64 memset with unaligned store and rep stosb
    
    Implement x86-64 memset with unaligned store and rep movsb.  Support
    16-byte, 32-byte and 64-byte vector register sizes.  A single file
    provides 2 implementations of memset, one with rep stosb and the other
    without rep stosb.  They share the same codes when size is between 2
    times of vector register size and REP_STOSB_THRESHOLD which defaults
    to 2KB.
    
    Key features:
    
    1. Use overlapping store to avoid branch.
    2. For size <= 4 times of vector register size, fully unroll the loop.
    3. For size > 4 times of vector register size, store 4 times of vector
    register size at a time.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
    	memset-sse2-unaligned-erms, memset-avx2-unaligned-erms and
    	memset-avx512-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Test __memset_chk_sse2_unaligned,
    	__memset_chk_sse2_unaligned_erms, __memset_chk_avx2_unaligned,
    	__memset_chk_avx2_unaligned_erms, __memset_chk_avx512_unaligned,
    	__memset_chk_avx512_unaligned_erms, __memset_sse2_unaligned,
    	__memset_sse2_unaligned_erms, __memset_erms,
    	__memset_avx2_unaligned, __memset_avx2_unaligned_erms,
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: New
    	file.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:
    	Likewise.
    
    (cherry picked from commit 830566307f038387ca0af3fd327706a8d1a2f595)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=1c5d0b0ba41376c2f0792da4f22cc1f5b2b2688e

commit 1c5d0b0ba41376c2f0792da4f22cc1f5b2b2688e
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:04:26 2016 -0700

    Add x86-64 memmove with unaligned load/store and rep movsb
    
    Implement x86-64 memmove with unaligned load/store and rep movsb.
    Support 16-byte, 32-byte and 64-byte vector register sizes.  When
    size <= 8 times of vector register size, there is no check for
    address overlap bewteen source and destination.  Since overhead for
    overlap check is small when size > 8 times of vector register size,
    memcpy is an alias of memmove.
    
    A single file provides 2 implementations of memmove, one with rep movsb
    and the other without rep movsb.  They share the same codes when size is
    between 2 times of vector register size and REP_MOVSB_THRESHOLD which
    is 2KB for 16-byte vector register size and scaled up by large vector
    register size.
    
    Key features:
    
    1. Use overlapping load and store to avoid branch.
    2. For size <= 8 times of vector register size, load  all sources into
    registers and store them together.
    3. If there is no address overlap bewteen source and destination, copy
    from both ends with 4 times of vector register size at a time.
    4. If address of destination > address of source, backward copy 8 times
    of vector register size at a time.
    5. Otherwise, forward copy 8 times of vector register size at a time.
    6. Use rep movsb only for forward copy.  Avoid slow backward rep movsb
    by fallbacking to backward copy 8 times of vector register size at a
    time.
    7. Skip when address of destination == address of source.
    
    	[BZ #19776]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
    	memmove-sse2-unaligned-erms, memmove-avx-unaligned-erms and
    	memmove-avx512-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Test
    	__memmove_chk_avx512_unaligned_2,
    	__memmove_chk_avx512_unaligned_erms,
    	__memmove_chk_avx_unaligned_2, __memmove_chk_avx_unaligned_erms,
    	__memmove_chk_sse2_unaligned_2,
    	__memmove_chk_sse2_unaligned_erms, __memmove_avx_unaligned_2,
    	__memmove_avx_unaligned_erms, __memmove_avx512_unaligned_2,
    	__memmove_avx512_unaligned_erms, __memmove_erms,
    	__memmove_sse2_unaligned_2, __memmove_sse2_unaligned_erms,
    	__memcpy_chk_avx512_unaligned_2,
    	__memcpy_chk_avx512_unaligned_erms,
    	__memcpy_chk_avx_unaligned_2, __memcpy_chk_avx_unaligned_erms,
    	__memcpy_chk_sse2_unaligned_2, __memcpy_chk_sse2_unaligned_erms,
    	__memcpy_avx_unaligned_2, __memcpy_avx_unaligned_erms,
    	__memcpy_avx512_unaligned_2, __memcpy_avx512_unaligned_erms,
    	__memcpy_sse2_unaligned_2, __memcpy_sse2_unaligned_erms,
    	__memcpy_erms, __mempcpy_chk_avx512_unaligned_2,
    	__mempcpy_chk_avx512_unaligned_erms,
    	__mempcpy_chk_avx_unaligned_2, __mempcpy_chk_avx_unaligned_erms,
    	__mempcpy_chk_sse2_unaligned_2, __mempcpy_chk_sse2_unaligned_erms,
    	__mempcpy_avx512_unaligned_2, __mempcpy_avx512_unaligned_erms,
    	__mempcpy_avx_unaligned_2, __mempcpy_avx_unaligned_erms,
    	__mempcpy_sse2_unaligned_2, __mempcpy_sse2_unaligned_erms and
    	__mempcpy_erms.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: New
    	file.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likwise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
    	Likwise.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:
    	Likwise.
    
    (cherry picked from commit 88b57b8ed41d5ecf2e1bdfc19556f9246a665ebb)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e1203f48239fbb9832db6ed3a0d2a008e317aff9

commit e1203f48239fbb9832db6ed3a0d2a008e317aff9
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 19:22:59 2016 -0700

    Initial Enhanced REP MOVSB/STOSB (ERMS) support
    
    The newer Intel processors support Enhanced REP MOVSB/STOSB (ERMS) which
    has a feature bit in CPUID.  This patch adds the Enhanced REP MOVSB/STOSB
    (ERMS) bit to x86 cpu-features.
    
    	* sysdeps/x86/cpu-features.h (bit_cpu_ERMS): New.
    	(index_cpu_ERMS): Likewise.
    	(reg_ERMS): Likewise.
    
    (cherry picked from commit 0791f91dff9a77263fa8173b143d854cad902c6d)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=3597d65be2a44f063ef12bb907fdad8567aa3e6a

commit 3597d65be2a44f063ef12bb907fdad8567aa3e6a
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 13:15:59 2016 -0700

    Make __memcpy_avx512_no_vzeroupper an alias
    
    Since x86-64 memcpy-avx512-no-vzeroupper.S implements memmove, make
    __memcpy_avx512_no_vzeroupper an alias of __memmove_avx512_no_vzeroupper
    to reduce code size of libc.so.
    
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memcpy-avx512-no-vzeroupper.
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: Renamed
    	to ...
    	* sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: This.
    	(MEMCPY): Don't define.
    	(MEMCPY_CHK): Likewise.
    	(MEMPCPY): Likewise.
    	(MEMPCPY_CHK): Likewise.
    	(MEMPCPY_CHK): Renamed to ...
    	(__mempcpy_chk_avx512_no_vzeroupper): This.
    	(MEMPCPY_CHK): Renamed to ...
    	(__mempcpy_chk_avx512_no_vzeroupper): This.
    	(MEMCPY_CHK): Renamed to ...
    	(__memmove_chk_avx512_no_vzeroupper): This.
    	(MEMCPY): Renamed to ...
    	(__memmove_avx512_no_vzeroupper): This.
    	(__memcpy_avx512_no_vzeroupper): New alias.
    	(__memcpy_chk_avx512_no_vzeroupper): Likewise.
    
    (cherry picked from commit 064f01b10b57ff09cda7025f484b848c38ddd57a)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=9fbaf0f27a11deb98df79d04adee97aebee78d40

commit 9fbaf0f27a11deb98df79d04adee97aebee78d40
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 13:13:36 2016 -0700

    Implement x86-64 multiarch mempcpy in memcpy
    
    Implement x86-64 multiarch mempcpy in memcpy to share most of code.  It
    reduces code size of libc.so.
    
    	[BZ #18858]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	mempcpy-ssse3, mempcpy-ssse3-back, mempcpy-avx-unaligned
    	and mempcpy-avx512-no-vzeroupper.
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMPCPY_CHK):
    	New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S
    	(MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-ssse3-back.S (MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-ssse3.S (MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-avx-unaligned.S: Removed.
    	* sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-ssse3-back.S: Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-ssse3.S: Likewise.
    
    (cherry picked from commit c365e615f7429aee302f8af7bf07ae262278febb)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=5239cb481eea27650173b9b9af22439afdcbf358

commit 5239cb481eea27650173b9b9af22439afdcbf358
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 04:39:48 2016 -0700

    [x86] Add a feature bit: Fast_Unaligned_Copy
    
    On AMD processors, memcpy optimized with unaligned SSE load is
    slower than emcpy optimized with aligned SSSE3 while other string
    functions are faster with unaligned SSE load.  A feature bit,
    Fast_Unaligned_Copy, is added to select memcpy optimized with
    unaligned SSE load.
    
    	[BZ #19583]
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Set
    	Fast_Unaligned_Copy with Fast_Unaligned_Load for Intel
    	processors.  Set Fast_Copy_Backward for AMD Excavator
    	processors.
    	* sysdeps/x86/cpu-features.h (bit_arch_Fast_Unaligned_Copy):
    	New.
    	(index_arch_Fast_Unaligned_Copy): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Check
    	Fast_Unaligned_Copy instead of Fast_Unaligned_Load.
    
    (cherry picked from commit e41b395523040fcb58c7d378475720c2836d280c)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a65b3d13e1754d568782e64a762c2c7fab45a55d

commit a65b3d13e1754d568782e64a762c2c7fab45a55d
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Mar 22 08:36:16 2016 -0700

    Don't set %rcx twice before "rep movsb"
    
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMCPY):
    	Don't set %rcx twice before "rep movsb".
    
    (cherry picked from commit 3c9a4cd16cbc7b79094fec68add2df66061ab5d7)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=f4b6d20366aac66070f1cf50552cf2951991a1e5

commit f4b6d20366aac66070f1cf50552cf2951991a1e5
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Mar 22 07:46:56 2016 -0700

    Set index_arch_AVX_Fast_Unaligned_Load only for Intel processors
    
    Since only Intel processors with AVX2 have fast unaligned load, we
    should set index_arch_AVX_Fast_Unaligned_Load only for Intel processors.
    
    Move AVX, AVX2, AVX512, FMA and FMA4 detection into get_common_indeces
    and call get_common_indeces for other processors.
    
    Add CPU_FEATURES_CPU_P and CPU_FEATURES_ARCH_P to aoid loading
    GLRO(dl_x86_cpu_features) in cpu-features.c.
    
    	[BZ #19583]
    	* sysdeps/x86/cpu-features.c (get_common_indeces): Remove
    	inline.  Check family before setting family, model and
    	extended_model.  Set AVX, AVX2, AVX512, FMA and FMA4 usable
    	bits here.
    	(init_cpu_features): Replace HAS_CPU_FEATURE and
    	HAS_ARCH_FEATURE with CPU_FEATURES_CPU_P and
    	CPU_FEATURES_ARCH_P.  Set index_arch_AVX_Fast_Unaligned_Load
    	for Intel processors with usable AVX2.  Call get_common_indeces
    	for other processors with family == NULL.
    	* sysdeps/x86/cpu-features.h (CPU_FEATURES_CPU_P): New macro.
    	(CPU_FEATURES_ARCH_P): Likewise.
    	(HAS_CPU_FEATURE): Use CPU_FEATURES_CPU_P.
    	(HAS_ARCH_FEATURE): Use CPU_FEATURES_ARCH_P.
    
    (cherry picked from commit f781a9e96138d8839663af5e88649ab1fbed74f8)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=ca9c5edeea52dc18f42ebbe29b1af352f5555538

commit ca9c5edeea52dc18f42ebbe29b1af352f5555538
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Nov 30 08:53:37 2015 -0800

    Update family and model detection for AMD CPUs
    
    AMD CPUs uses the similar encoding scheme for extended family and model
    as Intel CPUs as shown in:
    
    http://support.amd.com/TechDocs/25481.pdf
    
    This patch updates get_common_indeces to get family and model for both
    Intel and AMD CPUs when family == 0x0f.
    
    	[BZ #19214]
    	* sysdeps/x86/cpu-features.c (get_common_indeces): Add an
    	argument to return extended model.  Update family and model
    	with extended family and model when family == 0x0f.
    	(init_cpu_features): Updated.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c23cdbac4ea473effbef5c50b1217f95595b3460

commit c23cdbac4ea473effbef5c50b1217f95595b3460
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 10 05:26:46 2016 -0800

    Add _arch_/_cpu_ to index_*/bit_* in x86 cpu-features.h
    
    index_* and bit_* macros are used to access cpuid and feature arrays o
    struct cpu_features.  It is very easy to use bits and indices of cpuid
    array on feature array, especially in assembly codes.  For example,
    sysdeps/i386/i686/multiarch/bcopy.S has
    
    	HAS_CPU_FEATURE (Fast_Rep_String)
    
    which should be
    
    	HAS_ARCH_FEATURE (Fast_Rep_String)
    
    We change index_* and bit_* to index_cpu_*/index_arch_* and
    bit_cpu_*/bit_arch_* so that we can catch such error at build time.
    
    	[BZ #19762]
    	* sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h
    	(EXTRA_LD_ENVVARS): Add _arch_ to index_*/bit_*.
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Likewise.
    	* sysdeps/x86/cpu-features.h (bit_*): Renamed to ...
    	(bit_arch_*): This for feature array.
    	(bit_*): Renamed to ...
    	(bit_cpu_*): This for cpu array.
    	(index_*): Renamed to ...
    	(index_arch_*): This for feature array.
    	(index_*): Renamed to ...
    	(index_cpu_*): This for cpu array.
    	[__ASSEMBLER__] (HAS_FEATURE): Add and use field.
    	[__ASSEMBLER__] (HAS_CPU_FEATURE)): Pass cpu to HAS_FEATURE.
    	[__ASSEMBLER__] (HAS_ARCH_FEATURE)): Pass arch to HAS_FEATURE.
    	[!__ASSEMBLER__] (HAS_CPU_FEATURE): Replace index_##name and
    	bit_##name with index_cpu_##name and bit_cpu_##name.
    	[!__ASSEMBLER__] (HAS_ARCH_FEATURE): Replace index_##name and
    	bit_##name with index_arch_##name and bit_arch_##name.
    
    (cherry picked from commit 6aa3e97e2530f9917f504eb4146af119a3f27229)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4a49c82956f5a42a2cce22c2e97360de1b32301d

commit 4a49c82956f5a42a2cce22c2e97360de1b32301d
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 3 14:51:40 2016 -0800

    Or bit_Prefer_MAP_32BIT_EXEC in EXTRA_LD_ENVVARS
    
    We should turn on bit_Prefer_MAP_32BIT_EXEC in EXTRA_LD_ENVVARS without
    overriding other bits.
    
    	[BZ #19758]
    	* sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h
    	(EXTRA_LD_ENVVARS): Or bit_Prefer_MAP_32BIT_EXEC.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=18f8c0e3cc9ff7b092f02c9b42874a5439347bbc

commit 18f8c0e3cc9ff7b092f02c9b42874a5439347bbc
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Mar 6 16:48:11 2016 -0800

    Group AVX512 functions in .text.avx512 section
    
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S:
    	Replace .text with .text.avx512.
    	* sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S:
    	Likewise.
    
    (cherry picked from commit fee9eb6200f0e44a4b684903bc47fde36d46f1a5)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=0c8e297a186f844ebb7eba7a3bc0343c83615ca9

commit 0c8e297a186f844ebb7eba7a3bc0343c83615ca9
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Mar 4 08:37:40 2016 -0800

    x86-64: Fix memcpy IFUNC selection
    
    Chek Fast_Unaligned_Load, instead of Slow_BSF, and also check for
    Fast_Copy_Backward to enable __memcpy_ssse3_back.  Existing selection
    order is updated with following selection order:
    
    1. __memcpy_avx_unaligned if AVX_Fast_Unaligned_Load bit is set.
    2. __memcpy_sse2_unaligned if Fast_Unaligned_Load bit is set.
    3. __memcpy_sse2 if SSSE3 isn't available.
    4. __memcpy_ssse3_back if Fast_Copy_Backward bit it set.
    5. __memcpy_ssse3
    
    	[BZ #18880]
    	* sysdeps/x86_64/multiarch/memcpy.S: Check Fast_Unaligned_Load,
    	instead of Slow_BSF, and also check for Fast_Copy_Backward to
    	enable __memcpy_ssse3_back.
    
    (cherry picked from commit 14a1d7cc4c4fd5ee8e4e66b777221dd32a84efe8)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=3c772cb4d9cbe19cd97ad991e3dab43014198c44

commit 3c772cb4d9cbe19cd97ad991e3dab43014198c44
Author: Andrew Senkevich <andrew.senkevich@intel.com>
Date:   Sat Jan 16 00:49:45 2016 +0300

    Added memcpy/memmove family optimized with AVX512 for KNL hardware.
    
    Added AVX512 implementations of memcpy, mempcpy, memmove, memcpy_chk,
    mempcpy_chk, memmove_chk.
    It shows average improvement more than 30% over AVX versions on KNL
    hardware (performance results in the thread
    <https://sourceware.org/ml/libc-alpha/2016-01/msg00258.html>).
    
        * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Added new files.
        * sysdeps/x86_64/multiarch/ifunc-impl-list.c: Added new tests.
        * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: New file.
        * sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S: Likewise.
        * sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: Likewise.
        * sysdeps/x86_64/multiarch/memcpy.S: Added new IFUNC branch.
        * sysdeps/x86_64/multiarch/memcpy_chk.S: Likewise.
        * sysdeps/x86_64/multiarch/memmove.c: Likewise.
        * sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
        * sysdeps/x86_64/multiarch/mempcpy.S: Likewise.
        * sysdeps/x86_64/multiarch/mempcpy_chk.S: Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7f20e52f1f8d7709140f4bdf828a6bb0f0f08af2

commit 7f20e52f1f8d7709140f4bdf828a6bb0f0f08af2
Author: Andrew Senkevich <andrew.senkevich@intel.com>
Date:   Sat Dec 19 02:47:28 2015 +0300

    Added memset optimized with AVX512 for KNL hardware.
    
    It shows improvement up to 28% over AVX2 memset (performance results
    attached at <https://sourceware.org/ml/libc-alpha/2015-12/msg00052.html>).
    
        * sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S: New file.
        * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Added new file.
        * sysdeps/x86_64/multiarch/ifunc-impl-list.c: Added new tests.
        * sysdeps/x86_64/multiarch/memset.S: Added new IFUNC branch.
        * sysdeps/x86_64/multiarch/memset_chk.S: Likewise.
        * sysdeps/x86/cpu-features.h (bit_Prefer_No_VZEROUPPER,
        index_Prefer_No_VZEROUPPER): New.
        * sysdeps/x86/cpu-features.c (init_cpu_features): Set the
        Prefer_No_VZEROUPPER for Knights Landing.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=d530cd5463701a59ed923d53a97d3b534fdfea8a

commit d530cd5463701a59ed923d53a97d3b534fdfea8a
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Wed Oct 21 14:44:23 2015 -0700

    Add Prefer_MAP_32BIT_EXEC to map executable pages with MAP_32BIT
    
    According to Silvermont software optimization guide, for 64-bit
    applications, branch prediction performance can be negatively impacted
    when the target of a branch is more than 4GB away from the branch.  Add
    the Prefer_MAP_32BIT_EXEC bit so that mmap will try to map executable
    pages with MAP_32BIT first.  NB: MAP_32BIT will map to lower 2GB, not
    lower 4GB, address.  Prefer_MAP_32BIT_EXEC reduces bits available for
    address space layout randomization (ASLR), which is always disabled for
    SUID programs and can only be enabled by setting environment variable,
    LD_PREFER_MAP_32BIT_EXEC.
    
    On Fedora 23, this patch speeds up GCC 5 testsuite by 3% on Silvermont.
    
    	[BZ #19367]
    	* sysdeps/unix/sysv/linux/wordsize-64/mmap.c: New file.
    	* sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h: Likewise.
    	* sysdeps/unix/sysv/linux/x86_64/64/mmap.c: Likewise.
    	* sysdeps/x86/cpu-features.h (bit_Prefer_MAP_32BIT_EXEC): New.
    	(index_Prefer_MAP_32BIT_EXEC): Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=fe24aedc3530037d7bb614b84d309e6b816686bf

commit fe24aedc3530037d7bb614b84d309e6b816686bf
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Dec 15 11:46:54 2015 -0800

    Enable Silvermont optimizations for Knights Landing
    
    Knights Landing processor is based on Silvermont.  This patch enables
    Silvermont optimizations for Knights Landing.
    
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Enable
    	Silvermont optimizations for Knights Landing.

-----------------------------------------------------------------------

Comment 32 Sourceware Commits 2016-04-08 19:22:15 UTC

This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, hjl/erms/ifunc has been created
        at  fe38127f6d289dd6eaa6425acb108b7b384ddc4b (commit)

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=fe38127f6d289dd6eaa6425acb108b7b384ddc4b

commit fe38127f6d289dd6eaa6425acb108b7b384ddc4b
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 1 14:01:24 2016 -0700

    X86-64: Add dummy memcopy.h and wordcopy.c
    
    Since x86-64 doesn't use memory copy functions, add dummy memcopy.h and
    wordcopy.c to reduce code size.  It reduces the size of libc.so by about
    1 KB.
    
    	* sysdeps/x86_64/memcopy.h: New file.
    	* sysdeps/x86_64/wordcopy.c: Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=2c5fc8567a694ae6115b25db787673fb8dc140a5

commit 2c5fc8567a694ae6115b25db787673fb8dc140a5
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 12:46:57 2016 -0700

    X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove
    
    Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones,
    we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with
    the new ones.
    
    No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used
    before.  If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2
    memcpy/memmove optimized with Enhanced REP MOVSB will be used for
    processors with ERMS.  The new AVX512 memcpy/memmove will be used for
    processors with AVX512 which prefer vzeroupper.
    
    Since the new SSE2 memcpy/memmove are faster than the previous default
    memcpy/memmove used in libc.a and ld.so, we also remove the previous
    default memcpy/memmove and make them the default memcpy/memmove, except
    that non-temporal store isn't used in ld.so.
    
    Together, it reduces the size of libc.so by about 6 KB and the size of
    ld.so by about 2 KB.
    
    	[BZ #19776]
    	* sysdeps/x86_64/memcpy.S: Make it dummy.
    	* sysdeps/x86_64/mempcpy.S: Likewise.
    	* sysdeps/x86_64/memmove.S: New file.
    	* sysdeps/x86_64/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/memmove.c: Removed.
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memmove.c: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memcpy-sse2-unaligned, memmove-avx-unaligned,
    	memcpy-avx-unaligned and memmove-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Replace
    	__memmove_chk_avx512_unaligned_2 with
    	__memmove_chk_avx512_unaligned.  Remove
    	__memmove_chk_avx_unaligned_2.  Replace
    	__memmove_chk_sse2_unaligned_2 with
    	__memmove_chk_sse2_unaligned.  Remove __memmove_chk_sse2 and
    	__memmove_avx_unaligned_2.  Replace __memmove_avx512_unaligned_2
    	with __memmove_avx512_unaligned.  Replace
    	__memmove_sse2_unaligned_2 with __memmove_sse2_unaligned.
    	Remove __memmove_sse2.  Replace __memcpy_chk_avx512_unaligned_2
    	with __memcpy_chk_avx512_unaligned.  Remove
    	__memcpy_chk_avx_unaligned_2.  Replace
    	__memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned.
    	Remove __memcpy_chk_sse2.  Remove __memcpy_avx_unaligned_2.
    	Replace __memcpy_avx512_unaligned_2 with
    	__memcpy_avx512_unaligned.  Remove __memcpy_sse2_unaligned_2
    	and __memcpy_sse2.  Replace __mempcpy_chk_avx512_unaligned_2
    	with __mempcpy_chk_avx512_unaligned.  Remove
    	__mempcpy_chk_avx_unaligned_2.  Replace
    	__mempcpy_chk_sse2_unaligned_2 with
    	__mempcpy_chk_sse2_unaligned.  Remove __mempcpy_chk_sse2.
    	Replace __mempcpy_avx512_unaligned_2 with
    	__mempcpy_avx512_unaligned.  Remove __mempcpy_avx_unaligned_2.
    	Replace __mempcpy_sse2_unaligned_2 with
    	__mempcpy_sse2_unaligned.  Remove __mempcpy_sse2.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support
    	__memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned.
    	Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __memcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../memcpy.S.
    	* sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support
    	__memcpy_chk_avx512_unaligned_erms and
    	__memcpy_chk_avx512_unaligned.  Use
    	__memcpy_chk_avx_unaligned_erms and
    	__memcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __memcpy_chk_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
    	Change function suffix from unaligned_2 to unaligned.
    	* sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support
    	__mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned.
    	Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __mempcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../mempcpy.S.
    	(mempcpy): New.  Add a weak alias.
    	* sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support
    	__mempcpy_chk_avx512_unaligned_erms and
    	__mempcpy_chk_avx512_unaligned.  Use
    	__mempcpy_chk_avx_unaligned_erms and
    	__mempcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __mempcpy_chk_sse2_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=ed37fe74cfe0d9f68a8023b7f73a5805f4a5a206

commit ed37fe74cfe0d9f68a8023b7f73a5805f4a5a206
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:42:30 2016 -0700

    X86-64: Remove the previous SSE2/AVX2 memsets
    
    Since the new SSE2/AVX2 memsets are faster than the previous ones, we
    can remove the previous SSE2/AVX2 memsets and replace them with the
    new ones.  This reduces the size of libc.so by about 900 bytes.
    
    No change in IFUNC selection if SSE2 and AVX2 memsets weren't used
    before.  If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset
    optimized with Enhanced REP STOSB will be used for processors with
    ERMS.  The new AVX512 memset will be used for processors with AVX512
    which prefer vzeroupper.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded
    	into ...
    	* sysdeps/x86_64/memset.S: This.
    	(__bzero): Removed.
    	(__memset_tail): Likewise.
    	(__memset_chk): Likewise.
    	(memset): Likewise.
    	(MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't
    	defined.
    	(MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined.
    	* sysdeps/x86_64/multiarch/memset-avx2.S: Removed.
    	(__memset_zero_constant_len_parameter): Check SHARED instead of
    	PIC.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memset-avx2 and memset-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Remove __memset_chk_sse2,
    	__memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
    	(__bzero): Enabled.
    	* sysdeps/x86_64/multiarch/memset.S (memset): Replace
    	__memset_sse2 and __memset_avx2 with __memset_sse2_unaligned
    	and __memset_avx2_unaligned.  Use __memset_sse2_unaligned_erms
    	or __memset_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	(memset): Removed.
    	(__memset_chk): Likewise.
    	(MEMSET_SYMBOL): New.
    	(libc_hidden_builtin_def): Replace __memset_sse2 with
    	__memset_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace
    	__memset_chk_sse2 and __memset_chk_avx2 with
    	__memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms.
    	Use __memset_chk_sse2_unaligned_erms or
    	__memset_chk_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_chk_avx512_unaligned_erms and
    	__memset_chk_avx512_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=96b5fbcbc09df10b093221d6b55eaa5e7e8c044f

commit 96b5fbcbc09df10b093221d6b55eaa5e7e8c044f
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Apr 3 17:21:45 2016 -0700

    X86-64: Use non-temporal store in memcpy on large data
    
    The large memcpy micro benchmark in glibc shows that there is a
    regression with large data on Haswell machine.  non-temporal store in
    memcpy on large data can improve performance significantly.  This
    patch adds a threshold to use non temporal store which is 6 times of
    shared cache size.  When size is above the threshold, non temporal
    store will be used, but avoid non-temporal store if there is overlap
    between destination and source since destination may be in cache when
    source is loaded.
    
    For size below 8 vector register width, we load all data into registers
    and store them together.  Only forward and backward loops, which move 4
    vector registers at a time, are used to support overlapping addresses.
    For forward loop, we load the last 4 vector register width of data and
    the first vector register width of data into vector registers before the
    loop and store them after the loop.  For backward loop, we load the first
    4 vector register width of data and the last vector register width of
    data into vector registers before the loop and store them after the loop.
    
    	[BZ #19928]
    	* sysdeps/x86_64/cacheinfo.c (__x86_shared_non_temporal_threshold):
    	New.
    	(init_cacheinfo): Set __x86_shared_non_temporal_threshold to
    	6 times of shared cache size.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S
    	(VMOVNT): New.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S
    	(VMOVNT): Likewise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S
    	(VMOVNT): Likewise.
    	(VMOVU): Changed to movups for smaller code sizes.
    	(VMOVA): Changed to movaps for smaller code sizes.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Update
    	comments.
    	(PREFETCH): New.
    	(PREFETCH_SIZE): Likewise.
    	(PREFETCHED_LOAD_SIZE): Likewise.
    	(PREFETCH_ONE_SET): Likewise.
    	Rewrite to use forward and backward loops, which move 4 vector
    	registers at a time, to support overlapping addresses and use
    	non temporal store if size is above the threshold and there is
    	no overlap between destination and source.

-----------------------------------------------------------------------

Comment 33 Sourceware Commits 2016-04-08 20:33:03 UTC

This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, hjl/erms/2.23 has been created
        at  9e1ddc1180ca0619d12b620b227726233a48b9bc (commit)

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=9e1ddc1180ca0619d12b620b227726233a48b9bc

commit 9e1ddc1180ca0619d12b620b227726233a48b9bc
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 1 14:01:24 2016 -0700

    X86-64: Add dummy memcopy.h and wordcopy.c
    
    Since x86-64 doesn't use memory copy functions, add dummy memcopy.h and
    wordcopy.c to reduce code size.  It reduces the size of libc.so by about
    1 KB.
    
    	* sysdeps/x86_64/memcopy.h: New file.
    	* sysdeps/x86_64/wordcopy.c: Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=3443d7810db1092ac70a0fde7b85732a2e00cdc3

commit 3443d7810db1092ac70a0fde7b85732a2e00cdc3
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 12:46:57 2016 -0700

    X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove
    
    Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones,
    we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with
    the new ones.
    
    No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used
    before.  If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2
    memcpy/memmove optimized with Enhanced REP MOVSB will be used for
    processors with ERMS.  The new AVX512 memcpy/memmove will be used for
    processors with AVX512 which prefer vzeroupper.
    
    Since the new SSE2 memcpy/memmove are faster than the previous default
    memcpy/memmove used in libc.a and ld.so, we also remove the previous
    default memcpy/memmove and make them the default memcpy/memmove, except
    that non-temporal store isn't used in ld.so.
    
    Together, it reduces the size of libc.so by about 6 KB and the size of
    ld.so by about 2 KB.
    
    	[BZ #19776]
    	* sysdeps/x86_64/memcpy.S: Make it dummy.
    	* sysdeps/x86_64/mempcpy.S: Likewise.
    	* sysdeps/x86_64/memmove.S: New file.
    	* sysdeps/x86_64/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/memmove.c: Removed.
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memmove.c: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memcpy-sse2-unaligned, memmove-avx-unaligned,
    	memcpy-avx-unaligned and memmove-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Replace
    	__memmove_chk_avx512_unaligned_2 with
    	__memmove_chk_avx512_unaligned.  Remove
    	__memmove_chk_avx_unaligned_2.  Replace
    	__memmove_chk_sse2_unaligned_2 with
    	__memmove_chk_sse2_unaligned.  Remove __memmove_chk_sse2 and
    	__memmove_avx_unaligned_2.  Replace __memmove_avx512_unaligned_2
    	with __memmove_avx512_unaligned.  Replace
    	__memmove_sse2_unaligned_2 with __memmove_sse2_unaligned.
    	Remove __memmove_sse2.  Replace __memcpy_chk_avx512_unaligned_2
    	with __memcpy_chk_avx512_unaligned.  Remove
    	__memcpy_chk_avx_unaligned_2.  Replace
    	__memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned.
    	Remove __memcpy_chk_sse2.  Remove __memcpy_avx_unaligned_2.
    	Replace __memcpy_avx512_unaligned_2 with
    	__memcpy_avx512_unaligned.  Remove __memcpy_sse2_unaligned_2
    	and __memcpy_sse2.  Replace __mempcpy_chk_avx512_unaligned_2
    	with __mempcpy_chk_avx512_unaligned.  Remove
    	__mempcpy_chk_avx_unaligned_2.  Replace
    	__mempcpy_chk_sse2_unaligned_2 with
    	__mempcpy_chk_sse2_unaligned.  Remove __mempcpy_chk_sse2.
    	Replace __mempcpy_avx512_unaligned_2 with
    	__mempcpy_avx512_unaligned.  Remove __mempcpy_avx_unaligned_2.
    	Replace __mempcpy_sse2_unaligned_2 with
    	__mempcpy_sse2_unaligned.  Remove __mempcpy_sse2.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support
    	__memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned.
    	Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __memcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../memcpy.S.
    	* sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support
    	__memcpy_chk_avx512_unaligned_erms and
    	__memcpy_chk_avx512_unaligned.  Use
    	__memcpy_chk_avx_unaligned_erms and
    	__memcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __memcpy_chk_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
    	Change function suffix from unaligned_2 to unaligned.
    	* sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support
    	__mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned.
    	Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __mempcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../mempcpy.S.
    	(mempcpy): New.  Add a weak alias.
    	* sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support
    	__mempcpy_chk_avx512_unaligned_erms and
    	__mempcpy_chk_avx512_unaligned.  Use
    	__mempcpy_chk_avx_unaligned_erms and
    	__mempcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __mempcpy_chk_sse2_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=1d2a372d44dc05201242d0fd5551df9c3174806c

commit 1d2a372d44dc05201242d0fd5551df9c3174806c
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:42:30 2016 -0700

    X86-64: Remove the previous SSE2/AVX2 memsets
    
    Since the new SSE2/AVX2 memsets are faster than the previous ones, we
    can remove the previous SSE2/AVX2 memsets and replace them with the
    new ones.  This reduces the size of libc.so by about 900 bytes.
    
    No change in IFUNC selection if SSE2 and AVX2 memsets weren't used
    before.  If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset
    optimized with Enhanced REP STOSB will be used for processors with
    ERMS.  The new AVX512 memset will be used for processors with AVX512
    which prefer vzeroupper.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded
    	into ...
    	* sysdeps/x86_64/memset.S: This.
    	(__bzero): Removed.
    	(__memset_tail): Likewise.
    	(__memset_chk): Likewise.
    	(memset): Likewise.
    	(MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't
    	defined.
    	(MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined.
    	* sysdeps/x86_64/multiarch/memset-avx2.S: Removed.
    	(__memset_zero_constant_len_parameter): Check SHARED instead of
    	PIC.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memset-avx2 and memset-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Remove __memset_chk_sse2,
    	__memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
    	(__bzero): Enabled.
    	* sysdeps/x86_64/multiarch/memset.S (memset): Replace
    	__memset_sse2 and __memset_avx2 with __memset_sse2_unaligned
    	and __memset_avx2_unaligned.  Use __memset_sse2_unaligned_erms
    	or __memset_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	(memset): Removed.
    	(__memset_chk): Likewise.
    	(MEMSET_SYMBOL): New.
    	(libc_hidden_builtin_def): Replace __memset_sse2 with
    	__memset_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace
    	__memset_chk_sse2 and __memset_chk_avx2 with
    	__memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms.
    	Use __memset_chk_sse2_unaligned_erms or
    	__memset_chk_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_chk_avx512_unaligned_erms and
    	__memset_chk_avx512_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=9fa066d5f5ff996990869bbbad08435f02d18bb3

commit 9fa066d5f5ff996990869bbbad08435f02d18bb3
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Apr 3 17:21:45 2016 -0700

    X86-64: Use non-temporal store in memcpy on large data
    
    The large memcpy micro benchmark in glibc shows that there is a
    regression with large data on Haswell machine.  non-temporal store in
    memcpy on large data can improve performance significantly.  This
    patch adds a threshold to use non temporal store which is 6 times of
    shared cache size.  When size is above the threshold, non temporal
    store will be used, but avoid non-temporal store if there is overlap
    between destination and source since destination may be in cache when
    source is loaded.
    
    For size below 8 vector register width, we load all data into registers
    and store them together.  Only forward and backward loops, which move 4
    vector registers at a time, are used to support overlapping addresses.
    For forward loop, we load the last 4 vector register width of data and
    the first vector register width of data into vector registers before the
    loop and store them after the loop.  For backward loop, we load the first
    4 vector register width of data and the last vector register width of
    data into vector registers before the loop and store them after the loop.
    
    	[BZ #19928]
    	* sysdeps/x86_64/cacheinfo.c (__x86_shared_non_temporal_threshold):
    	New.
    	(init_cacheinfo): Set __x86_shared_non_temporal_threshold to
    	6 times of shared cache size.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S
    	(VMOVNT): New.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S
    	(VMOVNT): Likewise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S
    	(VMOVNT): Likewise.
    	(VMOVU): Changed to movups for smaller code sizes.
    	(VMOVA): Changed to movaps for smaller code sizes.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Update
    	comments.
    	(PREFETCH): New.
    	(PREFETCH_SIZE): Likewise.
    	(PREFETCHED_LOAD_SIZE): Likewise.
    	(PREFETCH_ONE_SET): Likewise.
    	Rewrite to use forward and backward loops, which move 4 vector
    	registers at a time, to support overlapping addresses and use
    	non temporal store if size is above the threshold and there is
    	no overlap between destination and source.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=0932dd8b56db46dd421a4855fb5dee9de092538d

commit 0932dd8b56db46dd421a4855fb5dee9de092538d
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Wed Apr 6 10:19:16 2016 -0700

    X86-64: Prepare memmove-vec-unaligned-erms.S
    
    Prepare memmove-vec-unaligned-erms.S to make the SSE2 version as the
    default memcpy, mempcpy and memmove.
    
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
    	(MEMCPY_SYMBOL): New.
    	(MEMPCPY_SYMBOL): Likewise.
    	(MEMMOVE_CHK_SYMBOL): Likewise.
    	Replace MEMMOVE_SYMBOL with MEMMOVE_CHK_SYMBOL on __mempcpy_chk
    	symbols.  Replace MEMMOVE_SYMBOL with MEMPCPY_SYMBOL on
    	__mempcpy symbols.  Provide alias for __memcpy_chk in libc.a.
    	Provide alias for memcpy in libc.a and ld.so.
    
    (cherry picked from commit a7d1c51482d15ab6c07e2ee0ae5e007067b18bfb)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=da2da79262814ba4ead3ee487549949096d8ad2d

commit da2da79262814ba4ead3ee487549949096d8ad2d
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Wed Apr 6 09:10:18 2016 -0700

    X86-64: Prepare memset-vec-unaligned-erms.S
    
    Prepare memset-vec-unaligned-erms.S to make the SSE2 version as the
    default memset.
    
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
    	(MEMSET_CHK_SYMBOL): New.  Define if not defined.
    	(__bzero): Check VEC_SIZE == 16 instead of USE_MULTIARCH.
    	Disabled fro now.
    	Replace MEMSET_SYMBOL with MEMSET_CHK_SYMBOL on __memset_chk
    	symbols.  Properly check USE_MULTIARCH on __memset symbols.
    
    (cherry picked from commit 4af1bb06c59d24f35bf8dc55897838d926c05892)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=9a93bdbaff81edf67c5486c84f8098055e355abb

commit 9a93bdbaff81edf67c5486c84f8098055e355abb
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Apr 5 05:21:07 2016 -0700

    Force 32-bit displacement in memset-vec-unaligned-erms.S
    
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: Force
    	32-bit displacement to avoid long nop between instructions.
    
    (cherry picked from commit ec0cac9a1f4094bd0db6f77c1b329e7a40eecc10)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=5118e532600549ad0f56cb9b1a179b8eab70c483

commit 5118e532600549ad0f56cb9b1a179b8eab70c483
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Apr 5 05:19:05 2016 -0700

    Add a comment in memset-sse2-unaligned-erms.S
    
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Add
    	a comment on VMOVU and VMOVA.
    
    (cherry picked from commit 696ac774847b80cf994438739478b0c3003b5958)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=06c6d4ae6ee7e5b83fd5868bef494def01f59292

commit 06c6d4ae6ee7e5b83fd5868bef494def01f59292
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Apr 3 14:32:20 2016 -0700

    Don't put SSE2/AVX/AVX512 memmove/memset in ld.so
    
    Since memmove and memset in ld.so don't use IFUNC, don't put SSE2, AVX
    and AVX512 memmove and memset in ld.so.
    
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip
    	if not in libc.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    
    (cherry picked from commit 5cd7af016d8587ff53b20ba259746f97edbddbf7)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a96379797a7eecc1b709cad7b68981eb698783dc

commit a96379797a7eecc1b709cad7b68981eb698783dc
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Apr 3 12:38:25 2016 -0700

    Fix memmove-vec-unaligned-erms.S
    
    __mempcpy_erms and __memmove_erms can't be placed between __memmove_chk
    and __memmove it breaks __memmove_chk.
    
    Don't check source == destination first since it is less common.
    
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:
    	(__mempcpy_erms, __memmove_erms): Moved before __mempcpy_chk
    	with unaligned_erms.
    	(__memmove_erms): Skip if source == destination.
    	(__memmove_unaligned_erms): Don't check source == destination
    	first.
    
    (cherry picked from commit ea2785e96fa503f3a2b5dd9f3a6ca65622b3c5f2)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=cfb059c79729b26284863334c9aa04f0a3b967b9

commit cfb059c79729b26284863334c9aa04f0a3b967b9
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 1 15:08:48 2016 -0700

    Remove Fast_Copy_Backward from Intel Core processors
    
    Intel Core i3, i5 and i7 processors have fast unaligned copy and
    copy backward is ignored.  Remove Fast_Copy_Backward from Intel Core
    processors to avoid confusion.
    
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Don't set
    	bit_arch_Fast_Copy_Backward for Intel Core proessors.
    
    (cherry picked from commit 27d3ce1467990f89126e228559dec8f84b96c60e)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=30c389be1af67c4d0716d207b6780c6169d1355f

commit 30c389be1af67c4d0716d207b6780c6169d1355f
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:05:51 2016 -0700

    Add x86-64 memset with unaligned store and rep stosb
    
    Implement x86-64 memset with unaligned store and rep movsb.  Support
    16-byte, 32-byte and 64-byte vector register sizes.  A single file
    provides 2 implementations of memset, one with rep stosb and the other
    without rep stosb.  They share the same codes when size is between 2
    times of vector register size and REP_STOSB_THRESHOLD which defaults
    to 2KB.
    
    Key features:
    
    1. Use overlapping store to avoid branch.
    2. For size <= 4 times of vector register size, fully unroll the loop.
    3. For size > 4 times of vector register size, store 4 times of vector
    register size at a time.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
    	memset-sse2-unaligned-erms, memset-avx2-unaligned-erms and
    	memset-avx512-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Test __memset_chk_sse2_unaligned,
    	__memset_chk_sse2_unaligned_erms, __memset_chk_avx2_unaligned,
    	__memset_chk_avx2_unaligned_erms, __memset_chk_avx512_unaligned,
    	__memset_chk_avx512_unaligned_erms, __memset_sse2_unaligned,
    	__memset_sse2_unaligned_erms, __memset_erms,
    	__memset_avx2_unaligned, __memset_avx2_unaligned_erms,
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: New
    	file.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:
    	Likewise.
    
    (cherry picked from commit 830566307f038387ca0af3fd327706a8d1a2f595)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=980d639b4ae58209843f09a29d86b0a8303b6650

commit 980d639b4ae58209843f09a29d86b0a8303b6650
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:04:26 2016 -0700

    Add x86-64 memmove with unaligned load/store and rep movsb
    
    Implement x86-64 memmove with unaligned load/store and rep movsb.
    Support 16-byte, 32-byte and 64-byte vector register sizes.  When
    size <= 8 times of vector register size, there is no check for
    address overlap bewteen source and destination.  Since overhead for
    overlap check is small when size > 8 times of vector register size,
    memcpy is an alias of memmove.
    
    A single file provides 2 implementations of memmove, one with rep movsb
    and the other without rep movsb.  They share the same codes when size is
    between 2 times of vector register size and REP_MOVSB_THRESHOLD which
    is 2KB for 16-byte vector register size and scaled up by large vector
    register size.
    
    Key features:
    
    1. Use overlapping load and store to avoid branch.
    2. For size <= 8 times of vector register size, load  all sources into
    registers and store them together.
    3. If there is no address overlap bewteen source and destination, copy
    from both ends with 4 times of vector register size at a time.
    4. If address of destination > address of source, backward copy 8 times
    of vector register size at a time.
    5. Otherwise, forward copy 8 times of vector register size at a time.
    6. Use rep movsb only for forward copy.  Avoid slow backward rep movsb
    by fallbacking to backward copy 8 times of vector register size at a
    time.
    7. Skip when address of destination == address of source.
    
    	[BZ #19776]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
    	memmove-sse2-unaligned-erms, memmove-avx-unaligned-erms and
    	memmove-avx512-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Test
    	__memmove_chk_avx512_unaligned_2,
    	__memmove_chk_avx512_unaligned_erms,
    	__memmove_chk_avx_unaligned_2, __memmove_chk_avx_unaligned_erms,
    	__memmove_chk_sse2_unaligned_2,
    	__memmove_chk_sse2_unaligned_erms, __memmove_avx_unaligned_2,
    	__memmove_avx_unaligned_erms, __memmove_avx512_unaligned_2,
    	__memmove_avx512_unaligned_erms, __memmove_erms,
    	__memmove_sse2_unaligned_2, __memmove_sse2_unaligned_erms,
    	__memcpy_chk_avx512_unaligned_2,
    	__memcpy_chk_avx512_unaligned_erms,
    	__memcpy_chk_avx_unaligned_2, __memcpy_chk_avx_unaligned_erms,
    	__memcpy_chk_sse2_unaligned_2, __memcpy_chk_sse2_unaligned_erms,
    	__memcpy_avx_unaligned_2, __memcpy_avx_unaligned_erms,
    	__memcpy_avx512_unaligned_2, __memcpy_avx512_unaligned_erms,
    	__memcpy_sse2_unaligned_2, __memcpy_sse2_unaligned_erms,
    	__memcpy_erms, __mempcpy_chk_avx512_unaligned_2,
    	__mempcpy_chk_avx512_unaligned_erms,
    	__mempcpy_chk_avx_unaligned_2, __mempcpy_chk_avx_unaligned_erms,
    	__mempcpy_chk_sse2_unaligned_2, __mempcpy_chk_sse2_unaligned_erms,
    	__mempcpy_avx512_unaligned_2, __mempcpy_avx512_unaligned_erms,
    	__mempcpy_avx_unaligned_2, __mempcpy_avx_unaligned_erms,
    	__mempcpy_sse2_unaligned_2, __mempcpy_sse2_unaligned_erms and
    	__mempcpy_erms.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: New
    	file.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likwise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
    	Likwise.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:
    	Likwise.
    
    (cherry picked from commit 88b57b8ed41d5ecf2e1bdfc19556f9246a665ebb)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=bf2bc5e5c9d7aa8af28b299ec26b8a37352730cc

commit bf2bc5e5c9d7aa8af28b299ec26b8a37352730cc
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 19:22:59 2016 -0700

    Initial Enhanced REP MOVSB/STOSB (ERMS) support
    
    The newer Intel processors support Enhanced REP MOVSB/STOSB (ERMS) which
    has a feature bit in CPUID.  This patch adds the Enhanced REP MOVSB/STOSB
    (ERMS) bit to x86 cpu-features.
    
    	* sysdeps/x86/cpu-features.h (bit_cpu_ERMS): New.
    	(index_cpu_ERMS): Likewise.
    	(reg_ERMS): Likewise.
    
    (cherry picked from commit 0791f91dff9a77263fa8173b143d854cad902c6d)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7c244283ff12329b3bca9878b8edac3b3fe5c7bc

commit 7c244283ff12329b3bca9878b8edac3b3fe5c7bc
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 13:15:59 2016 -0700

    Make __memcpy_avx512_no_vzeroupper an alias
    
    Since x86-64 memcpy-avx512-no-vzeroupper.S implements memmove, make
    __memcpy_avx512_no_vzeroupper an alias of __memmove_avx512_no_vzeroupper
    to reduce code size of libc.so.
    
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memcpy-avx512-no-vzeroupper.
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: Renamed
    	to ...
    	* sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: This.
    	(MEMCPY): Don't define.
    	(MEMCPY_CHK): Likewise.
    	(MEMPCPY): Likewise.
    	(MEMPCPY_CHK): Likewise.
    	(MEMPCPY_CHK): Renamed to ...
    	(__mempcpy_chk_avx512_no_vzeroupper): This.
    	(MEMPCPY_CHK): Renamed to ...
    	(__mempcpy_chk_avx512_no_vzeroupper): This.
    	(MEMCPY_CHK): Renamed to ...
    	(__memmove_chk_avx512_no_vzeroupper): This.
    	(MEMCPY): Renamed to ...
    	(__memmove_avx512_no_vzeroupper): This.
    	(__memcpy_avx512_no_vzeroupper): New alias.
    	(__memcpy_chk_avx512_no_vzeroupper): Likewise.
    
    (cherry picked from commit 064f01b10b57ff09cda7025f484b848c38ddd57a)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a9a14991fb2d3e69f80d25e9bbf2f6b0bcf11c3d

commit a9a14991fb2d3e69f80d25e9bbf2f6b0bcf11c3d
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 13:13:36 2016 -0700

    Implement x86-64 multiarch mempcpy in memcpy
    
    Implement x86-64 multiarch mempcpy in memcpy to share most of code.  It
    reduces code size of libc.so.
    
    	[BZ #18858]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	mempcpy-ssse3, mempcpy-ssse3-back, mempcpy-avx-unaligned
    	and mempcpy-avx512-no-vzeroupper.
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMPCPY_CHK):
    	New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S
    	(MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-ssse3-back.S (MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-ssse3.S (MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-avx-unaligned.S: Removed.
    	* sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-ssse3-back.S: Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-ssse3.S: Likewise.
    
    (cherry picked from commit c365e615f7429aee302f8af7bf07ae262278febb)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4fc09dabecee1b7cafdbca26ee7c63f68e53c229

commit 4fc09dabecee1b7cafdbca26ee7c63f68e53c229
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 04:39:48 2016 -0700

    [x86] Add a feature bit: Fast_Unaligned_Copy
    
    On AMD processors, memcpy optimized with unaligned SSE load is
    slower than emcpy optimized with aligned SSSE3 while other string
    functions are faster with unaligned SSE load.  A feature bit,
    Fast_Unaligned_Copy, is added to select memcpy optimized with
    unaligned SSE load.
    
    	[BZ #19583]
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Set
    	Fast_Unaligned_Copy with Fast_Unaligned_Load for Intel
    	processors.  Set Fast_Copy_Backward for AMD Excavator
    	processors.
    	* sysdeps/x86/cpu-features.h (bit_arch_Fast_Unaligned_Copy):
    	New.
    	(index_arch_Fast_Unaligned_Copy): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Check
    	Fast_Unaligned_Copy instead of Fast_Unaligned_Load.
    
    (cherry picked from commit e41b395523040fcb58c7d378475720c2836d280c)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=75f2d47e459a6bf5656a938e5c63f8b581eb3ee6

commit 75f2d47e459a6bf5656a938e5c63f8b581eb3ee6
Author: Florian Weimer <fweimer@redhat.com>
Date:   Fri Mar 25 11:11:42 2016 +0100

    tst-audit10: Fix compilation on compilers without bit_AVX512F [BZ #19860]
    
    	[BZ# 19860]
    	* sysdeps/x86_64/tst-audit10.c (avx512_enabled): Always return
    	zero if the compiler does not provide the AVX512F bit.
    
    (cherry picked from commit f327f5b47be57bc05a4077344b381016c1bb2c11)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=96c7375cb8b6f1875d9865f2ae92ecacf5f5e6fa

commit 96c7375cb8b6f1875d9865f2ae92ecacf5f5e6fa
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Mar 22 08:36:16 2016 -0700

    Don't set %rcx twice before "rep movsb"
    
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMCPY):
    	Don't set %rcx twice before "rep movsb".
    
    (cherry picked from commit 3c9a4cd16cbc7b79094fec68add2df66061ab5d7)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c273f613b0cc779ee33cc33d20941d271316e483

commit c273f613b0cc779ee33cc33d20941d271316e483
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Mar 22 07:46:56 2016 -0700

    Set index_arch_AVX_Fast_Unaligned_Load only for Intel processors
    
    Since only Intel processors with AVX2 have fast unaligned load, we
    should set index_arch_AVX_Fast_Unaligned_Load only for Intel processors.
    
    Move AVX, AVX2, AVX512, FMA and FMA4 detection into get_common_indeces
    and call get_common_indeces for other processors.
    
    Add CPU_FEATURES_CPU_P and CPU_FEATURES_ARCH_P to aoid loading
    GLRO(dl_x86_cpu_features) in cpu-features.c.
    
    	[BZ #19583]
    	* sysdeps/x86/cpu-features.c (get_common_indeces): Remove
    	inline.  Check family before setting family, model and
    	extended_model.  Set AVX, AVX2, AVX512, FMA and FMA4 usable
    	bits here.
    	(init_cpu_features): Replace HAS_CPU_FEATURE and
    	HAS_ARCH_FEATURE with CPU_FEATURES_CPU_P and
    	CPU_FEATURES_ARCH_P.  Set index_arch_AVX_Fast_Unaligned_Load
    	for Intel processors with usable AVX2.  Call get_common_indeces
    	for other processors with family == NULL.
    	* sysdeps/x86/cpu-features.h (CPU_FEATURES_CPU_P): New macro.
    	(CPU_FEATURES_ARCH_P): Likewise.
    	(HAS_CPU_FEATURE): Use CPU_FEATURES_CPU_P.
    	(HAS_ARCH_FEATURE): Use CPU_FEATURES_ARCH_P.
    
    (cherry picked from commit f781a9e96138d8839663af5e88649ab1fbed74f8)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c858d10a4e7fd682f2e7083836e4feacc2d580f4

commit c858d10a4e7fd682f2e7083836e4feacc2d580f4
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 10 05:26:46 2016 -0800

    Add _arch_/_cpu_ to index_*/bit_* in x86 cpu-features.h
    
    index_* and bit_* macros are used to access cpuid and feature arrays o
    struct cpu_features.  It is very easy to use bits and indices of cpuid
    array on feature array, especially in assembly codes.  For example,
    sysdeps/i386/i686/multiarch/bcopy.S has
    
    	HAS_CPU_FEATURE (Fast_Rep_String)
    
    which should be
    
    	HAS_ARCH_FEATURE (Fast_Rep_String)
    
    We change index_* and bit_* to index_cpu_*/index_arch_* and
    bit_cpu_*/bit_arch_* so that we can catch such error at build time.
    
    	[BZ #19762]
    	* sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h
    	(EXTRA_LD_ENVVARS): Add _arch_ to index_*/bit_*.
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Likewise.
    	* sysdeps/x86/cpu-features.h (bit_*): Renamed to ...
    	(bit_arch_*): This for feature array.
    	(bit_*): Renamed to ...
    	(bit_cpu_*): This for cpu array.
    	(index_*): Renamed to ...
    	(index_arch_*): This for feature array.
    	(index_*): Renamed to ...
    	(index_cpu_*): This for cpu array.
    	[__ASSEMBLER__] (HAS_FEATURE): Add and use field.
    	[__ASSEMBLER__] (HAS_CPU_FEATURE)): Pass cpu to HAS_FEATURE.
    	[__ASSEMBLER__] (HAS_ARCH_FEATURE)): Pass arch to HAS_FEATURE.
    	[!__ASSEMBLER__] (HAS_CPU_FEATURE): Replace index_##name and
    	bit_##name with index_cpu_##name and bit_cpu_##name.
    	[!__ASSEMBLER__] (HAS_ARCH_FEATURE): Replace index_##name and
    	bit_##name with index_arch_##name and bit_arch_##name.
    
    (cherry picked from commit 6aa3e97e2530f9917f504eb4146af119a3f27229)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7a90b56b0c3f8e55df44957cf6de7d3c9c04cbb9

commit 7a90b56b0c3f8e55df44957cf6de7d3c9c04cbb9
Author: Roland McGrath <roland@hack.frob.com>
Date:   Tue Mar 8 12:31:13 2016 -0800

    Fix tst-audit10 build when -mavx512f is not supported.
    
    (cherry picked from commit 3bd80c0de2f8e7ca8020d37739339636d169957e)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=ba80f6ceea3a6b6f711038646f419125fe3ad39c

commit ba80f6ceea3a6b6f711038646f419125fe3ad39c
Author: Florian Weimer <fweimer@redhat.com>
Date:   Mon Mar 7 16:00:25 2016 +0100

    tst-audit4, tst-audit10: Compile AVX/AVX-512 code separately [BZ #19269]
    
    This ensures that GCC will not use unsupported instructions before
    the run-time check to ensure support.
    
    (cherry picked from commit 3c0f7407eedb524c9114bb675cd55b903c71daaa)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=b8fe596e7f750d4ee2fca14d6a3999364c02662e

commit b8fe596e7f750d4ee2fca14d6a3999364c02662e
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Mar 6 16:48:11 2016 -0800

    Group AVX512 functions in .text.avx512 section
    
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S:
    	Replace .text with .text.avx512.
    	* sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S:
    	Likewise.
    
    (cherry picked from commit fee9eb6200f0e44a4b684903bc47fde36d46f1a5)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e455d17680cfaebb12692547422f95ba1ed30e29

commit e455d17680cfaebb12692547422f95ba1ed30e29
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Mar 4 08:37:40 2016 -0800

    x86-64: Fix memcpy IFUNC selection
    
    Chek Fast_Unaligned_Load, instead of Slow_BSF, and also check for
    Fast_Copy_Backward to enable __memcpy_ssse3_back.  Existing selection
    order is updated with following selection order:
    
    1. __memcpy_avx_unaligned if AVX_Fast_Unaligned_Load bit is set.
    2. __memcpy_sse2_unaligned if Fast_Unaligned_Load bit is set.
    3. __memcpy_sse2 if SSSE3 isn't available.
    4. __memcpy_ssse3_back if Fast_Copy_Backward bit it set.
    5. __memcpy_ssse3
    
    	[BZ #18880]
    	* sysdeps/x86_64/multiarch/memcpy.S: Check Fast_Unaligned_Load,
    	instead of Slow_BSF, and also check for Fast_Copy_Backward to
    	enable __memcpy_ssse3_back.
    
    (cherry picked from commit 14a1d7cc4c4fd5ee8e4e66b777221dd32a84efe8)

-----------------------------------------------------------------------

Comment 34 Sourceware Commits 2016-04-23 13:26:52 UTC

This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, hjl/erms/ifunc has been created
        at  bce4ef1859db80554f4f8b20e8597984f06b760e (commit)

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=bce4ef1859db80554f4f8b20e8597984f06b760e

commit bce4ef1859db80554f4f8b20e8597984f06b760e
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 1 14:01:24 2016 -0700

    X86-64: Add dummy memcopy.h and wordcopy.c
    
    Since x86-64 doesn't use memory copy functions, add dummy memcopy.h and
    wordcopy.c to reduce code size.  It reduces the size of libc.so by about
    1 KB.
    
    	* sysdeps/x86_64/memcopy.h: New file.
    	* sysdeps/x86_64/wordcopy.c: Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=46eecce30a070ca39d1b3c04dcdf68b378913b74

commit 46eecce30a070ca39d1b3c04dcdf68b378913b74
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 12:46:57 2016 -0700

    X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove
    
    Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones,
    we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with
    the new ones.
    
    No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used
    before.  If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2
    memcpy/memmove optimized with Enhanced REP MOVSB will be used for
    processors with ERMS.  The new AVX512 memcpy/memmove will be used for
    processors with AVX512 which prefer vzeroupper.
    
    Since the new SSE2 memcpy/memmove are faster than the previous default
    memcpy/memmove used in libc.a and ld.so, we also remove the previous
    default memcpy/memmove and make them the default memcpy/memmove, except
    that non-temporal store isn't used in ld.so.
    
    Together, it reduces the size of libc.so by about 6 KB and the size of
    ld.so by about 2 KB.
    
    	[BZ #19776]
    	* sysdeps/x86_64/memcpy.S: Make it dummy.
    	* sysdeps/x86_64/mempcpy.S: Likewise.
    	* sysdeps/x86_64/memmove.S: New file.
    	* sysdeps/x86_64/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/memmove.c: Removed.
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memmove.c: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memcpy-sse2-unaligned, memmove-avx-unaligned,
    	memcpy-avx-unaligned and memmove-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Replace
    	__memmove_chk_avx512_unaligned_2 with
    	__memmove_chk_avx512_unaligned.  Remove
    	__memmove_chk_avx_unaligned_2.  Replace
    	__memmove_chk_sse2_unaligned_2 with
    	__memmove_chk_sse2_unaligned.  Remove __memmove_chk_sse2 and
    	__memmove_avx_unaligned_2.  Replace __memmove_avx512_unaligned_2
    	with __memmove_avx512_unaligned.  Replace
    	__memmove_sse2_unaligned_2 with __memmove_sse2_unaligned.
    	Remove __memmove_sse2.  Replace __memcpy_chk_avx512_unaligned_2
    	with __memcpy_chk_avx512_unaligned.  Remove
    	__memcpy_chk_avx_unaligned_2.  Replace
    	__memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned.
    	Remove __memcpy_chk_sse2.  Remove __memcpy_avx_unaligned_2.
    	Replace __memcpy_avx512_unaligned_2 with
    	__memcpy_avx512_unaligned.  Remove __memcpy_sse2_unaligned_2
    	and __memcpy_sse2.  Replace __mempcpy_chk_avx512_unaligned_2
    	with __mempcpy_chk_avx512_unaligned.  Remove
    	__mempcpy_chk_avx_unaligned_2.  Replace
    	__mempcpy_chk_sse2_unaligned_2 with
    	__mempcpy_chk_sse2_unaligned.  Remove __mempcpy_chk_sse2.
    	Replace __mempcpy_avx512_unaligned_2 with
    	__mempcpy_avx512_unaligned.  Remove __mempcpy_avx_unaligned_2.
    	Replace __mempcpy_sse2_unaligned_2 with
    	__mempcpy_sse2_unaligned.  Remove __mempcpy_sse2.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support
    	__memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned.
    	Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __memcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../memcpy.S.
    	* sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support
    	__memcpy_chk_avx512_unaligned_erms and
    	__memcpy_chk_avx512_unaligned.  Use
    	__memcpy_chk_avx_unaligned_erms and
    	__memcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __memcpy_chk_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
    	Change function suffix from unaligned_2 to unaligned.
    	* sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support
    	__mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned.
    	Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __mempcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../mempcpy.S.
    	(mempcpy): New.  Add a weak alias.
    	* sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support
    	__mempcpy_chk_avx512_unaligned_erms and
    	__mempcpy_chk_avx512_unaligned.  Use
    	__mempcpy_chk_avx_unaligned_erms and
    	__mempcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __mempcpy_chk_sse2_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=6d234de7bccb082f25fea6641a0fbe414e13dfb4

commit 6d234de7bccb082f25fea6641a0fbe414e13dfb4
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:42:30 2016 -0700

    X86-64: Remove the previous SSE2/AVX2 memsets
    
    Since the new SSE2/AVX2 memsets are faster than the previous ones, we
    can remove the previous SSE2/AVX2 memsets and replace them with the
    new ones.  This reduces the size of libc.so by about 900 bytes.
    
    No change in IFUNC selection if SSE2 and AVX2 memsets weren't used
    before.  If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset
    optimized with Enhanced REP STOSB will be used for processors with
    ERMS.  The new AVX512 memset will be used for processors with AVX512
    which prefer vzeroupper.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded
    	into ...
    	* sysdeps/x86_64/memset.S: This.
    	(__bzero): Removed.
    	(__memset_tail): Likewise.
    	(__memset_chk): Likewise.
    	(memset): Likewise.
    	(MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't
    	defined.
    	(MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined.
    	* sysdeps/x86_64/multiarch/memset-avx2.S: Removed.
    	(__memset_zero_constant_len_parameter): Check SHARED instead of
    	PIC.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memset-avx2 and memset-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Remove __memset_chk_sse2,
    	__memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
    	(__bzero): Enabled.
    	* sysdeps/x86_64/multiarch/memset.S (memset): Replace
    	__memset_sse2 and __memset_avx2 with __memset_sse2_unaligned
    	and __memset_avx2_unaligned.  Use __memset_sse2_unaligned_erms
    	or __memset_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	(memset): Removed.
    	(__memset_chk): Likewise.
    	(MEMSET_SYMBOL): New.
    	(libc_hidden_builtin_def): Replace __memset_sse2 with
    	__memset_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace
    	__memset_chk_sse2 and __memset_chk_avx2 with
    	__memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms.
    	Use __memset_chk_sse2_unaligned_erms or
    	__memset_chk_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_chk_avx512_unaligned_erms and
    	__memset_chk_avx512_unaligned.

-----------------------------------------------------------------------

Comment 35 Sourceware Commits 2016-04-25 12:02:45 UTC

This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, hjl/cacheline/ifunc has been created
        at  52b083d35ef77d2c6164c66d2d87e87760b0b1e2 (commit)

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=52b083d35ef77d2c6164c66d2d87e87760b0b1e2

commit 52b083d35ef77d2c6164c66d2d87e87760b0b1e2
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 1 14:01:24 2016 -0700

    X86-64: Add dummy memcopy.h and wordcopy.c
    
    Since x86-64 doesn't use memory copy functions, add dummy memcopy.h and
    wordcopy.c to reduce code size.  It reduces the size of libc.so by about
    1 KB.
    
    	* sysdeps/x86_64/memcopy.h: New file.
    	* sysdeps/x86_64/wordcopy.c: Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=864950e5e1ba3f4a9aae7380d481fc029c362a3b

commit 864950e5e1ba3f4a9aae7380d481fc029c362a3b
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 12:46:57 2016 -0700

    X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove
    
    Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones,
    we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with
    the new ones.
    
    No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used
    before.  If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2
    memcpy/memmove optimized with Enhanced REP MOVSB will be used for
    processors with ERMS.  The new AVX512 memcpy/memmove will be used for
    processors with AVX512 which prefer vzeroupper.
    
    Since the new SSE2 memcpy/memmove are faster than the previous default
    memcpy/memmove used in libc.a and ld.so, we also remove the previous
    default memcpy/memmove and make them the default memcpy/memmove, except
    that non-temporal store isn't used in ld.so.
    
    Together, it reduces the size of libc.so by about 6 KB and the size of
    ld.so by about 2 KB.
    
    	[BZ #19776]
    	* sysdeps/x86_64/memcpy.S: Make it dummy.
    	* sysdeps/x86_64/mempcpy.S: Likewise.
    	* sysdeps/x86_64/memmove.S: New file.
    	* sysdeps/x86_64/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/memmove.c: Removed.
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memmove.c: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memcpy-sse2-unaligned, memmove-avx-unaligned,
    	memcpy-avx-unaligned and memmove-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Replace
    	__memmove_chk_avx512_unaligned_2 with
    	__memmove_chk_avx512_unaligned.  Remove
    	__memmove_chk_avx_unaligned_2.  Replace
    	__memmove_chk_sse2_unaligned_2 with
    	__memmove_chk_sse2_unaligned.  Remove __memmove_chk_sse2 and
    	__memmove_avx_unaligned_2.  Replace __memmove_avx512_unaligned_2
    	with __memmove_avx512_unaligned.  Replace
    	__memmove_sse2_unaligned_2 with __memmove_sse2_unaligned.
    	Remove __memmove_sse2.  Replace __memcpy_chk_avx512_unaligned_2
    	with __memcpy_chk_avx512_unaligned.  Remove
    	__memcpy_chk_avx_unaligned_2.  Replace
    	__memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned.
    	Remove __memcpy_chk_sse2.  Remove __memcpy_avx_unaligned_2.
    	Replace __memcpy_avx512_unaligned_2 with
    	__memcpy_avx512_unaligned.  Remove __memcpy_sse2_unaligned_2
    	and __memcpy_sse2.  Replace __mempcpy_chk_avx512_unaligned_2
    	with __mempcpy_chk_avx512_unaligned.  Remove
    	__mempcpy_chk_avx_unaligned_2.  Replace
    	__mempcpy_chk_sse2_unaligned_2 with
    	__mempcpy_chk_sse2_unaligned.  Remove __mempcpy_chk_sse2.
    	Replace __mempcpy_avx512_unaligned_2 with
    	__mempcpy_avx512_unaligned.  Remove __mempcpy_avx_unaligned_2.
    	Replace __mempcpy_sse2_unaligned_2 with
    	__mempcpy_sse2_unaligned.  Remove __mempcpy_sse2.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support
    	__memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned.
    	Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __memcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../memcpy.S.
    	* sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support
    	__memcpy_chk_avx512_unaligned_erms and
    	__memcpy_chk_avx512_unaligned.  Use
    	__memcpy_chk_avx_unaligned_erms and
    	__memcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __memcpy_chk_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
    	Change function suffix from unaligned_2 to unaligned.
    	* sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support
    	__mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned.
    	Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __mempcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../mempcpy.S.
    	(mempcpy): New.  Add a weak alias.
    	* sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support
    	__mempcpy_chk_avx512_unaligned_erms and
    	__mempcpy_chk_avx512_unaligned.  Use
    	__mempcpy_chk_avx_unaligned_erms and
    	__mempcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __mempcpy_chk_sse2_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=d95a13a81078acecd67b827ca684fd65d3e3d266

commit d95a13a81078acecd67b827ca684fd65d3e3d266
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:42:30 2016 -0700

    X86-64: Remove the previous SSE2/AVX2 memsets
    
    Since the new SSE2/AVX2 memsets are faster than the previous ones, we
    can remove the previous SSE2/AVX2 memsets and replace them with the
    new ones.  This reduces the size of libc.so by about 900 bytes.
    
    No change in IFUNC selection if SSE2 and AVX2 memsets weren't used
    before.  If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset
    optimized with Enhanced REP STOSB will be used for processors with
    ERMS.  The new AVX512 memset will be used for processors with AVX512
    which prefer vzeroupper.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded
    	into ...
    	* sysdeps/x86_64/memset.S: This.
    	(__bzero): Removed.
    	(__memset_tail): Likewise.
    	(__memset_chk): Likewise.
    	(memset): Likewise.
    	(MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't
    	defined.
    	(MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined.
    	* sysdeps/x86_64/multiarch/memset-avx2.S: Removed.
    	(__memset_zero_constant_len_parameter): Check SHARED instead of
    	PIC.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memset-avx2 and memset-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Remove __memset_chk_sse2,
    	__memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
    	(__bzero): Enabled.
    	* sysdeps/x86_64/multiarch/memset.S (memset): Replace
    	__memset_sse2 and __memset_avx2 with __memset_sse2_unaligned
    	and __memset_avx2_unaligned.  Use __memset_sse2_unaligned_erms
    	or __memset_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	(memset): Removed.
    	(__memset_chk): Likewise.
    	(MEMSET_SYMBOL): New.
    	(libc_hidden_builtin_def): Replace __memset_sse2 with
    	__memset_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace
    	__memset_chk_sse2 and __memset_chk_avx2 with
    	__memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms.
    	Use __memset_chk_sse2_unaligned_erms or
    	__memset_chk_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_chk_avx512_unaligned_erms and
    	__memset_chk_avx512_unaligned.

-----------------------------------------------------------------------

Comment 36 Sourceware Commits 2016-04-25 15:33:53 UTC

This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, hjl/cacheline/ifunc has been created
        at  38d75b36f9c32733c4c3987d67f05436a452ee24 (commit)

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=38d75b36f9c32733c4c3987d67f05436a452ee24

commit 38d75b36f9c32733c4c3987d67f05436a452ee24
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 1 14:01:24 2016 -0700

    X86-64: Add dummy memcopy.h and wordcopy.c
    
    Since x86-64 doesn't use memory copy functions, add dummy memcopy.h and
    wordcopy.c to reduce code size.  It reduces the size of libc.so by about
    1 KB.
    
    	* sysdeps/x86_64/memcopy.h: New file.
    	* sysdeps/x86_64/wordcopy.c: Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a3fbfb073c7f6e448b51bfde70c78eebc04af096

commit a3fbfb073c7f6e448b51bfde70c78eebc04af096
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 12:46:57 2016 -0700

    X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove
    
    Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones,
    we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with
    the new ones.
    
    No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used
    before.  If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2
    memcpy/memmove optimized with Enhanced REP MOVSB will be used for
    processors with ERMS.  The new AVX512 memcpy/memmove will be used for
    processors with AVX512 which prefer vzeroupper.
    
    Since the new SSE2 memcpy/memmove are faster than the previous default
    memcpy/memmove used in libc.a and ld.so, we also remove the previous
    default memcpy/memmove and make them the default memcpy/memmove, except
    that non-temporal store isn't used in ld.so.
    
    Together, it reduces the size of libc.so by about 6 KB and the size of
    ld.so by about 2 KB.
    
    	[BZ #19776]
    	* sysdeps/x86_64/memcpy.S: Make it dummy.
    	* sysdeps/x86_64/mempcpy.S: Likewise.
    	* sysdeps/x86_64/memmove.S: New file.
    	* sysdeps/x86_64/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/memmove.c: Removed.
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memmove.c: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memcpy-sse2-unaligned, memmove-avx-unaligned,
    	memcpy-avx-unaligned and memmove-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Replace
    	__memmove_chk_avx512_unaligned_2 with
    	__memmove_chk_avx512_unaligned.  Remove
    	__memmove_chk_avx_unaligned_2.  Replace
    	__memmove_chk_sse2_unaligned_2 with
    	__memmove_chk_sse2_unaligned.  Remove __memmove_chk_sse2 and
    	__memmove_avx_unaligned_2.  Replace __memmove_avx512_unaligned_2
    	with __memmove_avx512_unaligned.  Replace
    	__memmove_sse2_unaligned_2 with __memmove_sse2_unaligned.
    	Remove __memmove_sse2.  Replace __memcpy_chk_avx512_unaligned_2
    	with __memcpy_chk_avx512_unaligned.  Remove
    	__memcpy_chk_avx_unaligned_2.  Replace
    	__memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned.
    	Remove __memcpy_chk_sse2.  Remove __memcpy_avx_unaligned_2.
    	Replace __memcpy_avx512_unaligned_2 with
    	__memcpy_avx512_unaligned.  Remove __memcpy_sse2_unaligned_2
    	and __memcpy_sse2.  Replace __mempcpy_chk_avx512_unaligned_2
    	with __mempcpy_chk_avx512_unaligned.  Remove
    	__mempcpy_chk_avx_unaligned_2.  Replace
    	__mempcpy_chk_sse2_unaligned_2 with
    	__mempcpy_chk_sse2_unaligned.  Remove __mempcpy_chk_sse2.
    	Replace __mempcpy_avx512_unaligned_2 with
    	__mempcpy_avx512_unaligned.  Remove __mempcpy_avx_unaligned_2.
    	Replace __mempcpy_sse2_unaligned_2 with
    	__mempcpy_sse2_unaligned.  Remove __mempcpy_sse2.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support
    	__memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned.
    	Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __memcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../memcpy.S.
    	* sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support
    	__memcpy_chk_avx512_unaligned_erms and
    	__memcpy_chk_avx512_unaligned.  Use
    	__memcpy_chk_avx_unaligned_erms and
    	__memcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __memcpy_chk_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
    	Change function suffix from unaligned_2 to unaligned.
    	* sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support
    	__mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned.
    	Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __mempcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../mempcpy.S.
    	(mempcpy): New.  Add a weak alias.
    	* sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support
    	__mempcpy_chk_avx512_unaligned_erms and
    	__mempcpy_chk_avx512_unaligned.  Use
    	__mempcpy_chk_avx_unaligned_erms and
    	__mempcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __mempcpy_chk_sse2_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=0c9188718247d9f426d307c337cc86219d056c61

commit 0c9188718247d9f426d307c337cc86219d056c61
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:42:30 2016 -0700

    X86-64: Remove the previous SSE2/AVX2 memsets
    
    Since the new SSE2/AVX2 memsets are faster than the previous ones, we
    can remove the previous SSE2/AVX2 memsets and replace them with the
    new ones.  This reduces the size of libc.so by about 900 bytes.
    
    No change in IFUNC selection if SSE2 and AVX2 memsets weren't used
    before.  If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset
    optimized with Enhanced REP STOSB will be used for processors with
    ERMS.  The new AVX512 memset will be used for processors with AVX512
    which prefer vzeroupper.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded
    	into ...
    	* sysdeps/x86_64/memset.S: This.
    	(__bzero): Removed.
    	(__memset_tail): Likewise.
    	(__memset_chk): Likewise.
    	(memset): Likewise.
    	(MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't
    	defined.
    	(MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined.
    	* sysdeps/x86_64/multiarch/memset-avx2.S: Removed.
    	(__memset_zero_constant_len_parameter): Check SHARED instead of
    	PIC.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memset-avx2 and memset-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Remove __memset_chk_sse2,
    	__memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
    	(__bzero): Enabled.
    	* sysdeps/x86_64/multiarch/memset.S (memset): Replace
    	__memset_sse2 and __memset_avx2 with __memset_sse2_unaligned
    	and __memset_avx2_unaligned.  Use __memset_sse2_unaligned_erms
    	or __memset_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	(memset): Removed.
    	(__memset_chk): Likewise.
    	(MEMSET_SYMBOL): New.
    	(libc_hidden_builtin_def): Replace __memset_sse2 with
    	__memset_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace
    	__memset_chk_sse2 and __memset_chk_avx2 with
    	__memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms.
    	Use __memset_chk_sse2_unaligned_erms or
    	__memset_chk_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_chk_avx512_unaligned_erms and
    	__memset_chk_avx512_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=343b5e49525c4c936643418300ea16437256b1e0

commit 343b5e49525c4c936643418300ea16437256b1e0
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Apr 24 10:53:25 2016 -0700

    Align to cacheline

-----------------------------------------------------------------------

Comment 37 Sourceware Commits 2016-05-08 14:08:55 UTC

This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, hjl/erms/ifunc has been created
        at  68bb72a044d516058d00352d5600073d0d5d27e4 (commit)

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=68bb72a044d516058d00352d5600073d0d5d27e4

commit 68bb72a044d516058d00352d5600073d0d5d27e4
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 1 14:01:24 2016 -0700

    X86-64: Add dummy memcopy.h and wordcopy.c
    
    Since x86-64 doesn't use memory copy functions, add dummy memcopy.h and
    wordcopy.c to reduce code size.  It reduces the size of libc.so by about
    1 KB.
    
    	* sysdeps/x86_64/memcopy.h: New file.
    	* sysdeps/x86_64/wordcopy.c: Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=06da031549e4765f5481bf34b502529bef43e76c

commit 06da031549e4765f5481bf34b502529bef43e76c
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 12:46:57 2016 -0700

    X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove
    
    Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones,
    we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with
    the new ones.
    
    No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used
    before.  If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2
    memcpy/memmove optimized with Enhanced REP MOVSB will be used for
    processors with ERMS.  The new AVX512 memcpy/memmove will be used for
    processors with AVX512 which prefer vzeroupper.
    
    Since the new SSE2 memcpy/memmove are faster than the previous default
    memcpy/memmove used in libc.a and ld.so, we also remove the previous
    default memcpy/memmove and make them the default memcpy/memmove, except
    that non-temporal store isn't used in ld.so.
    
    Together, it reduces the size of libc.so by about 6 KB and the size of
    ld.so by about 2 KB.
    
    	[BZ #19776]
    	* sysdeps/x86_64/memcpy.S: Make it dummy.
    	* sysdeps/x86_64/mempcpy.S: Likewise.
    	* sysdeps/x86_64/memmove.S: New file.
    	* sysdeps/x86_64/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/memmove.c: Removed.
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memmove.c: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memcpy-sse2-unaligned, memmove-avx-unaligned,
    	memcpy-avx-unaligned and memmove-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Replace
    	__memmove_chk_avx512_unaligned_2 with
    	__memmove_chk_avx512_unaligned.  Remove
    	__memmove_chk_avx_unaligned_2.  Replace
    	__memmove_chk_sse2_unaligned_2 with
    	__memmove_chk_sse2_unaligned.  Remove __memmove_chk_sse2 and
    	__memmove_avx_unaligned_2.  Replace __memmove_avx512_unaligned_2
    	with __memmove_avx512_unaligned.  Replace
    	__memmove_sse2_unaligned_2 with __memmove_sse2_unaligned.
    	Remove __memmove_sse2.  Replace __memcpy_chk_avx512_unaligned_2
    	with __memcpy_chk_avx512_unaligned.  Remove
    	__memcpy_chk_avx_unaligned_2.  Replace
    	__memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned.
    	Remove __memcpy_chk_sse2.  Remove __memcpy_avx_unaligned_2.
    	Replace __memcpy_avx512_unaligned_2 with
    	__memcpy_avx512_unaligned.  Remove __memcpy_sse2_unaligned_2
    	and __memcpy_sse2.  Replace __mempcpy_chk_avx512_unaligned_2
    	with __mempcpy_chk_avx512_unaligned.  Remove
    	__mempcpy_chk_avx_unaligned_2.  Replace
    	__mempcpy_chk_sse2_unaligned_2 with
    	__mempcpy_chk_sse2_unaligned.  Remove __mempcpy_chk_sse2.
    	Replace __mempcpy_avx512_unaligned_2 with
    	__mempcpy_avx512_unaligned.  Remove __mempcpy_avx_unaligned_2.
    	Replace __mempcpy_sse2_unaligned_2 with
    	__mempcpy_sse2_unaligned.  Remove __mempcpy_sse2.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support
    	__memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned.
    	Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __memcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../memcpy.S.
    	* sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support
    	__memcpy_chk_avx512_unaligned_erms and
    	__memcpy_chk_avx512_unaligned.  Use
    	__memcpy_chk_avx_unaligned_erms and
    	__memcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __memcpy_chk_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
    	Change function suffix from unaligned_2 to unaligned.
    	* sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support
    	__mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned.
    	Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __mempcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../mempcpy.S.
    	(mempcpy): New.  Add a weak alias.
    	* sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support
    	__mempcpy_chk_avx512_unaligned_erms and
    	__mempcpy_chk_avx512_unaligned.  Use
    	__mempcpy_chk_avx_unaligned_erms and
    	__mempcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __mempcpy_chk_sse2_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a84e1c9bc4654a30b68c3612cfd395a4f85f1812

commit a84e1c9bc4654a30b68c3612cfd395a4f85f1812
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:42:30 2016 -0700

    X86-64: Remove the previous SSE2/AVX2 memsets
    
    Since the new SSE2/AVX2 memsets are faster than the previous ones, we
    can remove the previous SSE2/AVX2 memsets and replace them with the
    new ones.  This reduces the size of libc.so by about 900 bytes.
    
    No change in IFUNC selection if SSE2 and AVX2 memsets weren't used
    before.  If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset
    optimized with Enhanced REP STOSB will be used for processors with
    ERMS.  The new AVX512 memset will be used for processors with AVX512
    which prefer vzeroupper.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded
    	into ...
    	* sysdeps/x86_64/memset.S: This.
    	(__bzero): Removed.
    	(__memset_tail): Likewise.
    	(__memset_chk): Likewise.
    	(memset): Likewise.
    	(MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't
    	defined.
    	(MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined.
    	* sysdeps/x86_64/multiarch/memset-avx2.S: Removed.
    	(__memset_zero_constant_len_parameter): Check SHARED instead of
    	PIC.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memset-avx2 and memset-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Remove __memset_chk_sse2,
    	__memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
    	(__bzero): Enabled.
    	* sysdeps/x86_64/multiarch/memset.S (memset): Replace
    	__memset_sse2 and __memset_avx2 with __memset_sse2_unaligned
    	and __memset_avx2_unaligned.  Use __memset_sse2_unaligned_erms
    	or __memset_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	(memset): Removed.
    	(__memset_chk): Likewise.
    	(MEMSET_SYMBOL): New.
    	(libc_hidden_builtin_def): Replace __memset_sse2 with
    	__memset_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace
    	__memset_chk_sse2 and __memset_chk_avx2 with
    	__memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms.
    	Use __memset_chk_sse2_unaligned_erms or
    	__memset_chk_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_chk_avx512_unaligned_erms and
    	__memset_chk_avx512_unaligned.

-----------------------------------------------------------------------

Comment 38 Sourceware Commits 2016-05-08 15:55:05 UTC

This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, hjl/erms/ifunc has been created
        at  ebb33e72c79c9068a81079001db9a24dc6550fa2 (commit)

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=ebb33e72c79c9068a81079001db9a24dc6550fa2

commit ebb33e72c79c9068a81079001db9a24dc6550fa2
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 1 14:01:24 2016 -0700

    X86-64: Add dummy memcopy.h and wordcopy.c
    
    Since x86-64 doesn't use memory copy functions, add dummy memcopy.h and
    wordcopy.c to reduce code size.  It reduces the size of libc.so by about
    1 KB.
    
    	* sysdeps/x86_64/memcopy.h: New file.
    	* sysdeps/x86_64/wordcopy.c: Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=8590f10f1929588364725660082d01c3990037d9

commit 8590f10f1929588364725660082d01c3990037d9
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 12:46:57 2016 -0700

    X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove
    
    Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones,
    we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with
    the new ones.
    
    No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used
    before.  If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2
    memcpy/memmove optimized with Enhanced REP MOVSB will be used for
    processors with ERMS.  The new AVX512 memcpy/memmove will be used for
    processors with AVX512 which prefer vzeroupper.
    
    Since the new SSE2 memcpy/memmove are faster than the previous default
    memcpy/memmove used in libc.a and ld.so, we also remove the previous
    default memcpy/memmove and make them the default memcpy/memmove, except
    that non-temporal store isn't used in ld.so.
    
    Together, it reduces the size of libc.so by about 6 KB and the size of
    ld.so by about 2 KB.
    
    	[BZ #19776]
    	* sysdeps/x86_64/memcpy.S: Make it dummy.
    	* sysdeps/x86_64/mempcpy.S: Likewise.
    	* sysdeps/x86_64/memmove.S: New file.
    	* sysdeps/x86_64/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/memmove.c: Removed.
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memmove.c: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memcpy-sse2-unaligned, memmove-avx-unaligned,
    	memcpy-avx-unaligned and memmove-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Replace
    	__memmove_chk_avx512_unaligned_2 with
    	__memmove_chk_avx512_unaligned.  Remove
    	__memmove_chk_avx_unaligned_2.  Replace
    	__memmove_chk_sse2_unaligned_2 with
    	__memmove_chk_sse2_unaligned.  Remove __memmove_chk_sse2 and
    	__memmove_avx_unaligned_2.  Replace __memmove_avx512_unaligned_2
    	with __memmove_avx512_unaligned.  Replace
    	__memmove_sse2_unaligned_2 with __memmove_sse2_unaligned.
    	Remove __memmove_sse2.  Replace __memcpy_chk_avx512_unaligned_2
    	with __memcpy_chk_avx512_unaligned.  Remove
    	__memcpy_chk_avx_unaligned_2.  Replace
    	__memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned.
    	Remove __memcpy_chk_sse2.  Remove __memcpy_avx_unaligned_2.
    	Replace __memcpy_avx512_unaligned_2 with
    	__memcpy_avx512_unaligned.  Remove __memcpy_sse2_unaligned_2
    	and __memcpy_sse2.  Replace __mempcpy_chk_avx512_unaligned_2
    	with __mempcpy_chk_avx512_unaligned.  Remove
    	__mempcpy_chk_avx_unaligned_2.  Replace
    	__mempcpy_chk_sse2_unaligned_2 with
    	__mempcpy_chk_sse2_unaligned.  Remove __mempcpy_chk_sse2.
    	Replace __mempcpy_avx512_unaligned_2 with
    	__mempcpy_avx512_unaligned.  Remove __mempcpy_avx_unaligned_2.
    	Replace __mempcpy_sse2_unaligned_2 with
    	__mempcpy_sse2_unaligned.  Remove __mempcpy_sse2.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support
    	__memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned.
    	Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __memcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../memcpy.S.
    	* sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support
    	__memcpy_chk_avx512_unaligned_erms and
    	__memcpy_chk_avx512_unaligned.  Use
    	__memcpy_chk_avx_unaligned_erms and
    	__memcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __memcpy_chk_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
    	Change function suffix from unaligned_2 to unaligned.
    	* sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support
    	__mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned.
    	Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __mempcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../mempcpy.S.
    	(mempcpy): New.  Add a weak alias.
    	* sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support
    	__mempcpy_chk_avx512_unaligned_erms and
    	__mempcpy_chk_avx512_unaligned.  Use
    	__mempcpy_chk_avx_unaligned_erms and
    	__mempcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __mempcpy_chk_sse2_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=bb7291b391e4969ad668a5dd349e246f660e402e

commit bb7291b391e4969ad668a5dd349e246f660e402e
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:42:30 2016 -0700

    X86-64: Remove the previous SSE2/AVX2 memsets
    
    Since the new SSE2/AVX2 memsets are faster than the previous ones, we
    can remove the previous SSE2/AVX2 memsets and replace them with the
    new ones.  This reduces the size of libc.so by about 900 bytes.
    
    No change in IFUNC selection if SSE2 and AVX2 memsets weren't used
    before.  If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset
    optimized with Enhanced REP STOSB will be used for processors with
    ERMS.  The new AVX512 memset will be used for processors with AVX512
    which prefer vzeroupper.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded
    	into ...
    	* sysdeps/x86_64/memset.S: This.
    	(__bzero): Removed.
    	(__memset_tail): Likewise.
    	(__memset_chk): Likewise.
    	(memset): Likewise.
    	(MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't
    	defined.
    	(MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined.
    	* sysdeps/x86_64/multiarch/memset-avx2.S: Removed.
    	(__memset_zero_constant_len_parameter): Check SHARED instead of
    	PIC.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memset-avx2 and memset-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Remove __memset_chk_sse2,
    	__memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
    	(__bzero): Enabled.
    	* sysdeps/x86_64/multiarch/memset.S (memset): Replace
    	__memset_sse2 and __memset_avx2 with __memset_sse2_unaligned
    	and __memset_avx2_unaligned.  Use __memset_sse2_unaligned_erms
    	or __memset_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	(memset): Removed.
    	(__memset_chk): Likewise.
    	(MEMSET_SYMBOL): New.
    	(libc_hidden_builtin_def): Replace __memset_sse2 with
    	__memset_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace
    	__memset_chk_sse2 and __memset_chk_avx2 with
    	__memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms.
    	Use __memset_chk_sse2_unaligned_erms or
    	__memset_chk_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_chk_avx512_unaligned_erms and
    	__memset_chk_avx512_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=81a26fd73fb6b19ff2868e51d894012f192422fa

commit 81a26fd73fb6b19ff2868e51d894012f192422fa
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun May 8 07:52:08 2016 -0700

    Initialize __x86_shared_non_temporal_threshold only if zero
    
    Support setting processor-specific __x86_shared_non_temporal_threshold
    value in init_cpu_features.
    
    	* sysdeps/x86/cacheinfo.c (__x86_shared_non_temporal_threshold):
    	Initialize only if it is zero.

-----------------------------------------------------------------------

Comment 39 Sourceware Commits 2016-05-25 17:12:10 UTC

This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, hjl/erms/ifunc has been created
        at  85702d388d909d38f06c510c4504561df94d99bd (commit)

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=85702d388d909d38f06c510c4504561df94d99bd

commit 85702d388d909d38f06c510c4504561df94d99bd
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 1 14:01:24 2016 -0700

    X86-64: Add dummy memcopy.h and wordcopy.c
    
    Since x86-64 doesn't use memory copy functions, add dummy memcopy.h and
    wordcopy.c to reduce code size.  It reduces the size of libc.so by about
    1 KB.
    
    	* sysdeps/x86_64/memcopy.h: New file.
    	* sysdeps/x86_64/wordcopy.c: Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=550bdb5f6b24eb2422f0b0c3a4c81c4665f132ff

commit 550bdb5f6b24eb2422f0b0c3a4c81c4665f132ff
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 12:46:57 2016 -0700

    X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove
    
    Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones,
    we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with
    the new ones.
    
    No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used
    before.  If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2
    memcpy/memmove optimized with Enhanced REP MOVSB will be used for
    processors with ERMS.  The new AVX512 memcpy/memmove will be used for
    processors with AVX512 which prefer vzeroupper.
    
    Since the new SSE2 memcpy/memmove are faster than the previous default
    memcpy/memmove used in libc.a and ld.so, we also remove the previous
    default memcpy/memmove and make them the default memcpy/memmove, except
    that non-temporal store isn't used in ld.so.
    
    Together, it reduces the size of libc.so by about 6 KB and the size of
    ld.so by about 2 KB.
    
    	[BZ #19776]
    	* sysdeps/x86_64/memcpy.S: Make it dummy.
    	* sysdeps/x86_64/mempcpy.S: Likewise.
    	* sysdeps/x86_64/memmove.S: New file.
    	* sysdeps/x86_64/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/memmove.c: Removed.
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memmove.c: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memcpy-sse2-unaligned, memmove-avx-unaligned,
    	memcpy-avx-unaligned and memmove-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Replace
    	__memmove_chk_avx512_unaligned_2 with
    	__memmove_chk_avx512_unaligned.  Remove
    	__memmove_chk_avx_unaligned_2.  Replace
    	__memmove_chk_sse2_unaligned_2 with
    	__memmove_chk_sse2_unaligned.  Remove __memmove_chk_sse2 and
    	__memmove_avx_unaligned_2.  Replace __memmove_avx512_unaligned_2
    	with __memmove_avx512_unaligned.  Replace
    	__memmove_sse2_unaligned_2 with __memmove_sse2_unaligned.
    	Remove __memmove_sse2.  Replace __memcpy_chk_avx512_unaligned_2
    	with __memcpy_chk_avx512_unaligned.  Remove
    	__memcpy_chk_avx_unaligned_2.  Replace
    	__memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned.
    	Remove __memcpy_chk_sse2.  Remove __memcpy_avx_unaligned_2.
    	Replace __memcpy_avx512_unaligned_2 with
    	__memcpy_avx512_unaligned.  Remove __memcpy_sse2_unaligned_2
    	and __memcpy_sse2.  Replace __mempcpy_chk_avx512_unaligned_2
    	with __mempcpy_chk_avx512_unaligned.  Remove
    	__mempcpy_chk_avx_unaligned_2.  Replace
    	__mempcpy_chk_sse2_unaligned_2 with
    	__mempcpy_chk_sse2_unaligned.  Remove __mempcpy_chk_sse2.
    	Replace __mempcpy_avx512_unaligned_2 with
    	__mempcpy_avx512_unaligned.  Remove __mempcpy_avx_unaligned_2.
    	Replace __mempcpy_sse2_unaligned_2 with
    	__mempcpy_sse2_unaligned.  Remove __mempcpy_sse2.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support
    	__memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned.
    	Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __memcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../memcpy.S.
    	* sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support
    	__memcpy_chk_avx512_unaligned_erms and
    	__memcpy_chk_avx512_unaligned.  Use
    	__memcpy_chk_avx_unaligned_erms and
    	__memcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __memcpy_chk_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
    	Change function suffix from unaligned_2 to unaligned.
    	* sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support
    	__mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned.
    	Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __mempcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../mempcpy.S.
    	(mempcpy): New.  Add a weak alias.
    	* sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support
    	__mempcpy_chk_avx512_unaligned_erms and
    	__mempcpy_chk_avx512_unaligned.  Use
    	__mempcpy_chk_avx_unaligned_erms and
    	__mempcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __mempcpy_chk_sse2_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=1d96e81d73e27b2d243eab126433e24f4a4da2ef

commit 1d96e81d73e27b2d243eab126433e24f4a4da2ef
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:42:30 2016 -0700

    X86-64: Remove the previous SSE2/AVX2 memsets
    
    Since the new SSE2/AVX2 memsets are faster than the previous ones, we
    can remove the previous SSE2/AVX2 memsets and replace them with the
    new ones.  This reduces the size of libc.so by about 900 bytes.
    
    No change in IFUNC selection if SSE2 and AVX2 memsets weren't used
    before.  If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset
    optimized with Enhanced REP STOSB will be used for processors with
    ERMS.  The new AVX512 memset will be used for processors with AVX512
    which prefer vzeroupper.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded
    	into ...
    	* sysdeps/x86_64/memset.S: This.
    	(__bzero): Removed.
    	(__memset_tail): Likewise.
    	(__memset_chk): Likewise.
    	(memset): Likewise.
    	(MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't
    	defined.
    	(MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined.
    	* sysdeps/x86_64/multiarch/memset-avx2.S: Removed.
    	(__memset_zero_constant_len_parameter): Check SHARED instead of
    	PIC.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memset-avx2 and memset-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Remove __memset_chk_sse2,
    	__memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
    	(__bzero): Enabled.
    	* sysdeps/x86_64/multiarch/memset.S (memset): Replace
    	__memset_sse2 and __memset_avx2 with __memset_sse2_unaligned
    	and __memset_avx2_unaligned.  Use __memset_sse2_unaligned_erms
    	or __memset_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	(memset): Removed.
    	(__memset_chk): Likewise.
    	(MEMSET_SYMBOL): New.
    	(libc_hidden_builtin_def): Replace __memset_sse2 with
    	__memset_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace
    	__memset_chk_sse2 and __memset_chk_avx2 with
    	__memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms.
    	Use __memset_chk_sse2_unaligned_erms or
    	__memset_chk_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_chk_avx512_unaligned_erms and
    	__memset_chk_avx512_unaligned.

-----------------------------------------------------------------------

Comment 40 Sourceware Commits 2016-06-06 20:36:55 UTC

This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, hjl/erms/2.22 has been created
        at  b60dda5f2385aaca873069f9fb28645b82a1b711 (commit)

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=b60dda5f2385aaca873069f9fb28645b82a1b711

commit b60dda5f2385aaca873069f9fb28645b82a1b711
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri May 27 15:16:22 2016 -0700

    Count number of logical processors sharing L2 cache
    
    For Intel processors, when there are both L2 and L3 caches, SMT level
    type should be ued to count number of available logical processors
    sharing L2 cache.  If there is only L2 cache, core level type should
    be used to count number of available logical processors sharing L2
    cache.  Number of available logical processors sharing L2 cache should
    be used for non-inclusive L2 and L3 caches.
    
    	* sysdeps/x86/cacheinfo.c (init_cacheinfo): Count number of
    	available logical processors with SMT level type sharing L2
    	cache for Intel processors.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=ed46697862f2b0c2db726cc4c772e6003914bd72

commit ed46697862f2b0c2db726cc4c772e6003914bd72
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri May 20 14:41:14 2016 -0700

    Remove special L2 cache case for Knights Landing
    
    L2 cache is shared by 2 cores on Knights Landing, which has 4 threads
    per core:
    
    https://en.wikipedia.org/wiki/Xeon_Phi#Knights_Landing
    
    So L2 cache is shared by 8 threads on Knights Landing as reported by
    CPUID.  We should remove special L2 cache case for Knights Landing.
    
    	[BZ #18185]
    	* sysdeps/x86/cacheinfo.c (init_cacheinfo): Don't limit threads
    	sharing L2 cache to 2 for Knights Landing.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=07f943915311f6f92e5a031911d32c5e7458bfd5

commit 07f943915311f6f92e5a031911d32c5e7458bfd5
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu May 19 10:02:36 2016 -0700

    Correct Intel processor level type mask from CPUID
    
    Intel CPUID with EAX == 11 returns:
    
    ECX Bits 07 - 00: Level number. Same value in ECX input.
        Bits 15 - 08: Level type.
        ^^^^^^^^^^^^^^^^^^^^^^^^ This is level type.
        Bits 31 - 16: Reserved.
    
    Intel processor level type mask should be 0xff00, not 0xff0.
    
    	[BZ #20119]
    	* sysdeps/x86/cacheinfo.c (init_cacheinfo): Correct Intel
    	processor level type mask for CPUID with EAX == 11.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=201aebf739482fbb730d10eb7cf8335629bb4de4

commit 201aebf739482fbb730d10eb7cf8335629bb4de4
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu May 19 09:09:00 2016 -0700

    Check the HTT bit before counting logical threads
    
    Skip counting logical threads for Intel processors if the HTT bit is 0
    which indicates there is only a single logical processor.
    
    	* sysdeps/x86/cacheinfo.c (init_cacheinfo): Skip counting
    	logical threads if the HTT bit is 0.
    	* sysdeps/x86/cpu-features.h (bit_cpu_HTT): New.
    	(index_cpu_HTT): Likewise.
    	(reg_HTT): Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=dff8bcdab5968ac53e52ef06cabe8d921b429d22

commit dff8bcdab5968ac53e52ef06cabe8d921b429d22
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu May 19 08:49:45 2016 -0700

    Remove alignments on jump targets in memset
    
    X86-64 memset-vec-unaligned-erms.S aligns many jump targets, which
    increases code sizes, but not necessarily improve performance.  As
    memset benchtest data of align vs no align on various Intel and AMD
    processors
    
    https://sourceware.org/bugzilla/attachment.cgi?id=9277
    
    shows that aligning jump targets isn't necessary.
    
    	[BZ #20115]
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S (__memset):
    	Remove alignments on jump targets.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=aba9d000bf8441d77f0557af360e3aea7525d03e

commit aba9d000bf8441d77f0557af360e3aea7525d03e
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri May 13 08:29:22 2016 -0700

    Call init_cpu_features only if SHARED is defined
    
    In static executable, since init_cpu_features is called early from
    __libc_start_main, there is no need to call it again in dl_platform_init.
    
    	[BZ #20072]
    	* sysdeps/i386/dl-machine.h (dl_platform_init): Call
    	init_cpu_features only if SHARED is defined.
    	* sysdeps/x86_64/dl-machine.h (dl_platform_init): Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=6118b2d23016ec790b99b9331c3d7a45d588134e

commit 6118b2d23016ec790b99b9331c3d7a45d588134e
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri May 13 07:18:25 2016 -0700

    Support non-inclusive caches on Intel processors
    
    	* sysdeps/x86/cacheinfo.c (init_cacheinfo): Check and support
    	non-inclusive caches on Intel processors.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=8642c9a553d8ce8a3a0496ed11fed5a575d338c5

commit 8642c9a553d8ce8a3a0496ed11fed5a575d338c5
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Wed May 11 05:49:09 2016 -0700

    Remove x86 ifunc-defines.sym and rtld-global-offsets.sym
    
    Merge x86 ifunc-defines.sym with x86 cpu-features-offsets.sym.  Remove
    x86 ifunc-defines.sym and rtld-global-offsets.sym.  No code changes on
    i686 and x86-64.
    
    	* sysdeps/i386/i686/multiarch/Makefile (gen-as-const-headers):
    	Remove ifunc-defines.sym.
    	* sysdeps/x86_64/multiarch/Makefile (gen-as-const-headers):
    	Likewise.
    	* sysdeps/i386/i686/multiarch/ifunc-defines.sym: Removed.
    	* sysdeps/x86/rtld-global-offsets.sym: Likewise.
    	* sysdeps/x86_64/multiarch/ifunc-defines.sym: Likewise.
    	* sysdeps/x86/Makefile (gen-as-const-headers): Remove
    	rtld-global-offsets.sym.
    	* sysdeps/x86_64/multiarch/ifunc-defines.sym: Merged with ...
    	* sysdeps/x86/cpu-features-offsets.sym: This.
    	* sysdeps/x86/cpu-features.h: Include <cpu-features-offsets.h>
    	instead of <ifunc-defines.h> and <rtld-global-offsets.h>.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=3038902f233a5e0028a6424685b410f6c201040f

commit 3038902f233a5e0028a6424685b410f6c201040f
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun May 8 08:49:02 2016 -0700

    Move sysdeps/x86_64/cacheinfo.c to sysdeps/x86
    
    Move sysdeps/x86_64/cacheinfo.c to sysdeps/x86.  No code changes on x86
    and x86_64.
    
    	* sysdeps/i386/cacheinfo.c: Include <sysdeps/x86/cacheinfo.c>
    	instead of <sysdeps/x86_64/cacheinfo.c>.
    	* sysdeps/x86_64/cacheinfo.c: Moved to ...
    	* sysdeps/x86/cacheinfo.c: Here.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=df2b390bba18903d62c8910e808bfb0dce7f033c

commit df2b390bba18903d62c8910e808bfb0dce7f033c
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 15 05:22:53 2016 -0700

    Detect Intel Goldmont and Airmont processors
    
    Updated from the model numbers of Goldmont and Airmont processors in
    Intel64 And IA-32 Processor Architectures Software Developer's Manual
    Volume 3 Revision 058.
    
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Detect Intel
    	Goldmont and Airmont processors.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=157c57198e893b4882d1feb98de2b0721ee408fc

commit 157c57198e893b4882d1feb98de2b0721ee408fc
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 1 14:01:24 2016 -0700

    X86-64: Add dummy memcopy.h and wordcopy.c
    
    Since x86-64 doesn't use memory copy functions, add dummy memcopy.h and
    wordcopy.c to reduce code size.  It reduces the size of libc.so by about
    1 KB.
    
    	* sysdeps/x86_64/memcopy.h: New file.
    	* sysdeps/x86_64/wordcopy.c: Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=f817b9d36215ab60d58cc744d22773b4961a2c9b

commit f817b9d36215ab60d58cc744d22773b4961a2c9b
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 12:46:57 2016 -0700

    X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove
    
    Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones,
    we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with
    the new ones.
    
    No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used
    before.  If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2
    memcpy/memmove optimized with Enhanced REP MOVSB will be used for
    processors with ERMS.  The new AVX512 memcpy/memmove will be used for
    processors with AVX512 which prefer vzeroupper.
    
    Since the new SSE2 memcpy/memmove are faster than the previous default
    memcpy/memmove used in libc.a and ld.so, we also remove the previous
    default memcpy/memmove and make them the default memcpy/memmove, except
    that non-temporal store isn't used in ld.so.
    
    Together, it reduces the size of libc.so by about 6 KB and the size of
    ld.so by about 2 KB.
    
    	[BZ #19776]
    	* sysdeps/x86_64/memcpy.S: Make it dummy.
    	* sysdeps/x86_64/mempcpy.S: Likewise.
    	* sysdeps/x86_64/memmove.S: New file.
    	* sysdeps/x86_64/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.S: Likewise.
    	* sysdeps/x86_64/memmove.c: Removed.
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memmove.c: Likewise.
    	* sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memcpy-sse2-unaligned, memmove-avx-unaligned,
    	memcpy-avx-unaligned and memmove-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Replace
    	__memmove_chk_avx512_unaligned_2 with
    	__memmove_chk_avx512_unaligned.  Remove
    	__memmove_chk_avx_unaligned_2.  Replace
    	__memmove_chk_sse2_unaligned_2 with
    	__memmove_chk_sse2_unaligned.  Remove __memmove_chk_sse2 and
    	__memmove_avx_unaligned_2.  Replace __memmove_avx512_unaligned_2
    	with __memmove_avx512_unaligned.  Replace
    	__memmove_sse2_unaligned_2 with __memmove_sse2_unaligned.
    	Remove __memmove_sse2.  Replace __memcpy_chk_avx512_unaligned_2
    	with __memcpy_chk_avx512_unaligned.  Remove
    	__memcpy_chk_avx_unaligned_2.  Replace
    	__memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned.
    	Remove __memcpy_chk_sse2.  Remove __memcpy_avx_unaligned_2.
    	Replace __memcpy_avx512_unaligned_2 with
    	__memcpy_avx512_unaligned.  Remove __memcpy_sse2_unaligned_2
    	and __memcpy_sse2.  Replace __mempcpy_chk_avx512_unaligned_2
    	with __mempcpy_chk_avx512_unaligned.  Remove
    	__mempcpy_chk_avx_unaligned_2.  Replace
    	__mempcpy_chk_sse2_unaligned_2 with
    	__mempcpy_chk_sse2_unaligned.  Remove __mempcpy_chk_sse2.
    	Replace __mempcpy_avx512_unaligned_2 with
    	__mempcpy_avx512_unaligned.  Remove __mempcpy_avx_unaligned_2.
    	Replace __mempcpy_sse2_unaligned_2 with
    	__mempcpy_sse2_unaligned.  Remove __mempcpy_sse2.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support
    	__memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned.
    	Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __memcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../memcpy.S.
    	* sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support
    	__memcpy_chk_avx512_unaligned_erms and
    	__memcpy_chk_avx512_unaligned.  Use
    	__memcpy_chk_avx_unaligned_erms and
    	__memcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __memcpy_chk_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
    	Change function suffix from unaligned_2 to unaligned.
    	* sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support
    	__mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned.
    	Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms
    	if processor has ERMS.  Default to __mempcpy_sse2_unaligned.
    	(ENTRY): Removed.
    	(END): Likewise.
    	(ENTRY_CHK): Likewise.
    	(libc_hidden_builtin_def): Likewise.
    	Don't include ../mempcpy.S.
    	(mempcpy): New.  Add a weak alias.
    	* sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support
    	__mempcpy_chk_avx512_unaligned_erms and
    	__mempcpy_chk_avx512_unaligned.  Use
    	__mempcpy_chk_avx_unaligned_erms and
    	__mempcpy_chk_sse2_unaligned_erms if if processor has ERMS.
    	Default to __mempcpy_chk_sse2_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=122600f4b380b00ce0f682039fe59af4bd0edbc0

commit 122600f4b380b00ce0f682039fe59af4bd0edbc0
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:42:30 2016 -0700

    X86-64: Remove the previous SSE2/AVX2 memsets
    
    Since the new SSE2/AVX2 memsets are faster than the previous ones, we
    can remove the previous SSE2/AVX2 memsets and replace them with the
    new ones.  This reduces the size of libc.so by about 900 bytes.
    
    No change in IFUNC selection if SSE2 and AVX2 memsets weren't used
    before.  If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset
    optimized with Enhanced REP STOSB will be used for processors with
    ERMS.  The new AVX512 memset will be used for processors with AVX512
    which prefer vzeroupper.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded
    	into ...
    	* sysdeps/x86_64/memset.S: This.
    	(__bzero): Removed.
    	(__memset_tail): Likewise.
    	(__memset_chk): Likewise.
    	(memset): Likewise.
    	(MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't
    	defined.
    	(MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined.
    	* sysdeps/x86_64/multiarch/memset-avx2.S: Removed.
    	(__memset_zero_constant_len_parameter): Check SHARED instead of
    	PIC.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memset-avx2 and memset-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Remove __memset_chk_sse2,
    	__memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
    	(__bzero): Enabled.
    	* sysdeps/x86_64/multiarch/memset.S (memset): Replace
    	__memset_sse2 and __memset_avx2 with __memset_sse2_unaligned
    	and __memset_avx2_unaligned.  Use __memset_sse2_unaligned_erms
    	or __memset_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	(memset): Removed.
    	(__memset_chk): Likewise.
    	(MEMSET_SYMBOL): New.
    	(libc_hidden_builtin_def): Replace __memset_sse2 with
    	__memset_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace
    	__memset_chk_sse2 and __memset_chk_avx2 with
    	__memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms.
    	Use __memset_chk_sse2_unaligned_erms or
    	__memset_chk_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_chk_avx512_unaligned_erms and
    	__memset_chk_avx512_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=0ee4375cef69e00e69ddb1d08fe0d492053208f3

commit 0ee4375cef69e00e69ddb1d08fe0d492053208f3
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Apr 3 17:21:45 2016 -0700

    X86-64: Use non-temporal store in memcpy on large data
    
    The large memcpy micro benchmark in glibc shows that there is a
    regression with large data on Haswell machine.  non-temporal store in
    memcpy on large data can improve performance significantly.  This
    patch adds a threshold to use non temporal store which is 6 times of
    shared cache size.  When size is above the threshold, non temporal
    store will be used, but avoid non-temporal store if there is overlap
    between destination and source since destination may be in cache when
    source is loaded.
    
    For size below 8 vector register width, we load all data into registers
    and store them together.  Only forward and backward loops, which move 4
    vector registers at a time, are used to support overlapping addresses.
    For forward loop, we load the last 4 vector register width of data and
    the first vector register width of data into vector registers before the
    loop and store them after the loop.  For backward loop, we load the first
    4 vector register width of data and the last vector register width of
    data into vector registers before the loop and store them after the loop.
    
    	[BZ #19928]
    	* sysdeps/x86_64/cacheinfo.c (__x86_shared_non_temporal_threshold):
    	New.
    	(init_cacheinfo): Set __x86_shared_non_temporal_threshold to
    	6 times of shared cache size.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S
    	(VMOVNT): New.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S
    	(VMOVNT): Likewise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S
    	(VMOVNT): Likewise.
    	(VMOVU): Changed to movups for smaller code sizes.
    	(VMOVA): Changed to movaps for smaller code sizes.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Update
    	comments.
    	(PREFETCH): New.
    	(PREFETCH_SIZE): Likewise.
    	(PREFETCHED_LOAD_SIZE): Likewise.
    	(PREFETCH_ONE_SET): Likewise.
    	Rewrite to use forward and backward loops, which move 4 vector
    	registers at a time, to support overlapping addresses and use
    	non temporal store if size is above the threshold and there is
    	no overlap between destination and source.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=54667f64fa4074325ee33e487c033c313ce95067

commit 54667f64fa4074325ee33e487c033c313ce95067
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Wed Apr 6 10:19:16 2016 -0700

    X86-64: Prepare memmove-vec-unaligned-erms.S
    
    Prepare memmove-vec-unaligned-erms.S to make the SSE2 version as the
    default memcpy, mempcpy and memmove.
    
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
    	(MEMCPY_SYMBOL): New.
    	(MEMPCPY_SYMBOL): Likewise.
    	(MEMMOVE_CHK_SYMBOL): Likewise.
    	Replace MEMMOVE_SYMBOL with MEMMOVE_CHK_SYMBOL on __mempcpy_chk
    	symbols.  Replace MEMMOVE_SYMBOL with MEMPCPY_SYMBOL on
    	__mempcpy symbols.  Provide alias for __memcpy_chk in libc.a.
    	Provide alias for memcpy in libc.a and ld.so.
    
    (cherry picked from commit a7d1c51482d15ab6c07e2ee0ae5e007067b18bfb)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=68a0b487e274b3452a1660e4b9fad5df8d8c0284

commit 68a0b487e274b3452a1660e4b9fad5df8d8c0284
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Wed Apr 6 09:10:18 2016 -0700

    X86-64: Prepare memset-vec-unaligned-erms.S
    
    Prepare memset-vec-unaligned-erms.S to make the SSE2 version as the
    default memset.
    
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
    	(MEMSET_CHK_SYMBOL): New.  Define if not defined.
    	(__bzero): Check VEC_SIZE == 16 instead of USE_MULTIARCH.
    	Disabled fro now.
    	Replace MEMSET_SYMBOL with MEMSET_CHK_SYMBOL on __memset_chk
    	symbols.  Properly check USE_MULTIARCH on __memset symbols.
    
    (cherry picked from commit 4af1bb06c59d24f35bf8dc55897838d926c05892)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c2d3bdd6aec639fd66fceb3e2c145420c25d409b

commit c2d3bdd6aec639fd66fceb3e2c145420c25d409b
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Apr 5 05:21:07 2016 -0700

    Force 32-bit displacement in memset-vec-unaligned-erms.S
    
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: Force
    	32-bit displacement to avoid long nop between instructions.
    
    (cherry picked from commit ec0cac9a1f4094bd0db6f77c1b329e7a40eecc10)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=070a5e77d66f5520c1bbbc24dc1843a0a1c161ee

commit 070a5e77d66f5520c1bbbc24dc1843a0a1c161ee
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Apr 5 05:19:05 2016 -0700

    Add a comment in memset-sse2-unaligned-erms.S
    
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Add
    	a comment on VMOVU and VMOVA.
    
    (cherry picked from commit 696ac774847b80cf994438739478b0c3003b5958)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7e00bb9720268f142668d22e91dff7c3e6e0c08c

commit 7e00bb9720268f142668d22e91dff7c3e6e0c08c
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Apr 3 14:32:20 2016 -0700

    Don't put SSE2/AVX/AVX512 memmove/memset in ld.so
    
    Since memmove and memset in ld.so don't use IFUNC, don't put SSE2, AVX
    and AVX512 memmove and memset in ld.so.
    
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip
    	if not in libc.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    
    (cherry picked from commit 5cd7af016d8587ff53b20ba259746f97edbddbf7)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e1e57539f5dbdefc96a85021b611863eaa28dd13

commit e1e57539f5dbdefc96a85021b611863eaa28dd13
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Apr 3 12:38:25 2016 -0700

    Fix memmove-vec-unaligned-erms.S
    
    __mempcpy_erms and __memmove_erms can't be placed between __memmove_chk
    and __memmove it breaks __memmove_chk.
    
    Don't check source == destination first since it is less common.
    
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:
    	(__mempcpy_erms, __memmove_erms): Moved before __mempcpy_chk
    	with unaligned_erms.
    	(__memmove_erms): Skip if source == destination.
    	(__memmove_unaligned_erms): Don't check source == destination
    	first.
    
    (cherry picked from commit ea2785e96fa503f3a2b5dd9f3a6ca65622b3c5f2)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a13ac6b5ced68aadb7c1546102445f9c57f43231

commit a13ac6b5ced68aadb7c1546102445f9c57f43231
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Mar 6 08:23:24 2016 -0800

    Use HAS_ARCH_FEATURE with Fast_Rep_String
    
    HAS_ARCH_FEATURE, not HAS_CPU_FEATURE, should be used with
    Fast_Rep_String.
    
    	[BZ #19762]
    	* sysdeps/i386/i686/multiarch/bcopy.S (bcopy): Use
    	HAS_ARCH_FEATURE with Fast_Rep_String.
    	* sysdeps/i386/i686/multiarch/bzero.S (__bzero): Likewise.
    	* sysdeps/i386/i686/multiarch/memcpy.S (memcpy): Likewise.
    	* sysdeps/i386/i686/multiarch/memcpy_chk.S (__memcpy_chk):
    	Likewise.
    	* sysdeps/i386/i686/multiarch/memmove_chk.S (__memmove_chk):
    	Likewise.
    	* sysdeps/i386/i686/multiarch/mempcpy.S (__mempcpy): Likewise.
    	* sysdeps/i386/i686/multiarch/mempcpy_chk.S (__mempcpy_chk):
    	Likewise.
    	* sysdeps/i386/i686/multiarch/memset.S (memset): Likewise.
    	* sysdeps/i386/i686/multiarch/memset_chk.S (__memset_chk):
    	Likewise.
    
    (cherry picked from commit 4e940b2f4b577f3a530e0580373f7c2d569f4d63)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4ad4d58ed7a444e2d9787113fce132a99b35b417

commit 4ad4d58ed7a444e2d9787113fce132a99b35b417
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 1 15:08:48 2016 -0700

    Remove Fast_Copy_Backward from Intel Core processors
    
    Intel Core i3, i5 and i7 processors have fast unaligned copy and
    copy backward is ignored.  Remove Fast_Copy_Backward from Intel Core
    processors to avoid confusion.
    
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Don't set
    	bit_arch_Fast_Copy_Backward for Intel Core proessors.
    
    (cherry picked from commit 27d3ce1467990f89126e228559dec8f84b96c60e)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a304f3933c7f8347e49057a7a315cbd571662ff7

commit a304f3933c7f8347e49057a7a315cbd571662ff7
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:05:51 2016 -0700

    Add x86-64 memset with unaligned store and rep stosb
    
    Implement x86-64 memset with unaligned store and rep movsb.  Support
    16-byte, 32-byte and 64-byte vector register sizes.  A single file
    provides 2 implementations of memset, one with rep stosb and the other
    without rep stosb.  They share the same codes when size is between 2
    times of vector register size and REP_STOSB_THRESHOLD which defaults
    to 2KB.
    
    Key features:
    
    1. Use overlapping store to avoid branch.
    2. For size <= 4 times of vector register size, fully unroll the loop.
    3. For size > 4 times of vector register size, store 4 times of vector
    register size at a time.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
    	memset-sse2-unaligned-erms, memset-avx2-unaligned-erms and
    	memset-avx512-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Test __memset_chk_sse2_unaligned,
    	__memset_chk_sse2_unaligned_erms, __memset_chk_avx2_unaligned,
    	__memset_chk_avx2_unaligned_erms, __memset_chk_avx512_unaligned,
    	__memset_chk_avx512_unaligned_erms, __memset_sse2_unaligned,
    	__memset_sse2_unaligned_erms, __memset_erms,
    	__memset_avx2_unaligned, __memset_avx2_unaligned_erms,
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: New
    	file.
    	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:
    	Likewise.
    
    (cherry picked from commit 830566307f038387ca0af3fd327706a8d1a2f595)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=1c5d0b0ba41376c2f0792da4f22cc1f5b2b2688e

commit 1c5d0b0ba41376c2f0792da4f22cc1f5b2b2688e
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:04:26 2016 -0700

    Add x86-64 memmove with unaligned load/store and rep movsb
    
    Implement x86-64 memmove with unaligned load/store and rep movsb.
    Support 16-byte, 32-byte and 64-byte vector register sizes.  When
    size <= 8 times of vector register size, there is no check for
    address overlap bewteen source and destination.  Since overhead for
    overlap check is small when size > 8 times of vector register size,
    memcpy is an alias of memmove.
    
    A single file provides 2 implementations of memmove, one with rep movsb
    and the other without rep movsb.  They share the same codes when size is
    between 2 times of vector register size and REP_MOVSB_THRESHOLD which
    is 2KB for 16-byte vector register size and scaled up by large vector
    register size.
    
    Key features:
    
    1. Use overlapping load and store to avoid branch.
    2. For size <= 8 times of vector register size, load  all sources into
    registers and store them together.
    3. If there is no address overlap bewteen source and destination, copy
    from both ends with 4 times of vector register size at a time.
    4. If address of destination > address of source, backward copy 8 times
    of vector register size at a time.
    5. Otherwise, forward copy 8 times of vector register size at a time.
    6. Use rep movsb only for forward copy.  Avoid slow backward rep movsb
    by fallbacking to backward copy 8 times of vector register size at a
    time.
    7. Skip when address of destination == address of source.
    
    	[BZ #19776]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
    	memmove-sse2-unaligned-erms, memmove-avx-unaligned-erms and
    	memmove-avx512-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Test
    	__memmove_chk_avx512_unaligned_2,
    	__memmove_chk_avx512_unaligned_erms,
    	__memmove_chk_avx_unaligned_2, __memmove_chk_avx_unaligned_erms,
    	__memmove_chk_sse2_unaligned_2,
    	__memmove_chk_sse2_unaligned_erms, __memmove_avx_unaligned_2,
    	__memmove_avx_unaligned_erms, __memmove_avx512_unaligned_2,
    	__memmove_avx512_unaligned_erms, __memmove_erms,
    	__memmove_sse2_unaligned_2, __memmove_sse2_unaligned_erms,
    	__memcpy_chk_avx512_unaligned_2,
    	__memcpy_chk_avx512_unaligned_erms,
    	__memcpy_chk_avx_unaligned_2, __memcpy_chk_avx_unaligned_erms,
    	__memcpy_chk_sse2_unaligned_2, __memcpy_chk_sse2_unaligned_erms,
    	__memcpy_avx_unaligned_2, __memcpy_avx_unaligned_erms,
    	__memcpy_avx512_unaligned_2, __memcpy_avx512_unaligned_erms,
    	__memcpy_sse2_unaligned_2, __memcpy_sse2_unaligned_erms,
    	__memcpy_erms, __mempcpy_chk_avx512_unaligned_2,
    	__mempcpy_chk_avx512_unaligned_erms,
    	__mempcpy_chk_avx_unaligned_2, __mempcpy_chk_avx_unaligned_erms,
    	__mempcpy_chk_sse2_unaligned_2, __mempcpy_chk_sse2_unaligned_erms,
    	__mempcpy_avx512_unaligned_2, __mempcpy_avx512_unaligned_erms,
    	__mempcpy_avx_unaligned_2, __mempcpy_avx_unaligned_erms,
    	__mempcpy_sse2_unaligned_2, __mempcpy_sse2_unaligned_erms and
    	__mempcpy_erms.
    	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: New
    	file.
    	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
    	Likwise.
    	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
    	Likwise.
    	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:
    	Likwise.
    
    (cherry picked from commit 88b57b8ed41d5ecf2e1bdfc19556f9246a665ebb)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e1203f48239fbb9832db6ed3a0d2a008e317aff9

commit e1203f48239fbb9832db6ed3a0d2a008e317aff9
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 19:22:59 2016 -0700

    Initial Enhanced REP MOVSB/STOSB (ERMS) support
    
    The newer Intel processors support Enhanced REP MOVSB/STOSB (ERMS) which
    has a feature bit in CPUID.  This patch adds the Enhanced REP MOVSB/STOSB
    (ERMS) bit to x86 cpu-features.
    
    	* sysdeps/x86/cpu-features.h (bit_cpu_ERMS): New.
    	(index_cpu_ERMS): Likewise.
    	(reg_ERMS): Likewise.
    
    (cherry picked from commit 0791f91dff9a77263fa8173b143d854cad902c6d)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=3597d65be2a44f063ef12bb907fdad8567aa3e6a

commit 3597d65be2a44f063ef12bb907fdad8567aa3e6a
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 13:15:59 2016 -0700

    Make __memcpy_avx512_no_vzeroupper an alias
    
    Since x86-64 memcpy-avx512-no-vzeroupper.S implements memmove, make
    __memcpy_avx512_no_vzeroupper an alias of __memmove_avx512_no_vzeroupper
    to reduce code size of libc.so.
    
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memcpy-avx512-no-vzeroupper.
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: Renamed
    	to ...
    	* sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: This.
    	(MEMCPY): Don't define.
    	(MEMCPY_CHK): Likewise.
    	(MEMPCPY): Likewise.
    	(MEMPCPY_CHK): Likewise.
    	(MEMPCPY_CHK): Renamed to ...
    	(__mempcpy_chk_avx512_no_vzeroupper): This.
    	(MEMPCPY_CHK): Renamed to ...
    	(__mempcpy_chk_avx512_no_vzeroupper): This.
    	(MEMCPY_CHK): Renamed to ...
    	(__memmove_chk_avx512_no_vzeroupper): This.
    	(MEMCPY): Renamed to ...
    	(__memmove_avx512_no_vzeroupper): This.
    	(__memcpy_avx512_no_vzeroupper): New alias.
    	(__memcpy_chk_avx512_no_vzeroupper): Likewise.
    
    (cherry picked from commit 064f01b10b57ff09cda7025f484b848c38ddd57a)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=9fbaf0f27a11deb98df79d04adee97aebee78d40

commit 9fbaf0f27a11deb98df79d04adee97aebee78d40
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 13:13:36 2016 -0700

    Implement x86-64 multiarch mempcpy in memcpy
    
    Implement x86-64 multiarch mempcpy in memcpy to share most of code.  It
    reduces code size of libc.so.
    
    	[BZ #18858]
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	mempcpy-ssse3, mempcpy-ssse3-back, mempcpy-avx-unaligned
    	and mempcpy-avx512-no-vzeroupper.
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMPCPY_CHK):
    	New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S
    	(MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-ssse3-back.S (MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy-ssse3.S (MEMPCPY_CHK): New.
    	(MEMPCPY): Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-avx-unaligned.S: Removed.
    	* sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S:
    	Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-ssse3-back.S: Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy-ssse3.S: Likewise.
    
    (cherry picked from commit c365e615f7429aee302f8af7bf07ae262278febb)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=5239cb481eea27650173b9b9af22439afdcbf358

commit 5239cb481eea27650173b9b9af22439afdcbf358
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 04:39:48 2016 -0700

    [x86] Add a feature bit: Fast_Unaligned_Copy
    
    On AMD processors, memcpy optimized with unaligned SSE load is
    slower than emcpy optimized with aligned SSSE3 while other string
    functions are faster with unaligned SSE load.  A feature bit,
    Fast_Unaligned_Copy, is added to select memcpy optimized with
    unaligned SSE load.
    
    	[BZ #19583]
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Set
    	Fast_Unaligned_Copy with Fast_Unaligned_Load for Intel
    	processors.  Set Fast_Copy_Backward for AMD Excavator
    	processors.
    	* sysdeps/x86/cpu-features.h (bit_arch_Fast_Unaligned_Copy):
    	New.
    	(index_arch_Fast_Unaligned_Copy): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Check
    	Fast_Unaligned_Copy instead of Fast_Unaligned_Load.
    
    (cherry picked from commit e41b395523040fcb58c7d378475720c2836d280c)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a65b3d13e1754d568782e64a762c2c7fab45a55d

commit a65b3d13e1754d568782e64a762c2c7fab45a55d
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Mar 22 08:36:16 2016 -0700

    Don't set %rcx twice before "rep movsb"
    
    	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMCPY):
    	Don't set %rcx twice before "rep movsb".
    
    (cherry picked from commit 3c9a4cd16cbc7b79094fec68add2df66061ab5d7)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=f4b6d20366aac66070f1cf50552cf2951991a1e5

commit f4b6d20366aac66070f1cf50552cf2951991a1e5
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Mar 22 07:46:56 2016 -0700

    Set index_arch_AVX_Fast_Unaligned_Load only for Intel processors
    
    Since only Intel processors with AVX2 have fast unaligned load, we
    should set index_arch_AVX_Fast_Unaligned_Load only for Intel processors.
    
    Move AVX, AVX2, AVX512, FMA and FMA4 detection into get_common_indeces
    and call get_common_indeces for other processors.
    
    Add CPU_FEATURES_CPU_P and CPU_FEATURES_ARCH_P to aoid loading
    GLRO(dl_x86_cpu_features) in cpu-features.c.
    
    	[BZ #19583]
    	* sysdeps/x86/cpu-features.c (get_common_indeces): Remove
    	inline.  Check family before setting family, model and
    	extended_model.  Set AVX, AVX2, AVX512, FMA and FMA4 usable
    	bits here.
    	(init_cpu_features): Replace HAS_CPU_FEATURE and
    	HAS_ARCH_FEATURE with CPU_FEATURES_CPU_P and
    	CPU_FEATURES_ARCH_P.  Set index_arch_AVX_Fast_Unaligned_Load
    	for Intel processors with usable AVX2.  Call get_common_indeces
    	for other processors with family == NULL.
    	* sysdeps/x86/cpu-features.h (CPU_FEATURES_CPU_P): New macro.
    	(CPU_FEATURES_ARCH_P): Likewise.
    	(HAS_CPU_FEATURE): Use CPU_FEATURES_CPU_P.
    	(HAS_ARCH_FEATURE): Use CPU_FEATURES_ARCH_P.
    
    (cherry picked from commit f781a9e96138d8839663af5e88649ab1fbed74f8)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=ca9c5edeea52dc18f42ebbe29b1af352f5555538

commit ca9c5edeea52dc18f42ebbe29b1af352f5555538
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Nov 30 08:53:37 2015 -0800

    Update family and model detection for AMD CPUs
    
    AMD CPUs uses the similar encoding scheme for extended family and model
    as Intel CPUs as shown in:
    
    http://support.amd.com/TechDocs/25481.pdf
    
    This patch updates get_common_indeces to get family and model for both
    Intel and AMD CPUs when family == 0x0f.
    
    	[BZ #19214]
    	* sysdeps/x86/cpu-features.c (get_common_indeces): Add an
    	argument to return extended model.  Update family and model
    	with extended family and model when family == 0x0f.
    	(init_cpu_features): Updated.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c23cdbac4ea473effbef5c50b1217f95595b3460

commit c23cdbac4ea473effbef5c50b1217f95595b3460
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 10 05:26:46 2016 -0800

    Add _arch_/_cpu_ to index_*/bit_* in x86 cpu-features.h
    
    index_* and bit_* macros are used to access cpuid and feature arrays o
    struct cpu_features.  It is very easy to use bits and indices of cpuid
    array on feature array, especially in assembly codes.  For example,
    sysdeps/i386/i686/multiarch/bcopy.S has
    
    	HAS_CPU_FEATURE (Fast_Rep_String)
    
    which should be
    
    	HAS_ARCH_FEATURE (Fast_Rep_String)
    
    We change index_* and bit_* to index_cpu_*/index_arch_* and
    bit_cpu_*/bit_arch_* so that we can catch such error at build time.
    
    	[BZ #19762]
    	* sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h
    	(EXTRA_LD_ENVVARS): Add _arch_ to index_*/bit_*.
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Likewise.
    	* sysdeps/x86/cpu-features.h (bit_*): Renamed to ...
    	(bit_arch_*): This for feature array.
    	(bit_*): Renamed to ...
    	(bit_cpu_*): This for cpu array.
    	(index_*): Renamed to ...
    	(index_arch_*): This for feature array.
    	(index_*): Renamed to ...
    	(index_cpu_*): This for cpu array.
    	[__ASSEMBLER__] (HAS_FEATURE): Add and use field.
    	[__ASSEMBLER__] (HAS_CPU_FEATURE)): Pass cpu to HAS_FEATURE.
    	[__ASSEMBLER__] (HAS_ARCH_FEATURE)): Pass arch to HAS_FEATURE.
    	[!__ASSEMBLER__] (HAS_CPU_FEATURE): Replace index_##name and
    	bit_##name with index_cpu_##name and bit_cpu_##name.
    	[!__ASSEMBLER__] (HAS_ARCH_FEATURE): Replace index_##name and
    	bit_##name with index_arch_##name and bit_arch_##name.
    
    (cherry picked from commit 6aa3e97e2530f9917f504eb4146af119a3f27229)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4a49c82956f5a42a2cce22c2e97360de1b32301d

commit 4a49c82956f5a42a2cce22c2e97360de1b32301d
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 3 14:51:40 2016 -0800

    Or bit_Prefer_MAP_32BIT_EXEC in EXTRA_LD_ENVVARS
    
    We should turn on bit_Prefer_MAP_32BIT_EXEC in EXTRA_LD_ENVVARS without
    overriding other bits.
    
    	[BZ #19758]
    	* sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h
    	(EXTRA_LD_ENVVARS): Or bit_Prefer_MAP_32BIT_EXEC.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=18f8c0e3cc9ff7b092f02c9b42874a5439347bbc

commit 18f8c0e3cc9ff7b092f02c9b42874a5439347bbc
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Mar 6 16:48:11 2016 -0800

    Group AVX512 functions in .text.avx512 section
    
    	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S:
    	Replace .text with .text.avx512.
    	* sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S:
    	Likewise.
    
    (cherry picked from commit fee9eb6200f0e44a4b684903bc47fde36d46f1a5)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=0c8e297a186f844ebb7eba7a3bc0343c83615ca9

commit 0c8e297a186f844ebb7eba7a3bc0343c83615ca9
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Mar 4 08:37:40 2016 -0800

    x86-64: Fix memcpy IFUNC selection
    
    Chek Fast_Unaligned_Load, instead of Slow_BSF, and also check for
    Fast_Copy_Backward to enable __memcpy_ssse3_back.  Existing selection
    order is updated with following selection order:
    
    1. __memcpy_avx_unaligned if AVX_Fast_Unaligned_Load bit is set.
    2. __memcpy_sse2_unaligned if Fast_Unaligned_Load bit is set.
    3. __memcpy_sse2 if SSSE3 isn't available.
    4. __memcpy_ssse3_back if Fast_Copy_Backward bit it set.
    5. __memcpy_ssse3
    
    	[BZ #18880]
    	* sysdeps/x86_64/multiarch/memcpy.S: Check Fast_Unaligned_Load,
    	instead of Slow_BSF, and also check for Fast_Copy_Backward to
    	enable __memcpy_ssse3_back.
    
    (cherry picked from commit 14a1d7cc4c4fd5ee8e4e66b777221dd32a84efe8)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=3c772cb4d9cbe19cd97ad991e3dab43014198c44

commit 3c772cb4d9cbe19cd97ad991e3dab43014198c44
Author: Andrew Senkevich <andrew.senkevich@intel.com>
Date:   Sat Jan 16 00:49:45 2016 +0300

    Added memcpy/memmove family optimized with AVX512 for KNL hardware.
    
    Added AVX512 implementations of memcpy, mempcpy, memmove, memcpy_chk,
    mempcpy_chk, memmove_chk.
    It shows average improvement more than 30% over AVX versions on KNL
    hardware (performance results in the thread
    <https://sourceware.org/ml/libc-alpha/2016-01/msg00258.html>).
    
        * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Added new files.
        * sysdeps/x86_64/multiarch/ifunc-impl-list.c: Added new tests.
        * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: New file.
        * sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S: Likewise.
        * sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: Likewise.
        * sysdeps/x86_64/multiarch/memcpy.S: Added new IFUNC branch.
        * sysdeps/x86_64/multiarch/memcpy_chk.S: Likewise.
        * sysdeps/x86_64/multiarch/memmove.c: Likewise.
        * sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
        * sysdeps/x86_64/multiarch/mempcpy.S: Likewise.
        * sysdeps/x86_64/multiarch/mempcpy_chk.S: Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7f20e52f1f8d7709140f4bdf828a6bb0f0f08af2

commit 7f20e52f1f8d7709140f4bdf828a6bb0f0f08af2
Author: Andrew Senkevich <andrew.senkevich@intel.com>
Date:   Sat Dec 19 02:47:28 2015 +0300

    Added memset optimized with AVX512 for KNL hardware.
    
    It shows improvement up to 28% over AVX2 memset (performance results
    attached at <https://sourceware.org/ml/libc-alpha/2015-12/msg00052.html>).
    
        * sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S: New file.
        * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Added new file.
        * sysdeps/x86_64/multiarch/ifunc-impl-list.c: Added new tests.
        * sysdeps/x86_64/multiarch/memset.S: Added new IFUNC branch.
        * sysdeps/x86_64/multiarch/memset_chk.S: Likewise.
        * sysdeps/x86/cpu-features.h (bit_Prefer_No_VZEROUPPER,
        index_Prefer_No_VZEROUPPER): New.
        * sysdeps/x86/cpu-features.c (init_cpu_features): Set the
        Prefer_No_VZEROUPPER for Knights Landing.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=d530cd5463701a59ed923d53a97d3b534fdfea8a

commit d530cd5463701a59ed923d53a97d3b534fdfea8a
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Wed Oct 21 14:44:23 2015 -0700

    Add Prefer_MAP_32BIT_EXEC to map executable pages with MAP_32BIT
    
    According to Silvermont software optimization guide, for 64-bit
    applications, branch prediction performance can be negatively impacted
    when the target of a branch is more than 4GB away from the branch.  Add
    the Prefer_MAP_32BIT_EXEC bit so that mmap will try to map executable
    pages with MAP_32BIT first.  NB: MAP_32BIT will map to lower 2GB, not
    lower 4GB, address.  Prefer_MAP_32BIT_EXEC reduces bits available for
    address space layout randomization (ASLR), which is always disabled for
    SUID programs and can only be enabled by setting environment variable,
    LD_PREFER_MAP_32BIT_EXEC.
    
    On Fedora 23, this patch speeds up GCC 5 testsuite by 3% on Silvermont.
    
    	[BZ #19367]
    	* sysdeps/unix/sysv/linux/wordsize-64/mmap.c: New file.
    	* sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h: Likewise.
    	* sysdeps/unix/sysv/linux/x86_64/64/mmap.c: Likewise.
    	* sysdeps/x86/cpu-features.h (bit_Prefer_MAP_32BIT_EXEC): New.
    	(index_Prefer_MAP_32BIT_EXEC): Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=fe24aedc3530037d7bb614b84d309e6b816686bf

commit fe24aedc3530037d7bb614b84d309e6b816686bf
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Dec 15 11:46:54 2015 -0800

    Enable Silvermont optimizations for Knights Landing
    
    Knights Landing processor is based on Silvermont.  This patch enables
    Silvermont optimizations for Knights Landing.
    
    	* sysdeps/x86/cpu-features.c (init_cpu_features): Enable
    	Silvermont optimizations for Knights Landing.

-----------------------------------------------------------------------

Comment 41 Andrew Senkevich 2016-06-08 17:19:02 UTC

Created attachment 9328 [details]
Perfomance data with graphics

Comment 42 Sourceware Commits 2016-06-08 20:57:26 UTC

This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, master has been updated
       via  5e8c5bb1ac83aa2577d64d82467a653fa413f7ce (commit)
      from  5188b973250523d3e9c80ea3ab4001f696e6fa1a (commit)

Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=5e8c5bb1ac83aa2577d64d82467a653fa413f7ce

commit 5e8c5bb1ac83aa2577d64d82467a653fa413f7ce
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Wed Jun 8 13:55:45 2016 -0700

    X86-64: Remove the previous SSE2/AVX2 memsets
    
    Since the new SSE2/AVX2 memsets are faster than the previous ones, we
    can remove the previous SSE2/AVX2 memsets and replace them with the
    new ones.  This reduces the size of libc.so by about 900 bytes.
    
    No change in IFUNC selection if SSE2 and AVX2 memsets weren't used
    before.  If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset
    optimized with Enhanced REP STOSB will be used for processors with
    ERMS.  The new AVX512 memset will be used for processors with AVX512
    which prefer vzeroupper.
    
    	[BZ #19881]
    	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded
    	into ...
    	* sysdeps/x86_64/memset.S: This.
    	(__bzero): Removed.
    	(__memset_tail): Likewise.
    	(__memset_chk): Likewise.
    	(memset): Likewise.
    	(MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't
    	defined.
    	(MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined.
    	* sysdeps/x86_64/multiarch/memset-avx2.S: Removed.
    	(__memset_zero_constant_len_parameter): Check SHARED instead of
    	PIC.
    	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
    	memset-avx2 and memset-sse2-unaligned-erms.
    	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
    	(__libc_ifunc_impl_list): Remove __memset_chk_sse2,
    	__memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned.
    	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
    	(__bzero): Enabled.
    	* sysdeps/x86_64/multiarch/memset.S (memset): Replace
    	__memset_sse2 and __memset_avx2 with __memset_sse2_unaligned
    	and __memset_avx2_unaligned.  Use __memset_sse2_unaligned_erms
    	or __memset_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
    	(memset): Removed.
    	(__memset_chk): Likewise.
    	(MEMSET_SYMBOL): New.
    	(libc_hidden_builtin_def): Replace __memset_sse2 with
    	__memset_sse2_unaligned.
    	* sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace
    	__memset_chk_sse2 and __memset_chk_avx2 with
    	__memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms.
    	Use __memset_chk_sse2_unaligned_erms or
    	__memset_chk_avx2_unaligned_erms if processor has ERMS.  Support
    	__memset_chk_avx512_unaligned_erms and
    	__memset_chk_avx512_unaligned.

-----------------------------------------------------------------------

Summary of changes:
 ChangeLog                                          |   41 +++++
 sysdeps/x86_64/memset.S                            |  123 +++------------
 sysdeps/x86_64/multiarch/Makefile                  |    3 +-
 sysdeps/x86_64/multiarch/ifunc-impl-list.c         |    9 -
 sysdeps/x86_64/multiarch/memset-avx2.S             |  168 --------------------
 .../x86_64/multiarch/memset-sse2-unaligned-erms.S  |   20 ---
 .../x86_64/multiarch/memset-vec-unaligned-erms.S   |    2 +-
 sysdeps/x86_64/multiarch/memset.S                  |   34 +++--
 sysdeps/x86_64/multiarch/memset_chk.S              |   20 ++-
 9 files changed, 101 insertions(+), 319 deletions(-)
 delete mode 100644 sysdeps/x86_64/multiarch/memset-avx2.S
 delete mode 100644 sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S

Comment 43 H.J. Lu 2016-06-08 22:01:33 UTC

Fixed for 2.24.