sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S has L(overlapping): cmpq %rsi, %rdi jae .L3 ... L3: leaq -1(%rdx), %rax .p2align 4,,10 .p2align 4 .L11: movzbl (%rsi,%rax), %edx movb %dl, (%rdi,%rax) subq $1, %rax jmp .L11 I don't believe .L3 is reachable. Can it be removed?
There are ENTRY(__memcpy_sse2_unaligned) movq %rsi, %rax leaq (%rdx,%rdx), %rcx subq %rdi, %rax subq %rdx, %rax cmpq %rcx, %rax jb L(overlapping) When branch is taken, cmpq %rsi, %rdi jae .L3 will never be taken.
Created attachment 9069 [details] A revised implementation
This is an automated email from the git hooks/post-receive script. It was generated because a ref change was pushed to the repository containing the project "GNU C Library master sources". The branch, hjl/pr19776/master has been created at 8db09f10f2ad8cfd6d6fc158b336deba82132764 (commit) - Log ----------------------------------------------------------------- https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=8db09f10f2ad8cfd6d6fc158b336deba82132764 commit 8db09f10f2ad8cfd6d6fc158b336deba82132764 Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Mar 7 05:47:26 2016 -0800 Use __memcpy_chk_sse2_unaligned https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=af7cf21daf0b7c6492c0441c546f9290d3e03f9a commit af7cf21daf0b7c6492c0441c546f9290d3e03f9a Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Mar 7 05:44:58 2016 -0800 Use __mempcpy_chk_sse2_unaligned https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=bc2ec4cb418341a97056a3fd503b00c449c59d6e commit bc2ec4cb418341a97056a3fd503b00c449c59d6e Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Mar 7 05:42:46 2016 -0800 Use __mempcpy_sse2_unaligned https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=9452ba101b95ec3a20074b1cc8fd2d84b6e0e859 commit 9452ba101b95ec3a20074b1cc8fd2d84b6e0e859 Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Mar 6 17:06:41 2016 -0800 Add __mempcpy_sse2_unaligned and _chk functions Add __mempcpy_chk_sse2_unaligned, __mempcpy_sse2_unaligned and __memcpy_chk_sse2_unaligned. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memcpy_chk_sse2_unaligned, __mempcpy_chk_sse2_unaligned and __mempcpy_sse2_unaligned. * sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S (__mempcpy_chk_sse2_unaligned): New. (__mempcpy_sse2_unaligned): Likewise. (__memcpy_chk_sse2_unaligned): Likewise. (L(start): New label. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=b812ca4864339d011edde57a86b27b80748bbfda commit b812ca4864339d011edde57a86b27b80748bbfda Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Mar 6 16:52:53 2016 -0800 Remove L(overlapping) from memcpy-sse2-unaligned.S Since memcpy doesn't need to check overlapping source and destination, we can remove L(overlapping). [BZ #19776] * sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S (L(overlapping)): Removed. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=b74a495c76c05e9685482322a6a30d12a98a8a44 commit b74a495c76c05e9685482322a6a30d12a98a8a44 Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Mar 6 13:46:54 2016 -0800 Don't use RAX as scratch register To prepare sharing code with mempcpy, don't use RAX as scratch register so that RAX can be set to the return value at entrance. [BZ #19776] * sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Don't use RAX as scratch register. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=ab3ae148820f5023254a4979d6eef60537f0bb03 commit ab3ae148820f5023254a4979d6eef60537f0bb03 Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Mar 6 14:16:32 2016 -0800 Remove dead code from memcpy-sse2-unaligned.S There are ENTRY(__memcpy_sse2_unaligned) movq %rsi, %rax leaq (%rdx,%rdx), %rcx subq %rdi, %rax subq %rdx, %rax cmpq %rcx, %rax jb L(overlapping) When branch is taken, cmpq %rsi, %rdi jae .L3 will never be taken. We can remove the dead code. [BZ #19776] * sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S (.L3) Removed. -----------------------------------------------------------------------
(In reply to H.J. Lu from comment #1) > There are > > ENTRY(__memcpy_sse2_unaligned) > movq %rsi, %rax > leaq (%rdx,%rdx), %rcx > subq %rdi, %rax > subq %rdx, %rax > cmpq %rcx, %rax > jb L(overlapping) Since memcpy doesn't need to check overlapping source and destination, we can remove L(overlapping).
This is an automated email from the git hooks/post-receive script. It was generated because a ref change was pushed to the repository containing the project "GNU C Library master sources". The branch, hjl/pr19776/master has been created at 94a97fa99c9fc6bb1a579aabbde4509b9cb746b8 (commit) - Log ----------------------------------------------------------------- https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=94a97fa99c9fc6bb1a579aabbde4509b9cb746b8 commit 94a97fa99c9fc6bb1a579aabbde4509b9cb746b8 Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Mar 7 05:47:26 2016 -0800 Enable __memcpy_chk_sse2_unaligned Check Fast_Unaligned_Load for __memcpy_chk_sse2_unaligned * sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Check Fast_Unaligned_Load to enable __memcpy_chk_sse2_unaligned. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=164b82bb51a39bdfb21478b4ca28909fb265bdd1 commit 164b82bb51a39bdfb21478b4ca28909fb265bdd1 Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Mar 7 05:44:58 2016 -0800 Enable __mempcpy_chk_sse2_unaligned Check Fast_Unaligned_Load for __mempcpy_chk_sse2_unaligned * sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Check Fast_Unaligned_Load to enable __mempcpy_chk_sse2_unaligned. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=681c975ae1db45f1f184e52172bb5288f84a184d commit 681c975ae1db45f1f184e52172bb5288f84a184d Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Mar 7 05:42:46 2016 -0800 Enable __mempcpy_sse2_unaligned Check Fast_Unaligned_Load for __mempcpy_sse2_unaligned. * sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Check Fast_Unaligned_Load to enable __mempcpy_sse2_unaligned. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=9452ba101b95ec3a20074b1cc8fd2d84b6e0e859 commit 9452ba101b95ec3a20074b1cc8fd2d84b6e0e859 Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Mar 6 17:06:41 2016 -0800 Add __mempcpy_sse2_unaligned and _chk functions Add __mempcpy_chk_sse2_unaligned, __mempcpy_sse2_unaligned and __memcpy_chk_sse2_unaligned. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memcpy_chk_sse2_unaligned, __mempcpy_chk_sse2_unaligned and __mempcpy_sse2_unaligned. * sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S (__mempcpy_chk_sse2_unaligned): New. (__mempcpy_sse2_unaligned): Likewise. (__memcpy_chk_sse2_unaligned): Likewise. (L(start): New label. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=b812ca4864339d011edde57a86b27b80748bbfda commit b812ca4864339d011edde57a86b27b80748bbfda Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Mar 6 16:52:53 2016 -0800 Remove L(overlapping) from memcpy-sse2-unaligned.S Since memcpy doesn't need to check overlapping source and destination, we can remove L(overlapping). [BZ #19776] * sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S (L(overlapping)): Removed. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=b74a495c76c05e9685482322a6a30d12a98a8a44 commit b74a495c76c05e9685482322a6a30d12a98a8a44 Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Mar 6 13:46:54 2016 -0800 Don't use RAX as scratch register To prepare sharing code with mempcpy, don't use RAX as scratch register so that RAX can be set to the return value at entrance. [BZ #19776] * sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Don't use RAX as scratch register. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=ab3ae148820f5023254a4979d6eef60537f0bb03 commit ab3ae148820f5023254a4979d6eef60537f0bb03 Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Mar 6 14:16:32 2016 -0800 Remove dead code from memcpy-sse2-unaligned.S There are ENTRY(__memcpy_sse2_unaligned) movq %rsi, %rax leaq (%rdx,%rdx), %rcx subq %rdi, %rax subq %rdx, %rax cmpq %rcx, %rax jb L(overlapping) When branch is taken, cmpq %rsi, %rdi jae .L3 will never be taken. We can remove the dead code. [BZ #19776] * sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S (.L3) Removed. -----------------------------------------------------------------------
We can also add entry points for __mempcpy_chk_sse2_unaligned, __mempcpy_sse2_unaligned and __memcpy_chk_sse2_unaligned.
This is an automated email from the git hooks/post-receive script. It was generated because a ref change was pushed to the repository containing the project "GNU C Library master sources". The branch, hjl/pr19776/master has been created at 886d917c20e937a48d5ce538063cd40168eadd57 (commit) - Log ----------------------------------------------------------------- https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=886d917c20e937a48d5ce538063cd40168eadd57 commit 886d917c20e937a48d5ce538063cd40168eadd57 Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Mar 7 05:47:26 2016 -0800 Enable __memcpy_chk_sse2_unaligned Check Fast_Unaligned_Load for __memcpy_chk_sse2_unaligned [BZ #19776] * sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Check Fast_Unaligned_Load to enable __memcpy_chk_sse2_unaligned. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=b03a8d07233bd436bc3c36c5fe4ce50e719e0100 commit b03a8d07233bd436bc3c36c5fe4ce50e719e0100 Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Mar 7 05:44:58 2016 -0800 Enable __mempcpy_chk_sse2_unaligned Check Fast_Unaligned_Load for __mempcpy_chk_sse2_unaligned [BZ #19776] * sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Check Fast_Unaligned_Load to enable __mempcpy_chk_sse2_unaligned. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=192ab2971343d8f8132239442ab76c4cc3a29bed commit 192ab2971343d8f8132239442ab76c4cc3a29bed Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Mar 7 05:42:46 2016 -0800 Enable __mempcpy_sse2_unaligned Check Fast_Unaligned_Load for __mempcpy_sse2_unaligned. [BZ #19776] * sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Check Fast_Unaligned_Load to enable __mempcpy_sse2_unaligned. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e8aae678daf592b1d8e17d13d6a6a07d4b04bfa0 commit e8aae678daf592b1d8e17d13d6a6a07d4b04bfa0 Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Mar 6 17:06:41 2016 -0800 Add entry points for __mempcpy_sse2_unaligned and _chk functions Add entry points for __mempcpy_chk_sse2_unaligned, __mempcpy_sse2_unaligned and __memcpy_chk_sse2_unaligned. [BZ #19776] * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memcpy_chk_sse2_unaligned, __mempcpy_chk_sse2_unaligned and __mempcpy_sse2_unaligned. * sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S (__mempcpy_chk_sse2_unaligned): New. (__mempcpy_sse2_unaligned): Likewise. (__memcpy_chk_sse2_unaligned): Likewise. (L(start): New label. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=465818e53eb2f5ccd71def83774f21f9053a2951 commit 465818e53eb2f5ccd71def83774f21f9053a2951 Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Mar 6 16:52:53 2016 -0800 Remove L(overlapping) from memcpy-sse2-unaligned.S Since memcpy doesn't need to check overlapping source and destination, we can remove L(overlapping). [BZ #19776] * sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S (L(overlapping)): Removed. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=2b6078cf267ff798a4df7547e7b1a5969b2036af commit 2b6078cf267ff798a4df7547e7b1a5969b2036af Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Mar 6 13:46:54 2016 -0800 Don't use RAX as scratch register To prepare sharing code with mempcpy, don't use RAX as scratch register so that RAX can be set to the return value at entrance. [BZ #19776] * sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Don't use RAX as scratch register. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=cadcb611f7dff79400ddf616135ecb739ab3a67f commit cadcb611f7dff79400ddf616135ecb739ab3a67f Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Mar 6 14:16:32 2016 -0800 Remove dead code from memcpy-sse2-unaligned.S There are ENTRY(__memcpy_sse2_unaligned) movq %rsi, %rax leaq (%rdx,%rdx), %rcx subq %rdi, %rax subq %rdx, %rax cmpq %rcx, %rax jb L(overlapping) When branch is taken, cmpq %rsi, %rdi jae .L3 will never be taken. We can remove the dead code. [BZ #19776] * sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S (.L3) Removed. -----------------------------------------------------------------------
This is an automated email from the git hooks/post-receive script. It was generated because a ref change was pushed to the repository containing the project "GNU C Library master sources". The branch, hjl/pr19776/master has been created at 5497a1a41de327211f6072f5695175ea98a5055d (commit) - Log ----------------------------------------------------------------- https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=5497a1a41de327211f6072f5695175ea98a5055d commit 5497a1a41de327211f6072f5695175ea98a5055d Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Mar 7 05:47:26 2016 -0800 Enable __memcpy_chk_sse2_unaligned Check Fast_Unaligned_Load for __memcpy_chk_sse2_unaligned. The new selection order is: 1. __memcpy_chk_avx_unaligned if AVX_Fast_Unaligned_Load bit is set. 2. __memcpy_chk_sse2_unaligned if Fast_Unaligned_Load bit is set. 3. __memcpy_chk_sse2 if SSSE3 isn't available. 4. __memcpy_chk_ssse3_back if Fast_Copy_Backward bit it set. 5. __memcpy_chk_ssse3 [BZ #19776] * sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Check Fast_Unaligned_Load to enable __mempcpy_chk_sse2_unaligned. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c39c8cfb1bbb26371746e5b9456d1f3d2c28839d commit c39c8cfb1bbb26371746e5b9456d1f3d2c28839d Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Mar 7 05:44:58 2016 -0800 Enable __mempcpy_chk_sse2_unaligned Check Fast_Unaligned_Load for __mempcpy_chk_sse2_unaligned. The new selection order is: 1. __mempcpy_chk_avx_unaligned if AVX_Fast_Unaligned_Load bit is set. 2. __mempcpy_chk_sse2_unaligned if Fast_Unaligned_Load bit is set. 3. __mempcpy_chk_sse2 if SSSE3 isn't available. 4. __mempcpy_chk_ssse3_back if Fast_Copy_Backward bit it set. 5. __mempcpy_chk_ssse3 [BZ #19776] * sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Check Fast_Unaligned_Load to enable __mempcpy_chk_sse2_unaligned. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=eb5f220e827ca4f9a8e6ace8045251f16f6aeac2 commit eb5f220e827ca4f9a8e6ace8045251f16f6aeac2 Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Mar 7 05:42:46 2016 -0800 Enable __mempcpy_sse2_unaligned Check Fast_Unaligned_Load for __mempcpy_sse2_unaligned. The new selection order is: 1. __mempcpy_avx_unaligned if AVX_Fast_Unaligned_Load bit is set. 2. __mempcpy_sse2_unaligned if Fast_Unaligned_Load bit is set. 3. __mempcpy_sse2 if SSSE3 isn't available. 4. __mempcpy_ssse3_back if Fast_Copy_Backward bit it set. 5. __mempcpy_ssse3 [BZ #19776] * sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Check Fast_Unaligned_Load to enable __mempcpy_sse2_unaligned. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e8aae678daf592b1d8e17d13d6a6a07d4b04bfa0 commit e8aae678daf592b1d8e17d13d6a6a07d4b04bfa0 Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Mar 6 17:06:41 2016 -0800 Add entry points for __mempcpy_sse2_unaligned and _chk functions Add entry points for __mempcpy_chk_sse2_unaligned, __mempcpy_sse2_unaligned and __memcpy_chk_sse2_unaligned. [BZ #19776] * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memcpy_chk_sse2_unaligned, __mempcpy_chk_sse2_unaligned and __mempcpy_sse2_unaligned. * sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S (__mempcpy_chk_sse2_unaligned): New. (__mempcpy_sse2_unaligned): Likewise. (__memcpy_chk_sse2_unaligned): Likewise. (L(start): New label. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=465818e53eb2f5ccd71def83774f21f9053a2951 commit 465818e53eb2f5ccd71def83774f21f9053a2951 Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Mar 6 16:52:53 2016 -0800 Remove L(overlapping) from memcpy-sse2-unaligned.S Since memcpy doesn't need to check overlapping source and destination, we can remove L(overlapping). [BZ #19776] * sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S (L(overlapping)): Removed. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=2b6078cf267ff798a4df7547e7b1a5969b2036af commit 2b6078cf267ff798a4df7547e7b1a5969b2036af Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Mar 6 13:46:54 2016 -0800 Don't use RAX as scratch register To prepare sharing code with mempcpy, don't use RAX as scratch register so that RAX can be set to the return value at entrance. [BZ #19776] * sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Don't use RAX as scratch register. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=cadcb611f7dff79400ddf616135ecb739ab3a67f commit cadcb611f7dff79400ddf616135ecb739ab3a67f Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Mar 6 14:16:32 2016 -0800 Remove dead code from memcpy-sse2-unaligned.S There are ENTRY(__memcpy_sse2_unaligned) movq %rsi, %rax leaq (%rdx,%rdx), %rcx subq %rdi, %rax subq %rdx, %rax cmpq %rcx, %rax jb L(overlapping) When branch is taken, cmpq %rsi, %rdi jae .L3 will never be taken. We can remove the dead code. [BZ #19776] * sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S (.L3) Removed. -----------------------------------------------------------------------
This is an automated email from the git hooks/post-receive script. It was generated because a ref change was pushed to the repository containing the project "GNU C Library master sources". The branch, hjl/erms/hybrid has been created at 0debc67b0128dab2a524b44387ff848bc01480bb (commit) - Log ----------------------------------------------------------------- https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=0debc67b0128dab2a524b44387ff848bc01480bb commit 0debc67b0128dab2a524b44387ff848bc01480bb Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Mar 18 09:22:56 2016 -0700 Add __memset_sse2_erms and __memset_chk_sse2_erms * sysdeps/x86_64/memset.S (__memset_chk_sse2_erms): New function. (__memset_sse2_erms): Likewise. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memset_chk_sse2_erms and __memset_sse2_erms. * sysdeps/x86_64/sysdep.h (REP_STOSB_THRESHOLD): New. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=f47ec52f2506c944b514c7da14b54d6aee485350 commit f47ec52f2506c944b514c7da14b54d6aee485350 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Mar 18 08:32:05 2016 -0700 Add sse2_unaligned_erms versions of memcpy/mempcpy * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memcpy_chk_sse2_unaligned_erms, __memcpy_sse2_unaligned_erms, __mempcpy_chk_sse2_unaligned_erms and __mempcpy_sse2_unaligned_erms. * sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S (__mempcpy_chk_sse2_unaligned_erms): New function. (__mempcpy_sse2_unaligned_erms): Likewise. (__memcpy_chk_sse2_unaligned_erms): Likewise. (__memcpy_sse2_unaligned_erms): Likewise. * sysdeps/x86_64/sysdep.h (REP_MOVSB_THRESHOLD): New. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=53363d1b76a45543f1ac9c1854a17d0a90bb3cba commit 53363d1b76a45543f1ac9c1854a17d0a90bb3cba Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Mar 7 05:47:26 2016 -0800 Enable __memcpy_chk_sse2_unaligned Check Fast_Unaligned_Load for __memcpy_chk_sse2_unaligned. The new selection order is: 1. __memcpy_chk_avx_unaligned if AVX_Fast_Unaligned_Load bit is set. 2. __memcpy_chk_sse2_unaligned if Fast_Unaligned_Load bit is set. 3. __memcpy_chk_sse2 if SSSE3 isn't available. 4. __memcpy_chk_ssse3_back if Fast_Copy_Backward bit it set. 5. __memcpy_chk_ssse3 [BZ #19776] * sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Check Fast_Unaligned_Load to enable __mempcpy_chk_sse2_unaligned. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=9615edc8835cb28b09d2941df19919ac7da29a38 commit 9615edc8835cb28b09d2941df19919ac7da29a38 Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Mar 7 05:44:58 2016 -0800 Enable __mempcpy_chk_sse2_unaligned Check Fast_Unaligned_Load for __mempcpy_chk_sse2_unaligned. The new selection order is: 1. __mempcpy_chk_avx_unaligned if AVX_Fast_Unaligned_Load bit is set. 2. __mempcpy_chk_sse2_unaligned if Fast_Unaligned_Load bit is set. 3. __mempcpy_chk_sse2 if SSSE3 isn't available. 4. __mempcpy_chk_ssse3_back if Fast_Copy_Backward bit it set. 5. __mempcpy_chk_ssse3 [BZ #19776] * sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Check Fast_Unaligned_Load to enable __mempcpy_chk_sse2_unaligned. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e497a68bd68d08a2fffadbd9a5ed0c082cdc62e9 commit e497a68bd68d08a2fffadbd9a5ed0c082cdc62e9 Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Mar 7 05:42:46 2016 -0800 Enable __mempcpy_sse2_unaligned Check Fast_Unaligned_Load for __mempcpy_sse2_unaligned. The new selection order is: 1. __mempcpy_avx_unaligned if AVX_Fast_Unaligned_Load bit is set. 2. __mempcpy_sse2_unaligned if Fast_Unaligned_Load bit is set. 3. __mempcpy_sse2 if SSSE3 isn't available. 4. __mempcpy_ssse3_back if Fast_Copy_Backward bit it set. 5. __mempcpy_ssse3 [BZ #19776] * sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Check Fast_Unaligned_Load to enable __mempcpy_sse2_unaligned. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=f3397461af3990d1b147c7110b3fba6449b36400 commit f3397461af3990d1b147c7110b3fba6449b36400 Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Mar 6 17:06:41 2016 -0800 Add entry points for __mempcpy_sse2_unaligned and _chk functions Add entry points for __mempcpy_chk_sse2_unaligned, __mempcpy_sse2_unaligned and __memcpy_chk_sse2_unaligned. [BZ #19776] * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memcpy_chk_sse2_unaligned, __mempcpy_chk_sse2_unaligned and __mempcpy_sse2_unaligned. * sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S (__mempcpy_chk_sse2_unaligned): New. (__mempcpy_sse2_unaligned): Likewise. (__memcpy_chk_sse2_unaligned): Likewise. (L(start): New label. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e923b259dfd17705954d098370b399c24dcef2cf commit e923b259dfd17705954d098370b399c24dcef2cf Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Mar 6 16:52:53 2016 -0800 Remove L(overlapping) from memcpy-sse2-unaligned.S Since memcpy doesn't need to check overlapping source and destination, we can remove L(overlapping). [BZ #19776] * sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S (L(overlapping)): Removed. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=af82e9a0269e61b459aaf071d3f93b35ecb11e9e commit af82e9a0269e61b459aaf071d3f93b35ecb11e9e Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Mar 6 13:46:54 2016 -0800 Don't use RAX as scratch register To prepare sharing code with mempcpy, don't use RAX as scratch register so that RAX can be set to the return value at entrance. [BZ #19776] * sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Don't use RAX as scratch register. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=d700853c5270817df932f319917467931f433c41 commit d700853c5270817df932f319917467931f433c41 Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Mar 6 14:16:32 2016 -0800 Remove dead code from memcpy-sse2-unaligned.S There are ENTRY(__memcpy_sse2_unaligned) movq %rsi, %rax leaq (%rdx,%rdx), %rcx subq %rdi, %rax subq %rdx, %rax cmpq %rcx, %rax jb L(overlapping) When branch is taken, cmpq %rsi, %rdi jae .L3 will never be taken. We can remove the dead code. [BZ #19776] * sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S (.L3) Removed. -----------------------------------------------------------------------
This is an automated email from the git hooks/post-receive script. It was generated because a ref change was pushed to the repository containing the project "GNU C Library master sources". The branch, hjl/erms/hybrid has been created at 6ce43b9abbfa089e80ae8fa3508b7ecdfb56c265 (commit) - Log ----------------------------------------------------------------- https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=6ce43b9abbfa089e80ae8fa3508b7ecdfb56c265 commit 6ce43b9abbfa089e80ae8fa3508b7ecdfb56c265 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Mar 18 12:36:03 2016 -0700 Add Hybrid_ERMS and use it in memset.S https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=9ad76210d60ab512bd329824ccf24add63c6a4b7 commit 9ad76210d60ab512bd329824ccf24add63c6a4b7 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Mar 18 10:34:07 2016 -0700 Add __memset_avx2_erms and __memset_chk_avx2_erms https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=eeedc55e73367131c49cc84749006fd7cf0d2f41 commit eeedc55e73367131c49cc84749006fd7cf0d2f41 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Mar 18 10:27:58 2016 -0700 Add avx_unaligned_erms versions of memcpy/mempcpy https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=21a60509c4d164f7c2ac21add165e9868da0aab0 commit 21a60509c4d164f7c2ac21add165e9868da0aab0 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Mar 18 10:07:48 2016 -0700 Remove mempcpy-*.S https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=dce73065103d970a518ea7a869e218864c946198 commit dce73065103d970a518ea7a869e218864c946198 Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Mar 6 13:37:31 2016 -0800 Merge memcpy with mempcpy https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=ad023b590b32dcde6d1b641d41758e531290b6fd commit ad023b590b32dcde6d1b641d41758e531290b6fd Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Mar 18 09:22:56 2016 -0700 Add __memset_sse2_erms and __memset_chk_sse2_erms * sysdeps/x86_64/memset.S (__memset_chk_sse2_erms): New function. (__memset_sse2_erms): Likewise. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memset_chk_sse2_erms and __memset_sse2_erms. * sysdeps/x86_64/sysdep.h (REP_STOSB_THRESHOLD): New. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=f0bc5d68ff40abe2fe3ea1f2515811f9636f8ac9 commit f0bc5d68ff40abe2fe3ea1f2515811f9636f8ac9 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Mar 18 08:32:05 2016 -0700 Add sse2_unaligned_erms versions of memcpy/mempcpy * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memcpy_chk_sse2_unaligned_erms, __memcpy_sse2_unaligned_erms, __mempcpy_chk_sse2_unaligned_erms and __mempcpy_sse2_unaligned_erms. * sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S (__mempcpy_chk_sse2_unaligned_erms): New function. (__mempcpy_sse2_unaligned_erms): Likewise. (__memcpy_chk_sse2_unaligned_erms): Likewise. (__memcpy_sse2_unaligned_erms): Likewise. * sysdeps/x86_64/sysdep.h (REP_MOVSB_THRESHOLD): New. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=438ed4b59ada763de874ba3be10bb18ea52d5643 commit 438ed4b59ada763de874ba3be10bb18ea52d5643 Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Mar 7 05:47:26 2016 -0800 Enable __memcpy_chk_sse2_unaligned Check Fast_Unaligned_Load for __memcpy_chk_sse2_unaligned. The new selection order is: 1. __memcpy_chk_avx_unaligned if AVX_Fast_Unaligned_Load bit is set. 2. __memcpy_chk_sse2_unaligned if Fast_Unaligned_Load bit is set. 3. __memcpy_chk_sse2 if SSSE3 isn't available. 4. __memcpy_chk_ssse3_back if Fast_Copy_Backward bit it set. 5. __memcpy_chk_ssse3 [BZ #19776] * sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Check Fast_Unaligned_Load to enable __mempcpy_chk_sse2_unaligned. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=fe0ea5b0874b4ae4255c6ad3b2e0f1aaf94a97ff commit fe0ea5b0874b4ae4255c6ad3b2e0f1aaf94a97ff Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Mar 7 05:44:58 2016 -0800 Enable __mempcpy_chk_sse2_unaligned Check Fast_Unaligned_Load for __mempcpy_chk_sse2_unaligned. The new selection order is: 1. __mempcpy_chk_avx_unaligned if AVX_Fast_Unaligned_Load bit is set. 2. __mempcpy_chk_sse2_unaligned if Fast_Unaligned_Load bit is set. 3. __mempcpy_chk_sse2 if SSSE3 isn't available. 4. __mempcpy_chk_ssse3_back if Fast_Copy_Backward bit it set. 5. __mempcpy_chk_ssse3 [BZ #19776] * sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Check Fast_Unaligned_Load to enable __mempcpy_chk_sse2_unaligned. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=175e5a4dcf431f9820dd483b0ba93d165d8652b2 commit 175e5a4dcf431f9820dd483b0ba93d165d8652b2 Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Mar 7 05:42:46 2016 -0800 Enable __mempcpy_sse2_unaligned Check Fast_Unaligned_Load for __mempcpy_sse2_unaligned. The new selection order is: 1. __mempcpy_avx_unaligned if AVX_Fast_Unaligned_Load bit is set. 2. __mempcpy_sse2_unaligned if Fast_Unaligned_Load bit is set. 3. __mempcpy_sse2 if SSSE3 isn't available. 4. __mempcpy_ssse3_back if Fast_Copy_Backward bit it set. 5. __mempcpy_ssse3 [BZ #19776] * sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Check Fast_Unaligned_Load to enable __mempcpy_sse2_unaligned. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=32920004ade2b9fd8ac3b3a0e177c0687b536a04 commit 32920004ade2b9fd8ac3b3a0e177c0687b536a04 Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Mar 6 17:06:41 2016 -0800 Add entry points for __mempcpy_sse2_unaligned and _chk functions Add entry points for __mempcpy_chk_sse2_unaligned, __mempcpy_sse2_unaligned and __memcpy_chk_sse2_unaligned. [BZ #19776] * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memcpy_chk_sse2_unaligned, __mempcpy_chk_sse2_unaligned and __mempcpy_sse2_unaligned. * sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S (__mempcpy_chk_sse2_unaligned): New. (__mempcpy_sse2_unaligned): Likewise. (__memcpy_chk_sse2_unaligned): Likewise. (L(start): New label. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=12cf609fa5ffb50af921f634c6e548b50960948d commit 12cf609fa5ffb50af921f634c6e548b50960948d Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Mar 6 16:52:53 2016 -0800 Remove L(overlapping) from memcpy-sse2-unaligned.S Since memcpy doesn't need to check overlapping source and destination, we can remove L(overlapping). [BZ #19776] * sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S (L(overlapping)): Removed. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=849e297b6803c989f33c05f5be8f93dbb3274bb1 commit 849e297b6803c989f33c05f5be8f93dbb3274bb1 Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Mar 6 13:46:54 2016 -0800 Don't use RAX as scratch register To prepare sharing code with mempcpy, don't use RAX as scratch register so that RAX can be set to the return value at entrance. [BZ #19776] * sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Don't use RAX as scratch register. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4891daae057485dfaa319971afd1d4fe52fbc929 commit 4891daae057485dfaa319971afd1d4fe52fbc929 Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Mar 6 14:16:32 2016 -0800 Remove dead code from memcpy-sse2-unaligned.S There are ENTRY(__memcpy_sse2_unaligned) movq %rsi, %rax leaq (%rdx,%rdx), %rcx subq %rdi, %rax subq %rdx, %rax cmpq %rcx, %rax jb L(overlapping) When branch is taken, cmpq %rsi, %rdi jae .L3 will never be taken. We can remove the dead code. [BZ #19776] * sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S (.L3) Removed. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=735a17465774fc651dfe26cf27e3cf6cdf1883f6 commit 735a17465774fc651dfe26cf27e3cf6cdf1883f6 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Apr 11 08:51:16 2014 -0700 Test 32-bit ERMS memcpy/memset * sysdeps/i386/i686/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Add __bcopy_erms, __bzero_erms, __memmove_chk_erms, __memmove_erms, __memset_chk_erms, __memset_erms, __memcpy_chk_erms, __memcpy_erms, __mempcpy_chk_erms and __mempcpy_erms. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=665d19ea1eaa11d1a37d2e8fe679f92075583bf8 commit 665d19ea1eaa11d1a37d2e8fe679f92075583bf8 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Apr 11 08:25:17 2014 -0700 Test 64-bit ERMS memcpy/memset * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Add __memmove_chk_erms, __memmove_erms, __memset_erms, __memset_chk_erms, __memcpy_chk_erms, __memcpy_erms, __mempcpy_chk_erms and __mempcpy_erms. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=ef0a94fe0d1f51db1532b7f530841bf5ad2e2ba5 commit ef0a94fe0d1f51db1532b7f530841bf5ad2e2ba5 Author: H.J. Lu <hjl.tools@gmail.com> Date: Wed Sep 21 15:21:28 2011 -0700 Add 32it ERMS memcpy/memset * sysdeps/i386/i686/multiarch/Makefile (sysdep_routines): Add bcopy-erms, memcpy-erms, memmove-erms, mempcpy-erms, bzero-erms and memset-erms. * sysdeps/i386/i686/multiarch/bcopy-erms.S: New file. * sysdeps/i386/i686/multiarch/bzero-erms.S: Likewise. * sysdeps/i386/i686/multiarch/memcpy-erms.S: Likewise. * sysdeps/i386/i686/multiarch/memmove-erms.S: Likewise. * sysdeps/i386/i686/multiarch/mempcpy-erms.S: Likewise. * sysdeps/i386/i686/multiarch/memset-erms.S: Likewise. * sysdeps/i386/i686/multiarch/ifunc-defines.sym: Add COMMON_CPUID_INDEX_7. * sysdeps/i386/i686/multiarch/bcopy.S: Enable ERMS optimization for Fast_ERMS. * sysdeps/i386/i686/multiarch/bzero.S: Likewise. * sysdeps/i386/i686/multiarch/memcpy.S: Likewise. * sysdeps/i386/i686/multiarch/memcpy_chk.S: Likewise. * sysdeps/i386/i686/multiarch/memmove.S: Likewise. * sysdeps/i386/i686/multiarch/memmove_chk.S: Likewise. * sysdeps/i386/i686/multiarch/mempcpy.S: Likewise. * sysdeps/i386/i686/multiarch/mempcpy_chk.S: Likewise. * sysdeps/i386/i686/multiarch/memset.S: Likewise. * sysdeps/i386/i686/multiarch/memset_chk.S: Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=ef81a3a93a95dd0017fcefe4207306d6875c3794 commit ef81a3a93a95dd0017fcefe4207306d6875c3794 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Sep 15 16:16:10 2011 -0700 Add 64-bit ERMS memcpy and memset * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memcpy-erms, mempcpy-erms, memmove-erms and memset-erms. * sysdeps/x86_64/multiarch/memcpy-erms.S: New. * sysdeps/x86_64/multiarch/memmove-erms.S: Likewise. * sysdeps/x86_64/multiarch/mempcpy-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-erms.S: Likewise. * sysdeps/x86_64/multiarch/memcpy.S: Enable ERMS optimization for Fast_ERMS. * sysdeps/x86_64/multiarch/memcpy_chk.S: Likewise. * sysdeps/x86_64/multiarch/memmove.c: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.c: Likewise. * sysdeps/x86_64/multiarch/mempcpy.S: Likewise. * sysdeps/x86_64/multiarch/mempcpy_chk.S: Likewise. * sysdeps/x86_64/multiarch/memset.S: Likewise. * sysdeps/x86_64/multiarch/memset_chk.S: Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=cd8fa6c74a5f122a242c412b5ea526a31f13855a commit cd8fa6c74a5f122a242c412b5ea526a31f13855a Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Sep 15 15:47:01 2011 -0700 Initial ERMS support * sysdeps/x86/cpu-features.h (bit_arch_Fast_ERMS): New. (bit_cpu_ERMS): Likewise. (index_cpu_ERMS): Likewise. (index_arch_Fast_ERMS): Likewise. (reg_ERMS): Likewise. -----------------------------------------------------------------------
This is an automated email from the git hooks/post-receive script. It was generated because a ref change was pushed to the repository containing the project "GNU C Library master sources". The branch, hjl/erms/hybrid has been created at 942d5a67c652603257c4edcf9ee5d05951a454cb (commit) - Log ----------------------------------------------------------------- https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=942d5a67c652603257c4edcf9ee5d05951a454cb commit 942d5a67c652603257c4edcf9ee5d05951a454cb Author: H.J. Lu <hjl.tools@gmail.com> Date: Tue Mar 22 09:19:06 2016 -0700 Use Hybrid_ERMS in mempcpy.S https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=64b4df537063bb843ad07495ec4de0670a8a15fb commit 64b4df537063bb843ad07495ec4de0670a8a15fb Author: H.J. Lu <hjl.tools@gmail.com> Date: Tue Mar 22 09:02:01 2016 -0700 Use Hybrid_ERMS in memcpy.S https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=6680feac496d18c433f4355b81c0f789848965ff commit 6680feac496d18c433f4355b81c0f789848965ff Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Mar 18 12:36:03 2016 -0700 Add Hybrid_ERMS and use it in memset.S https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=fe2e2e385789f2ff297bfbc73cb11af8b43b8345 commit fe2e2e385789f2ff297bfbc73cb11af8b43b8345 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Mar 18 10:34:07 2016 -0700 Add __memset_avx2_erms and __memset_chk_avx2_erms https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=5540fbe17391c4d495ac23559c0f33b08394173d commit 5540fbe17391c4d495ac23559c0f33b08394173d Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Mar 18 10:27:58 2016 -0700 Add avx_unaligned_erms versions of memcpy/mempcpy https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=8523b446015dfcbd9c976194f8bed75534472243 commit 8523b446015dfcbd9c976194f8bed75534472243 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Mar 18 10:07:48 2016 -0700 Remove mempcpy-*.S https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=35aeba6cd4e06022d08f5e37ecbf94a37c9880f4 commit 35aeba6cd4e06022d08f5e37ecbf94a37c9880f4 Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Mar 6 13:37:31 2016 -0800 Merge memcpy with mempcpy https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c69e56fd19790341bc1cdf43adb14a7d033b9e16 commit c69e56fd19790341bc1cdf43adb14a7d033b9e16 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Mar 18 09:22:56 2016 -0700 Add __memset_sse2_erms and __memset_chk_sse2_erms * sysdeps/x86_64/memset.S (__memset_chk_sse2_erms): New function. (__memset_sse2_erms): Likewise. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memset_chk_sse2_erms and __memset_sse2_erms. * sysdeps/x86_64/sysdep.h (REP_STOSB_THRESHOLD): New. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=1e4466336442ac6b5f4537bcd3f641ab8899d47e commit 1e4466336442ac6b5f4537bcd3f641ab8899d47e Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Mar 18 08:32:05 2016 -0700 Add sse2_unaligned_erms versions of memcpy/mempcpy * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memcpy_chk_sse2_unaligned_erms, __memcpy_sse2_unaligned_erms, __mempcpy_chk_sse2_unaligned_erms and __mempcpy_sse2_unaligned_erms. * sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S (__mempcpy_chk_sse2_unaligned_erms): New function. (__mempcpy_sse2_unaligned_erms): Likewise. (__memcpy_chk_sse2_unaligned_erms): Likewise. (__memcpy_sse2_unaligned_erms): Likewise. * sysdeps/x86_64/sysdep.h (REP_MOVSB_THRESHOLD): New. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a6358872399eb796c326572b15a37e504173888b commit a6358872399eb796c326572b15a37e504173888b Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Mar 7 05:47:26 2016 -0800 Enable __memcpy_chk_sse2_unaligned Check Fast_Unaligned_Load for __memcpy_chk_sse2_unaligned. The new selection order is: 1. __memcpy_chk_avx_unaligned if AVX_Fast_Unaligned_Load bit is set. 2. __memcpy_chk_sse2_unaligned if Fast_Unaligned_Load bit is set. 3. __memcpy_chk_sse2 if SSSE3 isn't available. 4. __memcpy_chk_ssse3_back if Fast_Copy_Backward bit it set. 5. __memcpy_chk_ssse3 [BZ #19776] * sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Check Fast_Unaligned_Load to enable __mempcpy_chk_sse2_unaligned. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=acae6b0a49cc462a67bccb7e11a74ac720d98427 commit acae6b0a49cc462a67bccb7e11a74ac720d98427 Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Mar 7 05:44:58 2016 -0800 Enable __mempcpy_chk_sse2_unaligned Check Fast_Unaligned_Load for __mempcpy_chk_sse2_unaligned. The new selection order is: 1. __mempcpy_chk_avx_unaligned if AVX_Fast_Unaligned_Load bit is set. 2. __mempcpy_chk_sse2_unaligned if Fast_Unaligned_Load bit is set. 3. __mempcpy_chk_sse2 if SSSE3 isn't available. 4. __mempcpy_chk_ssse3_back if Fast_Copy_Backward bit it set. 5. __mempcpy_chk_ssse3 [BZ #19776] * sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Check Fast_Unaligned_Load to enable __mempcpy_chk_sse2_unaligned. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=cf9b1255b58424eaa9b36a8c6d173fa1dba030c7 commit cf9b1255b58424eaa9b36a8c6d173fa1dba030c7 Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Mar 7 05:42:46 2016 -0800 Enable __mempcpy_sse2_unaligned Check Fast_Unaligned_Load for __mempcpy_sse2_unaligned. The new selection order is: 1. __mempcpy_avx_unaligned if AVX_Fast_Unaligned_Load bit is set. 2. __mempcpy_sse2_unaligned if Fast_Unaligned_Load bit is set. 3. __mempcpy_sse2 if SSSE3 isn't available. 4. __mempcpy_ssse3_back if Fast_Copy_Backward bit it set. 5. __mempcpy_ssse3 [BZ #19776] * sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Check Fast_Unaligned_Load to enable __mempcpy_sse2_unaligned. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=5751e670acd3102f74b0fa5e5537e32d7e0b59be commit 5751e670acd3102f74b0fa5e5537e32d7e0b59be Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Mar 6 17:06:41 2016 -0800 Add entry points for __mempcpy_sse2_unaligned and _chk functions Add entry points for __mempcpy_chk_sse2_unaligned, __mempcpy_sse2_unaligned and __memcpy_chk_sse2_unaligned. [BZ #19776] * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memcpy_chk_sse2_unaligned, __mempcpy_chk_sse2_unaligned and __mempcpy_sse2_unaligned. * sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S (__mempcpy_chk_sse2_unaligned): New. (__mempcpy_sse2_unaligned): Likewise. (__memcpy_chk_sse2_unaligned): Likewise. (L(start): New label. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=63564757b905fc0f874c387fb905aa3757bb3605 commit 63564757b905fc0f874c387fb905aa3757bb3605 Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Mar 6 16:52:53 2016 -0800 Remove L(overlapping) from memcpy-sse2-unaligned.S Since memcpy doesn't need to check overlapping source and destination, we can remove L(overlapping). [BZ #19776] * sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S (L(overlapping)): Removed. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=30e33b2eb15afcbd7039fbfa3202166c0f8af1d1 commit 30e33b2eb15afcbd7039fbfa3202166c0f8af1d1 Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Mar 6 13:46:54 2016 -0800 Don't use RAX as scratch register To prepare sharing code with mempcpy, don't use RAX as scratch register so that RAX can be set to the return value at entrance. [BZ #19776] * sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Don't use RAX as scratch register. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=b4f5ff76b6b7974a8c8baea8e70088a1458c96a2 commit b4f5ff76b6b7974a8c8baea8e70088a1458c96a2 Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Mar 6 14:16:32 2016 -0800 Remove dead code from memcpy-sse2-unaligned.S There are ENTRY(__memcpy_sse2_unaligned) movq %rsi, %rax leaq (%rdx,%rdx), %rcx subq %rdi, %rax subq %rdx, %rax cmpq %rcx, %rax jb L(overlapping) When branch is taken, cmpq %rsi, %rdi jae .L3 will never be taken. We can remove the dead code. [BZ #19776] * sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S (.L3) Removed. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=67160949c9f01fd3f258c04f2eb5cee67739c503 commit 67160949c9f01fd3f258c04f2eb5cee67739c503 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Apr 11 08:51:16 2014 -0700 Test 32-bit ERMS memcpy/memset * sysdeps/i386/i686/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Add __bcopy_erms, __bzero_erms, __memmove_chk_erms, __memmove_erms, __memset_chk_erms, __memset_erms, __memcpy_chk_erms, __memcpy_erms, __mempcpy_chk_erms and __mempcpy_erms. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a4d669225600cf717543df1627bb47fa1c08fbe8 commit a4d669225600cf717543df1627bb47fa1c08fbe8 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Apr 11 08:25:17 2014 -0700 Test 64-bit ERMS memcpy/memset * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Add __memmove_chk_erms, __memmove_erms, __memset_erms, __memset_chk_erms, __memcpy_chk_erms, __memcpy_erms, __mempcpy_chk_erms and __mempcpy_erms. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=74aea35c82d147f0270bebefae89a66dfb191b1f commit 74aea35c82d147f0270bebefae89a66dfb191b1f Author: H.J. Lu <hjl.tools@gmail.com> Date: Wed Sep 21 15:21:28 2011 -0700 Add 32it ERMS memcpy/memset * sysdeps/i386/i686/multiarch/Makefile (sysdep_routines): Add bcopy-erms, memcpy-erms, memmove-erms, mempcpy-erms, bzero-erms and memset-erms. * sysdeps/i386/i686/multiarch/bcopy-erms.S: New file. * sysdeps/i386/i686/multiarch/bzero-erms.S: Likewise. * sysdeps/i386/i686/multiarch/memcpy-erms.S: Likewise. * sysdeps/i386/i686/multiarch/memmove-erms.S: Likewise. * sysdeps/i386/i686/multiarch/mempcpy-erms.S: Likewise. * sysdeps/i386/i686/multiarch/memset-erms.S: Likewise. * sysdeps/i386/i686/multiarch/ifunc-defines.sym: Add COMMON_CPUID_INDEX_7. * sysdeps/i386/i686/multiarch/bcopy.S: Enable ERMS optimization for Fast_ERMS. * sysdeps/i386/i686/multiarch/bzero.S: Likewise. * sysdeps/i386/i686/multiarch/memcpy.S: Likewise. * sysdeps/i386/i686/multiarch/memcpy_chk.S: Likewise. * sysdeps/i386/i686/multiarch/memmove.S: Likewise. * sysdeps/i386/i686/multiarch/memmove_chk.S: Likewise. * sysdeps/i386/i686/multiarch/mempcpy.S: Likewise. * sysdeps/i386/i686/multiarch/mempcpy_chk.S: Likewise. * sysdeps/i386/i686/multiarch/memset.S: Likewise. * sysdeps/i386/i686/multiarch/memset_chk.S: Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a81f1a5e58f9360afd079093ea6770bb9e570a2a commit a81f1a5e58f9360afd079093ea6770bb9e570a2a Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Sep 15 16:16:10 2011 -0700 Add 64-bit ERMS memcpy and memset * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memcpy-erms, mempcpy-erms, memmove-erms and memset-erms. * sysdeps/x86_64/multiarch/memcpy-erms.S: New. * sysdeps/x86_64/multiarch/memmove-erms.S: Likewise. * sysdeps/x86_64/multiarch/mempcpy-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-erms.S: Likewise. * sysdeps/x86_64/multiarch/memcpy.S: Enable ERMS optimization for Fast_ERMS. * sysdeps/x86_64/multiarch/memcpy_chk.S: Likewise. * sysdeps/x86_64/multiarch/memmove.c: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.c: Likewise. * sysdeps/x86_64/multiarch/mempcpy.S: Likewise. * sysdeps/x86_64/multiarch/mempcpy_chk.S: Likewise. * sysdeps/x86_64/multiarch/memset.S: Likewise. * sysdeps/x86_64/multiarch/memset_chk.S: Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=487dc028667e43ed55f407fda3c07fc31ecd1554 commit 487dc028667e43ed55f407fda3c07fc31ecd1554 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Sep 15 15:47:01 2011 -0700 Initial ERMS support * sysdeps/x86/cpu-features.h (bit_arch_Fast_ERMS): New. (bit_cpu_ERMS): Likewise. (index_cpu_ERMS): Likewise. (index_arch_Fast_ERMS): Likewise. (reg_ERMS): Likewise. -----------------------------------------------------------------------
This is an automated email from the git hooks/post-receive script. It was generated because a ref change was pushed to the repository containing the project "GNU C Library master sources". The branch, hjl/erms/hybrid has been created at cbb91f949f7e1560c299e7ad4727c6dfe976bb0c (commit) - Log ----------------------------------------------------------------- https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=cbb91f949f7e1560c299e7ad4727c6dfe976bb0c commit cbb91f949f7e1560c299e7ad4727c6dfe976bb0c Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 24 15:26:59 2016 -0700 Add memcpy-avx512-unaligned-erms.S https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=85b5d5a77c7ecd88daffa902a0e6535222f373bc commit 85b5d5a77c7ecd88daffa902a0e6535222f373bc Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 24 15:23:33 2016 -0700 Extend it to VEC_SIZE == 64 https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=b12a3dc8134a8632e803213a169b4f0d9fa765a0 commit b12a3dc8134a8632e803213a169b4f0d9fa765a0 Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Mar 6 14:16:32 2016 -0800 Add entry points for __mempcpy_sse2_unaligned and _chk functions Add entry points for __mempcpy_chk_sse2_unaligned, __mempcpy_sse2_unaligned and __memcpy_chk_sse2_unaligned. Add sse2_unaligned_erms versions of memcpy/mempcpy [BZ #19776] * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memcpy_chk_sse2_unaligned, __mempcpy_chk_sse2_unaligned and __mempcpy_sse2_unaligned. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memcpy_chk_sse2_unaligned_erms, __memcpy_sse2_unaligned_erms, __mempcpy_chk_sse2_unaligned_erms and __mempcpy_sse2_unaligned_erms. * sysdeps/x86_64/sysdep.h (REP_MOVSB_THRESHOLD): New. Enable __mempcpy_sse2_unaligned Check Fast_Unaligned_Load for __mempcpy_sse2_unaligned. The new selection order is: 1. __mempcpy_avx_unaligned if AVX_Fast_Unaligned_Load bit is set. 2. __mempcpy_sse2_unaligned if Fast_Unaligned_Load bit is set. 3. __mempcpy_sse2 if SSSE3 isn't available. 4. __mempcpy_ssse3_back if Fast_Copy_Backward bit it set. 5. __mempcpy_ssse3 [BZ #19776] * sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Check Fast_Unaligned_Load to enable __mempcpy_sse2_unaligned. Enable __mempcpy_chk_sse2_unaligned Check Fast_Unaligned_Load for __mempcpy_chk_sse2_unaligned. The new selection order is: 1. __mempcpy_chk_avx_unaligned if AVX_Fast_Unaligned_Load bit is set. 2. __mempcpy_chk_sse2_unaligned if Fast_Unaligned_Load bit is set. 3. __mempcpy_chk_sse2 if SSSE3 isn't available. 4. __mempcpy_chk_ssse3_back if Fast_Copy_Backward bit it set. 5. __mempcpy_chk_ssse3 [BZ #19776] * sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Check Fast_Unaligned_Load to enable __mempcpy_chk_sse2_unaligned. Enable __memcpy_chk_sse2_unaligned Check Fast_Unaligned_Load for __memcpy_chk_sse2_unaligned. The new selection order is: 1. __memcpy_chk_avx_unaligned if AVX_Fast_Unaligned_Load bit is set. 2. __memcpy_chk_sse2_unaligned if Fast_Unaligned_Load bit is set. 3. __memcpy_chk_sse2 if SSSE3 isn't available. 4. __memcpy_chk_ssse3_back if Fast_Copy_Backward bit it set. 5. __memcpy_chk_ssse3 [BZ #19776] * sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Check Fast_Unaligned_Load to enable __mempcpy_chk_sse2_unaligned. Use Hybrid_ERMS in memcpy.S Use Hybrid_ERMS in mempcpy.S Add memcpy-sse2-unaligned-erms.S Add initial memcpy-vec-unaligned-erms.S Add memcpy-avx-unaligned-erms.S https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=649521e48efafd5fccac8f405c2166749b6d59f4 commit 649521e48efafd5fccac8f405c2166749b6d59f4 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Mar 18 12:36:03 2016 -0700 Add Hybrid_ERMS and use it in memset.S https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=8c36a7002d4a2956e7b3701e4094382148b22eed commit 8c36a7002d4a2956e7b3701e4094382148b22eed Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Mar 18 10:34:07 2016 -0700 Add __memset_avx2_erms and __memset_chk_avx2_erms https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=cb894908a84c04d7fe3fbf164613fb5b5bac0737 commit cb894908a84c04d7fe3fbf164613fb5b5bac0737 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Mar 18 10:07:48 2016 -0700 Remove mempcpy-*.S https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=10e38ad556275d8d8a4bb2ea1423f1e160f259ee commit 10e38ad556275d8d8a4bb2ea1423f1e160f259ee Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Mar 6 13:37:31 2016 -0800 Merge memcpy with mempcpy https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=0c8870d243dd0139ef9954cad6e399327c2406e1 commit 0c8870d243dd0139ef9954cad6e399327c2406e1 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Mar 18 09:22:56 2016 -0700 Add __memset_sse2_erms and __memset_chk_sse2_erms * sysdeps/x86_64/memset.S (__memset_chk_sse2_erms): New function. (__memset_sse2_erms): Likewise. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memset_chk_sse2_erms and __memset_sse2_erms. * sysdeps/x86_64/sysdep.h (REP_STOSB_THRESHOLD): New. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=84e04147d1e70e8ba00c8f46923228cebbf0a64b commit 84e04147d1e70e8ba00c8f46923228cebbf0a64b Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Apr 11 08:51:16 2014 -0700 Test 32-bit ERMS memcpy/memset * sysdeps/i386/i686/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Add __bcopy_erms, __bzero_erms, __memmove_chk_erms, __memmove_erms, __memset_chk_erms, __memset_erms, __memcpy_chk_erms, __memcpy_erms, __mempcpy_chk_erms and __mempcpy_erms. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=f77aa0315f706aff7516af601a04568390422dab commit f77aa0315f706aff7516af601a04568390422dab Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Apr 11 08:25:17 2014 -0700 Test 64-bit ERMS memcpy/memset * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Add __memmove_chk_erms, __memmove_erms, __memset_erms, __memset_chk_erms, __memcpy_chk_erms, __memcpy_erms, __mempcpy_chk_erms and __mempcpy_erms. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=eade49cad4007c094dacce2967a7f13166a46dae commit eade49cad4007c094dacce2967a7f13166a46dae Author: H.J. Lu <hjl.tools@gmail.com> Date: Wed Sep 21 15:21:28 2011 -0700 Add 32it ERMS memcpy/memset * sysdeps/i386/i686/multiarch/Makefile (sysdep_routines): Add bcopy-erms, memcpy-erms, memmove-erms, mempcpy-erms, bzero-erms and memset-erms. * sysdeps/i386/i686/multiarch/bcopy-erms.S: New file. * sysdeps/i386/i686/multiarch/bzero-erms.S: Likewise. * sysdeps/i386/i686/multiarch/memcpy-erms.S: Likewise. * sysdeps/i386/i686/multiarch/memmove-erms.S: Likewise. * sysdeps/i386/i686/multiarch/mempcpy-erms.S: Likewise. * sysdeps/i386/i686/multiarch/memset-erms.S: Likewise. * sysdeps/i386/i686/multiarch/ifunc-defines.sym: Add COMMON_CPUID_INDEX_7. * sysdeps/i386/i686/multiarch/bcopy.S: Enable ERMS optimization for Fast_ERMS. * sysdeps/i386/i686/multiarch/bzero.S: Likewise. * sysdeps/i386/i686/multiarch/memcpy.S: Likewise. * sysdeps/i386/i686/multiarch/memcpy_chk.S: Likewise. * sysdeps/i386/i686/multiarch/memmove.S: Likewise. * sysdeps/i386/i686/multiarch/memmove_chk.S: Likewise. * sysdeps/i386/i686/multiarch/mempcpy.S: Likewise. * sysdeps/i386/i686/multiarch/mempcpy_chk.S: Likewise. * sysdeps/i386/i686/multiarch/memset.S: Likewise. * sysdeps/i386/i686/multiarch/memset_chk.S: Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=312879aedef40f9e1eb30c517d516c0a29cd030c commit 312879aedef40f9e1eb30c517d516c0a29cd030c Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Sep 15 16:16:10 2011 -0700 Add 64-bit ERMS memcpy and memset * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memcpy-erms, mempcpy-erms, memmove-erms and memset-erms. * sysdeps/x86_64/multiarch/memcpy-erms.S: New. * sysdeps/x86_64/multiarch/memmove-erms.S: Likewise. * sysdeps/x86_64/multiarch/mempcpy-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-erms.S: Likewise. * sysdeps/x86_64/multiarch/memcpy.S: Enable ERMS optimization for Fast_ERMS. * sysdeps/x86_64/multiarch/memcpy_chk.S: Likewise. * sysdeps/x86_64/multiarch/memmove.c: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.c: Likewise. * sysdeps/x86_64/multiarch/mempcpy.S: Likewise. * sysdeps/x86_64/multiarch/mempcpy_chk.S: Likewise. * sysdeps/x86_64/multiarch/memset.S: Likewise. * sysdeps/x86_64/multiarch/memset_chk.S: Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=487dc028667e43ed55f407fda3c07fc31ecd1554 commit 487dc028667e43ed55f407fda3c07fc31ecd1554 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Sep 15 15:47:01 2011 -0700 Initial ERMS support * sysdeps/x86/cpu-features.h (bit_arch_Fast_ERMS): New. (bit_cpu_ERMS): Likewise. (index_cpu_ERMS): Likewise. (index_arch_Fast_ERMS): Likewise. (reg_ERMS): Likewise. -----------------------------------------------------------------------
This is an automated email from the git hooks/post-receive script. It was generated because a ref change was pushed to the repository containing the project "GNU C Library master sources". The branch, hjl/erms/hybrid has been created at d9063d53743a92bfefb598543a79874dbe371f03 (commit) - Log ----------------------------------------------------------------- https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=d9063d53743a92bfefb598543a79874dbe371f03 commit d9063d53743a92bfefb598543a79874dbe371f03 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 24 17:45:47 2016 -0700 Rename to memmove-vec-unaligned-erms.S https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=8925b29440a4a9f127d1d38354c04562129fe672 commit 8925b29440a4a9f127d1d38354c04562129fe672 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 24 17:32:26 2016 -0700 Fix REP_STOSB_THRESHOLD in ../memset.S https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=b1ade313d46b80fef7ebfc2e4ac007211c1947e0 commit b1ade313d46b80fef7ebfc2e4ac007211c1947e0 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 24 17:32:09 2016 -0700 Fix REP_MOVSB_THRESHOLD in memcpy-vec-unaligned-erms.S https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=85f4c3310b649d654f2bbc5de2a89c1049322cd5 commit 85f4c3310b649d654f2bbc5de2a89c1049322cd5 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 24 17:26:50 2016 -0700 Move REP_STOSB_THRESHOLD to memset-avx2.S https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=1b6bf5e757fcbbcd49850b4c982aa67ae2049b12 commit 1b6bf5e757fcbbcd49850b4c982aa67ae2049b12 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 24 17:26:16 2016 -0700 Move REP_MOVSB_THRESHOLD to memcpy-vec-unaligned-erms.S https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=54183108f9966bd087af62c781c2a9e6a1c25587 commit 54183108f9966bd087af62c781c2a9e6a1c25587 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 24 17:24:36 2016 -0700 Add SECTION https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=cbb91f949f7e1560c299e7ad4727c6dfe976bb0c commit cbb91f949f7e1560c299e7ad4727c6dfe976bb0c Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 24 15:26:59 2016 -0700 Add memcpy-avx512-unaligned-erms.S https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=85b5d5a77c7ecd88daffa902a0e6535222f373bc commit 85b5d5a77c7ecd88daffa902a0e6535222f373bc Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 24 15:23:33 2016 -0700 Extend it to VEC_SIZE == 64 https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=b12a3dc8134a8632e803213a169b4f0d9fa765a0 commit b12a3dc8134a8632e803213a169b4f0d9fa765a0 Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Mar 6 14:16:32 2016 -0800 Add entry points for __mempcpy_sse2_unaligned and _chk functions Add entry points for __mempcpy_chk_sse2_unaligned, __mempcpy_sse2_unaligned and __memcpy_chk_sse2_unaligned. Add sse2_unaligned_erms versions of memcpy/mempcpy [BZ #19776] * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memcpy_chk_sse2_unaligned, __mempcpy_chk_sse2_unaligned and __mempcpy_sse2_unaligned. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memcpy_chk_sse2_unaligned_erms, __memcpy_sse2_unaligned_erms, __mempcpy_chk_sse2_unaligned_erms and __mempcpy_sse2_unaligned_erms. * sysdeps/x86_64/sysdep.h (REP_MOVSB_THRESHOLD): New. Enable __mempcpy_sse2_unaligned Check Fast_Unaligned_Load for __mempcpy_sse2_unaligned. The new selection order is: 1. __mempcpy_avx_unaligned if AVX_Fast_Unaligned_Load bit is set. 2. __mempcpy_sse2_unaligned if Fast_Unaligned_Load bit is set. 3. __mempcpy_sse2 if SSSE3 isn't available. 4. __mempcpy_ssse3_back if Fast_Copy_Backward bit it set. 5. __mempcpy_ssse3 [BZ #19776] * sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Check Fast_Unaligned_Load to enable __mempcpy_sse2_unaligned. Enable __mempcpy_chk_sse2_unaligned Check Fast_Unaligned_Load for __mempcpy_chk_sse2_unaligned. The new selection order is: 1. __mempcpy_chk_avx_unaligned if AVX_Fast_Unaligned_Load bit is set. 2. __mempcpy_chk_sse2_unaligned if Fast_Unaligned_Load bit is set. 3. __mempcpy_chk_sse2 if SSSE3 isn't available. 4. __mempcpy_chk_ssse3_back if Fast_Copy_Backward bit it set. 5. __mempcpy_chk_ssse3 [BZ #19776] * sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Check Fast_Unaligned_Load to enable __mempcpy_chk_sse2_unaligned. Enable __memcpy_chk_sse2_unaligned Check Fast_Unaligned_Load for __memcpy_chk_sse2_unaligned. The new selection order is: 1. __memcpy_chk_avx_unaligned if AVX_Fast_Unaligned_Load bit is set. 2. __memcpy_chk_sse2_unaligned if Fast_Unaligned_Load bit is set. 3. __memcpy_chk_sse2 if SSSE3 isn't available. 4. __memcpy_chk_ssse3_back if Fast_Copy_Backward bit it set. 5. __memcpy_chk_ssse3 [BZ #19776] * sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Check Fast_Unaligned_Load to enable __mempcpy_chk_sse2_unaligned. Use Hybrid_ERMS in memcpy.S Use Hybrid_ERMS in mempcpy.S Add memcpy-sse2-unaligned-erms.S Add initial memcpy-vec-unaligned-erms.S Add memcpy-avx-unaligned-erms.S https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=649521e48efafd5fccac8f405c2166749b6d59f4 commit 649521e48efafd5fccac8f405c2166749b6d59f4 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Mar 18 12:36:03 2016 -0700 Add Hybrid_ERMS and use it in memset.S https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=8c36a7002d4a2956e7b3701e4094382148b22eed commit 8c36a7002d4a2956e7b3701e4094382148b22eed Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Mar 18 10:34:07 2016 -0700 Add __memset_avx2_erms and __memset_chk_avx2_erms https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=cb894908a84c04d7fe3fbf164613fb5b5bac0737 commit cb894908a84c04d7fe3fbf164613fb5b5bac0737 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Mar 18 10:07:48 2016 -0700 Remove mempcpy-*.S https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=10e38ad556275d8d8a4bb2ea1423f1e160f259ee commit 10e38ad556275d8d8a4bb2ea1423f1e160f259ee Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Mar 6 13:37:31 2016 -0800 Merge memcpy with mempcpy https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=0c8870d243dd0139ef9954cad6e399327c2406e1 commit 0c8870d243dd0139ef9954cad6e399327c2406e1 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Mar 18 09:22:56 2016 -0700 Add __memset_sse2_erms and __memset_chk_sse2_erms * sysdeps/x86_64/memset.S (__memset_chk_sse2_erms): New function. (__memset_sse2_erms): Likewise. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memset_chk_sse2_erms and __memset_sse2_erms. * sysdeps/x86_64/sysdep.h (REP_STOSB_THRESHOLD): New. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=84e04147d1e70e8ba00c8f46923228cebbf0a64b commit 84e04147d1e70e8ba00c8f46923228cebbf0a64b Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Apr 11 08:51:16 2014 -0700 Test 32-bit ERMS memcpy/memset * sysdeps/i386/i686/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Add __bcopy_erms, __bzero_erms, __memmove_chk_erms, __memmove_erms, __memset_chk_erms, __memset_erms, __memcpy_chk_erms, __memcpy_erms, __mempcpy_chk_erms and __mempcpy_erms. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=f77aa0315f706aff7516af601a04568390422dab commit f77aa0315f706aff7516af601a04568390422dab Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Apr 11 08:25:17 2014 -0700 Test 64-bit ERMS memcpy/memset * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Add __memmove_chk_erms, __memmove_erms, __memset_erms, __memset_chk_erms, __memcpy_chk_erms, __memcpy_erms, __mempcpy_chk_erms and __mempcpy_erms. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=eade49cad4007c094dacce2967a7f13166a46dae commit eade49cad4007c094dacce2967a7f13166a46dae Author: H.J. Lu <hjl.tools@gmail.com> Date: Wed Sep 21 15:21:28 2011 -0700 Add 32it ERMS memcpy/memset * sysdeps/i386/i686/multiarch/Makefile (sysdep_routines): Add bcopy-erms, memcpy-erms, memmove-erms, mempcpy-erms, bzero-erms and memset-erms. * sysdeps/i386/i686/multiarch/bcopy-erms.S: New file. * sysdeps/i386/i686/multiarch/bzero-erms.S: Likewise. * sysdeps/i386/i686/multiarch/memcpy-erms.S: Likewise. * sysdeps/i386/i686/multiarch/memmove-erms.S: Likewise. * sysdeps/i386/i686/multiarch/mempcpy-erms.S: Likewise. * sysdeps/i386/i686/multiarch/memset-erms.S: Likewise. * sysdeps/i386/i686/multiarch/ifunc-defines.sym: Add COMMON_CPUID_INDEX_7. * sysdeps/i386/i686/multiarch/bcopy.S: Enable ERMS optimization for Fast_ERMS. * sysdeps/i386/i686/multiarch/bzero.S: Likewise. * sysdeps/i386/i686/multiarch/memcpy.S: Likewise. * sysdeps/i386/i686/multiarch/memcpy_chk.S: Likewise. * sysdeps/i386/i686/multiarch/memmove.S: Likewise. * sysdeps/i386/i686/multiarch/memmove_chk.S: Likewise. * sysdeps/i386/i686/multiarch/mempcpy.S: Likewise. * sysdeps/i386/i686/multiarch/mempcpy_chk.S: Likewise. * sysdeps/i386/i686/multiarch/memset.S: Likewise. * sysdeps/i386/i686/multiarch/memset_chk.S: Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=312879aedef40f9e1eb30c517d516c0a29cd030c commit 312879aedef40f9e1eb30c517d516c0a29cd030c Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Sep 15 16:16:10 2011 -0700 Add 64-bit ERMS memcpy and memset * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memcpy-erms, mempcpy-erms, memmove-erms and memset-erms. * sysdeps/x86_64/multiarch/memcpy-erms.S: New. * sysdeps/x86_64/multiarch/memmove-erms.S: Likewise. * sysdeps/x86_64/multiarch/mempcpy-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-erms.S: Likewise. * sysdeps/x86_64/multiarch/memcpy.S: Enable ERMS optimization for Fast_ERMS. * sysdeps/x86_64/multiarch/memcpy_chk.S: Likewise. * sysdeps/x86_64/multiarch/memmove.c: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.c: Likewise. * sysdeps/x86_64/multiarch/mempcpy.S: Likewise. * sysdeps/x86_64/multiarch/mempcpy_chk.S: Likewise. * sysdeps/x86_64/multiarch/memset.S: Likewise. * sysdeps/x86_64/multiarch/memset_chk.S: Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=487dc028667e43ed55f407fda3c07fc31ecd1554 commit 487dc028667e43ed55f407fda3c07fc31ecd1554 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Sep 15 15:47:01 2011 -0700 Initial ERMS support * sysdeps/x86/cpu-features.h (bit_arch_Fast_ERMS): New. (bit_cpu_ERMS): Likewise. (index_cpu_ERMS): Likewise. (index_arch_Fast_ERMS): Likewise. (reg_ERMS): Likewise. -----------------------------------------------------------------------
This is an automated email from the git hooks/post-receive script. It was generated because a ref change was pushed to the repository containing the project "GNU C Library master sources". The branch, hjl/erms/hybrid has been created at 4e9c954f3d3ed00e71cdab8411fcdc27992af7f9 (commit) - Log ----------------------------------------------------------------- https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4e9c954f3d3ed00e71cdab8411fcdc27992af7f9 commit 4e9c954f3d3ed00e71cdab8411fcdc27992af7f9 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Mar 18 12:36:03 2016 -0700 Add Hybrid_ERMS and use it in memset.S Add entry points for __mempcpy_sse2_unaligned and _chk functions Add entry points for __mempcpy_chk_sse2_unaligned, __mempcpy_sse2_unaligned and __memcpy_chk_sse2_unaligned. Add sse2_unaligned_erms versions of memcpy/mempcpy [BZ #19776] * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memcpy_chk_sse2_unaligned, __mempcpy_chk_sse2_unaligned and __mempcpy_sse2_unaligned. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memcpy_chk_sse2_unaligned_erms, __memcpy_sse2_unaligned_erms, __mempcpy_chk_sse2_unaligned_erms and __mempcpy_sse2_unaligned_erms. * sysdeps/x86_64/sysdep.h (REP_MOVSB_THRESHOLD): New. Enable __mempcpy_sse2_unaligned Check Fast_Unaligned_Load for __mempcpy_sse2_unaligned. The new selection order is: 1. __mempcpy_avx_unaligned if AVX_Fast_Unaligned_Load bit is set. 2. __mempcpy_sse2_unaligned if Fast_Unaligned_Load bit is set. 3. __mempcpy_sse2 if SSSE3 isn't available. 4. __mempcpy_ssse3_back if Fast_Copy_Backward bit it set. 5. __mempcpy_ssse3 [BZ #19776] * sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Check Fast_Unaligned_Load to enable __mempcpy_sse2_unaligned. Enable __mempcpy_chk_sse2_unaligned Check Fast_Unaligned_Load for __mempcpy_chk_sse2_unaligned. The new selection order is: 1. __mempcpy_chk_avx_unaligned if AVX_Fast_Unaligned_Load bit is set. 2. __mempcpy_chk_sse2_unaligned if Fast_Unaligned_Load bit is set. 3. __mempcpy_chk_sse2 if SSSE3 isn't available. 4. __mempcpy_chk_ssse3_back if Fast_Copy_Backward bit it set. 5. __mempcpy_chk_ssse3 [BZ #19776] * sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Check Fast_Unaligned_Load to enable __mempcpy_chk_sse2_unaligned. Enable __memcpy_chk_sse2_unaligned Check Fast_Unaligned_Load for __memcpy_chk_sse2_unaligned. The new selection order is: 1. __memcpy_chk_avx_unaligned if AVX_Fast_Unaligned_Load bit is set. 2. __memcpy_chk_sse2_unaligned if Fast_Unaligned_Load bit is set. 3. __memcpy_chk_sse2 if SSSE3 isn't available. 4. __memcpy_chk_ssse3_back if Fast_Copy_Backward bit it set. 5. __memcpy_chk_ssse3 [BZ #19776] * sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Check Fast_Unaligned_Load to enable __mempcpy_chk_sse2_unaligned. Use Hybrid_ERMS in memcpy.S Use Hybrid_ERMS in mempcpy.S Add memcpy-sse2-unaligned-erms.S Add initial memcpy-vec-unaligned-erms.S Add memcpy-avx-unaligned-erms.S https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=d3d2d8ea0e81400836ea1e2d45cea10685602aa4 commit d3d2d8ea0e81400836ea1e2d45cea10685602aa4 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Mar 18 10:34:07 2016 -0700 Add __memset_avx2_erms and __memset_chk_avx2_erms https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=ca72d26d16610aba2a31a4e1254cfe305b7f8c04 commit ca72d26d16610aba2a31a4e1254cfe305b7f8c04 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Mar 18 09:22:56 2016 -0700 Add __memset_sse2_erms and __memset_chk_sse2_erms * sysdeps/x86_64/memset.S (__memset_chk_sse2_erms): New function. (__memset_sse2_erms): Likewise. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memset_chk_sse2_erms and __memset_sse2_erms. * sysdeps/x86_64/sysdep.h (REP_STOSB_THRESHOLD): New. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=d459d3a2cc8970b326b92466e42ed25a79bf63b1 commit d459d3a2cc8970b326b92466e42ed25a79bf63b1 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Apr 11 08:25:17 2014 -0700 Test 64-bit ERMS memcpy/memset * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Add __memmove_chk_erms, __memmove_erms, __memset_erms, __memset_chk_erms, __memcpy_chk_erms, __memcpy_erms, __mempcpy_chk_erms and __mempcpy_erms. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=6030adff9f77fcdb8b4cd83f2bdadf80d34dc5f2 commit 6030adff9f77fcdb8b4cd83f2bdadf80d34dc5f2 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Sep 15 16:16:10 2011 -0700 Add 64-bit ERMS memcpy and memset * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memcpy-erms, mempcpy-erms, memmove-erms and memset-erms. * sysdeps/x86_64/multiarch/memcpy-erms.S: New. * sysdeps/x86_64/multiarch/memmove-erms.S: Likewise. * sysdeps/x86_64/multiarch/mempcpy-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-erms.S: Likewise. * sysdeps/x86_64/multiarch/memcpy.S: Enable ERMS optimization for Fast_ERMS. * sysdeps/x86_64/multiarch/memcpy_chk.S: Likewise. * sysdeps/x86_64/multiarch/memmove.c: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.c: Likewise. * sysdeps/x86_64/multiarch/mempcpy.S: Likewise. * sysdeps/x86_64/multiarch/mempcpy_chk.S: Likewise. * sysdeps/x86_64/multiarch/memset.S: Likewise. * sysdeps/x86_64/multiarch/memset_chk.S: Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=231b08122143d3f1973b0cb1e3dbc7de1aa6f5ba commit 231b08122143d3f1973b0cb1e3dbc7de1aa6f5ba Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Apr 11 08:51:16 2014 -0700 Test 32-bit ERMS memcpy/memset * sysdeps/i386/i686/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Add __bcopy_erms, __bzero_erms, __memmove_chk_erms, __memmove_erms, __memset_chk_erms, __memset_erms, __memcpy_chk_erms, __memcpy_erms, __mempcpy_chk_erms and __mempcpy_erms. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=b668540d1108ae2ec2605577e3a310b211c73757 commit b668540d1108ae2ec2605577e3a310b211c73757 Author: H.J. Lu <hjl.tools@gmail.com> Date: Wed Sep 21 15:21:28 2011 -0700 Add 32it ERMS memcpy/memset * sysdeps/i386/i686/multiarch/Makefile (sysdep_routines): Add bcopy-erms, memcpy-erms, memmove-erms, mempcpy-erms, bzero-erms and memset-erms. * sysdeps/i386/i686/multiarch/bcopy-erms.S: New file. * sysdeps/i386/i686/multiarch/bzero-erms.S: Likewise. * sysdeps/i386/i686/multiarch/memcpy-erms.S: Likewise. * sysdeps/i386/i686/multiarch/memmove-erms.S: Likewise. * sysdeps/i386/i686/multiarch/mempcpy-erms.S: Likewise. * sysdeps/i386/i686/multiarch/memset-erms.S: Likewise. * sysdeps/i386/i686/multiarch/ifunc-defines.sym: Add COMMON_CPUID_INDEX_7. * sysdeps/i386/i686/multiarch/bcopy.S: Enable ERMS optimization for Fast_ERMS. * sysdeps/i386/i686/multiarch/bzero.S: Likewise. * sysdeps/i386/i686/multiarch/memcpy.S: Likewise. * sysdeps/i386/i686/multiarch/memcpy_chk.S: Likewise. * sysdeps/i386/i686/multiarch/memmove.S: Likewise. * sysdeps/i386/i686/multiarch/memmove_chk.S: Likewise. * sysdeps/i386/i686/multiarch/mempcpy.S: Likewise. * sysdeps/i386/i686/multiarch/mempcpy_chk.S: Likewise. * sysdeps/i386/i686/multiarch/memset.S: Likewise. * sysdeps/i386/i686/multiarch/memset_chk.S: Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=f1053879ad4c4bc61d81d86f9a219fd33c1efc60 commit f1053879ad4c4bc61d81d86f9a219fd33c1efc60 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Sep 15 15:47:01 2011 -0700 Initial ERMS support * sysdeps/x86/cpu-features.h (bit_arch_Fast_ERMS): New. (bit_cpu_ERMS): Likewise. (index_cpu_ERMS): Likewise. (index_arch_Fast_ERMS): Likewise. (reg_ERMS): Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=5c94fdcb46c4116cd8bb57868f8f4ccf4a09990b commit 5c94fdcb46c4116cd8bb57868f8f4ccf4a09990b Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Mar 25 06:32:44 2016 -0700 Make __memcpy_avx512_no_vzeroupper an alias Since x86-64 memcpy-avx512-no-vzeroupper.S implements memmove, we can make __memcpy_avx512_no_vzeroupper an alias of __memmove_avx512_no_vzeroupper. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memcpy-avx512-no-vzeroupper. * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: Renamed to ... * sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: This. (MEMCPY): Don't define. (MEMCPY_CHK): Likewise. (MEMPCPY): Likewise. (MEMPCPY_CHK): Likewise. (MEMPCPY_CHK): Renamed to ... (__mempcpy_chk_avx512_no_vzeroupper): This. (MEMPCPY_CHK): Renamed to ... (__mempcpy_chk_avx512_no_vzeroupper): This. (MEMCPY_CHK): Renamed to ... (__memmove_chk_avx512_no_vzeroupper): This. (MEMCPY): Renamed to ... (__memmove_avx512_no_vzeroupper): This. (__memcpy_avx512_no_vzeroupper): New alias. (__memcpy_chk_avx512_no_vzeroupper): Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e0676f43f820875a620c9070521e50608040e091 commit e0676f43f820875a620c9070521e50608040e091 Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Mar 6 13:37:31 2016 -0800 Implement x86-64 multiarch mempcpy in memcpy Implement x86-64 multiarch mempcpy in memcpy to share most of code. It will reduce code size of libc.so. [BZ #18858] * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove mempcpy-ssse3, mempcpy-ssse3-back, mempcpy-avx-unaligned and mempcpy-avx512-no-vzeroupper. * sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/memcpy-ssse3-back.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/memcpy-ssse3.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/mempcpy-avx-unaligned.S: Removed. * sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S: Likewise. * sysdeps/x86_64/multiarch/mempcpy-ssse3-back.S: Likewise. * sysdeps/x86_64/multiarch/mempcpy-ssse3.S: Likewise. -----------------------------------------------------------------------
This is an automated email from the git hooks/post-receive script. It was generated because a ref change was pushed to the repository containing the project "GNU C Library master sources". The branch, hjl/erms/hybrid has been created at c2f9d014849d725458b546a351a9b0a93212d613 (commit) - Log ----------------------------------------------------------------- https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c2f9d014849d725458b546a351a9b0a93212d613 commit c2f9d014849d725458b546a351a9b0a93212d613 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Mar 25 08:20:17 2016 -0700 Add memset-vec-unaligned-erms.S * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memset-sse2-unaligned-erms, memset-avx2-unaligned-erms and memset-avx512-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memset_chk_sse2_unaligned, __memset_chk_sse2_unaligned_erms, __memset_chk_avx2_unaligned_erms, __memset_chk_avx512_unaligned_erms, __memset_sse2_unaligned, __memset_sse2_unaligned_erms, __memset_erms, __memset_avx2_unaligned_erms and __memset_avx512_unaligned_erms. * sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: New file. * sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset.S (memset): Support Hybrid_ERMS. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=9dab693aead878c81359db1bf0e40af04139d33e commit 9dab693aead878c81359db1bf0e40af04139d33e Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Mar 18 12:36:03 2016 -0700 Add x86-64 memmove with vector unaliged loads and rep movsb 1. __mempcpy_avx_unaligned if AVX_Fast_Unaligned_Load bit is set. 2. __mempcpy_sse2_unaligned if Fast_Unaligned_Load bit is set. 3. __mempcpy_sse2 if SSSE3 isn't available. 4. __mempcpy_ssse3_back if Fast_Copy_Backward bit it set. 5. __mempcpy_ssse3 [BZ #19776] * sysdeps/x86/cpu-features.c (init_cpu_features): Set bit_arch_Hybrid_ERMS if ERMS is supported. * sysdeps/x86/cpu-features.h (bit_arch_Hybrid_ERMS): New. (index_arch_Hybrid_ERMS): Likewise. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memmove-sse2-unaligned-erms, memmove-avx-unaligned-erms and memmove-avx512-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memmove_chk_avx512_unaligned_erms, __memmove_chk_avx_unaligned_erms, __memmove_chk_sse2_unaligned, __memmove_chk_sse2_unaligned_erms, __memmove_avx_unaligned_erms, __memmove_avx512_unaligned_erms, __memmove_erms, __memmove_sse2_unaligned, __memmove_sse2_unaligned_erms, __memcpy_chk_avx512_unaligned_erms, __memcpy_chk_avx_unaligned_erms, __memcpy_chk_sse2_unaligned, __memcpy_chk_sse2_unaligned_erms, __memcpy_avx_unaligned_erms, __memcpy_avx512_unaligned_erms, __memcpy_sse2_unaligned_2, __memcpy_sse2_unaligned_erms, __memcpy_erms, __mempcpy_chk_avx512_unaligned_erms, __mempcpy_chk_avx_unaligned_erms, __mempcpy_chk_sse2_unaligned, __mempcpy_chk_sse2_unaligned_erms, __mempcpy_avx512_unaligned_erms, __mempcpy_avx_unaligned_erms, __mempcpy_sse2_unaligned, __mempcpy_sse2_unaligned_erms and __mempcpy_erms. * sysdeps/x86_64/multiarch/memcpy (__new_memcpy): Support Hybrid_ERMS. * sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Likewise. * sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Likewise. * sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Likwise. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: New file. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S: Likwise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S: Likwise. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Likwise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c821d24e8dfc115875aa9d71cab5f0985fbdf533 commit c821d24e8dfc115875aa9d71cab5f0985fbdf533 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Apr 11 08:51:16 2014 -0700 Test 32-bit ERMS memcpy/memset * sysdeps/i386/i686/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Add __bcopy_erms, __bzero_erms, __memmove_chk_erms, __memmove_erms, __memset_chk_erms, __memset_erms, __memcpy_chk_erms, __memcpy_erms, __mempcpy_chk_erms and __mempcpy_erms. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=6261a13bbe8324b83cc33626174e4f7fa65c2290 commit 6261a13bbe8324b83cc33626174e4f7fa65c2290 Author: H.J. Lu <hjl.tools@gmail.com> Date: Wed Sep 21 15:21:28 2011 -0700 Add 32it ERMS memcpy/memset * sysdeps/i386/i686/multiarch/Makefile (sysdep_routines): Add bcopy-erms, memcpy-erms, memmove-erms, mempcpy-erms, bzero-erms and memset-erms. * sysdeps/i386/i686/multiarch/bcopy-erms.S: New file. * sysdeps/i386/i686/multiarch/bzero-erms.S: Likewise. * sysdeps/i386/i686/multiarch/memcpy-erms.S: Likewise. * sysdeps/i386/i686/multiarch/memmove-erms.S: Likewise. * sysdeps/i386/i686/multiarch/mempcpy-erms.S: Likewise. * sysdeps/i386/i686/multiarch/memset-erms.S: Likewise. * sysdeps/i386/i686/multiarch/ifunc-defines.sym: Add COMMON_CPUID_INDEX_7. * sysdeps/i386/i686/multiarch/bcopy.S: Enable ERMS optimization for Fast_ERMS. * sysdeps/i386/i686/multiarch/bzero.S: Likewise. * sysdeps/i386/i686/multiarch/memcpy.S: Likewise. * sysdeps/i386/i686/multiarch/memcpy_chk.S: Likewise. * sysdeps/i386/i686/multiarch/memmove.S: Likewise. * sysdeps/i386/i686/multiarch/memmove_chk.S: Likewise. * sysdeps/i386/i686/multiarch/mempcpy.S: Likewise. * sysdeps/i386/i686/multiarch/mempcpy_chk.S: Likewise. * sysdeps/i386/i686/multiarch/memset.S: Likewise. * sysdeps/i386/i686/multiarch/memset_chk.S: Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=74b015bd5ddf46896c16a2398f3491014f5f4ba8 commit 74b015bd5ddf46896c16a2398f3491014f5f4ba8 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Sep 15 15:47:01 2011 -0700 Initial ERMS support * sysdeps/x86/cpu-features.h (bit_arch_Fast_ERMS): New. (bit_cpu_ERMS): Likewise. (index_cpu_ERMS): Likewise. (index_arch_Fast_ERMS): Likewise. (reg_ERMS): Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=47e27ea9dad4e316ed4fe1cabe5774b72e4d9abf commit 47e27ea9dad4e316ed4fe1cabe5774b72e4d9abf Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Mar 25 06:32:44 2016 -0700 Make __memcpy_avx512_no_vzeroupper an alias Since x86-64 memcpy-avx512-no-vzeroupper.S implements memmove, we can make __memcpy_avx512_no_vzeroupper an alias of __memmove_avx512_no_vzeroupper to reduce code size of libc.so. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memcpy-avx512-no-vzeroupper. * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: Renamed to ... * sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: This. (MEMCPY): Don't define. (MEMCPY_CHK): Likewise. (MEMPCPY): Likewise. (MEMPCPY_CHK): Likewise. (MEMPCPY_CHK): Renamed to ... (__mempcpy_chk_avx512_no_vzeroupper): This. (MEMPCPY_CHK): Renamed to ... (__mempcpy_chk_avx512_no_vzeroupper): This. (MEMCPY_CHK): Renamed to ... (__memmove_chk_avx512_no_vzeroupper): This. (MEMCPY): Renamed to ... (__memmove_avx512_no_vzeroupper): This. (__memcpy_avx512_no_vzeroupper): New alias. (__memcpy_chk_avx512_no_vzeroupper): Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e0676f43f820875a620c9070521e50608040e091 commit e0676f43f820875a620c9070521e50608040e091 Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Mar 6 13:37:31 2016 -0800 Implement x86-64 multiarch mempcpy in memcpy Implement x86-64 multiarch mempcpy in memcpy to share most of code. It will reduce code size of libc.so. [BZ #18858] * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove mempcpy-ssse3, mempcpy-ssse3-back, mempcpy-avx-unaligned and mempcpy-avx512-no-vzeroupper. * sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/memcpy-ssse3-back.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/memcpy-ssse3.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/mempcpy-avx-unaligned.S: Removed. * sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S: Likewise. * sysdeps/x86_64/multiarch/mempcpy-ssse3-back.S: Likewise. * sysdeps/x86_64/multiarch/mempcpy-ssse3.S: Likewise. -----------------------------------------------------------------------
This is an automated email from the git hooks/post-receive script. It was generated because a ref change was pushed to the repository containing the project "GNU C Library master sources". The branch, hjl/erms/hybrid has been created at d512907b70ce4e27211433298cb90b99fe6ce32c (commit) - Log ----------------------------------------------------------------- https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=d512907b70ce4e27211433298cb90b99fe6ce32c commit d512907b70ce4e27211433298cb90b99fe6ce32c Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Mar 25 19:29:41 2016 -0700 Avoid slow backward REP MOVSB https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c2f9d014849d725458b546a351a9b0a93212d613 commit c2f9d014849d725458b546a351a9b0a93212d613 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Mar 25 08:20:17 2016 -0700 Add memset-vec-unaligned-erms.S * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memset-sse2-unaligned-erms, memset-avx2-unaligned-erms and memset-avx512-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memset_chk_sse2_unaligned, __memset_chk_sse2_unaligned_erms, __memset_chk_avx2_unaligned_erms, __memset_chk_avx512_unaligned_erms, __memset_sse2_unaligned, __memset_sse2_unaligned_erms, __memset_erms, __memset_avx2_unaligned_erms and __memset_avx512_unaligned_erms. * sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: New file. * sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset.S (memset): Support Hybrid_ERMS. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=9dab693aead878c81359db1bf0e40af04139d33e commit 9dab693aead878c81359db1bf0e40af04139d33e Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Mar 18 12:36:03 2016 -0700 Add x86-64 memmove with vector unaliged loads and rep movsb 1. __mempcpy_avx_unaligned if AVX_Fast_Unaligned_Load bit is set. 2. __mempcpy_sse2_unaligned if Fast_Unaligned_Load bit is set. 3. __mempcpy_sse2 if SSSE3 isn't available. 4. __mempcpy_ssse3_back if Fast_Copy_Backward bit it set. 5. __mempcpy_ssse3 [BZ #19776] * sysdeps/x86/cpu-features.c (init_cpu_features): Set bit_arch_Hybrid_ERMS if ERMS is supported. * sysdeps/x86/cpu-features.h (bit_arch_Hybrid_ERMS): New. (index_arch_Hybrid_ERMS): Likewise. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memmove-sse2-unaligned-erms, memmove-avx-unaligned-erms and memmove-avx512-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memmove_chk_avx512_unaligned_erms, __memmove_chk_avx_unaligned_erms, __memmove_chk_sse2_unaligned, __memmove_chk_sse2_unaligned_erms, __memmove_avx_unaligned_erms, __memmove_avx512_unaligned_erms, __memmove_erms, __memmove_sse2_unaligned, __memmove_sse2_unaligned_erms, __memcpy_chk_avx512_unaligned_erms, __memcpy_chk_avx_unaligned_erms, __memcpy_chk_sse2_unaligned, __memcpy_chk_sse2_unaligned_erms, __memcpy_avx_unaligned_erms, __memcpy_avx512_unaligned_erms, __memcpy_sse2_unaligned_2, __memcpy_sse2_unaligned_erms, __memcpy_erms, __mempcpy_chk_avx512_unaligned_erms, __mempcpy_chk_avx_unaligned_erms, __mempcpy_chk_sse2_unaligned, __mempcpy_chk_sse2_unaligned_erms, __mempcpy_avx512_unaligned_erms, __mempcpy_avx_unaligned_erms, __mempcpy_sse2_unaligned, __mempcpy_sse2_unaligned_erms and __mempcpy_erms. * sysdeps/x86_64/multiarch/memcpy (__new_memcpy): Support Hybrid_ERMS. * sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Likewise. * sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Likewise. * sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Likwise. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: New file. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S: Likwise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S: Likwise. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Likwise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c821d24e8dfc115875aa9d71cab5f0985fbdf533 commit c821d24e8dfc115875aa9d71cab5f0985fbdf533 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Apr 11 08:51:16 2014 -0700 Test 32-bit ERMS memcpy/memset * sysdeps/i386/i686/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Add __bcopy_erms, __bzero_erms, __memmove_chk_erms, __memmove_erms, __memset_chk_erms, __memset_erms, __memcpy_chk_erms, __memcpy_erms, __mempcpy_chk_erms and __mempcpy_erms. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=6261a13bbe8324b83cc33626174e4f7fa65c2290 commit 6261a13bbe8324b83cc33626174e4f7fa65c2290 Author: H.J. Lu <hjl.tools@gmail.com> Date: Wed Sep 21 15:21:28 2011 -0700 Add 32it ERMS memcpy/memset * sysdeps/i386/i686/multiarch/Makefile (sysdep_routines): Add bcopy-erms, memcpy-erms, memmove-erms, mempcpy-erms, bzero-erms and memset-erms. * sysdeps/i386/i686/multiarch/bcopy-erms.S: New file. * sysdeps/i386/i686/multiarch/bzero-erms.S: Likewise. * sysdeps/i386/i686/multiarch/memcpy-erms.S: Likewise. * sysdeps/i386/i686/multiarch/memmove-erms.S: Likewise. * sysdeps/i386/i686/multiarch/mempcpy-erms.S: Likewise. * sysdeps/i386/i686/multiarch/memset-erms.S: Likewise. * sysdeps/i386/i686/multiarch/ifunc-defines.sym: Add COMMON_CPUID_INDEX_7. * sysdeps/i386/i686/multiarch/bcopy.S: Enable ERMS optimization for Fast_ERMS. * sysdeps/i386/i686/multiarch/bzero.S: Likewise. * sysdeps/i386/i686/multiarch/memcpy.S: Likewise. * sysdeps/i386/i686/multiarch/memcpy_chk.S: Likewise. * sysdeps/i386/i686/multiarch/memmove.S: Likewise. * sysdeps/i386/i686/multiarch/memmove_chk.S: Likewise. * sysdeps/i386/i686/multiarch/mempcpy.S: Likewise. * sysdeps/i386/i686/multiarch/mempcpy_chk.S: Likewise. * sysdeps/i386/i686/multiarch/memset.S: Likewise. * sysdeps/i386/i686/multiarch/memset_chk.S: Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=74b015bd5ddf46896c16a2398f3491014f5f4ba8 commit 74b015bd5ddf46896c16a2398f3491014f5f4ba8 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Sep 15 15:47:01 2011 -0700 Initial ERMS support * sysdeps/x86/cpu-features.h (bit_arch_Fast_ERMS): New. (bit_cpu_ERMS): Likewise. (index_cpu_ERMS): Likewise. (index_arch_Fast_ERMS): Likewise. (reg_ERMS): Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=47e27ea9dad4e316ed4fe1cabe5774b72e4d9abf commit 47e27ea9dad4e316ed4fe1cabe5774b72e4d9abf Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Mar 25 06:32:44 2016 -0700 Make __memcpy_avx512_no_vzeroupper an alias Since x86-64 memcpy-avx512-no-vzeroupper.S implements memmove, we can make __memcpy_avx512_no_vzeroupper an alias of __memmove_avx512_no_vzeroupper to reduce code size of libc.so. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memcpy-avx512-no-vzeroupper. * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: Renamed to ... * sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: This. (MEMCPY): Don't define. (MEMCPY_CHK): Likewise. (MEMPCPY): Likewise. (MEMPCPY_CHK): Likewise. (MEMPCPY_CHK): Renamed to ... (__mempcpy_chk_avx512_no_vzeroupper): This. (MEMPCPY_CHK): Renamed to ... (__mempcpy_chk_avx512_no_vzeroupper): This. (MEMCPY_CHK): Renamed to ... (__memmove_chk_avx512_no_vzeroupper): This. (MEMCPY): Renamed to ... (__memmove_avx512_no_vzeroupper): This. (__memcpy_avx512_no_vzeroupper): New alias. (__memcpy_chk_avx512_no_vzeroupper): Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e0676f43f820875a620c9070521e50608040e091 commit e0676f43f820875a620c9070521e50608040e091 Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Mar 6 13:37:31 2016 -0800 Implement x86-64 multiarch mempcpy in memcpy Implement x86-64 multiarch mempcpy in memcpy to share most of code. It will reduce code size of libc.so. [BZ #18858] * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove mempcpy-ssse3, mempcpy-ssse3-back, mempcpy-avx-unaligned and mempcpy-avx512-no-vzeroupper. * sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/memcpy-ssse3-back.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/memcpy-ssse3.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/mempcpy-avx-unaligned.S: Removed. * sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S: Likewise. * sysdeps/x86_64/multiarch/mempcpy-ssse3-back.S: Likewise. * sysdeps/x86_64/multiarch/mempcpy-ssse3.S: Likewise. -----------------------------------------------------------------------
This is an automated email from the git hooks/post-receive script. It was generated because a ref change was pushed to the repository containing the project "GNU C Library master sources". The branch, hjl/erms/hybrid has been created at a52f4b20cd0a5dd6179e6c7b28973d0105314a4f (commit) - Log ----------------------------------------------------------------- https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a52f4b20cd0a5dd6179e6c7b28973d0105314a4f commit a52f4b20cd0a5dd6179e6c7b28973d0105314a4f Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Mar 25 08:20:17 2016 -0700 Add memset-vec-unaligned-erms.S * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memset-sse2-unaligned-erms, memset-avx2-unaligned-erms and memset-avx512-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memset_chk_sse2_unaligned, __memset_chk_sse2_unaligned_erms, __memset_chk_avx2_unaligned, __memset_chk_avx2_unaligned_erms, __memset_chk_avx512_unaligned, __memset_chk_avx512_unaligned_erms, __memset_sse2_unaligned, __memset_sse2_unaligned_erms, __memset_erms, __memset_avx2_unaligned, __memset_avx2_unaligned_erms, __memset_avx512_unaligned_erms and __memset_avx512_unaligned. * sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: New file. * sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset.S (memset): Support Hybrid_ERMS. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4d7fddd84a5993fec253c5b677688fa836a5a806 commit 4d7fddd84a5993fec253c5b677688fa836a5a806 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Mar 18 12:36:03 2016 -0700 Add x86-64 memmove with vector unaliged loads and rep movsb 1. __mempcpy_avx_unaligned if AVX_Fast_Unaligned_Load bit is set. 2. __mempcpy_sse2_unaligned if Fast_Unaligned_Load bit is set. 3. __mempcpy_sse2 if SSSE3 isn't available. 4. __mempcpy_ssse3_back if Fast_Copy_Backward bit it set. 5. __mempcpy_ssse3 [BZ #19776] * sysdeps/x86/cpu-features.c (init_cpu_features): Set bit_arch_Hybrid_ERMS if ERMS is supported. * sysdeps/x86/cpu-features.h (bit_arch_Hybrid_ERMS): New. (index_arch_Hybrid_ERMS): Likewise. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memmove-sse2-unaligned-erms, memmove-avx-unaligned-erms and memmove-avx512-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memmove_chk_avx512_unaligned_2, __memmove_chk_avx512_unaligned_erms, __memmove_chk_avx_unaligned_2, __memmove_chk_avx_unaligned_erms, __memmove_chk_sse2_unaligned_2, __memmove_chk_sse2_unaligned_erms, __memmove_avx_unaligned_2, __memmove_avx_unaligned_erms, __memmove_avx512_unaligned_2, __memmove_avx512_unaligned_erms, __memmove_erms, __memmove_sse2_unaligned_2, __memmove_sse2_unaligned_erms, __memcpy_chk_avx512_unaligned_2, __memcpy_chk_avx512_unaligned_erms, __memcpy_chk_avx_unaligned_2, __memcpy_chk_avx_unaligned_erms, __memcpy_chk_sse2_unaligned_2, __memcpy_chk_sse2_unaligned_erms, __memcpy_avx_unaligned_2, __memcpy_avx_unaligned_erms, __memcpy_avx512_unaligned_2, __memcpy_avx512_unaligned_erms, __memcpy_sse2_unaligned_2, __memcpy_sse2_unaligned_erms, __memcpy_erms, __mempcpy_chk_avx512_unaligned_2, __mempcpy_chk_avx512_unaligned_erms, __mempcpy_chk_avx_unaligned_2, __mempcpy_chk_avx_unaligned_erms, __mempcpy_chk_sse2_unaligned_2, __mempcpy_chk_sse2_unaligned_erms, __mempcpy_avx512_unaligned_2, __mempcpy_avx512_unaligned_erms, __mempcpy_avx_unaligned_2, __mempcpy_avx_unaligned_erms, __mempcpy_sse2_unaligned_2, __mempcpy_sse2_unaligned_erms and __mempcpy_erms. * sysdeps/x86_64/multiarch/memcpy (__new_memcpy): Support Hybrid_ERMS. * sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Likewise. * sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Likewise. * sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Likwise. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: New file. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S: Likwise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S: Likwise. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Likwise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c821d24e8dfc115875aa9d71cab5f0985fbdf533 commit c821d24e8dfc115875aa9d71cab5f0985fbdf533 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Apr 11 08:51:16 2014 -0700 Test 32-bit ERMS memcpy/memset * sysdeps/i386/i686/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Add __bcopy_erms, __bzero_erms, __memmove_chk_erms, __memmove_erms, __memset_chk_erms, __memset_erms, __memcpy_chk_erms, __memcpy_erms, __mempcpy_chk_erms and __mempcpy_erms. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=6261a13bbe8324b83cc33626174e4f7fa65c2290 commit 6261a13bbe8324b83cc33626174e4f7fa65c2290 Author: H.J. Lu <hjl.tools@gmail.com> Date: Wed Sep 21 15:21:28 2011 -0700 Add 32it ERMS memcpy/memset * sysdeps/i386/i686/multiarch/Makefile (sysdep_routines): Add bcopy-erms, memcpy-erms, memmove-erms, mempcpy-erms, bzero-erms and memset-erms. * sysdeps/i386/i686/multiarch/bcopy-erms.S: New file. * sysdeps/i386/i686/multiarch/bzero-erms.S: Likewise. * sysdeps/i386/i686/multiarch/memcpy-erms.S: Likewise. * sysdeps/i386/i686/multiarch/memmove-erms.S: Likewise. * sysdeps/i386/i686/multiarch/mempcpy-erms.S: Likewise. * sysdeps/i386/i686/multiarch/memset-erms.S: Likewise. * sysdeps/i386/i686/multiarch/ifunc-defines.sym: Add COMMON_CPUID_INDEX_7. * sysdeps/i386/i686/multiarch/bcopy.S: Enable ERMS optimization for Fast_ERMS. * sysdeps/i386/i686/multiarch/bzero.S: Likewise. * sysdeps/i386/i686/multiarch/memcpy.S: Likewise. * sysdeps/i386/i686/multiarch/memcpy_chk.S: Likewise. * sysdeps/i386/i686/multiarch/memmove.S: Likewise. * sysdeps/i386/i686/multiarch/memmove_chk.S: Likewise. * sysdeps/i386/i686/multiarch/mempcpy.S: Likewise. * sysdeps/i386/i686/multiarch/mempcpy_chk.S: Likewise. * sysdeps/i386/i686/multiarch/memset.S: Likewise. * sysdeps/i386/i686/multiarch/memset_chk.S: Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=74b015bd5ddf46896c16a2398f3491014f5f4ba8 commit 74b015bd5ddf46896c16a2398f3491014f5f4ba8 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Sep 15 15:47:01 2011 -0700 Initial ERMS support * sysdeps/x86/cpu-features.h (bit_arch_Fast_ERMS): New. (bit_cpu_ERMS): Likewise. (index_cpu_ERMS): Likewise. (index_arch_Fast_ERMS): Likewise. (reg_ERMS): Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=47e27ea9dad4e316ed4fe1cabe5774b72e4d9abf commit 47e27ea9dad4e316ed4fe1cabe5774b72e4d9abf Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Mar 25 06:32:44 2016 -0700 Make __memcpy_avx512_no_vzeroupper an alias Since x86-64 memcpy-avx512-no-vzeroupper.S implements memmove, we can make __memcpy_avx512_no_vzeroupper an alias of __memmove_avx512_no_vzeroupper to reduce code size of libc.so. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memcpy-avx512-no-vzeroupper. * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: Renamed to ... * sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: This. (MEMCPY): Don't define. (MEMCPY_CHK): Likewise. (MEMPCPY): Likewise. (MEMPCPY_CHK): Likewise. (MEMPCPY_CHK): Renamed to ... (__mempcpy_chk_avx512_no_vzeroupper): This. (MEMPCPY_CHK): Renamed to ... (__mempcpy_chk_avx512_no_vzeroupper): This. (MEMCPY_CHK): Renamed to ... (__memmove_chk_avx512_no_vzeroupper): This. (MEMCPY): Renamed to ... (__memmove_avx512_no_vzeroupper): This. (__memcpy_avx512_no_vzeroupper): New alias. (__memcpy_chk_avx512_no_vzeroupper): Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e0676f43f820875a620c9070521e50608040e091 commit e0676f43f820875a620c9070521e50608040e091 Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Mar 6 13:37:31 2016 -0800 Implement x86-64 multiarch mempcpy in memcpy Implement x86-64 multiarch mempcpy in memcpy to share most of code. It will reduce code size of libc.so. [BZ #18858] * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove mempcpy-ssse3, mempcpy-ssse3-back, mempcpy-avx-unaligned and mempcpy-avx512-no-vzeroupper. * sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/memcpy-ssse3-back.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/memcpy-ssse3.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/mempcpy-avx-unaligned.S: Removed. * sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S: Likewise. * sysdeps/x86_64/multiarch/mempcpy-ssse3-back.S: Likewise. * sysdeps/x86_64/multiarch/mempcpy-ssse3.S: Likewise. -----------------------------------------------------------------------
This is an automated email from the git hooks/post-receive script. It was generated because a ref change was pushed to the repository containing the project "GNU C Library master sources". The branch, hjl/erms/hybrid has been created at 0738cd7f51dfb72dc33cfdf84072ed463e7d69d1 (commit) - Log ----------------------------------------------------------------- https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=0738cd7f51dfb72dc33cfdf84072ed463e7d69d1 commit 0738cd7f51dfb72dc33cfdf84072ed463e7d69d1 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Mar 25 08:20:17 2016 -0700 Add memset-vec-unaligned-erms.S * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memset-sse2-unaligned-erms, memset-avx2-unaligned-erms and memset-avx512-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memset_chk_sse2_unaligned, __memset_chk_sse2_unaligned_erms, __memset_chk_avx2_unaligned, __memset_chk_avx2_unaligned_erms, __memset_chk_avx512_unaligned, __memset_chk_avx512_unaligned_erms, __memset_sse2_unaligned, __memset_sse2_unaligned_erms, __memset_erms, __memset_avx2_unaligned, __memset_avx2_unaligned_erms, __memset_avx512_unaligned_erms and __memset_avx512_unaligned. * sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: New file. * sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset.S (memset): Support Hybrid_ERMS. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4d7fddd84a5993fec253c5b677688fa836a5a806 commit 4d7fddd84a5993fec253c5b677688fa836a5a806 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Mar 18 12:36:03 2016 -0700 Add x86-64 memmove with vector unaliged loads and rep movsb 1. __mempcpy_avx_unaligned if AVX_Fast_Unaligned_Load bit is set. 2. __mempcpy_sse2_unaligned if Fast_Unaligned_Load bit is set. 3. __mempcpy_sse2 if SSSE3 isn't available. 4. __mempcpy_ssse3_back if Fast_Copy_Backward bit it set. 5. __mempcpy_ssse3 [BZ #19776] * sysdeps/x86/cpu-features.c (init_cpu_features): Set bit_arch_Hybrid_ERMS if ERMS is supported. * sysdeps/x86/cpu-features.h (bit_arch_Hybrid_ERMS): New. (index_arch_Hybrid_ERMS): Likewise. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memmove-sse2-unaligned-erms, memmove-avx-unaligned-erms and memmove-avx512-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memmove_chk_avx512_unaligned_2, __memmove_chk_avx512_unaligned_erms, __memmove_chk_avx_unaligned_2, __memmove_chk_avx_unaligned_erms, __memmove_chk_sse2_unaligned_2, __memmove_chk_sse2_unaligned_erms, __memmove_avx_unaligned_2, __memmove_avx_unaligned_erms, __memmove_avx512_unaligned_2, __memmove_avx512_unaligned_erms, __memmove_erms, __memmove_sse2_unaligned_2, __memmove_sse2_unaligned_erms, __memcpy_chk_avx512_unaligned_2, __memcpy_chk_avx512_unaligned_erms, __memcpy_chk_avx_unaligned_2, __memcpy_chk_avx_unaligned_erms, __memcpy_chk_sse2_unaligned_2, __memcpy_chk_sse2_unaligned_erms, __memcpy_avx_unaligned_2, __memcpy_avx_unaligned_erms, __memcpy_avx512_unaligned_2, __memcpy_avx512_unaligned_erms, __memcpy_sse2_unaligned_2, __memcpy_sse2_unaligned_erms, __memcpy_erms, __mempcpy_chk_avx512_unaligned_2, __mempcpy_chk_avx512_unaligned_erms, __mempcpy_chk_avx_unaligned_2, __mempcpy_chk_avx_unaligned_erms, __mempcpy_chk_sse2_unaligned_2, __mempcpy_chk_sse2_unaligned_erms, __mempcpy_avx512_unaligned_2, __mempcpy_avx512_unaligned_erms, __mempcpy_avx_unaligned_2, __mempcpy_avx_unaligned_erms, __mempcpy_sse2_unaligned_2, __mempcpy_sse2_unaligned_erms and __mempcpy_erms. * sysdeps/x86_64/multiarch/memcpy (__new_memcpy): Support Hybrid_ERMS. * sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Likewise. * sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Likewise. * sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Likwise. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: New file. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S: Likwise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S: Likwise. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Likwise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c821d24e8dfc115875aa9d71cab5f0985fbdf533 commit c821d24e8dfc115875aa9d71cab5f0985fbdf533 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Apr 11 08:51:16 2014 -0700 Test 32-bit ERMS memcpy/memset * sysdeps/i386/i686/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Add __bcopy_erms, __bzero_erms, __memmove_chk_erms, __memmove_erms, __memset_chk_erms, __memset_erms, __memcpy_chk_erms, __memcpy_erms, __mempcpy_chk_erms and __mempcpy_erms. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=6261a13bbe8324b83cc33626174e4f7fa65c2290 commit 6261a13bbe8324b83cc33626174e4f7fa65c2290 Author: H.J. Lu <hjl.tools@gmail.com> Date: Wed Sep 21 15:21:28 2011 -0700 Add 32it ERMS memcpy/memset * sysdeps/i386/i686/multiarch/Makefile (sysdep_routines): Add bcopy-erms, memcpy-erms, memmove-erms, mempcpy-erms, bzero-erms and memset-erms. * sysdeps/i386/i686/multiarch/bcopy-erms.S: New file. * sysdeps/i386/i686/multiarch/bzero-erms.S: Likewise. * sysdeps/i386/i686/multiarch/memcpy-erms.S: Likewise. * sysdeps/i386/i686/multiarch/memmove-erms.S: Likewise. * sysdeps/i386/i686/multiarch/mempcpy-erms.S: Likewise. * sysdeps/i386/i686/multiarch/memset-erms.S: Likewise. * sysdeps/i386/i686/multiarch/ifunc-defines.sym: Add COMMON_CPUID_INDEX_7. * sysdeps/i386/i686/multiarch/bcopy.S: Enable ERMS optimization for Fast_ERMS. * sysdeps/i386/i686/multiarch/bzero.S: Likewise. * sysdeps/i386/i686/multiarch/memcpy.S: Likewise. * sysdeps/i386/i686/multiarch/memcpy_chk.S: Likewise. * sysdeps/i386/i686/multiarch/memmove.S: Likewise. * sysdeps/i386/i686/multiarch/memmove_chk.S: Likewise. * sysdeps/i386/i686/multiarch/mempcpy.S: Likewise. * sysdeps/i386/i686/multiarch/mempcpy_chk.S: Likewise. * sysdeps/i386/i686/multiarch/memset.S: Likewise. * sysdeps/i386/i686/multiarch/memset_chk.S: Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=74b015bd5ddf46896c16a2398f3491014f5f4ba8 commit 74b015bd5ddf46896c16a2398f3491014f5f4ba8 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Sep 15 15:47:01 2011 -0700 Initial ERMS support * sysdeps/x86/cpu-features.h (bit_arch_Fast_ERMS): New. (bit_cpu_ERMS): Likewise. (index_cpu_ERMS): Likewise. (index_arch_Fast_ERMS): Likewise. (reg_ERMS): Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=47e27ea9dad4e316ed4fe1cabe5774b72e4d9abf commit 47e27ea9dad4e316ed4fe1cabe5774b72e4d9abf Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Mar 25 06:32:44 2016 -0700 Make __memcpy_avx512_no_vzeroupper an alias Since x86-64 memcpy-avx512-no-vzeroupper.S implements memmove, we can make __memcpy_avx512_no_vzeroupper an alias of __memmove_avx512_no_vzeroupper to reduce code size of libc.so. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memcpy-avx512-no-vzeroupper. * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: Renamed to ... * sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: This. (MEMCPY): Don't define. (MEMCPY_CHK): Likewise. (MEMPCPY): Likewise. (MEMPCPY_CHK): Likewise. (MEMPCPY_CHK): Renamed to ... (__mempcpy_chk_avx512_no_vzeroupper): This. (MEMPCPY_CHK): Renamed to ... (__mempcpy_chk_avx512_no_vzeroupper): This. (MEMCPY_CHK): Renamed to ... (__memmove_chk_avx512_no_vzeroupper): This. (MEMCPY): Renamed to ... (__memmove_avx512_no_vzeroupper): This. (__memcpy_avx512_no_vzeroupper): New alias. (__memcpy_chk_avx512_no_vzeroupper): Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e0676f43f820875a620c9070521e50608040e091 commit e0676f43f820875a620c9070521e50608040e091 Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Mar 6 13:37:31 2016 -0800 Implement x86-64 multiarch mempcpy in memcpy Implement x86-64 multiarch mempcpy in memcpy to share most of code. It will reduce code size of libc.so. [BZ #18858] * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove mempcpy-ssse3, mempcpy-ssse3-back, mempcpy-avx-unaligned and mempcpy-avx512-no-vzeroupper. * sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/memcpy-ssse3-back.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/memcpy-ssse3.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/mempcpy-avx-unaligned.S: Removed. * sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S: Likewise. * sysdeps/x86_64/multiarch/mempcpy-ssse3-back.S: Likewise. * sysdeps/x86_64/multiarch/mempcpy-ssse3.S: Likewise. -----------------------------------------------------------------------
This is an automated email from the git hooks/post-receive script. It was generated because a ref change was pushed to the repository containing the project "GNU C Library master sources". The branch, hjl/erms/hybrid has been created at c1cf88aac1a30ece7daf9b299d0a1f5ee0e9129a (commit) - Log ----------------------------------------------------------------- https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c1cf88aac1a30ece7daf9b299d0a1f5ee0e9129a commit c1cf88aac1a30ece7daf9b299d0a1f5ee0e9129a Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Mar 25 08:20:17 2016 -0700 Add memset-vec-unaligned-erms.S * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memset-sse2-unaligned-erms, memset-avx2-unaligned-erms and memset-avx512-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memset_chk_sse2_unaligned, __memset_chk_sse2_unaligned_erms, __memset_chk_avx2_unaligned, __memset_chk_avx2_unaligned_erms, __memset_chk_avx512_unaligned, __memset_chk_avx512_unaligned_erms, __memset_sse2_unaligned, __memset_sse2_unaligned_erms, __memset_erms, __memset_avx2_unaligned, __memset_avx2_unaligned_erms, __memset_avx512_unaligned_erms and __memset_avx512_unaligned. * sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: New file. * sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset.S (memset): Support Hybrid_ERMS. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=0f99b35129a93af08c4be10c37e1d75cfbd642d7 commit 0f99b35129a93af08c4be10c37e1d75cfbd642d7 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Mar 18 12:36:03 2016 -0700 Add x86-64 memmove with vector unaliged loads and rep movsb 1. __mempcpy_avx_unaligned if AVX_Fast_Unaligned_Load bit is set. 2. __mempcpy_sse2_unaligned if Fast_Unaligned_Load bit is set. 3. __mempcpy_sse2 if SSSE3 isn't available. 4. __mempcpy_ssse3_back if Fast_Copy_Backward bit it set. 5. __mempcpy_ssse3 [BZ #19776] * sysdeps/x86/cpu-features.c (init_cpu_features): Set bit_arch_Hybrid_ERMS if ERMS is supported. * sysdeps/x86/cpu-features.h (bit_arch_Hybrid_ERMS): New. (index_arch_Hybrid_ERMS): Likewise. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memmove-sse2-unaligned-erms, memmove-avx-unaligned-erms and memmove-avx512-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memmove_chk_avx512_unaligned_2, __memmove_chk_avx512_unaligned_erms, __memmove_chk_avx_unaligned_2, __memmove_chk_avx_unaligned_erms, __memmove_chk_sse2_unaligned_2, __memmove_chk_sse2_unaligned_erms, __memmove_avx_unaligned_2, __memmove_avx_unaligned_erms, __memmove_avx512_unaligned_2, __memmove_avx512_unaligned_erms, __memmove_erms, __memmove_sse2_unaligned_2, __memmove_sse2_unaligned_erms, __memcpy_chk_avx512_unaligned_2, __memcpy_chk_avx512_unaligned_erms, __memcpy_chk_avx_unaligned_2, __memcpy_chk_avx_unaligned_erms, __memcpy_chk_sse2_unaligned_2, __memcpy_chk_sse2_unaligned_erms, __memcpy_avx_unaligned_2, __memcpy_avx_unaligned_erms, __memcpy_avx512_unaligned_2, __memcpy_avx512_unaligned_erms, __memcpy_sse2_unaligned_2, __memcpy_sse2_unaligned_erms, __memcpy_erms, __mempcpy_chk_avx512_unaligned_2, __mempcpy_chk_avx512_unaligned_erms, __mempcpy_chk_avx_unaligned_2, __mempcpy_chk_avx_unaligned_erms, __mempcpy_chk_sse2_unaligned_2, __mempcpy_chk_sse2_unaligned_erms, __mempcpy_avx512_unaligned_2, __mempcpy_avx512_unaligned_erms, __mempcpy_avx_unaligned_2, __mempcpy_avx_unaligned_erms, __mempcpy_sse2_unaligned_2, __mempcpy_sse2_unaligned_erms and __mempcpy_erms. * sysdeps/x86_64/multiarch/memcpy (__new_memcpy): Support Hybrid_ERMS. * sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Likewise. * sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Likewise. * sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Likwise. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: New file. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S: Likwise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S: Likwise. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Likwise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c821d24e8dfc115875aa9d71cab5f0985fbdf533 commit c821d24e8dfc115875aa9d71cab5f0985fbdf533 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Apr 11 08:51:16 2014 -0700 Test 32-bit ERMS memcpy/memset * sysdeps/i386/i686/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Add __bcopy_erms, __bzero_erms, __memmove_chk_erms, __memmove_erms, __memset_chk_erms, __memset_erms, __memcpy_chk_erms, __memcpy_erms, __mempcpy_chk_erms and __mempcpy_erms. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=6261a13bbe8324b83cc33626174e4f7fa65c2290 commit 6261a13bbe8324b83cc33626174e4f7fa65c2290 Author: H.J. Lu <hjl.tools@gmail.com> Date: Wed Sep 21 15:21:28 2011 -0700 Add 32it ERMS memcpy/memset * sysdeps/i386/i686/multiarch/Makefile (sysdep_routines): Add bcopy-erms, memcpy-erms, memmove-erms, mempcpy-erms, bzero-erms and memset-erms. * sysdeps/i386/i686/multiarch/bcopy-erms.S: New file. * sysdeps/i386/i686/multiarch/bzero-erms.S: Likewise. * sysdeps/i386/i686/multiarch/memcpy-erms.S: Likewise. * sysdeps/i386/i686/multiarch/memmove-erms.S: Likewise. * sysdeps/i386/i686/multiarch/mempcpy-erms.S: Likewise. * sysdeps/i386/i686/multiarch/memset-erms.S: Likewise. * sysdeps/i386/i686/multiarch/ifunc-defines.sym: Add COMMON_CPUID_INDEX_7. * sysdeps/i386/i686/multiarch/bcopy.S: Enable ERMS optimization for Fast_ERMS. * sysdeps/i386/i686/multiarch/bzero.S: Likewise. * sysdeps/i386/i686/multiarch/memcpy.S: Likewise. * sysdeps/i386/i686/multiarch/memcpy_chk.S: Likewise. * sysdeps/i386/i686/multiarch/memmove.S: Likewise. * sysdeps/i386/i686/multiarch/memmove_chk.S: Likewise. * sysdeps/i386/i686/multiarch/mempcpy.S: Likewise. * sysdeps/i386/i686/multiarch/mempcpy_chk.S: Likewise. * sysdeps/i386/i686/multiarch/memset.S: Likewise. * sysdeps/i386/i686/multiarch/memset_chk.S: Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=74b015bd5ddf46896c16a2398f3491014f5f4ba8 commit 74b015bd5ddf46896c16a2398f3491014f5f4ba8 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Sep 15 15:47:01 2011 -0700 Initial ERMS support * sysdeps/x86/cpu-features.h (bit_arch_Fast_ERMS): New. (bit_cpu_ERMS): Likewise. (index_cpu_ERMS): Likewise. (index_arch_Fast_ERMS): Likewise. (reg_ERMS): Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=47e27ea9dad4e316ed4fe1cabe5774b72e4d9abf commit 47e27ea9dad4e316ed4fe1cabe5774b72e4d9abf Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Mar 25 06:32:44 2016 -0700 Make __memcpy_avx512_no_vzeroupper an alias Since x86-64 memcpy-avx512-no-vzeroupper.S implements memmove, we can make __memcpy_avx512_no_vzeroupper an alias of __memmove_avx512_no_vzeroupper to reduce code size of libc.so. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memcpy-avx512-no-vzeroupper. * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: Renamed to ... * sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: This. (MEMCPY): Don't define. (MEMCPY_CHK): Likewise. (MEMPCPY): Likewise. (MEMPCPY_CHK): Likewise. (MEMPCPY_CHK): Renamed to ... (__mempcpy_chk_avx512_no_vzeroupper): This. (MEMPCPY_CHK): Renamed to ... (__mempcpy_chk_avx512_no_vzeroupper): This. (MEMCPY_CHK): Renamed to ... (__memmove_chk_avx512_no_vzeroupper): This. (MEMCPY): Renamed to ... (__memmove_avx512_no_vzeroupper): This. (__memcpy_avx512_no_vzeroupper): New alias. (__memcpy_chk_avx512_no_vzeroupper): Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e0676f43f820875a620c9070521e50608040e091 commit e0676f43f820875a620c9070521e50608040e091 Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Mar 6 13:37:31 2016 -0800 Implement x86-64 multiarch mempcpy in memcpy Implement x86-64 multiarch mempcpy in memcpy to share most of code. It will reduce code size of libc.so. [BZ #18858] * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove mempcpy-ssse3, mempcpy-ssse3-back, mempcpy-avx-unaligned and mempcpy-avx512-no-vzeroupper. * sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/memcpy-ssse3-back.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/memcpy-ssse3.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/mempcpy-avx-unaligned.S: Removed. * sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S: Likewise. * sysdeps/x86_64/multiarch/mempcpy-ssse3-back.S: Likewise. * sysdeps/x86_64/multiarch/mempcpy-ssse3.S: Likewise. -----------------------------------------------------------------------
This is an automated email from the git hooks/post-receive script. It was generated because a ref change was pushed to the repository containing the project "GNU C Library master sources". The branch, hjl/erms/master has been created at 8474e0e809c4be11eea35f7c10bb26cb36a69b98 (commit) - Log ----------------------------------------------------------------- https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=8474e0e809c4be11eea35f7c10bb26cb36a69b98 commit 8474e0e809c4be11eea35f7c10bb26cb36a69b98 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Mar 25 08:20:17 2016 -0700 Add memset-vec-unaligned-erms.S * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memset-sse2-unaligned-erms, memset-avx2-unaligned-erms and memset-avx512-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memset_chk_sse2_unaligned, __memset_chk_sse2_unaligned_erms, __memset_chk_avx2_unaligned, __memset_chk_avx2_unaligned_erms, __memset_chk_avx512_unaligned, __memset_chk_avx512_unaligned_erms, __memset_sse2_unaligned, __memset_sse2_unaligned_erms, __memset_erms, __memset_avx2_unaligned, __memset_avx2_unaligned_erms, __memset_avx512_unaligned_erms and __memset_avx512_unaligned. * sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: New file. * sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset.S (memset): Support ERMS. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=ff16c1a2afa9863255e0b3e3edf4361e0896f319 commit ff16c1a2afa9863255e0b3e3edf4361e0896f319 Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Mar 28 04:36:03 2016 -0700 Add comments to memmove-vec-unaligned-erms.S https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=6ebcdb9629a9f458fe5e15328a629d86f909cc12 commit 6ebcdb9629a9f458fe5e15328a629d86f909cc12 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Mar 18 12:36:03 2016 -0700 Add x86-64 memmove with vector unaliged loads and rep movsb 1. __mempcpy_avx_unaligned if AVX_Fast_Unaligned_Load bit is set. 2. __mempcpy_sse2_unaligned if Fast_Unaligned_Load bit is set. 3. __mempcpy_sse2 if SSSE3 isn't available. 4. __mempcpy_ssse3_back if Fast_Copy_Backward bit it set. 5. __mempcpy_ssse3 [BZ #19776] * sysdeps/x86/cpu-features.c (init_cpu_features): Set bit_arch_Hybrid_ERMS if ERMS is supported. * sysdeps/x86/cpu-features.h (bit_arch_Hybrid_ERMS): New. (index_arch_Hybrid_ERMS): Likewise. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memmove-sse2-unaligned-erms, memmove-avx-unaligned-erms and memmove-avx512-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memmove_chk_avx512_unaligned_2, __memmove_chk_avx512_unaligned_erms, __memmove_chk_avx_unaligned_2, __memmove_chk_avx_unaligned_erms, __memmove_chk_sse2_unaligned_2, __memmove_chk_sse2_unaligned_erms, __memmove_avx_unaligned_2, __memmove_avx_unaligned_erms, __memmove_avx512_unaligned_2, __memmove_avx512_unaligned_erms, __memmove_erms, __memmove_sse2_unaligned_2, __memmove_sse2_unaligned_erms, __memcpy_chk_avx512_unaligned_2, __memcpy_chk_avx512_unaligned_erms, __memcpy_chk_avx_unaligned_2, __memcpy_chk_avx_unaligned_erms, __memcpy_chk_sse2_unaligned_2, __memcpy_chk_sse2_unaligned_erms, __memcpy_avx_unaligned_2, __memcpy_avx_unaligned_erms, __memcpy_avx512_unaligned_2, __memcpy_avx512_unaligned_erms, __memcpy_sse2_unaligned_2, __memcpy_sse2_unaligned_erms, __memcpy_erms, __mempcpy_chk_avx512_unaligned_2, __mempcpy_chk_avx512_unaligned_erms, __mempcpy_chk_avx_unaligned_2, __mempcpy_chk_avx_unaligned_erms, __mempcpy_chk_sse2_unaligned_2, __mempcpy_chk_sse2_unaligned_erms, __mempcpy_avx512_unaligned_2, __mempcpy_avx512_unaligned_erms, __mempcpy_avx_unaligned_2, __mempcpy_avx_unaligned_erms, __mempcpy_sse2_unaligned_2, __mempcpy_sse2_unaligned_erms and __mempcpy_erms. * sysdeps/x86_64/multiarch/memcpy (__new_memcpy): Support ERMS. * sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Likewise. * sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Likewise. * sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Likwise. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: New file. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S: Likwise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S: Likwise. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Likwise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=81641f1f3aafc33f33f20e02f88e696480578195 commit 81641f1f3aafc33f33f20e02f88e696480578195 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Sep 15 15:47:01 2011 -0700 Initial Enhanced REP MOVSB/STOSB (ERMS) support The newer Intel processors support Enhanced REP MOVSB/STOSB (ERMS) which has a feature bit in CPUID. This patch adds the Enhanced REP MOVSB/STOSB (ERMS) bit to x86 cpu-features. * sysdeps/x86/cpu-features.h (bit_cpu_ERMS): New. (index_cpu_ERMS): Likewise. (reg_ERMS): Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c6a93f91c66e4de17d1e1e25a87ac8fc460a02b0 commit c6a93f91c66e4de17d1e1e25a87ac8fc460a02b0 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Mar 25 06:32:44 2016 -0700 Make __memcpy_avx512_no_vzeroupper an alias Since x86-64 memcpy-avx512-no-vzeroupper.S implements memmove, we can make __memcpy_avx512_no_vzeroupper an alias of __memmove_avx512_no_vzeroupper to reduce code size of libc.so. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memcpy-avx512-no-vzeroupper. * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: Renamed to ... * sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: This. (MEMCPY): Don't define. (MEMCPY_CHK): Likewise. (MEMPCPY): Likewise. (MEMPCPY_CHK): Likewise. (MEMPCPY_CHK): Renamed to ... (__mempcpy_chk_avx512_no_vzeroupper): This. (MEMPCPY_CHK): Renamed to ... (__mempcpy_chk_avx512_no_vzeroupper): This. (MEMCPY_CHK): Renamed to ... (__memmove_chk_avx512_no_vzeroupper): This. (MEMCPY): Renamed to ... (__memmove_avx512_no_vzeroupper): This. (__memcpy_avx512_no_vzeroupper): New alias. (__memcpy_chk_avx512_no_vzeroupper): Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=85f40c56ef1e0a1716a85943bfd2824663b976c0 commit 85f40c56ef1e0a1716a85943bfd2824663b976c0 Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Mar 6 13:37:31 2016 -0800 Implement x86-64 multiarch mempcpy in memcpy Implement x86-64 multiarch mempcpy in memcpy to share most of code. It will reduce code size of libc.so. [BZ #18858] * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove mempcpy-ssse3, mempcpy-ssse3-back, mempcpy-avx-unaligned and mempcpy-avx512-no-vzeroupper. * sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/memcpy-ssse3-back.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/memcpy-ssse3.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/mempcpy-avx-unaligned.S: Removed. * sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S: Likewise. * sysdeps/x86_64/multiarch/mempcpy-ssse3-back.S: Likewise. * sysdeps/x86_64/multiarch/mempcpy-ssse3.S: Likewise. -----------------------------------------------------------------------
This is an automated email from the git hooks/post-receive script. It was generated because a ref change was pushed to the repository containing the project "GNU C Library master sources". The branch, hjl/erms/master has been created at 0f7985d681251d1a2c4a0e8a1bee97092a7dfccf (commit) - Log ----------------------------------------------------------------- https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=0f7985d681251d1a2c4a0e8a1bee97092a7dfccf commit 0f7985d681251d1a2c4a0e8a1bee97092a7dfccf Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Mar 25 08:20:17 2016 -0700 Add x86-64 memset with vector unaliged stores and rep stosb * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memset-sse2-unaligned-erms, memset-avx2-unaligned-erms and memset-avx512-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memset_chk_sse2_unaligned, __memset_chk_sse2_unaligned_erms, __memset_chk_avx2_unaligned, __memset_chk_avx2_unaligned_erms, __memset_chk_avx512_unaligned, __memset_chk_avx512_unaligned_erms, __memset_sse2_unaligned, __memset_sse2_unaligned_erms, __memset_erms, __memset_avx2_unaligned, __memset_avx2_unaligned_erms, __memset_avx512_unaligned_erms and __memset_avx512_unaligned. * sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: New file. * sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset.S (memset): Support ERMS. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=70579c5ae8056ff4e0c30a063ace064b7b45f50a commit 70579c5ae8056ff4e0c30a063ace064b7b45f50a Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Mar 18 12:36:03 2016 -0700 Add x86-64 memmove with vector unaliged loads and rep movsb 1. __mempcpy_avx_unaligned if AVX_Fast_Unaligned_Load bit is set. 2. __mempcpy_sse2_unaligned if Fast_Unaligned_Load bit is set. 3. __mempcpy_sse2 if SSSE3 isn't available. 4. __mempcpy_ssse3_back if Fast_Copy_Backward bit it set. 5. __mempcpy_ssse3 [BZ #19776] * sysdeps/x86/cpu-features.c (init_cpu_features): Set bit_arch_Hybrid_ERMS if ERMS is supported. * sysdeps/x86/cpu-features.h (bit_arch_Hybrid_ERMS): New. (index_arch_Hybrid_ERMS): Likewise. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memmove-sse2-unaligned-erms, memmove-avx-unaligned-erms and memmove-avx512-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memmove_chk_avx512_unaligned_2, __memmove_chk_avx512_unaligned_erms, __memmove_chk_avx_unaligned_2, __memmove_chk_avx_unaligned_erms, __memmove_chk_sse2_unaligned_2, __memmove_chk_sse2_unaligned_erms, __memmove_avx_unaligned_2, __memmove_avx_unaligned_erms, __memmove_avx512_unaligned_2, __memmove_avx512_unaligned_erms, __memmove_erms, __memmove_sse2_unaligned_2, __memmove_sse2_unaligned_erms, __memcpy_chk_avx512_unaligned_2, __memcpy_chk_avx512_unaligned_erms, __memcpy_chk_avx_unaligned_2, __memcpy_chk_avx_unaligned_erms, __memcpy_chk_sse2_unaligned_2, __memcpy_chk_sse2_unaligned_erms, __memcpy_avx_unaligned_2, __memcpy_avx_unaligned_erms, __memcpy_avx512_unaligned_2, __memcpy_avx512_unaligned_erms, __memcpy_sse2_unaligned_2, __memcpy_sse2_unaligned_erms, __memcpy_erms, __mempcpy_chk_avx512_unaligned_2, __mempcpy_chk_avx512_unaligned_erms, __mempcpy_chk_avx_unaligned_2, __mempcpy_chk_avx_unaligned_erms, __mempcpy_chk_sse2_unaligned_2, __mempcpy_chk_sse2_unaligned_erms, __mempcpy_avx512_unaligned_2, __mempcpy_avx512_unaligned_erms, __mempcpy_avx_unaligned_2, __mempcpy_avx_unaligned_erms, __mempcpy_sse2_unaligned_2, __mempcpy_sse2_unaligned_erms and __mempcpy_erms. * sysdeps/x86_64/multiarch/memcpy (__new_memcpy): Support ERMS. * sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Likewise. * sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Likewise. * sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Likwise. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: New file. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S: Likwise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S: Likwise. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Likwise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=81641f1f3aafc33f33f20e02f88e696480578195 commit 81641f1f3aafc33f33f20e02f88e696480578195 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Sep 15 15:47:01 2011 -0700 Initial Enhanced REP MOVSB/STOSB (ERMS) support The newer Intel processors support Enhanced REP MOVSB/STOSB (ERMS) which has a feature bit in CPUID. This patch adds the Enhanced REP MOVSB/STOSB (ERMS) bit to x86 cpu-features. * sysdeps/x86/cpu-features.h (bit_cpu_ERMS): New. (index_cpu_ERMS): Likewise. (reg_ERMS): Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c6a93f91c66e4de17d1e1e25a87ac8fc460a02b0 commit c6a93f91c66e4de17d1e1e25a87ac8fc460a02b0 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Mar 25 06:32:44 2016 -0700 Make __memcpy_avx512_no_vzeroupper an alias Since x86-64 memcpy-avx512-no-vzeroupper.S implements memmove, we can make __memcpy_avx512_no_vzeroupper an alias of __memmove_avx512_no_vzeroupper to reduce code size of libc.so. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memcpy-avx512-no-vzeroupper. * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: Renamed to ... * sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: This. (MEMCPY): Don't define. (MEMCPY_CHK): Likewise. (MEMPCPY): Likewise. (MEMPCPY_CHK): Likewise. (MEMPCPY_CHK): Renamed to ... (__mempcpy_chk_avx512_no_vzeroupper): This. (MEMPCPY_CHK): Renamed to ... (__mempcpy_chk_avx512_no_vzeroupper): This. (MEMCPY_CHK): Renamed to ... (__memmove_chk_avx512_no_vzeroupper): This. (MEMCPY): Renamed to ... (__memmove_avx512_no_vzeroupper): This. (__memcpy_avx512_no_vzeroupper): New alias. (__memcpy_chk_avx512_no_vzeroupper): Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=85f40c56ef1e0a1716a85943bfd2824663b976c0 commit 85f40c56ef1e0a1716a85943bfd2824663b976c0 Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Mar 6 13:37:31 2016 -0800 Implement x86-64 multiarch mempcpy in memcpy Implement x86-64 multiarch mempcpy in memcpy to share most of code. It will reduce code size of libc.so. [BZ #18858] * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove mempcpy-ssse3, mempcpy-ssse3-back, mempcpy-avx-unaligned and mempcpy-avx512-no-vzeroupper. * sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/memcpy-ssse3-back.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/memcpy-ssse3.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/mempcpy-avx-unaligned.S: Removed. * sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S: Likewise. * sysdeps/x86_64/multiarch/mempcpy-ssse3-back.S: Likewise. * sysdeps/x86_64/multiarch/mempcpy-ssse3.S: Likewise. -----------------------------------------------------------------------
This is an automated email from the git hooks/post-receive script. It was generated because a ref change was pushed to the repository containing the project "GNU C Library master sources". The branch, hjl/erms/master has been created at 9a4aba90edef8b8635712299e78a525380420dff (commit) - Log ----------------------------------------------------------------- https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=9a4aba90edef8b8635712299e78a525380420dff commit 9a4aba90edef8b8635712299e78a525380420dff Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Mar 13 00:26:57 2016 -0800 Add memmove/memset-avx512-unaligned-erms-no-vzeroupper.S https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c2240bc05c7efea04397719be2c1ccd8f8e8b745 commit c2240bc05c7efea04397719be2c1ccd8f8e8b745 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Mar 25 08:20:17 2016 -0700 Add x86-64 memset with vector unaliged stores and rep stosb Implement x86-64 memset with vector unaliged stores and rep movsb. Key features: 1. Use overlapping register store to avoid branch. 2. For size <= 4 times of vector register size, fully unroll the loop. 3. For size > 4 times of vector register size, store 4 times of vector register size at a time. 4. If size > REP_STOSB_THRESHOLD, use rep stosb. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memset-sse2-unaligned-erms, memset-avx2-unaligned-erms and memset-avx512-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memset_chk_sse2_unaligned, __memset_chk_sse2_unaligned_erms, __memset_chk_avx2_unaligned, __memset_chk_avx2_unaligned_erms, __memset_chk_avx512_unaligned, __memset_chk_avx512_unaligned_erms, __memset_sse2_unaligned, __memset_sse2_unaligned_erms, __memset_erms, __memset_avx2_unaligned, __memset_avx2_unaligned_erms, __memset_avx512_unaligned_erms and __memset_avx512_unaligned. * sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: New file. * sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=ae9731e6d4c3b97bdd152fb57110ddcecdb210f8 commit ae9731e6d4c3b97bdd152fb57110ddcecdb210f8 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Mar 18 12:36:03 2016 -0700 Add x86-64 memmove with vector unaliged loads and rep movsb Implement x86-64 memmove with vector unaliged loads and rep movsb. Key features: 1. Use Overlapping register load/store to avoid branch. 2. For size <= 8 times of vector register size, load all sources into registers and store them together. 3. If there is no address overflap bewteen source and destination, copy from both ends with 4 times of vector register size at a time. 4. If address of destination > address of source, backward copy 8 times of vector register size at a time. 5. Otherwise, forward copy 8 times of vector register size at a time. 6. If size > REP_MOVSB_THRESHOLD, use rep movsb for forward copy. Avoid slow backward rep movsb by fallbacking to backward copy 8 times of vector register size at a time. Also provide an alias for memcpy. [BZ #19776] * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memmove-sse2-unaligned-erms, memmove-avx-unaligned-erms and memmove-avx512-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memmove_chk_avx512_unaligned_2, __memmove_chk_avx512_unaligned_erms, __memmove_chk_avx_unaligned_2, __memmove_chk_avx_unaligned_erms, __memmove_chk_sse2_unaligned_2, __memmove_chk_sse2_unaligned_erms, __memmove_avx_unaligned_2, __memmove_avx_unaligned_erms, __memmove_avx512_unaligned_2, __memmove_avx512_unaligned_erms, __memmove_erms, __memmove_sse2_unaligned_2, __memmove_sse2_unaligned_erms, __memcpy_chk_avx512_unaligned_2, __memcpy_chk_avx512_unaligned_erms, __memcpy_chk_avx_unaligned_2, __memcpy_chk_avx_unaligned_erms, __memcpy_chk_sse2_unaligned_2, __memcpy_chk_sse2_unaligned_erms, __memcpy_avx_unaligned_2, __memcpy_avx_unaligned_erms, __memcpy_avx512_unaligned_2, __memcpy_avx512_unaligned_erms, __memcpy_sse2_unaligned_2, __memcpy_sse2_unaligned_erms, __memcpy_erms, __mempcpy_chk_avx512_unaligned_2, __mempcpy_chk_avx512_unaligned_erms, __mempcpy_chk_avx_unaligned_2, __mempcpy_chk_avx_unaligned_erms, __mempcpy_chk_sse2_unaligned_2, __mempcpy_chk_sse2_unaligned_erms, __mempcpy_avx512_unaligned_2, __mempcpy_avx512_unaligned_erms, __mempcpy_avx_unaligned_2, __mempcpy_avx_unaligned_erms, __mempcpy_sse2_unaligned_2, __mempcpy_sse2_unaligned_erms and __mempcpy_erms. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: New file. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S: Likwise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S: Likwise. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Likwise. Fix sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=30b534f494028b891e15b74e7fd232429d685295 commit 30b534f494028b891e15b74e7fd232429d685295 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Sep 15 15:47:01 2011 -0700 Initial Enhanced REP MOVSB/STOSB (ERMS) support The newer Intel processors support Enhanced REP MOVSB/STOSB (ERMS) which has a feature bit in CPUID. This patch adds the Enhanced REP MOVSB/STOSB (ERMS) bit to x86 cpu-features. * sysdeps/x86/cpu-features.h (bit_cpu_ERMS): New. (index_cpu_ERMS): Likewise. (reg_ERMS): Likewise. -----------------------------------------------------------------------
This is an automated email from the git hooks/post-receive script. It was generated because a ref change was pushed to the repository containing the project "GNU C Library master sources". The branch, hjl/erms/master has been created at ae2460f45588f301579553cd108e29477488a4b9 (commit) - Log ----------------------------------------------------------------- https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=ae2460f45588f301579553cd108e29477488a4b9 commit ae2460f45588f301579553cd108e29477488a4b9 Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Mar 13 00:26:57 2016 -0800 Add memmove/memset-avx512-unaligned-erms-no-vzeroupper.S https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=6e068a5e5a7db76310931b5dc0244bd572cb7fe7 commit 6e068a5e5a7db76310931b5dc0244bd572cb7fe7 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Mar 25 08:20:17 2016 -0700 Add x86-64 memset with vector unaliged stores and rep stosb Implement x86-64 memset with vector unaliged stores and rep movsb. Key features: 1. Use overlapping store to avoid branch. 2. For size <= 4 times of vector register size, fully unroll the loop. 3. For size > 4 times of vector register size, store 4 times of vector register size at a time. 4. If size > REP_STOSB_THRESHOLD, use rep stosb. A single file provides 2 implementations of memset, one with rep stosb and one without rep stosb. They share the same codes when size is between 2 times of vector register size and REP_STOSB_THRESHOLD. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memset-sse2-unaligned-erms, memset-avx2-unaligned-erms and memset-avx512-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memset_chk_sse2_unaligned, __memset_chk_sse2_unaligned_erms, __memset_chk_avx2_unaligned, __memset_chk_avx2_unaligned_erms, __memset_chk_avx512_unaligned, __memset_chk_avx512_unaligned_erms, __memset_sse2_unaligned, __memset_sse2_unaligned_erms, __memset_erms, __memset_avx2_unaligned, __memset_avx2_unaligned_erms, __memset_avx512_unaligned_erms and __memset_avx512_unaligned. * sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: New file. * sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=395b59341eac10502cffe2e6de218af054b8415b commit 395b59341eac10502cffe2e6de218af054b8415b Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Mar 18 12:36:03 2016 -0700 Add x86-64 memmove with vector unaliged loads and rep movsb Implement x86-64 memmove with vector unaliged loads and rep movsb. Key features: 1. Use overlapping load and store to avoid branch. 2. For size <= 8 times of vector register size, load all sources into registers and store them together. 3. If there is no address overlap bewteen source and destination, copy from both ends with 4 times of vector register size at a time. 4. If address of destination > address of source, backward copy 8 times of vector register size at a time. 5. Otherwise, forward copy 8 times of vector register size at a time. 6. If size > REP_MOVSB_THRESHOLD, use rep movsb for forward copy. Avoid slow backward rep movsb by fallbacking to backward copy 8 times of vector register size at a time. When size <= 8 times of vector register size, there is no check for address overlap bewteen source and destination. Since overhead for overlap check is small when size > 8 times of vector register size, memcpy is an alias of memmove. A single file provides 2 implementations of memmove, one with rep movsb and one without rep movsb. They share the same codes when size is between 2 times of vector register size and REP_MOVSB_THRESHOLD. [BZ #19776] * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memmove-sse2-unaligned-erms, memmove-avx-unaligned-erms and memmove-avx512-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memmove_chk_avx512_unaligned_2, __memmove_chk_avx512_unaligned_erms, __memmove_chk_avx_unaligned_2, __memmove_chk_avx_unaligned_erms, __memmove_chk_sse2_unaligned_2, __memmove_chk_sse2_unaligned_erms, __memmove_avx_unaligned_2, __memmove_avx_unaligned_erms, __memmove_avx512_unaligned_2, __memmove_avx512_unaligned_erms, __memmove_erms, __memmove_sse2_unaligned_2, __memmove_sse2_unaligned_erms, __memcpy_chk_avx512_unaligned_2, __memcpy_chk_avx512_unaligned_erms, __memcpy_chk_avx_unaligned_2, __memcpy_chk_avx_unaligned_erms, __memcpy_chk_sse2_unaligned_2, __memcpy_chk_sse2_unaligned_erms, __memcpy_avx_unaligned_2, __memcpy_avx_unaligned_erms, __memcpy_avx512_unaligned_2, __memcpy_avx512_unaligned_erms, __memcpy_sse2_unaligned_2, __memcpy_sse2_unaligned_erms, __memcpy_erms, __mempcpy_chk_avx512_unaligned_2, __mempcpy_chk_avx512_unaligned_erms, __mempcpy_chk_avx_unaligned_2, __mempcpy_chk_avx_unaligned_erms, __mempcpy_chk_sse2_unaligned_2, __mempcpy_chk_sse2_unaligned_erms, __mempcpy_avx512_unaligned_2, __mempcpy_avx512_unaligned_erms, __mempcpy_avx_unaligned_2, __mempcpy_avx_unaligned_erms, __mempcpy_sse2_unaligned_2, __mempcpy_sse2_unaligned_erms and __mempcpy_erms. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: New file. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S: Likwise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S: Likwise. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Likwise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=30b534f494028b891e15b74e7fd232429d685295 commit 30b534f494028b891e15b74e7fd232429d685295 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Sep 15 15:47:01 2011 -0700 Initial Enhanced REP MOVSB/STOSB (ERMS) support The newer Intel processors support Enhanced REP MOVSB/STOSB (ERMS) which has a feature bit in CPUID. This patch adds the Enhanced REP MOVSB/STOSB (ERMS) bit to x86 cpu-features. * sysdeps/x86/cpu-features.h (bit_cpu_ERMS): New. (index_cpu_ERMS): Likewise. (reg_ERMS): Likewise. -----------------------------------------------------------------------
This is an automated email from the git hooks/post-receive script. It was generated because a ref change was pushed to the repository containing the project "GNU C Library master sources". The branch, hjl/erms/master has been created at 40c3dbef25441f79d0600dd14f1a3018159ba80c (commit) - Log ----------------------------------------------------------------- https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=40c3dbef25441f79d0600dd14f1a3018159ba80c commit 40c3dbef25441f79d0600dd14f1a3018159ba80c Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Mar 13 00:26:57 2016 -0800 Add memmove/memset-avx512-unaligned-erms-no-vzeroupper.S https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a8190f265bedcc8754a74fa498c2cf2d0e829d8a commit a8190f265bedcc8754a74fa498c2cf2d0e829d8a Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Mar 25 08:20:17 2016 -0700 Add x86-64 memset with vector unaliged stores and rep stosb Implement x86-64 memset with vector unaliged stores and rep movsb. Support 16-byte, 32-byte and 64-byte vector register sizes. A single file provides 2 implementations of memset, one with rep stosb and the other without rep stosb. They share the same codes when size is between 2 times of vector register size and REP_STOSB_THRESHOLD which is 1KB for 16-byte vector register size and scaled up by larger vector register size. Key features: 1. Use overlapping store to avoid branch. 2. For size <= 4 times of vector register size, fully unroll the loop. 3. For size > 4 times of vector register size, store 4 times of vector register size at a time. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memset-sse2-unaligned-erms, memset-avx2-unaligned-erms and memset-avx512-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memset_chk_sse2_unaligned, __memset_chk_sse2_unaligned_erms, __memset_chk_avx2_unaligned, __memset_chk_avx2_unaligned_erms, __memset_chk_avx512_unaligned, __memset_chk_avx512_unaligned_erms, __memset_sse2_unaligned, __memset_sse2_unaligned_erms, __memset_erms, __memset_avx2_unaligned, __memset_avx2_unaligned_erms, __memset_avx512_unaligned_erms and __memset_avx512_unaligned. * sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: New file. * sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4e3afbf1b9f1f7a08e4040f0b9fd4000dd588e54 commit 4e3afbf1b9f1f7a08e4040f0b9fd4000dd588e54 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Mar 18 12:36:03 2016 -0700 Add x86-64 memmove with vector unaliged loads and rep movsb Implement x86-64 memmove with vector unaliged loads and rep movsb. Support 16-byte, 32-byte and 64-byte vector register sizes. When size <= 8 times of vector register size, there is no check for address overlap bewteen source and destination. Since overhead for overlap check is small when size > 8 times of vector register size, memcpy is an alias of memmove. A single file provides 2 implementations of memmove, one with rep movsb and the other without rep movsb. They share the same codes when size is between 2 times of vector register size and REP_MOVSB_THRESHOLD which is 2KB for 16-byte vector register size and scaled up by large vector register size. Key features: 1. Use overlapping load and store to avoid branch. 2. For size <= 8 times of vector register size, load all sources into registers and store them together. 3. If there is no address overlap bewteen source and destination, copy from both ends with 4 times of vector register size at a time. 4. If address of destination > address of source, backward copy 8 times of vector register size at a time. 5. Otherwise, forward copy 8 times of vector register size at a time. 6. Use rep movsb only for forward copy. Avoid slow backward rep movsb by fallbacking to backward copy 8 times of vector register size at a time. [BZ #19776] * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memmove-sse2-unaligned-erms, memmove-avx-unaligned-erms and memmove-avx512-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memmove_chk_avx512_unaligned_2, __memmove_chk_avx512_unaligned_erms, __memmove_chk_avx_unaligned_2, __memmove_chk_avx_unaligned_erms, __memmove_chk_sse2_unaligned_2, __memmove_chk_sse2_unaligned_erms, __memmove_avx_unaligned_2, __memmove_avx_unaligned_erms, __memmove_avx512_unaligned_2, __memmove_avx512_unaligned_erms, __memmove_erms, __memmove_sse2_unaligned_2, __memmove_sse2_unaligned_erms, __memcpy_chk_avx512_unaligned_2, __memcpy_chk_avx512_unaligned_erms, __memcpy_chk_avx_unaligned_2, __memcpy_chk_avx_unaligned_erms, __memcpy_chk_sse2_unaligned_2, __memcpy_chk_sse2_unaligned_erms, __memcpy_avx_unaligned_2, __memcpy_avx_unaligned_erms, __memcpy_avx512_unaligned_2, __memcpy_avx512_unaligned_erms, __memcpy_sse2_unaligned_2, __memcpy_sse2_unaligned_erms, __memcpy_erms, __mempcpy_chk_avx512_unaligned_2, __mempcpy_chk_avx512_unaligned_erms, __mempcpy_chk_avx_unaligned_2, __mempcpy_chk_avx_unaligned_erms, __mempcpy_chk_sse2_unaligned_2, __mempcpy_chk_sse2_unaligned_erms, __mempcpy_avx512_unaligned_2, __mempcpy_avx512_unaligned_erms, __mempcpy_avx_unaligned_2, __mempcpy_avx_unaligned_erms, __mempcpy_sse2_unaligned_2, __mempcpy_sse2_unaligned_erms and __mempcpy_erms. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: New file. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S: Likwise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S: Likwise. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Likwise. -----------------------------------------------------------------------
Created attachment 9138 [details] bench-memcpy data on various Intel and AMD processors
Created attachment 9139 [details] bench-memmove data on various Intel and AMD processors
This is an automated email from the git hooks/post-receive script. It was generated because a ref change was pushed to the repository containing the project "GNU C Library master sources". The branch, hjl/erms/master has been created at 69e33f275a3280e4cd9c35bc5da91812cdb9e418 (commit) - Log ----------------------------------------------------------------- https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=69e33f275a3280e4cd9c35bc5da91812cdb9e418 commit 69e33f275a3280e4cd9c35bc5da91812cdb9e418 Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Mar 13 00:26:57 2016 -0800 Add memmove/memset-avx512-unaligned-erms-no-vzeroupper.S https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=ff672e01d44a76b1bbf86bb47bcfa86ce7c7c419 commit ff672e01d44a76b1bbf86bb47bcfa86ce7c7c419 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Mar 25 08:20:17 2016 -0700 Add x86-64 memset with unaligned store and rep stosb Implement x86-64 memset with unaligned store and rep movsb. Support 16-byte, 32-byte and 64-byte vector register sizes. A single file provides 2 implementations of memset, one with rep stosb and the other without rep stosb. They share the same codes when size is between 2 times of vector register size and REP_STOSB_THRESHOLD which is 1KB for 16-byte vector register size and scaled up by larger vector register size. Key features: 1. Use overlapping store to avoid branch. 2. For size <= 4 times of vector register size, fully unroll the loop. 3. For size > 4 times of vector register size, store 4 times of vector register size at a time. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memset-sse2-unaligned-erms, memset-avx2-unaligned-erms and memset-avx512-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memset_chk_sse2_unaligned, __memset_chk_sse2_unaligned_erms, __memset_chk_avx2_unaligned, __memset_chk_avx2_unaligned_erms, __memset_chk_avx512_unaligned, __memset_chk_avx512_unaligned_erms, __memset_sse2_unaligned, __memset_sse2_unaligned_erms, __memset_erms, __memset_avx2_unaligned, __memset_avx2_unaligned_erms, __memset_avx512_unaligned_erms and __memset_avx512_unaligned. * sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: New file. * sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: Likewise. Memset https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=da344d24cc0ac98454105a94ebaf22997379c2fe commit da344d24cc0ac98454105a94ebaf22997379c2fe Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Mar 18 12:36:03 2016 -0700 Add x86-64 memmove with unaligned load/store and rep movsb Implement x86-64 memmove with unaligned load/store and rep movsb. Support 16-byte, 32-byte and 64-byte vector register sizes. When size <= 8 times of vector register size, there is no check for address overlap bewteen source and destination. Since overhead for overlap check is small when size > 8 times of vector register size, memcpy is an alias of memmove. A single file provides 2 implementations of memmove, one with rep movsb and the other without rep movsb. They share the same codes when size is between 2 times of vector register size and REP_MOVSB_THRESHOLD which is 2KB for 16-byte vector register size and scaled up by large vector register size. Key features: 1. Use overlapping load and store to avoid branch. 2. For size <= 8 times of vector register size, load all sources into registers and store them together. 3. If there is no address overlap bewteen source and destination, copy from both ends with 4 times of vector register size at a time. 4. If address of destination > address of source, backward copy 8 times of vector register size at a time. 5. Otherwise, forward copy 8 times of vector register size at a time. 6. Use rep movsb only for forward copy. Avoid slow backward rep movsb by fallbacking to backward copy 8 times of vector register size at a time. [BZ #19776] * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memmove-sse2-unaligned-erms, memmove-avx-unaligned-erms and memmove-avx512-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memmove_chk_avx512_unaligned_2, __memmove_chk_avx512_unaligned_erms, __memmove_chk_avx_unaligned_2, __memmove_chk_avx_unaligned_erms, __memmove_chk_sse2_unaligned_2, __memmove_chk_sse2_unaligned_erms, __memmove_avx_unaligned_2, __memmove_avx_unaligned_erms, __memmove_avx512_unaligned_2, __memmove_avx512_unaligned_erms, __memmove_erms, __memmove_sse2_unaligned_2, __memmove_sse2_unaligned_erms, __memcpy_chk_avx512_unaligned_2, __memcpy_chk_avx512_unaligned_erms, __memcpy_chk_avx_unaligned_2, __memcpy_chk_avx_unaligned_erms, __memcpy_chk_sse2_unaligned_2, __memcpy_chk_sse2_unaligned_erms, __memcpy_avx_unaligned_2, __memcpy_avx_unaligned_erms, __memcpy_avx512_unaligned_2, __memcpy_avx512_unaligned_erms, __memcpy_sse2_unaligned_2, __memcpy_sse2_unaligned_erms, __memcpy_erms, __mempcpy_chk_avx512_unaligned_2, __mempcpy_chk_avx512_unaligned_erms, __mempcpy_chk_avx_unaligned_2, __mempcpy_chk_avx_unaligned_erms, __mempcpy_chk_sse2_unaligned_2, __mempcpy_chk_sse2_unaligned_erms, __mempcpy_avx512_unaligned_2, __mempcpy_avx512_unaligned_erms, __mempcpy_avx_unaligned_2, __mempcpy_avx_unaligned_erms, __mempcpy_sse2_unaligned_2, __mempcpy_sse2_unaligned_erms and __mempcpy_erms. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: New file. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S: Likwise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S: Likwise. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Likwise. -----------------------------------------------------------------------
Asked for clarification of performance loss for Ivybridge, Penryn, and Bulldozer: https://www.sourceware.org/ml/libc-alpha/2016-03/msg00701.html
(In reply to Carlos O'Donell from comment #28) > Asked for clarification of performance loss for Ivybridge, Penryn, and > Bulldozer: > https://www.sourceware.org/ml/libc-alpha/2016-03/msg00701.html The new one will replace the old one. That is the new SSE2/AVX replaces the old SSE2/AVX. It won't change the choice of SSE2, SSSE3 nor AVX for a given processor. Performance on Penryn and Bulldozer won't change. Ivy bridge will see performance gain with the new __memcpy_sse2_unaligned_erms.
This is an automated email from the git hooks/post-receive script. It was generated because a ref change was pushed to the repository containing the project "GNU C Library master sources". The branch, hjl/erms/master has been created at d745a49ca39f47a701352b593151d8839dcba554 (commit) - Log ----------------------------------------------------------------- https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=d745a49ca39f47a701352b593151d8839dcba554 commit d745a49ca39f47a701352b593151d8839dcba554 Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Mar 13 00:26:57 2016 -0800 Add memmove/memset-avx512-unaligned-erms-no-vzeroupper.S https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a16432df8b46ce4db3f1b051ae61eda285833b42 commit a16432df8b46ce4db3f1b051ae61eda285833b42 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Mar 25 08:20:17 2016 -0700 Add x86-64 memset with unaligned store and rep stosb Implement x86-64 memset with unaligned store and rep movsb. Support 16-byte, 32-byte and 64-byte vector register sizes. A single file provides 2 implementations of memset, one with rep stosb and the other without rep stosb. They share the same codes when size is between 2 times of vector register size and REP_STOSB_THRESHOLD which is 1KB for 16-byte vector register size and scaled up by larger vector register size. Key features: 1. Use overlapping store to avoid branch. 2. For size <= 4 times of vector register size, fully unroll the loop. 3. For size > 4 times of vector register size, store 4 times of vector register size at a time. [BZ #19881] * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memset-sse2-unaligned-erms, memset-avx2-unaligned-erms and memset-avx512-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memset_chk_sse2_unaligned, __memset_chk_sse2_unaligned_erms, __memset_chk_avx2_unaligned, __memset_chk_avx2_unaligned_erms, __memset_chk_avx512_unaligned, __memset_chk_avx512_unaligned_erms, __memset_sse2_unaligned, __memset_sse2_unaligned_erms, __memset_erms, __memset_avx2_unaligned, __memset_avx2_unaligned_erms, __memset_avx512_unaligned_erms and __memset_avx512_unaligned. * sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: New file. * sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: Likewise. Memset https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=071b7236f96e0649e0728a83150f9fa52487563a commit 071b7236f96e0649e0728a83150f9fa52487563a Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Mar 18 12:36:03 2016 -0700 Add x86-64 memmove with unaligned load/store and rep movsb Implement x86-64 memmove with unaligned load/store and rep movsb. Support 16-byte, 32-byte and 64-byte vector register sizes. When size <= 8 times of vector register size, there is no check for address overlap bewteen source and destination. Since overhead for overlap check is small when size > 8 times of vector register size, memcpy is an alias of memmove. A single file provides 2 implementations of memmove, one with rep movsb and the other without rep movsb. They share the same codes when size is between 2 times of vector register size and REP_MOVSB_THRESHOLD which is 2KB for 16-byte vector register size and scaled up by large vector register size. Key features: 1. Use overlapping load and store to avoid branch. 2. For size <= 8 times of vector register size, load all sources into registers and store them together. 3. If there is no address overlap bewteen source and destination, copy from both ends with 4 times of vector register size at a time. 4. If address of destination > address of source, backward copy 8 times of vector register size at a time. 5. Otherwise, forward copy 8 times of vector register size at a time. 6. Use rep movsb only for forward copy. Avoid slow backward rep movsb by fallbacking to backward copy 8 times of vector register size at a time. 7. Skip when address of destination == address of source. [BZ #19776] * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memmove-sse2-unaligned-erms, memmove-avx-unaligned-erms and memmove-avx512-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memmove_chk_avx512_unaligned_2, __memmove_chk_avx512_unaligned_erms, __memmove_chk_avx_unaligned_2, __memmove_chk_avx_unaligned_erms, __memmove_chk_sse2_unaligned_2, __memmove_chk_sse2_unaligned_erms, __memmove_avx_unaligned_2, __memmove_avx_unaligned_erms, __memmove_avx512_unaligned_2, __memmove_avx512_unaligned_erms, __memmove_erms, __memmove_sse2_unaligned_2, __memmove_sse2_unaligned_erms, __memcpy_chk_avx512_unaligned_2, __memcpy_chk_avx512_unaligned_erms, __memcpy_chk_avx_unaligned_2, __memcpy_chk_avx_unaligned_erms, __memcpy_chk_sse2_unaligned_2, __memcpy_chk_sse2_unaligned_erms, __memcpy_avx_unaligned_2, __memcpy_avx_unaligned_erms, __memcpy_avx512_unaligned_2, __memcpy_avx512_unaligned_erms, __memcpy_sse2_unaligned_2, __memcpy_sse2_unaligned_erms, __memcpy_erms, __mempcpy_chk_avx512_unaligned_2, __mempcpy_chk_avx512_unaligned_erms, __mempcpy_chk_avx_unaligned_2, __mempcpy_chk_avx_unaligned_erms, __mempcpy_chk_sse2_unaligned_2, __mempcpy_chk_sse2_unaligned_erms, __mempcpy_avx512_unaligned_2, __mempcpy_avx512_unaligned_erms, __mempcpy_avx_unaligned_2, __mempcpy_avx_unaligned_erms, __mempcpy_sse2_unaligned_2, __mempcpy_sse2_unaligned_erms and __mempcpy_erms. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: New file. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S: Likwise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S: Likwise. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Likwise. -----------------------------------------------------------------------
This is an automated email from the git hooks/post-receive script. It was generated because a ref change was pushed to the repository containing the project "GNU C Library master sources". The branch, hjl/erms/master has been created at 68d440000cbbdc2480943db27bf539ba712d1607 (commit) - Log ----------------------------------------------------------------- https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=68d440000cbbdc2480943db27bf539ba712d1607 commit 68d440000cbbdc2480943db27bf539ba712d1607 Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Mar 13 00:26:57 2016 -0800 Add memmove/memset-avx512-unaligned-erms-no-vzeroupper.S https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=42aacde6879da25be932fabd05f05fa612d7c69e commit 42aacde6879da25be932fabd05f05fa612d7c69e Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Mar 25 08:20:17 2016 -0700 Add x86-64 memset with unaligned store and rep stosb Implement x86-64 memset with unaligned store and rep movsb. Support 16-byte, 32-byte and 64-byte vector register sizes. A single file provides 2 implementations of memset, one with rep stosb and the other without rep stosb. They share the same codes when size is between 2 times of vector register size and REP_STOSB_THRESHOLD which defaults to 2KB. Key features: 1. Use overlapping store to avoid branch. 2. For size <= 4 times of vector register size, fully unroll the loop. 3. For size > 4 times of vector register size, store 4 times of vector register size at a time. [BZ #19881] * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memset-sse2-unaligned-erms, memset-avx2-unaligned-erms and memset-avx512-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memset_chk_sse2_unaligned, __memset_chk_sse2_unaligned_erms, __memset_chk_avx2_unaligned, __memset_chk_avx2_unaligned_erms, __memset_chk_avx512_unaligned, __memset_chk_avx512_unaligned_erms, __memset_sse2_unaligned, __memset_sse2_unaligned_erms, __memset_erms, __memset_avx2_unaligned, __memset_avx2_unaligned_erms, __memset_avx512_unaligned_erms and __memset_avx512_unaligned. * sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: New file. * sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=071b7236f96e0649e0728a83150f9fa52487563a commit 071b7236f96e0649e0728a83150f9fa52487563a Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Mar 18 12:36:03 2016 -0700 Add x86-64 memmove with unaligned load/store and rep movsb Implement x86-64 memmove with unaligned load/store and rep movsb. Support 16-byte, 32-byte and 64-byte vector register sizes. When size <= 8 times of vector register size, there is no check for address overlap bewteen source and destination. Since overhead for overlap check is small when size > 8 times of vector register size, memcpy is an alias of memmove. A single file provides 2 implementations of memmove, one with rep movsb and the other without rep movsb. They share the same codes when size is between 2 times of vector register size and REP_MOVSB_THRESHOLD which is 2KB for 16-byte vector register size and scaled up by large vector register size. Key features: 1. Use overlapping load and store to avoid branch. 2. For size <= 8 times of vector register size, load all sources into registers and store them together. 3. If there is no address overlap bewteen source and destination, copy from both ends with 4 times of vector register size at a time. 4. If address of destination > address of source, backward copy 8 times of vector register size at a time. 5. Otherwise, forward copy 8 times of vector register size at a time. 6. Use rep movsb only for forward copy. Avoid slow backward rep movsb by fallbacking to backward copy 8 times of vector register size at a time. 7. Skip when address of destination == address of source. [BZ #19776] * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memmove-sse2-unaligned-erms, memmove-avx-unaligned-erms and memmove-avx512-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memmove_chk_avx512_unaligned_2, __memmove_chk_avx512_unaligned_erms, __memmove_chk_avx_unaligned_2, __memmove_chk_avx_unaligned_erms, __memmove_chk_sse2_unaligned_2, __memmove_chk_sse2_unaligned_erms, __memmove_avx_unaligned_2, __memmove_avx_unaligned_erms, __memmove_avx512_unaligned_2, __memmove_avx512_unaligned_erms, __memmove_erms, __memmove_sse2_unaligned_2, __memmove_sse2_unaligned_erms, __memcpy_chk_avx512_unaligned_2, __memcpy_chk_avx512_unaligned_erms, __memcpy_chk_avx_unaligned_2, __memcpy_chk_avx_unaligned_erms, __memcpy_chk_sse2_unaligned_2, __memcpy_chk_sse2_unaligned_erms, __memcpy_avx_unaligned_2, __memcpy_avx_unaligned_erms, __memcpy_avx512_unaligned_2, __memcpy_avx512_unaligned_erms, __memcpy_sse2_unaligned_2, __memcpy_sse2_unaligned_erms, __memcpy_erms, __mempcpy_chk_avx512_unaligned_2, __mempcpy_chk_avx512_unaligned_erms, __mempcpy_chk_avx_unaligned_2, __mempcpy_chk_avx_unaligned_erms, __mempcpy_chk_sse2_unaligned_2, __mempcpy_chk_sse2_unaligned_erms, __mempcpy_avx512_unaligned_2, __mempcpy_avx512_unaligned_erms, __mempcpy_avx_unaligned_2, __mempcpy_avx_unaligned_erms, __mempcpy_sse2_unaligned_2, __mempcpy_sse2_unaligned_erms and __mempcpy_erms. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: New file. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S: Likwise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S: Likwise. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Likwise. -----------------------------------------------------------------------
This is an automated email from the git hooks/post-receive script. It was generated because a ref change was pushed to the repository containing the project "GNU C Library master sources". The branch, hjl/erms/master has been created at 0db56470f1bee39a252daf2728d818296b179a9e (commit) - Log ----------------------------------------------------------------- https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=0db56470f1bee39a252daf2728d818296b179a9e commit 0db56470f1bee39a252daf2728d818296b179a9e Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Mar 13 00:26:57 2016 -0800 Add memmove/memset-avx512-unaligned-erms-no-vzeroupper.S https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7df7c6a195d6bc6ffdd90db0786d5de9c67d037a commit 7df7c6a195d6bc6ffdd90db0786d5de9c67d037a Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Mar 25 08:20:17 2016 -0700 Add x86-64 memset with unaligned store and rep stosb Implement x86-64 memset with unaligned store and rep movsb. Support 16-byte, 32-byte and 64-byte vector register sizes. A single file provides 2 implementations of memset, one with rep stosb and the other without rep stosb. They share the same codes when size is between 2 times of vector register size and REP_STOSB_THRESHOLD which defaults to 2KB. Key features: 1. Use overlapping store to avoid branch. 2. For size <= 4 times of vector register size, fully unroll the loop. 3. For size > 4 times of vector register size, store 4 times of vector register size at a time. [BZ #19881] * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memset-sse2-unaligned-erms, memset-avx2-unaligned-erms and memset-avx512-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memset_chk_sse2_unaligned, __memset_chk_sse2_unaligned_erms, __memset_chk_avx2_unaligned, __memset_chk_avx2_unaligned_erms, __memset_chk_avx512_unaligned, __memset_chk_avx512_unaligned_erms, __memset_sse2_unaligned, __memset_sse2_unaligned_erms, __memset_erms, __memset_avx2_unaligned, __memset_avx2_unaligned_erms, __memset_avx512_unaligned_erms and __memset_avx512_unaligned. * sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: New file. * sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=d1f2de07cb44abfb9e78f825e3edf2490cf1057c commit d1f2de07cb44abfb9e78f825e3edf2490cf1057c Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Mar 18 12:36:03 2016 -0700 Add x86-64 memmove with unaligned load/store and rep movsb Implement x86-64 memmove with unaligned load/store and rep movsb. Support 16-byte, 32-byte and 64-byte vector register sizes. When size <= 8 times of vector register size, there is no check for address overlap bewteen source and destination. Since overhead for overlap check is small when size > 8 times of vector register size, memcpy is an alias of memmove. A single file provides 2 implementations of memmove, one with rep movsb and the other without rep movsb. They share the same codes when size is between 2 times of vector register size and REP_MOVSB_THRESHOLD which is 2KB for 16-byte vector register size and scaled up by large vector register size. Key features: 1. Use overlapping load and store to avoid branch. 2. For size <= 8 times of vector register size, load all sources into registers and store them together. 3. If there is no address overlap bewteen source and destination, copy from both ends with 4 times of vector register size at a time. 4. If address of destination > address of source, backward copy 8 times of vector register size at a time. 5. Otherwise, forward copy 8 times of vector register size at a time. 6. Use rep movsb only for forward copy. Avoid slow backward rep movsb by fallbacking to backward copy 8 times of vector register size at a time. 7. Skip when address of destination == address of source. [BZ #19776] * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memmove-sse2-unaligned-erms, memmove-avx-unaligned-erms and memmove-avx512-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memmove_chk_avx512_unaligned_2, __memmove_chk_avx512_unaligned_erms, __memmove_chk_avx_unaligned_2, __memmove_chk_avx_unaligned_erms, __memmove_chk_sse2_unaligned_2, __memmove_chk_sse2_unaligned_erms, __memmove_avx_unaligned_2, __memmove_avx_unaligned_erms, __memmove_avx512_unaligned_2, __memmove_avx512_unaligned_erms, __memmove_erms, __memmove_sse2_unaligned_2, __memmove_sse2_unaligned_erms, __memcpy_chk_avx512_unaligned_2, __memcpy_chk_avx512_unaligned_erms, __memcpy_chk_avx_unaligned_2, __memcpy_chk_avx_unaligned_erms, __memcpy_chk_sse2_unaligned_2, __memcpy_chk_sse2_unaligned_erms, __memcpy_avx_unaligned_2, __memcpy_avx_unaligned_erms, __memcpy_avx512_unaligned_2, __memcpy_avx512_unaligned_erms, __memcpy_sse2_unaligned_2, __memcpy_sse2_unaligned_erms, __memcpy_erms, __mempcpy_chk_avx512_unaligned_2, __mempcpy_chk_avx512_unaligned_erms, __mempcpy_chk_avx_unaligned_2, __mempcpy_chk_avx_unaligned_erms, __mempcpy_chk_sse2_unaligned_2, __mempcpy_chk_sse2_unaligned_erms, __mempcpy_avx512_unaligned_2, __mempcpy_avx512_unaligned_erms, __mempcpy_avx_unaligned_2, __mempcpy_avx_unaligned_erms, __mempcpy_sse2_unaligned_2, __mempcpy_sse2_unaligned_erms and __mempcpy_erms. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: New file. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S: Likwise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S: Likwise. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Likwise. -----------------------------------------------------------------------
This is an automated email from the git hooks/post-receive script. It was generated because a ref change was pushed to the repository containing the project "GNU C Library master sources". The branch, master has been updated via 88b57b8ed41d5ecf2e1bdfc19556f9246a665ebb (commit) from 5cdd1989d1d2f135d02e66250f37ba8e767f9772 (commit) Those revisions listed above that are new to this repository have not appeared on any other notification email; so we list those revisions in full, below. - Log ----------------------------------------------------------------- https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=88b57b8ed41d5ecf2e1bdfc19556f9246a665ebb commit 88b57b8ed41d5ecf2e1bdfc19556f9246a665ebb Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 10:04:26 2016 -0700 Add x86-64 memmove with unaligned load/store and rep movsb Implement x86-64 memmove with unaligned load/store and rep movsb. Support 16-byte, 32-byte and 64-byte vector register sizes. When size <= 8 times of vector register size, there is no check for address overlap bewteen source and destination. Since overhead for overlap check is small when size > 8 times of vector register size, memcpy is an alias of memmove. A single file provides 2 implementations of memmove, one with rep movsb and the other without rep movsb. They share the same codes when size is between 2 times of vector register size and REP_MOVSB_THRESHOLD which is 2KB for 16-byte vector register size and scaled up by large vector register size. Key features: 1. Use overlapping load and store to avoid branch. 2. For size <= 8 times of vector register size, load all sources into registers and store them together. 3. If there is no address overlap bewteen source and destination, copy from both ends with 4 times of vector register size at a time. 4. If address of destination > address of source, backward copy 8 times of vector register size at a time. 5. Otherwise, forward copy 8 times of vector register size at a time. 6. Use rep movsb only for forward copy. Avoid slow backward rep movsb by fallbacking to backward copy 8 times of vector register size at a time. 7. Skip when address of destination == address of source. [BZ #19776] * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memmove-sse2-unaligned-erms, memmove-avx-unaligned-erms and memmove-avx512-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memmove_chk_avx512_unaligned_2, __memmove_chk_avx512_unaligned_erms, __memmove_chk_avx_unaligned_2, __memmove_chk_avx_unaligned_erms, __memmove_chk_sse2_unaligned_2, __memmove_chk_sse2_unaligned_erms, __memmove_avx_unaligned_2, __memmove_avx_unaligned_erms, __memmove_avx512_unaligned_2, __memmove_avx512_unaligned_erms, __memmove_erms, __memmove_sse2_unaligned_2, __memmove_sse2_unaligned_erms, __memcpy_chk_avx512_unaligned_2, __memcpy_chk_avx512_unaligned_erms, __memcpy_chk_avx_unaligned_2, __memcpy_chk_avx_unaligned_erms, __memcpy_chk_sse2_unaligned_2, __memcpy_chk_sse2_unaligned_erms, __memcpy_avx_unaligned_2, __memcpy_avx_unaligned_erms, __memcpy_avx512_unaligned_2, __memcpy_avx512_unaligned_erms, __memcpy_sse2_unaligned_2, __memcpy_sse2_unaligned_erms, __memcpy_erms, __mempcpy_chk_avx512_unaligned_2, __mempcpy_chk_avx512_unaligned_erms, __mempcpy_chk_avx_unaligned_2, __mempcpy_chk_avx_unaligned_erms, __mempcpy_chk_sse2_unaligned_2, __mempcpy_chk_sse2_unaligned_erms, __mempcpy_avx512_unaligned_2, __mempcpy_avx512_unaligned_erms, __mempcpy_avx_unaligned_2, __mempcpy_avx_unaligned_erms, __mempcpy_sse2_unaligned_2, __mempcpy_sse2_unaligned_erms and __mempcpy_erms. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: New file. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S: Likwise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S: Likwise. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Likwise. ----------------------------------------------------------------------- Summary of changes: ChangeLog | 40 ++ sysdeps/x86_64/multiarch/Makefile | 5 +- sysdeps/x86_64/multiarch/ifunc-impl-list.c | 99 +++++ .../x86_64/multiarch/memmove-avx-unaligned-erms.S | 9 + .../multiarch/memmove-avx512-unaligned-erms.S | 11 + .../x86_64/multiarch/memmove-sse2-unaligned-erms.S | 9 + .../x86_64/multiarch/memmove-vec-unaligned-erms.S | 462 ++++++++++++++++++++ 7 files changed, 634 insertions(+), 1 deletions(-) create mode 100644 sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S create mode 100644 sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S create mode 100644 sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S create mode 100644 sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
This is an automated email from the git hooks/post-receive script. It was generated because a ref change was pushed to the repository containing the project "GNU C Library master sources". The branch, hjl/erms/ifunc has been created at 320daec84c5d2c4a6a17a0043ca5f8c8fe30734e (commit) - Log ----------------------------------------------------------------- https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=320daec84c5d2c4a6a17a0043ca5f8c8fe30734e commit 320daec84c5d2c4a6a17a0043ca5f8c8fe30734e Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 12:46:57 2016 -0700 X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones, we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with the new ones. No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used before. If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2 memcpy/memmove optimized with Enhanced REP MOVSB will be used for processors with ERMS. The new AVX512 memcpy/memmove will be used for processors with AVX512 which prefer vzeroupper. Since the new SSE2 memcpy/memmove are faster than the previous default memcpy/memmove used in libc.a and ld.so, we also remove the previous default memcpy/memmove and make them the default memcpy/memmove. It also fixes the placement of __mempcpy_erms and __memmove_erms. (__mempcpy_erms, __memmove_erms): Moved before __mempcpy_chk with unaligned_erms. [BZ #19776] * sysdeps/x86_64/memcpy.S: Make it dummy. * sysdeps/x86_64/mempcpy.S: Likewise. * sysdeps/x86_64/memmove.S: New file. * sysdeps/x86_64/memmove_chk.S: Likewise. * sysdeps/x86_64/multiarch/memmove.S: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.S: Likewise. * sysdeps/x86_64/memmove.c: Removed. * sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memmove.c: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.c: Likewise. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memcpy-sse2-unaligned, memmove-avx-unaligned, memcpy-avx-unaligned and memmove-sse2-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Replace __memmove_chk_avx512_unaligned_2 with __memmove_chk_avx512_unaligned. Remove __memmove_chk_avx_unaligned_2. Replace __memmove_chk_sse2_unaligned_2 with __memmove_chk_sse2_unaligned. Remove __memmove_chk_sse2 and __memmove_avx_unaligned_2. Replace __memmove_avx512_unaligned_2 with __memmove_avx512_unaligned. Replace __memmove_sse2_unaligned_2 with __memmove_sse2_unaligned. Remove __memmove_sse2. Replace __memcpy_chk_avx512_unaligned_2 with __memcpy_chk_avx512_unaligned. Remove __memcpy_chk_avx_unaligned_2. Replace __memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned. Remove __memcpy_chk_sse2. Remove __memcpy_avx_unaligned_2. Replace __memcpy_avx512_unaligned_2 with __memcpy_avx512_unaligned. Remove __memcpy_sse2_unaligned_2 and __memcpy_sse2. Replace __mempcpy_chk_avx512_unaligned_2 with __mempcpy_chk_avx512_unaligned. Remove __mempcpy_chk_avx_unaligned_2. Replace __mempcpy_chk_sse2_unaligned_2 with __mempcpy_chk_sse2_unaligned. Remove __mempcpy_chk_sse2. Replace __mempcpy_avx512_unaligned_2 with __mempcpy_avx512_unaligned. Remove __mempcpy_avx_unaligned_2. Replace __mempcpy_sse2_unaligned_2 with __mempcpy_sse2_unaligned. Remove __mempcpy_sse2. * sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support __memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned. Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms if processor has ERMS. Default to __memcpy_sse2_unaligned. (ENTRY): Removed. (END): Likewise. (ENTRY_CHK): Likewise. (libc_hidden_builtin_def): Likewise. Don't include ../memcpy.S. * sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support __memcpy_chk_avx512_unaligned_erms and __memcpy_chk_avx512_unaligned. Use __memcpy_chk_avx_unaligned_erms and __memcpy_chk_sse2_unaligned_erms if if processor has ERMS. Default to __memcpy_chk_sse2_unaligned. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip if not in libc. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S (MEMCPY_SYMBOL): New. (MEMPCPY_SYMBOL): Likewise. (MEMMOVE_CHK_SYMBOL): Likewise. (__mempcpy_erms, __memmove_erms): Moved before __mempcpy_chk with unaligned_erms. Replace MEMMOVE_SYMBOL with MEMMOVE_CHK_SYMBOL on __mempcpy_chk symbols. Replace MEMMOVE_SYMBOL with MEMPCPY_SYMBOL on __mempcpy symbols. Change function suffix from unaligned_2 to unaligned. Provide alias for __memcpy_chk in libc.a. Provide alias for memcpy in libc.a and ld.so. * sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support __mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned. Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms if processor has ERMS. Default to __mempcpy_sse2_unaligned. (ENTRY): Removed. (END): Likewise. (ENTRY_CHK): Likewise. (libc_hidden_builtin_def): Likewise. Don't include ../mempcpy.S. (mempcpy): New. Add a weak alias. * sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support __mempcpy_chk_avx512_unaligned_erms and __mempcpy_chk_avx512_unaligned. Use __mempcpy_chk_avx_unaligned_erms and __mempcpy_chk_sse2_unaligned_erms if if processor has ERMS. Default to __mempcpy_chk_sse2_unaligned. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=503990f436e51cbd8f9479017fc0976687dcc90e commit 503990f436e51cbd8f9479017fc0976687dcc90e Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 10:42:30 2016 -0700 X86-64: Remove the previous SSE2/AVX2 memsets Since the new SSE2/AVX2 memsets are faster than the previous ones, we can remove the previous SSE2/AVX2 memsets and replace them with the new ones. No change in IFUNC selection if SSE2 and AVX2 memsets weren't used before. If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset optimized with Enhanced REP STOSB will be used for processors with ERMS. The new AVX512 memset will be used for processors with AVX512 which prefer vzeroupper. [BZ #19881] * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded into ... * sysdeps/x86_64/memset.S: This. (__bzero): Removed. (__memset_tail): Likewise. (__memset_chk): Likewise. (memset): Likewise. (MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't defined. (MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined. * sysdeps/x86_64/multiarch/memset-avx2.S: Removed. (__memset_zero_constant_len_parameter): Check SHARED instead of PIC. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memset-avx2 and memset-sse2-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Remove __memset_chk_sse2, __memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned. * sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: Skip if not in libc. * sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S (MEMSET_CHK_SYMBOL): New. Define if not defined. (__bzero): Check VEC_SIZE == 16 instead of USE_MULTIARCH. Replace MEMSET_SYMBOL with MEMSET_CHK_SYMBOL on __memset_chk symbols. Properly check USE_MULTIARCH on __memset symbols. * sysdeps/x86_64/multiarch/memset.S (memset): Replace __memset_sse2 and __memset_avx2 with __memset_sse2_unaligned and __memset_avx2_unaligned. Use __memset_sse2_unaligned_erms or __memset_avx2_unaligned_erms if processor has ERMS. Support __memset_avx512_unaligned_erms and __memset_avx512_unaligned. (memset): Removed. (__memset_chk): Likewise. (MEMSET_SYMBOL): New. (libc_hidden_builtin_def): Replace __memset_sse2 with __memset_sse2_unaligned. * sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace __memset_chk_sse2 and __memset_chk_avx2 with __memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms. Use __memset_chk_sse2_unaligned_erms or __memset_chk_avx2_unaligned_erms if processor has ERMS. Support __memset_chk_avx512_unaligned_erms and __memset_chk_avx512_unaligned. -----------------------------------------------------------------------
This is an automated email from the git hooks/post-receive script. It was generated because a ref change was pushed to the repository containing the project "GNU C Library master sources". The branch, hjl/erms/ifunc has been created at 6449801fe6f0d733f6fda77f057bd60d9091ebba (commit) - Log ----------------------------------------------------------------- https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=6449801fe6f0d733f6fda77f057bd60d9091ebba commit 6449801fe6f0d733f6fda77f057bd60d9091ebba Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 12:46:57 2016 -0700 X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones, we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with the new ones. No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used before. If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2 memcpy/memmove optimized with Enhanced REP MOVSB will be used for processors with ERMS. The new AVX512 memcpy/memmove will be used for processors with AVX512 which prefer vzeroupper. Since the new SSE2 memcpy/memmove are faster than the previous default memcpy/memmove used in libc.a and ld.so, we also remove the previous default memcpy/memmove and make them the default memcpy/memmove. It also fixes the placement of __mempcpy_erms and __memmove_erms. [BZ #19776] * sysdeps/x86_64/memcpy.S: Make it dummy. * sysdeps/x86_64/mempcpy.S: Likewise. * sysdeps/x86_64/memmove.S: New file. * sysdeps/x86_64/memmove_chk.S: Likewise. * sysdeps/x86_64/multiarch/memmove.S: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.S: Likewise. * sysdeps/x86_64/memmove.c: Removed. * sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memmove.c: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.c: Likewise. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memcpy-sse2-unaligned, memmove-avx-unaligned, memcpy-avx-unaligned and memmove-sse2-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Replace __memmove_chk_avx512_unaligned_2 with __memmove_chk_avx512_unaligned. Remove __memmove_chk_avx_unaligned_2. Replace __memmove_chk_sse2_unaligned_2 with __memmove_chk_sse2_unaligned. Remove __memmove_chk_sse2 and __memmove_avx_unaligned_2. Replace __memmove_avx512_unaligned_2 with __memmove_avx512_unaligned. Replace __memmove_sse2_unaligned_2 with __memmove_sse2_unaligned. Remove __memmove_sse2. Replace __memcpy_chk_avx512_unaligned_2 with __memcpy_chk_avx512_unaligned. Remove __memcpy_chk_avx_unaligned_2. Replace __memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned. Remove __memcpy_chk_sse2. Remove __memcpy_avx_unaligned_2. Replace __memcpy_avx512_unaligned_2 with __memcpy_avx512_unaligned. Remove __memcpy_sse2_unaligned_2 and __memcpy_sse2. Replace __mempcpy_chk_avx512_unaligned_2 with __mempcpy_chk_avx512_unaligned. Remove __mempcpy_chk_avx_unaligned_2. Replace __mempcpy_chk_sse2_unaligned_2 with __mempcpy_chk_sse2_unaligned. Remove __mempcpy_chk_sse2. Replace __mempcpy_avx512_unaligned_2 with __mempcpy_avx512_unaligned. Remove __mempcpy_avx_unaligned_2. Replace __mempcpy_sse2_unaligned_2 with __mempcpy_sse2_unaligned. Remove __mempcpy_sse2. * sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support __memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned. Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms if processor has ERMS. Default to __memcpy_sse2_unaligned. (ENTRY): Removed. (END): Likewise. (ENTRY_CHK): Likewise. (libc_hidden_builtin_def): Likewise. Don't include ../memcpy.S. * sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support __memcpy_chk_avx512_unaligned_erms and __memcpy_chk_avx512_unaligned. Use __memcpy_chk_avx_unaligned_erms and __memcpy_chk_sse2_unaligned_erms if if processor has ERMS. Default to __memcpy_chk_sse2_unaligned. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip if not in libc. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S (MEMCPY_SYMBOL): New. (MEMPCPY_SYMBOL): Likewise. (MEMMOVE_CHK_SYMBOL): Likewise. (__mempcpy_erms, __memmove_erms): Moved before __mempcpy_chk with unaligned_erms. Replace MEMMOVE_SYMBOL with MEMMOVE_CHK_SYMBOL on __mempcpy_chk symbols. Replace MEMMOVE_SYMBOL with MEMPCPY_SYMBOL on __mempcpy symbols. Change function suffix from unaligned_2 to unaligned. Provide alias for __memcpy_chk in libc.a. Provide alias for memcpy in libc.a and ld.so. * sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support __mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned. Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms if processor has ERMS. Default to __mempcpy_sse2_unaligned. (ENTRY): Removed. (END): Likewise. (ENTRY_CHK): Likewise. (libc_hidden_builtin_def): Likewise. Don't include ../mempcpy.S. (mempcpy): New. Add a weak alias. * sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support __mempcpy_chk_avx512_unaligned_erms and __mempcpy_chk_avx512_unaligned. Use __mempcpy_chk_avx_unaligned_erms and __mempcpy_chk_sse2_unaligned_erms if if processor has ERMS. Default to __mempcpy_chk_sse2_unaligned. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=503990f436e51cbd8f9479017fc0976687dcc90e commit 503990f436e51cbd8f9479017fc0976687dcc90e Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 10:42:30 2016 -0700 X86-64: Remove the previous SSE2/AVX2 memsets Since the new SSE2/AVX2 memsets are faster than the previous ones, we can remove the previous SSE2/AVX2 memsets and replace them with the new ones. No change in IFUNC selection if SSE2 and AVX2 memsets weren't used before. If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset optimized with Enhanced REP STOSB will be used for processors with ERMS. The new AVX512 memset will be used for processors with AVX512 which prefer vzeroupper. [BZ #19881] * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded into ... * sysdeps/x86_64/memset.S: This. (__bzero): Removed. (__memset_tail): Likewise. (__memset_chk): Likewise. (memset): Likewise. (MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't defined. (MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined. * sysdeps/x86_64/multiarch/memset-avx2.S: Removed. (__memset_zero_constant_len_parameter): Check SHARED instead of PIC. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memset-avx2 and memset-sse2-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Remove __memset_chk_sse2, __memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned. * sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: Skip if not in libc. * sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S (MEMSET_CHK_SYMBOL): New. Define if not defined. (__bzero): Check VEC_SIZE == 16 instead of USE_MULTIARCH. Replace MEMSET_SYMBOL with MEMSET_CHK_SYMBOL on __memset_chk symbols. Properly check USE_MULTIARCH on __memset symbols. * sysdeps/x86_64/multiarch/memset.S (memset): Replace __memset_sse2 and __memset_avx2 with __memset_sse2_unaligned and __memset_avx2_unaligned. Use __memset_sse2_unaligned_erms or __memset_avx2_unaligned_erms if processor has ERMS. Support __memset_avx512_unaligned_erms and __memset_avx512_unaligned. (memset): Removed. (__memset_chk): Likewise. (MEMSET_SYMBOL): New. (libc_hidden_builtin_def): Replace __memset_sse2 with __memset_sse2_unaligned. * sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace __memset_chk_sse2 and __memset_chk_avx2 with __memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms. Use __memset_chk_sse2_unaligned_erms or __memset_chk_avx2_unaligned_erms if processor has ERMS. Support __memset_chk_avx512_unaligned_erms and __memset_chk_avx512_unaligned. -----------------------------------------------------------------------
This is an automated email from the git hooks/post-receive script. It was generated because a ref change was pushed to the repository containing the project "GNU C Library master sources". The branch, hjl/erms/ifunc has been created at e499a177b666cb39041e4aa70582742f7844b685 (commit) - Log ----------------------------------------------------------------- https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e499a177b666cb39041e4aa70582742f7844b685 commit e499a177b666cb39041e4aa70582742f7844b685 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 12:46:57 2016 -0700 X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones, we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with the new ones. No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used before. If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2 memcpy/memmove optimized with Enhanced REP MOVSB will be used for processors with ERMS. The new AVX512 memcpy/memmove will be used for processors with AVX512 which prefer vzeroupper. Since the new SSE2 memcpy/memmove are faster than the previous default memcpy/memmove used in libc.a and ld.so, we also remove the previous default memcpy/memmove and make them the default memcpy/memmove. Together, it reduces the size of libc.so by about 6 KB and the size of ld.so by about 2 KB. It also fixes the placement of __mempcpy_erms and __memmove_erms. [BZ #19776] * sysdeps/x86_64/memcpy.S: Make it dummy. * sysdeps/x86_64/mempcpy.S: Likewise. * sysdeps/x86_64/memmove.S: New file. * sysdeps/x86_64/memmove_chk.S: Likewise. * sysdeps/x86_64/multiarch/memmove.S: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.S: Likewise. * sysdeps/x86_64/memmove.c: Removed. * sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memmove.c: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.c: Likewise. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memcpy-sse2-unaligned, memmove-avx-unaligned, memcpy-avx-unaligned and memmove-sse2-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Replace __memmove_chk_avx512_unaligned_2 with __memmove_chk_avx512_unaligned. Remove __memmove_chk_avx_unaligned_2. Replace __memmove_chk_sse2_unaligned_2 with __memmove_chk_sse2_unaligned. Remove __memmove_chk_sse2 and __memmove_avx_unaligned_2. Replace __memmove_avx512_unaligned_2 with __memmove_avx512_unaligned. Replace __memmove_sse2_unaligned_2 with __memmove_sse2_unaligned. Remove __memmove_sse2. Replace __memcpy_chk_avx512_unaligned_2 with __memcpy_chk_avx512_unaligned. Remove __memcpy_chk_avx_unaligned_2. Replace __memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned. Remove __memcpy_chk_sse2. Remove __memcpy_avx_unaligned_2. Replace __memcpy_avx512_unaligned_2 with __memcpy_avx512_unaligned. Remove __memcpy_sse2_unaligned_2 and __memcpy_sse2. Replace __mempcpy_chk_avx512_unaligned_2 with __mempcpy_chk_avx512_unaligned. Remove __mempcpy_chk_avx_unaligned_2. Replace __mempcpy_chk_sse2_unaligned_2 with __mempcpy_chk_sse2_unaligned. Remove __mempcpy_chk_sse2. Replace __mempcpy_avx512_unaligned_2 with __mempcpy_avx512_unaligned. Remove __mempcpy_avx_unaligned_2. Replace __mempcpy_sse2_unaligned_2 with __mempcpy_sse2_unaligned. Remove __mempcpy_sse2. * sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support __memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned. Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms if processor has ERMS. Default to __memcpy_sse2_unaligned. (ENTRY): Removed. (END): Likewise. (ENTRY_CHK): Likewise. (libc_hidden_builtin_def): Likewise. Don't include ../memcpy.S. * sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support __memcpy_chk_avx512_unaligned_erms and __memcpy_chk_avx512_unaligned. Use __memcpy_chk_avx_unaligned_erms and __memcpy_chk_sse2_unaligned_erms if if processor has ERMS. Default to __memcpy_chk_sse2_unaligned. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip if not in libc. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S (MEMCPY_SYMBOL): New. (MEMPCPY_SYMBOL): Likewise. (MEMMOVE_CHK_SYMBOL): Likewise. (__mempcpy_erms, __memmove_erms): Moved before __mempcpy_chk with unaligned_erms. Replace MEMMOVE_SYMBOL with MEMMOVE_CHK_SYMBOL on __mempcpy_chk symbols. Replace MEMMOVE_SYMBOL with MEMPCPY_SYMBOL on __mempcpy symbols. Change function suffix from unaligned_2 to unaligned. Provide alias for __memcpy_chk in libc.a. Provide alias for memcpy in libc.a and ld.so. * sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support __mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned. Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms if processor has ERMS. Default to __mempcpy_sse2_unaligned. (ENTRY): Removed. (END): Likewise. (ENTRY_CHK): Likewise. (libc_hidden_builtin_def): Likewise. Don't include ../mempcpy.S. (mempcpy): New. Add a weak alias. * sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support __mempcpy_chk_avx512_unaligned_erms and __mempcpy_chk_avx512_unaligned. Use __mempcpy_chk_avx_unaligned_erms and __mempcpy_chk_sse2_unaligned_erms if if processor has ERMS. Default to __mempcpy_chk_sse2_unaligned. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e01f71b4161a485bdca91ba95b7f3d976739291a commit e01f71b4161a485bdca91ba95b7f3d976739291a Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 10:42:30 2016 -0700 X86-64: Remove the previous SSE2/AVX2 memsets Since the new SSE2/AVX2 memsets are faster than the previous ones, we can remove the previous SSE2/AVX2 memsets and replace them with the new ones. This reduces the size of libc.so by about 900 bytes. No change in IFUNC selection if SSE2 and AVX2 memsets weren't used before. If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset optimized with Enhanced REP STOSB will be used for processors with ERMS. The new AVX512 memset will be used for processors with AVX512 which prefer vzeroupper. [BZ #19881] * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded into ... * sysdeps/x86_64/memset.S: This. (__bzero): Removed. (__memset_tail): Likewise. (__memset_chk): Likewise. (memset): Likewise. (MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't defined. (MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined. * sysdeps/x86_64/multiarch/memset-avx2.S: Removed. (__memset_zero_constant_len_parameter): Check SHARED instead of PIC. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memset-avx2 and memset-sse2-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Remove __memset_chk_sse2, __memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned. * sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: Skip if not in libc. * sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S (MEMSET_CHK_SYMBOL): New. Define if not defined. (__bzero): Check VEC_SIZE == 16 instead of USE_MULTIARCH. Replace MEMSET_SYMBOL with MEMSET_CHK_SYMBOL on __memset_chk symbols. Properly check USE_MULTIARCH on __memset symbols. * sysdeps/x86_64/multiarch/memset.S (memset): Replace __memset_sse2 and __memset_avx2 with __memset_sse2_unaligned and __memset_avx2_unaligned. Use __memset_sse2_unaligned_erms or __memset_avx2_unaligned_erms if processor has ERMS. Support __memset_avx512_unaligned_erms and __memset_avx512_unaligned. (memset): Removed. (__memset_chk): Likewise. (MEMSET_SYMBOL): New. (libc_hidden_builtin_def): Replace __memset_sse2 with __memset_sse2_unaligned. * sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace __memset_chk_sse2 and __memset_chk_avx2 with __memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms. Use __memset_chk_sse2_unaligned_erms or __memset_chk_avx2_unaligned_erms if processor has ERMS. Support __memset_chk_avx512_unaligned_erms and __memset_chk_avx512_unaligned. -----------------------------------------------------------------------
This is an automated email from the git hooks/post-receive script. It was generated because a ref change was pushed to the repository containing the project "GNU C Library master sources". The branch, hjl/erms/ifunc has been created at e3233fab32b028ca9630d28afff6f9c97a8f0a51 (commit) - Log ----------------------------------------------------------------- https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e3233fab32b028ca9630d28afff6f9c97a8f0a51 commit e3233fab32b028ca9630d28afff6f9c97a8f0a51 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Apr 1 14:01:24 2016 -0700 X86-64: Add dummy memcopy.h and wordcopy.c Since x86-64 doesn't use memory copy functions, add dummy memcopy.h and wordcopy.c to reduce code size. It reduces the size of libc.so by about 1 KB. * sysdeps/x86_64/memcopy.h: New file. * sysdeps/x86_64/wordcopy.c: Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=740dd54823c1514e339186aab3868a69af2836a9 commit 740dd54823c1514e339186aab3868a69af2836a9 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 12:46:57 2016 -0700 X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones, we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with the new ones. No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used before. If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2 memcpy/memmove optimized with Enhanced REP MOVSB will be used for processors with ERMS. The new AVX512 memcpy/memmove will be used for processors with AVX512 which prefer vzeroupper. Since the new SSE2 memcpy/memmove are faster than the previous default memcpy/memmove used in libc.a and ld.so, we also remove the previous default memcpy/memmove and make them the default memcpy/memmove. Together, it reduces the size of libc.so by about 6 KB and the size of ld.so by about 2 KB. It also fixes the placement of __mempcpy_erms and __memmove_erms. [BZ #19776] * sysdeps/x86_64/memcpy.S: Make it dummy. * sysdeps/x86_64/mempcpy.S: Likewise. * sysdeps/x86_64/memmove.S: New file. * sysdeps/x86_64/memmove_chk.S: Likewise. * sysdeps/x86_64/multiarch/memmove.S: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.S: Likewise. * sysdeps/x86_64/memmove.c: Removed. * sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memmove.c: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.c: Likewise. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memcpy-sse2-unaligned, memmove-avx-unaligned, memcpy-avx-unaligned and memmove-sse2-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Replace __memmove_chk_avx512_unaligned_2 with __memmove_chk_avx512_unaligned. Remove __memmove_chk_avx_unaligned_2. Replace __memmove_chk_sse2_unaligned_2 with __memmove_chk_sse2_unaligned. Remove __memmove_chk_sse2 and __memmove_avx_unaligned_2. Replace __memmove_avx512_unaligned_2 with __memmove_avx512_unaligned. Replace __memmove_sse2_unaligned_2 with __memmove_sse2_unaligned. Remove __memmove_sse2. Replace __memcpy_chk_avx512_unaligned_2 with __memcpy_chk_avx512_unaligned. Remove __memcpy_chk_avx_unaligned_2. Replace __memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned. Remove __memcpy_chk_sse2. Remove __memcpy_avx_unaligned_2. Replace __memcpy_avx512_unaligned_2 with __memcpy_avx512_unaligned. Remove __memcpy_sse2_unaligned_2 and __memcpy_sse2. Replace __mempcpy_chk_avx512_unaligned_2 with __mempcpy_chk_avx512_unaligned. Remove __mempcpy_chk_avx_unaligned_2. Replace __mempcpy_chk_sse2_unaligned_2 with __mempcpy_chk_sse2_unaligned. Remove __mempcpy_chk_sse2. Replace __mempcpy_avx512_unaligned_2 with __mempcpy_avx512_unaligned. Remove __mempcpy_avx_unaligned_2. Replace __mempcpy_sse2_unaligned_2 with __mempcpy_sse2_unaligned. Remove __mempcpy_sse2. * sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support __memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned. Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms if processor has ERMS. Default to __memcpy_sse2_unaligned. (ENTRY): Removed. (END): Likewise. (ENTRY_CHK): Likewise. (libc_hidden_builtin_def): Likewise. Don't include ../memcpy.S. * sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support __memcpy_chk_avx512_unaligned_erms and __memcpy_chk_avx512_unaligned. Use __memcpy_chk_avx_unaligned_erms and __memcpy_chk_sse2_unaligned_erms if if processor has ERMS. Default to __memcpy_chk_sse2_unaligned. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip if not in libc. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S (MEMCPY_SYMBOL): New. (MEMPCPY_SYMBOL): Likewise. (MEMMOVE_CHK_SYMBOL): Likewise. (__mempcpy_erms, __memmove_erms): Moved before __mempcpy_chk with unaligned_erms. Replace MEMMOVE_SYMBOL with MEMMOVE_CHK_SYMBOL on __mempcpy_chk symbols. Replace MEMMOVE_SYMBOL with MEMPCPY_SYMBOL on __mempcpy symbols. Change function suffix from unaligned_2 to unaligned. Provide alias for __memcpy_chk in libc.a. Provide alias for memcpy in libc.a and ld.so. * sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support __mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned. Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms if processor has ERMS. Default to __mempcpy_sse2_unaligned. (ENTRY): Removed. (END): Likewise. (ENTRY_CHK): Likewise. (libc_hidden_builtin_def): Likewise. Don't include ../mempcpy.S. (mempcpy): New. Add a weak alias. * sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support __mempcpy_chk_avx512_unaligned_erms and __mempcpy_chk_avx512_unaligned. Use __mempcpy_chk_avx_unaligned_erms and __mempcpy_chk_sse2_unaligned_erms if if processor has ERMS. Default to __mempcpy_chk_sse2_unaligned. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=003dea41324380333f1c3eca867ad012f6ca549f commit 003dea41324380333f1c3eca867ad012f6ca549f Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 10:42:30 2016 -0700 X86-64: Remove the previous SSE2/AVX2 memsets Since the new SSE2/AVX2 memsets are faster than the previous ones, we can remove the previous SSE2/AVX2 memsets and replace them with the new ones. This reduces the size of libc.so by about 900 bytes. No change in IFUNC selection if SSE2 and AVX2 memsets weren't used before. If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset optimized with Enhanced REP STOSB will be used for processors with ERMS. The new AVX512 memset will be used for processors with AVX512 which prefer vzeroupper. [BZ #19881] * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded into ... * sysdeps/x86_64/memset.S: This. (__bzero): Removed. (__memset_tail): Likewise. (__memset_chk): Likewise. (memset): Likewise. (MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't defined. (MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined. * sysdeps/x86_64/multiarch/memset-avx2.S: Removed. (__memset_zero_constant_len_parameter): Check SHARED instead of PIC. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memset-avx2 and memset-sse2-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Remove __memset_chk_sse2, __memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned. * sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: Skip if not in libc. * sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S (MEMSET_CHK_SYMBOL): New. Define if not defined. (__bzero): Check VEC_SIZE == 16 instead of USE_MULTIARCH. Replace MEMSET_SYMBOL with MEMSET_CHK_SYMBOL on __memset_chk symbols. Properly check USE_MULTIARCH on __memset symbols. * sysdeps/x86_64/multiarch/memset.S (memset): Replace __memset_sse2 and __memset_avx2 with __memset_sse2_unaligned and __memset_avx2_unaligned. Use __memset_sse2_unaligned_erms or __memset_avx2_unaligned_erms if processor has ERMS. Support __memset_avx512_unaligned_erms and __memset_avx512_unaligned. (memset): Removed. (__memset_chk): Likewise. (MEMSET_SYMBOL): New. (libc_hidden_builtin_def): Replace __memset_sse2 with __memset_sse2_unaligned. * sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace __memset_chk_sse2 and __memset_chk_avx2 with __memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms. Use __memset_chk_sse2_unaligned_erms or __memset_chk_avx2_unaligned_erms if processor has ERMS. Support __memset_chk_avx512_unaligned_erms and __memset_chk_avx512_unaligned. -----------------------------------------------------------------------
This is an automated email from the git hooks/post-receive script. It was generated because a ref change was pushed to the repository containing the project "GNU C Library master sources". The branch, hjl/erms/2.23 has been created at 4e339b9dc65217fb9b9be6cdc0e991f4ae64ccfe (commit) - Log ----------------------------------------------------------------- https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4e339b9dc65217fb9b9be6cdc0e991f4ae64ccfe commit 4e339b9dc65217fb9b9be6cdc0e991f4ae64ccfe Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Apr 1 14:01:24 2016 -0700 X86-64: Add dummy memcopy.h and wordcopy.c Since x86-64 doesn't use memory copy functions, add dummy memcopy.h and wordcopy.c to reduce code size. It reduces the size of libc.so by about 1 KB. * sysdeps/x86_64/memcopy.h: New file. * sysdeps/x86_64/wordcopy.c: Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=997e6c0db2c351f4a7b688c3134c1f77a0aa49de commit 997e6c0db2c351f4a7b688c3134c1f77a0aa49de Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 12:46:57 2016 -0700 X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones, we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with the new ones. No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used before. If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2 memcpy/memmove optimized with Enhanced REP MOVSB will be used for processors with ERMS. The new AVX512 memcpy/memmove will be used for processors with AVX512 which prefer vzeroupper. Since the new SSE2 memcpy/memmove are faster than the previous default memcpy/memmove used in libc.a and ld.so, we also remove the previous default memcpy/memmove and make them the default memcpy/memmove. Together, it reduces the size of libc.so by about 6 KB and the size of ld.so by about 2 KB. It also fixes the placement of __mempcpy_erms and __memmove_erms. [BZ #19776] * sysdeps/x86_64/memcpy.S: Make it dummy. * sysdeps/x86_64/mempcpy.S: Likewise. * sysdeps/x86_64/memmove.S: New file. * sysdeps/x86_64/memmove_chk.S: Likewise. * sysdeps/x86_64/multiarch/memmove.S: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.S: Likewise. * sysdeps/x86_64/memmove.c: Removed. * sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memmove.c: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.c: Likewise. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memcpy-sse2-unaligned, memmove-avx-unaligned, memcpy-avx-unaligned and memmove-sse2-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Replace __memmove_chk_avx512_unaligned_2 with __memmove_chk_avx512_unaligned. Remove __memmove_chk_avx_unaligned_2. Replace __memmove_chk_sse2_unaligned_2 with __memmove_chk_sse2_unaligned. Remove __memmove_chk_sse2 and __memmove_avx_unaligned_2. Replace __memmove_avx512_unaligned_2 with __memmove_avx512_unaligned. Replace __memmove_sse2_unaligned_2 with __memmove_sse2_unaligned. Remove __memmove_sse2. Replace __memcpy_chk_avx512_unaligned_2 with __memcpy_chk_avx512_unaligned. Remove __memcpy_chk_avx_unaligned_2. Replace __memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned. Remove __memcpy_chk_sse2. Remove __memcpy_avx_unaligned_2. Replace __memcpy_avx512_unaligned_2 with __memcpy_avx512_unaligned. Remove __memcpy_sse2_unaligned_2 and __memcpy_sse2. Replace __mempcpy_chk_avx512_unaligned_2 with __mempcpy_chk_avx512_unaligned. Remove __mempcpy_chk_avx_unaligned_2. Replace __mempcpy_chk_sse2_unaligned_2 with __mempcpy_chk_sse2_unaligned. Remove __mempcpy_chk_sse2. Replace __mempcpy_avx512_unaligned_2 with __mempcpy_avx512_unaligned. Remove __mempcpy_avx_unaligned_2. Replace __mempcpy_sse2_unaligned_2 with __mempcpy_sse2_unaligned. Remove __mempcpy_sse2. * sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support __memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned. Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms if processor has ERMS. Default to __memcpy_sse2_unaligned. (ENTRY): Removed. (END): Likewise. (ENTRY_CHK): Likewise. (libc_hidden_builtin_def): Likewise. Don't include ../memcpy.S. * sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support __memcpy_chk_avx512_unaligned_erms and __memcpy_chk_avx512_unaligned. Use __memcpy_chk_avx_unaligned_erms and __memcpy_chk_sse2_unaligned_erms if if processor has ERMS. Default to __memcpy_chk_sse2_unaligned. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip if not in libc. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S (MEMCPY_SYMBOL): New. (MEMPCPY_SYMBOL): Likewise. (MEMMOVE_CHK_SYMBOL): Likewise. (__mempcpy_erms, __memmove_erms): Moved before __mempcpy_chk with unaligned_erms. Replace MEMMOVE_SYMBOL with MEMMOVE_CHK_SYMBOL on __mempcpy_chk symbols. Replace MEMMOVE_SYMBOL with MEMPCPY_SYMBOL on __mempcpy symbols. Change function suffix from unaligned_2 to unaligned. Provide alias for __memcpy_chk in libc.a. Provide alias for memcpy in libc.a and ld.so. * sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support __mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned. Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms if processor has ERMS. Default to __mempcpy_sse2_unaligned. (ENTRY): Removed. (END): Likewise. (ENTRY_CHK): Likewise. (libc_hidden_builtin_def): Likewise. Don't include ../mempcpy.S. (mempcpy): New. Add a weak alias. * sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support __mempcpy_chk_avx512_unaligned_erms and __mempcpy_chk_avx512_unaligned. Use __mempcpy_chk_avx_unaligned_erms and __mempcpy_chk_sse2_unaligned_erms if if processor has ERMS. Default to __mempcpy_chk_sse2_unaligned. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=0ff8c6a7b53c5bb28ac3d3e0ae8da8099491b16c commit 0ff8c6a7b53c5bb28ac3d3e0ae8da8099491b16c Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 10:42:30 2016 -0700 X86-64: Remove the previous SSE2/AVX2 memsets Since the new SSE2/AVX2 memsets are faster than the previous ones, we can remove the previous SSE2/AVX2 memsets and replace them with the new ones. This reduces the size of libc.so by about 900 bytes. No change in IFUNC selection if SSE2 and AVX2 memsets weren't used before. If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset optimized with Enhanced REP STOSB will be used for processors with ERMS. The new AVX512 memset will be used for processors with AVX512 which prefer vzeroupper. [BZ #19881] * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded into ... * sysdeps/x86_64/memset.S: This. (__bzero): Removed. (__memset_tail): Likewise. (__memset_chk): Likewise. (memset): Likewise. (MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't defined. (MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined. * sysdeps/x86_64/multiarch/memset-avx2.S: Removed. (__memset_zero_constant_len_parameter): Check SHARED instead of PIC. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memset-avx2 and memset-sse2-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Remove __memset_chk_sse2, __memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned. * sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: Skip if not in libc. * sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S (MEMSET_CHK_SYMBOL): New. Define if not defined. (__bzero): Check VEC_SIZE == 16 instead of USE_MULTIARCH. Replace MEMSET_SYMBOL with MEMSET_CHK_SYMBOL on __memset_chk symbols. Properly check USE_MULTIARCH on __memset symbols. * sysdeps/x86_64/multiarch/memset.S (memset): Replace __memset_sse2 and __memset_avx2 with __memset_sse2_unaligned and __memset_avx2_unaligned. Use __memset_sse2_unaligned_erms or __memset_avx2_unaligned_erms if processor has ERMS. Support __memset_avx512_unaligned_erms and __memset_avx512_unaligned. (memset): Removed. (__memset_chk): Likewise. (MEMSET_SYMBOL): New. (libc_hidden_builtin_def): Replace __memset_sse2 with __memset_sse2_unaligned. * sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace __memset_chk_sse2 and __memset_chk_avx2 with __memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms. Use __memset_chk_sse2_unaligned_erms or __memset_chk_avx2_unaligned_erms if processor has ERMS. Support __memset_chk_avx512_unaligned_erms and __memset_chk_avx512_unaligned. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=cfb059c79729b26284863334c9aa04f0a3b967b9 commit cfb059c79729b26284863334c9aa04f0a3b967b9 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Apr 1 15:08:48 2016 -0700 Remove Fast_Copy_Backward from Intel Core processors Intel Core i3, i5 and i7 processors have fast unaligned copy and copy backward is ignored. Remove Fast_Copy_Backward from Intel Core processors to avoid confusion. * sysdeps/x86/cpu-features.c (init_cpu_features): Don't set bit_arch_Fast_Copy_Backward for Intel Core proessors. (cherry picked from commit 27d3ce1467990f89126e228559dec8f84b96c60e) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=30c389be1af67c4d0716d207b6780c6169d1355f commit 30c389be1af67c4d0716d207b6780c6169d1355f Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 10:05:51 2016 -0700 Add x86-64 memset with unaligned store and rep stosb Implement x86-64 memset with unaligned store and rep movsb. Support 16-byte, 32-byte and 64-byte vector register sizes. A single file provides 2 implementations of memset, one with rep stosb and the other without rep stosb. They share the same codes when size is between 2 times of vector register size and REP_STOSB_THRESHOLD which defaults to 2KB. Key features: 1. Use overlapping store to avoid branch. 2. For size <= 4 times of vector register size, fully unroll the loop. 3. For size > 4 times of vector register size, store 4 times of vector register size at a time. [BZ #19881] * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memset-sse2-unaligned-erms, memset-avx2-unaligned-erms and memset-avx512-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memset_chk_sse2_unaligned, __memset_chk_sse2_unaligned_erms, __memset_chk_avx2_unaligned, __memset_chk_avx2_unaligned_erms, __memset_chk_avx512_unaligned, __memset_chk_avx512_unaligned_erms, __memset_sse2_unaligned, __memset_sse2_unaligned_erms, __memset_erms, __memset_avx2_unaligned, __memset_avx2_unaligned_erms, __memset_avx512_unaligned_erms and __memset_avx512_unaligned. * sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: New file. * sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: Likewise. (cherry picked from commit 830566307f038387ca0af3fd327706a8d1a2f595) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=980d639b4ae58209843f09a29d86b0a8303b6650 commit 980d639b4ae58209843f09a29d86b0a8303b6650 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 10:04:26 2016 -0700 Add x86-64 memmove with unaligned load/store and rep movsb Implement x86-64 memmove with unaligned load/store and rep movsb. Support 16-byte, 32-byte and 64-byte vector register sizes. When size <= 8 times of vector register size, there is no check for address overlap bewteen source and destination. Since overhead for overlap check is small when size > 8 times of vector register size, memcpy is an alias of memmove. A single file provides 2 implementations of memmove, one with rep movsb and the other without rep movsb. They share the same codes when size is between 2 times of vector register size and REP_MOVSB_THRESHOLD which is 2KB for 16-byte vector register size and scaled up by large vector register size. Key features: 1. Use overlapping load and store to avoid branch. 2. For size <= 8 times of vector register size, load all sources into registers and store them together. 3. If there is no address overlap bewteen source and destination, copy from both ends with 4 times of vector register size at a time. 4. If address of destination > address of source, backward copy 8 times of vector register size at a time. 5. Otherwise, forward copy 8 times of vector register size at a time. 6. Use rep movsb only for forward copy. Avoid slow backward rep movsb by fallbacking to backward copy 8 times of vector register size at a time. 7. Skip when address of destination == address of source. [BZ #19776] * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memmove-sse2-unaligned-erms, memmove-avx-unaligned-erms and memmove-avx512-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memmove_chk_avx512_unaligned_2, __memmove_chk_avx512_unaligned_erms, __memmove_chk_avx_unaligned_2, __memmove_chk_avx_unaligned_erms, __memmove_chk_sse2_unaligned_2, __memmove_chk_sse2_unaligned_erms, __memmove_avx_unaligned_2, __memmove_avx_unaligned_erms, __memmove_avx512_unaligned_2, __memmove_avx512_unaligned_erms, __memmove_erms, __memmove_sse2_unaligned_2, __memmove_sse2_unaligned_erms, __memcpy_chk_avx512_unaligned_2, __memcpy_chk_avx512_unaligned_erms, __memcpy_chk_avx_unaligned_2, __memcpy_chk_avx_unaligned_erms, __memcpy_chk_sse2_unaligned_2, __memcpy_chk_sse2_unaligned_erms, __memcpy_avx_unaligned_2, __memcpy_avx_unaligned_erms, __memcpy_avx512_unaligned_2, __memcpy_avx512_unaligned_erms, __memcpy_sse2_unaligned_2, __memcpy_sse2_unaligned_erms, __memcpy_erms, __mempcpy_chk_avx512_unaligned_2, __mempcpy_chk_avx512_unaligned_erms, __mempcpy_chk_avx_unaligned_2, __mempcpy_chk_avx_unaligned_erms, __mempcpy_chk_sse2_unaligned_2, __mempcpy_chk_sse2_unaligned_erms, __mempcpy_avx512_unaligned_2, __mempcpy_avx512_unaligned_erms, __mempcpy_avx_unaligned_2, __mempcpy_avx_unaligned_erms, __mempcpy_sse2_unaligned_2, __mempcpy_sse2_unaligned_erms and __mempcpy_erms. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: New file. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S: Likwise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S: Likwise. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Likwise. (cherry picked from commit 88b57b8ed41d5ecf2e1bdfc19556f9246a665ebb) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=bf2bc5e5c9d7aa8af28b299ec26b8a37352730cc commit bf2bc5e5c9d7aa8af28b299ec26b8a37352730cc Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Mar 28 19:22:59 2016 -0700 Initial Enhanced REP MOVSB/STOSB (ERMS) support The newer Intel processors support Enhanced REP MOVSB/STOSB (ERMS) which has a feature bit in CPUID. This patch adds the Enhanced REP MOVSB/STOSB (ERMS) bit to x86 cpu-features. * sysdeps/x86/cpu-features.h (bit_cpu_ERMS): New. (index_cpu_ERMS): Likewise. (reg_ERMS): Likewise. (cherry picked from commit 0791f91dff9a77263fa8173b143d854cad902c6d) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7c244283ff12329b3bca9878b8edac3b3fe5c7bc commit 7c244283ff12329b3bca9878b8edac3b3fe5c7bc Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Mar 28 13:15:59 2016 -0700 Make __memcpy_avx512_no_vzeroupper an alias Since x86-64 memcpy-avx512-no-vzeroupper.S implements memmove, make __memcpy_avx512_no_vzeroupper an alias of __memmove_avx512_no_vzeroupper to reduce code size of libc.so. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memcpy-avx512-no-vzeroupper. * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: Renamed to ... * sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: This. (MEMCPY): Don't define. (MEMCPY_CHK): Likewise. (MEMPCPY): Likewise. (MEMPCPY_CHK): Likewise. (MEMPCPY_CHK): Renamed to ... (__mempcpy_chk_avx512_no_vzeroupper): This. (MEMPCPY_CHK): Renamed to ... (__mempcpy_chk_avx512_no_vzeroupper): This. (MEMCPY_CHK): Renamed to ... (__memmove_chk_avx512_no_vzeroupper): This. (MEMCPY): Renamed to ... (__memmove_avx512_no_vzeroupper): This. (__memcpy_avx512_no_vzeroupper): New alias. (__memcpy_chk_avx512_no_vzeroupper): Likewise. (cherry picked from commit 064f01b10b57ff09cda7025f484b848c38ddd57a) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a9a14991fb2d3e69f80d25e9bbf2f6b0bcf11c3d commit a9a14991fb2d3e69f80d25e9bbf2f6b0bcf11c3d Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Mar 28 13:13:36 2016 -0700 Implement x86-64 multiarch mempcpy in memcpy Implement x86-64 multiarch mempcpy in memcpy to share most of code. It reduces code size of libc.so. [BZ #18858] * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove mempcpy-ssse3, mempcpy-ssse3-back, mempcpy-avx-unaligned and mempcpy-avx512-no-vzeroupper. * sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/memcpy-ssse3-back.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/memcpy-ssse3.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/mempcpy-avx-unaligned.S: Removed. * sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S: Likewise. * sysdeps/x86_64/multiarch/mempcpy-ssse3-back.S: Likewise. * sysdeps/x86_64/multiarch/mempcpy-ssse3.S: Likewise. (cherry picked from commit c365e615f7429aee302f8af7bf07ae262278febb) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4fc09dabecee1b7cafdbca26ee7c63f68e53c229 commit 4fc09dabecee1b7cafdbca26ee7c63f68e53c229 Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Mar 28 04:39:48 2016 -0700 [x86] Add a feature bit: Fast_Unaligned_Copy On AMD processors, memcpy optimized with unaligned SSE load is slower than emcpy optimized with aligned SSSE3 while other string functions are faster with unaligned SSE load. A feature bit, Fast_Unaligned_Copy, is added to select memcpy optimized with unaligned SSE load. [BZ #19583] * sysdeps/x86/cpu-features.c (init_cpu_features): Set Fast_Unaligned_Copy with Fast_Unaligned_Load for Intel processors. Set Fast_Copy_Backward for AMD Excavator processors. * sysdeps/x86/cpu-features.h (bit_arch_Fast_Unaligned_Copy): New. (index_arch_Fast_Unaligned_Copy): Likewise. * sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Check Fast_Unaligned_Copy instead of Fast_Unaligned_Load. (cherry picked from commit e41b395523040fcb58c7d378475720c2836d280c) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=75f2d47e459a6bf5656a938e5c63f8b581eb3ee6 commit 75f2d47e459a6bf5656a938e5c63f8b581eb3ee6 Author: Florian Weimer <fweimer@redhat.com> Date: Fri Mar 25 11:11:42 2016 +0100 tst-audit10: Fix compilation on compilers without bit_AVX512F [BZ #19860] [BZ# 19860] * sysdeps/x86_64/tst-audit10.c (avx512_enabled): Always return zero if the compiler does not provide the AVX512F bit. (cherry picked from commit f327f5b47be57bc05a4077344b381016c1bb2c11) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=96c7375cb8b6f1875d9865f2ae92ecacf5f5e6fa commit 96c7375cb8b6f1875d9865f2ae92ecacf5f5e6fa Author: H.J. Lu <hjl.tools@gmail.com> Date: Tue Mar 22 08:36:16 2016 -0700 Don't set %rcx twice before "rep movsb" * sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMCPY): Don't set %rcx twice before "rep movsb". (cherry picked from commit 3c9a4cd16cbc7b79094fec68add2df66061ab5d7) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c273f613b0cc779ee33cc33d20941d271316e483 commit c273f613b0cc779ee33cc33d20941d271316e483 Author: H.J. Lu <hjl.tools@gmail.com> Date: Tue Mar 22 07:46:56 2016 -0700 Set index_arch_AVX_Fast_Unaligned_Load only for Intel processors Since only Intel processors with AVX2 have fast unaligned load, we should set index_arch_AVX_Fast_Unaligned_Load only for Intel processors. Move AVX, AVX2, AVX512, FMA and FMA4 detection into get_common_indeces and call get_common_indeces for other processors. Add CPU_FEATURES_CPU_P and CPU_FEATURES_ARCH_P to aoid loading GLRO(dl_x86_cpu_features) in cpu-features.c. [BZ #19583] * sysdeps/x86/cpu-features.c (get_common_indeces): Remove inline. Check family before setting family, model and extended_model. Set AVX, AVX2, AVX512, FMA and FMA4 usable bits here. (init_cpu_features): Replace HAS_CPU_FEATURE and HAS_ARCH_FEATURE with CPU_FEATURES_CPU_P and CPU_FEATURES_ARCH_P. Set index_arch_AVX_Fast_Unaligned_Load for Intel processors with usable AVX2. Call get_common_indeces for other processors with family == NULL. * sysdeps/x86/cpu-features.h (CPU_FEATURES_CPU_P): New macro. (CPU_FEATURES_ARCH_P): Likewise. (HAS_CPU_FEATURE): Use CPU_FEATURES_CPU_P. (HAS_ARCH_FEATURE): Use CPU_FEATURES_ARCH_P. (cherry picked from commit f781a9e96138d8839663af5e88649ab1fbed74f8) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c858d10a4e7fd682f2e7083836e4feacc2d580f4 commit c858d10a4e7fd682f2e7083836e4feacc2d580f4 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 10 05:26:46 2016 -0800 Add _arch_/_cpu_ to index_*/bit_* in x86 cpu-features.h index_* and bit_* macros are used to access cpuid and feature arrays o struct cpu_features. It is very easy to use bits and indices of cpuid array on feature array, especially in assembly codes. For example, sysdeps/i386/i686/multiarch/bcopy.S has HAS_CPU_FEATURE (Fast_Rep_String) which should be HAS_ARCH_FEATURE (Fast_Rep_String) We change index_* and bit_* to index_cpu_*/index_arch_* and bit_cpu_*/bit_arch_* so that we can catch such error at build time. [BZ #19762] * sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h (EXTRA_LD_ENVVARS): Add _arch_ to index_*/bit_*. * sysdeps/x86/cpu-features.c (init_cpu_features): Likewise. * sysdeps/x86/cpu-features.h (bit_*): Renamed to ... (bit_arch_*): This for feature array. (bit_*): Renamed to ... (bit_cpu_*): This for cpu array. (index_*): Renamed to ... (index_arch_*): This for feature array. (index_*): Renamed to ... (index_cpu_*): This for cpu array. [__ASSEMBLER__] (HAS_FEATURE): Add and use field. [__ASSEMBLER__] (HAS_CPU_FEATURE)): Pass cpu to HAS_FEATURE. [__ASSEMBLER__] (HAS_ARCH_FEATURE)): Pass arch to HAS_FEATURE. [!__ASSEMBLER__] (HAS_CPU_FEATURE): Replace index_##name and bit_##name with index_cpu_##name and bit_cpu_##name. [!__ASSEMBLER__] (HAS_ARCH_FEATURE): Replace index_##name and bit_##name with index_arch_##name and bit_arch_##name. (cherry picked from commit 6aa3e97e2530f9917f504eb4146af119a3f27229) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7a90b56b0c3f8e55df44957cf6de7d3c9c04cbb9 commit 7a90b56b0c3f8e55df44957cf6de7d3c9c04cbb9 Author: Roland McGrath <roland@hack.frob.com> Date: Tue Mar 8 12:31:13 2016 -0800 Fix tst-audit10 build when -mavx512f is not supported. (cherry picked from commit 3bd80c0de2f8e7ca8020d37739339636d169957e) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=ba80f6ceea3a6b6f711038646f419125fe3ad39c commit ba80f6ceea3a6b6f711038646f419125fe3ad39c Author: Florian Weimer <fweimer@redhat.com> Date: Mon Mar 7 16:00:25 2016 +0100 tst-audit4, tst-audit10: Compile AVX/AVX-512 code separately [BZ #19269] This ensures that GCC will not use unsupported instructions before the run-time check to ensure support. (cherry picked from commit 3c0f7407eedb524c9114bb675cd55b903c71daaa) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=b8fe596e7f750d4ee2fca14d6a3999364c02662e commit b8fe596e7f750d4ee2fca14d6a3999364c02662e Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Mar 6 16:48:11 2016 -0800 Group AVX512 functions in .text.avx512 section * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: Replace .text with .text.avx512. * sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S: Likewise. (cherry picked from commit fee9eb6200f0e44a4b684903bc47fde36d46f1a5) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e455d17680cfaebb12692547422f95ba1ed30e29 commit e455d17680cfaebb12692547422f95ba1ed30e29 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Mar 4 08:37:40 2016 -0800 x86-64: Fix memcpy IFUNC selection Chek Fast_Unaligned_Load, instead of Slow_BSF, and also check for Fast_Copy_Backward to enable __memcpy_ssse3_back. Existing selection order is updated with following selection order: 1. __memcpy_avx_unaligned if AVX_Fast_Unaligned_Load bit is set. 2. __memcpy_sse2_unaligned if Fast_Unaligned_Load bit is set. 3. __memcpy_sse2 if SSSE3 isn't available. 4. __memcpy_ssse3_back if Fast_Copy_Backward bit it set. 5. __memcpy_ssse3 [BZ #18880] * sysdeps/x86_64/multiarch/memcpy.S: Check Fast_Unaligned_Load, instead of Slow_BSF, and also check for Fast_Copy_Backward to enable __memcpy_ssse3_back. (cherry picked from commit 14a1d7cc4c4fd5ee8e4e66b777221dd32a84efe8) -----------------------------------------------------------------------
This is an automated email from the git hooks/post-receive script. It was generated because a ref change was pushed to the repository containing the project "GNU C Library master sources". The branch, hjl/erms/2.22 has been created at 7962f7b04a6374b36d1df15c0c7c8f5747e2e85f (commit) - Log ----------------------------------------------------------------- https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7962f7b04a6374b36d1df15c0c7c8f5747e2e85f commit 7962f7b04a6374b36d1df15c0c7c8f5747e2e85f Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Apr 1 14:01:24 2016 -0700 X86-64: Add dummy memcopy.h and wordcopy.c Since x86-64 doesn't use memory copy functions, add dummy memcopy.h and wordcopy.c to reduce code size. It reduces the size of libc.so by about 1 KB. * sysdeps/x86_64/memcopy.h: New file. * sysdeps/x86_64/wordcopy.c: Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=40d52d834531b7a4315b68155ee3daec3cdceb46 commit 40d52d834531b7a4315b68155ee3daec3cdceb46 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 12:46:57 2016 -0700 X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones, we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with the new ones. No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used before. If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2 memcpy/memmove optimized with Enhanced REP MOVSB will be used for processors with ERMS. The new AVX512 memcpy/memmove will be used for processors with AVX512 which prefer vzeroupper. Since the new SSE2 memcpy/memmove are faster than the previous default memcpy/memmove used in libc.a and ld.so, we also remove the previous default memcpy/memmove and make them the default memcpy/memmove. Together, it reduces the size of libc.so by about 6 KB and the size of ld.so by about 2 KB. It also fixes the placement of __mempcpy_erms and __memmove_erms. [BZ #19776] * sysdeps/x86_64/memcpy.S: Make it dummy. * sysdeps/x86_64/mempcpy.S: Likewise. * sysdeps/x86_64/memmove.S: New file. * sysdeps/x86_64/memmove_chk.S: Likewise. * sysdeps/x86_64/multiarch/memmove.S: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.S: Likewise. * sysdeps/x86_64/memmove.c: Removed. * sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memmove.c: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.c: Likewise. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memcpy-sse2-unaligned, memmove-avx-unaligned, memcpy-avx-unaligned and memmove-sse2-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Replace __memmove_chk_avx512_unaligned_2 with __memmove_chk_avx512_unaligned. Remove __memmove_chk_avx_unaligned_2. Replace __memmove_chk_sse2_unaligned_2 with __memmove_chk_sse2_unaligned. Remove __memmove_chk_sse2 and __memmove_avx_unaligned_2. Replace __memmove_avx512_unaligned_2 with __memmove_avx512_unaligned. Replace __memmove_sse2_unaligned_2 with __memmove_sse2_unaligned. Remove __memmove_sse2. Replace __memcpy_chk_avx512_unaligned_2 with __memcpy_chk_avx512_unaligned. Remove __memcpy_chk_avx_unaligned_2. Replace __memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned. Remove __memcpy_chk_sse2. Remove __memcpy_avx_unaligned_2. Replace __memcpy_avx512_unaligned_2 with __memcpy_avx512_unaligned. Remove __memcpy_sse2_unaligned_2 and __memcpy_sse2. Replace __mempcpy_chk_avx512_unaligned_2 with __mempcpy_chk_avx512_unaligned. Remove __mempcpy_chk_avx_unaligned_2. Replace __mempcpy_chk_sse2_unaligned_2 with __mempcpy_chk_sse2_unaligned. Remove __mempcpy_chk_sse2. Replace __mempcpy_avx512_unaligned_2 with __mempcpy_avx512_unaligned. Remove __mempcpy_avx_unaligned_2. Replace __mempcpy_sse2_unaligned_2 with __mempcpy_sse2_unaligned. Remove __mempcpy_sse2. * sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support __memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned. Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms if processor has ERMS. Default to __memcpy_sse2_unaligned. (ENTRY): Removed. (END): Likewise. (ENTRY_CHK): Likewise. (libc_hidden_builtin_def): Likewise. Don't include ../memcpy.S. * sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support __memcpy_chk_avx512_unaligned_erms and __memcpy_chk_avx512_unaligned. Use __memcpy_chk_avx_unaligned_erms and __memcpy_chk_sse2_unaligned_erms if if processor has ERMS. Default to __memcpy_chk_sse2_unaligned. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip if not in libc. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S (MEMCPY_SYMBOL): New. (MEMPCPY_SYMBOL): Likewise. (MEMMOVE_CHK_SYMBOL): Likewise. (__mempcpy_erms, __memmove_erms): Moved before __mempcpy_chk with unaligned_erms. Replace MEMMOVE_SYMBOL with MEMMOVE_CHK_SYMBOL on __mempcpy_chk symbols. Replace MEMMOVE_SYMBOL with MEMPCPY_SYMBOL on __mempcpy symbols. Change function suffix from unaligned_2 to unaligned. Provide alias for __memcpy_chk in libc.a. Provide alias for memcpy in libc.a and ld.so. * sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support __mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned. Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms if processor has ERMS. Default to __mempcpy_sse2_unaligned. (ENTRY): Removed. (END): Likewise. (ENTRY_CHK): Likewise. (libc_hidden_builtin_def): Likewise. Don't include ../mempcpy.S. (mempcpy): New. Add a weak alias. * sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support __mempcpy_chk_avx512_unaligned_erms and __mempcpy_chk_avx512_unaligned. Use __mempcpy_chk_avx_unaligned_erms and __mempcpy_chk_sse2_unaligned_erms if if processor has ERMS. Default to __mempcpy_chk_sse2_unaligned. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a61bbdcc906231982398239ec38f193a7522af5b commit a61bbdcc906231982398239ec38f193a7522af5b Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 10:42:30 2016 -0700 X86-64: Remove the previous SSE2/AVX2 memsets Since the new SSE2/AVX2 memsets are faster than the previous ones, we can remove the previous SSE2/AVX2 memsets and replace them with the new ones. This reduces the size of libc.so by about 900 bytes. No change in IFUNC selection if SSE2 and AVX2 memsets weren't used before. If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset optimized with Enhanced REP STOSB will be used for processors with ERMS. The new AVX512 memset will be used for processors with AVX512 which prefer vzeroupper. [BZ #19881] * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded into ... * sysdeps/x86_64/memset.S: This. (__bzero): Removed. (__memset_tail): Likewise. (__memset_chk): Likewise. (memset): Likewise. (MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't defined. (MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined. * sysdeps/x86_64/multiarch/memset-avx2.S: Removed. (__memset_zero_constant_len_parameter): Check SHARED instead of PIC. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memset-avx2 and memset-sse2-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Remove __memset_chk_sse2, __memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned. * sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: Skip if not in libc. * sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S (MEMSET_CHK_SYMBOL): New. Define if not defined. (__bzero): Check VEC_SIZE == 16 instead of USE_MULTIARCH. Replace MEMSET_SYMBOL with MEMSET_CHK_SYMBOL on __memset_chk symbols. Properly check USE_MULTIARCH on __memset symbols. * sysdeps/x86_64/multiarch/memset.S (memset): Replace __memset_sse2 and __memset_avx2 with __memset_sse2_unaligned and __memset_avx2_unaligned. Use __memset_sse2_unaligned_erms or __memset_avx2_unaligned_erms if processor has ERMS. Support __memset_avx512_unaligned_erms and __memset_avx512_unaligned. (memset): Removed. (__memset_chk): Likewise. (MEMSET_SYMBOL): New. (libc_hidden_builtin_def): Replace __memset_sse2 with __memset_sse2_unaligned. * sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace __memset_chk_sse2 and __memset_chk_avx2 with __memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms. Use __memset_chk_sse2_unaligned_erms or __memset_chk_avx2_unaligned_erms if processor has ERMS. Support __memset_chk_avx512_unaligned_erms and __memset_chk_avx512_unaligned. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4ad4d58ed7a444e2d9787113fce132a99b35b417 commit 4ad4d58ed7a444e2d9787113fce132a99b35b417 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Apr 1 15:08:48 2016 -0700 Remove Fast_Copy_Backward from Intel Core processors Intel Core i3, i5 and i7 processors have fast unaligned copy and copy backward is ignored. Remove Fast_Copy_Backward from Intel Core processors to avoid confusion. * sysdeps/x86/cpu-features.c (init_cpu_features): Don't set bit_arch_Fast_Copy_Backward for Intel Core proessors. (cherry picked from commit 27d3ce1467990f89126e228559dec8f84b96c60e) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a304f3933c7f8347e49057a7a315cbd571662ff7 commit a304f3933c7f8347e49057a7a315cbd571662ff7 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 10:05:51 2016 -0700 Add x86-64 memset with unaligned store and rep stosb Implement x86-64 memset with unaligned store and rep movsb. Support 16-byte, 32-byte and 64-byte vector register sizes. A single file provides 2 implementations of memset, one with rep stosb and the other without rep stosb. They share the same codes when size is between 2 times of vector register size and REP_STOSB_THRESHOLD which defaults to 2KB. Key features: 1. Use overlapping store to avoid branch. 2. For size <= 4 times of vector register size, fully unroll the loop. 3. For size > 4 times of vector register size, store 4 times of vector register size at a time. [BZ #19881] * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memset-sse2-unaligned-erms, memset-avx2-unaligned-erms and memset-avx512-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memset_chk_sse2_unaligned, __memset_chk_sse2_unaligned_erms, __memset_chk_avx2_unaligned, __memset_chk_avx2_unaligned_erms, __memset_chk_avx512_unaligned, __memset_chk_avx512_unaligned_erms, __memset_sse2_unaligned, __memset_sse2_unaligned_erms, __memset_erms, __memset_avx2_unaligned, __memset_avx2_unaligned_erms, __memset_avx512_unaligned_erms and __memset_avx512_unaligned. * sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: New file. * sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: Likewise. (cherry picked from commit 830566307f038387ca0af3fd327706a8d1a2f595) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=1c5d0b0ba41376c2f0792da4f22cc1f5b2b2688e commit 1c5d0b0ba41376c2f0792da4f22cc1f5b2b2688e Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 10:04:26 2016 -0700 Add x86-64 memmove with unaligned load/store and rep movsb Implement x86-64 memmove with unaligned load/store and rep movsb. Support 16-byte, 32-byte and 64-byte vector register sizes. When size <= 8 times of vector register size, there is no check for address overlap bewteen source and destination. Since overhead for overlap check is small when size > 8 times of vector register size, memcpy is an alias of memmove. A single file provides 2 implementations of memmove, one with rep movsb and the other without rep movsb. They share the same codes when size is between 2 times of vector register size and REP_MOVSB_THRESHOLD which is 2KB for 16-byte vector register size and scaled up by large vector register size. Key features: 1. Use overlapping load and store to avoid branch. 2. For size <= 8 times of vector register size, load all sources into registers and store them together. 3. If there is no address overlap bewteen source and destination, copy from both ends with 4 times of vector register size at a time. 4. If address of destination > address of source, backward copy 8 times of vector register size at a time. 5. Otherwise, forward copy 8 times of vector register size at a time. 6. Use rep movsb only for forward copy. Avoid slow backward rep movsb by fallbacking to backward copy 8 times of vector register size at a time. 7. Skip when address of destination == address of source. [BZ #19776] * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memmove-sse2-unaligned-erms, memmove-avx-unaligned-erms and memmove-avx512-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memmove_chk_avx512_unaligned_2, __memmove_chk_avx512_unaligned_erms, __memmove_chk_avx_unaligned_2, __memmove_chk_avx_unaligned_erms, __memmove_chk_sse2_unaligned_2, __memmove_chk_sse2_unaligned_erms, __memmove_avx_unaligned_2, __memmove_avx_unaligned_erms, __memmove_avx512_unaligned_2, __memmove_avx512_unaligned_erms, __memmove_erms, __memmove_sse2_unaligned_2, __memmove_sse2_unaligned_erms, __memcpy_chk_avx512_unaligned_2, __memcpy_chk_avx512_unaligned_erms, __memcpy_chk_avx_unaligned_2, __memcpy_chk_avx_unaligned_erms, __memcpy_chk_sse2_unaligned_2, __memcpy_chk_sse2_unaligned_erms, __memcpy_avx_unaligned_2, __memcpy_avx_unaligned_erms, __memcpy_avx512_unaligned_2, __memcpy_avx512_unaligned_erms, __memcpy_sse2_unaligned_2, __memcpy_sse2_unaligned_erms, __memcpy_erms, __mempcpy_chk_avx512_unaligned_2, __mempcpy_chk_avx512_unaligned_erms, __mempcpy_chk_avx_unaligned_2, __mempcpy_chk_avx_unaligned_erms, __mempcpy_chk_sse2_unaligned_2, __mempcpy_chk_sse2_unaligned_erms, __mempcpy_avx512_unaligned_2, __mempcpy_avx512_unaligned_erms, __mempcpy_avx_unaligned_2, __mempcpy_avx_unaligned_erms, __mempcpy_sse2_unaligned_2, __mempcpy_sse2_unaligned_erms and __mempcpy_erms. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: New file. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S: Likwise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S: Likwise. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Likwise. (cherry picked from commit 88b57b8ed41d5ecf2e1bdfc19556f9246a665ebb) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e1203f48239fbb9832db6ed3a0d2a008e317aff9 commit e1203f48239fbb9832db6ed3a0d2a008e317aff9 Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Mar 28 19:22:59 2016 -0700 Initial Enhanced REP MOVSB/STOSB (ERMS) support The newer Intel processors support Enhanced REP MOVSB/STOSB (ERMS) which has a feature bit in CPUID. This patch adds the Enhanced REP MOVSB/STOSB (ERMS) bit to x86 cpu-features. * sysdeps/x86/cpu-features.h (bit_cpu_ERMS): New. (index_cpu_ERMS): Likewise. (reg_ERMS): Likewise. (cherry picked from commit 0791f91dff9a77263fa8173b143d854cad902c6d) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=3597d65be2a44f063ef12bb907fdad8567aa3e6a commit 3597d65be2a44f063ef12bb907fdad8567aa3e6a Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Mar 28 13:15:59 2016 -0700 Make __memcpy_avx512_no_vzeroupper an alias Since x86-64 memcpy-avx512-no-vzeroupper.S implements memmove, make __memcpy_avx512_no_vzeroupper an alias of __memmove_avx512_no_vzeroupper to reduce code size of libc.so. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memcpy-avx512-no-vzeroupper. * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: Renamed to ... * sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: This. (MEMCPY): Don't define. (MEMCPY_CHK): Likewise. (MEMPCPY): Likewise. (MEMPCPY_CHK): Likewise. (MEMPCPY_CHK): Renamed to ... (__mempcpy_chk_avx512_no_vzeroupper): This. (MEMPCPY_CHK): Renamed to ... (__mempcpy_chk_avx512_no_vzeroupper): This. (MEMCPY_CHK): Renamed to ... (__memmove_chk_avx512_no_vzeroupper): This. (MEMCPY): Renamed to ... (__memmove_avx512_no_vzeroupper): This. (__memcpy_avx512_no_vzeroupper): New alias. (__memcpy_chk_avx512_no_vzeroupper): Likewise. (cherry picked from commit 064f01b10b57ff09cda7025f484b848c38ddd57a) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=9fbaf0f27a11deb98df79d04adee97aebee78d40 commit 9fbaf0f27a11deb98df79d04adee97aebee78d40 Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Mar 28 13:13:36 2016 -0700 Implement x86-64 multiarch mempcpy in memcpy Implement x86-64 multiarch mempcpy in memcpy to share most of code. It reduces code size of libc.so. [BZ #18858] * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove mempcpy-ssse3, mempcpy-ssse3-back, mempcpy-avx-unaligned and mempcpy-avx512-no-vzeroupper. * sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/memcpy-ssse3-back.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/memcpy-ssse3.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/mempcpy-avx-unaligned.S: Removed. * sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S: Likewise. * sysdeps/x86_64/multiarch/mempcpy-ssse3-back.S: Likewise. * sysdeps/x86_64/multiarch/mempcpy-ssse3.S: Likewise. (cherry picked from commit c365e615f7429aee302f8af7bf07ae262278febb) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=5239cb481eea27650173b9b9af22439afdcbf358 commit 5239cb481eea27650173b9b9af22439afdcbf358 Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Mar 28 04:39:48 2016 -0700 [x86] Add a feature bit: Fast_Unaligned_Copy On AMD processors, memcpy optimized with unaligned SSE load is slower than emcpy optimized with aligned SSSE3 while other string functions are faster with unaligned SSE load. A feature bit, Fast_Unaligned_Copy, is added to select memcpy optimized with unaligned SSE load. [BZ #19583] * sysdeps/x86/cpu-features.c (init_cpu_features): Set Fast_Unaligned_Copy with Fast_Unaligned_Load for Intel processors. Set Fast_Copy_Backward for AMD Excavator processors. * sysdeps/x86/cpu-features.h (bit_arch_Fast_Unaligned_Copy): New. (index_arch_Fast_Unaligned_Copy): Likewise. * sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Check Fast_Unaligned_Copy instead of Fast_Unaligned_Load. (cherry picked from commit e41b395523040fcb58c7d378475720c2836d280c) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a65b3d13e1754d568782e64a762c2c7fab45a55d commit a65b3d13e1754d568782e64a762c2c7fab45a55d Author: H.J. Lu <hjl.tools@gmail.com> Date: Tue Mar 22 08:36:16 2016 -0700 Don't set %rcx twice before "rep movsb" * sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMCPY): Don't set %rcx twice before "rep movsb". (cherry picked from commit 3c9a4cd16cbc7b79094fec68add2df66061ab5d7) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=f4b6d20366aac66070f1cf50552cf2951991a1e5 commit f4b6d20366aac66070f1cf50552cf2951991a1e5 Author: H.J. Lu <hjl.tools@gmail.com> Date: Tue Mar 22 07:46:56 2016 -0700 Set index_arch_AVX_Fast_Unaligned_Load only for Intel processors Since only Intel processors with AVX2 have fast unaligned load, we should set index_arch_AVX_Fast_Unaligned_Load only for Intel processors. Move AVX, AVX2, AVX512, FMA and FMA4 detection into get_common_indeces and call get_common_indeces for other processors. Add CPU_FEATURES_CPU_P and CPU_FEATURES_ARCH_P to aoid loading GLRO(dl_x86_cpu_features) in cpu-features.c. [BZ #19583] * sysdeps/x86/cpu-features.c (get_common_indeces): Remove inline. Check family before setting family, model and extended_model. Set AVX, AVX2, AVX512, FMA and FMA4 usable bits here. (init_cpu_features): Replace HAS_CPU_FEATURE and HAS_ARCH_FEATURE with CPU_FEATURES_CPU_P and CPU_FEATURES_ARCH_P. Set index_arch_AVX_Fast_Unaligned_Load for Intel processors with usable AVX2. Call get_common_indeces for other processors with family == NULL. * sysdeps/x86/cpu-features.h (CPU_FEATURES_CPU_P): New macro. (CPU_FEATURES_ARCH_P): Likewise. (HAS_CPU_FEATURE): Use CPU_FEATURES_CPU_P. (HAS_ARCH_FEATURE): Use CPU_FEATURES_ARCH_P. (cherry picked from commit f781a9e96138d8839663af5e88649ab1fbed74f8) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=ca9c5edeea52dc18f42ebbe29b1af352f5555538 commit ca9c5edeea52dc18f42ebbe29b1af352f5555538 Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Nov 30 08:53:37 2015 -0800 Update family and model detection for AMD CPUs AMD CPUs uses the similar encoding scheme for extended family and model as Intel CPUs as shown in: http://support.amd.com/TechDocs/25481.pdf This patch updates get_common_indeces to get family and model for both Intel and AMD CPUs when family == 0x0f. [BZ #19214] * sysdeps/x86/cpu-features.c (get_common_indeces): Add an argument to return extended model. Update family and model with extended family and model when family == 0x0f. (init_cpu_features): Updated. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c23cdbac4ea473effbef5c50b1217f95595b3460 commit c23cdbac4ea473effbef5c50b1217f95595b3460 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 10 05:26:46 2016 -0800 Add _arch_/_cpu_ to index_*/bit_* in x86 cpu-features.h index_* and bit_* macros are used to access cpuid and feature arrays o struct cpu_features. It is very easy to use bits and indices of cpuid array on feature array, especially in assembly codes. For example, sysdeps/i386/i686/multiarch/bcopy.S has HAS_CPU_FEATURE (Fast_Rep_String) which should be HAS_ARCH_FEATURE (Fast_Rep_String) We change index_* and bit_* to index_cpu_*/index_arch_* and bit_cpu_*/bit_arch_* so that we can catch such error at build time. [BZ #19762] * sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h (EXTRA_LD_ENVVARS): Add _arch_ to index_*/bit_*. * sysdeps/x86/cpu-features.c (init_cpu_features): Likewise. * sysdeps/x86/cpu-features.h (bit_*): Renamed to ... (bit_arch_*): This for feature array. (bit_*): Renamed to ... (bit_cpu_*): This for cpu array. (index_*): Renamed to ... (index_arch_*): This for feature array. (index_*): Renamed to ... (index_cpu_*): This for cpu array. [__ASSEMBLER__] (HAS_FEATURE): Add and use field. [__ASSEMBLER__] (HAS_CPU_FEATURE)): Pass cpu to HAS_FEATURE. [__ASSEMBLER__] (HAS_ARCH_FEATURE)): Pass arch to HAS_FEATURE. [!__ASSEMBLER__] (HAS_CPU_FEATURE): Replace index_##name and bit_##name with index_cpu_##name and bit_cpu_##name. [!__ASSEMBLER__] (HAS_ARCH_FEATURE): Replace index_##name and bit_##name with index_arch_##name and bit_arch_##name. (cherry picked from commit 6aa3e97e2530f9917f504eb4146af119a3f27229) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4a49c82956f5a42a2cce22c2e97360de1b32301d commit 4a49c82956f5a42a2cce22c2e97360de1b32301d Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 3 14:51:40 2016 -0800 Or bit_Prefer_MAP_32BIT_EXEC in EXTRA_LD_ENVVARS We should turn on bit_Prefer_MAP_32BIT_EXEC in EXTRA_LD_ENVVARS without overriding other bits. [BZ #19758] * sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h (EXTRA_LD_ENVVARS): Or bit_Prefer_MAP_32BIT_EXEC. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=18f8c0e3cc9ff7b092f02c9b42874a5439347bbc commit 18f8c0e3cc9ff7b092f02c9b42874a5439347bbc Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Mar 6 16:48:11 2016 -0800 Group AVX512 functions in .text.avx512 section * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: Replace .text with .text.avx512. * sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S: Likewise. (cherry picked from commit fee9eb6200f0e44a4b684903bc47fde36d46f1a5) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=0c8e297a186f844ebb7eba7a3bc0343c83615ca9 commit 0c8e297a186f844ebb7eba7a3bc0343c83615ca9 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Mar 4 08:37:40 2016 -0800 x86-64: Fix memcpy IFUNC selection Chek Fast_Unaligned_Load, instead of Slow_BSF, and also check for Fast_Copy_Backward to enable __memcpy_ssse3_back. Existing selection order is updated with following selection order: 1. __memcpy_avx_unaligned if AVX_Fast_Unaligned_Load bit is set. 2. __memcpy_sse2_unaligned if Fast_Unaligned_Load bit is set. 3. __memcpy_sse2 if SSSE3 isn't available. 4. __memcpy_ssse3_back if Fast_Copy_Backward bit it set. 5. __memcpy_ssse3 [BZ #18880] * sysdeps/x86_64/multiarch/memcpy.S: Check Fast_Unaligned_Load, instead of Slow_BSF, and also check for Fast_Copy_Backward to enable __memcpy_ssse3_back. (cherry picked from commit 14a1d7cc4c4fd5ee8e4e66b777221dd32a84efe8) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=3c772cb4d9cbe19cd97ad991e3dab43014198c44 commit 3c772cb4d9cbe19cd97ad991e3dab43014198c44 Author: Andrew Senkevich <andrew.senkevich@intel.com> Date: Sat Jan 16 00:49:45 2016 +0300 Added memcpy/memmove family optimized with AVX512 for KNL hardware. Added AVX512 implementations of memcpy, mempcpy, memmove, memcpy_chk, mempcpy_chk, memmove_chk. It shows average improvement more than 30% over AVX versions on KNL hardware (performance results in the thread <https://sourceware.org/ml/libc-alpha/2016-01/msg00258.html>). * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Added new files. * sysdeps/x86_64/multiarch/ifunc-impl-list.c: Added new tests. * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: New file. * sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S: Likewise. * sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: Likewise. * sysdeps/x86_64/multiarch/memcpy.S: Added new IFUNC branch. * sysdeps/x86_64/multiarch/memcpy_chk.S: Likewise. * sysdeps/x86_64/multiarch/memmove.c: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.c: Likewise. * sysdeps/x86_64/multiarch/mempcpy.S: Likewise. * sysdeps/x86_64/multiarch/mempcpy_chk.S: Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7f20e52f1f8d7709140f4bdf828a6bb0f0f08af2 commit 7f20e52f1f8d7709140f4bdf828a6bb0f0f08af2 Author: Andrew Senkevich <andrew.senkevich@intel.com> Date: Sat Dec 19 02:47:28 2015 +0300 Added memset optimized with AVX512 for KNL hardware. It shows improvement up to 28% over AVX2 memset (performance results attached at <https://sourceware.org/ml/libc-alpha/2015-12/msg00052.html>). * sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S: New file. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Added new file. * sysdeps/x86_64/multiarch/ifunc-impl-list.c: Added new tests. * sysdeps/x86_64/multiarch/memset.S: Added new IFUNC branch. * sysdeps/x86_64/multiarch/memset_chk.S: Likewise. * sysdeps/x86/cpu-features.h (bit_Prefer_No_VZEROUPPER, index_Prefer_No_VZEROUPPER): New. * sysdeps/x86/cpu-features.c (init_cpu_features): Set the Prefer_No_VZEROUPPER for Knights Landing. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=d530cd5463701a59ed923d53a97d3b534fdfea8a commit d530cd5463701a59ed923d53a97d3b534fdfea8a Author: H.J. Lu <hjl.tools@gmail.com> Date: Wed Oct 21 14:44:23 2015 -0700 Add Prefer_MAP_32BIT_EXEC to map executable pages with MAP_32BIT According to Silvermont software optimization guide, for 64-bit applications, branch prediction performance can be negatively impacted when the target of a branch is more than 4GB away from the branch. Add the Prefer_MAP_32BIT_EXEC bit so that mmap will try to map executable pages with MAP_32BIT first. NB: MAP_32BIT will map to lower 2GB, not lower 4GB, address. Prefer_MAP_32BIT_EXEC reduces bits available for address space layout randomization (ASLR), which is always disabled for SUID programs and can only be enabled by setting environment variable, LD_PREFER_MAP_32BIT_EXEC. On Fedora 23, this patch speeds up GCC 5 testsuite by 3% on Silvermont. [BZ #19367] * sysdeps/unix/sysv/linux/wordsize-64/mmap.c: New file. * sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h: Likewise. * sysdeps/unix/sysv/linux/x86_64/64/mmap.c: Likewise. * sysdeps/x86/cpu-features.h (bit_Prefer_MAP_32BIT_EXEC): New. (index_Prefer_MAP_32BIT_EXEC): Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=fe24aedc3530037d7bb614b84d309e6b816686bf commit fe24aedc3530037d7bb614b84d309e6b816686bf Author: H.J. Lu <hjl.tools@gmail.com> Date: Tue Dec 15 11:46:54 2015 -0800 Enable Silvermont optimizations for Knights Landing Knights Landing processor is based on Silvermont. This patch enables Silvermont optimizations for Knights Landing. * sysdeps/x86/cpu-features.c (init_cpu_features): Enable Silvermont optimizations for Knights Landing. -----------------------------------------------------------------------
This is an automated email from the git hooks/post-receive script. It was generated because a ref change was pushed to the repository containing the project "GNU C Library master sources". The branch, hjl/erms/ifunc has been created at 8073bd8f850c1b7b04921a4c921d26bbbe5fbcae (commit) - Log ----------------------------------------------------------------- https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=8073bd8f850c1b7b04921a4c921d26bbbe5fbcae commit 8073bd8f850c1b7b04921a4c921d26bbbe5fbcae Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Apr 1 14:01:24 2016 -0700 X86-64: Add dummy memcopy.h and wordcopy.c Since x86-64 doesn't use memory copy functions, add dummy memcopy.h and wordcopy.c to reduce code size. It reduces the size of libc.so by about 1 KB. * sysdeps/x86_64/memcopy.h: New file. * sysdeps/x86_64/wordcopy.c: Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=3608e9eb3d04e22a8341ac6c397e52f531330ac1 commit 3608e9eb3d04e22a8341ac6c397e52f531330ac1 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 12:46:57 2016 -0700 X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones, we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with the new ones. No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used before. If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2 memcpy/memmove optimized with Enhanced REP MOVSB will be used for processors with ERMS. The new AVX512 memcpy/memmove will be used for processors with AVX512 which prefer vzeroupper. Since the new SSE2 memcpy/memmove are faster than the previous default memcpy/memmove used in libc.a and ld.so, we also remove the previous default memcpy/memmove and make them the default memcpy/memmove. Together, it reduces the size of libc.so by about 6 KB and the size of ld.so by about 2 KB. [BZ #19776] * sysdeps/x86_64/memcpy.S: Make it dummy. * sysdeps/x86_64/mempcpy.S: Likewise. * sysdeps/x86_64/memmove.S: New file. * sysdeps/x86_64/memmove_chk.S: Likewise. * sysdeps/x86_64/multiarch/memmove.S: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.S: Likewise. * sysdeps/x86_64/memmove.c: Removed. * sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memmove.c: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.c: Likewise. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memcpy-sse2-unaligned, memmove-avx-unaligned, memcpy-avx-unaligned and memmove-sse2-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Replace __memmove_chk_avx512_unaligned_2 with __memmove_chk_avx512_unaligned. Remove __memmove_chk_avx_unaligned_2. Replace __memmove_chk_sse2_unaligned_2 with __memmove_chk_sse2_unaligned. Remove __memmove_chk_sse2 and __memmove_avx_unaligned_2. Replace __memmove_avx512_unaligned_2 with __memmove_avx512_unaligned. Replace __memmove_sse2_unaligned_2 with __memmove_sse2_unaligned. Remove __memmove_sse2. Replace __memcpy_chk_avx512_unaligned_2 with __memcpy_chk_avx512_unaligned. Remove __memcpy_chk_avx_unaligned_2. Replace __memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned. Remove __memcpy_chk_sse2. Remove __memcpy_avx_unaligned_2. Replace __memcpy_avx512_unaligned_2 with __memcpy_avx512_unaligned. Remove __memcpy_sse2_unaligned_2 and __memcpy_sse2. Replace __mempcpy_chk_avx512_unaligned_2 with __mempcpy_chk_avx512_unaligned. Remove __mempcpy_chk_avx_unaligned_2. Replace __mempcpy_chk_sse2_unaligned_2 with __mempcpy_chk_sse2_unaligned. Remove __mempcpy_chk_sse2. Replace __mempcpy_avx512_unaligned_2 with __mempcpy_avx512_unaligned. Remove __mempcpy_avx_unaligned_2. Replace __mempcpy_sse2_unaligned_2 with __mempcpy_sse2_unaligned. Remove __mempcpy_sse2. * sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support __memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned. Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms if processor has ERMS. Default to __memcpy_sse2_unaligned. (ENTRY): Removed. (END): Likewise. (ENTRY_CHK): Likewise. (libc_hidden_builtin_def): Likewise. Don't include ../memcpy.S. * sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support __memcpy_chk_avx512_unaligned_erms and __memcpy_chk_avx512_unaligned. Use __memcpy_chk_avx_unaligned_erms and __memcpy_chk_sse2_unaligned_erms if if processor has ERMS. Default to __memcpy_chk_sse2_unaligned. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip if not in libc. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S (MEMCPY_SYMBOL): New. (MEMPCPY_SYMBOL): Likewise. (MEMMOVE_CHK_SYMBOL): Likewise. Replace MEMMOVE_SYMBOL with MEMMOVE_CHK_SYMBOL on __mempcpy_chk symbols. Replace MEMMOVE_SYMBOL with MEMPCPY_SYMBOL on __mempcpy symbols. Change function suffix from unaligned_2 to unaligned. Provide alias for __memcpy_chk in libc.a. Provide alias for memcpy in libc.a and ld.so. * sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support __mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned. Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms if processor has ERMS. Default to __mempcpy_sse2_unaligned. (ENTRY): Removed. (END): Likewise. (ENTRY_CHK): Likewise. (libc_hidden_builtin_def): Likewise. Don't include ../mempcpy.S. (mempcpy): New. Add a weak alias. * sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support __mempcpy_chk_avx512_unaligned_erms and __mempcpy_chk_avx512_unaligned. Use __mempcpy_chk_avx_unaligned_erms and __mempcpy_chk_sse2_unaligned_erms if if processor has ERMS. Default to __mempcpy_chk_sse2_unaligned. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=83823b091e18fec9151752a0429a78a0d81d6317 commit 83823b091e18fec9151752a0429a78a0d81d6317 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 10:42:30 2016 -0700 X86-64: Remove the previous SSE2/AVX2 memsets Since the new SSE2/AVX2 memsets are faster than the previous ones, we can remove the previous SSE2/AVX2 memsets and replace them with the new ones. This reduces the size of libc.so by about 900 bytes. No change in IFUNC selection if SSE2 and AVX2 memsets weren't used before. If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset optimized with Enhanced REP STOSB will be used for processors with ERMS. The new AVX512 memset will be used for processors with AVX512 which prefer vzeroupper. [BZ #19881] * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded into ... * sysdeps/x86_64/memset.S: This. (__bzero): Removed. (__memset_tail): Likewise. (__memset_chk): Likewise. (memset): Likewise. (MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't defined. (MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined. * sysdeps/x86_64/multiarch/memset-avx2.S: Removed. (__memset_zero_constant_len_parameter): Check SHARED instead of PIC. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memset-avx2 and memset-sse2-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Remove __memset_chk_sse2, __memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned. * sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: Skip if not in libc. * sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S (MEMSET_CHK_SYMBOL): New. Define if not defined. (__bzero): Check VEC_SIZE == 16 instead of USE_MULTIARCH. Replace MEMSET_SYMBOL with MEMSET_CHK_SYMBOL on __memset_chk symbols. Properly check USE_MULTIARCH on __memset symbols. * sysdeps/x86_64/multiarch/memset.S (memset): Replace __memset_sse2 and __memset_avx2 with __memset_sse2_unaligned and __memset_avx2_unaligned. Use __memset_sse2_unaligned_erms or __memset_avx2_unaligned_erms if processor has ERMS. Support __memset_avx512_unaligned_erms and __memset_avx512_unaligned. (memset): Removed. (__memset_chk): Likewise. (MEMSET_SYMBOL): New. (libc_hidden_builtin_def): Replace __memset_sse2 with __memset_sse2_unaligned. * sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace __memset_chk_sse2 and __memset_chk_avx2 with __memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms. Use __memset_chk_sse2_unaligned_erms or __memset_chk_avx2_unaligned_erms if processor has ERMS. Support __memset_chk_avx512_unaligned_erms and __memset_chk_avx512_unaligned. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e39197c4403a082b1c607210211cb643830b8d9c commit e39197c4403a082b1c607210211cb643830b8d9c Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Apr 3 17:21:45 2016 -0700 X86-64: Use non-temporal store in memmove on large data memcpy/memmove benchmarks with large data shows that there is a regression with large data on Haswell machine. non-temporal store in memmove on large data can improve performance significantly. This patch adds a threshold to use non temporal store which is 4 times of shared cache size. When size is above the threshold, non temporal store will be used. For size below 8 vector register width, we load all data into registers and store them together. Only forward and backward loops, which move 4 vector registers at a time, are used to support overlapping addresses. For forward loop, we load the last 4 vector register width of data and the first vector register width of data into vector registers before the loop and store them after the loop. For backward loop, we load the first 4 vector register width of data and the last vector register width of data into vector registers before the loop and store them after the loop. * sysdeps/x86_64/cacheinfo.c (__x86_shared_non_temporal_threshold): New. (init_cacheinfo): Set __x86_shared_non_temporal_threshold to 4 times of shared cache size. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S (PREFETCHNT): New. (VMOVNT): Likewise. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S (PREFETCHNT): Likewise. (VMOVNT): Likewise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S (PREFETCHNT): Likewise. (VMOVNT): Likewise. (VMOVU): Changed to movups for smaller code sizes. (VMOVA): Changed to movaps for smaller code sizes. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Update comments. Rewrite to use forward and backward loops, which move 4 vector registers at a time, to support overlapping addresses and use non temporal store if size is above the threshold. -----------------------------------------------------------------------
This is an automated email from the git hooks/post-receive script. It was generated because a ref change was pushed to the repository containing the project "GNU C Library master sources". The branch, hjl/erms/2.23 has been created at 9910c54c2e97b6c36f8593097e53d5e09f837a69 (commit) - Log ----------------------------------------------------------------- https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=9910c54c2e97b6c36f8593097e53d5e09f837a69 commit 9910c54c2e97b6c36f8593097e53d5e09f837a69 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Apr 1 14:01:24 2016 -0700 X86-64: Add dummy memcopy.h and wordcopy.c Since x86-64 doesn't use memory copy functions, add dummy memcopy.h and wordcopy.c to reduce code size. It reduces the size of libc.so by about 1 KB. * sysdeps/x86_64/memcopy.h: New file. * sysdeps/x86_64/wordcopy.c: Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=3429d9dd330a5c140cb37e77e7c388a71fdb44f1 commit 3429d9dd330a5c140cb37e77e7c388a71fdb44f1 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 12:46:57 2016 -0700 X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones, we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with the new ones. No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used before. If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2 memcpy/memmove optimized with Enhanced REP MOVSB will be used for processors with ERMS. The new AVX512 memcpy/memmove will be used for processors with AVX512 which prefer vzeroupper. Since the new SSE2 memcpy/memmove are faster than the previous default memcpy/memmove used in libc.a and ld.so, we also remove the previous default memcpy/memmove and make them the default memcpy/memmove. Together, it reduces the size of libc.so by about 6 KB and the size of ld.so by about 2 KB. [BZ #19776] * sysdeps/x86_64/memcpy.S: Make it dummy. * sysdeps/x86_64/mempcpy.S: Likewise. * sysdeps/x86_64/memmove.S: New file. * sysdeps/x86_64/memmove_chk.S: Likewise. * sysdeps/x86_64/multiarch/memmove.S: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.S: Likewise. * sysdeps/x86_64/memmove.c: Removed. * sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memmove.c: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.c: Likewise. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memcpy-sse2-unaligned, memmove-avx-unaligned, memcpy-avx-unaligned and memmove-sse2-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Replace __memmove_chk_avx512_unaligned_2 with __memmove_chk_avx512_unaligned. Remove __memmove_chk_avx_unaligned_2. Replace __memmove_chk_sse2_unaligned_2 with __memmove_chk_sse2_unaligned. Remove __memmove_chk_sse2 and __memmove_avx_unaligned_2. Replace __memmove_avx512_unaligned_2 with __memmove_avx512_unaligned. Replace __memmove_sse2_unaligned_2 with __memmove_sse2_unaligned. Remove __memmove_sse2. Replace __memcpy_chk_avx512_unaligned_2 with __memcpy_chk_avx512_unaligned. Remove __memcpy_chk_avx_unaligned_2. Replace __memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned. Remove __memcpy_chk_sse2. Remove __memcpy_avx_unaligned_2. Replace __memcpy_avx512_unaligned_2 with __memcpy_avx512_unaligned. Remove __memcpy_sse2_unaligned_2 and __memcpy_sse2. Replace __mempcpy_chk_avx512_unaligned_2 with __mempcpy_chk_avx512_unaligned. Remove __mempcpy_chk_avx_unaligned_2. Replace __mempcpy_chk_sse2_unaligned_2 with __mempcpy_chk_sse2_unaligned. Remove __mempcpy_chk_sse2. Replace __mempcpy_avx512_unaligned_2 with __mempcpy_avx512_unaligned. Remove __mempcpy_avx_unaligned_2. Replace __mempcpy_sse2_unaligned_2 with __mempcpy_sse2_unaligned. Remove __mempcpy_sse2. * sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support __memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned. Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms if processor has ERMS. Default to __memcpy_sse2_unaligned. (ENTRY): Removed. (END): Likewise. (ENTRY_CHK): Likewise. (libc_hidden_builtin_def): Likewise. Don't include ../memcpy.S. * sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support __memcpy_chk_avx512_unaligned_erms and __memcpy_chk_avx512_unaligned. Use __memcpy_chk_avx_unaligned_erms and __memcpy_chk_sse2_unaligned_erms if if processor has ERMS. Default to __memcpy_chk_sse2_unaligned. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip if not in libc. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S (MEMCPY_SYMBOL): New. (MEMPCPY_SYMBOL): Likewise. (MEMMOVE_CHK_SYMBOL): Likewise. Replace MEMMOVE_SYMBOL with MEMMOVE_CHK_SYMBOL on __mempcpy_chk symbols. Replace MEMMOVE_SYMBOL with MEMPCPY_SYMBOL on __mempcpy symbols. Change function suffix from unaligned_2 to unaligned. Provide alias for __memcpy_chk in libc.a. Provide alias for memcpy in libc.a and ld.so. * sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support __mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned. Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms if processor has ERMS. Default to __mempcpy_sse2_unaligned. (ENTRY): Removed. (END): Likewise. (ENTRY_CHK): Likewise. (libc_hidden_builtin_def): Likewise. Don't include ../mempcpy.S. (mempcpy): New. Add a weak alias. * sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support __mempcpy_chk_avx512_unaligned_erms and __mempcpy_chk_avx512_unaligned. Use __mempcpy_chk_avx_unaligned_erms and __mempcpy_chk_sse2_unaligned_erms if if processor has ERMS. Default to __mempcpy_chk_sse2_unaligned. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7c36cac64f6855f1f4ff007beaca3cb766e694ec commit 7c36cac64f6855f1f4ff007beaca3cb766e694ec Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 10:42:30 2016 -0700 X86-64: Remove the previous SSE2/AVX2 memsets Since the new SSE2/AVX2 memsets are faster than the previous ones, we can remove the previous SSE2/AVX2 memsets and replace them with the new ones. This reduces the size of libc.so by about 900 bytes. No change in IFUNC selection if SSE2 and AVX2 memsets weren't used before. If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset optimized with Enhanced REP STOSB will be used for processors with ERMS. The new AVX512 memset will be used for processors with AVX512 which prefer vzeroupper. [BZ #19881] * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded into ... * sysdeps/x86_64/memset.S: This. (__bzero): Removed. (__memset_tail): Likewise. (__memset_chk): Likewise. (memset): Likewise. (MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't defined. (MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined. * sysdeps/x86_64/multiarch/memset-avx2.S: Removed. (__memset_zero_constant_len_parameter): Check SHARED instead of PIC. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memset-avx2 and memset-sse2-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Remove __memset_chk_sse2, __memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned. * sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: Skip if not in libc. * sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S (MEMSET_CHK_SYMBOL): New. Define if not defined. (__bzero): Check VEC_SIZE == 16 instead of USE_MULTIARCH. Replace MEMSET_SYMBOL with MEMSET_CHK_SYMBOL on __memset_chk symbols. Properly check USE_MULTIARCH on __memset symbols. * sysdeps/x86_64/multiarch/memset.S (memset): Replace __memset_sse2 and __memset_avx2 with __memset_sse2_unaligned and __memset_avx2_unaligned. Use __memset_sse2_unaligned_erms or __memset_avx2_unaligned_erms if processor has ERMS. Support __memset_avx512_unaligned_erms and __memset_avx512_unaligned. (memset): Removed. (__memset_chk): Likewise. (MEMSET_SYMBOL): New. (libc_hidden_builtin_def): Replace __memset_sse2 with __memset_sse2_unaligned. * sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace __memset_chk_sse2 and __memset_chk_avx2 with __memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms. Use __memset_chk_sse2_unaligned_erms or __memset_chk_avx2_unaligned_erms if processor has ERMS. Support __memset_chk_avx512_unaligned_erms and __memset_chk_avx512_unaligned. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=69b122e1149e158c382c2b0bdd4591a4a19cb505 commit 69b122e1149e158c382c2b0bdd4591a4a19cb505 Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Apr 3 17:21:45 2016 -0700 X86-64: Use non-temporal store in memmove on large data memcpy/memmove benchmarks with large data shows that there is a regression with large data on Haswell machine. non-temporal store in memmove on large data can improve performance significantly. This patch adds a threshold to use non temporal store which is 4 times of shared cache size. When size is above the threshold, non temporal store will be used. For size below 8 vector register width, we load all data into registers and store them together. Only forward and backward loops, which move 4 vector registers at a time, are used to support overlapping addresses. For forward loop, we load the last 4 vector register width of data and the first vector register width of data into vector registers before the loop and store them after the loop. For backward loop, we load the first 4 vector register width of data and the last vector register width of data into vector registers before the loop and store them after the loop. * sysdeps/x86_64/cacheinfo.c (__x86_shared_non_temporal_threshold): New. (init_cacheinfo): Set __x86_shared_non_temporal_threshold to 4 times of shared cache size. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S (PREFETCHNT): New. (VMOVNT): Likewise. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S (PREFETCHNT): Likewise. (VMOVNT): Likewise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S (PREFETCHNT): Likewise. (VMOVNT): Likewise. (VMOVU): Changed to movups for smaller code sizes. (VMOVA): Changed to movaps for smaller code sizes. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Update comments. Rewrite to use forward and backward loops, which move 4 vector registers at a time, to support overlapping addresses and use non temporal store if size is above the threshold. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=9a93bdbaff81edf67c5486c84f8098055e355abb commit 9a93bdbaff81edf67c5486c84f8098055e355abb Author: H.J. Lu <hjl.tools@gmail.com> Date: Tue Apr 5 05:21:07 2016 -0700 Force 32-bit displacement in memset-vec-unaligned-erms.S * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: Force 32-bit displacement to avoid long nop between instructions. (cherry picked from commit ec0cac9a1f4094bd0db6f77c1b329e7a40eecc10) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=5118e532600549ad0f56cb9b1a179b8eab70c483 commit 5118e532600549ad0f56cb9b1a179b8eab70c483 Author: H.J. Lu <hjl.tools@gmail.com> Date: Tue Apr 5 05:19:05 2016 -0700 Add a comment in memset-sse2-unaligned-erms.S * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Add a comment on VMOVU and VMOVA. (cherry picked from commit 696ac774847b80cf994438739478b0c3003b5958) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=06c6d4ae6ee7e5b83fd5868bef494def01f59292 commit 06c6d4ae6ee7e5b83fd5868bef494def01f59292 Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Apr 3 14:32:20 2016 -0700 Don't put SSE2/AVX/AVX512 memmove/memset in ld.so Since memmove and memset in ld.so don't use IFUNC, don't put SSE2, AVX and AVX512 memmove and memset in ld.so. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip if not in libc. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S: Likewise. (cherry picked from commit 5cd7af016d8587ff53b20ba259746f97edbddbf7) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a96379797a7eecc1b709cad7b68981eb698783dc commit a96379797a7eecc1b709cad7b68981eb698783dc Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Apr 3 12:38:25 2016 -0700 Fix memmove-vec-unaligned-erms.S __mempcpy_erms and __memmove_erms can't be placed between __memmove_chk and __memmove it breaks __memmove_chk. Don't check source == destination first since it is less common. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: (__mempcpy_erms, __memmove_erms): Moved before __mempcpy_chk with unaligned_erms. (__memmove_erms): Skip if source == destination. (__memmove_unaligned_erms): Don't check source == destination first. (cherry picked from commit ea2785e96fa503f3a2b5dd9f3a6ca65622b3c5f2) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=cfb059c79729b26284863334c9aa04f0a3b967b9 commit cfb059c79729b26284863334c9aa04f0a3b967b9 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Apr 1 15:08:48 2016 -0700 Remove Fast_Copy_Backward from Intel Core processors Intel Core i3, i5 and i7 processors have fast unaligned copy and copy backward is ignored. Remove Fast_Copy_Backward from Intel Core processors to avoid confusion. * sysdeps/x86/cpu-features.c (init_cpu_features): Don't set bit_arch_Fast_Copy_Backward for Intel Core proessors. (cherry picked from commit 27d3ce1467990f89126e228559dec8f84b96c60e) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=30c389be1af67c4d0716d207b6780c6169d1355f commit 30c389be1af67c4d0716d207b6780c6169d1355f Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 10:05:51 2016 -0700 Add x86-64 memset with unaligned store and rep stosb Implement x86-64 memset with unaligned store and rep movsb. Support 16-byte, 32-byte and 64-byte vector register sizes. A single file provides 2 implementations of memset, one with rep stosb and the other without rep stosb. They share the same codes when size is between 2 times of vector register size and REP_STOSB_THRESHOLD which defaults to 2KB. Key features: 1. Use overlapping store to avoid branch. 2. For size <= 4 times of vector register size, fully unroll the loop. 3. For size > 4 times of vector register size, store 4 times of vector register size at a time. [BZ #19881] * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memset-sse2-unaligned-erms, memset-avx2-unaligned-erms and memset-avx512-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memset_chk_sse2_unaligned, __memset_chk_sse2_unaligned_erms, __memset_chk_avx2_unaligned, __memset_chk_avx2_unaligned_erms, __memset_chk_avx512_unaligned, __memset_chk_avx512_unaligned_erms, __memset_sse2_unaligned, __memset_sse2_unaligned_erms, __memset_erms, __memset_avx2_unaligned, __memset_avx2_unaligned_erms, __memset_avx512_unaligned_erms and __memset_avx512_unaligned. * sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: New file. * sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: Likewise. (cherry picked from commit 830566307f038387ca0af3fd327706a8d1a2f595) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=980d639b4ae58209843f09a29d86b0a8303b6650 commit 980d639b4ae58209843f09a29d86b0a8303b6650 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 10:04:26 2016 -0700 Add x86-64 memmove with unaligned load/store and rep movsb Implement x86-64 memmove with unaligned load/store and rep movsb. Support 16-byte, 32-byte and 64-byte vector register sizes. When size <= 8 times of vector register size, there is no check for address overlap bewteen source and destination. Since overhead for overlap check is small when size > 8 times of vector register size, memcpy is an alias of memmove. A single file provides 2 implementations of memmove, one with rep movsb and the other without rep movsb. They share the same codes when size is between 2 times of vector register size and REP_MOVSB_THRESHOLD which is 2KB for 16-byte vector register size and scaled up by large vector register size. Key features: 1. Use overlapping load and store to avoid branch. 2. For size <= 8 times of vector register size, load all sources into registers and store them together. 3. If there is no address overlap bewteen source and destination, copy from both ends with 4 times of vector register size at a time. 4. If address of destination > address of source, backward copy 8 times of vector register size at a time. 5. Otherwise, forward copy 8 times of vector register size at a time. 6. Use rep movsb only for forward copy. Avoid slow backward rep movsb by fallbacking to backward copy 8 times of vector register size at a time. 7. Skip when address of destination == address of source. [BZ #19776] * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memmove-sse2-unaligned-erms, memmove-avx-unaligned-erms and memmove-avx512-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memmove_chk_avx512_unaligned_2, __memmove_chk_avx512_unaligned_erms, __memmove_chk_avx_unaligned_2, __memmove_chk_avx_unaligned_erms, __memmove_chk_sse2_unaligned_2, __memmove_chk_sse2_unaligned_erms, __memmove_avx_unaligned_2, __memmove_avx_unaligned_erms, __memmove_avx512_unaligned_2, __memmove_avx512_unaligned_erms, __memmove_erms, __memmove_sse2_unaligned_2, __memmove_sse2_unaligned_erms, __memcpy_chk_avx512_unaligned_2, __memcpy_chk_avx512_unaligned_erms, __memcpy_chk_avx_unaligned_2, __memcpy_chk_avx_unaligned_erms, __memcpy_chk_sse2_unaligned_2, __memcpy_chk_sse2_unaligned_erms, __memcpy_avx_unaligned_2, __memcpy_avx_unaligned_erms, __memcpy_avx512_unaligned_2, __memcpy_avx512_unaligned_erms, __memcpy_sse2_unaligned_2, __memcpy_sse2_unaligned_erms, __memcpy_erms, __mempcpy_chk_avx512_unaligned_2, __mempcpy_chk_avx512_unaligned_erms, __mempcpy_chk_avx_unaligned_2, __mempcpy_chk_avx_unaligned_erms, __mempcpy_chk_sse2_unaligned_2, __mempcpy_chk_sse2_unaligned_erms, __mempcpy_avx512_unaligned_2, __mempcpy_avx512_unaligned_erms, __mempcpy_avx_unaligned_2, __mempcpy_avx_unaligned_erms, __mempcpy_sse2_unaligned_2, __mempcpy_sse2_unaligned_erms and __mempcpy_erms. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: New file. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S: Likwise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S: Likwise. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Likwise. (cherry picked from commit 88b57b8ed41d5ecf2e1bdfc19556f9246a665ebb) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=bf2bc5e5c9d7aa8af28b299ec26b8a37352730cc commit bf2bc5e5c9d7aa8af28b299ec26b8a37352730cc Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Mar 28 19:22:59 2016 -0700 Initial Enhanced REP MOVSB/STOSB (ERMS) support The newer Intel processors support Enhanced REP MOVSB/STOSB (ERMS) which has a feature bit in CPUID. This patch adds the Enhanced REP MOVSB/STOSB (ERMS) bit to x86 cpu-features. * sysdeps/x86/cpu-features.h (bit_cpu_ERMS): New. (index_cpu_ERMS): Likewise. (reg_ERMS): Likewise. (cherry picked from commit 0791f91dff9a77263fa8173b143d854cad902c6d) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7c244283ff12329b3bca9878b8edac3b3fe5c7bc commit 7c244283ff12329b3bca9878b8edac3b3fe5c7bc Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Mar 28 13:15:59 2016 -0700 Make __memcpy_avx512_no_vzeroupper an alias Since x86-64 memcpy-avx512-no-vzeroupper.S implements memmove, make __memcpy_avx512_no_vzeroupper an alias of __memmove_avx512_no_vzeroupper to reduce code size of libc.so. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memcpy-avx512-no-vzeroupper. * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: Renamed to ... * sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: This. (MEMCPY): Don't define. (MEMCPY_CHK): Likewise. (MEMPCPY): Likewise. (MEMPCPY_CHK): Likewise. (MEMPCPY_CHK): Renamed to ... (__mempcpy_chk_avx512_no_vzeroupper): This. (MEMPCPY_CHK): Renamed to ... (__mempcpy_chk_avx512_no_vzeroupper): This. (MEMCPY_CHK): Renamed to ... (__memmove_chk_avx512_no_vzeroupper): This. (MEMCPY): Renamed to ... (__memmove_avx512_no_vzeroupper): This. (__memcpy_avx512_no_vzeroupper): New alias. (__memcpy_chk_avx512_no_vzeroupper): Likewise. (cherry picked from commit 064f01b10b57ff09cda7025f484b848c38ddd57a) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a9a14991fb2d3e69f80d25e9bbf2f6b0bcf11c3d commit a9a14991fb2d3e69f80d25e9bbf2f6b0bcf11c3d Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Mar 28 13:13:36 2016 -0700 Implement x86-64 multiarch mempcpy in memcpy Implement x86-64 multiarch mempcpy in memcpy to share most of code. It reduces code size of libc.so. [BZ #18858] * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove mempcpy-ssse3, mempcpy-ssse3-back, mempcpy-avx-unaligned and mempcpy-avx512-no-vzeroupper. * sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/memcpy-ssse3-back.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/memcpy-ssse3.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/mempcpy-avx-unaligned.S: Removed. * sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S: Likewise. * sysdeps/x86_64/multiarch/mempcpy-ssse3-back.S: Likewise. * sysdeps/x86_64/multiarch/mempcpy-ssse3.S: Likewise. (cherry picked from commit c365e615f7429aee302f8af7bf07ae262278febb) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4fc09dabecee1b7cafdbca26ee7c63f68e53c229 commit 4fc09dabecee1b7cafdbca26ee7c63f68e53c229 Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Mar 28 04:39:48 2016 -0700 [x86] Add a feature bit: Fast_Unaligned_Copy On AMD processors, memcpy optimized with unaligned SSE load is slower than emcpy optimized with aligned SSSE3 while other string functions are faster with unaligned SSE load. A feature bit, Fast_Unaligned_Copy, is added to select memcpy optimized with unaligned SSE load. [BZ #19583] * sysdeps/x86/cpu-features.c (init_cpu_features): Set Fast_Unaligned_Copy with Fast_Unaligned_Load for Intel processors. Set Fast_Copy_Backward for AMD Excavator processors. * sysdeps/x86/cpu-features.h (bit_arch_Fast_Unaligned_Copy): New. (index_arch_Fast_Unaligned_Copy): Likewise. * sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Check Fast_Unaligned_Copy instead of Fast_Unaligned_Load. (cherry picked from commit e41b395523040fcb58c7d378475720c2836d280c) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=75f2d47e459a6bf5656a938e5c63f8b581eb3ee6 commit 75f2d47e459a6bf5656a938e5c63f8b581eb3ee6 Author: Florian Weimer <fweimer@redhat.com> Date: Fri Mar 25 11:11:42 2016 +0100 tst-audit10: Fix compilation on compilers without bit_AVX512F [BZ #19860] [BZ# 19860] * sysdeps/x86_64/tst-audit10.c (avx512_enabled): Always return zero if the compiler does not provide the AVX512F bit. (cherry picked from commit f327f5b47be57bc05a4077344b381016c1bb2c11) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=96c7375cb8b6f1875d9865f2ae92ecacf5f5e6fa commit 96c7375cb8b6f1875d9865f2ae92ecacf5f5e6fa Author: H.J. Lu <hjl.tools@gmail.com> Date: Tue Mar 22 08:36:16 2016 -0700 Don't set %rcx twice before "rep movsb" * sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMCPY): Don't set %rcx twice before "rep movsb". (cherry picked from commit 3c9a4cd16cbc7b79094fec68add2df66061ab5d7) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c273f613b0cc779ee33cc33d20941d271316e483 commit c273f613b0cc779ee33cc33d20941d271316e483 Author: H.J. Lu <hjl.tools@gmail.com> Date: Tue Mar 22 07:46:56 2016 -0700 Set index_arch_AVX_Fast_Unaligned_Load only for Intel processors Since only Intel processors with AVX2 have fast unaligned load, we should set index_arch_AVX_Fast_Unaligned_Load only for Intel processors. Move AVX, AVX2, AVX512, FMA and FMA4 detection into get_common_indeces and call get_common_indeces for other processors. Add CPU_FEATURES_CPU_P and CPU_FEATURES_ARCH_P to aoid loading GLRO(dl_x86_cpu_features) in cpu-features.c. [BZ #19583] * sysdeps/x86/cpu-features.c (get_common_indeces): Remove inline. Check family before setting family, model and extended_model. Set AVX, AVX2, AVX512, FMA and FMA4 usable bits here. (init_cpu_features): Replace HAS_CPU_FEATURE and HAS_ARCH_FEATURE with CPU_FEATURES_CPU_P and CPU_FEATURES_ARCH_P. Set index_arch_AVX_Fast_Unaligned_Load for Intel processors with usable AVX2. Call get_common_indeces for other processors with family == NULL. * sysdeps/x86/cpu-features.h (CPU_FEATURES_CPU_P): New macro. (CPU_FEATURES_ARCH_P): Likewise. (HAS_CPU_FEATURE): Use CPU_FEATURES_CPU_P. (HAS_ARCH_FEATURE): Use CPU_FEATURES_ARCH_P. (cherry picked from commit f781a9e96138d8839663af5e88649ab1fbed74f8) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c858d10a4e7fd682f2e7083836e4feacc2d580f4 commit c858d10a4e7fd682f2e7083836e4feacc2d580f4 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 10 05:26:46 2016 -0800 Add _arch_/_cpu_ to index_*/bit_* in x86 cpu-features.h index_* and bit_* macros are used to access cpuid and feature arrays o struct cpu_features. It is very easy to use bits and indices of cpuid array on feature array, especially in assembly codes. For example, sysdeps/i386/i686/multiarch/bcopy.S has HAS_CPU_FEATURE (Fast_Rep_String) which should be HAS_ARCH_FEATURE (Fast_Rep_String) We change index_* and bit_* to index_cpu_*/index_arch_* and bit_cpu_*/bit_arch_* so that we can catch such error at build time. [BZ #19762] * sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h (EXTRA_LD_ENVVARS): Add _arch_ to index_*/bit_*. * sysdeps/x86/cpu-features.c (init_cpu_features): Likewise. * sysdeps/x86/cpu-features.h (bit_*): Renamed to ... (bit_arch_*): This for feature array. (bit_*): Renamed to ... (bit_cpu_*): This for cpu array. (index_*): Renamed to ... (index_arch_*): This for feature array. (index_*): Renamed to ... (index_cpu_*): This for cpu array. [__ASSEMBLER__] (HAS_FEATURE): Add and use field. [__ASSEMBLER__] (HAS_CPU_FEATURE)): Pass cpu to HAS_FEATURE. [__ASSEMBLER__] (HAS_ARCH_FEATURE)): Pass arch to HAS_FEATURE. [!__ASSEMBLER__] (HAS_CPU_FEATURE): Replace index_##name and bit_##name with index_cpu_##name and bit_cpu_##name. [!__ASSEMBLER__] (HAS_ARCH_FEATURE): Replace index_##name and bit_##name with index_arch_##name and bit_arch_##name. (cherry picked from commit 6aa3e97e2530f9917f504eb4146af119a3f27229) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7a90b56b0c3f8e55df44957cf6de7d3c9c04cbb9 commit 7a90b56b0c3f8e55df44957cf6de7d3c9c04cbb9 Author: Roland McGrath <roland@hack.frob.com> Date: Tue Mar 8 12:31:13 2016 -0800 Fix tst-audit10 build when -mavx512f is not supported. (cherry picked from commit 3bd80c0de2f8e7ca8020d37739339636d169957e) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=ba80f6ceea3a6b6f711038646f419125fe3ad39c commit ba80f6ceea3a6b6f711038646f419125fe3ad39c Author: Florian Weimer <fweimer@redhat.com> Date: Mon Mar 7 16:00:25 2016 +0100 tst-audit4, tst-audit10: Compile AVX/AVX-512 code separately [BZ #19269] This ensures that GCC will not use unsupported instructions before the run-time check to ensure support. (cherry picked from commit 3c0f7407eedb524c9114bb675cd55b903c71daaa) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=b8fe596e7f750d4ee2fca14d6a3999364c02662e commit b8fe596e7f750d4ee2fca14d6a3999364c02662e Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Mar 6 16:48:11 2016 -0800 Group AVX512 functions in .text.avx512 section * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: Replace .text with .text.avx512. * sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S: Likewise. (cherry picked from commit fee9eb6200f0e44a4b684903bc47fde36d46f1a5) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e455d17680cfaebb12692547422f95ba1ed30e29 commit e455d17680cfaebb12692547422f95ba1ed30e29 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Mar 4 08:37:40 2016 -0800 x86-64: Fix memcpy IFUNC selection Chek Fast_Unaligned_Load, instead of Slow_BSF, and also check for Fast_Copy_Backward to enable __memcpy_ssse3_back. Existing selection order is updated with following selection order: 1. __memcpy_avx_unaligned if AVX_Fast_Unaligned_Load bit is set. 2. __memcpy_sse2_unaligned if Fast_Unaligned_Load bit is set. 3. __memcpy_sse2 if SSSE3 isn't available. 4. __memcpy_ssse3_back if Fast_Copy_Backward bit it set. 5. __memcpy_ssse3 [BZ #18880] * sysdeps/x86_64/multiarch/memcpy.S: Check Fast_Unaligned_Load, instead of Slow_BSF, and also check for Fast_Copy_Backward to enable __memcpy_ssse3_back. (cherry picked from commit 14a1d7cc4c4fd5ee8e4e66b777221dd32a84efe8) -----------------------------------------------------------------------
This is an automated email from the git hooks/post-receive script. It was generated because a ref change was pushed to the repository containing the project "GNU C Library master sources". The branch, hjl/erms/2.22 has been created at f0a3ab52c05e0813348e0e5460aaf1dc5d1e7a64 (commit) - Log ----------------------------------------------------------------- https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=f0a3ab52c05e0813348e0e5460aaf1dc5d1e7a64 commit f0a3ab52c05e0813348e0e5460aaf1dc5d1e7a64 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Apr 1 14:01:24 2016 -0700 X86-64: Add dummy memcopy.h and wordcopy.c Since x86-64 doesn't use memory copy functions, add dummy memcopy.h and wordcopy.c to reduce code size. It reduces the size of libc.so by about 1 KB. * sysdeps/x86_64/memcopy.h: New file. * sysdeps/x86_64/wordcopy.c: Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=d5af940569c5c48835acdf6c8c47451e1e92c817 commit d5af940569c5c48835acdf6c8c47451e1e92c817 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 12:46:57 2016 -0700 X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones, we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with the new ones. No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used before. If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2 memcpy/memmove optimized with Enhanced REP MOVSB will be used for processors with ERMS. The new AVX512 memcpy/memmove will be used for processors with AVX512 which prefer vzeroupper. Since the new SSE2 memcpy/memmove are faster than the previous default memcpy/memmove used in libc.a and ld.so, we also remove the previous default memcpy/memmove and make them the default memcpy/memmove. Together, it reduces the size of libc.so by about 6 KB and the size of ld.so by about 2 KB. [BZ #19776] * sysdeps/x86_64/memcpy.S: Make it dummy. * sysdeps/x86_64/mempcpy.S: Likewise. * sysdeps/x86_64/memmove.S: New file. * sysdeps/x86_64/memmove_chk.S: Likewise. * sysdeps/x86_64/multiarch/memmove.S: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.S: Likewise. * sysdeps/x86_64/memmove.c: Removed. * sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memmove.c: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.c: Likewise. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memcpy-sse2-unaligned, memmove-avx-unaligned, memcpy-avx-unaligned and memmove-sse2-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Replace __memmove_chk_avx512_unaligned_2 with __memmove_chk_avx512_unaligned. Remove __memmove_chk_avx_unaligned_2. Replace __memmove_chk_sse2_unaligned_2 with __memmove_chk_sse2_unaligned. Remove __memmove_chk_sse2 and __memmove_avx_unaligned_2. Replace __memmove_avx512_unaligned_2 with __memmove_avx512_unaligned. Replace __memmove_sse2_unaligned_2 with __memmove_sse2_unaligned. Remove __memmove_sse2. Replace __memcpy_chk_avx512_unaligned_2 with __memcpy_chk_avx512_unaligned. Remove __memcpy_chk_avx_unaligned_2. Replace __memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned. Remove __memcpy_chk_sse2. Remove __memcpy_avx_unaligned_2. Replace __memcpy_avx512_unaligned_2 with __memcpy_avx512_unaligned. Remove __memcpy_sse2_unaligned_2 and __memcpy_sse2. Replace __mempcpy_chk_avx512_unaligned_2 with __mempcpy_chk_avx512_unaligned. Remove __mempcpy_chk_avx_unaligned_2. Replace __mempcpy_chk_sse2_unaligned_2 with __mempcpy_chk_sse2_unaligned. Remove __mempcpy_chk_sse2. Replace __mempcpy_avx512_unaligned_2 with __mempcpy_avx512_unaligned. Remove __mempcpy_avx_unaligned_2. Replace __mempcpy_sse2_unaligned_2 with __mempcpy_sse2_unaligned. Remove __mempcpy_sse2. * sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support __memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned. Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms if processor has ERMS. Default to __memcpy_sse2_unaligned. (ENTRY): Removed. (END): Likewise. (ENTRY_CHK): Likewise. (libc_hidden_builtin_def): Likewise. Don't include ../memcpy.S. * sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support __memcpy_chk_avx512_unaligned_erms and __memcpy_chk_avx512_unaligned. Use __memcpy_chk_avx_unaligned_erms and __memcpy_chk_sse2_unaligned_erms if if processor has ERMS. Default to __memcpy_chk_sse2_unaligned. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip if not in libc. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S (MEMCPY_SYMBOL): New. (MEMPCPY_SYMBOL): Likewise. (MEMMOVE_CHK_SYMBOL): Likewise. Replace MEMMOVE_SYMBOL with MEMMOVE_CHK_SYMBOL on __mempcpy_chk symbols. Replace MEMMOVE_SYMBOL with MEMPCPY_SYMBOL on __mempcpy symbols. Change function suffix from unaligned_2 to unaligned. Provide alias for __memcpy_chk in libc.a. Provide alias for memcpy in libc.a and ld.so. * sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support __mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned. Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms if processor has ERMS. Default to __mempcpy_sse2_unaligned. (ENTRY): Removed. (END): Likewise. (ENTRY_CHK): Likewise. (libc_hidden_builtin_def): Likewise. Don't include ../mempcpy.S. (mempcpy): New. Add a weak alias. * sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support __mempcpy_chk_avx512_unaligned_erms and __mempcpy_chk_avx512_unaligned. Use __mempcpy_chk_avx_unaligned_erms and __mempcpy_chk_sse2_unaligned_erms if if processor has ERMS. Default to __mempcpy_chk_sse2_unaligned. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=3ad9dce564d95ac817f86cad1bb4f0bc29c58f5f commit 3ad9dce564d95ac817f86cad1bb4f0bc29c58f5f Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 10:42:30 2016 -0700 X86-64: Remove the previous SSE2/AVX2 memsets Since the new SSE2/AVX2 memsets are faster than the previous ones, we can remove the previous SSE2/AVX2 memsets and replace them with the new ones. This reduces the size of libc.so by about 900 bytes. No change in IFUNC selection if SSE2 and AVX2 memsets weren't used before. If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset optimized with Enhanced REP STOSB will be used for processors with ERMS. The new AVX512 memset will be used for processors with AVX512 which prefer vzeroupper. [BZ #19881] * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded into ... * sysdeps/x86_64/memset.S: This. (__bzero): Removed. (__memset_tail): Likewise. (__memset_chk): Likewise. (memset): Likewise. (MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't defined. (MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined. * sysdeps/x86_64/multiarch/memset-avx2.S: Removed. (__memset_zero_constant_len_parameter): Check SHARED instead of PIC. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memset-avx2 and memset-sse2-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Remove __memset_chk_sse2, __memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned. * sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: Skip if not in libc. * sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S (MEMSET_CHK_SYMBOL): New. Define if not defined. (__bzero): Check VEC_SIZE == 16 instead of USE_MULTIARCH. Replace MEMSET_SYMBOL with MEMSET_CHK_SYMBOL on __memset_chk symbols. Properly check USE_MULTIARCH on __memset symbols. * sysdeps/x86_64/multiarch/memset.S (memset): Replace __memset_sse2 and __memset_avx2 with __memset_sse2_unaligned and __memset_avx2_unaligned. Use __memset_sse2_unaligned_erms or __memset_avx2_unaligned_erms if processor has ERMS. Support __memset_avx512_unaligned_erms and __memset_avx512_unaligned. (memset): Removed. (__memset_chk): Likewise. (MEMSET_SYMBOL): New. (libc_hidden_builtin_def): Replace __memset_sse2 with __memset_sse2_unaligned. * sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace __memset_chk_sse2 and __memset_chk_avx2 with __memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms. Use __memset_chk_sse2_unaligned_erms or __memset_chk_avx2_unaligned_erms if processor has ERMS. Support __memset_chk_avx512_unaligned_erms and __memset_chk_avx512_unaligned. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e1e4908bee33dc0aed48835c1884387b5e942963 commit e1e4908bee33dc0aed48835c1884387b5e942963 Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Apr 3 17:21:45 2016 -0700 X86-64: Use non-temporal store in memmove on large data memcpy/memmove benchmarks with large data shows that there is a regression with large data on Haswell machine. non-temporal store in memmove on large data can improve performance significantly. This patch adds a threshold to use non temporal store which is 4 times of shared cache size. When size is above the threshold, non temporal store will be used. For size below 8 vector register width, we load all data into registers and store them together. Only forward and backward loops, which move 4 vector registers at a time, are used to support overlapping addresses. For forward loop, we load the last 4 vector register width of data and the first vector register width of data into vector registers before the loop and store them after the loop. For backward loop, we load the first 4 vector register width of data and the last vector register width of data into vector registers before the loop and store them after the loop. * sysdeps/x86_64/cacheinfo.c (__x86_shared_non_temporal_threshold): New. (init_cacheinfo): Set __x86_shared_non_temporal_threshold to 4 times of shared cache size. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S (PREFETCHNT): New. (VMOVNT): Likewise. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S (PREFETCHNT): Likewise. (VMOVNT): Likewise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S (PREFETCHNT): Likewise. (VMOVNT): Likewise. (VMOVU): Changed to movups for smaller code sizes. (VMOVA): Changed to movaps for smaller code sizes. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Update comments. Rewrite to use forward and backward loops, which move 4 vector registers at a time, to support overlapping addresses and use non temporal store if size is above the threshold. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c2d3bdd6aec639fd66fceb3e2c145420c25d409b commit c2d3bdd6aec639fd66fceb3e2c145420c25d409b Author: H.J. Lu <hjl.tools@gmail.com> Date: Tue Apr 5 05:21:07 2016 -0700 Force 32-bit displacement in memset-vec-unaligned-erms.S * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: Force 32-bit displacement to avoid long nop between instructions. (cherry picked from commit ec0cac9a1f4094bd0db6f77c1b329e7a40eecc10) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=070a5e77d66f5520c1bbbc24dc1843a0a1c161ee commit 070a5e77d66f5520c1bbbc24dc1843a0a1c161ee Author: H.J. Lu <hjl.tools@gmail.com> Date: Tue Apr 5 05:19:05 2016 -0700 Add a comment in memset-sse2-unaligned-erms.S * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Add a comment on VMOVU and VMOVA. (cherry picked from commit 696ac774847b80cf994438739478b0c3003b5958) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7e00bb9720268f142668d22e91dff7c3e6e0c08c commit 7e00bb9720268f142668d22e91dff7c3e6e0c08c Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Apr 3 14:32:20 2016 -0700 Don't put SSE2/AVX/AVX512 memmove/memset in ld.so Since memmove and memset in ld.so don't use IFUNC, don't put SSE2, AVX and AVX512 memmove and memset in ld.so. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip if not in libc. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S: Likewise. (cherry picked from commit 5cd7af016d8587ff53b20ba259746f97edbddbf7) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e1e57539f5dbdefc96a85021b611863eaa28dd13 commit e1e57539f5dbdefc96a85021b611863eaa28dd13 Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Apr 3 12:38:25 2016 -0700 Fix memmove-vec-unaligned-erms.S __mempcpy_erms and __memmove_erms can't be placed between __memmove_chk and __memmove it breaks __memmove_chk. Don't check source == destination first since it is less common. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: (__mempcpy_erms, __memmove_erms): Moved before __mempcpy_chk with unaligned_erms. (__memmove_erms): Skip if source == destination. (__memmove_unaligned_erms): Don't check source == destination first. (cherry picked from commit ea2785e96fa503f3a2b5dd9f3a6ca65622b3c5f2) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a13ac6b5ced68aadb7c1546102445f9c57f43231 commit a13ac6b5ced68aadb7c1546102445f9c57f43231 Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Mar 6 08:23:24 2016 -0800 Use HAS_ARCH_FEATURE with Fast_Rep_String HAS_ARCH_FEATURE, not HAS_CPU_FEATURE, should be used with Fast_Rep_String. [BZ #19762] * sysdeps/i386/i686/multiarch/bcopy.S (bcopy): Use HAS_ARCH_FEATURE with Fast_Rep_String. * sysdeps/i386/i686/multiarch/bzero.S (__bzero): Likewise. * sysdeps/i386/i686/multiarch/memcpy.S (memcpy): Likewise. * sysdeps/i386/i686/multiarch/memcpy_chk.S (__memcpy_chk): Likewise. * sysdeps/i386/i686/multiarch/memmove_chk.S (__memmove_chk): Likewise. * sysdeps/i386/i686/multiarch/mempcpy.S (__mempcpy): Likewise. * sysdeps/i386/i686/multiarch/mempcpy_chk.S (__mempcpy_chk): Likewise. * sysdeps/i386/i686/multiarch/memset.S (memset): Likewise. * sysdeps/i386/i686/multiarch/memset_chk.S (__memset_chk): Likewise. (cherry picked from commit 4e940b2f4b577f3a530e0580373f7c2d569f4d63) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4ad4d58ed7a444e2d9787113fce132a99b35b417 commit 4ad4d58ed7a444e2d9787113fce132a99b35b417 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Apr 1 15:08:48 2016 -0700 Remove Fast_Copy_Backward from Intel Core processors Intel Core i3, i5 and i7 processors have fast unaligned copy and copy backward is ignored. Remove Fast_Copy_Backward from Intel Core processors to avoid confusion. * sysdeps/x86/cpu-features.c (init_cpu_features): Don't set bit_arch_Fast_Copy_Backward for Intel Core proessors. (cherry picked from commit 27d3ce1467990f89126e228559dec8f84b96c60e) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a304f3933c7f8347e49057a7a315cbd571662ff7 commit a304f3933c7f8347e49057a7a315cbd571662ff7 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 10:05:51 2016 -0700 Add x86-64 memset with unaligned store and rep stosb Implement x86-64 memset with unaligned store and rep movsb. Support 16-byte, 32-byte and 64-byte vector register sizes. A single file provides 2 implementations of memset, one with rep stosb and the other without rep stosb. They share the same codes when size is between 2 times of vector register size and REP_STOSB_THRESHOLD which defaults to 2KB. Key features: 1. Use overlapping store to avoid branch. 2. For size <= 4 times of vector register size, fully unroll the loop. 3. For size > 4 times of vector register size, store 4 times of vector register size at a time. [BZ #19881] * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memset-sse2-unaligned-erms, memset-avx2-unaligned-erms and memset-avx512-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memset_chk_sse2_unaligned, __memset_chk_sse2_unaligned_erms, __memset_chk_avx2_unaligned, __memset_chk_avx2_unaligned_erms, __memset_chk_avx512_unaligned, __memset_chk_avx512_unaligned_erms, __memset_sse2_unaligned, __memset_sse2_unaligned_erms, __memset_erms, __memset_avx2_unaligned, __memset_avx2_unaligned_erms, __memset_avx512_unaligned_erms and __memset_avx512_unaligned. * sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: New file. * sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: Likewise. (cherry picked from commit 830566307f038387ca0af3fd327706a8d1a2f595) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=1c5d0b0ba41376c2f0792da4f22cc1f5b2b2688e commit 1c5d0b0ba41376c2f0792da4f22cc1f5b2b2688e Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 10:04:26 2016 -0700 Add x86-64 memmove with unaligned load/store and rep movsb Implement x86-64 memmove with unaligned load/store and rep movsb. Support 16-byte, 32-byte and 64-byte vector register sizes. When size <= 8 times of vector register size, there is no check for address overlap bewteen source and destination. Since overhead for overlap check is small when size > 8 times of vector register size, memcpy is an alias of memmove. A single file provides 2 implementations of memmove, one with rep movsb and the other without rep movsb. They share the same codes when size is between 2 times of vector register size and REP_MOVSB_THRESHOLD which is 2KB for 16-byte vector register size and scaled up by large vector register size. Key features: 1. Use overlapping load and store to avoid branch. 2. For size <= 8 times of vector register size, load all sources into registers and store them together. 3. If there is no address overlap bewteen source and destination, copy from both ends with 4 times of vector register size at a time. 4. If address of destination > address of source, backward copy 8 times of vector register size at a time. 5. Otherwise, forward copy 8 times of vector register size at a time. 6. Use rep movsb only for forward copy. Avoid slow backward rep movsb by fallbacking to backward copy 8 times of vector register size at a time. 7. Skip when address of destination == address of source. [BZ #19776] * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memmove-sse2-unaligned-erms, memmove-avx-unaligned-erms and memmove-avx512-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memmove_chk_avx512_unaligned_2, __memmove_chk_avx512_unaligned_erms, __memmove_chk_avx_unaligned_2, __memmove_chk_avx_unaligned_erms, __memmove_chk_sse2_unaligned_2, __memmove_chk_sse2_unaligned_erms, __memmove_avx_unaligned_2, __memmove_avx_unaligned_erms, __memmove_avx512_unaligned_2, __memmove_avx512_unaligned_erms, __memmove_erms, __memmove_sse2_unaligned_2, __memmove_sse2_unaligned_erms, __memcpy_chk_avx512_unaligned_2, __memcpy_chk_avx512_unaligned_erms, __memcpy_chk_avx_unaligned_2, __memcpy_chk_avx_unaligned_erms, __memcpy_chk_sse2_unaligned_2, __memcpy_chk_sse2_unaligned_erms, __memcpy_avx_unaligned_2, __memcpy_avx_unaligned_erms, __memcpy_avx512_unaligned_2, __memcpy_avx512_unaligned_erms, __memcpy_sse2_unaligned_2, __memcpy_sse2_unaligned_erms, __memcpy_erms, __mempcpy_chk_avx512_unaligned_2, __mempcpy_chk_avx512_unaligned_erms, __mempcpy_chk_avx_unaligned_2, __mempcpy_chk_avx_unaligned_erms, __mempcpy_chk_sse2_unaligned_2, __mempcpy_chk_sse2_unaligned_erms, __mempcpy_avx512_unaligned_2, __mempcpy_avx512_unaligned_erms, __mempcpy_avx_unaligned_2, __mempcpy_avx_unaligned_erms, __mempcpy_sse2_unaligned_2, __mempcpy_sse2_unaligned_erms and __mempcpy_erms. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: New file. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S: Likwise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S: Likwise. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Likwise. (cherry picked from commit 88b57b8ed41d5ecf2e1bdfc19556f9246a665ebb) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e1203f48239fbb9832db6ed3a0d2a008e317aff9 commit e1203f48239fbb9832db6ed3a0d2a008e317aff9 Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Mar 28 19:22:59 2016 -0700 Initial Enhanced REP MOVSB/STOSB (ERMS) support The newer Intel processors support Enhanced REP MOVSB/STOSB (ERMS) which has a feature bit in CPUID. This patch adds the Enhanced REP MOVSB/STOSB (ERMS) bit to x86 cpu-features. * sysdeps/x86/cpu-features.h (bit_cpu_ERMS): New. (index_cpu_ERMS): Likewise. (reg_ERMS): Likewise. (cherry picked from commit 0791f91dff9a77263fa8173b143d854cad902c6d) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=3597d65be2a44f063ef12bb907fdad8567aa3e6a commit 3597d65be2a44f063ef12bb907fdad8567aa3e6a Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Mar 28 13:15:59 2016 -0700 Make __memcpy_avx512_no_vzeroupper an alias Since x86-64 memcpy-avx512-no-vzeroupper.S implements memmove, make __memcpy_avx512_no_vzeroupper an alias of __memmove_avx512_no_vzeroupper to reduce code size of libc.so. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memcpy-avx512-no-vzeroupper. * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: Renamed to ... * sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: This. (MEMCPY): Don't define. (MEMCPY_CHK): Likewise. (MEMPCPY): Likewise. (MEMPCPY_CHK): Likewise. (MEMPCPY_CHK): Renamed to ... (__mempcpy_chk_avx512_no_vzeroupper): This. (MEMPCPY_CHK): Renamed to ... (__mempcpy_chk_avx512_no_vzeroupper): This. (MEMCPY_CHK): Renamed to ... (__memmove_chk_avx512_no_vzeroupper): This. (MEMCPY): Renamed to ... (__memmove_avx512_no_vzeroupper): This. (__memcpy_avx512_no_vzeroupper): New alias. (__memcpy_chk_avx512_no_vzeroupper): Likewise. (cherry picked from commit 064f01b10b57ff09cda7025f484b848c38ddd57a) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=9fbaf0f27a11deb98df79d04adee97aebee78d40 commit 9fbaf0f27a11deb98df79d04adee97aebee78d40 Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Mar 28 13:13:36 2016 -0700 Implement x86-64 multiarch mempcpy in memcpy Implement x86-64 multiarch mempcpy in memcpy to share most of code. It reduces code size of libc.so. [BZ #18858] * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove mempcpy-ssse3, mempcpy-ssse3-back, mempcpy-avx-unaligned and mempcpy-avx512-no-vzeroupper. * sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/memcpy-ssse3-back.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/memcpy-ssse3.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/mempcpy-avx-unaligned.S: Removed. * sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S: Likewise. * sysdeps/x86_64/multiarch/mempcpy-ssse3-back.S: Likewise. * sysdeps/x86_64/multiarch/mempcpy-ssse3.S: Likewise. (cherry picked from commit c365e615f7429aee302f8af7bf07ae262278febb) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=5239cb481eea27650173b9b9af22439afdcbf358 commit 5239cb481eea27650173b9b9af22439afdcbf358 Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Mar 28 04:39:48 2016 -0700 [x86] Add a feature bit: Fast_Unaligned_Copy On AMD processors, memcpy optimized with unaligned SSE load is slower than emcpy optimized with aligned SSSE3 while other string functions are faster with unaligned SSE load. A feature bit, Fast_Unaligned_Copy, is added to select memcpy optimized with unaligned SSE load. [BZ #19583] * sysdeps/x86/cpu-features.c (init_cpu_features): Set Fast_Unaligned_Copy with Fast_Unaligned_Load for Intel processors. Set Fast_Copy_Backward for AMD Excavator processors. * sysdeps/x86/cpu-features.h (bit_arch_Fast_Unaligned_Copy): New. (index_arch_Fast_Unaligned_Copy): Likewise. * sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Check Fast_Unaligned_Copy instead of Fast_Unaligned_Load. (cherry picked from commit e41b395523040fcb58c7d378475720c2836d280c) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a65b3d13e1754d568782e64a762c2c7fab45a55d commit a65b3d13e1754d568782e64a762c2c7fab45a55d Author: H.J. Lu <hjl.tools@gmail.com> Date: Tue Mar 22 08:36:16 2016 -0700 Don't set %rcx twice before "rep movsb" * sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMCPY): Don't set %rcx twice before "rep movsb". (cherry picked from commit 3c9a4cd16cbc7b79094fec68add2df66061ab5d7) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=f4b6d20366aac66070f1cf50552cf2951991a1e5 commit f4b6d20366aac66070f1cf50552cf2951991a1e5 Author: H.J. Lu <hjl.tools@gmail.com> Date: Tue Mar 22 07:46:56 2016 -0700 Set index_arch_AVX_Fast_Unaligned_Load only for Intel processors Since only Intel processors with AVX2 have fast unaligned load, we should set index_arch_AVX_Fast_Unaligned_Load only for Intel processors. Move AVX, AVX2, AVX512, FMA and FMA4 detection into get_common_indeces and call get_common_indeces for other processors. Add CPU_FEATURES_CPU_P and CPU_FEATURES_ARCH_P to aoid loading GLRO(dl_x86_cpu_features) in cpu-features.c. [BZ #19583] * sysdeps/x86/cpu-features.c (get_common_indeces): Remove inline. Check family before setting family, model and extended_model. Set AVX, AVX2, AVX512, FMA and FMA4 usable bits here. (init_cpu_features): Replace HAS_CPU_FEATURE and HAS_ARCH_FEATURE with CPU_FEATURES_CPU_P and CPU_FEATURES_ARCH_P. Set index_arch_AVX_Fast_Unaligned_Load for Intel processors with usable AVX2. Call get_common_indeces for other processors with family == NULL. * sysdeps/x86/cpu-features.h (CPU_FEATURES_CPU_P): New macro. (CPU_FEATURES_ARCH_P): Likewise. (HAS_CPU_FEATURE): Use CPU_FEATURES_CPU_P. (HAS_ARCH_FEATURE): Use CPU_FEATURES_ARCH_P. (cherry picked from commit f781a9e96138d8839663af5e88649ab1fbed74f8) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=ca9c5edeea52dc18f42ebbe29b1af352f5555538 commit ca9c5edeea52dc18f42ebbe29b1af352f5555538 Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Nov 30 08:53:37 2015 -0800 Update family and model detection for AMD CPUs AMD CPUs uses the similar encoding scheme for extended family and model as Intel CPUs as shown in: http://support.amd.com/TechDocs/25481.pdf This patch updates get_common_indeces to get family and model for both Intel and AMD CPUs when family == 0x0f. [BZ #19214] * sysdeps/x86/cpu-features.c (get_common_indeces): Add an argument to return extended model. Update family and model with extended family and model when family == 0x0f. (init_cpu_features): Updated. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c23cdbac4ea473effbef5c50b1217f95595b3460 commit c23cdbac4ea473effbef5c50b1217f95595b3460 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 10 05:26:46 2016 -0800 Add _arch_/_cpu_ to index_*/bit_* in x86 cpu-features.h index_* and bit_* macros are used to access cpuid and feature arrays o struct cpu_features. It is very easy to use bits and indices of cpuid array on feature array, especially in assembly codes. For example, sysdeps/i386/i686/multiarch/bcopy.S has HAS_CPU_FEATURE (Fast_Rep_String) which should be HAS_ARCH_FEATURE (Fast_Rep_String) We change index_* and bit_* to index_cpu_*/index_arch_* and bit_cpu_*/bit_arch_* so that we can catch such error at build time. [BZ #19762] * sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h (EXTRA_LD_ENVVARS): Add _arch_ to index_*/bit_*. * sysdeps/x86/cpu-features.c (init_cpu_features): Likewise. * sysdeps/x86/cpu-features.h (bit_*): Renamed to ... (bit_arch_*): This for feature array. (bit_*): Renamed to ... (bit_cpu_*): This for cpu array. (index_*): Renamed to ... (index_arch_*): This for feature array. (index_*): Renamed to ... (index_cpu_*): This for cpu array. [__ASSEMBLER__] (HAS_FEATURE): Add and use field. [__ASSEMBLER__] (HAS_CPU_FEATURE)): Pass cpu to HAS_FEATURE. [__ASSEMBLER__] (HAS_ARCH_FEATURE)): Pass arch to HAS_FEATURE. [!__ASSEMBLER__] (HAS_CPU_FEATURE): Replace index_##name and bit_##name with index_cpu_##name and bit_cpu_##name. [!__ASSEMBLER__] (HAS_ARCH_FEATURE): Replace index_##name and bit_##name with index_arch_##name and bit_arch_##name. (cherry picked from commit 6aa3e97e2530f9917f504eb4146af119a3f27229) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4a49c82956f5a42a2cce22c2e97360de1b32301d commit 4a49c82956f5a42a2cce22c2e97360de1b32301d Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 3 14:51:40 2016 -0800 Or bit_Prefer_MAP_32BIT_EXEC in EXTRA_LD_ENVVARS We should turn on bit_Prefer_MAP_32BIT_EXEC in EXTRA_LD_ENVVARS without overriding other bits. [BZ #19758] * sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h (EXTRA_LD_ENVVARS): Or bit_Prefer_MAP_32BIT_EXEC. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=18f8c0e3cc9ff7b092f02c9b42874a5439347bbc commit 18f8c0e3cc9ff7b092f02c9b42874a5439347bbc Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Mar 6 16:48:11 2016 -0800 Group AVX512 functions in .text.avx512 section * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: Replace .text with .text.avx512. * sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S: Likewise. (cherry picked from commit fee9eb6200f0e44a4b684903bc47fde36d46f1a5) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=0c8e297a186f844ebb7eba7a3bc0343c83615ca9 commit 0c8e297a186f844ebb7eba7a3bc0343c83615ca9 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Mar 4 08:37:40 2016 -0800 x86-64: Fix memcpy IFUNC selection Chek Fast_Unaligned_Load, instead of Slow_BSF, and also check for Fast_Copy_Backward to enable __memcpy_ssse3_back. Existing selection order is updated with following selection order: 1. __memcpy_avx_unaligned if AVX_Fast_Unaligned_Load bit is set. 2. __memcpy_sse2_unaligned if Fast_Unaligned_Load bit is set. 3. __memcpy_sse2 if SSSE3 isn't available. 4. __memcpy_ssse3_back if Fast_Copy_Backward bit it set. 5. __memcpy_ssse3 [BZ #18880] * sysdeps/x86_64/multiarch/memcpy.S: Check Fast_Unaligned_Load, instead of Slow_BSF, and also check for Fast_Copy_Backward to enable __memcpy_ssse3_back. (cherry picked from commit 14a1d7cc4c4fd5ee8e4e66b777221dd32a84efe8) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=3c772cb4d9cbe19cd97ad991e3dab43014198c44 commit 3c772cb4d9cbe19cd97ad991e3dab43014198c44 Author: Andrew Senkevich <andrew.senkevich@intel.com> Date: Sat Jan 16 00:49:45 2016 +0300 Added memcpy/memmove family optimized with AVX512 for KNL hardware. Added AVX512 implementations of memcpy, mempcpy, memmove, memcpy_chk, mempcpy_chk, memmove_chk. It shows average improvement more than 30% over AVX versions on KNL hardware (performance results in the thread <https://sourceware.org/ml/libc-alpha/2016-01/msg00258.html>). * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Added new files. * sysdeps/x86_64/multiarch/ifunc-impl-list.c: Added new tests. * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: New file. * sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S: Likewise. * sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: Likewise. * sysdeps/x86_64/multiarch/memcpy.S: Added new IFUNC branch. * sysdeps/x86_64/multiarch/memcpy_chk.S: Likewise. * sysdeps/x86_64/multiarch/memmove.c: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.c: Likewise. * sysdeps/x86_64/multiarch/mempcpy.S: Likewise. * sysdeps/x86_64/multiarch/mempcpy_chk.S: Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7f20e52f1f8d7709140f4bdf828a6bb0f0f08af2 commit 7f20e52f1f8d7709140f4bdf828a6bb0f0f08af2 Author: Andrew Senkevich <andrew.senkevich@intel.com> Date: Sat Dec 19 02:47:28 2015 +0300 Added memset optimized with AVX512 for KNL hardware. It shows improvement up to 28% over AVX2 memset (performance results attached at <https://sourceware.org/ml/libc-alpha/2015-12/msg00052.html>). * sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S: New file. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Added new file. * sysdeps/x86_64/multiarch/ifunc-impl-list.c: Added new tests. * sysdeps/x86_64/multiarch/memset.S: Added new IFUNC branch. * sysdeps/x86_64/multiarch/memset_chk.S: Likewise. * sysdeps/x86/cpu-features.h (bit_Prefer_No_VZEROUPPER, index_Prefer_No_VZEROUPPER): New. * sysdeps/x86/cpu-features.c (init_cpu_features): Set the Prefer_No_VZEROUPPER for Knights Landing. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=d530cd5463701a59ed923d53a97d3b534fdfea8a commit d530cd5463701a59ed923d53a97d3b534fdfea8a Author: H.J. Lu <hjl.tools@gmail.com> Date: Wed Oct 21 14:44:23 2015 -0700 Add Prefer_MAP_32BIT_EXEC to map executable pages with MAP_32BIT According to Silvermont software optimization guide, for 64-bit applications, branch prediction performance can be negatively impacted when the target of a branch is more than 4GB away from the branch. Add the Prefer_MAP_32BIT_EXEC bit so that mmap will try to map executable pages with MAP_32BIT first. NB: MAP_32BIT will map to lower 2GB, not lower 4GB, address. Prefer_MAP_32BIT_EXEC reduces bits available for address space layout randomization (ASLR), which is always disabled for SUID programs and can only be enabled by setting environment variable, LD_PREFER_MAP_32BIT_EXEC. On Fedora 23, this patch speeds up GCC 5 testsuite by 3% on Silvermont. [BZ #19367] * sysdeps/unix/sysv/linux/wordsize-64/mmap.c: New file. * sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h: Likewise. * sysdeps/unix/sysv/linux/x86_64/64/mmap.c: Likewise. * sysdeps/x86/cpu-features.h (bit_Prefer_MAP_32BIT_EXEC): New. (index_Prefer_MAP_32BIT_EXEC): Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=fe24aedc3530037d7bb614b84d309e6b816686bf commit fe24aedc3530037d7bb614b84d309e6b816686bf Author: H.J. Lu <hjl.tools@gmail.com> Date: Tue Dec 15 11:46:54 2015 -0800 Enable Silvermont optimizations for Knights Landing Knights Landing processor is based on Silvermont. This patch enables Silvermont optimizations for Knights Landing. * sysdeps/x86/cpu-features.c (init_cpu_features): Enable Silvermont optimizations for Knights Landing. -----------------------------------------------------------------------
This is an automated email from the git hooks/post-receive script. It was generated because a ref change was pushed to the repository containing the project "GNU C Library master sources". The branch, hjl/erms/ifunc has been created at 7d3414159ba17db4224b675cf4086741210544b1 (commit) - Log ----------------------------------------------------------------- https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7d3414159ba17db4224b675cf4086741210544b1 commit 7d3414159ba17db4224b675cf4086741210544b1 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Apr 1 14:01:24 2016 -0700 X86-64: Add dummy memcopy.h and wordcopy.c Since x86-64 doesn't use memory copy functions, add dummy memcopy.h and wordcopy.c to reduce code size. It reduces the size of libc.so by about 1 KB. * sysdeps/x86_64/memcopy.h: New file. * sysdeps/x86_64/wordcopy.c: Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=efd380007f75c4157a823ee14d658c0ced3ba4a8 commit efd380007f75c4157a823ee14d658c0ced3ba4a8 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 12:46:57 2016 -0700 X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones, we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with the new ones. No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used before. If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2 memcpy/memmove optimized with Enhanced REP MOVSB will be used for processors with ERMS. The new AVX512 memcpy/memmove will be used for processors with AVX512 which prefer vzeroupper. Since the new SSE2 memcpy/memmove are faster than the previous default memcpy/memmove used in libc.a and ld.so, we also remove the previous default memcpy/memmove and make them the default memcpy/memmove. Together, it reduces the size of libc.so by about 6 KB and the size of ld.so by about 2 KB. [BZ #19776] * sysdeps/x86_64/memcpy.S: Make it dummy. * sysdeps/x86_64/mempcpy.S: Likewise. * sysdeps/x86_64/memmove.S: New file. * sysdeps/x86_64/memmove_chk.S: Likewise. * sysdeps/x86_64/multiarch/memmove.S: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.S: Likewise. * sysdeps/x86_64/memmove.c: Removed. * sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memmove.c: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.c: Likewise. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memcpy-sse2-unaligned, memmove-avx-unaligned, memcpy-avx-unaligned and memmove-sse2-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Replace __memmove_chk_avx512_unaligned_2 with __memmove_chk_avx512_unaligned. Remove __memmove_chk_avx_unaligned_2. Replace __memmove_chk_sse2_unaligned_2 with __memmove_chk_sse2_unaligned. Remove __memmove_chk_sse2 and __memmove_avx_unaligned_2. Replace __memmove_avx512_unaligned_2 with __memmove_avx512_unaligned. Replace __memmove_sse2_unaligned_2 with __memmove_sse2_unaligned. Remove __memmove_sse2. Replace __memcpy_chk_avx512_unaligned_2 with __memcpy_chk_avx512_unaligned. Remove __memcpy_chk_avx_unaligned_2. Replace __memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned. Remove __memcpy_chk_sse2. Remove __memcpy_avx_unaligned_2. Replace __memcpy_avx512_unaligned_2 with __memcpy_avx512_unaligned. Remove __memcpy_sse2_unaligned_2 and __memcpy_sse2. Replace __mempcpy_chk_avx512_unaligned_2 with __mempcpy_chk_avx512_unaligned. Remove __mempcpy_chk_avx_unaligned_2. Replace __mempcpy_chk_sse2_unaligned_2 with __mempcpy_chk_sse2_unaligned. Remove __mempcpy_chk_sse2. Replace __mempcpy_avx512_unaligned_2 with __mempcpy_avx512_unaligned. Remove __mempcpy_avx_unaligned_2. Replace __mempcpy_sse2_unaligned_2 with __mempcpy_sse2_unaligned. Remove __mempcpy_sse2. * sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support __memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned. Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms if processor has ERMS. Default to __memcpy_sse2_unaligned. (ENTRY): Removed. (END): Likewise. (ENTRY_CHK): Likewise. (libc_hidden_builtin_def): Likewise. Don't include ../memcpy.S. * sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support __memcpy_chk_avx512_unaligned_erms and __memcpy_chk_avx512_unaligned. Use __memcpy_chk_avx_unaligned_erms and __memcpy_chk_sse2_unaligned_erms if if processor has ERMS. Default to __memcpy_chk_sse2_unaligned. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip if not in libc. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S Change function suffix from unaligned_2 to unaligned. * sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support __mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned. Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms if processor has ERMS. Default to __mempcpy_sse2_unaligned. (ENTRY): Removed. (END): Likewise. (ENTRY_CHK): Likewise. (libc_hidden_builtin_def): Likewise. Don't include ../mempcpy.S. (mempcpy): New. Add a weak alias. * sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support __mempcpy_chk_avx512_unaligned_erms and __mempcpy_chk_avx512_unaligned. Use __mempcpy_chk_avx_unaligned_erms and __mempcpy_chk_sse2_unaligned_erms if if processor has ERMS. Default to __mempcpy_chk_sse2_unaligned. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=8f1593ebaee38ddedcabee5fe3553abdb0f08bfd commit 8f1593ebaee38ddedcabee5fe3553abdb0f08bfd Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 10:42:30 2016 -0700 X86-64: Remove the previous SSE2/AVX2 memsets Since the new SSE2/AVX2 memsets are faster than the previous ones, we can remove the previous SSE2/AVX2 memsets and replace them with the new ones. This reduces the size of libc.so by about 900 bytes. No change in IFUNC selection if SSE2 and AVX2 memsets weren't used before. If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset optimized with Enhanced REP STOSB will be used for processors with ERMS. The new AVX512 memset will be used for processors with AVX512 which prefer vzeroupper. [BZ #19881] * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded into ... * sysdeps/x86_64/memset.S: This. (__bzero): Removed. (__memset_tail): Likewise. (__memset_chk): Likewise. (memset): Likewise. (MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't defined. (MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined. * sysdeps/x86_64/multiarch/memset-avx2.S: Removed. (__memset_zero_constant_len_parameter): Check SHARED instead of PIC. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memset-avx2 and memset-sse2-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Remove __memset_chk_sse2, __memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned. * sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: Skip if not in libc. * sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S (__bzero): Enabled. * sysdeps/x86_64/multiarch/memset.S (memset): Replace __memset_sse2 and __memset_avx2 with __memset_sse2_unaligned and __memset_avx2_unaligned. Use __memset_sse2_unaligned_erms or __memset_avx2_unaligned_erms if processor has ERMS. Support __memset_avx512_unaligned_erms and __memset_avx512_unaligned. (memset): Removed. (__memset_chk): Likewise. (MEMSET_SYMBOL): New. (libc_hidden_builtin_def): Replace __memset_sse2 with __memset_sse2_unaligned. * sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace __memset_chk_sse2 and __memset_chk_avx2 with __memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms. Use __memset_chk_sse2_unaligned_erms or __memset_chk_avx2_unaligned_erms if processor has ERMS. Support __memset_chk_avx512_unaligned_erms and __memset_chk_avx512_unaligned. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=545cd24ea6b85661abfa9ac1e49d56dd7cc19cc9 commit 545cd24ea6b85661abfa9ac1e49d56dd7cc19cc9 Author: H.J. Lu <hjl.tools@gmail.com> Date: Wed Apr 6 10:49:27 2016 -0700 Use PREFETCH_ONE_SET_X https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=af07dbdaa999d0172dd840f3dbe6963901c3496f commit af07dbdaa999d0172dd840f3dbe6963901c3496f Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Apr 3 17:21:45 2016 -0700 X86-64: Use non-temporal store in memmove on large data memcpy/memmove benchmarks with large data shows that there is a regression with large data on Haswell machine. non-temporal store in memmove on large data can improve performance significantly. This patch adds a threshold to use non temporal store which is 4 times of shared cache size. When size is above the threshold, non temporal store will be used. For size below 8 vector register width, we load all data into registers and store them together. Only forward and backward loops, which move 4 vector registers at a time, are used to support overlapping addresses. For forward loop, we load the last 4 vector register width of data and the first vector register width of data into vector registers before the loop and store them after the loop. For backward loop, we load the first 4 vector register width of data and the last vector register width of data into vector registers before the loop and store them after the loop. * sysdeps/x86_64/cacheinfo.c (__x86_shared_non_temporal_threshold): New. (init_cacheinfo): Set __x86_shared_non_temporal_threshold to 4 times of shared cache size. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S (PREFETCHNT): New. (VMOVNT): Likewise. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S (PREFETCHNT): Likewise. (VMOVNT): Likewise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S (PREFETCHNT): Likewise. (VMOVNT): Likewise. (VMOVU): Changed to movups for smaller code sizes. (VMOVA): Changed to movaps for smaller code sizes. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Update comments. (PREFETCH_SIZE): New. (PREFETCHED_LOAD_SIZE): Likewise. (PREFETCH_ONE_SET): Likewise. Rewrite to use forward and backward loops, which move 4 vector registers at a time, to support overlapping addresses and use non temporal store if size is above the threshold. -----------------------------------------------------------------------
This is an automated email from the git hooks/post-receive script. It was generated because a ref change was pushed to the repository containing the project "GNU C Library master sources". The branch, hjl/erms/ifunc has been created at 8624d88eb694d12da34edd5c6fd10d19fe7e3400 (commit) - Log ----------------------------------------------------------------- https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=8624d88eb694d12da34edd5c6fd10d19fe7e3400 commit 8624d88eb694d12da34edd5c6fd10d19fe7e3400 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Apr 1 14:01:24 2016 -0700 X86-64: Add dummy memcopy.h and wordcopy.c Since x86-64 doesn't use memory copy functions, add dummy memcopy.h and wordcopy.c to reduce code size. It reduces the size of libc.so by about 1 KB. * sysdeps/x86_64/memcopy.h: New file. * sysdeps/x86_64/wordcopy.c: Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=50c50aad0d7f9fedbc72953875fc1ef73bb2fa8e commit 50c50aad0d7f9fedbc72953875fc1ef73bb2fa8e Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 12:46:57 2016 -0700 X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones, we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with the new ones. No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used before. If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2 memcpy/memmove optimized with Enhanced REP MOVSB will be used for processors with ERMS. The new AVX512 memcpy/memmove will be used for processors with AVX512 which prefer vzeroupper. Since the new SSE2 memcpy/memmove are faster than the previous default memcpy/memmove used in libc.a and ld.so, we also remove the previous default memcpy/memmove and make them the default memcpy/memmove. Together, it reduces the size of libc.so by about 6 KB and the size of ld.so by about 2 KB. [BZ #19776] * sysdeps/x86_64/memcpy.S: Make it dummy. * sysdeps/x86_64/mempcpy.S: Likewise. * sysdeps/x86_64/memmove.S: New file. * sysdeps/x86_64/memmove_chk.S: Likewise. * sysdeps/x86_64/multiarch/memmove.S: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.S: Likewise. * sysdeps/x86_64/memmove.c: Removed. * sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memmove.c: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.c: Likewise. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memcpy-sse2-unaligned, memmove-avx-unaligned, memcpy-avx-unaligned and memmove-sse2-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Replace __memmove_chk_avx512_unaligned_2 with __memmove_chk_avx512_unaligned. Remove __memmove_chk_avx_unaligned_2. Replace __memmove_chk_sse2_unaligned_2 with __memmove_chk_sse2_unaligned. Remove __memmove_chk_sse2 and __memmove_avx_unaligned_2. Replace __memmove_avx512_unaligned_2 with __memmove_avx512_unaligned. Replace __memmove_sse2_unaligned_2 with __memmove_sse2_unaligned. Remove __memmove_sse2. Replace __memcpy_chk_avx512_unaligned_2 with __memcpy_chk_avx512_unaligned. Remove __memcpy_chk_avx_unaligned_2. Replace __memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned. Remove __memcpy_chk_sse2. Remove __memcpy_avx_unaligned_2. Replace __memcpy_avx512_unaligned_2 with __memcpy_avx512_unaligned. Remove __memcpy_sse2_unaligned_2 and __memcpy_sse2. Replace __mempcpy_chk_avx512_unaligned_2 with __mempcpy_chk_avx512_unaligned. Remove __mempcpy_chk_avx_unaligned_2. Replace __mempcpy_chk_sse2_unaligned_2 with __mempcpy_chk_sse2_unaligned. Remove __mempcpy_chk_sse2. Replace __mempcpy_avx512_unaligned_2 with __mempcpy_avx512_unaligned. Remove __mempcpy_avx_unaligned_2. Replace __mempcpy_sse2_unaligned_2 with __mempcpy_sse2_unaligned. Remove __mempcpy_sse2. * sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support __memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned. Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms if processor has ERMS. Default to __memcpy_sse2_unaligned. (ENTRY): Removed. (END): Likewise. (ENTRY_CHK): Likewise. (libc_hidden_builtin_def): Likewise. Don't include ../memcpy.S. * sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support __memcpy_chk_avx512_unaligned_erms and __memcpy_chk_avx512_unaligned. Use __memcpy_chk_avx_unaligned_erms and __memcpy_chk_sse2_unaligned_erms if if processor has ERMS. Default to __memcpy_chk_sse2_unaligned. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip if not in libc. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S Change function suffix from unaligned_2 to unaligned. * sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support __mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned. Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms if processor has ERMS. Default to __mempcpy_sse2_unaligned. (ENTRY): Removed. (END): Likewise. (ENTRY_CHK): Likewise. (libc_hidden_builtin_def): Likewise. Don't include ../mempcpy.S. (mempcpy): New. Add a weak alias. * sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support __mempcpy_chk_avx512_unaligned_erms and __mempcpy_chk_avx512_unaligned. Use __mempcpy_chk_avx_unaligned_erms and __mempcpy_chk_sse2_unaligned_erms if if processor has ERMS. Default to __mempcpy_chk_sse2_unaligned. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=967e0813a7740c8d6cc82247beaf69a5dee491a9 commit 967e0813a7740c8d6cc82247beaf69a5dee491a9 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 10:42:30 2016 -0700 X86-64: Remove the previous SSE2/AVX2 memsets Since the new SSE2/AVX2 memsets are faster than the previous ones, we can remove the previous SSE2/AVX2 memsets and replace them with the new ones. This reduces the size of libc.so by about 900 bytes. No change in IFUNC selection if SSE2 and AVX2 memsets weren't used before. If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset optimized with Enhanced REP STOSB will be used for processors with ERMS. The new AVX512 memset will be used for processors with AVX512 which prefer vzeroupper. [BZ #19881] * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded into ... * sysdeps/x86_64/memset.S: This. (__bzero): Removed. (__memset_tail): Likewise. (__memset_chk): Likewise. (memset): Likewise. (MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't defined. (MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined. * sysdeps/x86_64/multiarch/memset-avx2.S: Removed. (__memset_zero_constant_len_parameter): Check SHARED instead of PIC. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memset-avx2 and memset-sse2-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Remove __memset_chk_sse2, __memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned. * sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: Skip if not in libc. * sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S (__bzero): Enabled. * sysdeps/x86_64/multiarch/memset.S (memset): Replace __memset_sse2 and __memset_avx2 with __memset_sse2_unaligned and __memset_avx2_unaligned. Use __memset_sse2_unaligned_erms or __memset_avx2_unaligned_erms if processor has ERMS. Support __memset_avx512_unaligned_erms and __memset_avx512_unaligned. (memset): Removed. (__memset_chk): Likewise. (MEMSET_SYMBOL): New. (libc_hidden_builtin_def): Replace __memset_sse2 with __memset_sse2_unaligned. * sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace __memset_chk_sse2 and __memset_chk_avx2 with __memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms. Use __memset_chk_sse2_unaligned_erms or __memset_chk_avx2_unaligned_erms if processor has ERMS. Support __memset_chk_avx512_unaligned_erms and __memset_chk_avx512_unaligned. -----------------------------------------------------------------------
This is an automated email from the git hooks/post-receive script. It was generated because a ref change was pushed to the repository containing the project "GNU C Library master sources". The branch, hjl/erms/2.23 has been created at c51eab61e17e7575265f1e36bd0293e224500f52 (commit) - Log ----------------------------------------------------------------- https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c51eab61e17e7575265f1e36bd0293e224500f52 commit c51eab61e17e7575265f1e36bd0293e224500f52 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Apr 1 14:01:24 2016 -0700 X86-64: Add dummy memcopy.h and wordcopy.c Since x86-64 doesn't use memory copy functions, add dummy memcopy.h and wordcopy.c to reduce code size. It reduces the size of libc.so by about 1 KB. * sysdeps/x86_64/memcopy.h: New file. * sysdeps/x86_64/wordcopy.c: Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7f9478e6530ab0ede00f705e456445aeff283560 commit 7f9478e6530ab0ede00f705e456445aeff283560 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 12:46:57 2016 -0700 X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones, we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with the new ones. No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used before. If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2 memcpy/memmove optimized with Enhanced REP MOVSB will be used for processors with ERMS. The new AVX512 memcpy/memmove will be used for processors with AVX512 which prefer vzeroupper. Since the new SSE2 memcpy/memmove are faster than the previous default memcpy/memmove used in libc.a and ld.so, we also remove the previous default memcpy/memmove and make them the default memcpy/memmove. Together, it reduces the size of libc.so by about 6 KB and the size of ld.so by about 2 KB. [BZ #19776] * sysdeps/x86_64/memcpy.S: Make it dummy. * sysdeps/x86_64/mempcpy.S: Likewise. * sysdeps/x86_64/memmove.S: New file. * sysdeps/x86_64/memmove_chk.S: Likewise. * sysdeps/x86_64/multiarch/memmove.S: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.S: Likewise. * sysdeps/x86_64/memmove.c: Removed. * sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memmove.c: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.c: Likewise. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memcpy-sse2-unaligned, memmove-avx-unaligned, memcpy-avx-unaligned and memmove-sse2-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Replace __memmove_chk_avx512_unaligned_2 with __memmove_chk_avx512_unaligned. Remove __memmove_chk_avx_unaligned_2. Replace __memmove_chk_sse2_unaligned_2 with __memmove_chk_sse2_unaligned. Remove __memmove_chk_sse2 and __memmove_avx_unaligned_2. Replace __memmove_avx512_unaligned_2 with __memmove_avx512_unaligned. Replace __memmove_sse2_unaligned_2 with __memmove_sse2_unaligned. Remove __memmove_sse2. Replace __memcpy_chk_avx512_unaligned_2 with __memcpy_chk_avx512_unaligned. Remove __memcpy_chk_avx_unaligned_2. Replace __memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned. Remove __memcpy_chk_sse2. Remove __memcpy_avx_unaligned_2. Replace __memcpy_avx512_unaligned_2 with __memcpy_avx512_unaligned. Remove __memcpy_sse2_unaligned_2 and __memcpy_sse2. Replace __mempcpy_chk_avx512_unaligned_2 with __mempcpy_chk_avx512_unaligned. Remove __mempcpy_chk_avx_unaligned_2. Replace __mempcpy_chk_sse2_unaligned_2 with __mempcpy_chk_sse2_unaligned. Remove __mempcpy_chk_sse2. Replace __mempcpy_avx512_unaligned_2 with __mempcpy_avx512_unaligned. Remove __mempcpy_avx_unaligned_2. Replace __mempcpy_sse2_unaligned_2 with __mempcpy_sse2_unaligned. Remove __mempcpy_sse2. * sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support __memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned. Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms if processor has ERMS. Default to __memcpy_sse2_unaligned. (ENTRY): Removed. (END): Likewise. (ENTRY_CHK): Likewise. (libc_hidden_builtin_def): Likewise. Don't include ../memcpy.S. * sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support __memcpy_chk_avx512_unaligned_erms and __memcpy_chk_avx512_unaligned. Use __memcpy_chk_avx_unaligned_erms and __memcpy_chk_sse2_unaligned_erms if if processor has ERMS. Default to __memcpy_chk_sse2_unaligned. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip if not in libc. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S Change function suffix from unaligned_2 to unaligned. * sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support __mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned. Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms if processor has ERMS. Default to __mempcpy_sse2_unaligned. (ENTRY): Removed. (END): Likewise. (ENTRY_CHK): Likewise. (libc_hidden_builtin_def): Likewise. Don't include ../mempcpy.S. (mempcpy): New. Add a weak alias. * sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support __mempcpy_chk_avx512_unaligned_erms and __mempcpy_chk_avx512_unaligned. Use __mempcpy_chk_avx_unaligned_erms and __mempcpy_chk_sse2_unaligned_erms if if processor has ERMS. Default to __mempcpy_chk_sse2_unaligned. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=68248ecc51b4725e794236c495effde76d4be61c commit 68248ecc51b4725e794236c495effde76d4be61c Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 10:42:30 2016 -0700 X86-64: Remove the previous SSE2/AVX2 memsets Since the new SSE2/AVX2 memsets are faster than the previous ones, we can remove the previous SSE2/AVX2 memsets and replace them with the new ones. This reduces the size of libc.so by about 900 bytes. No change in IFUNC selection if SSE2 and AVX2 memsets weren't used before. If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset optimized with Enhanced REP STOSB will be used for processors with ERMS. The new AVX512 memset will be used for processors with AVX512 which prefer vzeroupper. [BZ #19881] * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded into ... * sysdeps/x86_64/memset.S: This. (__bzero): Removed. (__memset_tail): Likewise. (__memset_chk): Likewise. (memset): Likewise. (MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't defined. (MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined. * sysdeps/x86_64/multiarch/memset-avx2.S: Removed. (__memset_zero_constant_len_parameter): Check SHARED instead of PIC. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memset-avx2 and memset-sse2-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Remove __memset_chk_sse2, __memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned. * sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: Skip if not in libc. * sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S (__bzero): Enabled. * sysdeps/x86_64/multiarch/memset.S (memset): Replace __memset_sse2 and __memset_avx2 with __memset_sse2_unaligned and __memset_avx2_unaligned. Use __memset_sse2_unaligned_erms or __memset_avx2_unaligned_erms if processor has ERMS. Support __memset_avx512_unaligned_erms and __memset_avx512_unaligned. (memset): Removed. (__memset_chk): Likewise. (MEMSET_SYMBOL): New. (libc_hidden_builtin_def): Replace __memset_sse2 with __memset_sse2_unaligned. * sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace __memset_chk_sse2 and __memset_chk_avx2 with __memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms. Use __memset_chk_sse2_unaligned_erms or __memset_chk_avx2_unaligned_erms if processor has ERMS. Support __memset_chk_avx512_unaligned_erms and __memset_chk_avx512_unaligned. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=095d851c67b7ea5edb536ead965c73fce34b2edd commit 095d851c67b7ea5edb536ead965c73fce34b2edd Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Apr 3 17:21:45 2016 -0700 X86-64: Use non-temporal store in memmove on large data memcpy/memmove benchmarks with large data shows that there is a regression with large data on Haswell machine. non-temporal store in memmove on large data can improve performance significantly. This patch adds a threshold to use non temporal store which is 4 times of shared cache size. When size is above the threshold, non temporal store will be used. For size below 8 vector register width, we load all data into registers and store them together. Only forward and backward loops, which move 4 vector registers at a time, are used to support overlapping addresses. For forward loop, we load the last 4 vector register width of data and the first vector register width of data into vector registers before the loop and store them after the loop. For backward loop, we load the first 4 vector register width of data and the last vector register width of data into vector registers before the loop and store them after the loop. * sysdeps/x86_64/cacheinfo.c (__x86_shared_non_temporal_threshold): New. (init_cacheinfo): Set __x86_shared_non_temporal_threshold to 4 times of shared cache size. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S (PREFETCHNT): New. (VMOVNT): Likewise. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S (PREFETCHNT): Likewise. (VMOVNT): Likewise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S (PREFETCHNT): Likewise. (VMOVNT): Likewise. (VMOVU): Changed to movups for smaller code sizes. (VMOVA): Changed to movaps for smaller code sizes. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Update comments. (PREFETCH_SIZE): New. (PREFETCHED_LOAD_SIZE): Likewise. (PREFETCH_ONE_SET): Likewise. Rewrite to use forward and backward loops, which move 4 vector registers at a time, to support overlapping addresses and use non temporal store if size is above the threshold. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=0932dd8b56db46dd421a4855fb5dee9de092538d commit 0932dd8b56db46dd421a4855fb5dee9de092538d Author: H.J. Lu <hjl.tools@gmail.com> Date: Wed Apr 6 10:19:16 2016 -0700 X86-64: Prepare memmove-vec-unaligned-erms.S Prepare memmove-vec-unaligned-erms.S to make the SSE2 version as the default memcpy, mempcpy and memmove. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S (MEMCPY_SYMBOL): New. (MEMPCPY_SYMBOL): Likewise. (MEMMOVE_CHK_SYMBOL): Likewise. Replace MEMMOVE_SYMBOL with MEMMOVE_CHK_SYMBOL on __mempcpy_chk symbols. Replace MEMMOVE_SYMBOL with MEMPCPY_SYMBOL on __mempcpy symbols. Provide alias for __memcpy_chk in libc.a. Provide alias for memcpy in libc.a and ld.so. (cherry picked from commit a7d1c51482d15ab6c07e2ee0ae5e007067b18bfb) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=da2da79262814ba4ead3ee487549949096d8ad2d commit da2da79262814ba4ead3ee487549949096d8ad2d Author: H.J. Lu <hjl.tools@gmail.com> Date: Wed Apr 6 09:10:18 2016 -0700 X86-64: Prepare memset-vec-unaligned-erms.S Prepare memset-vec-unaligned-erms.S to make the SSE2 version as the default memset. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S (MEMSET_CHK_SYMBOL): New. Define if not defined. (__bzero): Check VEC_SIZE == 16 instead of USE_MULTIARCH. Disabled fro now. Replace MEMSET_SYMBOL with MEMSET_CHK_SYMBOL on __memset_chk symbols. Properly check USE_MULTIARCH on __memset symbols. (cherry picked from commit 4af1bb06c59d24f35bf8dc55897838d926c05892) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=9a93bdbaff81edf67c5486c84f8098055e355abb commit 9a93bdbaff81edf67c5486c84f8098055e355abb Author: H.J. Lu <hjl.tools@gmail.com> Date: Tue Apr 5 05:21:07 2016 -0700 Force 32-bit displacement in memset-vec-unaligned-erms.S * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: Force 32-bit displacement to avoid long nop between instructions. (cherry picked from commit ec0cac9a1f4094bd0db6f77c1b329e7a40eecc10) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=5118e532600549ad0f56cb9b1a179b8eab70c483 commit 5118e532600549ad0f56cb9b1a179b8eab70c483 Author: H.J. Lu <hjl.tools@gmail.com> Date: Tue Apr 5 05:19:05 2016 -0700 Add a comment in memset-sse2-unaligned-erms.S * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Add a comment on VMOVU and VMOVA. (cherry picked from commit 696ac774847b80cf994438739478b0c3003b5958) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=06c6d4ae6ee7e5b83fd5868bef494def01f59292 commit 06c6d4ae6ee7e5b83fd5868bef494def01f59292 Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Apr 3 14:32:20 2016 -0700 Don't put SSE2/AVX/AVX512 memmove/memset in ld.so Since memmove and memset in ld.so don't use IFUNC, don't put SSE2, AVX and AVX512 memmove and memset in ld.so. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip if not in libc. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S: Likewise. (cherry picked from commit 5cd7af016d8587ff53b20ba259746f97edbddbf7) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a96379797a7eecc1b709cad7b68981eb698783dc commit a96379797a7eecc1b709cad7b68981eb698783dc Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Apr 3 12:38:25 2016 -0700 Fix memmove-vec-unaligned-erms.S __mempcpy_erms and __memmove_erms can't be placed between __memmove_chk and __memmove it breaks __memmove_chk. Don't check source == destination first since it is less common. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: (__mempcpy_erms, __memmove_erms): Moved before __mempcpy_chk with unaligned_erms. (__memmove_erms): Skip if source == destination. (__memmove_unaligned_erms): Don't check source == destination first. (cherry picked from commit ea2785e96fa503f3a2b5dd9f3a6ca65622b3c5f2) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=cfb059c79729b26284863334c9aa04f0a3b967b9 commit cfb059c79729b26284863334c9aa04f0a3b967b9 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Apr 1 15:08:48 2016 -0700 Remove Fast_Copy_Backward from Intel Core processors Intel Core i3, i5 and i7 processors have fast unaligned copy and copy backward is ignored. Remove Fast_Copy_Backward from Intel Core processors to avoid confusion. * sysdeps/x86/cpu-features.c (init_cpu_features): Don't set bit_arch_Fast_Copy_Backward for Intel Core proessors. (cherry picked from commit 27d3ce1467990f89126e228559dec8f84b96c60e) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=30c389be1af67c4d0716d207b6780c6169d1355f commit 30c389be1af67c4d0716d207b6780c6169d1355f Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 10:05:51 2016 -0700 Add x86-64 memset with unaligned store and rep stosb Implement x86-64 memset with unaligned store and rep movsb. Support 16-byte, 32-byte and 64-byte vector register sizes. A single file provides 2 implementations of memset, one with rep stosb and the other without rep stosb. They share the same codes when size is between 2 times of vector register size and REP_STOSB_THRESHOLD which defaults to 2KB. Key features: 1. Use overlapping store to avoid branch. 2. For size <= 4 times of vector register size, fully unroll the loop. 3. For size > 4 times of vector register size, store 4 times of vector register size at a time. [BZ #19881] * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memset-sse2-unaligned-erms, memset-avx2-unaligned-erms and memset-avx512-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memset_chk_sse2_unaligned, __memset_chk_sse2_unaligned_erms, __memset_chk_avx2_unaligned, __memset_chk_avx2_unaligned_erms, __memset_chk_avx512_unaligned, __memset_chk_avx512_unaligned_erms, __memset_sse2_unaligned, __memset_sse2_unaligned_erms, __memset_erms, __memset_avx2_unaligned, __memset_avx2_unaligned_erms, __memset_avx512_unaligned_erms and __memset_avx512_unaligned. * sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: New file. * sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: Likewise. (cherry picked from commit 830566307f038387ca0af3fd327706a8d1a2f595) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=980d639b4ae58209843f09a29d86b0a8303b6650 commit 980d639b4ae58209843f09a29d86b0a8303b6650 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 10:04:26 2016 -0700 Add x86-64 memmove with unaligned load/store and rep movsb Implement x86-64 memmove with unaligned load/store and rep movsb. Support 16-byte, 32-byte and 64-byte vector register sizes. When size <= 8 times of vector register size, there is no check for address overlap bewteen source and destination. Since overhead for overlap check is small when size > 8 times of vector register size, memcpy is an alias of memmove. A single file provides 2 implementations of memmove, one with rep movsb and the other without rep movsb. They share the same codes when size is between 2 times of vector register size and REP_MOVSB_THRESHOLD which is 2KB for 16-byte vector register size and scaled up by large vector register size. Key features: 1. Use overlapping load and store to avoid branch. 2. For size <= 8 times of vector register size, load all sources into registers and store them together. 3. If there is no address overlap bewteen source and destination, copy from both ends with 4 times of vector register size at a time. 4. If address of destination > address of source, backward copy 8 times of vector register size at a time. 5. Otherwise, forward copy 8 times of vector register size at a time. 6. Use rep movsb only for forward copy. Avoid slow backward rep movsb by fallbacking to backward copy 8 times of vector register size at a time. 7. Skip when address of destination == address of source. [BZ #19776] * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memmove-sse2-unaligned-erms, memmove-avx-unaligned-erms and memmove-avx512-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memmove_chk_avx512_unaligned_2, __memmove_chk_avx512_unaligned_erms, __memmove_chk_avx_unaligned_2, __memmove_chk_avx_unaligned_erms, __memmove_chk_sse2_unaligned_2, __memmove_chk_sse2_unaligned_erms, __memmove_avx_unaligned_2, __memmove_avx_unaligned_erms, __memmove_avx512_unaligned_2, __memmove_avx512_unaligned_erms, __memmove_erms, __memmove_sse2_unaligned_2, __memmove_sse2_unaligned_erms, __memcpy_chk_avx512_unaligned_2, __memcpy_chk_avx512_unaligned_erms, __memcpy_chk_avx_unaligned_2, __memcpy_chk_avx_unaligned_erms, __memcpy_chk_sse2_unaligned_2, __memcpy_chk_sse2_unaligned_erms, __memcpy_avx_unaligned_2, __memcpy_avx_unaligned_erms, __memcpy_avx512_unaligned_2, __memcpy_avx512_unaligned_erms, __memcpy_sse2_unaligned_2, __memcpy_sse2_unaligned_erms, __memcpy_erms, __mempcpy_chk_avx512_unaligned_2, __mempcpy_chk_avx512_unaligned_erms, __mempcpy_chk_avx_unaligned_2, __mempcpy_chk_avx_unaligned_erms, __mempcpy_chk_sse2_unaligned_2, __mempcpy_chk_sse2_unaligned_erms, __mempcpy_avx512_unaligned_2, __mempcpy_avx512_unaligned_erms, __mempcpy_avx_unaligned_2, __mempcpy_avx_unaligned_erms, __mempcpy_sse2_unaligned_2, __mempcpy_sse2_unaligned_erms and __mempcpy_erms. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: New file. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S: Likwise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S: Likwise. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Likwise. (cherry picked from commit 88b57b8ed41d5ecf2e1bdfc19556f9246a665ebb) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=bf2bc5e5c9d7aa8af28b299ec26b8a37352730cc commit bf2bc5e5c9d7aa8af28b299ec26b8a37352730cc Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Mar 28 19:22:59 2016 -0700 Initial Enhanced REP MOVSB/STOSB (ERMS) support The newer Intel processors support Enhanced REP MOVSB/STOSB (ERMS) which has a feature bit in CPUID. This patch adds the Enhanced REP MOVSB/STOSB (ERMS) bit to x86 cpu-features. * sysdeps/x86/cpu-features.h (bit_cpu_ERMS): New. (index_cpu_ERMS): Likewise. (reg_ERMS): Likewise. (cherry picked from commit 0791f91dff9a77263fa8173b143d854cad902c6d) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7c244283ff12329b3bca9878b8edac3b3fe5c7bc commit 7c244283ff12329b3bca9878b8edac3b3fe5c7bc Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Mar 28 13:15:59 2016 -0700 Make __memcpy_avx512_no_vzeroupper an alias Since x86-64 memcpy-avx512-no-vzeroupper.S implements memmove, make __memcpy_avx512_no_vzeroupper an alias of __memmove_avx512_no_vzeroupper to reduce code size of libc.so. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memcpy-avx512-no-vzeroupper. * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: Renamed to ... * sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: This. (MEMCPY): Don't define. (MEMCPY_CHK): Likewise. (MEMPCPY): Likewise. (MEMPCPY_CHK): Likewise. (MEMPCPY_CHK): Renamed to ... (__mempcpy_chk_avx512_no_vzeroupper): This. (MEMPCPY_CHK): Renamed to ... (__mempcpy_chk_avx512_no_vzeroupper): This. (MEMCPY_CHK): Renamed to ... (__memmove_chk_avx512_no_vzeroupper): This. (MEMCPY): Renamed to ... (__memmove_avx512_no_vzeroupper): This. (__memcpy_avx512_no_vzeroupper): New alias. (__memcpy_chk_avx512_no_vzeroupper): Likewise. (cherry picked from commit 064f01b10b57ff09cda7025f484b848c38ddd57a) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a9a14991fb2d3e69f80d25e9bbf2f6b0bcf11c3d commit a9a14991fb2d3e69f80d25e9bbf2f6b0bcf11c3d Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Mar 28 13:13:36 2016 -0700 Implement x86-64 multiarch mempcpy in memcpy Implement x86-64 multiarch mempcpy in memcpy to share most of code. It reduces code size of libc.so. [BZ #18858] * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove mempcpy-ssse3, mempcpy-ssse3-back, mempcpy-avx-unaligned and mempcpy-avx512-no-vzeroupper. * sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/memcpy-ssse3-back.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/memcpy-ssse3.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/mempcpy-avx-unaligned.S: Removed. * sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S: Likewise. * sysdeps/x86_64/multiarch/mempcpy-ssse3-back.S: Likewise. * sysdeps/x86_64/multiarch/mempcpy-ssse3.S: Likewise. (cherry picked from commit c365e615f7429aee302f8af7bf07ae262278febb) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4fc09dabecee1b7cafdbca26ee7c63f68e53c229 commit 4fc09dabecee1b7cafdbca26ee7c63f68e53c229 Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Mar 28 04:39:48 2016 -0700 [x86] Add a feature bit: Fast_Unaligned_Copy On AMD processors, memcpy optimized with unaligned SSE load is slower than emcpy optimized with aligned SSSE3 while other string functions are faster with unaligned SSE load. A feature bit, Fast_Unaligned_Copy, is added to select memcpy optimized with unaligned SSE load. [BZ #19583] * sysdeps/x86/cpu-features.c (init_cpu_features): Set Fast_Unaligned_Copy with Fast_Unaligned_Load for Intel processors. Set Fast_Copy_Backward for AMD Excavator processors. * sysdeps/x86/cpu-features.h (bit_arch_Fast_Unaligned_Copy): New. (index_arch_Fast_Unaligned_Copy): Likewise. * sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Check Fast_Unaligned_Copy instead of Fast_Unaligned_Load. (cherry picked from commit e41b395523040fcb58c7d378475720c2836d280c) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=75f2d47e459a6bf5656a938e5c63f8b581eb3ee6 commit 75f2d47e459a6bf5656a938e5c63f8b581eb3ee6 Author: Florian Weimer <fweimer@redhat.com> Date: Fri Mar 25 11:11:42 2016 +0100 tst-audit10: Fix compilation on compilers without bit_AVX512F [BZ #19860] [BZ# 19860] * sysdeps/x86_64/tst-audit10.c (avx512_enabled): Always return zero if the compiler does not provide the AVX512F bit. (cherry picked from commit f327f5b47be57bc05a4077344b381016c1bb2c11) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=96c7375cb8b6f1875d9865f2ae92ecacf5f5e6fa commit 96c7375cb8b6f1875d9865f2ae92ecacf5f5e6fa Author: H.J. Lu <hjl.tools@gmail.com> Date: Tue Mar 22 08:36:16 2016 -0700 Don't set %rcx twice before "rep movsb" * sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMCPY): Don't set %rcx twice before "rep movsb". (cherry picked from commit 3c9a4cd16cbc7b79094fec68add2df66061ab5d7) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c273f613b0cc779ee33cc33d20941d271316e483 commit c273f613b0cc779ee33cc33d20941d271316e483 Author: H.J. Lu <hjl.tools@gmail.com> Date: Tue Mar 22 07:46:56 2016 -0700 Set index_arch_AVX_Fast_Unaligned_Load only for Intel processors Since only Intel processors with AVX2 have fast unaligned load, we should set index_arch_AVX_Fast_Unaligned_Load only for Intel processors. Move AVX, AVX2, AVX512, FMA and FMA4 detection into get_common_indeces and call get_common_indeces for other processors. Add CPU_FEATURES_CPU_P and CPU_FEATURES_ARCH_P to aoid loading GLRO(dl_x86_cpu_features) in cpu-features.c. [BZ #19583] * sysdeps/x86/cpu-features.c (get_common_indeces): Remove inline. Check family before setting family, model and extended_model. Set AVX, AVX2, AVX512, FMA and FMA4 usable bits here. (init_cpu_features): Replace HAS_CPU_FEATURE and HAS_ARCH_FEATURE with CPU_FEATURES_CPU_P and CPU_FEATURES_ARCH_P. Set index_arch_AVX_Fast_Unaligned_Load for Intel processors with usable AVX2. Call get_common_indeces for other processors with family == NULL. * sysdeps/x86/cpu-features.h (CPU_FEATURES_CPU_P): New macro. (CPU_FEATURES_ARCH_P): Likewise. (HAS_CPU_FEATURE): Use CPU_FEATURES_CPU_P. (HAS_ARCH_FEATURE): Use CPU_FEATURES_ARCH_P. (cherry picked from commit f781a9e96138d8839663af5e88649ab1fbed74f8) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c858d10a4e7fd682f2e7083836e4feacc2d580f4 commit c858d10a4e7fd682f2e7083836e4feacc2d580f4 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 10 05:26:46 2016 -0800 Add _arch_/_cpu_ to index_*/bit_* in x86 cpu-features.h index_* and bit_* macros are used to access cpuid and feature arrays o struct cpu_features. It is very easy to use bits and indices of cpuid array on feature array, especially in assembly codes. For example, sysdeps/i386/i686/multiarch/bcopy.S has HAS_CPU_FEATURE (Fast_Rep_String) which should be HAS_ARCH_FEATURE (Fast_Rep_String) We change index_* and bit_* to index_cpu_*/index_arch_* and bit_cpu_*/bit_arch_* so that we can catch such error at build time. [BZ #19762] * sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h (EXTRA_LD_ENVVARS): Add _arch_ to index_*/bit_*. * sysdeps/x86/cpu-features.c (init_cpu_features): Likewise. * sysdeps/x86/cpu-features.h (bit_*): Renamed to ... (bit_arch_*): This for feature array. (bit_*): Renamed to ... (bit_cpu_*): This for cpu array. (index_*): Renamed to ... (index_arch_*): This for feature array. (index_*): Renamed to ... (index_cpu_*): This for cpu array. [__ASSEMBLER__] (HAS_FEATURE): Add and use field. [__ASSEMBLER__] (HAS_CPU_FEATURE)): Pass cpu to HAS_FEATURE. [__ASSEMBLER__] (HAS_ARCH_FEATURE)): Pass arch to HAS_FEATURE. [!__ASSEMBLER__] (HAS_CPU_FEATURE): Replace index_##name and bit_##name with index_cpu_##name and bit_cpu_##name. [!__ASSEMBLER__] (HAS_ARCH_FEATURE): Replace index_##name and bit_##name with index_arch_##name and bit_arch_##name. (cherry picked from commit 6aa3e97e2530f9917f504eb4146af119a3f27229) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7a90b56b0c3f8e55df44957cf6de7d3c9c04cbb9 commit 7a90b56b0c3f8e55df44957cf6de7d3c9c04cbb9 Author: Roland McGrath <roland@hack.frob.com> Date: Tue Mar 8 12:31:13 2016 -0800 Fix tst-audit10 build when -mavx512f is not supported. (cherry picked from commit 3bd80c0de2f8e7ca8020d37739339636d169957e) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=ba80f6ceea3a6b6f711038646f419125fe3ad39c commit ba80f6ceea3a6b6f711038646f419125fe3ad39c Author: Florian Weimer <fweimer@redhat.com> Date: Mon Mar 7 16:00:25 2016 +0100 tst-audit4, tst-audit10: Compile AVX/AVX-512 code separately [BZ #19269] This ensures that GCC will not use unsupported instructions before the run-time check to ensure support. (cherry picked from commit 3c0f7407eedb524c9114bb675cd55b903c71daaa) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=b8fe596e7f750d4ee2fca14d6a3999364c02662e commit b8fe596e7f750d4ee2fca14d6a3999364c02662e Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Mar 6 16:48:11 2016 -0800 Group AVX512 functions in .text.avx512 section * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: Replace .text with .text.avx512. * sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S: Likewise. (cherry picked from commit fee9eb6200f0e44a4b684903bc47fde36d46f1a5) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e455d17680cfaebb12692547422f95ba1ed30e29 commit e455d17680cfaebb12692547422f95ba1ed30e29 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Mar 4 08:37:40 2016 -0800 x86-64: Fix memcpy IFUNC selection Chek Fast_Unaligned_Load, instead of Slow_BSF, and also check for Fast_Copy_Backward to enable __memcpy_ssse3_back. Existing selection order is updated with following selection order: 1. __memcpy_avx_unaligned if AVX_Fast_Unaligned_Load bit is set. 2. __memcpy_sse2_unaligned if Fast_Unaligned_Load bit is set. 3. __memcpy_sse2 if SSSE3 isn't available. 4. __memcpy_ssse3_back if Fast_Copy_Backward bit it set. 5. __memcpy_ssse3 [BZ #18880] * sysdeps/x86_64/multiarch/memcpy.S: Check Fast_Unaligned_Load, instead of Slow_BSF, and also check for Fast_Copy_Backward to enable __memcpy_ssse3_back. (cherry picked from commit 14a1d7cc4c4fd5ee8e4e66b777221dd32a84efe8) -----------------------------------------------------------------------
This is an automated email from the git hooks/post-receive script. It was generated because a ref change was pushed to the repository containing the project "GNU C Library master sources". The branch, hjl/erms/2.22 has been created at 34f2cbf8ca6ee99f36229315fb03c27e3acd805d (commit) - Log ----------------------------------------------------------------- https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=34f2cbf8ca6ee99f36229315fb03c27e3acd805d commit 34f2cbf8ca6ee99f36229315fb03c27e3acd805d Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Apr 1 14:01:24 2016 -0700 X86-64: Add dummy memcopy.h and wordcopy.c Since x86-64 doesn't use memory copy functions, add dummy memcopy.h and wordcopy.c to reduce code size. It reduces the size of libc.so by about 1 KB. * sysdeps/x86_64/memcopy.h: New file. * sysdeps/x86_64/wordcopy.c: Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=912d6a93556739773b511766c2ca95fb293f5566 commit 912d6a93556739773b511766c2ca95fb293f5566 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 12:46:57 2016 -0700 X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones, we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with the new ones. No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used before. If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2 memcpy/memmove optimized with Enhanced REP MOVSB will be used for processors with ERMS. The new AVX512 memcpy/memmove will be used for processors with AVX512 which prefer vzeroupper. Since the new SSE2 memcpy/memmove are faster than the previous default memcpy/memmove used in libc.a and ld.so, we also remove the previous default memcpy/memmove and make them the default memcpy/memmove. Together, it reduces the size of libc.so by about 6 KB and the size of ld.so by about 2 KB. [BZ #19776] * sysdeps/x86_64/memcpy.S: Make it dummy. * sysdeps/x86_64/mempcpy.S: Likewise. * sysdeps/x86_64/memmove.S: New file. * sysdeps/x86_64/memmove_chk.S: Likewise. * sysdeps/x86_64/multiarch/memmove.S: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.S: Likewise. * sysdeps/x86_64/memmove.c: Removed. * sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memmove.c: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.c: Likewise. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memcpy-sse2-unaligned, memmove-avx-unaligned, memcpy-avx-unaligned and memmove-sse2-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Replace __memmove_chk_avx512_unaligned_2 with __memmove_chk_avx512_unaligned. Remove __memmove_chk_avx_unaligned_2. Replace __memmove_chk_sse2_unaligned_2 with __memmove_chk_sse2_unaligned. Remove __memmove_chk_sse2 and __memmove_avx_unaligned_2. Replace __memmove_avx512_unaligned_2 with __memmove_avx512_unaligned. Replace __memmove_sse2_unaligned_2 with __memmove_sse2_unaligned. Remove __memmove_sse2. Replace __memcpy_chk_avx512_unaligned_2 with __memcpy_chk_avx512_unaligned. Remove __memcpy_chk_avx_unaligned_2. Replace __memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned. Remove __memcpy_chk_sse2. Remove __memcpy_avx_unaligned_2. Replace __memcpy_avx512_unaligned_2 with __memcpy_avx512_unaligned. Remove __memcpy_sse2_unaligned_2 and __memcpy_sse2. Replace __mempcpy_chk_avx512_unaligned_2 with __mempcpy_chk_avx512_unaligned. Remove __mempcpy_chk_avx_unaligned_2. Replace __mempcpy_chk_sse2_unaligned_2 with __mempcpy_chk_sse2_unaligned. Remove __mempcpy_chk_sse2. Replace __mempcpy_avx512_unaligned_2 with __mempcpy_avx512_unaligned. Remove __mempcpy_avx_unaligned_2. Replace __mempcpy_sse2_unaligned_2 with __mempcpy_sse2_unaligned. Remove __mempcpy_sse2. * sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support __memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned. Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms if processor has ERMS. Default to __memcpy_sse2_unaligned. (ENTRY): Removed. (END): Likewise. (ENTRY_CHK): Likewise. (libc_hidden_builtin_def): Likewise. Don't include ../memcpy.S. * sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support __memcpy_chk_avx512_unaligned_erms and __memcpy_chk_avx512_unaligned. Use __memcpy_chk_avx_unaligned_erms and __memcpy_chk_sse2_unaligned_erms if if processor has ERMS. Default to __memcpy_chk_sse2_unaligned. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip if not in libc. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S Change function suffix from unaligned_2 to unaligned. * sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support __mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned. Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms if processor has ERMS. Default to __mempcpy_sse2_unaligned. (ENTRY): Removed. (END): Likewise. (ENTRY_CHK): Likewise. (libc_hidden_builtin_def): Likewise. Don't include ../mempcpy.S. (mempcpy): New. Add a weak alias. * sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support __mempcpy_chk_avx512_unaligned_erms and __mempcpy_chk_avx512_unaligned. Use __mempcpy_chk_avx_unaligned_erms and __mempcpy_chk_sse2_unaligned_erms if if processor has ERMS. Default to __mempcpy_chk_sse2_unaligned. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7dfa91a07593740cf3ad71060300b1cc38ac2910 commit 7dfa91a07593740cf3ad71060300b1cc38ac2910 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 10:42:30 2016 -0700 X86-64: Remove the previous SSE2/AVX2 memsets Since the new SSE2/AVX2 memsets are faster than the previous ones, we can remove the previous SSE2/AVX2 memsets and replace them with the new ones. This reduces the size of libc.so by about 900 bytes. No change in IFUNC selection if SSE2 and AVX2 memsets weren't used before. If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset optimized with Enhanced REP STOSB will be used for processors with ERMS. The new AVX512 memset will be used for processors with AVX512 which prefer vzeroupper. [BZ #19881] * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded into ... * sysdeps/x86_64/memset.S: This. (__bzero): Removed. (__memset_tail): Likewise. (__memset_chk): Likewise. (memset): Likewise. (MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't defined. (MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined. * sysdeps/x86_64/multiarch/memset-avx2.S: Removed. (__memset_zero_constant_len_parameter): Check SHARED instead of PIC. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memset-avx2 and memset-sse2-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Remove __memset_chk_sse2, __memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned. * sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: Skip if not in libc. * sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S (MEMSET_CHK_SYMBOL): New. Define if not defined. (__bzero): Check VEC_SIZE == 16 instead of USE_MULTIARCH. Replace MEMSET_SYMBOL with MEMSET_CHK_SYMBOL on __memset_chk symbols. Properly check USE_MULTIARCH on __memset symbols. * sysdeps/x86_64/multiarch/memset.S (memset): Replace __memset_sse2 and __memset_avx2 with __memset_sse2_unaligned and __memset_avx2_unaligned. Use __memset_sse2_unaligned_erms or __memset_avx2_unaligned_erms if processor has ERMS. Support __memset_avx512_unaligned_erms and __memset_avx512_unaligned. (memset): Removed. (__memset_chk): Likewise. (MEMSET_SYMBOL): New. (libc_hidden_builtin_def): Replace __memset_sse2 with __memset_sse2_unaligned. * sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace __memset_chk_sse2 and __memset_chk_avx2 with __memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms. Use __memset_chk_sse2_unaligned_erms or __memset_chk_avx2_unaligned_erms if processor has ERMS. Support __memset_chk_avx512_unaligned_erms and __memset_chk_avx512_unaligned. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=5b3e44eeb5ae74fb4a1c353db7e8a5ee18ccdb10 commit 5b3e44eeb5ae74fb4a1c353db7e8a5ee18ccdb10 Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Apr 3 17:21:45 2016 -0700 X86-64: Use non-temporal store in memmove on large data memcpy/memmove benchmarks with large data shows that there is a regression with large data on Haswell machine. non-temporal store in memmove on large data can improve performance significantly. This patch adds a threshold to use non temporal store which is 4 times of shared cache size. When size is above the threshold, non temporal store will be used. For size below 8 vector register width, we load all data into registers and store them together. Only forward and backward loops, which move 4 vector registers at a time, are used to support overlapping addresses. For forward loop, we load the last 4 vector register width of data and the first vector register width of data into vector registers before the loop and store them after the loop. For backward loop, we load the first 4 vector register width of data and the last vector register width of data into vector registers before the loop and store them after the loop. * sysdeps/x86_64/cacheinfo.c (__x86_shared_non_temporal_threshold): New. (init_cacheinfo): Set __x86_shared_non_temporal_threshold to 4 times of shared cache size. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S (PREFETCHNT): New. (VMOVNT): Likewise. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S (PREFETCHNT): Likewise. (VMOVNT): Likewise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S (PREFETCHNT): Likewise. (VMOVNT): Likewise. (VMOVU): Changed to movups for smaller code sizes. (VMOVA): Changed to movaps for smaller code sizes. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Update comments. (PREFETCH_SIZE): New. (PREFETCHED_LOAD_SIZE): Likewise. (PREFETCH_ONE_SET): Likewise. Rewrite to use forward and backward loops, which move 4 vector registers at a time, to support overlapping addresses and use non temporal store if size is above the threshold. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=54667f64fa4074325ee33e487c033c313ce95067 commit 54667f64fa4074325ee33e487c033c313ce95067 Author: H.J. Lu <hjl.tools@gmail.com> Date: Wed Apr 6 10:19:16 2016 -0700 X86-64: Prepare memmove-vec-unaligned-erms.S Prepare memmove-vec-unaligned-erms.S to make the SSE2 version as the default memcpy, mempcpy and memmove. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S (MEMCPY_SYMBOL): New. (MEMPCPY_SYMBOL): Likewise. (MEMMOVE_CHK_SYMBOL): Likewise. Replace MEMMOVE_SYMBOL with MEMMOVE_CHK_SYMBOL on __mempcpy_chk symbols. Replace MEMMOVE_SYMBOL with MEMPCPY_SYMBOL on __mempcpy symbols. Provide alias for __memcpy_chk in libc.a. Provide alias for memcpy in libc.a and ld.so. (cherry picked from commit a7d1c51482d15ab6c07e2ee0ae5e007067b18bfb) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=68a0b487e274b3452a1660e4b9fad5df8d8c0284 commit 68a0b487e274b3452a1660e4b9fad5df8d8c0284 Author: H.J. Lu <hjl.tools@gmail.com> Date: Wed Apr 6 09:10:18 2016 -0700 X86-64: Prepare memset-vec-unaligned-erms.S Prepare memset-vec-unaligned-erms.S to make the SSE2 version as the default memset. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S (MEMSET_CHK_SYMBOL): New. Define if not defined. (__bzero): Check VEC_SIZE == 16 instead of USE_MULTIARCH. Disabled fro now. Replace MEMSET_SYMBOL with MEMSET_CHK_SYMBOL on __memset_chk symbols. Properly check USE_MULTIARCH on __memset symbols. (cherry picked from commit 4af1bb06c59d24f35bf8dc55897838d926c05892) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c2d3bdd6aec639fd66fceb3e2c145420c25d409b commit c2d3bdd6aec639fd66fceb3e2c145420c25d409b Author: H.J. Lu <hjl.tools@gmail.com> Date: Tue Apr 5 05:21:07 2016 -0700 Force 32-bit displacement in memset-vec-unaligned-erms.S * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: Force 32-bit displacement to avoid long nop between instructions. (cherry picked from commit ec0cac9a1f4094bd0db6f77c1b329e7a40eecc10) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=070a5e77d66f5520c1bbbc24dc1843a0a1c161ee commit 070a5e77d66f5520c1bbbc24dc1843a0a1c161ee Author: H.J. Lu <hjl.tools@gmail.com> Date: Tue Apr 5 05:19:05 2016 -0700 Add a comment in memset-sse2-unaligned-erms.S * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Add a comment on VMOVU and VMOVA. (cherry picked from commit 696ac774847b80cf994438739478b0c3003b5958) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7e00bb9720268f142668d22e91dff7c3e6e0c08c commit 7e00bb9720268f142668d22e91dff7c3e6e0c08c Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Apr 3 14:32:20 2016 -0700 Don't put SSE2/AVX/AVX512 memmove/memset in ld.so Since memmove and memset in ld.so don't use IFUNC, don't put SSE2, AVX and AVX512 memmove and memset in ld.so. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip if not in libc. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S: Likewise. (cherry picked from commit 5cd7af016d8587ff53b20ba259746f97edbddbf7) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e1e57539f5dbdefc96a85021b611863eaa28dd13 commit e1e57539f5dbdefc96a85021b611863eaa28dd13 Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Apr 3 12:38:25 2016 -0700 Fix memmove-vec-unaligned-erms.S __mempcpy_erms and __memmove_erms can't be placed between __memmove_chk and __memmove it breaks __memmove_chk. Don't check source == destination first since it is less common. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: (__mempcpy_erms, __memmove_erms): Moved before __mempcpy_chk with unaligned_erms. (__memmove_erms): Skip if source == destination. (__memmove_unaligned_erms): Don't check source == destination first. (cherry picked from commit ea2785e96fa503f3a2b5dd9f3a6ca65622b3c5f2) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a13ac6b5ced68aadb7c1546102445f9c57f43231 commit a13ac6b5ced68aadb7c1546102445f9c57f43231 Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Mar 6 08:23:24 2016 -0800 Use HAS_ARCH_FEATURE with Fast_Rep_String HAS_ARCH_FEATURE, not HAS_CPU_FEATURE, should be used with Fast_Rep_String. [BZ #19762] * sysdeps/i386/i686/multiarch/bcopy.S (bcopy): Use HAS_ARCH_FEATURE with Fast_Rep_String. * sysdeps/i386/i686/multiarch/bzero.S (__bzero): Likewise. * sysdeps/i386/i686/multiarch/memcpy.S (memcpy): Likewise. * sysdeps/i386/i686/multiarch/memcpy_chk.S (__memcpy_chk): Likewise. * sysdeps/i386/i686/multiarch/memmove_chk.S (__memmove_chk): Likewise. * sysdeps/i386/i686/multiarch/mempcpy.S (__mempcpy): Likewise. * sysdeps/i386/i686/multiarch/mempcpy_chk.S (__mempcpy_chk): Likewise. * sysdeps/i386/i686/multiarch/memset.S (memset): Likewise. * sysdeps/i386/i686/multiarch/memset_chk.S (__memset_chk): Likewise. (cherry picked from commit 4e940b2f4b577f3a530e0580373f7c2d569f4d63) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4ad4d58ed7a444e2d9787113fce132a99b35b417 commit 4ad4d58ed7a444e2d9787113fce132a99b35b417 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Apr 1 15:08:48 2016 -0700 Remove Fast_Copy_Backward from Intel Core processors Intel Core i3, i5 and i7 processors have fast unaligned copy and copy backward is ignored. Remove Fast_Copy_Backward from Intel Core processors to avoid confusion. * sysdeps/x86/cpu-features.c (init_cpu_features): Don't set bit_arch_Fast_Copy_Backward for Intel Core proessors. (cherry picked from commit 27d3ce1467990f89126e228559dec8f84b96c60e) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a304f3933c7f8347e49057a7a315cbd571662ff7 commit a304f3933c7f8347e49057a7a315cbd571662ff7 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 10:05:51 2016 -0700 Add x86-64 memset with unaligned store and rep stosb Implement x86-64 memset with unaligned store and rep movsb. Support 16-byte, 32-byte and 64-byte vector register sizes. A single file provides 2 implementations of memset, one with rep stosb and the other without rep stosb. They share the same codes when size is between 2 times of vector register size and REP_STOSB_THRESHOLD which defaults to 2KB. Key features: 1. Use overlapping store to avoid branch. 2. For size <= 4 times of vector register size, fully unroll the loop. 3. For size > 4 times of vector register size, store 4 times of vector register size at a time. [BZ #19881] * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memset-sse2-unaligned-erms, memset-avx2-unaligned-erms and memset-avx512-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memset_chk_sse2_unaligned, __memset_chk_sse2_unaligned_erms, __memset_chk_avx2_unaligned, __memset_chk_avx2_unaligned_erms, __memset_chk_avx512_unaligned, __memset_chk_avx512_unaligned_erms, __memset_sse2_unaligned, __memset_sse2_unaligned_erms, __memset_erms, __memset_avx2_unaligned, __memset_avx2_unaligned_erms, __memset_avx512_unaligned_erms and __memset_avx512_unaligned. * sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: New file. * sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: Likewise. (cherry picked from commit 830566307f038387ca0af3fd327706a8d1a2f595) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=1c5d0b0ba41376c2f0792da4f22cc1f5b2b2688e commit 1c5d0b0ba41376c2f0792da4f22cc1f5b2b2688e Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 10:04:26 2016 -0700 Add x86-64 memmove with unaligned load/store and rep movsb Implement x86-64 memmove with unaligned load/store and rep movsb. Support 16-byte, 32-byte and 64-byte vector register sizes. When size <= 8 times of vector register size, there is no check for address overlap bewteen source and destination. Since overhead for overlap check is small when size > 8 times of vector register size, memcpy is an alias of memmove. A single file provides 2 implementations of memmove, one with rep movsb and the other without rep movsb. They share the same codes when size is between 2 times of vector register size and REP_MOVSB_THRESHOLD which is 2KB for 16-byte vector register size and scaled up by large vector register size. Key features: 1. Use overlapping load and store to avoid branch. 2. For size <= 8 times of vector register size, load all sources into registers and store them together. 3. If there is no address overlap bewteen source and destination, copy from both ends with 4 times of vector register size at a time. 4. If address of destination > address of source, backward copy 8 times of vector register size at a time. 5. Otherwise, forward copy 8 times of vector register size at a time. 6. Use rep movsb only for forward copy. Avoid slow backward rep movsb by fallbacking to backward copy 8 times of vector register size at a time. 7. Skip when address of destination == address of source. [BZ #19776] * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memmove-sse2-unaligned-erms, memmove-avx-unaligned-erms and memmove-avx512-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memmove_chk_avx512_unaligned_2, __memmove_chk_avx512_unaligned_erms, __memmove_chk_avx_unaligned_2, __memmove_chk_avx_unaligned_erms, __memmove_chk_sse2_unaligned_2, __memmove_chk_sse2_unaligned_erms, __memmove_avx_unaligned_2, __memmove_avx_unaligned_erms, __memmove_avx512_unaligned_2, __memmove_avx512_unaligned_erms, __memmove_erms, __memmove_sse2_unaligned_2, __memmove_sse2_unaligned_erms, __memcpy_chk_avx512_unaligned_2, __memcpy_chk_avx512_unaligned_erms, __memcpy_chk_avx_unaligned_2, __memcpy_chk_avx_unaligned_erms, __memcpy_chk_sse2_unaligned_2, __memcpy_chk_sse2_unaligned_erms, __memcpy_avx_unaligned_2, __memcpy_avx_unaligned_erms, __memcpy_avx512_unaligned_2, __memcpy_avx512_unaligned_erms, __memcpy_sse2_unaligned_2, __memcpy_sse2_unaligned_erms, __memcpy_erms, __mempcpy_chk_avx512_unaligned_2, __mempcpy_chk_avx512_unaligned_erms, __mempcpy_chk_avx_unaligned_2, __mempcpy_chk_avx_unaligned_erms, __mempcpy_chk_sse2_unaligned_2, __mempcpy_chk_sse2_unaligned_erms, __mempcpy_avx512_unaligned_2, __mempcpy_avx512_unaligned_erms, __mempcpy_avx_unaligned_2, __mempcpy_avx_unaligned_erms, __mempcpy_sse2_unaligned_2, __mempcpy_sse2_unaligned_erms and __mempcpy_erms. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: New file. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S: Likwise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S: Likwise. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Likwise. (cherry picked from commit 88b57b8ed41d5ecf2e1bdfc19556f9246a665ebb) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e1203f48239fbb9832db6ed3a0d2a008e317aff9 commit e1203f48239fbb9832db6ed3a0d2a008e317aff9 Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Mar 28 19:22:59 2016 -0700 Initial Enhanced REP MOVSB/STOSB (ERMS) support The newer Intel processors support Enhanced REP MOVSB/STOSB (ERMS) which has a feature bit in CPUID. This patch adds the Enhanced REP MOVSB/STOSB (ERMS) bit to x86 cpu-features. * sysdeps/x86/cpu-features.h (bit_cpu_ERMS): New. (index_cpu_ERMS): Likewise. (reg_ERMS): Likewise. (cherry picked from commit 0791f91dff9a77263fa8173b143d854cad902c6d) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=3597d65be2a44f063ef12bb907fdad8567aa3e6a commit 3597d65be2a44f063ef12bb907fdad8567aa3e6a Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Mar 28 13:15:59 2016 -0700 Make __memcpy_avx512_no_vzeroupper an alias Since x86-64 memcpy-avx512-no-vzeroupper.S implements memmove, make __memcpy_avx512_no_vzeroupper an alias of __memmove_avx512_no_vzeroupper to reduce code size of libc.so. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memcpy-avx512-no-vzeroupper. * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: Renamed to ... * sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: This. (MEMCPY): Don't define. (MEMCPY_CHK): Likewise. (MEMPCPY): Likewise. (MEMPCPY_CHK): Likewise. (MEMPCPY_CHK): Renamed to ... (__mempcpy_chk_avx512_no_vzeroupper): This. (MEMPCPY_CHK): Renamed to ... (__mempcpy_chk_avx512_no_vzeroupper): This. (MEMCPY_CHK): Renamed to ... (__memmove_chk_avx512_no_vzeroupper): This. (MEMCPY): Renamed to ... (__memmove_avx512_no_vzeroupper): This. (__memcpy_avx512_no_vzeroupper): New alias. (__memcpy_chk_avx512_no_vzeroupper): Likewise. (cherry picked from commit 064f01b10b57ff09cda7025f484b848c38ddd57a) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=9fbaf0f27a11deb98df79d04adee97aebee78d40 commit 9fbaf0f27a11deb98df79d04adee97aebee78d40 Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Mar 28 13:13:36 2016 -0700 Implement x86-64 multiarch mempcpy in memcpy Implement x86-64 multiarch mempcpy in memcpy to share most of code. It reduces code size of libc.so. [BZ #18858] * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove mempcpy-ssse3, mempcpy-ssse3-back, mempcpy-avx-unaligned and mempcpy-avx512-no-vzeroupper. * sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/memcpy-ssse3-back.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/memcpy-ssse3.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/mempcpy-avx-unaligned.S: Removed. * sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S: Likewise. * sysdeps/x86_64/multiarch/mempcpy-ssse3-back.S: Likewise. * sysdeps/x86_64/multiarch/mempcpy-ssse3.S: Likewise. (cherry picked from commit c365e615f7429aee302f8af7bf07ae262278febb) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=5239cb481eea27650173b9b9af22439afdcbf358 commit 5239cb481eea27650173b9b9af22439afdcbf358 Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Mar 28 04:39:48 2016 -0700 [x86] Add a feature bit: Fast_Unaligned_Copy On AMD processors, memcpy optimized with unaligned SSE load is slower than emcpy optimized with aligned SSSE3 while other string functions are faster with unaligned SSE load. A feature bit, Fast_Unaligned_Copy, is added to select memcpy optimized with unaligned SSE load. [BZ #19583] * sysdeps/x86/cpu-features.c (init_cpu_features): Set Fast_Unaligned_Copy with Fast_Unaligned_Load for Intel processors. Set Fast_Copy_Backward for AMD Excavator processors. * sysdeps/x86/cpu-features.h (bit_arch_Fast_Unaligned_Copy): New. (index_arch_Fast_Unaligned_Copy): Likewise. * sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Check Fast_Unaligned_Copy instead of Fast_Unaligned_Load. (cherry picked from commit e41b395523040fcb58c7d378475720c2836d280c) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a65b3d13e1754d568782e64a762c2c7fab45a55d commit a65b3d13e1754d568782e64a762c2c7fab45a55d Author: H.J. Lu <hjl.tools@gmail.com> Date: Tue Mar 22 08:36:16 2016 -0700 Don't set %rcx twice before "rep movsb" * sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMCPY): Don't set %rcx twice before "rep movsb". (cherry picked from commit 3c9a4cd16cbc7b79094fec68add2df66061ab5d7) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=f4b6d20366aac66070f1cf50552cf2951991a1e5 commit f4b6d20366aac66070f1cf50552cf2951991a1e5 Author: H.J. Lu <hjl.tools@gmail.com> Date: Tue Mar 22 07:46:56 2016 -0700 Set index_arch_AVX_Fast_Unaligned_Load only for Intel processors Since only Intel processors with AVX2 have fast unaligned load, we should set index_arch_AVX_Fast_Unaligned_Load only for Intel processors. Move AVX, AVX2, AVX512, FMA and FMA4 detection into get_common_indeces and call get_common_indeces for other processors. Add CPU_FEATURES_CPU_P and CPU_FEATURES_ARCH_P to aoid loading GLRO(dl_x86_cpu_features) in cpu-features.c. [BZ #19583] * sysdeps/x86/cpu-features.c (get_common_indeces): Remove inline. Check family before setting family, model and extended_model. Set AVX, AVX2, AVX512, FMA and FMA4 usable bits here. (init_cpu_features): Replace HAS_CPU_FEATURE and HAS_ARCH_FEATURE with CPU_FEATURES_CPU_P and CPU_FEATURES_ARCH_P. Set index_arch_AVX_Fast_Unaligned_Load for Intel processors with usable AVX2. Call get_common_indeces for other processors with family == NULL. * sysdeps/x86/cpu-features.h (CPU_FEATURES_CPU_P): New macro. (CPU_FEATURES_ARCH_P): Likewise. (HAS_CPU_FEATURE): Use CPU_FEATURES_CPU_P. (HAS_ARCH_FEATURE): Use CPU_FEATURES_ARCH_P. (cherry picked from commit f781a9e96138d8839663af5e88649ab1fbed74f8) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=ca9c5edeea52dc18f42ebbe29b1af352f5555538 commit ca9c5edeea52dc18f42ebbe29b1af352f5555538 Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Nov 30 08:53:37 2015 -0800 Update family and model detection for AMD CPUs AMD CPUs uses the similar encoding scheme for extended family and model as Intel CPUs as shown in: http://support.amd.com/TechDocs/25481.pdf This patch updates get_common_indeces to get family and model for both Intel and AMD CPUs when family == 0x0f. [BZ #19214] * sysdeps/x86/cpu-features.c (get_common_indeces): Add an argument to return extended model. Update family and model with extended family and model when family == 0x0f. (init_cpu_features): Updated. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c23cdbac4ea473effbef5c50b1217f95595b3460 commit c23cdbac4ea473effbef5c50b1217f95595b3460 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 10 05:26:46 2016 -0800 Add _arch_/_cpu_ to index_*/bit_* in x86 cpu-features.h index_* and bit_* macros are used to access cpuid and feature arrays o struct cpu_features. It is very easy to use bits and indices of cpuid array on feature array, especially in assembly codes. For example, sysdeps/i386/i686/multiarch/bcopy.S has HAS_CPU_FEATURE (Fast_Rep_String) which should be HAS_ARCH_FEATURE (Fast_Rep_String) We change index_* and bit_* to index_cpu_*/index_arch_* and bit_cpu_*/bit_arch_* so that we can catch such error at build time. [BZ #19762] * sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h (EXTRA_LD_ENVVARS): Add _arch_ to index_*/bit_*. * sysdeps/x86/cpu-features.c (init_cpu_features): Likewise. * sysdeps/x86/cpu-features.h (bit_*): Renamed to ... (bit_arch_*): This for feature array. (bit_*): Renamed to ... (bit_cpu_*): This for cpu array. (index_*): Renamed to ... (index_arch_*): This for feature array. (index_*): Renamed to ... (index_cpu_*): This for cpu array. [__ASSEMBLER__] (HAS_FEATURE): Add and use field. [__ASSEMBLER__] (HAS_CPU_FEATURE)): Pass cpu to HAS_FEATURE. [__ASSEMBLER__] (HAS_ARCH_FEATURE)): Pass arch to HAS_FEATURE. [!__ASSEMBLER__] (HAS_CPU_FEATURE): Replace index_##name and bit_##name with index_cpu_##name and bit_cpu_##name. [!__ASSEMBLER__] (HAS_ARCH_FEATURE): Replace index_##name and bit_##name with index_arch_##name and bit_arch_##name. (cherry picked from commit 6aa3e97e2530f9917f504eb4146af119a3f27229) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4a49c82956f5a42a2cce22c2e97360de1b32301d commit 4a49c82956f5a42a2cce22c2e97360de1b32301d Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 3 14:51:40 2016 -0800 Or bit_Prefer_MAP_32BIT_EXEC in EXTRA_LD_ENVVARS We should turn on bit_Prefer_MAP_32BIT_EXEC in EXTRA_LD_ENVVARS without overriding other bits. [BZ #19758] * sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h (EXTRA_LD_ENVVARS): Or bit_Prefer_MAP_32BIT_EXEC. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=18f8c0e3cc9ff7b092f02c9b42874a5439347bbc commit 18f8c0e3cc9ff7b092f02c9b42874a5439347bbc Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Mar 6 16:48:11 2016 -0800 Group AVX512 functions in .text.avx512 section * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: Replace .text with .text.avx512. * sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S: Likewise. (cherry picked from commit fee9eb6200f0e44a4b684903bc47fde36d46f1a5) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=0c8e297a186f844ebb7eba7a3bc0343c83615ca9 commit 0c8e297a186f844ebb7eba7a3bc0343c83615ca9 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Mar 4 08:37:40 2016 -0800 x86-64: Fix memcpy IFUNC selection Chek Fast_Unaligned_Load, instead of Slow_BSF, and also check for Fast_Copy_Backward to enable __memcpy_ssse3_back. Existing selection order is updated with following selection order: 1. __memcpy_avx_unaligned if AVX_Fast_Unaligned_Load bit is set. 2. __memcpy_sse2_unaligned if Fast_Unaligned_Load bit is set. 3. __memcpy_sse2 if SSSE3 isn't available. 4. __memcpy_ssse3_back if Fast_Copy_Backward bit it set. 5. __memcpy_ssse3 [BZ #18880] * sysdeps/x86_64/multiarch/memcpy.S: Check Fast_Unaligned_Load, instead of Slow_BSF, and also check for Fast_Copy_Backward to enable __memcpy_ssse3_back. (cherry picked from commit 14a1d7cc4c4fd5ee8e4e66b777221dd32a84efe8) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=3c772cb4d9cbe19cd97ad991e3dab43014198c44 commit 3c772cb4d9cbe19cd97ad991e3dab43014198c44 Author: Andrew Senkevich <andrew.senkevich@intel.com> Date: Sat Jan 16 00:49:45 2016 +0300 Added memcpy/memmove family optimized with AVX512 for KNL hardware. Added AVX512 implementations of memcpy, mempcpy, memmove, memcpy_chk, mempcpy_chk, memmove_chk. It shows average improvement more than 30% over AVX versions on KNL hardware (performance results in the thread <https://sourceware.org/ml/libc-alpha/2016-01/msg00258.html>). * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Added new files. * sysdeps/x86_64/multiarch/ifunc-impl-list.c: Added new tests. * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: New file. * sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S: Likewise. * sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: Likewise. * sysdeps/x86_64/multiarch/memcpy.S: Added new IFUNC branch. * sysdeps/x86_64/multiarch/memcpy_chk.S: Likewise. * sysdeps/x86_64/multiarch/memmove.c: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.c: Likewise. * sysdeps/x86_64/multiarch/mempcpy.S: Likewise. * sysdeps/x86_64/multiarch/mempcpy_chk.S: Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7f20e52f1f8d7709140f4bdf828a6bb0f0f08af2 commit 7f20e52f1f8d7709140f4bdf828a6bb0f0f08af2 Author: Andrew Senkevich <andrew.senkevich@intel.com> Date: Sat Dec 19 02:47:28 2015 +0300 Added memset optimized with AVX512 for KNL hardware. It shows improvement up to 28% over AVX2 memset (performance results attached at <https://sourceware.org/ml/libc-alpha/2015-12/msg00052.html>). * sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S: New file. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Added new file. * sysdeps/x86_64/multiarch/ifunc-impl-list.c: Added new tests. * sysdeps/x86_64/multiarch/memset.S: Added new IFUNC branch. * sysdeps/x86_64/multiarch/memset_chk.S: Likewise. * sysdeps/x86/cpu-features.h (bit_Prefer_No_VZEROUPPER, index_Prefer_No_VZEROUPPER): New. * sysdeps/x86/cpu-features.c (init_cpu_features): Set the Prefer_No_VZEROUPPER for Knights Landing. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=d530cd5463701a59ed923d53a97d3b534fdfea8a commit d530cd5463701a59ed923d53a97d3b534fdfea8a Author: H.J. Lu <hjl.tools@gmail.com> Date: Wed Oct 21 14:44:23 2015 -0700 Add Prefer_MAP_32BIT_EXEC to map executable pages with MAP_32BIT According to Silvermont software optimization guide, for 64-bit applications, branch prediction performance can be negatively impacted when the target of a branch is more than 4GB away from the branch. Add the Prefer_MAP_32BIT_EXEC bit so that mmap will try to map executable pages with MAP_32BIT first. NB: MAP_32BIT will map to lower 2GB, not lower 4GB, address. Prefer_MAP_32BIT_EXEC reduces bits available for address space layout randomization (ASLR), which is always disabled for SUID programs and can only be enabled by setting environment variable, LD_PREFER_MAP_32BIT_EXEC. On Fedora 23, this patch speeds up GCC 5 testsuite by 3% on Silvermont. [BZ #19367] * sysdeps/unix/sysv/linux/wordsize-64/mmap.c: New file. * sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h: Likewise. * sysdeps/unix/sysv/linux/x86_64/64/mmap.c: Likewise. * sysdeps/x86/cpu-features.h (bit_Prefer_MAP_32BIT_EXEC): New. (index_Prefer_MAP_32BIT_EXEC): Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=fe24aedc3530037d7bb614b84d309e6b816686bf commit fe24aedc3530037d7bb614b84d309e6b816686bf Author: H.J. Lu <hjl.tools@gmail.com> Date: Tue Dec 15 11:46:54 2015 -0800 Enable Silvermont optimizations for Knights Landing Knights Landing processor is based on Silvermont. This patch enables Silvermont optimizations for Knights Landing. * sysdeps/x86/cpu-features.c (init_cpu_features): Enable Silvermont optimizations for Knights Landing. -----------------------------------------------------------------------
This is an automated email from the git hooks/post-receive script. It was generated because a ref change was pushed to the repository containing the project "GNU C Library master sources". The branch, hjl/erms/ifunc has been created at 25d87576122689b22db9929271bdd7cb403aec1c (commit) - Log ----------------------------------------------------------------- https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=25d87576122689b22db9929271bdd7cb403aec1c commit 25d87576122689b22db9929271bdd7cb403aec1c Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Apr 1 14:01:24 2016 -0700 X86-64: Add dummy memcopy.h and wordcopy.c Since x86-64 doesn't use memory copy functions, add dummy memcopy.h and wordcopy.c to reduce code size. It reduces the size of libc.so by about 1 KB. * sysdeps/x86_64/memcopy.h: New file. * sysdeps/x86_64/wordcopy.c: Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a5d79693ef1c91aed0e12662ec84d5d9b597f283 commit a5d79693ef1c91aed0e12662ec84d5d9b597f283 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 12:46:57 2016 -0700 X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones, we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with the new ones. No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used before. If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2 memcpy/memmove optimized with Enhanced REP MOVSB will be used for processors with ERMS. The new AVX512 memcpy/memmove will be used for processors with AVX512 which prefer vzeroupper. Since the new SSE2 memcpy/memmove are faster than the previous default memcpy/memmove used in libc.a and ld.so, we also remove the previous default memcpy/memmove and make them the default memcpy/memmove. Together, it reduces the size of libc.so by about 6 KB and the size of ld.so by about 2 KB. [BZ #19776] * sysdeps/x86_64/memcpy.S: Make it dummy. * sysdeps/x86_64/mempcpy.S: Likewise. * sysdeps/x86_64/memmove.S: New file. * sysdeps/x86_64/memmove_chk.S: Likewise. * sysdeps/x86_64/multiarch/memmove.S: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.S: Likewise. * sysdeps/x86_64/memmove.c: Removed. * sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memmove.c: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.c: Likewise. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memcpy-sse2-unaligned, memmove-avx-unaligned, memcpy-avx-unaligned and memmove-sse2-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Replace __memmove_chk_avx512_unaligned_2 with __memmove_chk_avx512_unaligned. Remove __memmove_chk_avx_unaligned_2. Replace __memmove_chk_sse2_unaligned_2 with __memmove_chk_sse2_unaligned. Remove __memmove_chk_sse2 and __memmove_avx_unaligned_2. Replace __memmove_avx512_unaligned_2 with __memmove_avx512_unaligned. Replace __memmove_sse2_unaligned_2 with __memmove_sse2_unaligned. Remove __memmove_sse2. Replace __memcpy_chk_avx512_unaligned_2 with __memcpy_chk_avx512_unaligned. Remove __memcpy_chk_avx_unaligned_2. Replace __memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned. Remove __memcpy_chk_sse2. Remove __memcpy_avx_unaligned_2. Replace __memcpy_avx512_unaligned_2 with __memcpy_avx512_unaligned. Remove __memcpy_sse2_unaligned_2 and __memcpy_sse2. Replace __mempcpy_chk_avx512_unaligned_2 with __mempcpy_chk_avx512_unaligned. Remove __mempcpy_chk_avx_unaligned_2. Replace __mempcpy_chk_sse2_unaligned_2 with __mempcpy_chk_sse2_unaligned. Remove __mempcpy_chk_sse2. Replace __mempcpy_avx512_unaligned_2 with __mempcpy_avx512_unaligned. Remove __mempcpy_avx_unaligned_2. Replace __mempcpy_sse2_unaligned_2 with __mempcpy_sse2_unaligned. Remove __mempcpy_sse2. * sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support __memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned. Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms if processor has ERMS. Default to __memcpy_sse2_unaligned. (ENTRY): Removed. (END): Likewise. (ENTRY_CHK): Likewise. (libc_hidden_builtin_def): Likewise. Don't include ../memcpy.S. * sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support __memcpy_chk_avx512_unaligned_erms and __memcpy_chk_avx512_unaligned. Use __memcpy_chk_avx_unaligned_erms and __memcpy_chk_sse2_unaligned_erms if if processor has ERMS. Default to __memcpy_chk_sse2_unaligned. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip if not in libc. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S Change function suffix from unaligned_2 to unaligned. * sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support __mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned. Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms if processor has ERMS. Default to __mempcpy_sse2_unaligned. (ENTRY): Removed. (END): Likewise. (ENTRY_CHK): Likewise. (libc_hidden_builtin_def): Likewise. Don't include ../mempcpy.S. (mempcpy): New. Add a weak alias. * sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support __mempcpy_chk_avx512_unaligned_erms and __mempcpy_chk_avx512_unaligned. Use __mempcpy_chk_avx_unaligned_erms and __mempcpy_chk_sse2_unaligned_erms if if processor has ERMS. Default to __mempcpy_chk_sse2_unaligned. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=6e7b4bf19131e1fb304c5d5429ceb5d4ef17a6b9 commit 6e7b4bf19131e1fb304c5d5429ceb5d4ef17a6b9 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 10:42:30 2016 -0700 X86-64: Remove the previous SSE2/AVX2 memsets Since the new SSE2/AVX2 memsets are faster than the previous ones, we can remove the previous SSE2/AVX2 memsets and replace them with the new ones. This reduces the size of libc.so by about 900 bytes. No change in IFUNC selection if SSE2 and AVX2 memsets weren't used before. If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset optimized with Enhanced REP STOSB will be used for processors with ERMS. The new AVX512 memset will be used for processors with AVX512 which prefer vzeroupper. [BZ #19881] * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded into ... * sysdeps/x86_64/memset.S: This. (__bzero): Removed. (__memset_tail): Likewise. (__memset_chk): Likewise. (memset): Likewise. (MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't defined. (MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined. * sysdeps/x86_64/multiarch/memset-avx2.S: Removed. (__memset_zero_constant_len_parameter): Check SHARED instead of PIC. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memset-avx2 and memset-sse2-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Remove __memset_chk_sse2, __memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned. * sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: Skip if not in libc. * sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S (__bzero): Enabled. * sysdeps/x86_64/multiarch/memset.S (memset): Replace __memset_sse2 and __memset_avx2 with __memset_sse2_unaligned and __memset_avx2_unaligned. Use __memset_sse2_unaligned_erms or __memset_avx2_unaligned_erms if processor has ERMS. Support __memset_avx512_unaligned_erms and __memset_avx512_unaligned. (memset): Removed. (__memset_chk): Likewise. (MEMSET_SYMBOL): New. (libc_hidden_builtin_def): Replace __memset_sse2 with __memset_sse2_unaligned. * sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace __memset_chk_sse2 and __memset_chk_avx2 with __memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms. Use __memset_chk_sse2_unaligned_erms or __memset_chk_avx2_unaligned_erms if processor has ERMS. Support __memset_chk_avx512_unaligned_erms and __memset_chk_avx512_unaligned. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=b80cc56d7d3a54c16d33d147438b67c715906675 commit b80cc56d7d3a54c16d33d147438b67c715906675 Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Apr 3 17:21:45 2016 -0700 X86-64: Use non-temporal store in memcpy on large data The large memcpy micro benchmark in glibc shows that there is a regression with large data on Haswell machine. non-temporal store in memcpy on large data can improve performance significantly. This patch adds a threshold to use non temporal store which is 6 times of shared cache size. When size is above the threshold, non temporal store will be used. For size below 8 vector register width, we load all data into registers and store them together. Only forward and backward loops, which move 4 vector registers at a time, are used to support overlapping addresses. For forward loop, we load the last 4 vector register width of data and the first vector register width of data into vector registers before the loop and store them after the loop. For backward loop, we load the first 4 vector register width of data and the last vector register width of data into vector registers before the loop and store them after the loop. * sysdeps/x86_64/cacheinfo.c (__x86_shared_non_temporal_threshold): New. (init_cacheinfo): Set __x86_shared_non_temporal_threshold to 6 times of shared cache size. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S (VMOVNT): New. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S (VMOVNT): Likewise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S (VMOVNT): Likewise. (VMOVU): Changed to movups for smaller code sizes. (VMOVA): Changed to movaps for smaller code sizes. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Update comments. (PREFETCH): New. (PREFETCH_SIZE): Likewise. (PREFETCHED_LOAD_SIZE): Likewise. (PREFETCH_ONE_SET): Likewise. Rewrite to use forward and backward loops, which move 4 vector registers at a time, to support overlapping addresses and use non temporal store if size is above the threshold. -----------------------------------------------------------------------
This is an automated email from the git hooks/post-receive script. It was generated because a ref change was pushed to the repository containing the project "GNU C Library master sources". The branch, hjl/erms/ifunc has been created at e02f2640644024281bf02537606bbad201aa20d7 (commit) - Log ----------------------------------------------------------------- https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e02f2640644024281bf02537606bbad201aa20d7 commit e02f2640644024281bf02537606bbad201aa20d7 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Apr 1 14:01:24 2016 -0700 X86-64: Add dummy memcopy.h and wordcopy.c Since x86-64 doesn't use memory copy functions, add dummy memcopy.h and wordcopy.c to reduce code size. It reduces the size of libc.so by about 1 KB. * sysdeps/x86_64/memcopy.h: New file. * sysdeps/x86_64/wordcopy.c: Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=82069d63c103c6d9ff88718657ac8dd83715919f commit 82069d63c103c6d9ff88718657ac8dd83715919f Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 12:46:57 2016 -0700 X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones, we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with the new ones. No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used before. If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2 memcpy/memmove optimized with Enhanced REP MOVSB will be used for processors with ERMS. The new AVX512 memcpy/memmove will be used for processors with AVX512 which prefer vzeroupper. Since the new SSE2 memcpy/memmove are faster than the previous default memcpy/memmove used in libc.a and ld.so, we also remove the previous default memcpy/memmove and make them the default memcpy/memmove, except that non-temporal store isn't used in ld.so. Together, it reduces the size of libc.so by about 6 KB and the size of ld.so by about 2 KB. [BZ #19776] * sysdeps/x86_64/memcpy.S: Make it dummy. * sysdeps/x86_64/mempcpy.S: Likewise. * sysdeps/x86_64/memmove.S: New file. * sysdeps/x86_64/memmove_chk.S: Likewise. * sysdeps/x86_64/multiarch/memmove.S: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.S: Likewise. * sysdeps/x86_64/memmove.c: Removed. * sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memmove.c: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.c: Likewise. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memcpy-sse2-unaligned, memmove-avx-unaligned, memcpy-avx-unaligned and memmove-sse2-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Replace __memmove_chk_avx512_unaligned_2 with __memmove_chk_avx512_unaligned. Remove __memmove_chk_avx_unaligned_2. Replace __memmove_chk_sse2_unaligned_2 with __memmove_chk_sse2_unaligned. Remove __memmove_chk_sse2 and __memmove_avx_unaligned_2. Replace __memmove_avx512_unaligned_2 with __memmove_avx512_unaligned. Replace __memmove_sse2_unaligned_2 with __memmove_sse2_unaligned. Remove __memmove_sse2. Replace __memcpy_chk_avx512_unaligned_2 with __memcpy_chk_avx512_unaligned. Remove __memcpy_chk_avx_unaligned_2. Replace __memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned. Remove __memcpy_chk_sse2. Remove __memcpy_avx_unaligned_2. Replace __memcpy_avx512_unaligned_2 with __memcpy_avx512_unaligned. Remove __memcpy_sse2_unaligned_2 and __memcpy_sse2. Replace __mempcpy_chk_avx512_unaligned_2 with __mempcpy_chk_avx512_unaligned. Remove __mempcpy_chk_avx_unaligned_2. Replace __mempcpy_chk_sse2_unaligned_2 with __mempcpy_chk_sse2_unaligned. Remove __mempcpy_chk_sse2. Replace __mempcpy_avx512_unaligned_2 with __mempcpy_avx512_unaligned. Remove __mempcpy_avx_unaligned_2. Replace __mempcpy_sse2_unaligned_2 with __mempcpy_sse2_unaligned. Remove __mempcpy_sse2. * sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support __memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned. Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms if processor has ERMS. Default to __memcpy_sse2_unaligned. (ENTRY): Removed. (END): Likewise. (ENTRY_CHK): Likewise. (libc_hidden_builtin_def): Likewise. Don't include ../memcpy.S. * sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support __memcpy_chk_avx512_unaligned_erms and __memcpy_chk_avx512_unaligned. Use __memcpy_chk_avx_unaligned_erms and __memcpy_chk_sse2_unaligned_erms if if processor has ERMS. Default to __memcpy_chk_sse2_unaligned. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip if not in libc. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S Change function suffix from unaligned_2 to unaligned. * sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support __mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned. Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms if processor has ERMS. Default to __mempcpy_sse2_unaligned. (ENTRY): Removed. (END): Likewise. (ENTRY_CHK): Likewise. (libc_hidden_builtin_def): Likewise. Don't include ../mempcpy.S. (mempcpy): New. Add a weak alias. * sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support __mempcpy_chk_avx512_unaligned_erms and __mempcpy_chk_avx512_unaligned. Use __mempcpy_chk_avx_unaligned_erms and __mempcpy_chk_sse2_unaligned_erms if if processor has ERMS. Default to __mempcpy_chk_sse2_unaligned. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=6e7b4bf19131e1fb304c5d5429ceb5d4ef17a6b9 commit 6e7b4bf19131e1fb304c5d5429ceb5d4ef17a6b9 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 10:42:30 2016 -0700 X86-64: Remove the previous SSE2/AVX2 memsets Since the new SSE2/AVX2 memsets are faster than the previous ones, we can remove the previous SSE2/AVX2 memsets and replace them with the new ones. This reduces the size of libc.so by about 900 bytes. No change in IFUNC selection if SSE2 and AVX2 memsets weren't used before. If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset optimized with Enhanced REP STOSB will be used for processors with ERMS. The new AVX512 memset will be used for processors with AVX512 which prefer vzeroupper. [BZ #19881] * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded into ... * sysdeps/x86_64/memset.S: This. (__bzero): Removed. (__memset_tail): Likewise. (__memset_chk): Likewise. (memset): Likewise. (MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't defined. (MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined. * sysdeps/x86_64/multiarch/memset-avx2.S: Removed. (__memset_zero_constant_len_parameter): Check SHARED instead of PIC. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memset-avx2 and memset-sse2-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Remove __memset_chk_sse2, __memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned. * sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: Skip if not in libc. * sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S (__bzero): Enabled. * sysdeps/x86_64/multiarch/memset.S (memset): Replace __memset_sse2 and __memset_avx2 with __memset_sse2_unaligned and __memset_avx2_unaligned. Use __memset_sse2_unaligned_erms or __memset_avx2_unaligned_erms if processor has ERMS. Support __memset_avx512_unaligned_erms and __memset_avx512_unaligned. (memset): Removed. (__memset_chk): Likewise. (MEMSET_SYMBOL): New. (libc_hidden_builtin_def): Replace __memset_sse2 with __memset_sse2_unaligned. * sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace __memset_chk_sse2 and __memset_chk_avx2 with __memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms. Use __memset_chk_sse2_unaligned_erms or __memset_chk_avx2_unaligned_erms if processor has ERMS. Support __memset_chk_avx512_unaligned_erms and __memset_chk_avx512_unaligned. -----------------------------------------------------------------------
This is an automated email from the git hooks/post-receive script. It was generated because a ref change was pushed to the repository containing the project "GNU C Library master sources". The branch, hjl/erms/ifunc has been created at c0da8faf69fa56c249ac5ec40836b76fe5ab0233 (commit) - Log ----------------------------------------------------------------- https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c0da8faf69fa56c249ac5ec40836b76fe5ab0233 commit c0da8faf69fa56c249ac5ec40836b76fe5ab0233 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Apr 1 14:01:24 2016 -0700 X86-64: Add dummy memcopy.h and wordcopy.c Since x86-64 doesn't use memory copy functions, add dummy memcopy.h and wordcopy.c to reduce code size. It reduces the size of libc.so by about 1 KB. * sysdeps/x86_64/memcopy.h: New file. * sysdeps/x86_64/wordcopy.c: Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=9871e1739bcb231314a4d2ee03c8c757aa139332 commit 9871e1739bcb231314a4d2ee03c8c757aa139332 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 12:46:57 2016 -0700 X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones, we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with the new ones. No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used before. If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2 memcpy/memmove optimized with Enhanced REP MOVSB will be used for processors with ERMS. The new AVX512 memcpy/memmove will be used for processors with AVX512 which prefer vzeroupper. Since the new SSE2 memcpy/memmove are faster than the previous default memcpy/memmove used in libc.a and ld.so, we also remove the previous default memcpy/memmove and make them the default memcpy/memmove, except that non-temporal store isn't used in ld.so. Together, it reduces the size of libc.so by about 6 KB and the size of ld.so by about 2 KB. [BZ #19776] * sysdeps/x86_64/memcpy.S: Make it dummy. * sysdeps/x86_64/mempcpy.S: Likewise. * sysdeps/x86_64/memmove.S: New file. * sysdeps/x86_64/memmove_chk.S: Likewise. * sysdeps/x86_64/multiarch/memmove.S: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.S: Likewise. * sysdeps/x86_64/memmove.c: Removed. * sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memmove.c: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.c: Likewise. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memcpy-sse2-unaligned, memmove-avx-unaligned, memcpy-avx-unaligned and memmove-sse2-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Replace __memmove_chk_avx512_unaligned_2 with __memmove_chk_avx512_unaligned. Remove __memmove_chk_avx_unaligned_2. Replace __memmove_chk_sse2_unaligned_2 with __memmove_chk_sse2_unaligned. Remove __memmove_chk_sse2 and __memmove_avx_unaligned_2. Replace __memmove_avx512_unaligned_2 with __memmove_avx512_unaligned. Replace __memmove_sse2_unaligned_2 with __memmove_sse2_unaligned. Remove __memmove_sse2. Replace __memcpy_chk_avx512_unaligned_2 with __memcpy_chk_avx512_unaligned. Remove __memcpy_chk_avx_unaligned_2. Replace __memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned. Remove __memcpy_chk_sse2. Remove __memcpy_avx_unaligned_2. Replace __memcpy_avx512_unaligned_2 with __memcpy_avx512_unaligned. Remove __memcpy_sse2_unaligned_2 and __memcpy_sse2. Replace __mempcpy_chk_avx512_unaligned_2 with __mempcpy_chk_avx512_unaligned. Remove __mempcpy_chk_avx_unaligned_2. Replace __mempcpy_chk_sse2_unaligned_2 with __mempcpy_chk_sse2_unaligned. Remove __mempcpy_chk_sse2. Replace __mempcpy_avx512_unaligned_2 with __mempcpy_avx512_unaligned. Remove __mempcpy_avx_unaligned_2. Replace __mempcpy_sse2_unaligned_2 with __mempcpy_sse2_unaligned. Remove __mempcpy_sse2. * sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support __memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned. Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms if processor has ERMS. Default to __memcpy_sse2_unaligned. (ENTRY): Removed. (END): Likewise. (ENTRY_CHK): Likewise. (libc_hidden_builtin_def): Likewise. Don't include ../memcpy.S. * sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support __memcpy_chk_avx512_unaligned_erms and __memcpy_chk_avx512_unaligned. Use __memcpy_chk_avx_unaligned_erms and __memcpy_chk_sse2_unaligned_erms if if processor has ERMS. Default to __memcpy_chk_sse2_unaligned. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip if not in libc. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S Change function suffix from unaligned_2 to unaligned. * sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support __mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned. Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms if processor has ERMS. Default to __mempcpy_sse2_unaligned. (ENTRY): Removed. (END): Likewise. (ENTRY_CHK): Likewise. (libc_hidden_builtin_def): Likewise. Don't include ../mempcpy.S. (mempcpy): New. Add a weak alias. * sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support __mempcpy_chk_avx512_unaligned_erms and __mempcpy_chk_avx512_unaligned. Use __mempcpy_chk_avx_unaligned_erms and __mempcpy_chk_sse2_unaligned_erms if if processor has ERMS. Default to __mempcpy_chk_sse2_unaligned. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=3c888072ea1da909c3453e8873ab8ec6a6c7b7b2 commit 3c888072ea1da909c3453e8873ab8ec6a6c7b7b2 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 10:42:30 2016 -0700 X86-64: Remove the previous SSE2/AVX2 memsets Since the new SSE2/AVX2 memsets are faster than the previous ones, we can remove the previous SSE2/AVX2 memsets and replace them with the new ones. This reduces the size of libc.so by about 900 bytes. No change in IFUNC selection if SSE2 and AVX2 memsets weren't used before. If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset optimized with Enhanced REP STOSB will be used for processors with ERMS. The new AVX512 memset will be used for processors with AVX512 which prefer vzeroupper. [BZ #19881] * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded into ... * sysdeps/x86_64/memset.S: This. (__bzero): Removed. (__memset_tail): Likewise. (__memset_chk): Likewise. (memset): Likewise. (MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't defined. (MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined. * sysdeps/x86_64/multiarch/memset-avx2.S: Removed. (__memset_zero_constant_len_parameter): Check SHARED instead of PIC. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memset-avx2 and memset-sse2-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Remove __memset_chk_sse2, __memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned. * sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: Skip if not in libc. * sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S (__bzero): Enabled. * sysdeps/x86_64/multiarch/memset.S (memset): Replace __memset_sse2 and __memset_avx2 with __memset_sse2_unaligned and __memset_avx2_unaligned. Use __memset_sse2_unaligned_erms or __memset_avx2_unaligned_erms if processor has ERMS. Support __memset_avx512_unaligned_erms and __memset_avx512_unaligned. (memset): Removed. (__memset_chk): Likewise. (MEMSET_SYMBOL): New. (libc_hidden_builtin_def): Replace __memset_sse2 with __memset_sse2_unaligned. * sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace __memset_chk_sse2 and __memset_chk_avx2 with __memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms. Use __memset_chk_sse2_unaligned_erms or __memset_chk_avx2_unaligned_erms if processor has ERMS. Support __memset_chk_avx512_unaligned_erms and __memset_chk_avx512_unaligned. -----------------------------------------------------------------------
This is an automated email from the git hooks/post-receive script. It was generated because a ref change was pushed to the repository containing the project "GNU C Library master sources". The branch, hjl/erms/2.23 has been created at 2a1cca399be415d6c5a556af2018e5fb726d9a37 (commit) - Log ----------------------------------------------------------------- https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=2a1cca399be415d6c5a556af2018e5fb726d9a37 commit 2a1cca399be415d6c5a556af2018e5fb726d9a37 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Apr 1 14:01:24 2016 -0700 X86-64: Add dummy memcopy.h and wordcopy.c Since x86-64 doesn't use memory copy functions, add dummy memcopy.h and wordcopy.c to reduce code size. It reduces the size of libc.so by about 1 KB. * sysdeps/x86_64/memcopy.h: New file. * sysdeps/x86_64/wordcopy.c: Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=b361e72f264a06e856d97cbbf1cedbf2f7dd73bf commit b361e72f264a06e856d97cbbf1cedbf2f7dd73bf Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 12:46:57 2016 -0700 X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones, we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with the new ones. No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used before. If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2 memcpy/memmove optimized with Enhanced REP MOVSB will be used for processors with ERMS. The new AVX512 memcpy/memmove will be used for processors with AVX512 which prefer vzeroupper. Since the new SSE2 memcpy/memmove are faster than the previous default memcpy/memmove used in libc.a and ld.so, we also remove the previous default memcpy/memmove and make them the default memcpy/memmove. Together, it reduces the size of libc.so by about 6 KB and the size of ld.so by about 2 KB. [BZ #19776] * sysdeps/x86_64/memcpy.S: Make it dummy. * sysdeps/x86_64/mempcpy.S: Likewise. * sysdeps/x86_64/memmove.S: New file. * sysdeps/x86_64/memmove_chk.S: Likewise. * sysdeps/x86_64/multiarch/memmove.S: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.S: Likewise. * sysdeps/x86_64/memmove.c: Removed. * sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memmove.c: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.c: Likewise. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memcpy-sse2-unaligned, memmove-avx-unaligned, memcpy-avx-unaligned and memmove-sse2-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Replace __memmove_chk_avx512_unaligned_2 with __memmove_chk_avx512_unaligned. Remove __memmove_chk_avx_unaligned_2. Replace __memmove_chk_sse2_unaligned_2 with __memmove_chk_sse2_unaligned. Remove __memmove_chk_sse2 and __memmove_avx_unaligned_2. Replace __memmove_avx512_unaligned_2 with __memmove_avx512_unaligned. Replace __memmove_sse2_unaligned_2 with __memmove_sse2_unaligned. Remove __memmove_sse2. Replace __memcpy_chk_avx512_unaligned_2 with __memcpy_chk_avx512_unaligned. Remove __memcpy_chk_avx_unaligned_2. Replace __memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned. Remove __memcpy_chk_sse2. Remove __memcpy_avx_unaligned_2. Replace __memcpy_avx512_unaligned_2 with __memcpy_avx512_unaligned. Remove __memcpy_sse2_unaligned_2 and __memcpy_sse2. Replace __mempcpy_chk_avx512_unaligned_2 with __mempcpy_chk_avx512_unaligned. Remove __mempcpy_chk_avx_unaligned_2. Replace __mempcpy_chk_sse2_unaligned_2 with __mempcpy_chk_sse2_unaligned. Remove __mempcpy_chk_sse2. Replace __mempcpy_avx512_unaligned_2 with __mempcpy_avx512_unaligned. Remove __mempcpy_avx_unaligned_2. Replace __mempcpy_sse2_unaligned_2 with __mempcpy_sse2_unaligned. Remove __mempcpy_sse2. * sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support __memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned. Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms if processor has ERMS. Default to __memcpy_sse2_unaligned. (ENTRY): Removed. (END): Likewise. (ENTRY_CHK): Likewise. (libc_hidden_builtin_def): Likewise. Don't include ../memcpy.S. * sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support __memcpy_chk_avx512_unaligned_erms and __memcpy_chk_avx512_unaligned. Use __memcpy_chk_avx_unaligned_erms and __memcpy_chk_sse2_unaligned_erms if if processor has ERMS. Default to __memcpy_chk_sse2_unaligned. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip if not in libc. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S Change function suffix from unaligned_2 to unaligned. * sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support __mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned. Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms if processor has ERMS. Default to __mempcpy_sse2_unaligned. (ENTRY): Removed. (END): Likewise. (ENTRY_CHK): Likewise. (libc_hidden_builtin_def): Likewise. Don't include ../mempcpy.S. (mempcpy): New. Add a weak alias. * sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support __mempcpy_chk_avx512_unaligned_erms and __mempcpy_chk_avx512_unaligned. Use __mempcpy_chk_avx_unaligned_erms and __mempcpy_chk_sse2_unaligned_erms if if processor has ERMS. Default to __mempcpy_chk_sse2_unaligned. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c97c370612496379176be8e33c19dc4f80b7f01c commit c97c370612496379176be8e33c19dc4f80b7f01c Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 10:42:30 2016 -0700 X86-64: Remove the previous SSE2/AVX2 memsets Since the new SSE2/AVX2 memsets are faster than the previous ones, we can remove the previous SSE2/AVX2 memsets and replace them with the new ones. This reduces the size of libc.so by about 900 bytes. No change in IFUNC selection if SSE2 and AVX2 memsets weren't used before. If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset optimized with Enhanced REP STOSB will be used for processors with ERMS. The new AVX512 memset will be used for processors with AVX512 which prefer vzeroupper. [BZ #19881] * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded into ... * sysdeps/x86_64/memset.S: This. (__bzero): Removed. (__memset_tail): Likewise. (__memset_chk): Likewise. (memset): Likewise. (MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't defined. (MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined. * sysdeps/x86_64/multiarch/memset-avx2.S: Removed. (__memset_zero_constant_len_parameter): Check SHARED instead of PIC. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memset-avx2 and memset-sse2-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Remove __memset_chk_sse2, __memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned. * sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: Skip if not in libc. * sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S (__bzero): Enabled. * sysdeps/x86_64/multiarch/memset.S (memset): Replace __memset_sse2 and __memset_avx2 with __memset_sse2_unaligned and __memset_avx2_unaligned. Use __memset_sse2_unaligned_erms or __memset_avx2_unaligned_erms if processor has ERMS. Support __memset_avx512_unaligned_erms and __memset_avx512_unaligned. (memset): Removed. (__memset_chk): Likewise. (MEMSET_SYMBOL): New. (libc_hidden_builtin_def): Replace __memset_sse2 with __memset_sse2_unaligned. * sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace __memset_chk_sse2 and __memset_chk_avx2 with __memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms. Use __memset_chk_sse2_unaligned_erms or __memset_chk_avx2_unaligned_erms if processor has ERMS. Support __memset_chk_avx512_unaligned_erms and __memset_chk_avx512_unaligned. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=121270b79236d7c5802e8d9af2d27952cb9efae9 commit 121270b79236d7c5802e8d9af2d27952cb9efae9 Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Apr 3 17:21:45 2016 -0700 X86-64: Use non-temporal store in memcpy on large data The large memcpy micro benchmark in glibc shows that there is a regression with large data on Haswell machine. non-temporal store in memcpy on large data can improve performance significantly. This patch adds a threshold to use non temporal store which is 6 times of shared cache size. When size is above the threshold, non temporal store will be used. For size below 8 vector register width, we load all data into registers and store them together. Only forward and backward loops, which move 4 vector registers at a time, are used to support overlapping addresses. For forward loop, we load the last 4 vector register width of data and the first vector register width of data into vector registers before the loop and store them after the loop. For backward loop, we load the first 4 vector register width of data and the last vector register width of data into vector registers before the loop and store them after the loop. * sysdeps/x86_64/cacheinfo.c (__x86_shared_non_temporal_threshold): New. (init_cacheinfo): Set __x86_shared_non_temporal_threshold to 6 times of shared cache size. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S (VMOVNT): New. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S (VMOVNT): Likewise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S (VMOVNT): Likewise. (VMOVU): Changed to movups for smaller code sizes. (VMOVA): Changed to movaps for smaller code sizes. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Update comments. (PREFETCH): New. (PREFETCH_SIZE): Likewise. (PREFETCHED_LOAD_SIZE): Likewise. (PREFETCH_ONE_SET): Likewise. Rewrite to use forward and backward loops, which move 4 vector registers at a time, to support overlapping addresses and use non temporal store if size is above the threshold. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=0932dd8b56db46dd421a4855fb5dee9de092538d commit 0932dd8b56db46dd421a4855fb5dee9de092538d Author: H.J. Lu <hjl.tools@gmail.com> Date: Wed Apr 6 10:19:16 2016 -0700 X86-64: Prepare memmove-vec-unaligned-erms.S Prepare memmove-vec-unaligned-erms.S to make the SSE2 version as the default memcpy, mempcpy and memmove. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S (MEMCPY_SYMBOL): New. (MEMPCPY_SYMBOL): Likewise. (MEMMOVE_CHK_SYMBOL): Likewise. Replace MEMMOVE_SYMBOL with MEMMOVE_CHK_SYMBOL on __mempcpy_chk symbols. Replace MEMMOVE_SYMBOL with MEMPCPY_SYMBOL on __mempcpy symbols. Provide alias for __memcpy_chk in libc.a. Provide alias for memcpy in libc.a and ld.so. (cherry picked from commit a7d1c51482d15ab6c07e2ee0ae5e007067b18bfb) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=da2da79262814ba4ead3ee487549949096d8ad2d commit da2da79262814ba4ead3ee487549949096d8ad2d Author: H.J. Lu <hjl.tools@gmail.com> Date: Wed Apr 6 09:10:18 2016 -0700 X86-64: Prepare memset-vec-unaligned-erms.S Prepare memset-vec-unaligned-erms.S to make the SSE2 version as the default memset. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S (MEMSET_CHK_SYMBOL): New. Define if not defined. (__bzero): Check VEC_SIZE == 16 instead of USE_MULTIARCH. Disabled fro now. Replace MEMSET_SYMBOL with MEMSET_CHK_SYMBOL on __memset_chk symbols. Properly check USE_MULTIARCH on __memset symbols. (cherry picked from commit 4af1bb06c59d24f35bf8dc55897838d926c05892) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=9a93bdbaff81edf67c5486c84f8098055e355abb commit 9a93bdbaff81edf67c5486c84f8098055e355abb Author: H.J. Lu <hjl.tools@gmail.com> Date: Tue Apr 5 05:21:07 2016 -0700 Force 32-bit displacement in memset-vec-unaligned-erms.S * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: Force 32-bit displacement to avoid long nop between instructions. (cherry picked from commit ec0cac9a1f4094bd0db6f77c1b329e7a40eecc10) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=5118e532600549ad0f56cb9b1a179b8eab70c483 commit 5118e532600549ad0f56cb9b1a179b8eab70c483 Author: H.J. Lu <hjl.tools@gmail.com> Date: Tue Apr 5 05:19:05 2016 -0700 Add a comment in memset-sse2-unaligned-erms.S * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Add a comment on VMOVU and VMOVA. (cherry picked from commit 696ac774847b80cf994438739478b0c3003b5958) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=06c6d4ae6ee7e5b83fd5868bef494def01f59292 commit 06c6d4ae6ee7e5b83fd5868bef494def01f59292 Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Apr 3 14:32:20 2016 -0700 Don't put SSE2/AVX/AVX512 memmove/memset in ld.so Since memmove and memset in ld.so don't use IFUNC, don't put SSE2, AVX and AVX512 memmove and memset in ld.so. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip if not in libc. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S: Likewise. (cherry picked from commit 5cd7af016d8587ff53b20ba259746f97edbddbf7) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a96379797a7eecc1b709cad7b68981eb698783dc commit a96379797a7eecc1b709cad7b68981eb698783dc Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Apr 3 12:38:25 2016 -0700 Fix memmove-vec-unaligned-erms.S __mempcpy_erms and __memmove_erms can't be placed between __memmove_chk and __memmove it breaks __memmove_chk. Don't check source == destination first since it is less common. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: (__mempcpy_erms, __memmove_erms): Moved before __mempcpy_chk with unaligned_erms. (__memmove_erms): Skip if source == destination. (__memmove_unaligned_erms): Don't check source == destination first. (cherry picked from commit ea2785e96fa503f3a2b5dd9f3a6ca65622b3c5f2) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=cfb059c79729b26284863334c9aa04f0a3b967b9 commit cfb059c79729b26284863334c9aa04f0a3b967b9 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Apr 1 15:08:48 2016 -0700 Remove Fast_Copy_Backward from Intel Core processors Intel Core i3, i5 and i7 processors have fast unaligned copy and copy backward is ignored. Remove Fast_Copy_Backward from Intel Core processors to avoid confusion. * sysdeps/x86/cpu-features.c (init_cpu_features): Don't set bit_arch_Fast_Copy_Backward for Intel Core proessors. (cherry picked from commit 27d3ce1467990f89126e228559dec8f84b96c60e) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=30c389be1af67c4d0716d207b6780c6169d1355f commit 30c389be1af67c4d0716d207b6780c6169d1355f Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 10:05:51 2016 -0700 Add x86-64 memset with unaligned store and rep stosb Implement x86-64 memset with unaligned store and rep movsb. Support 16-byte, 32-byte and 64-byte vector register sizes. A single file provides 2 implementations of memset, one with rep stosb and the other without rep stosb. They share the same codes when size is between 2 times of vector register size and REP_STOSB_THRESHOLD which defaults to 2KB. Key features: 1. Use overlapping store to avoid branch. 2. For size <= 4 times of vector register size, fully unroll the loop. 3. For size > 4 times of vector register size, store 4 times of vector register size at a time. [BZ #19881] * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memset-sse2-unaligned-erms, memset-avx2-unaligned-erms and memset-avx512-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memset_chk_sse2_unaligned, __memset_chk_sse2_unaligned_erms, __memset_chk_avx2_unaligned, __memset_chk_avx2_unaligned_erms, __memset_chk_avx512_unaligned, __memset_chk_avx512_unaligned_erms, __memset_sse2_unaligned, __memset_sse2_unaligned_erms, __memset_erms, __memset_avx2_unaligned, __memset_avx2_unaligned_erms, __memset_avx512_unaligned_erms and __memset_avx512_unaligned. * sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: New file. * sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: Likewise. (cherry picked from commit 830566307f038387ca0af3fd327706a8d1a2f595) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=980d639b4ae58209843f09a29d86b0a8303b6650 commit 980d639b4ae58209843f09a29d86b0a8303b6650 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 10:04:26 2016 -0700 Add x86-64 memmove with unaligned load/store and rep movsb Implement x86-64 memmove with unaligned load/store and rep movsb. Support 16-byte, 32-byte and 64-byte vector register sizes. When size <= 8 times of vector register size, there is no check for address overlap bewteen source and destination. Since overhead for overlap check is small when size > 8 times of vector register size, memcpy is an alias of memmove. A single file provides 2 implementations of memmove, one with rep movsb and the other without rep movsb. They share the same codes when size is between 2 times of vector register size and REP_MOVSB_THRESHOLD which is 2KB for 16-byte vector register size and scaled up by large vector register size. Key features: 1. Use overlapping load and store to avoid branch. 2. For size <= 8 times of vector register size, load all sources into registers and store them together. 3. If there is no address overlap bewteen source and destination, copy from both ends with 4 times of vector register size at a time. 4. If address of destination > address of source, backward copy 8 times of vector register size at a time. 5. Otherwise, forward copy 8 times of vector register size at a time. 6. Use rep movsb only for forward copy. Avoid slow backward rep movsb by fallbacking to backward copy 8 times of vector register size at a time. 7. Skip when address of destination == address of source. [BZ #19776] * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memmove-sse2-unaligned-erms, memmove-avx-unaligned-erms and memmove-avx512-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memmove_chk_avx512_unaligned_2, __memmove_chk_avx512_unaligned_erms, __memmove_chk_avx_unaligned_2, __memmove_chk_avx_unaligned_erms, __memmove_chk_sse2_unaligned_2, __memmove_chk_sse2_unaligned_erms, __memmove_avx_unaligned_2, __memmove_avx_unaligned_erms, __memmove_avx512_unaligned_2, __memmove_avx512_unaligned_erms, __memmove_erms, __memmove_sse2_unaligned_2, __memmove_sse2_unaligned_erms, __memcpy_chk_avx512_unaligned_2, __memcpy_chk_avx512_unaligned_erms, __memcpy_chk_avx_unaligned_2, __memcpy_chk_avx_unaligned_erms, __memcpy_chk_sse2_unaligned_2, __memcpy_chk_sse2_unaligned_erms, __memcpy_avx_unaligned_2, __memcpy_avx_unaligned_erms, __memcpy_avx512_unaligned_2, __memcpy_avx512_unaligned_erms, __memcpy_sse2_unaligned_2, __memcpy_sse2_unaligned_erms, __memcpy_erms, __mempcpy_chk_avx512_unaligned_2, __mempcpy_chk_avx512_unaligned_erms, __mempcpy_chk_avx_unaligned_2, __mempcpy_chk_avx_unaligned_erms, __mempcpy_chk_sse2_unaligned_2, __mempcpy_chk_sse2_unaligned_erms, __mempcpy_avx512_unaligned_2, __mempcpy_avx512_unaligned_erms, __mempcpy_avx_unaligned_2, __mempcpy_avx_unaligned_erms, __mempcpy_sse2_unaligned_2, __mempcpy_sse2_unaligned_erms and __mempcpy_erms. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: New file. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S: Likwise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S: Likwise. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Likwise. (cherry picked from commit 88b57b8ed41d5ecf2e1bdfc19556f9246a665ebb) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=bf2bc5e5c9d7aa8af28b299ec26b8a37352730cc commit bf2bc5e5c9d7aa8af28b299ec26b8a37352730cc Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Mar 28 19:22:59 2016 -0700 Initial Enhanced REP MOVSB/STOSB (ERMS) support The newer Intel processors support Enhanced REP MOVSB/STOSB (ERMS) which has a feature bit in CPUID. This patch adds the Enhanced REP MOVSB/STOSB (ERMS) bit to x86 cpu-features. * sysdeps/x86/cpu-features.h (bit_cpu_ERMS): New. (index_cpu_ERMS): Likewise. (reg_ERMS): Likewise. (cherry picked from commit 0791f91dff9a77263fa8173b143d854cad902c6d) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7c244283ff12329b3bca9878b8edac3b3fe5c7bc commit 7c244283ff12329b3bca9878b8edac3b3fe5c7bc Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Mar 28 13:15:59 2016 -0700 Make __memcpy_avx512_no_vzeroupper an alias Since x86-64 memcpy-avx512-no-vzeroupper.S implements memmove, make __memcpy_avx512_no_vzeroupper an alias of __memmove_avx512_no_vzeroupper to reduce code size of libc.so. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memcpy-avx512-no-vzeroupper. * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: Renamed to ... * sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: This. (MEMCPY): Don't define. (MEMCPY_CHK): Likewise. (MEMPCPY): Likewise. (MEMPCPY_CHK): Likewise. (MEMPCPY_CHK): Renamed to ... (__mempcpy_chk_avx512_no_vzeroupper): This. (MEMPCPY_CHK): Renamed to ... (__mempcpy_chk_avx512_no_vzeroupper): This. (MEMCPY_CHK): Renamed to ... (__memmove_chk_avx512_no_vzeroupper): This. (MEMCPY): Renamed to ... (__memmove_avx512_no_vzeroupper): This. (__memcpy_avx512_no_vzeroupper): New alias. (__memcpy_chk_avx512_no_vzeroupper): Likewise. (cherry picked from commit 064f01b10b57ff09cda7025f484b848c38ddd57a) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a9a14991fb2d3e69f80d25e9bbf2f6b0bcf11c3d commit a9a14991fb2d3e69f80d25e9bbf2f6b0bcf11c3d Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Mar 28 13:13:36 2016 -0700 Implement x86-64 multiarch mempcpy in memcpy Implement x86-64 multiarch mempcpy in memcpy to share most of code. It reduces code size of libc.so. [BZ #18858] * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove mempcpy-ssse3, mempcpy-ssse3-back, mempcpy-avx-unaligned and mempcpy-avx512-no-vzeroupper. * sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/memcpy-ssse3-back.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/memcpy-ssse3.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/mempcpy-avx-unaligned.S: Removed. * sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S: Likewise. * sysdeps/x86_64/multiarch/mempcpy-ssse3-back.S: Likewise. * sysdeps/x86_64/multiarch/mempcpy-ssse3.S: Likewise. (cherry picked from commit c365e615f7429aee302f8af7bf07ae262278febb) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4fc09dabecee1b7cafdbca26ee7c63f68e53c229 commit 4fc09dabecee1b7cafdbca26ee7c63f68e53c229 Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Mar 28 04:39:48 2016 -0700 [x86] Add a feature bit: Fast_Unaligned_Copy On AMD processors, memcpy optimized with unaligned SSE load is slower than emcpy optimized with aligned SSSE3 while other string functions are faster with unaligned SSE load. A feature bit, Fast_Unaligned_Copy, is added to select memcpy optimized with unaligned SSE load. [BZ #19583] * sysdeps/x86/cpu-features.c (init_cpu_features): Set Fast_Unaligned_Copy with Fast_Unaligned_Load for Intel processors. Set Fast_Copy_Backward for AMD Excavator processors. * sysdeps/x86/cpu-features.h (bit_arch_Fast_Unaligned_Copy): New. (index_arch_Fast_Unaligned_Copy): Likewise. * sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Check Fast_Unaligned_Copy instead of Fast_Unaligned_Load. (cherry picked from commit e41b395523040fcb58c7d378475720c2836d280c) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=75f2d47e459a6bf5656a938e5c63f8b581eb3ee6 commit 75f2d47e459a6bf5656a938e5c63f8b581eb3ee6 Author: Florian Weimer <fweimer@redhat.com> Date: Fri Mar 25 11:11:42 2016 +0100 tst-audit10: Fix compilation on compilers without bit_AVX512F [BZ #19860] [BZ# 19860] * sysdeps/x86_64/tst-audit10.c (avx512_enabled): Always return zero if the compiler does not provide the AVX512F bit. (cherry picked from commit f327f5b47be57bc05a4077344b381016c1bb2c11) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=96c7375cb8b6f1875d9865f2ae92ecacf5f5e6fa commit 96c7375cb8b6f1875d9865f2ae92ecacf5f5e6fa Author: H.J. Lu <hjl.tools@gmail.com> Date: Tue Mar 22 08:36:16 2016 -0700 Don't set %rcx twice before "rep movsb" * sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMCPY): Don't set %rcx twice before "rep movsb". (cherry picked from commit 3c9a4cd16cbc7b79094fec68add2df66061ab5d7) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c273f613b0cc779ee33cc33d20941d271316e483 commit c273f613b0cc779ee33cc33d20941d271316e483 Author: H.J. Lu <hjl.tools@gmail.com> Date: Tue Mar 22 07:46:56 2016 -0700 Set index_arch_AVX_Fast_Unaligned_Load only for Intel processors Since only Intel processors with AVX2 have fast unaligned load, we should set index_arch_AVX_Fast_Unaligned_Load only for Intel processors. Move AVX, AVX2, AVX512, FMA and FMA4 detection into get_common_indeces and call get_common_indeces for other processors. Add CPU_FEATURES_CPU_P and CPU_FEATURES_ARCH_P to aoid loading GLRO(dl_x86_cpu_features) in cpu-features.c. [BZ #19583] * sysdeps/x86/cpu-features.c (get_common_indeces): Remove inline. Check family before setting family, model and extended_model. Set AVX, AVX2, AVX512, FMA and FMA4 usable bits here. (init_cpu_features): Replace HAS_CPU_FEATURE and HAS_ARCH_FEATURE with CPU_FEATURES_CPU_P and CPU_FEATURES_ARCH_P. Set index_arch_AVX_Fast_Unaligned_Load for Intel processors with usable AVX2. Call get_common_indeces for other processors with family == NULL. * sysdeps/x86/cpu-features.h (CPU_FEATURES_CPU_P): New macro. (CPU_FEATURES_ARCH_P): Likewise. (HAS_CPU_FEATURE): Use CPU_FEATURES_CPU_P. (HAS_ARCH_FEATURE): Use CPU_FEATURES_ARCH_P. (cherry picked from commit f781a9e96138d8839663af5e88649ab1fbed74f8) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c858d10a4e7fd682f2e7083836e4feacc2d580f4 commit c858d10a4e7fd682f2e7083836e4feacc2d580f4 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 10 05:26:46 2016 -0800 Add _arch_/_cpu_ to index_*/bit_* in x86 cpu-features.h index_* and bit_* macros are used to access cpuid and feature arrays o struct cpu_features. It is very easy to use bits and indices of cpuid array on feature array, especially in assembly codes. For example, sysdeps/i386/i686/multiarch/bcopy.S has HAS_CPU_FEATURE (Fast_Rep_String) which should be HAS_ARCH_FEATURE (Fast_Rep_String) We change index_* and bit_* to index_cpu_*/index_arch_* and bit_cpu_*/bit_arch_* so that we can catch such error at build time. [BZ #19762] * sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h (EXTRA_LD_ENVVARS): Add _arch_ to index_*/bit_*. * sysdeps/x86/cpu-features.c (init_cpu_features): Likewise. * sysdeps/x86/cpu-features.h (bit_*): Renamed to ... (bit_arch_*): This for feature array. (bit_*): Renamed to ... (bit_cpu_*): This for cpu array. (index_*): Renamed to ... (index_arch_*): This for feature array. (index_*): Renamed to ... (index_cpu_*): This for cpu array. [__ASSEMBLER__] (HAS_FEATURE): Add and use field. [__ASSEMBLER__] (HAS_CPU_FEATURE)): Pass cpu to HAS_FEATURE. [__ASSEMBLER__] (HAS_ARCH_FEATURE)): Pass arch to HAS_FEATURE. [!__ASSEMBLER__] (HAS_CPU_FEATURE): Replace index_##name and bit_##name with index_cpu_##name and bit_cpu_##name. [!__ASSEMBLER__] (HAS_ARCH_FEATURE): Replace index_##name and bit_##name with index_arch_##name and bit_arch_##name. (cherry picked from commit 6aa3e97e2530f9917f504eb4146af119a3f27229) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7a90b56b0c3f8e55df44957cf6de7d3c9c04cbb9 commit 7a90b56b0c3f8e55df44957cf6de7d3c9c04cbb9 Author: Roland McGrath <roland@hack.frob.com> Date: Tue Mar 8 12:31:13 2016 -0800 Fix tst-audit10 build when -mavx512f is not supported. (cherry picked from commit 3bd80c0de2f8e7ca8020d37739339636d169957e) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=ba80f6ceea3a6b6f711038646f419125fe3ad39c commit ba80f6ceea3a6b6f711038646f419125fe3ad39c Author: Florian Weimer <fweimer@redhat.com> Date: Mon Mar 7 16:00:25 2016 +0100 tst-audit4, tst-audit10: Compile AVX/AVX-512 code separately [BZ #19269] This ensures that GCC will not use unsupported instructions before the run-time check to ensure support. (cherry picked from commit 3c0f7407eedb524c9114bb675cd55b903c71daaa) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=b8fe596e7f750d4ee2fca14d6a3999364c02662e commit b8fe596e7f750d4ee2fca14d6a3999364c02662e Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Mar 6 16:48:11 2016 -0800 Group AVX512 functions in .text.avx512 section * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: Replace .text with .text.avx512. * sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S: Likewise. (cherry picked from commit fee9eb6200f0e44a4b684903bc47fde36d46f1a5) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e455d17680cfaebb12692547422f95ba1ed30e29 commit e455d17680cfaebb12692547422f95ba1ed30e29 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Mar 4 08:37:40 2016 -0800 x86-64: Fix memcpy IFUNC selection Chek Fast_Unaligned_Load, instead of Slow_BSF, and also check for Fast_Copy_Backward to enable __memcpy_ssse3_back. Existing selection order is updated with following selection order: 1. __memcpy_avx_unaligned if AVX_Fast_Unaligned_Load bit is set. 2. __memcpy_sse2_unaligned if Fast_Unaligned_Load bit is set. 3. __memcpy_sse2 if SSSE3 isn't available. 4. __memcpy_ssse3_back if Fast_Copy_Backward bit it set. 5. __memcpy_ssse3 [BZ #18880] * sysdeps/x86_64/multiarch/memcpy.S: Check Fast_Unaligned_Load, instead of Slow_BSF, and also check for Fast_Copy_Backward to enable __memcpy_ssse3_back. (cherry picked from commit 14a1d7cc4c4fd5ee8e4e66b777221dd32a84efe8) -----------------------------------------------------------------------
This is an automated email from the git hooks/post-receive script. It was generated because a ref change was pushed to the repository containing the project "GNU C Library master sources". The branch, hjl/erms/2.22 has been created at 8b65cadefc53cc42e1970e0817336fe96a7aa396 (commit) - Log ----------------------------------------------------------------- https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=8b65cadefc53cc42e1970e0817336fe96a7aa396 commit 8b65cadefc53cc42e1970e0817336fe96a7aa396 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Apr 1 14:01:24 2016 -0700 X86-64: Add dummy memcopy.h and wordcopy.c Since x86-64 doesn't use memory copy functions, add dummy memcopy.h and wordcopy.c to reduce code size. It reduces the size of libc.so by about 1 KB. * sysdeps/x86_64/memcopy.h: New file. * sysdeps/x86_64/wordcopy.c: Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c4b1dec2c115ba19192fdb143f25cfc1ac76c94a commit c4b1dec2c115ba19192fdb143f25cfc1ac76c94a Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 12:46:57 2016 -0700 X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones, we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with the new ones. No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used before. If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2 memcpy/memmove optimized with Enhanced REP MOVSB will be used for processors with ERMS. The new AVX512 memcpy/memmove will be used for processors with AVX512 which prefer vzeroupper. Since the new SSE2 memcpy/memmove are faster than the previous default memcpy/memmove used in libc.a and ld.so, we also remove the previous default memcpy/memmove and make them the default memcpy/memmove. Together, it reduces the size of libc.so by about 6 KB and the size of ld.so by about 2 KB. [BZ #19776] * sysdeps/x86_64/memcpy.S: Make it dummy. * sysdeps/x86_64/mempcpy.S: Likewise. * sysdeps/x86_64/memmove.S: New file. * sysdeps/x86_64/memmove_chk.S: Likewise. * sysdeps/x86_64/multiarch/memmove.S: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.S: Likewise. * sysdeps/x86_64/memmove.c: Removed. * sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memmove.c: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.c: Likewise. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memcpy-sse2-unaligned, memmove-avx-unaligned, memcpy-avx-unaligned and memmove-sse2-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Replace __memmove_chk_avx512_unaligned_2 with __memmove_chk_avx512_unaligned. Remove __memmove_chk_avx_unaligned_2. Replace __memmove_chk_sse2_unaligned_2 with __memmove_chk_sse2_unaligned. Remove __memmove_chk_sse2 and __memmove_avx_unaligned_2. Replace __memmove_avx512_unaligned_2 with __memmove_avx512_unaligned. Replace __memmove_sse2_unaligned_2 with __memmove_sse2_unaligned. Remove __memmove_sse2. Replace __memcpy_chk_avx512_unaligned_2 with __memcpy_chk_avx512_unaligned. Remove __memcpy_chk_avx_unaligned_2. Replace __memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned. Remove __memcpy_chk_sse2. Remove __memcpy_avx_unaligned_2. Replace __memcpy_avx512_unaligned_2 with __memcpy_avx512_unaligned. Remove __memcpy_sse2_unaligned_2 and __memcpy_sse2. Replace __mempcpy_chk_avx512_unaligned_2 with __mempcpy_chk_avx512_unaligned. Remove __mempcpy_chk_avx_unaligned_2. Replace __mempcpy_chk_sse2_unaligned_2 with __mempcpy_chk_sse2_unaligned. Remove __mempcpy_chk_sse2. Replace __mempcpy_avx512_unaligned_2 with __mempcpy_avx512_unaligned. Remove __mempcpy_avx_unaligned_2. Replace __mempcpy_sse2_unaligned_2 with __mempcpy_sse2_unaligned. Remove __mempcpy_sse2. * sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support __memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned. Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms if processor has ERMS. Default to __memcpy_sse2_unaligned. (ENTRY): Removed. (END): Likewise. (ENTRY_CHK): Likewise. (libc_hidden_builtin_def): Likewise. Don't include ../memcpy.S. * sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support __memcpy_chk_avx512_unaligned_erms and __memcpy_chk_avx512_unaligned. Use __memcpy_chk_avx_unaligned_erms and __memcpy_chk_sse2_unaligned_erms if if processor has ERMS. Default to __memcpy_chk_sse2_unaligned. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip if not in libc. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S Change function suffix from unaligned_2 to unaligned. * sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support __mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned. Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms if processor has ERMS. Default to __mempcpy_sse2_unaligned. (ENTRY): Removed. (END): Likewise. (ENTRY_CHK): Likewise. (libc_hidden_builtin_def): Likewise. Don't include ../mempcpy.S. (mempcpy): New. Add a weak alias. * sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support __mempcpy_chk_avx512_unaligned_erms and __mempcpy_chk_avx512_unaligned. Use __mempcpy_chk_avx_unaligned_erms and __mempcpy_chk_sse2_unaligned_erms if if processor has ERMS. Default to __mempcpy_chk_sse2_unaligned. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=1d9b78b7787695ab0fddbaabeb3ef07c730e94a4 commit 1d9b78b7787695ab0fddbaabeb3ef07c730e94a4 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 10:42:30 2016 -0700 X86-64: Remove the previous SSE2/AVX2 memsets Since the new SSE2/AVX2 memsets are faster than the previous ones, we can remove the previous SSE2/AVX2 memsets and replace them with the new ones. This reduces the size of libc.so by about 900 bytes. No change in IFUNC selection if SSE2 and AVX2 memsets weren't used before. If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset optimized with Enhanced REP STOSB will be used for processors with ERMS. The new AVX512 memset will be used for processors with AVX512 which prefer vzeroupper. [BZ #19881] * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded into ... * sysdeps/x86_64/memset.S: This. (__bzero): Removed. (__memset_tail): Likewise. (__memset_chk): Likewise. (memset): Likewise. (MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't defined. (MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined. * sysdeps/x86_64/multiarch/memset-avx2.S: Removed. (__memset_zero_constant_len_parameter): Check SHARED instead of PIC. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memset-avx2 and memset-sse2-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Remove __memset_chk_sse2, __memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned. * sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: Skip if not in libc. * sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S (__bzero): Enabled. * sysdeps/x86_64/multiarch/memset.S (memset): Replace __memset_sse2 and __memset_avx2 with __memset_sse2_unaligned and __memset_avx2_unaligned. Use __memset_sse2_unaligned_erms or __memset_avx2_unaligned_erms if processor has ERMS. Support __memset_avx512_unaligned_erms and __memset_avx512_unaligned. (memset): Removed. (__memset_chk): Likewise. (MEMSET_SYMBOL): New. (libc_hidden_builtin_def): Replace __memset_sse2 with __memset_sse2_unaligned. * sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace __memset_chk_sse2 and __memset_chk_avx2 with __memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms. Use __memset_chk_sse2_unaligned_erms or __memset_chk_avx2_unaligned_erms if processor has ERMS. Support __memset_chk_avx512_unaligned_erms and __memset_chk_avx512_unaligned. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=2e9ef7960d01bc9bb36f2f3e7c9c567f11e56da9 commit 2e9ef7960d01bc9bb36f2f3e7c9c567f11e56da9 Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Apr 3 17:21:45 2016 -0700 X86-64: Use non-temporal store in memcpy on large data The large memcpy micro benchmark in glibc shows that there is a regression with large data on Haswell machine. non-temporal store in memcpy on large data can improve performance significantly. This patch adds a threshold to use non temporal store which is 6 times of shared cache size. When size is above the threshold, non temporal store will be used. For size below 8 vector register width, we load all data into registers and store them together. Only forward and backward loops, which move 4 vector registers at a time, are used to support overlapping addresses. For forward loop, we load the last 4 vector register width of data and the first vector register width of data into vector registers before the loop and store them after the loop. For backward loop, we load the first 4 vector register width of data and the last vector register width of data into vector registers before the loop and store them after the loop. * sysdeps/x86_64/cacheinfo.c (__x86_shared_non_temporal_threshold): New. (init_cacheinfo): Set __x86_shared_non_temporal_threshold to 6 times of shared cache size. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S (VMOVNT): New. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S (VMOVNT): Likewise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S (VMOVNT): Likewise. (VMOVU): Changed to movups for smaller code sizes. (VMOVA): Changed to movaps for smaller code sizes. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Update comments. (PREFETCH): New. (PREFETCH_SIZE): Likewise. (PREFETCHED_LOAD_SIZE): Likewise. (PREFETCH_ONE_SET): Likewise. Rewrite to use forward and backward loops, which move 4 vector registers at a time, to support overlapping addresses and use non temporal store if size is above the threshold. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=54667f64fa4074325ee33e487c033c313ce95067 commit 54667f64fa4074325ee33e487c033c313ce95067 Author: H.J. Lu <hjl.tools@gmail.com> Date: Wed Apr 6 10:19:16 2016 -0700 X86-64: Prepare memmove-vec-unaligned-erms.S Prepare memmove-vec-unaligned-erms.S to make the SSE2 version as the default memcpy, mempcpy and memmove. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S (MEMCPY_SYMBOL): New. (MEMPCPY_SYMBOL): Likewise. (MEMMOVE_CHK_SYMBOL): Likewise. Replace MEMMOVE_SYMBOL with MEMMOVE_CHK_SYMBOL on __mempcpy_chk symbols. Replace MEMMOVE_SYMBOL with MEMPCPY_SYMBOL on __mempcpy symbols. Provide alias for __memcpy_chk in libc.a. Provide alias for memcpy in libc.a and ld.so. (cherry picked from commit a7d1c51482d15ab6c07e2ee0ae5e007067b18bfb) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=68a0b487e274b3452a1660e4b9fad5df8d8c0284 commit 68a0b487e274b3452a1660e4b9fad5df8d8c0284 Author: H.J. Lu <hjl.tools@gmail.com> Date: Wed Apr 6 09:10:18 2016 -0700 X86-64: Prepare memset-vec-unaligned-erms.S Prepare memset-vec-unaligned-erms.S to make the SSE2 version as the default memset. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S (MEMSET_CHK_SYMBOL): New. Define if not defined. (__bzero): Check VEC_SIZE == 16 instead of USE_MULTIARCH. Disabled fro now. Replace MEMSET_SYMBOL with MEMSET_CHK_SYMBOL on __memset_chk symbols. Properly check USE_MULTIARCH on __memset symbols. (cherry picked from commit 4af1bb06c59d24f35bf8dc55897838d926c05892) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c2d3bdd6aec639fd66fceb3e2c145420c25d409b commit c2d3bdd6aec639fd66fceb3e2c145420c25d409b Author: H.J. Lu <hjl.tools@gmail.com> Date: Tue Apr 5 05:21:07 2016 -0700 Force 32-bit displacement in memset-vec-unaligned-erms.S * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: Force 32-bit displacement to avoid long nop between instructions. (cherry picked from commit ec0cac9a1f4094bd0db6f77c1b329e7a40eecc10) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=070a5e77d66f5520c1bbbc24dc1843a0a1c161ee commit 070a5e77d66f5520c1bbbc24dc1843a0a1c161ee Author: H.J. Lu <hjl.tools@gmail.com> Date: Tue Apr 5 05:19:05 2016 -0700 Add a comment in memset-sse2-unaligned-erms.S * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Add a comment on VMOVU and VMOVA. (cherry picked from commit 696ac774847b80cf994438739478b0c3003b5958) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7e00bb9720268f142668d22e91dff7c3e6e0c08c commit 7e00bb9720268f142668d22e91dff7c3e6e0c08c Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Apr 3 14:32:20 2016 -0700 Don't put SSE2/AVX/AVX512 memmove/memset in ld.so Since memmove and memset in ld.so don't use IFUNC, don't put SSE2, AVX and AVX512 memmove and memset in ld.so. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip if not in libc. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S: Likewise. (cherry picked from commit 5cd7af016d8587ff53b20ba259746f97edbddbf7) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e1e57539f5dbdefc96a85021b611863eaa28dd13 commit e1e57539f5dbdefc96a85021b611863eaa28dd13 Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Apr 3 12:38:25 2016 -0700 Fix memmove-vec-unaligned-erms.S __mempcpy_erms and __memmove_erms can't be placed between __memmove_chk and __memmove it breaks __memmove_chk. Don't check source == destination first since it is less common. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: (__mempcpy_erms, __memmove_erms): Moved before __mempcpy_chk with unaligned_erms. (__memmove_erms): Skip if source == destination. (__memmove_unaligned_erms): Don't check source == destination first. (cherry picked from commit ea2785e96fa503f3a2b5dd9f3a6ca65622b3c5f2) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a13ac6b5ced68aadb7c1546102445f9c57f43231 commit a13ac6b5ced68aadb7c1546102445f9c57f43231 Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Mar 6 08:23:24 2016 -0800 Use HAS_ARCH_FEATURE with Fast_Rep_String HAS_ARCH_FEATURE, not HAS_CPU_FEATURE, should be used with Fast_Rep_String. [BZ #19762] * sysdeps/i386/i686/multiarch/bcopy.S (bcopy): Use HAS_ARCH_FEATURE with Fast_Rep_String. * sysdeps/i386/i686/multiarch/bzero.S (__bzero): Likewise. * sysdeps/i386/i686/multiarch/memcpy.S (memcpy): Likewise. * sysdeps/i386/i686/multiarch/memcpy_chk.S (__memcpy_chk): Likewise. * sysdeps/i386/i686/multiarch/memmove_chk.S (__memmove_chk): Likewise. * sysdeps/i386/i686/multiarch/mempcpy.S (__mempcpy): Likewise. * sysdeps/i386/i686/multiarch/mempcpy_chk.S (__mempcpy_chk): Likewise. * sysdeps/i386/i686/multiarch/memset.S (memset): Likewise. * sysdeps/i386/i686/multiarch/memset_chk.S (__memset_chk): Likewise. (cherry picked from commit 4e940b2f4b577f3a530e0580373f7c2d569f4d63) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4ad4d58ed7a444e2d9787113fce132a99b35b417 commit 4ad4d58ed7a444e2d9787113fce132a99b35b417 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Apr 1 15:08:48 2016 -0700 Remove Fast_Copy_Backward from Intel Core processors Intel Core i3, i5 and i7 processors have fast unaligned copy and copy backward is ignored. Remove Fast_Copy_Backward from Intel Core processors to avoid confusion. * sysdeps/x86/cpu-features.c (init_cpu_features): Don't set bit_arch_Fast_Copy_Backward for Intel Core proessors. (cherry picked from commit 27d3ce1467990f89126e228559dec8f84b96c60e) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a304f3933c7f8347e49057a7a315cbd571662ff7 commit a304f3933c7f8347e49057a7a315cbd571662ff7 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 10:05:51 2016 -0700 Add x86-64 memset with unaligned store and rep stosb Implement x86-64 memset with unaligned store and rep movsb. Support 16-byte, 32-byte and 64-byte vector register sizes. A single file provides 2 implementations of memset, one with rep stosb and the other without rep stosb. They share the same codes when size is between 2 times of vector register size and REP_STOSB_THRESHOLD which defaults to 2KB. Key features: 1. Use overlapping store to avoid branch. 2. For size <= 4 times of vector register size, fully unroll the loop. 3. For size > 4 times of vector register size, store 4 times of vector register size at a time. [BZ #19881] * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memset-sse2-unaligned-erms, memset-avx2-unaligned-erms and memset-avx512-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memset_chk_sse2_unaligned, __memset_chk_sse2_unaligned_erms, __memset_chk_avx2_unaligned, __memset_chk_avx2_unaligned_erms, __memset_chk_avx512_unaligned, __memset_chk_avx512_unaligned_erms, __memset_sse2_unaligned, __memset_sse2_unaligned_erms, __memset_erms, __memset_avx2_unaligned, __memset_avx2_unaligned_erms, __memset_avx512_unaligned_erms and __memset_avx512_unaligned. * sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: New file. * sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: Likewise. (cherry picked from commit 830566307f038387ca0af3fd327706a8d1a2f595) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=1c5d0b0ba41376c2f0792da4f22cc1f5b2b2688e commit 1c5d0b0ba41376c2f0792da4f22cc1f5b2b2688e Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 10:04:26 2016 -0700 Add x86-64 memmove with unaligned load/store and rep movsb Implement x86-64 memmove with unaligned load/store and rep movsb. Support 16-byte, 32-byte and 64-byte vector register sizes. When size <= 8 times of vector register size, there is no check for address overlap bewteen source and destination. Since overhead for overlap check is small when size > 8 times of vector register size, memcpy is an alias of memmove. A single file provides 2 implementations of memmove, one with rep movsb and the other without rep movsb. They share the same codes when size is between 2 times of vector register size and REP_MOVSB_THRESHOLD which is 2KB for 16-byte vector register size and scaled up by large vector register size. Key features: 1. Use overlapping load and store to avoid branch. 2. For size <= 8 times of vector register size, load all sources into registers and store them together. 3. If there is no address overlap bewteen source and destination, copy from both ends with 4 times of vector register size at a time. 4. If address of destination > address of source, backward copy 8 times of vector register size at a time. 5. Otherwise, forward copy 8 times of vector register size at a time. 6. Use rep movsb only for forward copy. Avoid slow backward rep movsb by fallbacking to backward copy 8 times of vector register size at a time. 7. Skip when address of destination == address of source. [BZ #19776] * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memmove-sse2-unaligned-erms, memmove-avx-unaligned-erms and memmove-avx512-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memmove_chk_avx512_unaligned_2, __memmove_chk_avx512_unaligned_erms, __memmove_chk_avx_unaligned_2, __memmove_chk_avx_unaligned_erms, __memmove_chk_sse2_unaligned_2, __memmove_chk_sse2_unaligned_erms, __memmove_avx_unaligned_2, __memmove_avx_unaligned_erms, __memmove_avx512_unaligned_2, __memmove_avx512_unaligned_erms, __memmove_erms, __memmove_sse2_unaligned_2, __memmove_sse2_unaligned_erms, __memcpy_chk_avx512_unaligned_2, __memcpy_chk_avx512_unaligned_erms, __memcpy_chk_avx_unaligned_2, __memcpy_chk_avx_unaligned_erms, __memcpy_chk_sse2_unaligned_2, __memcpy_chk_sse2_unaligned_erms, __memcpy_avx_unaligned_2, __memcpy_avx_unaligned_erms, __memcpy_avx512_unaligned_2, __memcpy_avx512_unaligned_erms, __memcpy_sse2_unaligned_2, __memcpy_sse2_unaligned_erms, __memcpy_erms, __mempcpy_chk_avx512_unaligned_2, __mempcpy_chk_avx512_unaligned_erms, __mempcpy_chk_avx_unaligned_2, __mempcpy_chk_avx_unaligned_erms, __mempcpy_chk_sse2_unaligned_2, __mempcpy_chk_sse2_unaligned_erms, __mempcpy_avx512_unaligned_2, __mempcpy_avx512_unaligned_erms, __mempcpy_avx_unaligned_2, __mempcpy_avx_unaligned_erms, __mempcpy_sse2_unaligned_2, __mempcpy_sse2_unaligned_erms and __mempcpy_erms. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: New file. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S: Likwise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S: Likwise. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Likwise. (cherry picked from commit 88b57b8ed41d5ecf2e1bdfc19556f9246a665ebb) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e1203f48239fbb9832db6ed3a0d2a008e317aff9 commit e1203f48239fbb9832db6ed3a0d2a008e317aff9 Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Mar 28 19:22:59 2016 -0700 Initial Enhanced REP MOVSB/STOSB (ERMS) support The newer Intel processors support Enhanced REP MOVSB/STOSB (ERMS) which has a feature bit in CPUID. This patch adds the Enhanced REP MOVSB/STOSB (ERMS) bit to x86 cpu-features. * sysdeps/x86/cpu-features.h (bit_cpu_ERMS): New. (index_cpu_ERMS): Likewise. (reg_ERMS): Likewise. (cherry picked from commit 0791f91dff9a77263fa8173b143d854cad902c6d) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=3597d65be2a44f063ef12bb907fdad8567aa3e6a commit 3597d65be2a44f063ef12bb907fdad8567aa3e6a Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Mar 28 13:15:59 2016 -0700 Make __memcpy_avx512_no_vzeroupper an alias Since x86-64 memcpy-avx512-no-vzeroupper.S implements memmove, make __memcpy_avx512_no_vzeroupper an alias of __memmove_avx512_no_vzeroupper to reduce code size of libc.so. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memcpy-avx512-no-vzeroupper. * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: Renamed to ... * sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: This. (MEMCPY): Don't define. (MEMCPY_CHK): Likewise. (MEMPCPY): Likewise. (MEMPCPY_CHK): Likewise. (MEMPCPY_CHK): Renamed to ... (__mempcpy_chk_avx512_no_vzeroupper): This. (MEMPCPY_CHK): Renamed to ... (__mempcpy_chk_avx512_no_vzeroupper): This. (MEMCPY_CHK): Renamed to ... (__memmove_chk_avx512_no_vzeroupper): This. (MEMCPY): Renamed to ... (__memmove_avx512_no_vzeroupper): This. (__memcpy_avx512_no_vzeroupper): New alias. (__memcpy_chk_avx512_no_vzeroupper): Likewise. (cherry picked from commit 064f01b10b57ff09cda7025f484b848c38ddd57a) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=9fbaf0f27a11deb98df79d04adee97aebee78d40 commit 9fbaf0f27a11deb98df79d04adee97aebee78d40 Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Mar 28 13:13:36 2016 -0700 Implement x86-64 multiarch mempcpy in memcpy Implement x86-64 multiarch mempcpy in memcpy to share most of code. It reduces code size of libc.so. [BZ #18858] * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove mempcpy-ssse3, mempcpy-ssse3-back, mempcpy-avx-unaligned and mempcpy-avx512-no-vzeroupper. * sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/memcpy-ssse3-back.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/memcpy-ssse3.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/mempcpy-avx-unaligned.S: Removed. * sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S: Likewise. * sysdeps/x86_64/multiarch/mempcpy-ssse3-back.S: Likewise. * sysdeps/x86_64/multiarch/mempcpy-ssse3.S: Likewise. (cherry picked from commit c365e615f7429aee302f8af7bf07ae262278febb) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=5239cb481eea27650173b9b9af22439afdcbf358 commit 5239cb481eea27650173b9b9af22439afdcbf358 Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Mar 28 04:39:48 2016 -0700 [x86] Add a feature bit: Fast_Unaligned_Copy On AMD processors, memcpy optimized with unaligned SSE load is slower than emcpy optimized with aligned SSSE3 while other string functions are faster with unaligned SSE load. A feature bit, Fast_Unaligned_Copy, is added to select memcpy optimized with unaligned SSE load. [BZ #19583] * sysdeps/x86/cpu-features.c (init_cpu_features): Set Fast_Unaligned_Copy with Fast_Unaligned_Load for Intel processors. Set Fast_Copy_Backward for AMD Excavator processors. * sysdeps/x86/cpu-features.h (bit_arch_Fast_Unaligned_Copy): New. (index_arch_Fast_Unaligned_Copy): Likewise. * sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Check Fast_Unaligned_Copy instead of Fast_Unaligned_Load. (cherry picked from commit e41b395523040fcb58c7d378475720c2836d280c) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a65b3d13e1754d568782e64a762c2c7fab45a55d commit a65b3d13e1754d568782e64a762c2c7fab45a55d Author: H.J. Lu <hjl.tools@gmail.com> Date: Tue Mar 22 08:36:16 2016 -0700 Don't set %rcx twice before "rep movsb" * sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMCPY): Don't set %rcx twice before "rep movsb". (cherry picked from commit 3c9a4cd16cbc7b79094fec68add2df66061ab5d7) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=f4b6d20366aac66070f1cf50552cf2951991a1e5 commit f4b6d20366aac66070f1cf50552cf2951991a1e5 Author: H.J. Lu <hjl.tools@gmail.com> Date: Tue Mar 22 07:46:56 2016 -0700 Set index_arch_AVX_Fast_Unaligned_Load only for Intel processors Since only Intel processors with AVX2 have fast unaligned load, we should set index_arch_AVX_Fast_Unaligned_Load only for Intel processors. Move AVX, AVX2, AVX512, FMA and FMA4 detection into get_common_indeces and call get_common_indeces for other processors. Add CPU_FEATURES_CPU_P and CPU_FEATURES_ARCH_P to aoid loading GLRO(dl_x86_cpu_features) in cpu-features.c. [BZ #19583] * sysdeps/x86/cpu-features.c (get_common_indeces): Remove inline. Check family before setting family, model and extended_model. Set AVX, AVX2, AVX512, FMA and FMA4 usable bits here. (init_cpu_features): Replace HAS_CPU_FEATURE and HAS_ARCH_FEATURE with CPU_FEATURES_CPU_P and CPU_FEATURES_ARCH_P. Set index_arch_AVX_Fast_Unaligned_Load for Intel processors with usable AVX2. Call get_common_indeces for other processors with family == NULL. * sysdeps/x86/cpu-features.h (CPU_FEATURES_CPU_P): New macro. (CPU_FEATURES_ARCH_P): Likewise. (HAS_CPU_FEATURE): Use CPU_FEATURES_CPU_P. (HAS_ARCH_FEATURE): Use CPU_FEATURES_ARCH_P. (cherry picked from commit f781a9e96138d8839663af5e88649ab1fbed74f8) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=ca9c5edeea52dc18f42ebbe29b1af352f5555538 commit ca9c5edeea52dc18f42ebbe29b1af352f5555538 Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Nov 30 08:53:37 2015 -0800 Update family and model detection for AMD CPUs AMD CPUs uses the similar encoding scheme for extended family and model as Intel CPUs as shown in: http://support.amd.com/TechDocs/25481.pdf This patch updates get_common_indeces to get family and model for both Intel and AMD CPUs when family == 0x0f. [BZ #19214] * sysdeps/x86/cpu-features.c (get_common_indeces): Add an argument to return extended model. Update family and model with extended family and model when family == 0x0f. (init_cpu_features): Updated. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c23cdbac4ea473effbef5c50b1217f95595b3460 commit c23cdbac4ea473effbef5c50b1217f95595b3460 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 10 05:26:46 2016 -0800 Add _arch_/_cpu_ to index_*/bit_* in x86 cpu-features.h index_* and bit_* macros are used to access cpuid and feature arrays o struct cpu_features. It is very easy to use bits and indices of cpuid array on feature array, especially in assembly codes. For example, sysdeps/i386/i686/multiarch/bcopy.S has HAS_CPU_FEATURE (Fast_Rep_String) which should be HAS_ARCH_FEATURE (Fast_Rep_String) We change index_* and bit_* to index_cpu_*/index_arch_* and bit_cpu_*/bit_arch_* so that we can catch such error at build time. [BZ #19762] * sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h (EXTRA_LD_ENVVARS): Add _arch_ to index_*/bit_*. * sysdeps/x86/cpu-features.c (init_cpu_features): Likewise. * sysdeps/x86/cpu-features.h (bit_*): Renamed to ... (bit_arch_*): This for feature array. (bit_*): Renamed to ... (bit_cpu_*): This for cpu array. (index_*): Renamed to ... (index_arch_*): This for feature array. (index_*): Renamed to ... (index_cpu_*): This for cpu array. [__ASSEMBLER__] (HAS_FEATURE): Add and use field. [__ASSEMBLER__] (HAS_CPU_FEATURE)): Pass cpu to HAS_FEATURE. [__ASSEMBLER__] (HAS_ARCH_FEATURE)): Pass arch to HAS_FEATURE. [!__ASSEMBLER__] (HAS_CPU_FEATURE): Replace index_##name and bit_##name with index_cpu_##name and bit_cpu_##name. [!__ASSEMBLER__] (HAS_ARCH_FEATURE): Replace index_##name and bit_##name with index_arch_##name and bit_arch_##name. (cherry picked from commit 6aa3e97e2530f9917f504eb4146af119a3f27229) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4a49c82956f5a42a2cce22c2e97360de1b32301d commit 4a49c82956f5a42a2cce22c2e97360de1b32301d Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 3 14:51:40 2016 -0800 Or bit_Prefer_MAP_32BIT_EXEC in EXTRA_LD_ENVVARS We should turn on bit_Prefer_MAP_32BIT_EXEC in EXTRA_LD_ENVVARS without overriding other bits. [BZ #19758] * sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h (EXTRA_LD_ENVVARS): Or bit_Prefer_MAP_32BIT_EXEC. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=18f8c0e3cc9ff7b092f02c9b42874a5439347bbc commit 18f8c0e3cc9ff7b092f02c9b42874a5439347bbc Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Mar 6 16:48:11 2016 -0800 Group AVX512 functions in .text.avx512 section * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: Replace .text with .text.avx512. * sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S: Likewise. (cherry picked from commit fee9eb6200f0e44a4b684903bc47fde36d46f1a5) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=0c8e297a186f844ebb7eba7a3bc0343c83615ca9 commit 0c8e297a186f844ebb7eba7a3bc0343c83615ca9 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Mar 4 08:37:40 2016 -0800 x86-64: Fix memcpy IFUNC selection Chek Fast_Unaligned_Load, instead of Slow_BSF, and also check for Fast_Copy_Backward to enable __memcpy_ssse3_back. Existing selection order is updated with following selection order: 1. __memcpy_avx_unaligned if AVX_Fast_Unaligned_Load bit is set. 2. __memcpy_sse2_unaligned if Fast_Unaligned_Load bit is set. 3. __memcpy_sse2 if SSSE3 isn't available. 4. __memcpy_ssse3_back if Fast_Copy_Backward bit it set. 5. __memcpy_ssse3 [BZ #18880] * sysdeps/x86_64/multiarch/memcpy.S: Check Fast_Unaligned_Load, instead of Slow_BSF, and also check for Fast_Copy_Backward to enable __memcpy_ssse3_back. (cherry picked from commit 14a1d7cc4c4fd5ee8e4e66b777221dd32a84efe8) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=3c772cb4d9cbe19cd97ad991e3dab43014198c44 commit 3c772cb4d9cbe19cd97ad991e3dab43014198c44 Author: Andrew Senkevich <andrew.senkevich@intel.com> Date: Sat Jan 16 00:49:45 2016 +0300 Added memcpy/memmove family optimized with AVX512 for KNL hardware. Added AVX512 implementations of memcpy, mempcpy, memmove, memcpy_chk, mempcpy_chk, memmove_chk. It shows average improvement more than 30% over AVX versions on KNL hardware (performance results in the thread <https://sourceware.org/ml/libc-alpha/2016-01/msg00258.html>). * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Added new files. * sysdeps/x86_64/multiarch/ifunc-impl-list.c: Added new tests. * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: New file. * sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S: Likewise. * sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: Likewise. * sysdeps/x86_64/multiarch/memcpy.S: Added new IFUNC branch. * sysdeps/x86_64/multiarch/memcpy_chk.S: Likewise. * sysdeps/x86_64/multiarch/memmove.c: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.c: Likewise. * sysdeps/x86_64/multiarch/mempcpy.S: Likewise. * sysdeps/x86_64/multiarch/mempcpy_chk.S: Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7f20e52f1f8d7709140f4bdf828a6bb0f0f08af2 commit 7f20e52f1f8d7709140f4bdf828a6bb0f0f08af2 Author: Andrew Senkevich <andrew.senkevich@intel.com> Date: Sat Dec 19 02:47:28 2015 +0300 Added memset optimized with AVX512 for KNL hardware. It shows improvement up to 28% over AVX2 memset (performance results attached at <https://sourceware.org/ml/libc-alpha/2015-12/msg00052.html>). * sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S: New file. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Added new file. * sysdeps/x86_64/multiarch/ifunc-impl-list.c: Added new tests. * sysdeps/x86_64/multiarch/memset.S: Added new IFUNC branch. * sysdeps/x86_64/multiarch/memset_chk.S: Likewise. * sysdeps/x86/cpu-features.h (bit_Prefer_No_VZEROUPPER, index_Prefer_No_VZEROUPPER): New. * sysdeps/x86/cpu-features.c (init_cpu_features): Set the Prefer_No_VZEROUPPER for Knights Landing. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=d530cd5463701a59ed923d53a97d3b534fdfea8a commit d530cd5463701a59ed923d53a97d3b534fdfea8a Author: H.J. Lu <hjl.tools@gmail.com> Date: Wed Oct 21 14:44:23 2015 -0700 Add Prefer_MAP_32BIT_EXEC to map executable pages with MAP_32BIT According to Silvermont software optimization guide, for 64-bit applications, branch prediction performance can be negatively impacted when the target of a branch is more than 4GB away from the branch. Add the Prefer_MAP_32BIT_EXEC bit so that mmap will try to map executable pages with MAP_32BIT first. NB: MAP_32BIT will map to lower 2GB, not lower 4GB, address. Prefer_MAP_32BIT_EXEC reduces bits available for address space layout randomization (ASLR), which is always disabled for SUID programs and can only be enabled by setting environment variable, LD_PREFER_MAP_32BIT_EXEC. On Fedora 23, this patch speeds up GCC 5 testsuite by 3% on Silvermont. [BZ #19367] * sysdeps/unix/sysv/linux/wordsize-64/mmap.c: New file. * sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h: Likewise. * sysdeps/unix/sysv/linux/x86_64/64/mmap.c: Likewise. * sysdeps/x86/cpu-features.h (bit_Prefer_MAP_32BIT_EXEC): New. (index_Prefer_MAP_32BIT_EXEC): Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=fe24aedc3530037d7bb614b84d309e6b816686bf commit fe24aedc3530037d7bb614b84d309e6b816686bf Author: H.J. Lu <hjl.tools@gmail.com> Date: Tue Dec 15 11:46:54 2015 -0800 Enable Silvermont optimizations for Knights Landing Knights Landing processor is based on Silvermont. This patch enables Silvermont optimizations for Knights Landing. * sysdeps/x86/cpu-features.c (init_cpu_features): Enable Silvermont optimizations for Knights Landing. -----------------------------------------------------------------------
This is an automated email from the git hooks/post-receive script. It was generated because a ref change was pushed to the repository containing the project "GNU C Library master sources". The branch, hjl/erms/ifunc has been created at 3fe7f2277e4c557147c8cf93452c2f43a62bdffa (commit) - Log ----------------------------------------------------------------- https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=3fe7f2277e4c557147c8cf93452c2f43a62bdffa commit 3fe7f2277e4c557147c8cf93452c2f43a62bdffa Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Apr 1 14:01:24 2016 -0700 X86-64: Add dummy memcopy.h and wordcopy.c Since x86-64 doesn't use memory copy functions, add dummy memcopy.h and wordcopy.c to reduce code size. It reduces the size of libc.so by about 1 KB. * sysdeps/x86_64/memcopy.h: New file. * sysdeps/x86_64/wordcopy.c: Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=f81588aa14ead50fb38ce07f0e77e03c33d54f6e commit f81588aa14ead50fb38ce07f0e77e03c33d54f6e Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 12:46:57 2016 -0700 X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones, we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with the new ones. No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used before. If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2 memcpy/memmove optimized with Enhanced REP MOVSB will be used for processors with ERMS. The new AVX512 memcpy/memmove will be used for processors with AVX512 which prefer vzeroupper. Since the new SSE2 memcpy/memmove are faster than the previous default memcpy/memmove used in libc.a and ld.so, we also remove the previous default memcpy/memmove and make them the default memcpy/memmove, except that non-temporal store isn't used in ld.so. Together, it reduces the size of libc.so by about 6 KB and the size of ld.so by about 2 KB. [BZ #19776] * sysdeps/x86_64/memcpy.S: Make it dummy. * sysdeps/x86_64/mempcpy.S: Likewise. * sysdeps/x86_64/memmove.S: New file. * sysdeps/x86_64/memmove_chk.S: Likewise. * sysdeps/x86_64/multiarch/memmove.S: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.S: Likewise. * sysdeps/x86_64/memmove.c: Removed. * sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memmove.c: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.c: Likewise. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memcpy-sse2-unaligned, memmove-avx-unaligned, memcpy-avx-unaligned and memmove-sse2-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Replace __memmove_chk_avx512_unaligned_2 with __memmove_chk_avx512_unaligned. Remove __memmove_chk_avx_unaligned_2. Replace __memmove_chk_sse2_unaligned_2 with __memmove_chk_sse2_unaligned. Remove __memmove_chk_sse2 and __memmove_avx_unaligned_2. Replace __memmove_avx512_unaligned_2 with __memmove_avx512_unaligned. Replace __memmove_sse2_unaligned_2 with __memmove_sse2_unaligned. Remove __memmove_sse2. Replace __memcpy_chk_avx512_unaligned_2 with __memcpy_chk_avx512_unaligned. Remove __memcpy_chk_avx_unaligned_2. Replace __memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned. Remove __memcpy_chk_sse2. Remove __memcpy_avx_unaligned_2. Replace __memcpy_avx512_unaligned_2 with __memcpy_avx512_unaligned. Remove __memcpy_sse2_unaligned_2 and __memcpy_sse2. Replace __mempcpy_chk_avx512_unaligned_2 with __mempcpy_chk_avx512_unaligned. Remove __mempcpy_chk_avx_unaligned_2. Replace __mempcpy_chk_sse2_unaligned_2 with __mempcpy_chk_sse2_unaligned. Remove __mempcpy_chk_sse2. Replace __mempcpy_avx512_unaligned_2 with __mempcpy_avx512_unaligned. Remove __mempcpy_avx_unaligned_2. Replace __mempcpy_sse2_unaligned_2 with __mempcpy_sse2_unaligned. Remove __mempcpy_sse2. * sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support __memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned. Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms if processor has ERMS. Default to __memcpy_sse2_unaligned. (ENTRY): Removed. (END): Likewise. (ENTRY_CHK): Likewise. (libc_hidden_builtin_def): Likewise. Don't include ../memcpy.S. * sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support __memcpy_chk_avx512_unaligned_erms and __memcpy_chk_avx512_unaligned. Use __memcpy_chk_avx_unaligned_erms and __memcpy_chk_sse2_unaligned_erms if if processor has ERMS. Default to __memcpy_chk_sse2_unaligned. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip if not in libc. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S Change function suffix from unaligned_2 to unaligned. * sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support __mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned. Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms if processor has ERMS. Default to __mempcpy_sse2_unaligned. (ENTRY): Removed. (END): Likewise. (ENTRY_CHK): Likewise. (libc_hidden_builtin_def): Likewise. Don't include ../mempcpy.S. (mempcpy): New. Add a weak alias. * sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support __mempcpy_chk_avx512_unaligned_erms and __mempcpy_chk_avx512_unaligned. Use __mempcpy_chk_avx_unaligned_erms and __mempcpy_chk_sse2_unaligned_erms if if processor has ERMS. Default to __mempcpy_chk_sse2_unaligned. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=2a1bbef9600bbbf32d20b5aaaa77ccaedce28f84 commit 2a1bbef9600bbbf32d20b5aaaa77ccaedce28f84 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 10:42:30 2016 -0700 X86-64: Remove the previous SSE2/AVX2 memsets Since the new SSE2/AVX2 memsets are faster than the previous ones, we can remove the previous SSE2/AVX2 memsets and replace them with the new ones. This reduces the size of libc.so by about 900 bytes. No change in IFUNC selection if SSE2 and AVX2 memsets weren't used before. If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset optimized with Enhanced REP STOSB will be used for processors with ERMS. The new AVX512 memset will be used for processors with AVX512 which prefer vzeroupper. [BZ #19881] * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded into ... * sysdeps/x86_64/memset.S: This. (__bzero): Removed. (__memset_tail): Likewise. (__memset_chk): Likewise. (memset): Likewise. (MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't defined. (MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined. * sysdeps/x86_64/multiarch/memset-avx2.S: Removed. (__memset_zero_constant_len_parameter): Check SHARED instead of PIC. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memset-avx2 and memset-sse2-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Remove __memset_chk_sse2, __memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned. * sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: Skip if not in libc. * sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S (__bzero): Enabled. * sysdeps/x86_64/multiarch/memset.S (memset): Replace __memset_sse2 and __memset_avx2 with __memset_sse2_unaligned and __memset_avx2_unaligned. Use __memset_sse2_unaligned_erms or __memset_avx2_unaligned_erms if processor has ERMS. Support __memset_avx512_unaligned_erms and __memset_avx512_unaligned. (memset): Removed. (__memset_chk): Likewise. (MEMSET_SYMBOL): New. (libc_hidden_builtin_def): Replace __memset_sse2 with __memset_sse2_unaligned. * sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace __memset_chk_sse2 and __memset_chk_avx2 with __memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms. Use __memset_chk_sse2_unaligned_erms or __memset_chk_avx2_unaligned_erms if processor has ERMS. Support __memset_chk_avx512_unaligned_erms and __memset_chk_avx512_unaligned. -----------------------------------------------------------------------
This is an automated email from the git hooks/post-receive script. It was generated because a ref change was pushed to the repository containing the project "GNU C Library master sources". The branch, hjl/erms/ifunc has been created at 8c1be321be69b89a454fa0e5d112a4377eec3200 (commit) - Log ----------------------------------------------------------------- https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=8c1be321be69b89a454fa0e5d112a4377eec3200 commit 8c1be321be69b89a454fa0e5d112a4377eec3200 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Apr 1 14:01:24 2016 -0700 X86-64: Add dummy memcopy.h and wordcopy.c Since x86-64 doesn't use memory copy functions, add dummy memcopy.h and wordcopy.c to reduce code size. It reduces the size of libc.so by about 1 KB. * sysdeps/x86_64/memcopy.h: New file. * sysdeps/x86_64/wordcopy.c: Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=9bd27fc5f25dd26dcd2402194e2c21fed74218ca commit 9bd27fc5f25dd26dcd2402194e2c21fed74218ca Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 12:46:57 2016 -0700 X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones, we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with the new ones. No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used before. If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2 memcpy/memmove optimized with Enhanced REP MOVSB will be used for processors with ERMS. The new AVX512 memcpy/memmove will be used for processors with AVX512 which prefer vzeroupper. Since the new SSE2 memcpy/memmove are faster than the previous default memcpy/memmove used in libc.a and ld.so, we also remove the previous default memcpy/memmove and make them the default memcpy/memmove, except that non-temporal store isn't used in ld.so. Together, it reduces the size of libc.so by about 6 KB and the size of ld.so by about 2 KB. [BZ #19776] * sysdeps/x86_64/memcpy.S: Make it dummy. * sysdeps/x86_64/mempcpy.S: Likewise. * sysdeps/x86_64/memmove.S: New file. * sysdeps/x86_64/memmove_chk.S: Likewise. * sysdeps/x86_64/multiarch/memmove.S: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.S: Likewise. * sysdeps/x86_64/memmove.c: Removed. * sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memmove.c: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.c: Likewise. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memcpy-sse2-unaligned, memmove-avx-unaligned, memcpy-avx-unaligned and memmove-sse2-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Replace __memmove_chk_avx512_unaligned_2 with __memmove_chk_avx512_unaligned. Remove __memmove_chk_avx_unaligned_2. Replace __memmove_chk_sse2_unaligned_2 with __memmove_chk_sse2_unaligned. Remove __memmove_chk_sse2 and __memmove_avx_unaligned_2. Replace __memmove_avx512_unaligned_2 with __memmove_avx512_unaligned. Replace __memmove_sse2_unaligned_2 with __memmove_sse2_unaligned. Remove __memmove_sse2. Replace __memcpy_chk_avx512_unaligned_2 with __memcpy_chk_avx512_unaligned. Remove __memcpy_chk_avx_unaligned_2. Replace __memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned. Remove __memcpy_chk_sse2. Remove __memcpy_avx_unaligned_2. Replace __memcpy_avx512_unaligned_2 with __memcpy_avx512_unaligned. Remove __memcpy_sse2_unaligned_2 and __memcpy_sse2. Replace __mempcpy_chk_avx512_unaligned_2 with __mempcpy_chk_avx512_unaligned. Remove __mempcpy_chk_avx_unaligned_2. Replace __mempcpy_chk_sse2_unaligned_2 with __mempcpy_chk_sse2_unaligned. Remove __mempcpy_chk_sse2. Replace __mempcpy_avx512_unaligned_2 with __mempcpy_avx512_unaligned. Remove __mempcpy_avx_unaligned_2. Replace __mempcpy_sse2_unaligned_2 with __mempcpy_sse2_unaligned. Remove __mempcpy_sse2. * sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support __memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned. Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms if processor has ERMS. Default to __memcpy_sse2_unaligned. (ENTRY): Removed. (END): Likewise. (ENTRY_CHK): Likewise. (libc_hidden_builtin_def): Likewise. Don't include ../memcpy.S. * sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support __memcpy_chk_avx512_unaligned_erms and __memcpy_chk_avx512_unaligned. Use __memcpy_chk_avx_unaligned_erms and __memcpy_chk_sse2_unaligned_erms if if processor has ERMS. Default to __memcpy_chk_sse2_unaligned. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S Change function suffix from unaligned_2 to unaligned. * sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support __mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned. Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms if processor has ERMS. Default to __mempcpy_sse2_unaligned. (ENTRY): Removed. (END): Likewise. (ENTRY_CHK): Likewise. (libc_hidden_builtin_def): Likewise. Don't include ../mempcpy.S. (mempcpy): New. Add a weak alias. * sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support __mempcpy_chk_avx512_unaligned_erms and __mempcpy_chk_avx512_unaligned. Use __mempcpy_chk_avx_unaligned_erms and __mempcpy_chk_sse2_unaligned_erms if if processor has ERMS. Default to __mempcpy_chk_sse2_unaligned. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=9052a15d8ebb371e7cc5f8547fedb7539f4a25fe commit 9052a15d8ebb371e7cc5f8547fedb7539f4a25fe Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 10:42:30 2016 -0700 X86-64: Remove the previous SSE2/AVX2 memsets Since the new SSE2/AVX2 memsets are faster than the previous ones, we can remove the previous SSE2/AVX2 memsets and replace them with the new ones. This reduces the size of libc.so by about 900 bytes. No change in IFUNC selection if SSE2 and AVX2 memsets weren't used before. If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset optimized with Enhanced REP STOSB will be used for processors with ERMS. The new AVX512 memset will be used for processors with AVX512 which prefer vzeroupper. [BZ #19881] * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded into ... * sysdeps/x86_64/memset.S: This. (__bzero): Removed. (__memset_tail): Likewise. (__memset_chk): Likewise. (memset): Likewise. (MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't defined. (MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined. * sysdeps/x86_64/multiarch/memset-avx2.S: Removed. (__memset_zero_constant_len_parameter): Check SHARED instead of PIC. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memset-avx2 and memset-sse2-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Remove __memset_chk_sse2, __memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S (__bzero): Enabled. * sysdeps/x86_64/multiarch/memset.S (memset): Replace __memset_sse2 and __memset_avx2 with __memset_sse2_unaligned and __memset_avx2_unaligned. Use __memset_sse2_unaligned_erms or __memset_avx2_unaligned_erms if processor has ERMS. Support __memset_avx512_unaligned_erms and __memset_avx512_unaligned. (memset): Removed. (__memset_chk): Likewise. (MEMSET_SYMBOL): New. (libc_hidden_builtin_def): Replace __memset_sse2 with __memset_sse2_unaligned. * sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace __memset_chk_sse2 and __memset_chk_avx2 with __memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms. Use __memset_chk_sse2_unaligned_erms or __memset_chk_avx2_unaligned_erms if processor has ERMS. Support __memset_chk_avx512_unaligned_erms and __memset_chk_avx512_unaligned. -----------------------------------------------------------------------
This is an automated email from the git hooks/post-receive script. It was generated because a ref change was pushed to the repository containing the project "GNU C Library master sources". The branch, hjl/erms/2.22 has been created at 09104b0b6fc150112f5e282c096f739a2f49fb6e (commit) - Log ----------------------------------------------------------------- https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=09104b0b6fc150112f5e282c096f739a2f49fb6e commit 09104b0b6fc150112f5e282c096f739a2f49fb6e Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Apr 1 14:01:24 2016 -0700 X86-64: Add dummy memcopy.h and wordcopy.c Since x86-64 doesn't use memory copy functions, add dummy memcopy.h and wordcopy.c to reduce code size. It reduces the size of libc.so by about 1 KB. * sysdeps/x86_64/memcopy.h: New file. * sysdeps/x86_64/wordcopy.c: Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=ec598c844faca4fc87e8c1ec067c94109ba58402 commit ec598c844faca4fc87e8c1ec067c94109ba58402 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 12:46:57 2016 -0700 X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones, we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with the new ones. No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used before. If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2 memcpy/memmove optimized with Enhanced REP MOVSB will be used for processors with ERMS. The new AVX512 memcpy/memmove will be used for processors with AVX512 which prefer vzeroupper. Since the new SSE2 memcpy/memmove are faster than the previous default memcpy/memmove used in libc.a and ld.so, we also remove the previous default memcpy/memmove and make them the default memcpy/memmove, except that non-temporal store isn't used in ld.so. Together, it reduces the size of libc.so by about 6 KB and the size of ld.so by about 2 KB. [BZ #19776] * sysdeps/x86_64/memcpy.S: Make it dummy. * sysdeps/x86_64/mempcpy.S: Likewise. * sysdeps/x86_64/memmove.S: New file. * sysdeps/x86_64/memmove_chk.S: Likewise. * sysdeps/x86_64/multiarch/memmove.S: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.S: Likewise. * sysdeps/x86_64/memmove.c: Removed. * sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memmove.c: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.c: Likewise. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memcpy-sse2-unaligned, memmove-avx-unaligned, memcpy-avx-unaligned and memmove-sse2-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Replace __memmove_chk_avx512_unaligned_2 with __memmove_chk_avx512_unaligned. Remove __memmove_chk_avx_unaligned_2. Replace __memmove_chk_sse2_unaligned_2 with __memmove_chk_sse2_unaligned. Remove __memmove_chk_sse2 and __memmove_avx_unaligned_2. Replace __memmove_avx512_unaligned_2 with __memmove_avx512_unaligned. Replace __memmove_sse2_unaligned_2 with __memmove_sse2_unaligned. Remove __memmove_sse2. Replace __memcpy_chk_avx512_unaligned_2 with __memcpy_chk_avx512_unaligned. Remove __memcpy_chk_avx_unaligned_2. Replace __memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned. Remove __memcpy_chk_sse2. Remove __memcpy_avx_unaligned_2. Replace __memcpy_avx512_unaligned_2 with __memcpy_avx512_unaligned. Remove __memcpy_sse2_unaligned_2 and __memcpy_sse2. Replace __mempcpy_chk_avx512_unaligned_2 with __mempcpy_chk_avx512_unaligned. Remove __mempcpy_chk_avx_unaligned_2. Replace __mempcpy_chk_sse2_unaligned_2 with __mempcpy_chk_sse2_unaligned. Remove __mempcpy_chk_sse2. Replace __mempcpy_avx512_unaligned_2 with __mempcpy_avx512_unaligned. Remove __mempcpy_avx_unaligned_2. Replace __mempcpy_sse2_unaligned_2 with __mempcpy_sse2_unaligned. Remove __mempcpy_sse2. * sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support __memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned. Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms if processor has ERMS. Default to __memcpy_sse2_unaligned. (ENTRY): Removed. (END): Likewise. (ENTRY_CHK): Likewise. (libc_hidden_builtin_def): Likewise. Don't include ../memcpy.S. * sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support __memcpy_chk_avx512_unaligned_erms and __memcpy_chk_avx512_unaligned. Use __memcpy_chk_avx_unaligned_erms and __memcpy_chk_sse2_unaligned_erms if if processor has ERMS. Default to __memcpy_chk_sse2_unaligned. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S Change function suffix from unaligned_2 to unaligned. * sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support __mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned. Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms if processor has ERMS. Default to __mempcpy_sse2_unaligned. (ENTRY): Removed. (END): Likewise. (ENTRY_CHK): Likewise. (libc_hidden_builtin_def): Likewise. Don't include ../mempcpy.S. (mempcpy): New. Add a weak alias. * sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support __mempcpy_chk_avx512_unaligned_erms and __mempcpy_chk_avx512_unaligned. Use __mempcpy_chk_avx_unaligned_erms and __mempcpy_chk_sse2_unaligned_erms if if processor has ERMS. Default to __mempcpy_chk_sse2_unaligned. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=ff9d413f34efc46e4160ee4a3b30ddc04fb37518 commit ff9d413f34efc46e4160ee4a3b30ddc04fb37518 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 10:42:30 2016 -0700 X86-64: Remove the previous SSE2/AVX2 memsets Since the new SSE2/AVX2 memsets are faster than the previous ones, we can remove the previous SSE2/AVX2 memsets and replace them with the new ones. This reduces the size of libc.so by about 900 bytes. No change in IFUNC selection if SSE2 and AVX2 memsets weren't used before. If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset optimized with Enhanced REP STOSB will be used for processors with ERMS. The new AVX512 memset will be used for processors with AVX512 which prefer vzeroupper. [BZ #19881] * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded into ... * sysdeps/x86_64/memset.S: This. (__bzero): Removed. (__memset_tail): Likewise. (__memset_chk): Likewise. (memset): Likewise. (MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't defined. (MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined. * sysdeps/x86_64/multiarch/memset-avx2.S: Removed. (__memset_zero_constant_len_parameter): Check SHARED instead of PIC. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memset-avx2 and memset-sse2-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Remove __memset_chk_sse2, __memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S (__bzero): Enabled. * sysdeps/x86_64/multiarch/memset.S (memset): Replace __memset_sse2 and __memset_avx2 with __memset_sse2_unaligned and __memset_avx2_unaligned. Use __memset_sse2_unaligned_erms or __memset_avx2_unaligned_erms if processor has ERMS. Support __memset_avx512_unaligned_erms and __memset_avx512_unaligned. (memset): Removed. (__memset_chk): Likewise. (MEMSET_SYMBOL): New. (libc_hidden_builtin_def): Replace __memset_sse2 with __memset_sse2_unaligned. * sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace __memset_chk_sse2 and __memset_chk_avx2 with __memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms. Use __memset_chk_sse2_unaligned_erms or __memset_chk_avx2_unaligned_erms if processor has ERMS. Support __memset_chk_avx512_unaligned_erms and __memset_chk_avx512_unaligned. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=76945cf3a33403b5dff551d48cb68a6729848740 commit 76945cf3a33403b5dff551d48cb68a6729848740 Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Apr 3 17:21:45 2016 -0700 X86-64: Use non-temporal store in memcpy on large data The large memcpy micro benchmark in glibc shows that there is a regression with large data on Haswell machine. non-temporal store in memcpy on large data can improve performance significantly. This patch adds a threshold to use non temporal store which is 6 times of shared cache size. When size is above the threshold, non temporal store will be used, but avoid non-temporal store if there is overlap between destination and source since destination may be in cache when source is loaded. For size below 8 vector register width, we load all data into registers and store them together. Only forward and backward loops, which move 4 vector registers at a time, are used to support overlapping addresses. For forward loop, we load the last 4 vector register width of data and the first vector register width of data into vector registers before the loop and store them after the loop. For backward loop, we load the first 4 vector register width of data and the last vector register width of data into vector registers before the loop and store them after the loop. * sysdeps/x86_64/cacheinfo.c (__x86_shared_non_temporal_threshold): New. (init_cacheinfo): Set __x86_shared_non_temporal_threshold to 6 times of shared cache size. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S (VMOVNT): New. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S (VMOVNT): Likewise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S (VMOVNT): Likewise. (VMOVU): Changed to movups for smaller code sizes. (VMOVA): Changed to movaps for smaller code sizes. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Update comments. (PREFETCH): New. (PREFETCH_SIZE): Likewise. (PREFETCHED_LOAD_SIZE): Likewise. (PREFETCH_ONE_SET): Likewise. Rewrite to use forward and backward loops, which move 4 vector registers at a time, to support overlapping addresses and use non temporal store if size is above the threshold and there is no overlap between destination and source. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=54667f64fa4074325ee33e487c033c313ce95067 commit 54667f64fa4074325ee33e487c033c313ce95067 Author: H.J. Lu <hjl.tools@gmail.com> Date: Wed Apr 6 10:19:16 2016 -0700 X86-64: Prepare memmove-vec-unaligned-erms.S Prepare memmove-vec-unaligned-erms.S to make the SSE2 version as the default memcpy, mempcpy and memmove. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S (MEMCPY_SYMBOL): New. (MEMPCPY_SYMBOL): Likewise. (MEMMOVE_CHK_SYMBOL): Likewise. Replace MEMMOVE_SYMBOL with MEMMOVE_CHK_SYMBOL on __mempcpy_chk symbols. Replace MEMMOVE_SYMBOL with MEMPCPY_SYMBOL on __mempcpy symbols. Provide alias for __memcpy_chk in libc.a. Provide alias for memcpy in libc.a and ld.so. (cherry picked from commit a7d1c51482d15ab6c07e2ee0ae5e007067b18bfb) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=68a0b487e274b3452a1660e4b9fad5df8d8c0284 commit 68a0b487e274b3452a1660e4b9fad5df8d8c0284 Author: H.J. Lu <hjl.tools@gmail.com> Date: Wed Apr 6 09:10:18 2016 -0700 X86-64: Prepare memset-vec-unaligned-erms.S Prepare memset-vec-unaligned-erms.S to make the SSE2 version as the default memset. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S (MEMSET_CHK_SYMBOL): New. Define if not defined. (__bzero): Check VEC_SIZE == 16 instead of USE_MULTIARCH. Disabled fro now. Replace MEMSET_SYMBOL with MEMSET_CHK_SYMBOL on __memset_chk symbols. Properly check USE_MULTIARCH on __memset symbols. (cherry picked from commit 4af1bb06c59d24f35bf8dc55897838d926c05892) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c2d3bdd6aec639fd66fceb3e2c145420c25d409b commit c2d3bdd6aec639fd66fceb3e2c145420c25d409b Author: H.J. Lu <hjl.tools@gmail.com> Date: Tue Apr 5 05:21:07 2016 -0700 Force 32-bit displacement in memset-vec-unaligned-erms.S * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: Force 32-bit displacement to avoid long nop between instructions. (cherry picked from commit ec0cac9a1f4094bd0db6f77c1b329e7a40eecc10) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=070a5e77d66f5520c1bbbc24dc1843a0a1c161ee commit 070a5e77d66f5520c1bbbc24dc1843a0a1c161ee Author: H.J. Lu <hjl.tools@gmail.com> Date: Tue Apr 5 05:19:05 2016 -0700 Add a comment in memset-sse2-unaligned-erms.S * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Add a comment on VMOVU and VMOVA. (cherry picked from commit 696ac774847b80cf994438739478b0c3003b5958) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7e00bb9720268f142668d22e91dff7c3e6e0c08c commit 7e00bb9720268f142668d22e91dff7c3e6e0c08c Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Apr 3 14:32:20 2016 -0700 Don't put SSE2/AVX/AVX512 memmove/memset in ld.so Since memmove and memset in ld.so don't use IFUNC, don't put SSE2, AVX and AVX512 memmove and memset in ld.so. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip if not in libc. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S: Likewise. (cherry picked from commit 5cd7af016d8587ff53b20ba259746f97edbddbf7) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e1e57539f5dbdefc96a85021b611863eaa28dd13 commit e1e57539f5dbdefc96a85021b611863eaa28dd13 Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Apr 3 12:38:25 2016 -0700 Fix memmove-vec-unaligned-erms.S __mempcpy_erms and __memmove_erms can't be placed between __memmove_chk and __memmove it breaks __memmove_chk. Don't check source == destination first since it is less common. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: (__mempcpy_erms, __memmove_erms): Moved before __mempcpy_chk with unaligned_erms. (__memmove_erms): Skip if source == destination. (__memmove_unaligned_erms): Don't check source == destination first. (cherry picked from commit ea2785e96fa503f3a2b5dd9f3a6ca65622b3c5f2) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a13ac6b5ced68aadb7c1546102445f9c57f43231 commit a13ac6b5ced68aadb7c1546102445f9c57f43231 Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Mar 6 08:23:24 2016 -0800 Use HAS_ARCH_FEATURE with Fast_Rep_String HAS_ARCH_FEATURE, not HAS_CPU_FEATURE, should be used with Fast_Rep_String. [BZ #19762] * sysdeps/i386/i686/multiarch/bcopy.S (bcopy): Use HAS_ARCH_FEATURE with Fast_Rep_String. * sysdeps/i386/i686/multiarch/bzero.S (__bzero): Likewise. * sysdeps/i386/i686/multiarch/memcpy.S (memcpy): Likewise. * sysdeps/i386/i686/multiarch/memcpy_chk.S (__memcpy_chk): Likewise. * sysdeps/i386/i686/multiarch/memmove_chk.S (__memmove_chk): Likewise. * sysdeps/i386/i686/multiarch/mempcpy.S (__mempcpy): Likewise. * sysdeps/i386/i686/multiarch/mempcpy_chk.S (__mempcpy_chk): Likewise. * sysdeps/i386/i686/multiarch/memset.S (memset): Likewise. * sysdeps/i386/i686/multiarch/memset_chk.S (__memset_chk): Likewise. (cherry picked from commit 4e940b2f4b577f3a530e0580373f7c2d569f4d63) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4ad4d58ed7a444e2d9787113fce132a99b35b417 commit 4ad4d58ed7a444e2d9787113fce132a99b35b417 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Apr 1 15:08:48 2016 -0700 Remove Fast_Copy_Backward from Intel Core processors Intel Core i3, i5 and i7 processors have fast unaligned copy and copy backward is ignored. Remove Fast_Copy_Backward from Intel Core processors to avoid confusion. * sysdeps/x86/cpu-features.c (init_cpu_features): Don't set bit_arch_Fast_Copy_Backward for Intel Core proessors. (cherry picked from commit 27d3ce1467990f89126e228559dec8f84b96c60e) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a304f3933c7f8347e49057a7a315cbd571662ff7 commit a304f3933c7f8347e49057a7a315cbd571662ff7 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 10:05:51 2016 -0700 Add x86-64 memset with unaligned store and rep stosb Implement x86-64 memset with unaligned store and rep movsb. Support 16-byte, 32-byte and 64-byte vector register sizes. A single file provides 2 implementations of memset, one with rep stosb and the other without rep stosb. They share the same codes when size is between 2 times of vector register size and REP_STOSB_THRESHOLD which defaults to 2KB. Key features: 1. Use overlapping store to avoid branch. 2. For size <= 4 times of vector register size, fully unroll the loop. 3. For size > 4 times of vector register size, store 4 times of vector register size at a time. [BZ #19881] * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memset-sse2-unaligned-erms, memset-avx2-unaligned-erms and memset-avx512-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memset_chk_sse2_unaligned, __memset_chk_sse2_unaligned_erms, __memset_chk_avx2_unaligned, __memset_chk_avx2_unaligned_erms, __memset_chk_avx512_unaligned, __memset_chk_avx512_unaligned_erms, __memset_sse2_unaligned, __memset_sse2_unaligned_erms, __memset_erms, __memset_avx2_unaligned, __memset_avx2_unaligned_erms, __memset_avx512_unaligned_erms and __memset_avx512_unaligned. * sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: New file. * sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: Likewise. (cherry picked from commit 830566307f038387ca0af3fd327706a8d1a2f595) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=1c5d0b0ba41376c2f0792da4f22cc1f5b2b2688e commit 1c5d0b0ba41376c2f0792da4f22cc1f5b2b2688e Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 10:04:26 2016 -0700 Add x86-64 memmove with unaligned load/store and rep movsb Implement x86-64 memmove with unaligned load/store and rep movsb. Support 16-byte, 32-byte and 64-byte vector register sizes. When size <= 8 times of vector register size, there is no check for address overlap bewteen source and destination. Since overhead for overlap check is small when size > 8 times of vector register size, memcpy is an alias of memmove. A single file provides 2 implementations of memmove, one with rep movsb and the other without rep movsb. They share the same codes when size is between 2 times of vector register size and REP_MOVSB_THRESHOLD which is 2KB for 16-byte vector register size and scaled up by large vector register size. Key features: 1. Use overlapping load and store to avoid branch. 2. For size <= 8 times of vector register size, load all sources into registers and store them together. 3. If there is no address overlap bewteen source and destination, copy from both ends with 4 times of vector register size at a time. 4. If address of destination > address of source, backward copy 8 times of vector register size at a time. 5. Otherwise, forward copy 8 times of vector register size at a time. 6. Use rep movsb only for forward copy. Avoid slow backward rep movsb by fallbacking to backward copy 8 times of vector register size at a time. 7. Skip when address of destination == address of source. [BZ #19776] * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memmove-sse2-unaligned-erms, memmove-avx-unaligned-erms and memmove-avx512-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memmove_chk_avx512_unaligned_2, __memmove_chk_avx512_unaligned_erms, __memmove_chk_avx_unaligned_2, __memmove_chk_avx_unaligned_erms, __memmove_chk_sse2_unaligned_2, __memmove_chk_sse2_unaligned_erms, __memmove_avx_unaligned_2, __memmove_avx_unaligned_erms, __memmove_avx512_unaligned_2, __memmove_avx512_unaligned_erms, __memmove_erms, __memmove_sse2_unaligned_2, __memmove_sse2_unaligned_erms, __memcpy_chk_avx512_unaligned_2, __memcpy_chk_avx512_unaligned_erms, __memcpy_chk_avx_unaligned_2, __memcpy_chk_avx_unaligned_erms, __memcpy_chk_sse2_unaligned_2, __memcpy_chk_sse2_unaligned_erms, __memcpy_avx_unaligned_2, __memcpy_avx_unaligned_erms, __memcpy_avx512_unaligned_2, __memcpy_avx512_unaligned_erms, __memcpy_sse2_unaligned_2, __memcpy_sse2_unaligned_erms, __memcpy_erms, __mempcpy_chk_avx512_unaligned_2, __mempcpy_chk_avx512_unaligned_erms, __mempcpy_chk_avx_unaligned_2, __mempcpy_chk_avx_unaligned_erms, __mempcpy_chk_sse2_unaligned_2, __mempcpy_chk_sse2_unaligned_erms, __mempcpy_avx512_unaligned_2, __mempcpy_avx512_unaligned_erms, __mempcpy_avx_unaligned_2, __mempcpy_avx_unaligned_erms, __mempcpy_sse2_unaligned_2, __mempcpy_sse2_unaligned_erms and __mempcpy_erms. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: New file. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S: Likwise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S: Likwise. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Likwise. (cherry picked from commit 88b57b8ed41d5ecf2e1bdfc19556f9246a665ebb) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e1203f48239fbb9832db6ed3a0d2a008e317aff9 commit e1203f48239fbb9832db6ed3a0d2a008e317aff9 Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Mar 28 19:22:59 2016 -0700 Initial Enhanced REP MOVSB/STOSB (ERMS) support The newer Intel processors support Enhanced REP MOVSB/STOSB (ERMS) which has a feature bit in CPUID. This patch adds the Enhanced REP MOVSB/STOSB (ERMS) bit to x86 cpu-features. * sysdeps/x86/cpu-features.h (bit_cpu_ERMS): New. (index_cpu_ERMS): Likewise. (reg_ERMS): Likewise. (cherry picked from commit 0791f91dff9a77263fa8173b143d854cad902c6d) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=3597d65be2a44f063ef12bb907fdad8567aa3e6a commit 3597d65be2a44f063ef12bb907fdad8567aa3e6a Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Mar 28 13:15:59 2016 -0700 Make __memcpy_avx512_no_vzeroupper an alias Since x86-64 memcpy-avx512-no-vzeroupper.S implements memmove, make __memcpy_avx512_no_vzeroupper an alias of __memmove_avx512_no_vzeroupper to reduce code size of libc.so. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memcpy-avx512-no-vzeroupper. * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: Renamed to ... * sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: This. (MEMCPY): Don't define. (MEMCPY_CHK): Likewise. (MEMPCPY): Likewise. (MEMPCPY_CHK): Likewise. (MEMPCPY_CHK): Renamed to ... (__mempcpy_chk_avx512_no_vzeroupper): This. (MEMPCPY_CHK): Renamed to ... (__mempcpy_chk_avx512_no_vzeroupper): This. (MEMCPY_CHK): Renamed to ... (__memmove_chk_avx512_no_vzeroupper): This. (MEMCPY): Renamed to ... (__memmove_avx512_no_vzeroupper): This. (__memcpy_avx512_no_vzeroupper): New alias. (__memcpy_chk_avx512_no_vzeroupper): Likewise. (cherry picked from commit 064f01b10b57ff09cda7025f484b848c38ddd57a) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=9fbaf0f27a11deb98df79d04adee97aebee78d40 commit 9fbaf0f27a11deb98df79d04adee97aebee78d40 Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Mar 28 13:13:36 2016 -0700 Implement x86-64 multiarch mempcpy in memcpy Implement x86-64 multiarch mempcpy in memcpy to share most of code. It reduces code size of libc.so. [BZ #18858] * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove mempcpy-ssse3, mempcpy-ssse3-back, mempcpy-avx-unaligned and mempcpy-avx512-no-vzeroupper. * sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/memcpy-ssse3-back.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/memcpy-ssse3.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/mempcpy-avx-unaligned.S: Removed. * sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S: Likewise. * sysdeps/x86_64/multiarch/mempcpy-ssse3-back.S: Likewise. * sysdeps/x86_64/multiarch/mempcpy-ssse3.S: Likewise. (cherry picked from commit c365e615f7429aee302f8af7bf07ae262278febb) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=5239cb481eea27650173b9b9af22439afdcbf358 commit 5239cb481eea27650173b9b9af22439afdcbf358 Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Mar 28 04:39:48 2016 -0700 [x86] Add a feature bit: Fast_Unaligned_Copy On AMD processors, memcpy optimized with unaligned SSE load is slower than emcpy optimized with aligned SSSE3 while other string functions are faster with unaligned SSE load. A feature bit, Fast_Unaligned_Copy, is added to select memcpy optimized with unaligned SSE load. [BZ #19583] * sysdeps/x86/cpu-features.c (init_cpu_features): Set Fast_Unaligned_Copy with Fast_Unaligned_Load for Intel processors. Set Fast_Copy_Backward for AMD Excavator processors. * sysdeps/x86/cpu-features.h (bit_arch_Fast_Unaligned_Copy): New. (index_arch_Fast_Unaligned_Copy): Likewise. * sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Check Fast_Unaligned_Copy instead of Fast_Unaligned_Load. (cherry picked from commit e41b395523040fcb58c7d378475720c2836d280c) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a65b3d13e1754d568782e64a762c2c7fab45a55d commit a65b3d13e1754d568782e64a762c2c7fab45a55d Author: H.J. Lu <hjl.tools@gmail.com> Date: Tue Mar 22 08:36:16 2016 -0700 Don't set %rcx twice before "rep movsb" * sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMCPY): Don't set %rcx twice before "rep movsb". (cherry picked from commit 3c9a4cd16cbc7b79094fec68add2df66061ab5d7) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=f4b6d20366aac66070f1cf50552cf2951991a1e5 commit f4b6d20366aac66070f1cf50552cf2951991a1e5 Author: H.J. Lu <hjl.tools@gmail.com> Date: Tue Mar 22 07:46:56 2016 -0700 Set index_arch_AVX_Fast_Unaligned_Load only for Intel processors Since only Intel processors with AVX2 have fast unaligned load, we should set index_arch_AVX_Fast_Unaligned_Load only for Intel processors. Move AVX, AVX2, AVX512, FMA and FMA4 detection into get_common_indeces and call get_common_indeces for other processors. Add CPU_FEATURES_CPU_P and CPU_FEATURES_ARCH_P to aoid loading GLRO(dl_x86_cpu_features) in cpu-features.c. [BZ #19583] * sysdeps/x86/cpu-features.c (get_common_indeces): Remove inline. Check family before setting family, model and extended_model. Set AVX, AVX2, AVX512, FMA and FMA4 usable bits here. (init_cpu_features): Replace HAS_CPU_FEATURE and HAS_ARCH_FEATURE with CPU_FEATURES_CPU_P and CPU_FEATURES_ARCH_P. Set index_arch_AVX_Fast_Unaligned_Load for Intel processors with usable AVX2. Call get_common_indeces for other processors with family == NULL. * sysdeps/x86/cpu-features.h (CPU_FEATURES_CPU_P): New macro. (CPU_FEATURES_ARCH_P): Likewise. (HAS_CPU_FEATURE): Use CPU_FEATURES_CPU_P. (HAS_ARCH_FEATURE): Use CPU_FEATURES_ARCH_P. (cherry picked from commit f781a9e96138d8839663af5e88649ab1fbed74f8) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=ca9c5edeea52dc18f42ebbe29b1af352f5555538 commit ca9c5edeea52dc18f42ebbe29b1af352f5555538 Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Nov 30 08:53:37 2015 -0800 Update family and model detection for AMD CPUs AMD CPUs uses the similar encoding scheme for extended family and model as Intel CPUs as shown in: http://support.amd.com/TechDocs/25481.pdf This patch updates get_common_indeces to get family and model for both Intel and AMD CPUs when family == 0x0f. [BZ #19214] * sysdeps/x86/cpu-features.c (get_common_indeces): Add an argument to return extended model. Update family and model with extended family and model when family == 0x0f. (init_cpu_features): Updated. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c23cdbac4ea473effbef5c50b1217f95595b3460 commit c23cdbac4ea473effbef5c50b1217f95595b3460 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 10 05:26:46 2016 -0800 Add _arch_/_cpu_ to index_*/bit_* in x86 cpu-features.h index_* and bit_* macros are used to access cpuid and feature arrays o struct cpu_features. It is very easy to use bits and indices of cpuid array on feature array, especially in assembly codes. For example, sysdeps/i386/i686/multiarch/bcopy.S has HAS_CPU_FEATURE (Fast_Rep_String) which should be HAS_ARCH_FEATURE (Fast_Rep_String) We change index_* and bit_* to index_cpu_*/index_arch_* and bit_cpu_*/bit_arch_* so that we can catch such error at build time. [BZ #19762] * sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h (EXTRA_LD_ENVVARS): Add _arch_ to index_*/bit_*. * sysdeps/x86/cpu-features.c (init_cpu_features): Likewise. * sysdeps/x86/cpu-features.h (bit_*): Renamed to ... (bit_arch_*): This for feature array. (bit_*): Renamed to ... (bit_cpu_*): This for cpu array. (index_*): Renamed to ... (index_arch_*): This for feature array. (index_*): Renamed to ... (index_cpu_*): This for cpu array. [__ASSEMBLER__] (HAS_FEATURE): Add and use field. [__ASSEMBLER__] (HAS_CPU_FEATURE)): Pass cpu to HAS_FEATURE. [__ASSEMBLER__] (HAS_ARCH_FEATURE)): Pass arch to HAS_FEATURE. [!__ASSEMBLER__] (HAS_CPU_FEATURE): Replace index_##name and bit_##name with index_cpu_##name and bit_cpu_##name. [!__ASSEMBLER__] (HAS_ARCH_FEATURE): Replace index_##name and bit_##name with index_arch_##name and bit_arch_##name. (cherry picked from commit 6aa3e97e2530f9917f504eb4146af119a3f27229) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4a49c82956f5a42a2cce22c2e97360de1b32301d commit 4a49c82956f5a42a2cce22c2e97360de1b32301d Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 3 14:51:40 2016 -0800 Or bit_Prefer_MAP_32BIT_EXEC in EXTRA_LD_ENVVARS We should turn on bit_Prefer_MAP_32BIT_EXEC in EXTRA_LD_ENVVARS without overriding other bits. [BZ #19758] * sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h (EXTRA_LD_ENVVARS): Or bit_Prefer_MAP_32BIT_EXEC. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=18f8c0e3cc9ff7b092f02c9b42874a5439347bbc commit 18f8c0e3cc9ff7b092f02c9b42874a5439347bbc Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Mar 6 16:48:11 2016 -0800 Group AVX512 functions in .text.avx512 section * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: Replace .text with .text.avx512. * sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S: Likewise. (cherry picked from commit fee9eb6200f0e44a4b684903bc47fde36d46f1a5) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=0c8e297a186f844ebb7eba7a3bc0343c83615ca9 commit 0c8e297a186f844ebb7eba7a3bc0343c83615ca9 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Mar 4 08:37:40 2016 -0800 x86-64: Fix memcpy IFUNC selection Chek Fast_Unaligned_Load, instead of Slow_BSF, and also check for Fast_Copy_Backward to enable __memcpy_ssse3_back. Existing selection order is updated with following selection order: 1. __memcpy_avx_unaligned if AVX_Fast_Unaligned_Load bit is set. 2. __memcpy_sse2_unaligned if Fast_Unaligned_Load bit is set. 3. __memcpy_sse2 if SSSE3 isn't available. 4. __memcpy_ssse3_back if Fast_Copy_Backward bit it set. 5. __memcpy_ssse3 [BZ #18880] * sysdeps/x86_64/multiarch/memcpy.S: Check Fast_Unaligned_Load, instead of Slow_BSF, and also check for Fast_Copy_Backward to enable __memcpy_ssse3_back. (cherry picked from commit 14a1d7cc4c4fd5ee8e4e66b777221dd32a84efe8) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=3c772cb4d9cbe19cd97ad991e3dab43014198c44 commit 3c772cb4d9cbe19cd97ad991e3dab43014198c44 Author: Andrew Senkevich <andrew.senkevich@intel.com> Date: Sat Jan 16 00:49:45 2016 +0300 Added memcpy/memmove family optimized with AVX512 for KNL hardware. Added AVX512 implementations of memcpy, mempcpy, memmove, memcpy_chk, mempcpy_chk, memmove_chk. It shows average improvement more than 30% over AVX versions on KNL hardware (performance results in the thread <https://sourceware.org/ml/libc-alpha/2016-01/msg00258.html>). * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Added new files. * sysdeps/x86_64/multiarch/ifunc-impl-list.c: Added new tests. * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: New file. * sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S: Likewise. * sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: Likewise. * sysdeps/x86_64/multiarch/memcpy.S: Added new IFUNC branch. * sysdeps/x86_64/multiarch/memcpy_chk.S: Likewise. * sysdeps/x86_64/multiarch/memmove.c: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.c: Likewise. * sysdeps/x86_64/multiarch/mempcpy.S: Likewise. * sysdeps/x86_64/multiarch/mempcpy_chk.S: Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7f20e52f1f8d7709140f4bdf828a6bb0f0f08af2 commit 7f20e52f1f8d7709140f4bdf828a6bb0f0f08af2 Author: Andrew Senkevich <andrew.senkevich@intel.com> Date: Sat Dec 19 02:47:28 2015 +0300 Added memset optimized with AVX512 for KNL hardware. It shows improvement up to 28% over AVX2 memset (performance results attached at <https://sourceware.org/ml/libc-alpha/2015-12/msg00052.html>). * sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S: New file. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Added new file. * sysdeps/x86_64/multiarch/ifunc-impl-list.c: Added new tests. * sysdeps/x86_64/multiarch/memset.S: Added new IFUNC branch. * sysdeps/x86_64/multiarch/memset_chk.S: Likewise. * sysdeps/x86/cpu-features.h (bit_Prefer_No_VZEROUPPER, index_Prefer_No_VZEROUPPER): New. * sysdeps/x86/cpu-features.c (init_cpu_features): Set the Prefer_No_VZEROUPPER for Knights Landing. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=d530cd5463701a59ed923d53a97d3b534fdfea8a commit d530cd5463701a59ed923d53a97d3b534fdfea8a Author: H.J. Lu <hjl.tools@gmail.com> Date: Wed Oct 21 14:44:23 2015 -0700 Add Prefer_MAP_32BIT_EXEC to map executable pages with MAP_32BIT According to Silvermont software optimization guide, for 64-bit applications, branch prediction performance can be negatively impacted when the target of a branch is more than 4GB away from the branch. Add the Prefer_MAP_32BIT_EXEC bit so that mmap will try to map executable pages with MAP_32BIT first. NB: MAP_32BIT will map to lower 2GB, not lower 4GB, address. Prefer_MAP_32BIT_EXEC reduces bits available for address space layout randomization (ASLR), which is always disabled for SUID programs and can only be enabled by setting environment variable, LD_PREFER_MAP_32BIT_EXEC. On Fedora 23, this patch speeds up GCC 5 testsuite by 3% on Silvermont. [BZ #19367] * sysdeps/unix/sysv/linux/wordsize-64/mmap.c: New file. * sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h: Likewise. * sysdeps/unix/sysv/linux/x86_64/64/mmap.c: Likewise. * sysdeps/x86/cpu-features.h (bit_Prefer_MAP_32BIT_EXEC): New. (index_Prefer_MAP_32BIT_EXEC): Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=fe24aedc3530037d7bb614b84d309e6b816686bf commit fe24aedc3530037d7bb614b84d309e6b816686bf Author: H.J. Lu <hjl.tools@gmail.com> Date: Tue Dec 15 11:46:54 2015 -0800 Enable Silvermont optimizations for Knights Landing Knights Landing processor is based on Silvermont. This patch enables Silvermont optimizations for Knights Landing. * sysdeps/x86/cpu-features.c (init_cpu_features): Enable Silvermont optimizations for Knights Landing. -----------------------------------------------------------------------
This is an automated email from the git hooks/post-receive script. It was generated because a ref change was pushed to the repository containing the project "GNU C Library master sources". The branch, hjl/erms/2.22 has been created at 157c57198e893b4882d1feb98de2b0721ee408fc (commit) - Log ----------------------------------------------------------------- https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=157c57198e893b4882d1feb98de2b0721ee408fc commit 157c57198e893b4882d1feb98de2b0721ee408fc Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Apr 1 14:01:24 2016 -0700 X86-64: Add dummy memcopy.h and wordcopy.c Since x86-64 doesn't use memory copy functions, add dummy memcopy.h and wordcopy.c to reduce code size. It reduces the size of libc.so by about 1 KB. * sysdeps/x86_64/memcopy.h: New file. * sysdeps/x86_64/wordcopy.c: Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=f817b9d36215ab60d58cc744d22773b4961a2c9b commit f817b9d36215ab60d58cc744d22773b4961a2c9b Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 12:46:57 2016 -0700 X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones, we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with the new ones. No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used before. If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2 memcpy/memmove optimized with Enhanced REP MOVSB will be used for processors with ERMS. The new AVX512 memcpy/memmove will be used for processors with AVX512 which prefer vzeroupper. Since the new SSE2 memcpy/memmove are faster than the previous default memcpy/memmove used in libc.a and ld.so, we also remove the previous default memcpy/memmove and make them the default memcpy/memmove, except that non-temporal store isn't used in ld.so. Together, it reduces the size of libc.so by about 6 KB and the size of ld.so by about 2 KB. [BZ #19776] * sysdeps/x86_64/memcpy.S: Make it dummy. * sysdeps/x86_64/mempcpy.S: Likewise. * sysdeps/x86_64/memmove.S: New file. * sysdeps/x86_64/memmove_chk.S: Likewise. * sysdeps/x86_64/multiarch/memmove.S: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.S: Likewise. * sysdeps/x86_64/memmove.c: Removed. * sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memmove.c: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.c: Likewise. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memcpy-sse2-unaligned, memmove-avx-unaligned, memcpy-avx-unaligned and memmove-sse2-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Replace __memmove_chk_avx512_unaligned_2 with __memmove_chk_avx512_unaligned. Remove __memmove_chk_avx_unaligned_2. Replace __memmove_chk_sse2_unaligned_2 with __memmove_chk_sse2_unaligned. Remove __memmove_chk_sse2 and __memmove_avx_unaligned_2. Replace __memmove_avx512_unaligned_2 with __memmove_avx512_unaligned. Replace __memmove_sse2_unaligned_2 with __memmove_sse2_unaligned. Remove __memmove_sse2. Replace __memcpy_chk_avx512_unaligned_2 with __memcpy_chk_avx512_unaligned. Remove __memcpy_chk_avx_unaligned_2. Replace __memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned. Remove __memcpy_chk_sse2. Remove __memcpy_avx_unaligned_2. Replace __memcpy_avx512_unaligned_2 with __memcpy_avx512_unaligned. Remove __memcpy_sse2_unaligned_2 and __memcpy_sse2. Replace __mempcpy_chk_avx512_unaligned_2 with __mempcpy_chk_avx512_unaligned. Remove __mempcpy_chk_avx_unaligned_2. Replace __mempcpy_chk_sse2_unaligned_2 with __mempcpy_chk_sse2_unaligned. Remove __mempcpy_chk_sse2. Replace __mempcpy_avx512_unaligned_2 with __mempcpy_avx512_unaligned. Remove __mempcpy_avx_unaligned_2. Replace __mempcpy_sse2_unaligned_2 with __mempcpy_sse2_unaligned. Remove __mempcpy_sse2. * sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support __memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned. Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms if processor has ERMS. Default to __memcpy_sse2_unaligned. (ENTRY): Removed. (END): Likewise. (ENTRY_CHK): Likewise. (libc_hidden_builtin_def): Likewise. Don't include ../memcpy.S. * sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support __memcpy_chk_avx512_unaligned_erms and __memcpy_chk_avx512_unaligned. Use __memcpy_chk_avx_unaligned_erms and __memcpy_chk_sse2_unaligned_erms if if processor has ERMS. Default to __memcpy_chk_sse2_unaligned. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S Change function suffix from unaligned_2 to unaligned. * sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support __mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned. Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms if processor has ERMS. Default to __mempcpy_sse2_unaligned. (ENTRY): Removed. (END): Likewise. (ENTRY_CHK): Likewise. (libc_hidden_builtin_def): Likewise. Don't include ../mempcpy.S. (mempcpy): New. Add a weak alias. * sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support __mempcpy_chk_avx512_unaligned_erms and __mempcpy_chk_avx512_unaligned. Use __mempcpy_chk_avx_unaligned_erms and __mempcpy_chk_sse2_unaligned_erms if if processor has ERMS. Default to __mempcpy_chk_sse2_unaligned. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=122600f4b380b00ce0f682039fe59af4bd0edbc0 commit 122600f4b380b00ce0f682039fe59af4bd0edbc0 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 10:42:30 2016 -0700 X86-64: Remove the previous SSE2/AVX2 memsets Since the new SSE2/AVX2 memsets are faster than the previous ones, we can remove the previous SSE2/AVX2 memsets and replace them with the new ones. This reduces the size of libc.so by about 900 bytes. No change in IFUNC selection if SSE2 and AVX2 memsets weren't used before. If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset optimized with Enhanced REP STOSB will be used for processors with ERMS. The new AVX512 memset will be used for processors with AVX512 which prefer vzeroupper. [BZ #19881] * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded into ... * sysdeps/x86_64/memset.S: This. (__bzero): Removed. (__memset_tail): Likewise. (__memset_chk): Likewise. (memset): Likewise. (MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't defined. (MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined. * sysdeps/x86_64/multiarch/memset-avx2.S: Removed. (__memset_zero_constant_len_parameter): Check SHARED instead of PIC. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memset-avx2 and memset-sse2-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Remove __memset_chk_sse2, __memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S (__bzero): Enabled. * sysdeps/x86_64/multiarch/memset.S (memset): Replace __memset_sse2 and __memset_avx2 with __memset_sse2_unaligned and __memset_avx2_unaligned. Use __memset_sse2_unaligned_erms or __memset_avx2_unaligned_erms if processor has ERMS. Support __memset_avx512_unaligned_erms and __memset_avx512_unaligned. (memset): Removed. (__memset_chk): Likewise. (MEMSET_SYMBOL): New. (libc_hidden_builtin_def): Replace __memset_sse2 with __memset_sse2_unaligned. * sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace __memset_chk_sse2 and __memset_chk_avx2 with __memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms. Use __memset_chk_sse2_unaligned_erms or __memset_chk_avx2_unaligned_erms if processor has ERMS. Support __memset_chk_avx512_unaligned_erms and __memset_chk_avx512_unaligned. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=0ee4375cef69e00e69ddb1d08fe0d492053208f3 commit 0ee4375cef69e00e69ddb1d08fe0d492053208f3 Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Apr 3 17:21:45 2016 -0700 X86-64: Use non-temporal store in memcpy on large data The large memcpy micro benchmark in glibc shows that there is a regression with large data on Haswell machine. non-temporal store in memcpy on large data can improve performance significantly. This patch adds a threshold to use non temporal store which is 6 times of shared cache size. When size is above the threshold, non temporal store will be used, but avoid non-temporal store if there is overlap between destination and source since destination may be in cache when source is loaded. For size below 8 vector register width, we load all data into registers and store them together. Only forward and backward loops, which move 4 vector registers at a time, are used to support overlapping addresses. For forward loop, we load the last 4 vector register width of data and the first vector register width of data into vector registers before the loop and store them after the loop. For backward loop, we load the first 4 vector register width of data and the last vector register width of data into vector registers before the loop and store them after the loop. [BZ #19928] * sysdeps/x86_64/cacheinfo.c (__x86_shared_non_temporal_threshold): New. (init_cacheinfo): Set __x86_shared_non_temporal_threshold to 6 times of shared cache size. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S (VMOVNT): New. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S (VMOVNT): Likewise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S (VMOVNT): Likewise. (VMOVU): Changed to movups for smaller code sizes. (VMOVA): Changed to movaps for smaller code sizes. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Update comments. (PREFETCH): New. (PREFETCH_SIZE): Likewise. (PREFETCHED_LOAD_SIZE): Likewise. (PREFETCH_ONE_SET): Likewise. Rewrite to use forward and backward loops, which move 4 vector registers at a time, to support overlapping addresses and use non temporal store if size is above the threshold and there is no overlap between destination and source. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=54667f64fa4074325ee33e487c033c313ce95067 commit 54667f64fa4074325ee33e487c033c313ce95067 Author: H.J. Lu <hjl.tools@gmail.com> Date: Wed Apr 6 10:19:16 2016 -0700 X86-64: Prepare memmove-vec-unaligned-erms.S Prepare memmove-vec-unaligned-erms.S to make the SSE2 version as the default memcpy, mempcpy and memmove. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S (MEMCPY_SYMBOL): New. (MEMPCPY_SYMBOL): Likewise. (MEMMOVE_CHK_SYMBOL): Likewise. Replace MEMMOVE_SYMBOL with MEMMOVE_CHK_SYMBOL on __mempcpy_chk symbols. Replace MEMMOVE_SYMBOL with MEMPCPY_SYMBOL on __mempcpy symbols. Provide alias for __memcpy_chk in libc.a. Provide alias for memcpy in libc.a and ld.so. (cherry picked from commit a7d1c51482d15ab6c07e2ee0ae5e007067b18bfb) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=68a0b487e274b3452a1660e4b9fad5df8d8c0284 commit 68a0b487e274b3452a1660e4b9fad5df8d8c0284 Author: H.J. Lu <hjl.tools@gmail.com> Date: Wed Apr 6 09:10:18 2016 -0700 X86-64: Prepare memset-vec-unaligned-erms.S Prepare memset-vec-unaligned-erms.S to make the SSE2 version as the default memset. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S (MEMSET_CHK_SYMBOL): New. Define if not defined. (__bzero): Check VEC_SIZE == 16 instead of USE_MULTIARCH. Disabled fro now. Replace MEMSET_SYMBOL with MEMSET_CHK_SYMBOL on __memset_chk symbols. Properly check USE_MULTIARCH on __memset symbols. (cherry picked from commit 4af1bb06c59d24f35bf8dc55897838d926c05892) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c2d3bdd6aec639fd66fceb3e2c145420c25d409b commit c2d3bdd6aec639fd66fceb3e2c145420c25d409b Author: H.J. Lu <hjl.tools@gmail.com> Date: Tue Apr 5 05:21:07 2016 -0700 Force 32-bit displacement in memset-vec-unaligned-erms.S * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: Force 32-bit displacement to avoid long nop between instructions. (cherry picked from commit ec0cac9a1f4094bd0db6f77c1b329e7a40eecc10) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=070a5e77d66f5520c1bbbc24dc1843a0a1c161ee commit 070a5e77d66f5520c1bbbc24dc1843a0a1c161ee Author: H.J. Lu <hjl.tools@gmail.com> Date: Tue Apr 5 05:19:05 2016 -0700 Add a comment in memset-sse2-unaligned-erms.S * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Add a comment on VMOVU and VMOVA. (cherry picked from commit 696ac774847b80cf994438739478b0c3003b5958) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7e00bb9720268f142668d22e91dff7c3e6e0c08c commit 7e00bb9720268f142668d22e91dff7c3e6e0c08c Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Apr 3 14:32:20 2016 -0700 Don't put SSE2/AVX/AVX512 memmove/memset in ld.so Since memmove and memset in ld.so don't use IFUNC, don't put SSE2, AVX and AVX512 memmove and memset in ld.so. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip if not in libc. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S: Likewise. (cherry picked from commit 5cd7af016d8587ff53b20ba259746f97edbddbf7) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e1e57539f5dbdefc96a85021b611863eaa28dd13 commit e1e57539f5dbdefc96a85021b611863eaa28dd13 Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Apr 3 12:38:25 2016 -0700 Fix memmove-vec-unaligned-erms.S __mempcpy_erms and __memmove_erms can't be placed between __memmove_chk and __memmove it breaks __memmove_chk. Don't check source == destination first since it is less common. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: (__mempcpy_erms, __memmove_erms): Moved before __mempcpy_chk with unaligned_erms. (__memmove_erms): Skip if source == destination. (__memmove_unaligned_erms): Don't check source == destination first. (cherry picked from commit ea2785e96fa503f3a2b5dd9f3a6ca65622b3c5f2) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a13ac6b5ced68aadb7c1546102445f9c57f43231 commit a13ac6b5ced68aadb7c1546102445f9c57f43231 Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Mar 6 08:23:24 2016 -0800 Use HAS_ARCH_FEATURE with Fast_Rep_String HAS_ARCH_FEATURE, not HAS_CPU_FEATURE, should be used with Fast_Rep_String. [BZ #19762] * sysdeps/i386/i686/multiarch/bcopy.S (bcopy): Use HAS_ARCH_FEATURE with Fast_Rep_String. * sysdeps/i386/i686/multiarch/bzero.S (__bzero): Likewise. * sysdeps/i386/i686/multiarch/memcpy.S (memcpy): Likewise. * sysdeps/i386/i686/multiarch/memcpy_chk.S (__memcpy_chk): Likewise. * sysdeps/i386/i686/multiarch/memmove_chk.S (__memmove_chk): Likewise. * sysdeps/i386/i686/multiarch/mempcpy.S (__mempcpy): Likewise. * sysdeps/i386/i686/multiarch/mempcpy_chk.S (__mempcpy_chk): Likewise. * sysdeps/i386/i686/multiarch/memset.S (memset): Likewise. * sysdeps/i386/i686/multiarch/memset_chk.S (__memset_chk): Likewise. (cherry picked from commit 4e940b2f4b577f3a530e0580373f7c2d569f4d63) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4ad4d58ed7a444e2d9787113fce132a99b35b417 commit 4ad4d58ed7a444e2d9787113fce132a99b35b417 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Apr 1 15:08:48 2016 -0700 Remove Fast_Copy_Backward from Intel Core processors Intel Core i3, i5 and i7 processors have fast unaligned copy and copy backward is ignored. Remove Fast_Copy_Backward from Intel Core processors to avoid confusion. * sysdeps/x86/cpu-features.c (init_cpu_features): Don't set bit_arch_Fast_Copy_Backward for Intel Core proessors. (cherry picked from commit 27d3ce1467990f89126e228559dec8f84b96c60e) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a304f3933c7f8347e49057a7a315cbd571662ff7 commit a304f3933c7f8347e49057a7a315cbd571662ff7 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 10:05:51 2016 -0700 Add x86-64 memset with unaligned store and rep stosb Implement x86-64 memset with unaligned store and rep movsb. Support 16-byte, 32-byte and 64-byte vector register sizes. A single file provides 2 implementations of memset, one with rep stosb and the other without rep stosb. They share the same codes when size is between 2 times of vector register size and REP_STOSB_THRESHOLD which defaults to 2KB. Key features: 1. Use overlapping store to avoid branch. 2. For size <= 4 times of vector register size, fully unroll the loop. 3. For size > 4 times of vector register size, store 4 times of vector register size at a time. [BZ #19881] * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memset-sse2-unaligned-erms, memset-avx2-unaligned-erms and memset-avx512-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memset_chk_sse2_unaligned, __memset_chk_sse2_unaligned_erms, __memset_chk_avx2_unaligned, __memset_chk_avx2_unaligned_erms, __memset_chk_avx512_unaligned, __memset_chk_avx512_unaligned_erms, __memset_sse2_unaligned, __memset_sse2_unaligned_erms, __memset_erms, __memset_avx2_unaligned, __memset_avx2_unaligned_erms, __memset_avx512_unaligned_erms and __memset_avx512_unaligned. * sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: New file. * sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: Likewise. (cherry picked from commit 830566307f038387ca0af3fd327706a8d1a2f595) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=1c5d0b0ba41376c2f0792da4f22cc1f5b2b2688e commit 1c5d0b0ba41376c2f0792da4f22cc1f5b2b2688e Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 10:04:26 2016 -0700 Add x86-64 memmove with unaligned load/store and rep movsb Implement x86-64 memmove with unaligned load/store and rep movsb. Support 16-byte, 32-byte and 64-byte vector register sizes. When size <= 8 times of vector register size, there is no check for address overlap bewteen source and destination. Since overhead for overlap check is small when size > 8 times of vector register size, memcpy is an alias of memmove. A single file provides 2 implementations of memmove, one with rep movsb and the other without rep movsb. They share the same codes when size is between 2 times of vector register size and REP_MOVSB_THRESHOLD which is 2KB for 16-byte vector register size and scaled up by large vector register size. Key features: 1. Use overlapping load and store to avoid branch. 2. For size <= 8 times of vector register size, load all sources into registers and store them together. 3. If there is no address overlap bewteen source and destination, copy from both ends with 4 times of vector register size at a time. 4. If address of destination > address of source, backward copy 8 times of vector register size at a time. 5. Otherwise, forward copy 8 times of vector register size at a time. 6. Use rep movsb only for forward copy. Avoid slow backward rep movsb by fallbacking to backward copy 8 times of vector register size at a time. 7. Skip when address of destination == address of source. [BZ #19776] * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memmove-sse2-unaligned-erms, memmove-avx-unaligned-erms and memmove-avx512-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memmove_chk_avx512_unaligned_2, __memmove_chk_avx512_unaligned_erms, __memmove_chk_avx_unaligned_2, __memmove_chk_avx_unaligned_erms, __memmove_chk_sse2_unaligned_2, __memmove_chk_sse2_unaligned_erms, __memmove_avx_unaligned_2, __memmove_avx_unaligned_erms, __memmove_avx512_unaligned_2, __memmove_avx512_unaligned_erms, __memmove_erms, __memmove_sse2_unaligned_2, __memmove_sse2_unaligned_erms, __memcpy_chk_avx512_unaligned_2, __memcpy_chk_avx512_unaligned_erms, __memcpy_chk_avx_unaligned_2, __memcpy_chk_avx_unaligned_erms, __memcpy_chk_sse2_unaligned_2, __memcpy_chk_sse2_unaligned_erms, __memcpy_avx_unaligned_2, __memcpy_avx_unaligned_erms, __memcpy_avx512_unaligned_2, __memcpy_avx512_unaligned_erms, __memcpy_sse2_unaligned_2, __memcpy_sse2_unaligned_erms, __memcpy_erms, __mempcpy_chk_avx512_unaligned_2, __mempcpy_chk_avx512_unaligned_erms, __mempcpy_chk_avx_unaligned_2, __mempcpy_chk_avx_unaligned_erms, __mempcpy_chk_sse2_unaligned_2, __mempcpy_chk_sse2_unaligned_erms, __mempcpy_avx512_unaligned_2, __mempcpy_avx512_unaligned_erms, __mempcpy_avx_unaligned_2, __mempcpy_avx_unaligned_erms, __mempcpy_sse2_unaligned_2, __mempcpy_sse2_unaligned_erms and __mempcpy_erms. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: New file. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S: Likwise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S: Likwise. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Likwise. (cherry picked from commit 88b57b8ed41d5ecf2e1bdfc19556f9246a665ebb) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e1203f48239fbb9832db6ed3a0d2a008e317aff9 commit e1203f48239fbb9832db6ed3a0d2a008e317aff9 Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Mar 28 19:22:59 2016 -0700 Initial Enhanced REP MOVSB/STOSB (ERMS) support The newer Intel processors support Enhanced REP MOVSB/STOSB (ERMS) which has a feature bit in CPUID. This patch adds the Enhanced REP MOVSB/STOSB (ERMS) bit to x86 cpu-features. * sysdeps/x86/cpu-features.h (bit_cpu_ERMS): New. (index_cpu_ERMS): Likewise. (reg_ERMS): Likewise. (cherry picked from commit 0791f91dff9a77263fa8173b143d854cad902c6d) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=3597d65be2a44f063ef12bb907fdad8567aa3e6a commit 3597d65be2a44f063ef12bb907fdad8567aa3e6a Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Mar 28 13:15:59 2016 -0700 Make __memcpy_avx512_no_vzeroupper an alias Since x86-64 memcpy-avx512-no-vzeroupper.S implements memmove, make __memcpy_avx512_no_vzeroupper an alias of __memmove_avx512_no_vzeroupper to reduce code size of libc.so. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memcpy-avx512-no-vzeroupper. * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: Renamed to ... * sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: This. (MEMCPY): Don't define. (MEMCPY_CHK): Likewise. (MEMPCPY): Likewise. (MEMPCPY_CHK): Likewise. (MEMPCPY_CHK): Renamed to ... (__mempcpy_chk_avx512_no_vzeroupper): This. (MEMPCPY_CHK): Renamed to ... (__mempcpy_chk_avx512_no_vzeroupper): This. (MEMCPY_CHK): Renamed to ... (__memmove_chk_avx512_no_vzeroupper): This. (MEMCPY): Renamed to ... (__memmove_avx512_no_vzeroupper): This. (__memcpy_avx512_no_vzeroupper): New alias. (__memcpy_chk_avx512_no_vzeroupper): Likewise. (cherry picked from commit 064f01b10b57ff09cda7025f484b848c38ddd57a) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=9fbaf0f27a11deb98df79d04adee97aebee78d40 commit 9fbaf0f27a11deb98df79d04adee97aebee78d40 Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Mar 28 13:13:36 2016 -0700 Implement x86-64 multiarch mempcpy in memcpy Implement x86-64 multiarch mempcpy in memcpy to share most of code. It reduces code size of libc.so. [BZ #18858] * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove mempcpy-ssse3, mempcpy-ssse3-back, mempcpy-avx-unaligned and mempcpy-avx512-no-vzeroupper. * sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/memcpy-ssse3-back.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/memcpy-ssse3.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/mempcpy-avx-unaligned.S: Removed. * sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S: Likewise. * sysdeps/x86_64/multiarch/mempcpy-ssse3-back.S: Likewise. * sysdeps/x86_64/multiarch/mempcpy-ssse3.S: Likewise. (cherry picked from commit c365e615f7429aee302f8af7bf07ae262278febb) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=5239cb481eea27650173b9b9af22439afdcbf358 commit 5239cb481eea27650173b9b9af22439afdcbf358 Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Mar 28 04:39:48 2016 -0700 [x86] Add a feature bit: Fast_Unaligned_Copy On AMD processors, memcpy optimized with unaligned SSE load is slower than emcpy optimized with aligned SSSE3 while other string functions are faster with unaligned SSE load. A feature bit, Fast_Unaligned_Copy, is added to select memcpy optimized with unaligned SSE load. [BZ #19583] * sysdeps/x86/cpu-features.c (init_cpu_features): Set Fast_Unaligned_Copy with Fast_Unaligned_Load for Intel processors. Set Fast_Copy_Backward for AMD Excavator processors. * sysdeps/x86/cpu-features.h (bit_arch_Fast_Unaligned_Copy): New. (index_arch_Fast_Unaligned_Copy): Likewise. * sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Check Fast_Unaligned_Copy instead of Fast_Unaligned_Load. (cherry picked from commit e41b395523040fcb58c7d378475720c2836d280c) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a65b3d13e1754d568782e64a762c2c7fab45a55d commit a65b3d13e1754d568782e64a762c2c7fab45a55d Author: H.J. Lu <hjl.tools@gmail.com> Date: Tue Mar 22 08:36:16 2016 -0700 Don't set %rcx twice before "rep movsb" * sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMCPY): Don't set %rcx twice before "rep movsb". (cherry picked from commit 3c9a4cd16cbc7b79094fec68add2df66061ab5d7) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=f4b6d20366aac66070f1cf50552cf2951991a1e5 commit f4b6d20366aac66070f1cf50552cf2951991a1e5 Author: H.J. Lu <hjl.tools@gmail.com> Date: Tue Mar 22 07:46:56 2016 -0700 Set index_arch_AVX_Fast_Unaligned_Load only for Intel processors Since only Intel processors with AVX2 have fast unaligned load, we should set index_arch_AVX_Fast_Unaligned_Load only for Intel processors. Move AVX, AVX2, AVX512, FMA and FMA4 detection into get_common_indeces and call get_common_indeces for other processors. Add CPU_FEATURES_CPU_P and CPU_FEATURES_ARCH_P to aoid loading GLRO(dl_x86_cpu_features) in cpu-features.c. [BZ #19583] * sysdeps/x86/cpu-features.c (get_common_indeces): Remove inline. Check family before setting family, model and extended_model. Set AVX, AVX2, AVX512, FMA and FMA4 usable bits here. (init_cpu_features): Replace HAS_CPU_FEATURE and HAS_ARCH_FEATURE with CPU_FEATURES_CPU_P and CPU_FEATURES_ARCH_P. Set index_arch_AVX_Fast_Unaligned_Load for Intel processors with usable AVX2. Call get_common_indeces for other processors with family == NULL. * sysdeps/x86/cpu-features.h (CPU_FEATURES_CPU_P): New macro. (CPU_FEATURES_ARCH_P): Likewise. (HAS_CPU_FEATURE): Use CPU_FEATURES_CPU_P. (HAS_ARCH_FEATURE): Use CPU_FEATURES_ARCH_P. (cherry picked from commit f781a9e96138d8839663af5e88649ab1fbed74f8) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=ca9c5edeea52dc18f42ebbe29b1af352f5555538 commit ca9c5edeea52dc18f42ebbe29b1af352f5555538 Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Nov 30 08:53:37 2015 -0800 Update family and model detection for AMD CPUs AMD CPUs uses the similar encoding scheme for extended family and model as Intel CPUs as shown in: http://support.amd.com/TechDocs/25481.pdf This patch updates get_common_indeces to get family and model for both Intel and AMD CPUs when family == 0x0f. [BZ #19214] * sysdeps/x86/cpu-features.c (get_common_indeces): Add an argument to return extended model. Update family and model with extended family and model when family == 0x0f. (init_cpu_features): Updated. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c23cdbac4ea473effbef5c50b1217f95595b3460 commit c23cdbac4ea473effbef5c50b1217f95595b3460 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 10 05:26:46 2016 -0800 Add _arch_/_cpu_ to index_*/bit_* in x86 cpu-features.h index_* and bit_* macros are used to access cpuid and feature arrays o struct cpu_features. It is very easy to use bits and indices of cpuid array on feature array, especially in assembly codes. For example, sysdeps/i386/i686/multiarch/bcopy.S has HAS_CPU_FEATURE (Fast_Rep_String) which should be HAS_ARCH_FEATURE (Fast_Rep_String) We change index_* and bit_* to index_cpu_*/index_arch_* and bit_cpu_*/bit_arch_* so that we can catch such error at build time. [BZ #19762] * sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h (EXTRA_LD_ENVVARS): Add _arch_ to index_*/bit_*. * sysdeps/x86/cpu-features.c (init_cpu_features): Likewise. * sysdeps/x86/cpu-features.h (bit_*): Renamed to ... (bit_arch_*): This for feature array. (bit_*): Renamed to ... (bit_cpu_*): This for cpu array. (index_*): Renamed to ... (index_arch_*): This for feature array. (index_*): Renamed to ... (index_cpu_*): This for cpu array. [__ASSEMBLER__] (HAS_FEATURE): Add and use field. [__ASSEMBLER__] (HAS_CPU_FEATURE)): Pass cpu to HAS_FEATURE. [__ASSEMBLER__] (HAS_ARCH_FEATURE)): Pass arch to HAS_FEATURE. [!__ASSEMBLER__] (HAS_CPU_FEATURE): Replace index_##name and bit_##name with index_cpu_##name and bit_cpu_##name. [!__ASSEMBLER__] (HAS_ARCH_FEATURE): Replace index_##name and bit_##name with index_arch_##name and bit_arch_##name. (cherry picked from commit 6aa3e97e2530f9917f504eb4146af119a3f27229) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4a49c82956f5a42a2cce22c2e97360de1b32301d commit 4a49c82956f5a42a2cce22c2e97360de1b32301d Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 3 14:51:40 2016 -0800 Or bit_Prefer_MAP_32BIT_EXEC in EXTRA_LD_ENVVARS We should turn on bit_Prefer_MAP_32BIT_EXEC in EXTRA_LD_ENVVARS without overriding other bits. [BZ #19758] * sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h (EXTRA_LD_ENVVARS): Or bit_Prefer_MAP_32BIT_EXEC. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=18f8c0e3cc9ff7b092f02c9b42874a5439347bbc commit 18f8c0e3cc9ff7b092f02c9b42874a5439347bbc Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Mar 6 16:48:11 2016 -0800 Group AVX512 functions in .text.avx512 section * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: Replace .text with .text.avx512. * sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S: Likewise. (cherry picked from commit fee9eb6200f0e44a4b684903bc47fde36d46f1a5) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=0c8e297a186f844ebb7eba7a3bc0343c83615ca9 commit 0c8e297a186f844ebb7eba7a3bc0343c83615ca9 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Mar 4 08:37:40 2016 -0800 x86-64: Fix memcpy IFUNC selection Chek Fast_Unaligned_Load, instead of Slow_BSF, and also check for Fast_Copy_Backward to enable __memcpy_ssse3_back. Existing selection order is updated with following selection order: 1. __memcpy_avx_unaligned if AVX_Fast_Unaligned_Load bit is set. 2. __memcpy_sse2_unaligned if Fast_Unaligned_Load bit is set. 3. __memcpy_sse2 if SSSE3 isn't available. 4. __memcpy_ssse3_back if Fast_Copy_Backward bit it set. 5. __memcpy_ssse3 [BZ #18880] * sysdeps/x86_64/multiarch/memcpy.S: Check Fast_Unaligned_Load, instead of Slow_BSF, and also check for Fast_Copy_Backward to enable __memcpy_ssse3_back. (cherry picked from commit 14a1d7cc4c4fd5ee8e4e66b777221dd32a84efe8) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=3c772cb4d9cbe19cd97ad991e3dab43014198c44 commit 3c772cb4d9cbe19cd97ad991e3dab43014198c44 Author: Andrew Senkevich <andrew.senkevich@intel.com> Date: Sat Jan 16 00:49:45 2016 +0300 Added memcpy/memmove family optimized with AVX512 for KNL hardware. Added AVX512 implementations of memcpy, mempcpy, memmove, memcpy_chk, mempcpy_chk, memmove_chk. It shows average improvement more than 30% over AVX versions on KNL hardware (performance results in the thread <https://sourceware.org/ml/libc-alpha/2016-01/msg00258.html>). * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Added new files. * sysdeps/x86_64/multiarch/ifunc-impl-list.c: Added new tests. * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: New file. * sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S: Likewise. * sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: Likewise. * sysdeps/x86_64/multiarch/memcpy.S: Added new IFUNC branch. * sysdeps/x86_64/multiarch/memcpy_chk.S: Likewise. * sysdeps/x86_64/multiarch/memmove.c: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.c: Likewise. * sysdeps/x86_64/multiarch/mempcpy.S: Likewise. * sysdeps/x86_64/multiarch/mempcpy_chk.S: Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7f20e52f1f8d7709140f4bdf828a6bb0f0f08af2 commit 7f20e52f1f8d7709140f4bdf828a6bb0f0f08af2 Author: Andrew Senkevich <andrew.senkevich@intel.com> Date: Sat Dec 19 02:47:28 2015 +0300 Added memset optimized with AVX512 for KNL hardware. It shows improvement up to 28% over AVX2 memset (performance results attached at <https://sourceware.org/ml/libc-alpha/2015-12/msg00052.html>). * sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S: New file. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Added new file. * sysdeps/x86_64/multiarch/ifunc-impl-list.c: Added new tests. * sysdeps/x86_64/multiarch/memset.S: Added new IFUNC branch. * sysdeps/x86_64/multiarch/memset_chk.S: Likewise. * sysdeps/x86/cpu-features.h (bit_Prefer_No_VZEROUPPER, index_Prefer_No_VZEROUPPER): New. * sysdeps/x86/cpu-features.c (init_cpu_features): Set the Prefer_No_VZEROUPPER for Knights Landing. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=d530cd5463701a59ed923d53a97d3b534fdfea8a commit d530cd5463701a59ed923d53a97d3b534fdfea8a Author: H.J. Lu <hjl.tools@gmail.com> Date: Wed Oct 21 14:44:23 2015 -0700 Add Prefer_MAP_32BIT_EXEC to map executable pages with MAP_32BIT According to Silvermont software optimization guide, for 64-bit applications, branch prediction performance can be negatively impacted when the target of a branch is more than 4GB away from the branch. Add the Prefer_MAP_32BIT_EXEC bit so that mmap will try to map executable pages with MAP_32BIT first. NB: MAP_32BIT will map to lower 2GB, not lower 4GB, address. Prefer_MAP_32BIT_EXEC reduces bits available for address space layout randomization (ASLR), which is always disabled for SUID programs and can only be enabled by setting environment variable, LD_PREFER_MAP_32BIT_EXEC. On Fedora 23, this patch speeds up GCC 5 testsuite by 3% on Silvermont. [BZ #19367] * sysdeps/unix/sysv/linux/wordsize-64/mmap.c: New file. * sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h: Likewise. * sysdeps/unix/sysv/linux/x86_64/64/mmap.c: Likewise. * sysdeps/x86/cpu-features.h (bit_Prefer_MAP_32BIT_EXEC): New. (index_Prefer_MAP_32BIT_EXEC): Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=fe24aedc3530037d7bb614b84d309e6b816686bf commit fe24aedc3530037d7bb614b84d309e6b816686bf Author: H.J. Lu <hjl.tools@gmail.com> Date: Tue Dec 15 11:46:54 2015 -0800 Enable Silvermont optimizations for Knights Landing Knights Landing processor is based on Silvermont. This patch enables Silvermont optimizations for Knights Landing. * sysdeps/x86/cpu-features.c (init_cpu_features): Enable Silvermont optimizations for Knights Landing. -----------------------------------------------------------------------
This is an automated email from the git hooks/post-receive script. It was generated because a ref change was pushed to the repository containing the project "GNU C Library master sources". The branch, hjl/erms/ifunc has been created at fe38127f6d289dd6eaa6425acb108b7b384ddc4b (commit) - Log ----------------------------------------------------------------- https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=fe38127f6d289dd6eaa6425acb108b7b384ddc4b commit fe38127f6d289dd6eaa6425acb108b7b384ddc4b Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Apr 1 14:01:24 2016 -0700 X86-64: Add dummy memcopy.h and wordcopy.c Since x86-64 doesn't use memory copy functions, add dummy memcopy.h and wordcopy.c to reduce code size. It reduces the size of libc.so by about 1 KB. * sysdeps/x86_64/memcopy.h: New file. * sysdeps/x86_64/wordcopy.c: Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=2c5fc8567a694ae6115b25db787673fb8dc140a5 commit 2c5fc8567a694ae6115b25db787673fb8dc140a5 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 12:46:57 2016 -0700 X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones, we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with the new ones. No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used before. If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2 memcpy/memmove optimized with Enhanced REP MOVSB will be used for processors with ERMS. The new AVX512 memcpy/memmove will be used for processors with AVX512 which prefer vzeroupper. Since the new SSE2 memcpy/memmove are faster than the previous default memcpy/memmove used in libc.a and ld.so, we also remove the previous default memcpy/memmove and make them the default memcpy/memmove, except that non-temporal store isn't used in ld.so. Together, it reduces the size of libc.so by about 6 KB and the size of ld.so by about 2 KB. [BZ #19776] * sysdeps/x86_64/memcpy.S: Make it dummy. * sysdeps/x86_64/mempcpy.S: Likewise. * sysdeps/x86_64/memmove.S: New file. * sysdeps/x86_64/memmove_chk.S: Likewise. * sysdeps/x86_64/multiarch/memmove.S: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.S: Likewise. * sysdeps/x86_64/memmove.c: Removed. * sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memmove.c: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.c: Likewise. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memcpy-sse2-unaligned, memmove-avx-unaligned, memcpy-avx-unaligned and memmove-sse2-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Replace __memmove_chk_avx512_unaligned_2 with __memmove_chk_avx512_unaligned. Remove __memmove_chk_avx_unaligned_2. Replace __memmove_chk_sse2_unaligned_2 with __memmove_chk_sse2_unaligned. Remove __memmove_chk_sse2 and __memmove_avx_unaligned_2. Replace __memmove_avx512_unaligned_2 with __memmove_avx512_unaligned. Replace __memmove_sse2_unaligned_2 with __memmove_sse2_unaligned. Remove __memmove_sse2. Replace __memcpy_chk_avx512_unaligned_2 with __memcpy_chk_avx512_unaligned. Remove __memcpy_chk_avx_unaligned_2. Replace __memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned. Remove __memcpy_chk_sse2. Remove __memcpy_avx_unaligned_2. Replace __memcpy_avx512_unaligned_2 with __memcpy_avx512_unaligned. Remove __memcpy_sse2_unaligned_2 and __memcpy_sse2. Replace __mempcpy_chk_avx512_unaligned_2 with __mempcpy_chk_avx512_unaligned. Remove __mempcpy_chk_avx_unaligned_2. Replace __mempcpy_chk_sse2_unaligned_2 with __mempcpy_chk_sse2_unaligned. Remove __mempcpy_chk_sse2. Replace __mempcpy_avx512_unaligned_2 with __mempcpy_avx512_unaligned. Remove __mempcpy_avx_unaligned_2. Replace __mempcpy_sse2_unaligned_2 with __mempcpy_sse2_unaligned. Remove __mempcpy_sse2. * sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support __memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned. Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms if processor has ERMS. Default to __memcpy_sse2_unaligned. (ENTRY): Removed. (END): Likewise. (ENTRY_CHK): Likewise. (libc_hidden_builtin_def): Likewise. Don't include ../memcpy.S. * sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support __memcpy_chk_avx512_unaligned_erms and __memcpy_chk_avx512_unaligned. Use __memcpy_chk_avx_unaligned_erms and __memcpy_chk_sse2_unaligned_erms if if processor has ERMS. Default to __memcpy_chk_sse2_unaligned. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S Change function suffix from unaligned_2 to unaligned. * sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support __mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned. Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms if processor has ERMS. Default to __mempcpy_sse2_unaligned. (ENTRY): Removed. (END): Likewise. (ENTRY_CHK): Likewise. (libc_hidden_builtin_def): Likewise. Don't include ../mempcpy.S. (mempcpy): New. Add a weak alias. * sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support __mempcpy_chk_avx512_unaligned_erms and __mempcpy_chk_avx512_unaligned. Use __mempcpy_chk_avx_unaligned_erms and __mempcpy_chk_sse2_unaligned_erms if if processor has ERMS. Default to __mempcpy_chk_sse2_unaligned. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=ed37fe74cfe0d9f68a8023b7f73a5805f4a5a206 commit ed37fe74cfe0d9f68a8023b7f73a5805f4a5a206 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 10:42:30 2016 -0700 X86-64: Remove the previous SSE2/AVX2 memsets Since the new SSE2/AVX2 memsets are faster than the previous ones, we can remove the previous SSE2/AVX2 memsets and replace them with the new ones. This reduces the size of libc.so by about 900 bytes. No change in IFUNC selection if SSE2 and AVX2 memsets weren't used before. If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset optimized with Enhanced REP STOSB will be used for processors with ERMS. The new AVX512 memset will be used for processors with AVX512 which prefer vzeroupper. [BZ #19881] * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded into ... * sysdeps/x86_64/memset.S: This. (__bzero): Removed. (__memset_tail): Likewise. (__memset_chk): Likewise. (memset): Likewise. (MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't defined. (MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined. * sysdeps/x86_64/multiarch/memset-avx2.S: Removed. (__memset_zero_constant_len_parameter): Check SHARED instead of PIC. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memset-avx2 and memset-sse2-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Remove __memset_chk_sse2, __memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S (__bzero): Enabled. * sysdeps/x86_64/multiarch/memset.S (memset): Replace __memset_sse2 and __memset_avx2 with __memset_sse2_unaligned and __memset_avx2_unaligned. Use __memset_sse2_unaligned_erms or __memset_avx2_unaligned_erms if processor has ERMS. Support __memset_avx512_unaligned_erms and __memset_avx512_unaligned. (memset): Removed. (__memset_chk): Likewise. (MEMSET_SYMBOL): New. (libc_hidden_builtin_def): Replace __memset_sse2 with __memset_sse2_unaligned. * sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace __memset_chk_sse2 and __memset_chk_avx2 with __memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms. Use __memset_chk_sse2_unaligned_erms or __memset_chk_avx2_unaligned_erms if processor has ERMS. Support __memset_chk_avx512_unaligned_erms and __memset_chk_avx512_unaligned. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=96b5fbcbc09df10b093221d6b55eaa5e7e8c044f commit 96b5fbcbc09df10b093221d6b55eaa5e7e8c044f Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Apr 3 17:21:45 2016 -0700 X86-64: Use non-temporal store in memcpy on large data The large memcpy micro benchmark in glibc shows that there is a regression with large data on Haswell machine. non-temporal store in memcpy on large data can improve performance significantly. This patch adds a threshold to use non temporal store which is 6 times of shared cache size. When size is above the threshold, non temporal store will be used, but avoid non-temporal store if there is overlap between destination and source since destination may be in cache when source is loaded. For size below 8 vector register width, we load all data into registers and store them together. Only forward and backward loops, which move 4 vector registers at a time, are used to support overlapping addresses. For forward loop, we load the last 4 vector register width of data and the first vector register width of data into vector registers before the loop and store them after the loop. For backward loop, we load the first 4 vector register width of data and the last vector register width of data into vector registers before the loop and store them after the loop. [BZ #19928] * sysdeps/x86_64/cacheinfo.c (__x86_shared_non_temporal_threshold): New. (init_cacheinfo): Set __x86_shared_non_temporal_threshold to 6 times of shared cache size. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S (VMOVNT): New. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S (VMOVNT): Likewise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S (VMOVNT): Likewise. (VMOVU): Changed to movups for smaller code sizes. (VMOVA): Changed to movaps for smaller code sizes. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Update comments. (PREFETCH): New. (PREFETCH_SIZE): Likewise. (PREFETCHED_LOAD_SIZE): Likewise. (PREFETCH_ONE_SET): Likewise. Rewrite to use forward and backward loops, which move 4 vector registers at a time, to support overlapping addresses and use non temporal store if size is above the threshold and there is no overlap between destination and source. -----------------------------------------------------------------------
This is an automated email from the git hooks/post-receive script. It was generated because a ref change was pushed to the repository containing the project "GNU C Library master sources". The branch, hjl/erms/2.23 has been created at 9e1ddc1180ca0619d12b620b227726233a48b9bc (commit) - Log ----------------------------------------------------------------- https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=9e1ddc1180ca0619d12b620b227726233a48b9bc commit 9e1ddc1180ca0619d12b620b227726233a48b9bc Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Apr 1 14:01:24 2016 -0700 X86-64: Add dummy memcopy.h and wordcopy.c Since x86-64 doesn't use memory copy functions, add dummy memcopy.h and wordcopy.c to reduce code size. It reduces the size of libc.so by about 1 KB. * sysdeps/x86_64/memcopy.h: New file. * sysdeps/x86_64/wordcopy.c: Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=3443d7810db1092ac70a0fde7b85732a2e00cdc3 commit 3443d7810db1092ac70a0fde7b85732a2e00cdc3 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 12:46:57 2016 -0700 X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones, we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with the new ones. No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used before. If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2 memcpy/memmove optimized with Enhanced REP MOVSB will be used for processors with ERMS. The new AVX512 memcpy/memmove will be used for processors with AVX512 which prefer vzeroupper. Since the new SSE2 memcpy/memmove are faster than the previous default memcpy/memmove used in libc.a and ld.so, we also remove the previous default memcpy/memmove and make them the default memcpy/memmove, except that non-temporal store isn't used in ld.so. Together, it reduces the size of libc.so by about 6 KB and the size of ld.so by about 2 KB. [BZ #19776] * sysdeps/x86_64/memcpy.S: Make it dummy. * sysdeps/x86_64/mempcpy.S: Likewise. * sysdeps/x86_64/memmove.S: New file. * sysdeps/x86_64/memmove_chk.S: Likewise. * sysdeps/x86_64/multiarch/memmove.S: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.S: Likewise. * sysdeps/x86_64/memmove.c: Removed. * sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memmove.c: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.c: Likewise. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memcpy-sse2-unaligned, memmove-avx-unaligned, memcpy-avx-unaligned and memmove-sse2-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Replace __memmove_chk_avx512_unaligned_2 with __memmove_chk_avx512_unaligned. Remove __memmove_chk_avx_unaligned_2. Replace __memmove_chk_sse2_unaligned_2 with __memmove_chk_sse2_unaligned. Remove __memmove_chk_sse2 and __memmove_avx_unaligned_2. Replace __memmove_avx512_unaligned_2 with __memmove_avx512_unaligned. Replace __memmove_sse2_unaligned_2 with __memmove_sse2_unaligned. Remove __memmove_sse2. Replace __memcpy_chk_avx512_unaligned_2 with __memcpy_chk_avx512_unaligned. Remove __memcpy_chk_avx_unaligned_2. Replace __memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned. Remove __memcpy_chk_sse2. Remove __memcpy_avx_unaligned_2. Replace __memcpy_avx512_unaligned_2 with __memcpy_avx512_unaligned. Remove __memcpy_sse2_unaligned_2 and __memcpy_sse2. Replace __mempcpy_chk_avx512_unaligned_2 with __mempcpy_chk_avx512_unaligned. Remove __mempcpy_chk_avx_unaligned_2. Replace __mempcpy_chk_sse2_unaligned_2 with __mempcpy_chk_sse2_unaligned. Remove __mempcpy_chk_sse2. Replace __mempcpy_avx512_unaligned_2 with __mempcpy_avx512_unaligned. Remove __mempcpy_avx_unaligned_2. Replace __mempcpy_sse2_unaligned_2 with __mempcpy_sse2_unaligned. Remove __mempcpy_sse2. * sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support __memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned. Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms if processor has ERMS. Default to __memcpy_sse2_unaligned. (ENTRY): Removed. (END): Likewise. (ENTRY_CHK): Likewise. (libc_hidden_builtin_def): Likewise. Don't include ../memcpy.S. * sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support __memcpy_chk_avx512_unaligned_erms and __memcpy_chk_avx512_unaligned. Use __memcpy_chk_avx_unaligned_erms and __memcpy_chk_sse2_unaligned_erms if if processor has ERMS. Default to __memcpy_chk_sse2_unaligned. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S Change function suffix from unaligned_2 to unaligned. * sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support __mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned. Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms if processor has ERMS. Default to __mempcpy_sse2_unaligned. (ENTRY): Removed. (END): Likewise. (ENTRY_CHK): Likewise. (libc_hidden_builtin_def): Likewise. Don't include ../mempcpy.S. (mempcpy): New. Add a weak alias. * sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support __mempcpy_chk_avx512_unaligned_erms and __mempcpy_chk_avx512_unaligned. Use __mempcpy_chk_avx_unaligned_erms and __mempcpy_chk_sse2_unaligned_erms if if processor has ERMS. Default to __mempcpy_chk_sse2_unaligned. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=1d2a372d44dc05201242d0fd5551df9c3174806c commit 1d2a372d44dc05201242d0fd5551df9c3174806c Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 10:42:30 2016 -0700 X86-64: Remove the previous SSE2/AVX2 memsets Since the new SSE2/AVX2 memsets are faster than the previous ones, we can remove the previous SSE2/AVX2 memsets and replace them with the new ones. This reduces the size of libc.so by about 900 bytes. No change in IFUNC selection if SSE2 and AVX2 memsets weren't used before. If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset optimized with Enhanced REP STOSB will be used for processors with ERMS. The new AVX512 memset will be used for processors with AVX512 which prefer vzeroupper. [BZ #19881] * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded into ... * sysdeps/x86_64/memset.S: This. (__bzero): Removed. (__memset_tail): Likewise. (__memset_chk): Likewise. (memset): Likewise. (MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't defined. (MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined. * sysdeps/x86_64/multiarch/memset-avx2.S: Removed. (__memset_zero_constant_len_parameter): Check SHARED instead of PIC. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memset-avx2 and memset-sse2-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Remove __memset_chk_sse2, __memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S (__bzero): Enabled. * sysdeps/x86_64/multiarch/memset.S (memset): Replace __memset_sse2 and __memset_avx2 with __memset_sse2_unaligned and __memset_avx2_unaligned. Use __memset_sse2_unaligned_erms or __memset_avx2_unaligned_erms if processor has ERMS. Support __memset_avx512_unaligned_erms and __memset_avx512_unaligned. (memset): Removed. (__memset_chk): Likewise. (MEMSET_SYMBOL): New. (libc_hidden_builtin_def): Replace __memset_sse2 with __memset_sse2_unaligned. * sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace __memset_chk_sse2 and __memset_chk_avx2 with __memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms. Use __memset_chk_sse2_unaligned_erms or __memset_chk_avx2_unaligned_erms if processor has ERMS. Support __memset_chk_avx512_unaligned_erms and __memset_chk_avx512_unaligned. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=9fa066d5f5ff996990869bbbad08435f02d18bb3 commit 9fa066d5f5ff996990869bbbad08435f02d18bb3 Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Apr 3 17:21:45 2016 -0700 X86-64: Use non-temporal store in memcpy on large data The large memcpy micro benchmark in glibc shows that there is a regression with large data on Haswell machine. non-temporal store in memcpy on large data can improve performance significantly. This patch adds a threshold to use non temporal store which is 6 times of shared cache size. When size is above the threshold, non temporal store will be used, but avoid non-temporal store if there is overlap between destination and source since destination may be in cache when source is loaded. For size below 8 vector register width, we load all data into registers and store them together. Only forward and backward loops, which move 4 vector registers at a time, are used to support overlapping addresses. For forward loop, we load the last 4 vector register width of data and the first vector register width of data into vector registers before the loop and store them after the loop. For backward loop, we load the first 4 vector register width of data and the last vector register width of data into vector registers before the loop and store them after the loop. [BZ #19928] * sysdeps/x86_64/cacheinfo.c (__x86_shared_non_temporal_threshold): New. (init_cacheinfo): Set __x86_shared_non_temporal_threshold to 6 times of shared cache size. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S (VMOVNT): New. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S (VMOVNT): Likewise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S (VMOVNT): Likewise. (VMOVU): Changed to movups for smaller code sizes. (VMOVA): Changed to movaps for smaller code sizes. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Update comments. (PREFETCH): New. (PREFETCH_SIZE): Likewise. (PREFETCHED_LOAD_SIZE): Likewise. (PREFETCH_ONE_SET): Likewise. Rewrite to use forward and backward loops, which move 4 vector registers at a time, to support overlapping addresses and use non temporal store if size is above the threshold and there is no overlap between destination and source. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=0932dd8b56db46dd421a4855fb5dee9de092538d commit 0932dd8b56db46dd421a4855fb5dee9de092538d Author: H.J. Lu <hjl.tools@gmail.com> Date: Wed Apr 6 10:19:16 2016 -0700 X86-64: Prepare memmove-vec-unaligned-erms.S Prepare memmove-vec-unaligned-erms.S to make the SSE2 version as the default memcpy, mempcpy and memmove. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S (MEMCPY_SYMBOL): New. (MEMPCPY_SYMBOL): Likewise. (MEMMOVE_CHK_SYMBOL): Likewise. Replace MEMMOVE_SYMBOL with MEMMOVE_CHK_SYMBOL on __mempcpy_chk symbols. Replace MEMMOVE_SYMBOL with MEMPCPY_SYMBOL on __mempcpy symbols. Provide alias for __memcpy_chk in libc.a. Provide alias for memcpy in libc.a and ld.so. (cherry picked from commit a7d1c51482d15ab6c07e2ee0ae5e007067b18bfb) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=da2da79262814ba4ead3ee487549949096d8ad2d commit da2da79262814ba4ead3ee487549949096d8ad2d Author: H.J. Lu <hjl.tools@gmail.com> Date: Wed Apr 6 09:10:18 2016 -0700 X86-64: Prepare memset-vec-unaligned-erms.S Prepare memset-vec-unaligned-erms.S to make the SSE2 version as the default memset. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S (MEMSET_CHK_SYMBOL): New. Define if not defined. (__bzero): Check VEC_SIZE == 16 instead of USE_MULTIARCH. Disabled fro now. Replace MEMSET_SYMBOL with MEMSET_CHK_SYMBOL on __memset_chk symbols. Properly check USE_MULTIARCH on __memset symbols. (cherry picked from commit 4af1bb06c59d24f35bf8dc55897838d926c05892) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=9a93bdbaff81edf67c5486c84f8098055e355abb commit 9a93bdbaff81edf67c5486c84f8098055e355abb Author: H.J. Lu <hjl.tools@gmail.com> Date: Tue Apr 5 05:21:07 2016 -0700 Force 32-bit displacement in memset-vec-unaligned-erms.S * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: Force 32-bit displacement to avoid long nop between instructions. (cherry picked from commit ec0cac9a1f4094bd0db6f77c1b329e7a40eecc10) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=5118e532600549ad0f56cb9b1a179b8eab70c483 commit 5118e532600549ad0f56cb9b1a179b8eab70c483 Author: H.J. Lu <hjl.tools@gmail.com> Date: Tue Apr 5 05:19:05 2016 -0700 Add a comment in memset-sse2-unaligned-erms.S * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Add a comment on VMOVU and VMOVA. (cherry picked from commit 696ac774847b80cf994438739478b0c3003b5958) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=06c6d4ae6ee7e5b83fd5868bef494def01f59292 commit 06c6d4ae6ee7e5b83fd5868bef494def01f59292 Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Apr 3 14:32:20 2016 -0700 Don't put SSE2/AVX/AVX512 memmove/memset in ld.so Since memmove and memset in ld.so don't use IFUNC, don't put SSE2, AVX and AVX512 memmove and memset in ld.so. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip if not in libc. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S: Likewise. (cherry picked from commit 5cd7af016d8587ff53b20ba259746f97edbddbf7) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a96379797a7eecc1b709cad7b68981eb698783dc commit a96379797a7eecc1b709cad7b68981eb698783dc Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Apr 3 12:38:25 2016 -0700 Fix memmove-vec-unaligned-erms.S __mempcpy_erms and __memmove_erms can't be placed between __memmove_chk and __memmove it breaks __memmove_chk. Don't check source == destination first since it is less common. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: (__mempcpy_erms, __memmove_erms): Moved before __mempcpy_chk with unaligned_erms. (__memmove_erms): Skip if source == destination. (__memmove_unaligned_erms): Don't check source == destination first. (cherry picked from commit ea2785e96fa503f3a2b5dd9f3a6ca65622b3c5f2) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=cfb059c79729b26284863334c9aa04f0a3b967b9 commit cfb059c79729b26284863334c9aa04f0a3b967b9 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Apr 1 15:08:48 2016 -0700 Remove Fast_Copy_Backward from Intel Core processors Intel Core i3, i5 and i7 processors have fast unaligned copy and copy backward is ignored. Remove Fast_Copy_Backward from Intel Core processors to avoid confusion. * sysdeps/x86/cpu-features.c (init_cpu_features): Don't set bit_arch_Fast_Copy_Backward for Intel Core proessors. (cherry picked from commit 27d3ce1467990f89126e228559dec8f84b96c60e) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=30c389be1af67c4d0716d207b6780c6169d1355f commit 30c389be1af67c4d0716d207b6780c6169d1355f Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 10:05:51 2016 -0700 Add x86-64 memset with unaligned store and rep stosb Implement x86-64 memset with unaligned store and rep movsb. Support 16-byte, 32-byte and 64-byte vector register sizes. A single file provides 2 implementations of memset, one with rep stosb and the other without rep stosb. They share the same codes when size is between 2 times of vector register size and REP_STOSB_THRESHOLD which defaults to 2KB. Key features: 1. Use overlapping store to avoid branch. 2. For size <= 4 times of vector register size, fully unroll the loop. 3. For size > 4 times of vector register size, store 4 times of vector register size at a time. [BZ #19881] * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memset-sse2-unaligned-erms, memset-avx2-unaligned-erms and memset-avx512-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memset_chk_sse2_unaligned, __memset_chk_sse2_unaligned_erms, __memset_chk_avx2_unaligned, __memset_chk_avx2_unaligned_erms, __memset_chk_avx512_unaligned, __memset_chk_avx512_unaligned_erms, __memset_sse2_unaligned, __memset_sse2_unaligned_erms, __memset_erms, __memset_avx2_unaligned, __memset_avx2_unaligned_erms, __memset_avx512_unaligned_erms and __memset_avx512_unaligned. * sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: New file. * sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: Likewise. (cherry picked from commit 830566307f038387ca0af3fd327706a8d1a2f595) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=980d639b4ae58209843f09a29d86b0a8303b6650 commit 980d639b4ae58209843f09a29d86b0a8303b6650 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 10:04:26 2016 -0700 Add x86-64 memmove with unaligned load/store and rep movsb Implement x86-64 memmove with unaligned load/store and rep movsb. Support 16-byte, 32-byte and 64-byte vector register sizes. When size <= 8 times of vector register size, there is no check for address overlap bewteen source and destination. Since overhead for overlap check is small when size > 8 times of vector register size, memcpy is an alias of memmove. A single file provides 2 implementations of memmove, one with rep movsb and the other without rep movsb. They share the same codes when size is between 2 times of vector register size and REP_MOVSB_THRESHOLD which is 2KB for 16-byte vector register size and scaled up by large vector register size. Key features: 1. Use overlapping load and store to avoid branch. 2. For size <= 8 times of vector register size, load all sources into registers and store them together. 3. If there is no address overlap bewteen source and destination, copy from both ends with 4 times of vector register size at a time. 4. If address of destination > address of source, backward copy 8 times of vector register size at a time. 5. Otherwise, forward copy 8 times of vector register size at a time. 6. Use rep movsb only for forward copy. Avoid slow backward rep movsb by fallbacking to backward copy 8 times of vector register size at a time. 7. Skip when address of destination == address of source. [BZ #19776] * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memmove-sse2-unaligned-erms, memmove-avx-unaligned-erms and memmove-avx512-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memmove_chk_avx512_unaligned_2, __memmove_chk_avx512_unaligned_erms, __memmove_chk_avx_unaligned_2, __memmove_chk_avx_unaligned_erms, __memmove_chk_sse2_unaligned_2, __memmove_chk_sse2_unaligned_erms, __memmove_avx_unaligned_2, __memmove_avx_unaligned_erms, __memmove_avx512_unaligned_2, __memmove_avx512_unaligned_erms, __memmove_erms, __memmove_sse2_unaligned_2, __memmove_sse2_unaligned_erms, __memcpy_chk_avx512_unaligned_2, __memcpy_chk_avx512_unaligned_erms, __memcpy_chk_avx_unaligned_2, __memcpy_chk_avx_unaligned_erms, __memcpy_chk_sse2_unaligned_2, __memcpy_chk_sse2_unaligned_erms, __memcpy_avx_unaligned_2, __memcpy_avx_unaligned_erms, __memcpy_avx512_unaligned_2, __memcpy_avx512_unaligned_erms, __memcpy_sse2_unaligned_2, __memcpy_sse2_unaligned_erms, __memcpy_erms, __mempcpy_chk_avx512_unaligned_2, __mempcpy_chk_avx512_unaligned_erms, __mempcpy_chk_avx_unaligned_2, __mempcpy_chk_avx_unaligned_erms, __mempcpy_chk_sse2_unaligned_2, __mempcpy_chk_sse2_unaligned_erms, __mempcpy_avx512_unaligned_2, __mempcpy_avx512_unaligned_erms, __mempcpy_avx_unaligned_2, __mempcpy_avx_unaligned_erms, __mempcpy_sse2_unaligned_2, __mempcpy_sse2_unaligned_erms and __mempcpy_erms. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: New file. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S: Likwise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S: Likwise. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Likwise. (cherry picked from commit 88b57b8ed41d5ecf2e1bdfc19556f9246a665ebb) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=bf2bc5e5c9d7aa8af28b299ec26b8a37352730cc commit bf2bc5e5c9d7aa8af28b299ec26b8a37352730cc Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Mar 28 19:22:59 2016 -0700 Initial Enhanced REP MOVSB/STOSB (ERMS) support The newer Intel processors support Enhanced REP MOVSB/STOSB (ERMS) which has a feature bit in CPUID. This patch adds the Enhanced REP MOVSB/STOSB (ERMS) bit to x86 cpu-features. * sysdeps/x86/cpu-features.h (bit_cpu_ERMS): New. (index_cpu_ERMS): Likewise. (reg_ERMS): Likewise. (cherry picked from commit 0791f91dff9a77263fa8173b143d854cad902c6d) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7c244283ff12329b3bca9878b8edac3b3fe5c7bc commit 7c244283ff12329b3bca9878b8edac3b3fe5c7bc Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Mar 28 13:15:59 2016 -0700 Make __memcpy_avx512_no_vzeroupper an alias Since x86-64 memcpy-avx512-no-vzeroupper.S implements memmove, make __memcpy_avx512_no_vzeroupper an alias of __memmove_avx512_no_vzeroupper to reduce code size of libc.so. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memcpy-avx512-no-vzeroupper. * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: Renamed to ... * sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: This. (MEMCPY): Don't define. (MEMCPY_CHK): Likewise. (MEMPCPY): Likewise. (MEMPCPY_CHK): Likewise. (MEMPCPY_CHK): Renamed to ... (__mempcpy_chk_avx512_no_vzeroupper): This. (MEMPCPY_CHK): Renamed to ... (__mempcpy_chk_avx512_no_vzeroupper): This. (MEMCPY_CHK): Renamed to ... (__memmove_chk_avx512_no_vzeroupper): This. (MEMCPY): Renamed to ... (__memmove_avx512_no_vzeroupper): This. (__memcpy_avx512_no_vzeroupper): New alias. (__memcpy_chk_avx512_no_vzeroupper): Likewise. (cherry picked from commit 064f01b10b57ff09cda7025f484b848c38ddd57a) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a9a14991fb2d3e69f80d25e9bbf2f6b0bcf11c3d commit a9a14991fb2d3e69f80d25e9bbf2f6b0bcf11c3d Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Mar 28 13:13:36 2016 -0700 Implement x86-64 multiarch mempcpy in memcpy Implement x86-64 multiarch mempcpy in memcpy to share most of code. It reduces code size of libc.so. [BZ #18858] * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove mempcpy-ssse3, mempcpy-ssse3-back, mempcpy-avx-unaligned and mempcpy-avx512-no-vzeroupper. * sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/memcpy-ssse3-back.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/memcpy-ssse3.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/mempcpy-avx-unaligned.S: Removed. * sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S: Likewise. * sysdeps/x86_64/multiarch/mempcpy-ssse3-back.S: Likewise. * sysdeps/x86_64/multiarch/mempcpy-ssse3.S: Likewise. (cherry picked from commit c365e615f7429aee302f8af7bf07ae262278febb) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4fc09dabecee1b7cafdbca26ee7c63f68e53c229 commit 4fc09dabecee1b7cafdbca26ee7c63f68e53c229 Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Mar 28 04:39:48 2016 -0700 [x86] Add a feature bit: Fast_Unaligned_Copy On AMD processors, memcpy optimized with unaligned SSE load is slower than emcpy optimized with aligned SSSE3 while other string functions are faster with unaligned SSE load. A feature bit, Fast_Unaligned_Copy, is added to select memcpy optimized with unaligned SSE load. [BZ #19583] * sysdeps/x86/cpu-features.c (init_cpu_features): Set Fast_Unaligned_Copy with Fast_Unaligned_Load for Intel processors. Set Fast_Copy_Backward for AMD Excavator processors. * sysdeps/x86/cpu-features.h (bit_arch_Fast_Unaligned_Copy): New. (index_arch_Fast_Unaligned_Copy): Likewise. * sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Check Fast_Unaligned_Copy instead of Fast_Unaligned_Load. (cherry picked from commit e41b395523040fcb58c7d378475720c2836d280c) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=75f2d47e459a6bf5656a938e5c63f8b581eb3ee6 commit 75f2d47e459a6bf5656a938e5c63f8b581eb3ee6 Author: Florian Weimer <fweimer@redhat.com> Date: Fri Mar 25 11:11:42 2016 +0100 tst-audit10: Fix compilation on compilers without bit_AVX512F [BZ #19860] [BZ# 19860] * sysdeps/x86_64/tst-audit10.c (avx512_enabled): Always return zero if the compiler does not provide the AVX512F bit. (cherry picked from commit f327f5b47be57bc05a4077344b381016c1bb2c11) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=96c7375cb8b6f1875d9865f2ae92ecacf5f5e6fa commit 96c7375cb8b6f1875d9865f2ae92ecacf5f5e6fa Author: H.J. Lu <hjl.tools@gmail.com> Date: Tue Mar 22 08:36:16 2016 -0700 Don't set %rcx twice before "rep movsb" * sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMCPY): Don't set %rcx twice before "rep movsb". (cherry picked from commit 3c9a4cd16cbc7b79094fec68add2df66061ab5d7) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c273f613b0cc779ee33cc33d20941d271316e483 commit c273f613b0cc779ee33cc33d20941d271316e483 Author: H.J. Lu <hjl.tools@gmail.com> Date: Tue Mar 22 07:46:56 2016 -0700 Set index_arch_AVX_Fast_Unaligned_Load only for Intel processors Since only Intel processors with AVX2 have fast unaligned load, we should set index_arch_AVX_Fast_Unaligned_Load only for Intel processors. Move AVX, AVX2, AVX512, FMA and FMA4 detection into get_common_indeces and call get_common_indeces for other processors. Add CPU_FEATURES_CPU_P and CPU_FEATURES_ARCH_P to aoid loading GLRO(dl_x86_cpu_features) in cpu-features.c. [BZ #19583] * sysdeps/x86/cpu-features.c (get_common_indeces): Remove inline. Check family before setting family, model and extended_model. Set AVX, AVX2, AVX512, FMA and FMA4 usable bits here. (init_cpu_features): Replace HAS_CPU_FEATURE and HAS_ARCH_FEATURE with CPU_FEATURES_CPU_P and CPU_FEATURES_ARCH_P. Set index_arch_AVX_Fast_Unaligned_Load for Intel processors with usable AVX2. Call get_common_indeces for other processors with family == NULL. * sysdeps/x86/cpu-features.h (CPU_FEATURES_CPU_P): New macro. (CPU_FEATURES_ARCH_P): Likewise. (HAS_CPU_FEATURE): Use CPU_FEATURES_CPU_P. (HAS_ARCH_FEATURE): Use CPU_FEATURES_ARCH_P. (cherry picked from commit f781a9e96138d8839663af5e88649ab1fbed74f8) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c858d10a4e7fd682f2e7083836e4feacc2d580f4 commit c858d10a4e7fd682f2e7083836e4feacc2d580f4 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 10 05:26:46 2016 -0800 Add _arch_/_cpu_ to index_*/bit_* in x86 cpu-features.h index_* and bit_* macros are used to access cpuid and feature arrays o struct cpu_features. It is very easy to use bits and indices of cpuid array on feature array, especially in assembly codes. For example, sysdeps/i386/i686/multiarch/bcopy.S has HAS_CPU_FEATURE (Fast_Rep_String) which should be HAS_ARCH_FEATURE (Fast_Rep_String) We change index_* and bit_* to index_cpu_*/index_arch_* and bit_cpu_*/bit_arch_* so that we can catch such error at build time. [BZ #19762] * sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h (EXTRA_LD_ENVVARS): Add _arch_ to index_*/bit_*. * sysdeps/x86/cpu-features.c (init_cpu_features): Likewise. * sysdeps/x86/cpu-features.h (bit_*): Renamed to ... (bit_arch_*): This for feature array. (bit_*): Renamed to ... (bit_cpu_*): This for cpu array. (index_*): Renamed to ... (index_arch_*): This for feature array. (index_*): Renamed to ... (index_cpu_*): This for cpu array. [__ASSEMBLER__] (HAS_FEATURE): Add and use field. [__ASSEMBLER__] (HAS_CPU_FEATURE)): Pass cpu to HAS_FEATURE. [__ASSEMBLER__] (HAS_ARCH_FEATURE)): Pass arch to HAS_FEATURE. [!__ASSEMBLER__] (HAS_CPU_FEATURE): Replace index_##name and bit_##name with index_cpu_##name and bit_cpu_##name. [!__ASSEMBLER__] (HAS_ARCH_FEATURE): Replace index_##name and bit_##name with index_arch_##name and bit_arch_##name. (cherry picked from commit 6aa3e97e2530f9917f504eb4146af119a3f27229) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7a90b56b0c3f8e55df44957cf6de7d3c9c04cbb9 commit 7a90b56b0c3f8e55df44957cf6de7d3c9c04cbb9 Author: Roland McGrath <roland@hack.frob.com> Date: Tue Mar 8 12:31:13 2016 -0800 Fix tst-audit10 build when -mavx512f is not supported. (cherry picked from commit 3bd80c0de2f8e7ca8020d37739339636d169957e) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=ba80f6ceea3a6b6f711038646f419125fe3ad39c commit ba80f6ceea3a6b6f711038646f419125fe3ad39c Author: Florian Weimer <fweimer@redhat.com> Date: Mon Mar 7 16:00:25 2016 +0100 tst-audit4, tst-audit10: Compile AVX/AVX-512 code separately [BZ #19269] This ensures that GCC will not use unsupported instructions before the run-time check to ensure support. (cherry picked from commit 3c0f7407eedb524c9114bb675cd55b903c71daaa) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=b8fe596e7f750d4ee2fca14d6a3999364c02662e commit b8fe596e7f750d4ee2fca14d6a3999364c02662e Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Mar 6 16:48:11 2016 -0800 Group AVX512 functions in .text.avx512 section * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: Replace .text with .text.avx512. * sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S: Likewise. (cherry picked from commit fee9eb6200f0e44a4b684903bc47fde36d46f1a5) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e455d17680cfaebb12692547422f95ba1ed30e29 commit e455d17680cfaebb12692547422f95ba1ed30e29 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Mar 4 08:37:40 2016 -0800 x86-64: Fix memcpy IFUNC selection Chek Fast_Unaligned_Load, instead of Slow_BSF, and also check for Fast_Copy_Backward to enable __memcpy_ssse3_back. Existing selection order is updated with following selection order: 1. __memcpy_avx_unaligned if AVX_Fast_Unaligned_Load bit is set. 2. __memcpy_sse2_unaligned if Fast_Unaligned_Load bit is set. 3. __memcpy_sse2 if SSSE3 isn't available. 4. __memcpy_ssse3_back if Fast_Copy_Backward bit it set. 5. __memcpy_ssse3 [BZ #18880] * sysdeps/x86_64/multiarch/memcpy.S: Check Fast_Unaligned_Load, instead of Slow_BSF, and also check for Fast_Copy_Backward to enable __memcpy_ssse3_back. (cherry picked from commit 14a1d7cc4c4fd5ee8e4e66b777221dd32a84efe8) -----------------------------------------------------------------------
The current memcpy performance data: https://sourceware.org/bugzilla/attachment.cgi?id=9184 The current memmove performance data: https://sourceware.org/bugzilla/attachment.cgi?id=9185
This is an automated email from the git hooks/post-receive script. It was generated because a ref change was pushed to the repository containing the project "GNU C Library master sources". The branch, hjl/erms/ifunc has been created at bce4ef1859db80554f4f8b20e8597984f06b760e (commit) - Log ----------------------------------------------------------------- https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=bce4ef1859db80554f4f8b20e8597984f06b760e commit bce4ef1859db80554f4f8b20e8597984f06b760e Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Apr 1 14:01:24 2016 -0700 X86-64: Add dummy memcopy.h and wordcopy.c Since x86-64 doesn't use memory copy functions, add dummy memcopy.h and wordcopy.c to reduce code size. It reduces the size of libc.so by about 1 KB. * sysdeps/x86_64/memcopy.h: New file. * sysdeps/x86_64/wordcopy.c: Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=46eecce30a070ca39d1b3c04dcdf68b378913b74 commit 46eecce30a070ca39d1b3c04dcdf68b378913b74 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 12:46:57 2016 -0700 X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones, we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with the new ones. No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used before. If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2 memcpy/memmove optimized with Enhanced REP MOVSB will be used for processors with ERMS. The new AVX512 memcpy/memmove will be used for processors with AVX512 which prefer vzeroupper. Since the new SSE2 memcpy/memmove are faster than the previous default memcpy/memmove used in libc.a and ld.so, we also remove the previous default memcpy/memmove and make them the default memcpy/memmove, except that non-temporal store isn't used in ld.so. Together, it reduces the size of libc.so by about 6 KB and the size of ld.so by about 2 KB. [BZ #19776] * sysdeps/x86_64/memcpy.S: Make it dummy. * sysdeps/x86_64/mempcpy.S: Likewise. * sysdeps/x86_64/memmove.S: New file. * sysdeps/x86_64/memmove_chk.S: Likewise. * sysdeps/x86_64/multiarch/memmove.S: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.S: Likewise. * sysdeps/x86_64/memmove.c: Removed. * sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memmove.c: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.c: Likewise. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memcpy-sse2-unaligned, memmove-avx-unaligned, memcpy-avx-unaligned and memmove-sse2-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Replace __memmove_chk_avx512_unaligned_2 with __memmove_chk_avx512_unaligned. Remove __memmove_chk_avx_unaligned_2. Replace __memmove_chk_sse2_unaligned_2 with __memmove_chk_sse2_unaligned. Remove __memmove_chk_sse2 and __memmove_avx_unaligned_2. Replace __memmove_avx512_unaligned_2 with __memmove_avx512_unaligned. Replace __memmove_sse2_unaligned_2 with __memmove_sse2_unaligned. Remove __memmove_sse2. Replace __memcpy_chk_avx512_unaligned_2 with __memcpy_chk_avx512_unaligned. Remove __memcpy_chk_avx_unaligned_2. Replace __memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned. Remove __memcpy_chk_sse2. Remove __memcpy_avx_unaligned_2. Replace __memcpy_avx512_unaligned_2 with __memcpy_avx512_unaligned. Remove __memcpy_sse2_unaligned_2 and __memcpy_sse2. Replace __mempcpy_chk_avx512_unaligned_2 with __mempcpy_chk_avx512_unaligned. Remove __mempcpy_chk_avx_unaligned_2. Replace __mempcpy_chk_sse2_unaligned_2 with __mempcpy_chk_sse2_unaligned. Remove __mempcpy_chk_sse2. Replace __mempcpy_avx512_unaligned_2 with __mempcpy_avx512_unaligned. Remove __mempcpy_avx_unaligned_2. Replace __mempcpy_sse2_unaligned_2 with __mempcpy_sse2_unaligned. Remove __mempcpy_sse2. * sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support __memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned. Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms if processor has ERMS. Default to __memcpy_sse2_unaligned. (ENTRY): Removed. (END): Likewise. (ENTRY_CHK): Likewise. (libc_hidden_builtin_def): Likewise. Don't include ../memcpy.S. * sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support __memcpy_chk_avx512_unaligned_erms and __memcpy_chk_avx512_unaligned. Use __memcpy_chk_avx_unaligned_erms and __memcpy_chk_sse2_unaligned_erms if if processor has ERMS. Default to __memcpy_chk_sse2_unaligned. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S Change function suffix from unaligned_2 to unaligned. * sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support __mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned. Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms if processor has ERMS. Default to __mempcpy_sse2_unaligned. (ENTRY): Removed. (END): Likewise. (ENTRY_CHK): Likewise. (libc_hidden_builtin_def): Likewise. Don't include ../mempcpy.S. (mempcpy): New. Add a weak alias. * sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support __mempcpy_chk_avx512_unaligned_erms and __mempcpy_chk_avx512_unaligned. Use __mempcpy_chk_avx_unaligned_erms and __mempcpy_chk_sse2_unaligned_erms if if processor has ERMS. Default to __mempcpy_chk_sse2_unaligned. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=6d234de7bccb082f25fea6641a0fbe414e13dfb4 commit 6d234de7bccb082f25fea6641a0fbe414e13dfb4 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 10:42:30 2016 -0700 X86-64: Remove the previous SSE2/AVX2 memsets Since the new SSE2/AVX2 memsets are faster than the previous ones, we can remove the previous SSE2/AVX2 memsets and replace them with the new ones. This reduces the size of libc.so by about 900 bytes. No change in IFUNC selection if SSE2 and AVX2 memsets weren't used before. If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset optimized with Enhanced REP STOSB will be used for processors with ERMS. The new AVX512 memset will be used for processors with AVX512 which prefer vzeroupper. [BZ #19881] * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded into ... * sysdeps/x86_64/memset.S: This. (__bzero): Removed. (__memset_tail): Likewise. (__memset_chk): Likewise. (memset): Likewise. (MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't defined. (MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined. * sysdeps/x86_64/multiarch/memset-avx2.S: Removed. (__memset_zero_constant_len_parameter): Check SHARED instead of PIC. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memset-avx2 and memset-sse2-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Remove __memset_chk_sse2, __memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S (__bzero): Enabled. * sysdeps/x86_64/multiarch/memset.S (memset): Replace __memset_sse2 and __memset_avx2 with __memset_sse2_unaligned and __memset_avx2_unaligned. Use __memset_sse2_unaligned_erms or __memset_avx2_unaligned_erms if processor has ERMS. Support __memset_avx512_unaligned_erms and __memset_avx512_unaligned. (memset): Removed. (__memset_chk): Likewise. (MEMSET_SYMBOL): New. (libc_hidden_builtin_def): Replace __memset_sse2 with __memset_sse2_unaligned. * sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace __memset_chk_sse2 and __memset_chk_avx2 with __memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms. Use __memset_chk_sse2_unaligned_erms or __memset_chk_avx2_unaligned_erms if processor has ERMS. Support __memset_chk_avx512_unaligned_erms and __memset_chk_avx512_unaligned. -----------------------------------------------------------------------
This is an automated email from the git hooks/post-receive script. It was generated because a ref change was pushed to the repository containing the project "GNU C Library master sources". The branch, hjl/cacheline/ifunc has been created at 52b083d35ef77d2c6164c66d2d87e87760b0b1e2 (commit) - Log ----------------------------------------------------------------- https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=52b083d35ef77d2c6164c66d2d87e87760b0b1e2 commit 52b083d35ef77d2c6164c66d2d87e87760b0b1e2 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Apr 1 14:01:24 2016 -0700 X86-64: Add dummy memcopy.h and wordcopy.c Since x86-64 doesn't use memory copy functions, add dummy memcopy.h and wordcopy.c to reduce code size. It reduces the size of libc.so by about 1 KB. * sysdeps/x86_64/memcopy.h: New file. * sysdeps/x86_64/wordcopy.c: Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=864950e5e1ba3f4a9aae7380d481fc029c362a3b commit 864950e5e1ba3f4a9aae7380d481fc029c362a3b Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 12:46:57 2016 -0700 X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones, we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with the new ones. No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used before. If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2 memcpy/memmove optimized with Enhanced REP MOVSB will be used for processors with ERMS. The new AVX512 memcpy/memmove will be used for processors with AVX512 which prefer vzeroupper. Since the new SSE2 memcpy/memmove are faster than the previous default memcpy/memmove used in libc.a and ld.so, we also remove the previous default memcpy/memmove and make them the default memcpy/memmove, except that non-temporal store isn't used in ld.so. Together, it reduces the size of libc.so by about 6 KB and the size of ld.so by about 2 KB. [BZ #19776] * sysdeps/x86_64/memcpy.S: Make it dummy. * sysdeps/x86_64/mempcpy.S: Likewise. * sysdeps/x86_64/memmove.S: New file. * sysdeps/x86_64/memmove_chk.S: Likewise. * sysdeps/x86_64/multiarch/memmove.S: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.S: Likewise. * sysdeps/x86_64/memmove.c: Removed. * sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memmove.c: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.c: Likewise. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memcpy-sse2-unaligned, memmove-avx-unaligned, memcpy-avx-unaligned and memmove-sse2-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Replace __memmove_chk_avx512_unaligned_2 with __memmove_chk_avx512_unaligned. Remove __memmove_chk_avx_unaligned_2. Replace __memmove_chk_sse2_unaligned_2 with __memmove_chk_sse2_unaligned. Remove __memmove_chk_sse2 and __memmove_avx_unaligned_2. Replace __memmove_avx512_unaligned_2 with __memmove_avx512_unaligned. Replace __memmove_sse2_unaligned_2 with __memmove_sse2_unaligned. Remove __memmove_sse2. Replace __memcpy_chk_avx512_unaligned_2 with __memcpy_chk_avx512_unaligned. Remove __memcpy_chk_avx_unaligned_2. Replace __memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned. Remove __memcpy_chk_sse2. Remove __memcpy_avx_unaligned_2. Replace __memcpy_avx512_unaligned_2 with __memcpy_avx512_unaligned. Remove __memcpy_sse2_unaligned_2 and __memcpy_sse2. Replace __mempcpy_chk_avx512_unaligned_2 with __mempcpy_chk_avx512_unaligned. Remove __mempcpy_chk_avx_unaligned_2. Replace __mempcpy_chk_sse2_unaligned_2 with __mempcpy_chk_sse2_unaligned. Remove __mempcpy_chk_sse2. Replace __mempcpy_avx512_unaligned_2 with __mempcpy_avx512_unaligned. Remove __mempcpy_avx_unaligned_2. Replace __mempcpy_sse2_unaligned_2 with __mempcpy_sse2_unaligned. Remove __mempcpy_sse2. * sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support __memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned. Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms if processor has ERMS. Default to __memcpy_sse2_unaligned. (ENTRY): Removed. (END): Likewise. (ENTRY_CHK): Likewise. (libc_hidden_builtin_def): Likewise. Don't include ../memcpy.S. * sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support __memcpy_chk_avx512_unaligned_erms and __memcpy_chk_avx512_unaligned. Use __memcpy_chk_avx_unaligned_erms and __memcpy_chk_sse2_unaligned_erms if if processor has ERMS. Default to __memcpy_chk_sse2_unaligned. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S Change function suffix from unaligned_2 to unaligned. * sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support __mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned. Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms if processor has ERMS. Default to __mempcpy_sse2_unaligned. (ENTRY): Removed. (END): Likewise. (ENTRY_CHK): Likewise. (libc_hidden_builtin_def): Likewise. Don't include ../mempcpy.S. (mempcpy): New. Add a weak alias. * sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support __mempcpy_chk_avx512_unaligned_erms and __mempcpy_chk_avx512_unaligned. Use __mempcpy_chk_avx_unaligned_erms and __mempcpy_chk_sse2_unaligned_erms if if processor has ERMS. Default to __mempcpy_chk_sse2_unaligned. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=d95a13a81078acecd67b827ca684fd65d3e3d266 commit d95a13a81078acecd67b827ca684fd65d3e3d266 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 10:42:30 2016 -0700 X86-64: Remove the previous SSE2/AVX2 memsets Since the new SSE2/AVX2 memsets are faster than the previous ones, we can remove the previous SSE2/AVX2 memsets and replace them with the new ones. This reduces the size of libc.so by about 900 bytes. No change in IFUNC selection if SSE2 and AVX2 memsets weren't used before. If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset optimized with Enhanced REP STOSB will be used for processors with ERMS. The new AVX512 memset will be used for processors with AVX512 which prefer vzeroupper. [BZ #19881] * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded into ... * sysdeps/x86_64/memset.S: This. (__bzero): Removed. (__memset_tail): Likewise. (__memset_chk): Likewise. (memset): Likewise. (MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't defined. (MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined. * sysdeps/x86_64/multiarch/memset-avx2.S: Removed. (__memset_zero_constant_len_parameter): Check SHARED instead of PIC. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memset-avx2 and memset-sse2-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Remove __memset_chk_sse2, __memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S (__bzero): Enabled. * sysdeps/x86_64/multiarch/memset.S (memset): Replace __memset_sse2 and __memset_avx2 with __memset_sse2_unaligned and __memset_avx2_unaligned. Use __memset_sse2_unaligned_erms or __memset_avx2_unaligned_erms if processor has ERMS. Support __memset_avx512_unaligned_erms and __memset_avx512_unaligned. (memset): Removed. (__memset_chk): Likewise. (MEMSET_SYMBOL): New. (libc_hidden_builtin_def): Replace __memset_sse2 with __memset_sse2_unaligned. * sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace __memset_chk_sse2 and __memset_chk_avx2 with __memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms. Use __memset_chk_sse2_unaligned_erms or __memset_chk_avx2_unaligned_erms if processor has ERMS. Support __memset_chk_avx512_unaligned_erms and __memset_chk_avx512_unaligned. -----------------------------------------------------------------------
This is an automated email from the git hooks/post-receive script. It was generated because a ref change was pushed to the repository containing the project "GNU C Library master sources". The branch, hjl/cacheline/ifunc has been created at 38d75b36f9c32733c4c3987d67f05436a452ee24 (commit) - Log ----------------------------------------------------------------- https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=38d75b36f9c32733c4c3987d67f05436a452ee24 commit 38d75b36f9c32733c4c3987d67f05436a452ee24 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Apr 1 14:01:24 2016 -0700 X86-64: Add dummy memcopy.h and wordcopy.c Since x86-64 doesn't use memory copy functions, add dummy memcopy.h and wordcopy.c to reduce code size. It reduces the size of libc.so by about 1 KB. * sysdeps/x86_64/memcopy.h: New file. * sysdeps/x86_64/wordcopy.c: Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a3fbfb073c7f6e448b51bfde70c78eebc04af096 commit a3fbfb073c7f6e448b51bfde70c78eebc04af096 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 12:46:57 2016 -0700 X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones, we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with the new ones. No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used before. If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2 memcpy/memmove optimized with Enhanced REP MOVSB will be used for processors with ERMS. The new AVX512 memcpy/memmove will be used for processors with AVX512 which prefer vzeroupper. Since the new SSE2 memcpy/memmove are faster than the previous default memcpy/memmove used in libc.a and ld.so, we also remove the previous default memcpy/memmove and make them the default memcpy/memmove, except that non-temporal store isn't used in ld.so. Together, it reduces the size of libc.so by about 6 KB and the size of ld.so by about 2 KB. [BZ #19776] * sysdeps/x86_64/memcpy.S: Make it dummy. * sysdeps/x86_64/mempcpy.S: Likewise. * sysdeps/x86_64/memmove.S: New file. * sysdeps/x86_64/memmove_chk.S: Likewise. * sysdeps/x86_64/multiarch/memmove.S: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.S: Likewise. * sysdeps/x86_64/memmove.c: Removed. * sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memmove.c: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.c: Likewise. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memcpy-sse2-unaligned, memmove-avx-unaligned, memcpy-avx-unaligned and memmove-sse2-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Replace __memmove_chk_avx512_unaligned_2 with __memmove_chk_avx512_unaligned. Remove __memmove_chk_avx_unaligned_2. Replace __memmove_chk_sse2_unaligned_2 with __memmove_chk_sse2_unaligned. Remove __memmove_chk_sse2 and __memmove_avx_unaligned_2. Replace __memmove_avx512_unaligned_2 with __memmove_avx512_unaligned. Replace __memmove_sse2_unaligned_2 with __memmove_sse2_unaligned. Remove __memmove_sse2. Replace __memcpy_chk_avx512_unaligned_2 with __memcpy_chk_avx512_unaligned. Remove __memcpy_chk_avx_unaligned_2. Replace __memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned. Remove __memcpy_chk_sse2. Remove __memcpy_avx_unaligned_2. Replace __memcpy_avx512_unaligned_2 with __memcpy_avx512_unaligned. Remove __memcpy_sse2_unaligned_2 and __memcpy_sse2. Replace __mempcpy_chk_avx512_unaligned_2 with __mempcpy_chk_avx512_unaligned. Remove __mempcpy_chk_avx_unaligned_2. Replace __mempcpy_chk_sse2_unaligned_2 with __mempcpy_chk_sse2_unaligned. Remove __mempcpy_chk_sse2. Replace __mempcpy_avx512_unaligned_2 with __mempcpy_avx512_unaligned. Remove __mempcpy_avx_unaligned_2. Replace __mempcpy_sse2_unaligned_2 with __mempcpy_sse2_unaligned. Remove __mempcpy_sse2. * sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support __memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned. Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms if processor has ERMS. Default to __memcpy_sse2_unaligned. (ENTRY): Removed. (END): Likewise. (ENTRY_CHK): Likewise. (libc_hidden_builtin_def): Likewise. Don't include ../memcpy.S. * sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support __memcpy_chk_avx512_unaligned_erms and __memcpy_chk_avx512_unaligned. Use __memcpy_chk_avx_unaligned_erms and __memcpy_chk_sse2_unaligned_erms if if processor has ERMS. Default to __memcpy_chk_sse2_unaligned. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S Change function suffix from unaligned_2 to unaligned. * sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support __mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned. Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms if processor has ERMS. Default to __mempcpy_sse2_unaligned. (ENTRY): Removed. (END): Likewise. (ENTRY_CHK): Likewise. (libc_hidden_builtin_def): Likewise. Don't include ../mempcpy.S. (mempcpy): New. Add a weak alias. * sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support __mempcpy_chk_avx512_unaligned_erms and __mempcpy_chk_avx512_unaligned. Use __mempcpy_chk_avx_unaligned_erms and __mempcpy_chk_sse2_unaligned_erms if if processor has ERMS. Default to __mempcpy_chk_sse2_unaligned. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=0c9188718247d9f426d307c337cc86219d056c61 commit 0c9188718247d9f426d307c337cc86219d056c61 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 10:42:30 2016 -0700 X86-64: Remove the previous SSE2/AVX2 memsets Since the new SSE2/AVX2 memsets are faster than the previous ones, we can remove the previous SSE2/AVX2 memsets and replace them with the new ones. This reduces the size of libc.so by about 900 bytes. No change in IFUNC selection if SSE2 and AVX2 memsets weren't used before. If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset optimized with Enhanced REP STOSB will be used for processors with ERMS. The new AVX512 memset will be used for processors with AVX512 which prefer vzeroupper. [BZ #19881] * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded into ... * sysdeps/x86_64/memset.S: This. (__bzero): Removed. (__memset_tail): Likewise. (__memset_chk): Likewise. (memset): Likewise. (MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't defined. (MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined. * sysdeps/x86_64/multiarch/memset-avx2.S: Removed. (__memset_zero_constant_len_parameter): Check SHARED instead of PIC. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memset-avx2 and memset-sse2-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Remove __memset_chk_sse2, __memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S (__bzero): Enabled. * sysdeps/x86_64/multiarch/memset.S (memset): Replace __memset_sse2 and __memset_avx2 with __memset_sse2_unaligned and __memset_avx2_unaligned. Use __memset_sse2_unaligned_erms or __memset_avx2_unaligned_erms if processor has ERMS. Support __memset_avx512_unaligned_erms and __memset_avx512_unaligned. (memset): Removed. (__memset_chk): Likewise. (MEMSET_SYMBOL): New. (libc_hidden_builtin_def): Replace __memset_sse2 with __memset_sse2_unaligned. * sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace __memset_chk_sse2 and __memset_chk_avx2 with __memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms. Use __memset_chk_sse2_unaligned_erms or __memset_chk_avx2_unaligned_erms if processor has ERMS. Support __memset_chk_avx512_unaligned_erms and __memset_chk_avx512_unaligned. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=343b5e49525c4c936643418300ea16437256b1e0 commit 343b5e49525c4c936643418300ea16437256b1e0 Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Apr 24 10:53:25 2016 -0700 Align to cacheline -----------------------------------------------------------------------
This is an automated email from the git hooks/post-receive script. It was generated because a ref change was pushed to the repository containing the project "GNU C Library master sources". The branch, hjl/erms/ifunc has been created at 68bb72a044d516058d00352d5600073d0d5d27e4 (commit) - Log ----------------------------------------------------------------- https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=68bb72a044d516058d00352d5600073d0d5d27e4 commit 68bb72a044d516058d00352d5600073d0d5d27e4 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Apr 1 14:01:24 2016 -0700 X86-64: Add dummy memcopy.h and wordcopy.c Since x86-64 doesn't use memory copy functions, add dummy memcopy.h and wordcopy.c to reduce code size. It reduces the size of libc.so by about 1 KB. * sysdeps/x86_64/memcopy.h: New file. * sysdeps/x86_64/wordcopy.c: Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=06da031549e4765f5481bf34b502529bef43e76c commit 06da031549e4765f5481bf34b502529bef43e76c Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 12:46:57 2016 -0700 X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones, we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with the new ones. No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used before. If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2 memcpy/memmove optimized with Enhanced REP MOVSB will be used for processors with ERMS. The new AVX512 memcpy/memmove will be used for processors with AVX512 which prefer vzeroupper. Since the new SSE2 memcpy/memmove are faster than the previous default memcpy/memmove used in libc.a and ld.so, we also remove the previous default memcpy/memmove and make them the default memcpy/memmove, except that non-temporal store isn't used in ld.so. Together, it reduces the size of libc.so by about 6 KB and the size of ld.so by about 2 KB. [BZ #19776] * sysdeps/x86_64/memcpy.S: Make it dummy. * sysdeps/x86_64/mempcpy.S: Likewise. * sysdeps/x86_64/memmove.S: New file. * sysdeps/x86_64/memmove_chk.S: Likewise. * sysdeps/x86_64/multiarch/memmove.S: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.S: Likewise. * sysdeps/x86_64/memmove.c: Removed. * sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memmove.c: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.c: Likewise. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memcpy-sse2-unaligned, memmove-avx-unaligned, memcpy-avx-unaligned and memmove-sse2-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Replace __memmove_chk_avx512_unaligned_2 with __memmove_chk_avx512_unaligned. Remove __memmove_chk_avx_unaligned_2. Replace __memmove_chk_sse2_unaligned_2 with __memmove_chk_sse2_unaligned. Remove __memmove_chk_sse2 and __memmove_avx_unaligned_2. Replace __memmove_avx512_unaligned_2 with __memmove_avx512_unaligned. Replace __memmove_sse2_unaligned_2 with __memmove_sse2_unaligned. Remove __memmove_sse2. Replace __memcpy_chk_avx512_unaligned_2 with __memcpy_chk_avx512_unaligned. Remove __memcpy_chk_avx_unaligned_2. Replace __memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned. Remove __memcpy_chk_sse2. Remove __memcpy_avx_unaligned_2. Replace __memcpy_avx512_unaligned_2 with __memcpy_avx512_unaligned. Remove __memcpy_sse2_unaligned_2 and __memcpy_sse2. Replace __mempcpy_chk_avx512_unaligned_2 with __mempcpy_chk_avx512_unaligned. Remove __mempcpy_chk_avx_unaligned_2. Replace __mempcpy_chk_sse2_unaligned_2 with __mempcpy_chk_sse2_unaligned. Remove __mempcpy_chk_sse2. Replace __mempcpy_avx512_unaligned_2 with __mempcpy_avx512_unaligned. Remove __mempcpy_avx_unaligned_2. Replace __mempcpy_sse2_unaligned_2 with __mempcpy_sse2_unaligned. Remove __mempcpy_sse2. * sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support __memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned. Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms if processor has ERMS. Default to __memcpy_sse2_unaligned. (ENTRY): Removed. (END): Likewise. (ENTRY_CHK): Likewise. (libc_hidden_builtin_def): Likewise. Don't include ../memcpy.S. * sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support __memcpy_chk_avx512_unaligned_erms and __memcpy_chk_avx512_unaligned. Use __memcpy_chk_avx_unaligned_erms and __memcpy_chk_sse2_unaligned_erms if if processor has ERMS. Default to __memcpy_chk_sse2_unaligned. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S Change function suffix from unaligned_2 to unaligned. * sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support __mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned. Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms if processor has ERMS. Default to __mempcpy_sse2_unaligned. (ENTRY): Removed. (END): Likewise. (ENTRY_CHK): Likewise. (libc_hidden_builtin_def): Likewise. Don't include ../mempcpy.S. (mempcpy): New. Add a weak alias. * sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support __mempcpy_chk_avx512_unaligned_erms and __mempcpy_chk_avx512_unaligned. Use __mempcpy_chk_avx_unaligned_erms and __mempcpy_chk_sse2_unaligned_erms if if processor has ERMS. Default to __mempcpy_chk_sse2_unaligned. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a84e1c9bc4654a30b68c3612cfd395a4f85f1812 commit a84e1c9bc4654a30b68c3612cfd395a4f85f1812 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 10:42:30 2016 -0700 X86-64: Remove the previous SSE2/AVX2 memsets Since the new SSE2/AVX2 memsets are faster than the previous ones, we can remove the previous SSE2/AVX2 memsets and replace them with the new ones. This reduces the size of libc.so by about 900 bytes. No change in IFUNC selection if SSE2 and AVX2 memsets weren't used before. If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset optimized with Enhanced REP STOSB will be used for processors with ERMS. The new AVX512 memset will be used for processors with AVX512 which prefer vzeroupper. [BZ #19881] * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded into ... * sysdeps/x86_64/memset.S: This. (__bzero): Removed. (__memset_tail): Likewise. (__memset_chk): Likewise. (memset): Likewise. (MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't defined. (MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined. * sysdeps/x86_64/multiarch/memset-avx2.S: Removed. (__memset_zero_constant_len_parameter): Check SHARED instead of PIC. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memset-avx2 and memset-sse2-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Remove __memset_chk_sse2, __memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S (__bzero): Enabled. * sysdeps/x86_64/multiarch/memset.S (memset): Replace __memset_sse2 and __memset_avx2 with __memset_sse2_unaligned and __memset_avx2_unaligned. Use __memset_sse2_unaligned_erms or __memset_avx2_unaligned_erms if processor has ERMS. Support __memset_avx512_unaligned_erms and __memset_avx512_unaligned. (memset): Removed. (__memset_chk): Likewise. (MEMSET_SYMBOL): New. (libc_hidden_builtin_def): Replace __memset_sse2 with __memset_sse2_unaligned. * sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace __memset_chk_sse2 and __memset_chk_avx2 with __memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms. Use __memset_chk_sse2_unaligned_erms or __memset_chk_avx2_unaligned_erms if processor has ERMS. Support __memset_chk_avx512_unaligned_erms and __memset_chk_avx512_unaligned. -----------------------------------------------------------------------
This is an automated email from the git hooks/post-receive script. It was generated because a ref change was pushed to the repository containing the project "GNU C Library master sources". The branch, hjl/erms/ifunc has been created at ebb33e72c79c9068a81079001db9a24dc6550fa2 (commit) - Log ----------------------------------------------------------------- https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=ebb33e72c79c9068a81079001db9a24dc6550fa2 commit ebb33e72c79c9068a81079001db9a24dc6550fa2 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Apr 1 14:01:24 2016 -0700 X86-64: Add dummy memcopy.h and wordcopy.c Since x86-64 doesn't use memory copy functions, add dummy memcopy.h and wordcopy.c to reduce code size. It reduces the size of libc.so by about 1 KB. * sysdeps/x86_64/memcopy.h: New file. * sysdeps/x86_64/wordcopy.c: Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=8590f10f1929588364725660082d01c3990037d9 commit 8590f10f1929588364725660082d01c3990037d9 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 12:46:57 2016 -0700 X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones, we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with the new ones. No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used before. If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2 memcpy/memmove optimized with Enhanced REP MOVSB will be used for processors with ERMS. The new AVX512 memcpy/memmove will be used for processors with AVX512 which prefer vzeroupper. Since the new SSE2 memcpy/memmove are faster than the previous default memcpy/memmove used in libc.a and ld.so, we also remove the previous default memcpy/memmove and make them the default memcpy/memmove, except that non-temporal store isn't used in ld.so. Together, it reduces the size of libc.so by about 6 KB and the size of ld.so by about 2 KB. [BZ #19776] * sysdeps/x86_64/memcpy.S: Make it dummy. * sysdeps/x86_64/mempcpy.S: Likewise. * sysdeps/x86_64/memmove.S: New file. * sysdeps/x86_64/memmove_chk.S: Likewise. * sysdeps/x86_64/multiarch/memmove.S: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.S: Likewise. * sysdeps/x86_64/memmove.c: Removed. * sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memmove.c: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.c: Likewise. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memcpy-sse2-unaligned, memmove-avx-unaligned, memcpy-avx-unaligned and memmove-sse2-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Replace __memmove_chk_avx512_unaligned_2 with __memmove_chk_avx512_unaligned. Remove __memmove_chk_avx_unaligned_2. Replace __memmove_chk_sse2_unaligned_2 with __memmove_chk_sse2_unaligned. Remove __memmove_chk_sse2 and __memmove_avx_unaligned_2. Replace __memmove_avx512_unaligned_2 with __memmove_avx512_unaligned. Replace __memmove_sse2_unaligned_2 with __memmove_sse2_unaligned. Remove __memmove_sse2. Replace __memcpy_chk_avx512_unaligned_2 with __memcpy_chk_avx512_unaligned. Remove __memcpy_chk_avx_unaligned_2. Replace __memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned. Remove __memcpy_chk_sse2. Remove __memcpy_avx_unaligned_2. Replace __memcpy_avx512_unaligned_2 with __memcpy_avx512_unaligned. Remove __memcpy_sse2_unaligned_2 and __memcpy_sse2. Replace __mempcpy_chk_avx512_unaligned_2 with __mempcpy_chk_avx512_unaligned. Remove __mempcpy_chk_avx_unaligned_2. Replace __mempcpy_chk_sse2_unaligned_2 with __mempcpy_chk_sse2_unaligned. Remove __mempcpy_chk_sse2. Replace __mempcpy_avx512_unaligned_2 with __mempcpy_avx512_unaligned. Remove __mempcpy_avx_unaligned_2. Replace __mempcpy_sse2_unaligned_2 with __mempcpy_sse2_unaligned. Remove __mempcpy_sse2. * sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support __memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned. Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms if processor has ERMS. Default to __memcpy_sse2_unaligned. (ENTRY): Removed. (END): Likewise. (ENTRY_CHK): Likewise. (libc_hidden_builtin_def): Likewise. Don't include ../memcpy.S. * sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support __memcpy_chk_avx512_unaligned_erms and __memcpy_chk_avx512_unaligned. Use __memcpy_chk_avx_unaligned_erms and __memcpy_chk_sse2_unaligned_erms if if processor has ERMS. Default to __memcpy_chk_sse2_unaligned. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S Change function suffix from unaligned_2 to unaligned. * sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support __mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned. Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms if processor has ERMS. Default to __mempcpy_sse2_unaligned. (ENTRY): Removed. (END): Likewise. (ENTRY_CHK): Likewise. (libc_hidden_builtin_def): Likewise. Don't include ../mempcpy.S. (mempcpy): New. Add a weak alias. * sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support __mempcpy_chk_avx512_unaligned_erms and __mempcpy_chk_avx512_unaligned. Use __mempcpy_chk_avx_unaligned_erms and __mempcpy_chk_sse2_unaligned_erms if if processor has ERMS. Default to __mempcpy_chk_sse2_unaligned. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=bb7291b391e4969ad668a5dd349e246f660e402e commit bb7291b391e4969ad668a5dd349e246f660e402e Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 10:42:30 2016 -0700 X86-64: Remove the previous SSE2/AVX2 memsets Since the new SSE2/AVX2 memsets are faster than the previous ones, we can remove the previous SSE2/AVX2 memsets and replace them with the new ones. This reduces the size of libc.so by about 900 bytes. No change in IFUNC selection if SSE2 and AVX2 memsets weren't used before. If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset optimized with Enhanced REP STOSB will be used for processors with ERMS. The new AVX512 memset will be used for processors with AVX512 which prefer vzeroupper. [BZ #19881] * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded into ... * sysdeps/x86_64/memset.S: This. (__bzero): Removed. (__memset_tail): Likewise. (__memset_chk): Likewise. (memset): Likewise. (MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't defined. (MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined. * sysdeps/x86_64/multiarch/memset-avx2.S: Removed. (__memset_zero_constant_len_parameter): Check SHARED instead of PIC. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memset-avx2 and memset-sse2-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Remove __memset_chk_sse2, __memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S (__bzero): Enabled. * sysdeps/x86_64/multiarch/memset.S (memset): Replace __memset_sse2 and __memset_avx2 with __memset_sse2_unaligned and __memset_avx2_unaligned. Use __memset_sse2_unaligned_erms or __memset_avx2_unaligned_erms if processor has ERMS. Support __memset_avx512_unaligned_erms and __memset_avx512_unaligned. (memset): Removed. (__memset_chk): Likewise. (MEMSET_SYMBOL): New. (libc_hidden_builtin_def): Replace __memset_sse2 with __memset_sse2_unaligned. * sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace __memset_chk_sse2 and __memset_chk_avx2 with __memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms. Use __memset_chk_sse2_unaligned_erms or __memset_chk_avx2_unaligned_erms if processor has ERMS. Support __memset_chk_avx512_unaligned_erms and __memset_chk_avx512_unaligned. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=81a26fd73fb6b19ff2868e51d894012f192422fa commit 81a26fd73fb6b19ff2868e51d894012f192422fa Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun May 8 07:52:08 2016 -0700 Initialize __x86_shared_non_temporal_threshold only if zero Support setting processor-specific __x86_shared_non_temporal_threshold value in init_cpu_features. * sysdeps/x86/cacheinfo.c (__x86_shared_non_temporal_threshold): Initialize only if it is zero. -----------------------------------------------------------------------
This is an automated email from the git hooks/post-receive script. It was generated because a ref change was pushed to the repository containing the project "GNU C Library master sources". The branch, hjl/erms/ifunc has been created at 85702d388d909d38f06c510c4504561df94d99bd (commit) - Log ----------------------------------------------------------------- https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=85702d388d909d38f06c510c4504561df94d99bd commit 85702d388d909d38f06c510c4504561df94d99bd Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Apr 1 14:01:24 2016 -0700 X86-64: Add dummy memcopy.h and wordcopy.c Since x86-64 doesn't use memory copy functions, add dummy memcopy.h and wordcopy.c to reduce code size. It reduces the size of libc.so by about 1 KB. * sysdeps/x86_64/memcopy.h: New file. * sysdeps/x86_64/wordcopy.c: Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=550bdb5f6b24eb2422f0b0c3a4c81c4665f132ff commit 550bdb5f6b24eb2422f0b0c3a4c81c4665f132ff Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 12:46:57 2016 -0700 X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones, we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with the new ones. No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used before. If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2 memcpy/memmove optimized with Enhanced REP MOVSB will be used for processors with ERMS. The new AVX512 memcpy/memmove will be used for processors with AVX512 which prefer vzeroupper. Since the new SSE2 memcpy/memmove are faster than the previous default memcpy/memmove used in libc.a and ld.so, we also remove the previous default memcpy/memmove and make them the default memcpy/memmove, except that non-temporal store isn't used in ld.so. Together, it reduces the size of libc.so by about 6 KB and the size of ld.so by about 2 KB. [BZ #19776] * sysdeps/x86_64/memcpy.S: Make it dummy. * sysdeps/x86_64/mempcpy.S: Likewise. * sysdeps/x86_64/memmove.S: New file. * sysdeps/x86_64/memmove_chk.S: Likewise. * sysdeps/x86_64/multiarch/memmove.S: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.S: Likewise. * sysdeps/x86_64/memmove.c: Removed. * sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memmove.c: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.c: Likewise. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memcpy-sse2-unaligned, memmove-avx-unaligned, memcpy-avx-unaligned and memmove-sse2-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Replace __memmove_chk_avx512_unaligned_2 with __memmove_chk_avx512_unaligned. Remove __memmove_chk_avx_unaligned_2. Replace __memmove_chk_sse2_unaligned_2 with __memmove_chk_sse2_unaligned. Remove __memmove_chk_sse2 and __memmove_avx_unaligned_2. Replace __memmove_avx512_unaligned_2 with __memmove_avx512_unaligned. Replace __memmove_sse2_unaligned_2 with __memmove_sse2_unaligned. Remove __memmove_sse2. Replace __memcpy_chk_avx512_unaligned_2 with __memcpy_chk_avx512_unaligned. Remove __memcpy_chk_avx_unaligned_2. Replace __memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned. Remove __memcpy_chk_sse2. Remove __memcpy_avx_unaligned_2. Replace __memcpy_avx512_unaligned_2 with __memcpy_avx512_unaligned. Remove __memcpy_sse2_unaligned_2 and __memcpy_sse2. Replace __mempcpy_chk_avx512_unaligned_2 with __mempcpy_chk_avx512_unaligned. Remove __mempcpy_chk_avx_unaligned_2. Replace __mempcpy_chk_sse2_unaligned_2 with __mempcpy_chk_sse2_unaligned. Remove __mempcpy_chk_sse2. Replace __mempcpy_avx512_unaligned_2 with __mempcpy_avx512_unaligned. Remove __mempcpy_avx_unaligned_2. Replace __mempcpy_sse2_unaligned_2 with __mempcpy_sse2_unaligned. Remove __mempcpy_sse2. * sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support __memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned. Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms if processor has ERMS. Default to __memcpy_sse2_unaligned. (ENTRY): Removed. (END): Likewise. (ENTRY_CHK): Likewise. (libc_hidden_builtin_def): Likewise. Don't include ../memcpy.S. * sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support __memcpy_chk_avx512_unaligned_erms and __memcpy_chk_avx512_unaligned. Use __memcpy_chk_avx_unaligned_erms and __memcpy_chk_sse2_unaligned_erms if if processor has ERMS. Default to __memcpy_chk_sse2_unaligned. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S Change function suffix from unaligned_2 to unaligned. * sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support __mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned. Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms if processor has ERMS. Default to __mempcpy_sse2_unaligned. (ENTRY): Removed. (END): Likewise. (ENTRY_CHK): Likewise. (libc_hidden_builtin_def): Likewise. Don't include ../mempcpy.S. (mempcpy): New. Add a weak alias. * sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support __mempcpy_chk_avx512_unaligned_erms and __mempcpy_chk_avx512_unaligned. Use __mempcpy_chk_avx_unaligned_erms and __mempcpy_chk_sse2_unaligned_erms if if processor has ERMS. Default to __mempcpy_chk_sse2_unaligned. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=1d96e81d73e27b2d243eab126433e24f4a4da2ef commit 1d96e81d73e27b2d243eab126433e24f4a4da2ef Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 10:42:30 2016 -0700 X86-64: Remove the previous SSE2/AVX2 memsets Since the new SSE2/AVX2 memsets are faster than the previous ones, we can remove the previous SSE2/AVX2 memsets and replace them with the new ones. This reduces the size of libc.so by about 900 bytes. No change in IFUNC selection if SSE2 and AVX2 memsets weren't used before. If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset optimized with Enhanced REP STOSB will be used for processors with ERMS. The new AVX512 memset will be used for processors with AVX512 which prefer vzeroupper. [BZ #19881] * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded into ... * sysdeps/x86_64/memset.S: This. (__bzero): Removed. (__memset_tail): Likewise. (__memset_chk): Likewise. (memset): Likewise. (MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't defined. (MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined. * sysdeps/x86_64/multiarch/memset-avx2.S: Removed. (__memset_zero_constant_len_parameter): Check SHARED instead of PIC. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memset-avx2 and memset-sse2-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Remove __memset_chk_sse2, __memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S (__bzero): Enabled. * sysdeps/x86_64/multiarch/memset.S (memset): Replace __memset_sse2 and __memset_avx2 with __memset_sse2_unaligned and __memset_avx2_unaligned. Use __memset_sse2_unaligned_erms or __memset_avx2_unaligned_erms if processor has ERMS. Support __memset_avx512_unaligned_erms and __memset_avx512_unaligned. (memset): Removed. (__memset_chk): Likewise. (MEMSET_SYMBOL): New. (libc_hidden_builtin_def): Replace __memset_sse2 with __memset_sse2_unaligned. * sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace __memset_chk_sse2 and __memset_chk_avx2 with __memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms. Use __memset_chk_sse2_unaligned_erms or __memset_chk_avx2_unaligned_erms if processor has ERMS. Support __memset_chk_avx512_unaligned_erms and __memset_chk_avx512_unaligned. -----------------------------------------------------------------------
Created attachment 9320 [details] AMD Bulldozer perf data
Created attachment 9322 [details] HSW perf data
Created attachment 9323 [details] SKL perf data
This is an automated email from the git hooks/post-receive script. It was generated because a ref change was pushed to the repository containing the project "GNU C Library master sources". The branch, hjl/erms/2.22 has been created at b60dda5f2385aaca873069f9fb28645b82a1b711 (commit) - Log ----------------------------------------------------------------- https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=b60dda5f2385aaca873069f9fb28645b82a1b711 commit b60dda5f2385aaca873069f9fb28645b82a1b711 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri May 27 15:16:22 2016 -0700 Count number of logical processors sharing L2 cache For Intel processors, when there are both L2 and L3 caches, SMT level type should be ued to count number of available logical processors sharing L2 cache. If there is only L2 cache, core level type should be used to count number of available logical processors sharing L2 cache. Number of available logical processors sharing L2 cache should be used for non-inclusive L2 and L3 caches. * sysdeps/x86/cacheinfo.c (init_cacheinfo): Count number of available logical processors with SMT level type sharing L2 cache for Intel processors. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=ed46697862f2b0c2db726cc4c772e6003914bd72 commit ed46697862f2b0c2db726cc4c772e6003914bd72 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri May 20 14:41:14 2016 -0700 Remove special L2 cache case for Knights Landing L2 cache is shared by 2 cores on Knights Landing, which has 4 threads per core: https://en.wikipedia.org/wiki/Xeon_Phi#Knights_Landing So L2 cache is shared by 8 threads on Knights Landing as reported by CPUID. We should remove special L2 cache case for Knights Landing. [BZ #18185] * sysdeps/x86/cacheinfo.c (init_cacheinfo): Don't limit threads sharing L2 cache to 2 for Knights Landing. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=07f943915311f6f92e5a031911d32c5e7458bfd5 commit 07f943915311f6f92e5a031911d32c5e7458bfd5 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu May 19 10:02:36 2016 -0700 Correct Intel processor level type mask from CPUID Intel CPUID with EAX == 11 returns: ECX Bits 07 - 00: Level number. Same value in ECX input. Bits 15 - 08: Level type. ^^^^^^^^^^^^^^^^^^^^^^^^ This is level type. Bits 31 - 16: Reserved. Intel processor level type mask should be 0xff00, not 0xff0. [BZ #20119] * sysdeps/x86/cacheinfo.c (init_cacheinfo): Correct Intel processor level type mask for CPUID with EAX == 11. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=201aebf739482fbb730d10eb7cf8335629bb4de4 commit 201aebf739482fbb730d10eb7cf8335629bb4de4 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu May 19 09:09:00 2016 -0700 Check the HTT bit before counting logical threads Skip counting logical threads for Intel processors if the HTT bit is 0 which indicates there is only a single logical processor. * sysdeps/x86/cacheinfo.c (init_cacheinfo): Skip counting logical threads if the HTT bit is 0. * sysdeps/x86/cpu-features.h (bit_cpu_HTT): New. (index_cpu_HTT): Likewise. (reg_HTT): Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=dff8bcdab5968ac53e52ef06cabe8d921b429d22 commit dff8bcdab5968ac53e52ef06cabe8d921b429d22 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu May 19 08:49:45 2016 -0700 Remove alignments on jump targets in memset X86-64 memset-vec-unaligned-erms.S aligns many jump targets, which increases code sizes, but not necessarily improve performance. As memset benchtest data of align vs no align on various Intel and AMD processors https://sourceware.org/bugzilla/attachment.cgi?id=9277 shows that aligning jump targets isn't necessary. [BZ #20115] * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S (__memset): Remove alignments on jump targets. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=aba9d000bf8441d77f0557af360e3aea7525d03e commit aba9d000bf8441d77f0557af360e3aea7525d03e Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri May 13 08:29:22 2016 -0700 Call init_cpu_features only if SHARED is defined In static executable, since init_cpu_features is called early from __libc_start_main, there is no need to call it again in dl_platform_init. [BZ #20072] * sysdeps/i386/dl-machine.h (dl_platform_init): Call init_cpu_features only if SHARED is defined. * sysdeps/x86_64/dl-machine.h (dl_platform_init): Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=6118b2d23016ec790b99b9331c3d7a45d588134e commit 6118b2d23016ec790b99b9331c3d7a45d588134e Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri May 13 07:18:25 2016 -0700 Support non-inclusive caches on Intel processors * sysdeps/x86/cacheinfo.c (init_cacheinfo): Check and support non-inclusive caches on Intel processors. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=8642c9a553d8ce8a3a0496ed11fed5a575d338c5 commit 8642c9a553d8ce8a3a0496ed11fed5a575d338c5 Author: H.J. Lu <hjl.tools@gmail.com> Date: Wed May 11 05:49:09 2016 -0700 Remove x86 ifunc-defines.sym and rtld-global-offsets.sym Merge x86 ifunc-defines.sym with x86 cpu-features-offsets.sym. Remove x86 ifunc-defines.sym and rtld-global-offsets.sym. No code changes on i686 and x86-64. * sysdeps/i386/i686/multiarch/Makefile (gen-as-const-headers): Remove ifunc-defines.sym. * sysdeps/x86_64/multiarch/Makefile (gen-as-const-headers): Likewise. * sysdeps/i386/i686/multiarch/ifunc-defines.sym: Removed. * sysdeps/x86/rtld-global-offsets.sym: Likewise. * sysdeps/x86_64/multiarch/ifunc-defines.sym: Likewise. * sysdeps/x86/Makefile (gen-as-const-headers): Remove rtld-global-offsets.sym. * sysdeps/x86_64/multiarch/ifunc-defines.sym: Merged with ... * sysdeps/x86/cpu-features-offsets.sym: This. * sysdeps/x86/cpu-features.h: Include <cpu-features-offsets.h> instead of <ifunc-defines.h> and <rtld-global-offsets.h>. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=3038902f233a5e0028a6424685b410f6c201040f commit 3038902f233a5e0028a6424685b410f6c201040f Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun May 8 08:49:02 2016 -0700 Move sysdeps/x86_64/cacheinfo.c to sysdeps/x86 Move sysdeps/x86_64/cacheinfo.c to sysdeps/x86. No code changes on x86 and x86_64. * sysdeps/i386/cacheinfo.c: Include <sysdeps/x86/cacheinfo.c> instead of <sysdeps/x86_64/cacheinfo.c>. * sysdeps/x86_64/cacheinfo.c: Moved to ... * sysdeps/x86/cacheinfo.c: Here. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=df2b390bba18903d62c8910e808bfb0dce7f033c commit df2b390bba18903d62c8910e808bfb0dce7f033c Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Apr 15 05:22:53 2016 -0700 Detect Intel Goldmont and Airmont processors Updated from the model numbers of Goldmont and Airmont processors in Intel64 And IA-32 Processor Architectures Software Developer's Manual Volume 3 Revision 058. * sysdeps/x86/cpu-features.c (init_cpu_features): Detect Intel Goldmont and Airmont processors. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=157c57198e893b4882d1feb98de2b0721ee408fc commit 157c57198e893b4882d1feb98de2b0721ee408fc Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Apr 1 14:01:24 2016 -0700 X86-64: Add dummy memcopy.h and wordcopy.c Since x86-64 doesn't use memory copy functions, add dummy memcopy.h and wordcopy.c to reduce code size. It reduces the size of libc.so by about 1 KB. * sysdeps/x86_64/memcopy.h: New file. * sysdeps/x86_64/wordcopy.c: Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=f817b9d36215ab60d58cc744d22773b4961a2c9b commit f817b9d36215ab60d58cc744d22773b4961a2c9b Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 12:46:57 2016 -0700 X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones, we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with the new ones. No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used before. If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2 memcpy/memmove optimized with Enhanced REP MOVSB will be used for processors with ERMS. The new AVX512 memcpy/memmove will be used for processors with AVX512 which prefer vzeroupper. Since the new SSE2 memcpy/memmove are faster than the previous default memcpy/memmove used in libc.a and ld.so, we also remove the previous default memcpy/memmove and make them the default memcpy/memmove, except that non-temporal store isn't used in ld.so. Together, it reduces the size of libc.so by about 6 KB and the size of ld.so by about 2 KB. [BZ #19776] * sysdeps/x86_64/memcpy.S: Make it dummy. * sysdeps/x86_64/mempcpy.S: Likewise. * sysdeps/x86_64/memmove.S: New file. * sysdeps/x86_64/memmove_chk.S: Likewise. * sysdeps/x86_64/multiarch/memmove.S: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.S: Likewise. * sysdeps/x86_64/memmove.c: Removed. * sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memmove.c: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.c: Likewise. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memcpy-sse2-unaligned, memmove-avx-unaligned, memcpy-avx-unaligned and memmove-sse2-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Replace __memmove_chk_avx512_unaligned_2 with __memmove_chk_avx512_unaligned. Remove __memmove_chk_avx_unaligned_2. Replace __memmove_chk_sse2_unaligned_2 with __memmove_chk_sse2_unaligned. Remove __memmove_chk_sse2 and __memmove_avx_unaligned_2. Replace __memmove_avx512_unaligned_2 with __memmove_avx512_unaligned. Replace __memmove_sse2_unaligned_2 with __memmove_sse2_unaligned. Remove __memmove_sse2. Replace __memcpy_chk_avx512_unaligned_2 with __memcpy_chk_avx512_unaligned. Remove __memcpy_chk_avx_unaligned_2. Replace __memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned. Remove __memcpy_chk_sse2. Remove __memcpy_avx_unaligned_2. Replace __memcpy_avx512_unaligned_2 with __memcpy_avx512_unaligned. Remove __memcpy_sse2_unaligned_2 and __memcpy_sse2. Replace __mempcpy_chk_avx512_unaligned_2 with __mempcpy_chk_avx512_unaligned. Remove __mempcpy_chk_avx_unaligned_2. Replace __mempcpy_chk_sse2_unaligned_2 with __mempcpy_chk_sse2_unaligned. Remove __mempcpy_chk_sse2. Replace __mempcpy_avx512_unaligned_2 with __mempcpy_avx512_unaligned. Remove __mempcpy_avx_unaligned_2. Replace __mempcpy_sse2_unaligned_2 with __mempcpy_sse2_unaligned. Remove __mempcpy_sse2. * sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support __memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned. Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms if processor has ERMS. Default to __memcpy_sse2_unaligned. (ENTRY): Removed. (END): Likewise. (ENTRY_CHK): Likewise. (libc_hidden_builtin_def): Likewise. Don't include ../memcpy.S. * sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support __memcpy_chk_avx512_unaligned_erms and __memcpy_chk_avx512_unaligned. Use __memcpy_chk_avx_unaligned_erms and __memcpy_chk_sse2_unaligned_erms if if processor has ERMS. Default to __memcpy_chk_sse2_unaligned. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S Change function suffix from unaligned_2 to unaligned. * sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support __mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned. Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms if processor has ERMS. Default to __mempcpy_sse2_unaligned. (ENTRY): Removed. (END): Likewise. (ENTRY_CHK): Likewise. (libc_hidden_builtin_def): Likewise. Don't include ../mempcpy.S. (mempcpy): New. Add a weak alias. * sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support __mempcpy_chk_avx512_unaligned_erms and __mempcpy_chk_avx512_unaligned. Use __mempcpy_chk_avx_unaligned_erms and __mempcpy_chk_sse2_unaligned_erms if if processor has ERMS. Default to __mempcpy_chk_sse2_unaligned. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=122600f4b380b00ce0f682039fe59af4bd0edbc0 commit 122600f4b380b00ce0f682039fe59af4bd0edbc0 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 10:42:30 2016 -0700 X86-64: Remove the previous SSE2/AVX2 memsets Since the new SSE2/AVX2 memsets are faster than the previous ones, we can remove the previous SSE2/AVX2 memsets and replace them with the new ones. This reduces the size of libc.so by about 900 bytes. No change in IFUNC selection if SSE2 and AVX2 memsets weren't used before. If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset optimized with Enhanced REP STOSB will be used for processors with ERMS. The new AVX512 memset will be used for processors with AVX512 which prefer vzeroupper. [BZ #19881] * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded into ... * sysdeps/x86_64/memset.S: This. (__bzero): Removed. (__memset_tail): Likewise. (__memset_chk): Likewise. (memset): Likewise. (MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't defined. (MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined. * sysdeps/x86_64/multiarch/memset-avx2.S: Removed. (__memset_zero_constant_len_parameter): Check SHARED instead of PIC. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memset-avx2 and memset-sse2-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Remove __memset_chk_sse2, __memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S (__bzero): Enabled. * sysdeps/x86_64/multiarch/memset.S (memset): Replace __memset_sse2 and __memset_avx2 with __memset_sse2_unaligned and __memset_avx2_unaligned. Use __memset_sse2_unaligned_erms or __memset_avx2_unaligned_erms if processor has ERMS. Support __memset_avx512_unaligned_erms and __memset_avx512_unaligned. (memset): Removed. (__memset_chk): Likewise. (MEMSET_SYMBOL): New. (libc_hidden_builtin_def): Replace __memset_sse2 with __memset_sse2_unaligned. * sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace __memset_chk_sse2 and __memset_chk_avx2 with __memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms. Use __memset_chk_sse2_unaligned_erms or __memset_chk_avx2_unaligned_erms if processor has ERMS. Support __memset_chk_avx512_unaligned_erms and __memset_chk_avx512_unaligned. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=0ee4375cef69e00e69ddb1d08fe0d492053208f3 commit 0ee4375cef69e00e69ddb1d08fe0d492053208f3 Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Apr 3 17:21:45 2016 -0700 X86-64: Use non-temporal store in memcpy on large data The large memcpy micro benchmark in glibc shows that there is a regression with large data on Haswell machine. non-temporal store in memcpy on large data can improve performance significantly. This patch adds a threshold to use non temporal store which is 6 times of shared cache size. When size is above the threshold, non temporal store will be used, but avoid non-temporal store if there is overlap between destination and source since destination may be in cache when source is loaded. For size below 8 vector register width, we load all data into registers and store them together. Only forward and backward loops, which move 4 vector registers at a time, are used to support overlapping addresses. For forward loop, we load the last 4 vector register width of data and the first vector register width of data into vector registers before the loop and store them after the loop. For backward loop, we load the first 4 vector register width of data and the last vector register width of data into vector registers before the loop and store them after the loop. [BZ #19928] * sysdeps/x86_64/cacheinfo.c (__x86_shared_non_temporal_threshold): New. (init_cacheinfo): Set __x86_shared_non_temporal_threshold to 6 times of shared cache size. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S (VMOVNT): New. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S (VMOVNT): Likewise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S (VMOVNT): Likewise. (VMOVU): Changed to movups for smaller code sizes. (VMOVA): Changed to movaps for smaller code sizes. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Update comments. (PREFETCH): New. (PREFETCH_SIZE): Likewise. (PREFETCHED_LOAD_SIZE): Likewise. (PREFETCH_ONE_SET): Likewise. Rewrite to use forward and backward loops, which move 4 vector registers at a time, to support overlapping addresses and use non temporal store if size is above the threshold and there is no overlap between destination and source. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=54667f64fa4074325ee33e487c033c313ce95067 commit 54667f64fa4074325ee33e487c033c313ce95067 Author: H.J. Lu <hjl.tools@gmail.com> Date: Wed Apr 6 10:19:16 2016 -0700 X86-64: Prepare memmove-vec-unaligned-erms.S Prepare memmove-vec-unaligned-erms.S to make the SSE2 version as the default memcpy, mempcpy and memmove. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S (MEMCPY_SYMBOL): New. (MEMPCPY_SYMBOL): Likewise. (MEMMOVE_CHK_SYMBOL): Likewise. Replace MEMMOVE_SYMBOL with MEMMOVE_CHK_SYMBOL on __mempcpy_chk symbols. Replace MEMMOVE_SYMBOL with MEMPCPY_SYMBOL on __mempcpy symbols. Provide alias for __memcpy_chk in libc.a. Provide alias for memcpy in libc.a and ld.so. (cherry picked from commit a7d1c51482d15ab6c07e2ee0ae5e007067b18bfb) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=68a0b487e274b3452a1660e4b9fad5df8d8c0284 commit 68a0b487e274b3452a1660e4b9fad5df8d8c0284 Author: H.J. Lu <hjl.tools@gmail.com> Date: Wed Apr 6 09:10:18 2016 -0700 X86-64: Prepare memset-vec-unaligned-erms.S Prepare memset-vec-unaligned-erms.S to make the SSE2 version as the default memset. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S (MEMSET_CHK_SYMBOL): New. Define if not defined. (__bzero): Check VEC_SIZE == 16 instead of USE_MULTIARCH. Disabled fro now. Replace MEMSET_SYMBOL with MEMSET_CHK_SYMBOL on __memset_chk symbols. Properly check USE_MULTIARCH on __memset symbols. (cherry picked from commit 4af1bb06c59d24f35bf8dc55897838d926c05892) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c2d3bdd6aec639fd66fceb3e2c145420c25d409b commit c2d3bdd6aec639fd66fceb3e2c145420c25d409b Author: H.J. Lu <hjl.tools@gmail.com> Date: Tue Apr 5 05:21:07 2016 -0700 Force 32-bit displacement in memset-vec-unaligned-erms.S * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: Force 32-bit displacement to avoid long nop between instructions. (cherry picked from commit ec0cac9a1f4094bd0db6f77c1b329e7a40eecc10) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=070a5e77d66f5520c1bbbc24dc1843a0a1c161ee commit 070a5e77d66f5520c1bbbc24dc1843a0a1c161ee Author: H.J. Lu <hjl.tools@gmail.com> Date: Tue Apr 5 05:19:05 2016 -0700 Add a comment in memset-sse2-unaligned-erms.S * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Add a comment on VMOVU and VMOVA. (cherry picked from commit 696ac774847b80cf994438739478b0c3003b5958) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7e00bb9720268f142668d22e91dff7c3e6e0c08c commit 7e00bb9720268f142668d22e91dff7c3e6e0c08c Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Apr 3 14:32:20 2016 -0700 Don't put SSE2/AVX/AVX512 memmove/memset in ld.so Since memmove and memset in ld.so don't use IFUNC, don't put SSE2, AVX and AVX512 memmove and memset in ld.so. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip if not in libc. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S: Likewise. (cherry picked from commit 5cd7af016d8587ff53b20ba259746f97edbddbf7) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e1e57539f5dbdefc96a85021b611863eaa28dd13 commit e1e57539f5dbdefc96a85021b611863eaa28dd13 Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Apr 3 12:38:25 2016 -0700 Fix memmove-vec-unaligned-erms.S __mempcpy_erms and __memmove_erms can't be placed between __memmove_chk and __memmove it breaks __memmove_chk. Don't check source == destination first since it is less common. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: (__mempcpy_erms, __memmove_erms): Moved before __mempcpy_chk with unaligned_erms. (__memmove_erms): Skip if source == destination. (__memmove_unaligned_erms): Don't check source == destination first. (cherry picked from commit ea2785e96fa503f3a2b5dd9f3a6ca65622b3c5f2) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a13ac6b5ced68aadb7c1546102445f9c57f43231 commit a13ac6b5ced68aadb7c1546102445f9c57f43231 Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Mar 6 08:23:24 2016 -0800 Use HAS_ARCH_FEATURE with Fast_Rep_String HAS_ARCH_FEATURE, not HAS_CPU_FEATURE, should be used with Fast_Rep_String. [BZ #19762] * sysdeps/i386/i686/multiarch/bcopy.S (bcopy): Use HAS_ARCH_FEATURE with Fast_Rep_String. * sysdeps/i386/i686/multiarch/bzero.S (__bzero): Likewise. * sysdeps/i386/i686/multiarch/memcpy.S (memcpy): Likewise. * sysdeps/i386/i686/multiarch/memcpy_chk.S (__memcpy_chk): Likewise. * sysdeps/i386/i686/multiarch/memmove_chk.S (__memmove_chk): Likewise. * sysdeps/i386/i686/multiarch/mempcpy.S (__mempcpy): Likewise. * sysdeps/i386/i686/multiarch/mempcpy_chk.S (__mempcpy_chk): Likewise. * sysdeps/i386/i686/multiarch/memset.S (memset): Likewise. * sysdeps/i386/i686/multiarch/memset_chk.S (__memset_chk): Likewise. (cherry picked from commit 4e940b2f4b577f3a530e0580373f7c2d569f4d63) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4ad4d58ed7a444e2d9787113fce132a99b35b417 commit 4ad4d58ed7a444e2d9787113fce132a99b35b417 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Apr 1 15:08:48 2016 -0700 Remove Fast_Copy_Backward from Intel Core processors Intel Core i3, i5 and i7 processors have fast unaligned copy and copy backward is ignored. Remove Fast_Copy_Backward from Intel Core processors to avoid confusion. * sysdeps/x86/cpu-features.c (init_cpu_features): Don't set bit_arch_Fast_Copy_Backward for Intel Core proessors. (cherry picked from commit 27d3ce1467990f89126e228559dec8f84b96c60e) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a304f3933c7f8347e49057a7a315cbd571662ff7 commit a304f3933c7f8347e49057a7a315cbd571662ff7 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 10:05:51 2016 -0700 Add x86-64 memset with unaligned store and rep stosb Implement x86-64 memset with unaligned store and rep movsb. Support 16-byte, 32-byte and 64-byte vector register sizes. A single file provides 2 implementations of memset, one with rep stosb and the other without rep stosb. They share the same codes when size is between 2 times of vector register size and REP_STOSB_THRESHOLD which defaults to 2KB. Key features: 1. Use overlapping store to avoid branch. 2. For size <= 4 times of vector register size, fully unroll the loop. 3. For size > 4 times of vector register size, store 4 times of vector register size at a time. [BZ #19881] * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memset-sse2-unaligned-erms, memset-avx2-unaligned-erms and memset-avx512-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memset_chk_sse2_unaligned, __memset_chk_sse2_unaligned_erms, __memset_chk_avx2_unaligned, __memset_chk_avx2_unaligned_erms, __memset_chk_avx512_unaligned, __memset_chk_avx512_unaligned_erms, __memset_sse2_unaligned, __memset_sse2_unaligned_erms, __memset_erms, __memset_avx2_unaligned, __memset_avx2_unaligned_erms, __memset_avx512_unaligned_erms and __memset_avx512_unaligned. * sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: New file. * sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: Likewise. (cherry picked from commit 830566307f038387ca0af3fd327706a8d1a2f595) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=1c5d0b0ba41376c2f0792da4f22cc1f5b2b2688e commit 1c5d0b0ba41376c2f0792da4f22cc1f5b2b2688e Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 31 10:04:26 2016 -0700 Add x86-64 memmove with unaligned load/store and rep movsb Implement x86-64 memmove with unaligned load/store and rep movsb. Support 16-byte, 32-byte and 64-byte vector register sizes. When size <= 8 times of vector register size, there is no check for address overlap bewteen source and destination. Since overhead for overlap check is small when size > 8 times of vector register size, memcpy is an alias of memmove. A single file provides 2 implementations of memmove, one with rep movsb and the other without rep movsb. They share the same codes when size is between 2 times of vector register size and REP_MOVSB_THRESHOLD which is 2KB for 16-byte vector register size and scaled up by large vector register size. Key features: 1. Use overlapping load and store to avoid branch. 2. For size <= 8 times of vector register size, load all sources into registers and store them together. 3. If there is no address overlap bewteen source and destination, copy from both ends with 4 times of vector register size at a time. 4. If address of destination > address of source, backward copy 8 times of vector register size at a time. 5. Otherwise, forward copy 8 times of vector register size at a time. 6. Use rep movsb only for forward copy. Avoid slow backward rep movsb by fallbacking to backward copy 8 times of vector register size at a time. 7. Skip when address of destination == address of source. [BZ #19776] * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memmove-sse2-unaligned-erms, memmove-avx-unaligned-erms and memmove-avx512-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memmove_chk_avx512_unaligned_2, __memmove_chk_avx512_unaligned_erms, __memmove_chk_avx_unaligned_2, __memmove_chk_avx_unaligned_erms, __memmove_chk_sse2_unaligned_2, __memmove_chk_sse2_unaligned_erms, __memmove_avx_unaligned_2, __memmove_avx_unaligned_erms, __memmove_avx512_unaligned_2, __memmove_avx512_unaligned_erms, __memmove_erms, __memmove_sse2_unaligned_2, __memmove_sse2_unaligned_erms, __memcpy_chk_avx512_unaligned_2, __memcpy_chk_avx512_unaligned_erms, __memcpy_chk_avx_unaligned_2, __memcpy_chk_avx_unaligned_erms, __memcpy_chk_sse2_unaligned_2, __memcpy_chk_sse2_unaligned_erms, __memcpy_avx_unaligned_2, __memcpy_avx_unaligned_erms, __memcpy_avx512_unaligned_2, __memcpy_avx512_unaligned_erms, __memcpy_sse2_unaligned_2, __memcpy_sse2_unaligned_erms, __memcpy_erms, __mempcpy_chk_avx512_unaligned_2, __mempcpy_chk_avx512_unaligned_erms, __mempcpy_chk_avx_unaligned_2, __mempcpy_chk_avx_unaligned_erms, __mempcpy_chk_sse2_unaligned_2, __mempcpy_chk_sse2_unaligned_erms, __mempcpy_avx512_unaligned_2, __mempcpy_avx512_unaligned_erms, __mempcpy_avx_unaligned_2, __mempcpy_avx_unaligned_erms, __mempcpy_sse2_unaligned_2, __mempcpy_sse2_unaligned_erms and __mempcpy_erms. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: New file. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S: Likwise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S: Likwise. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Likwise. (cherry picked from commit 88b57b8ed41d5ecf2e1bdfc19556f9246a665ebb) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e1203f48239fbb9832db6ed3a0d2a008e317aff9 commit e1203f48239fbb9832db6ed3a0d2a008e317aff9 Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Mar 28 19:22:59 2016 -0700 Initial Enhanced REP MOVSB/STOSB (ERMS) support The newer Intel processors support Enhanced REP MOVSB/STOSB (ERMS) which has a feature bit in CPUID. This patch adds the Enhanced REP MOVSB/STOSB (ERMS) bit to x86 cpu-features. * sysdeps/x86/cpu-features.h (bit_cpu_ERMS): New. (index_cpu_ERMS): Likewise. (reg_ERMS): Likewise. (cherry picked from commit 0791f91dff9a77263fa8173b143d854cad902c6d) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=3597d65be2a44f063ef12bb907fdad8567aa3e6a commit 3597d65be2a44f063ef12bb907fdad8567aa3e6a Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Mar 28 13:15:59 2016 -0700 Make __memcpy_avx512_no_vzeroupper an alias Since x86-64 memcpy-avx512-no-vzeroupper.S implements memmove, make __memcpy_avx512_no_vzeroupper an alias of __memmove_avx512_no_vzeroupper to reduce code size of libc.so. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memcpy-avx512-no-vzeroupper. * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: Renamed to ... * sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: This. (MEMCPY): Don't define. (MEMCPY_CHK): Likewise. (MEMPCPY): Likewise. (MEMPCPY_CHK): Likewise. (MEMPCPY_CHK): Renamed to ... (__mempcpy_chk_avx512_no_vzeroupper): This. (MEMPCPY_CHK): Renamed to ... (__mempcpy_chk_avx512_no_vzeroupper): This. (MEMCPY_CHK): Renamed to ... (__memmove_chk_avx512_no_vzeroupper): This. (MEMCPY): Renamed to ... (__memmove_avx512_no_vzeroupper): This. (__memcpy_avx512_no_vzeroupper): New alias. (__memcpy_chk_avx512_no_vzeroupper): Likewise. (cherry picked from commit 064f01b10b57ff09cda7025f484b848c38ddd57a) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=9fbaf0f27a11deb98df79d04adee97aebee78d40 commit 9fbaf0f27a11deb98df79d04adee97aebee78d40 Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Mar 28 13:13:36 2016 -0700 Implement x86-64 multiarch mempcpy in memcpy Implement x86-64 multiarch mempcpy in memcpy to share most of code. It reduces code size of libc.so. [BZ #18858] * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove mempcpy-ssse3, mempcpy-ssse3-back, mempcpy-avx-unaligned and mempcpy-avx512-no-vzeroupper. * sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/memcpy-ssse3-back.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/memcpy-ssse3.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/mempcpy-avx-unaligned.S: Removed. * sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S: Likewise. * sysdeps/x86_64/multiarch/mempcpy-ssse3-back.S: Likewise. * sysdeps/x86_64/multiarch/mempcpy-ssse3.S: Likewise. (cherry picked from commit c365e615f7429aee302f8af7bf07ae262278febb) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=5239cb481eea27650173b9b9af22439afdcbf358 commit 5239cb481eea27650173b9b9af22439afdcbf358 Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Mar 28 04:39:48 2016 -0700 [x86] Add a feature bit: Fast_Unaligned_Copy On AMD processors, memcpy optimized with unaligned SSE load is slower than emcpy optimized with aligned SSSE3 while other string functions are faster with unaligned SSE load. A feature bit, Fast_Unaligned_Copy, is added to select memcpy optimized with unaligned SSE load. [BZ #19583] * sysdeps/x86/cpu-features.c (init_cpu_features): Set Fast_Unaligned_Copy with Fast_Unaligned_Load for Intel processors. Set Fast_Copy_Backward for AMD Excavator processors. * sysdeps/x86/cpu-features.h (bit_arch_Fast_Unaligned_Copy): New. (index_arch_Fast_Unaligned_Copy): Likewise. * sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Check Fast_Unaligned_Copy instead of Fast_Unaligned_Load. (cherry picked from commit e41b395523040fcb58c7d378475720c2836d280c) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a65b3d13e1754d568782e64a762c2c7fab45a55d commit a65b3d13e1754d568782e64a762c2c7fab45a55d Author: H.J. Lu <hjl.tools@gmail.com> Date: Tue Mar 22 08:36:16 2016 -0700 Don't set %rcx twice before "rep movsb" * sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMCPY): Don't set %rcx twice before "rep movsb". (cherry picked from commit 3c9a4cd16cbc7b79094fec68add2df66061ab5d7) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=f4b6d20366aac66070f1cf50552cf2951991a1e5 commit f4b6d20366aac66070f1cf50552cf2951991a1e5 Author: H.J. Lu <hjl.tools@gmail.com> Date: Tue Mar 22 07:46:56 2016 -0700 Set index_arch_AVX_Fast_Unaligned_Load only for Intel processors Since only Intel processors with AVX2 have fast unaligned load, we should set index_arch_AVX_Fast_Unaligned_Load only for Intel processors. Move AVX, AVX2, AVX512, FMA and FMA4 detection into get_common_indeces and call get_common_indeces for other processors. Add CPU_FEATURES_CPU_P and CPU_FEATURES_ARCH_P to aoid loading GLRO(dl_x86_cpu_features) in cpu-features.c. [BZ #19583] * sysdeps/x86/cpu-features.c (get_common_indeces): Remove inline. Check family before setting family, model and extended_model. Set AVX, AVX2, AVX512, FMA and FMA4 usable bits here. (init_cpu_features): Replace HAS_CPU_FEATURE and HAS_ARCH_FEATURE with CPU_FEATURES_CPU_P and CPU_FEATURES_ARCH_P. Set index_arch_AVX_Fast_Unaligned_Load for Intel processors with usable AVX2. Call get_common_indeces for other processors with family == NULL. * sysdeps/x86/cpu-features.h (CPU_FEATURES_CPU_P): New macro. (CPU_FEATURES_ARCH_P): Likewise. (HAS_CPU_FEATURE): Use CPU_FEATURES_CPU_P. (HAS_ARCH_FEATURE): Use CPU_FEATURES_ARCH_P. (cherry picked from commit f781a9e96138d8839663af5e88649ab1fbed74f8) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=ca9c5edeea52dc18f42ebbe29b1af352f5555538 commit ca9c5edeea52dc18f42ebbe29b1af352f5555538 Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Nov 30 08:53:37 2015 -0800 Update family and model detection for AMD CPUs AMD CPUs uses the similar encoding scheme for extended family and model as Intel CPUs as shown in: http://support.amd.com/TechDocs/25481.pdf This patch updates get_common_indeces to get family and model for both Intel and AMD CPUs when family == 0x0f. [BZ #19214] * sysdeps/x86/cpu-features.c (get_common_indeces): Add an argument to return extended model. Update family and model with extended family and model when family == 0x0f. (init_cpu_features): Updated. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c23cdbac4ea473effbef5c50b1217f95595b3460 commit c23cdbac4ea473effbef5c50b1217f95595b3460 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 10 05:26:46 2016 -0800 Add _arch_/_cpu_ to index_*/bit_* in x86 cpu-features.h index_* and bit_* macros are used to access cpuid and feature arrays o struct cpu_features. It is very easy to use bits and indices of cpuid array on feature array, especially in assembly codes. For example, sysdeps/i386/i686/multiarch/bcopy.S has HAS_CPU_FEATURE (Fast_Rep_String) which should be HAS_ARCH_FEATURE (Fast_Rep_String) We change index_* and bit_* to index_cpu_*/index_arch_* and bit_cpu_*/bit_arch_* so that we can catch such error at build time. [BZ #19762] * sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h (EXTRA_LD_ENVVARS): Add _arch_ to index_*/bit_*. * sysdeps/x86/cpu-features.c (init_cpu_features): Likewise. * sysdeps/x86/cpu-features.h (bit_*): Renamed to ... (bit_arch_*): This for feature array. (bit_*): Renamed to ... (bit_cpu_*): This for cpu array. (index_*): Renamed to ... (index_arch_*): This for feature array. (index_*): Renamed to ... (index_cpu_*): This for cpu array. [__ASSEMBLER__] (HAS_FEATURE): Add and use field. [__ASSEMBLER__] (HAS_CPU_FEATURE)): Pass cpu to HAS_FEATURE. [__ASSEMBLER__] (HAS_ARCH_FEATURE)): Pass arch to HAS_FEATURE. [!__ASSEMBLER__] (HAS_CPU_FEATURE): Replace index_##name and bit_##name with index_cpu_##name and bit_cpu_##name. [!__ASSEMBLER__] (HAS_ARCH_FEATURE): Replace index_##name and bit_##name with index_arch_##name and bit_arch_##name. (cherry picked from commit 6aa3e97e2530f9917f504eb4146af119a3f27229) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4a49c82956f5a42a2cce22c2e97360de1b32301d commit 4a49c82956f5a42a2cce22c2e97360de1b32301d Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 3 14:51:40 2016 -0800 Or bit_Prefer_MAP_32BIT_EXEC in EXTRA_LD_ENVVARS We should turn on bit_Prefer_MAP_32BIT_EXEC in EXTRA_LD_ENVVARS without overriding other bits. [BZ #19758] * sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h (EXTRA_LD_ENVVARS): Or bit_Prefer_MAP_32BIT_EXEC. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=18f8c0e3cc9ff7b092f02c9b42874a5439347bbc commit 18f8c0e3cc9ff7b092f02c9b42874a5439347bbc Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Mar 6 16:48:11 2016 -0800 Group AVX512 functions in .text.avx512 section * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: Replace .text with .text.avx512. * sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S: Likewise. (cherry picked from commit fee9eb6200f0e44a4b684903bc47fde36d46f1a5) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=0c8e297a186f844ebb7eba7a3bc0343c83615ca9 commit 0c8e297a186f844ebb7eba7a3bc0343c83615ca9 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Mar 4 08:37:40 2016 -0800 x86-64: Fix memcpy IFUNC selection Chek Fast_Unaligned_Load, instead of Slow_BSF, and also check for Fast_Copy_Backward to enable __memcpy_ssse3_back. Existing selection order is updated with following selection order: 1. __memcpy_avx_unaligned if AVX_Fast_Unaligned_Load bit is set. 2. __memcpy_sse2_unaligned if Fast_Unaligned_Load bit is set. 3. __memcpy_sse2 if SSSE3 isn't available. 4. __memcpy_ssse3_back if Fast_Copy_Backward bit it set. 5. __memcpy_ssse3 [BZ #18880] * sysdeps/x86_64/multiarch/memcpy.S: Check Fast_Unaligned_Load, instead of Slow_BSF, and also check for Fast_Copy_Backward to enable __memcpy_ssse3_back. (cherry picked from commit 14a1d7cc4c4fd5ee8e4e66b777221dd32a84efe8) https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=3c772cb4d9cbe19cd97ad991e3dab43014198c44 commit 3c772cb4d9cbe19cd97ad991e3dab43014198c44 Author: Andrew Senkevich <andrew.senkevich@intel.com> Date: Sat Jan 16 00:49:45 2016 +0300 Added memcpy/memmove family optimized with AVX512 for KNL hardware. Added AVX512 implementations of memcpy, mempcpy, memmove, memcpy_chk, mempcpy_chk, memmove_chk. It shows average improvement more than 30% over AVX versions on KNL hardware (performance results in the thread <https://sourceware.org/ml/libc-alpha/2016-01/msg00258.html>). * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Added new files. * sysdeps/x86_64/multiarch/ifunc-impl-list.c: Added new tests. * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: New file. * sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S: Likewise. * sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: Likewise. * sysdeps/x86_64/multiarch/memcpy.S: Added new IFUNC branch. * sysdeps/x86_64/multiarch/memcpy_chk.S: Likewise. * sysdeps/x86_64/multiarch/memmove.c: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.c: Likewise. * sysdeps/x86_64/multiarch/mempcpy.S: Likewise. * sysdeps/x86_64/multiarch/mempcpy_chk.S: Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7f20e52f1f8d7709140f4bdf828a6bb0f0f08af2 commit 7f20e52f1f8d7709140f4bdf828a6bb0f0f08af2 Author: Andrew Senkevich <andrew.senkevich@intel.com> Date: Sat Dec 19 02:47:28 2015 +0300 Added memset optimized with AVX512 for KNL hardware. It shows improvement up to 28% over AVX2 memset (performance results attached at <https://sourceware.org/ml/libc-alpha/2015-12/msg00052.html>). * sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S: New file. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Added new file. * sysdeps/x86_64/multiarch/ifunc-impl-list.c: Added new tests. * sysdeps/x86_64/multiarch/memset.S: Added new IFUNC branch. * sysdeps/x86_64/multiarch/memset_chk.S: Likewise. * sysdeps/x86/cpu-features.h (bit_Prefer_No_VZEROUPPER, index_Prefer_No_VZEROUPPER): New. * sysdeps/x86/cpu-features.c (init_cpu_features): Set the Prefer_No_VZEROUPPER for Knights Landing. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=d530cd5463701a59ed923d53a97d3b534fdfea8a commit d530cd5463701a59ed923d53a97d3b534fdfea8a Author: H.J. Lu <hjl.tools@gmail.com> Date: Wed Oct 21 14:44:23 2015 -0700 Add Prefer_MAP_32BIT_EXEC to map executable pages with MAP_32BIT According to Silvermont software optimization guide, for 64-bit applications, branch prediction performance can be negatively impacted when the target of a branch is more than 4GB away from the branch. Add the Prefer_MAP_32BIT_EXEC bit so that mmap will try to map executable pages with MAP_32BIT first. NB: MAP_32BIT will map to lower 2GB, not lower 4GB, address. Prefer_MAP_32BIT_EXEC reduces bits available for address space layout randomization (ASLR), which is always disabled for SUID programs and can only be enabled by setting environment variable, LD_PREFER_MAP_32BIT_EXEC. On Fedora 23, this patch speeds up GCC 5 testsuite by 3% on Silvermont. [BZ #19367] * sysdeps/unix/sysv/linux/wordsize-64/mmap.c: New file. * sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h: Likewise. * sysdeps/unix/sysv/linux/x86_64/64/mmap.c: Likewise. * sysdeps/x86/cpu-features.h (bit_Prefer_MAP_32BIT_EXEC): New. (index_Prefer_MAP_32BIT_EXEC): Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=fe24aedc3530037d7bb614b84d309e6b816686bf commit fe24aedc3530037d7bb614b84d309e6b816686bf Author: H.J. Lu <hjl.tools@gmail.com> Date: Tue Dec 15 11:46:54 2015 -0800 Enable Silvermont optimizations for Knights Landing Knights Landing processor is based on Silvermont. This patch enables Silvermont optimizations for Knights Landing. * sysdeps/x86/cpu-features.c (init_cpu_features): Enable Silvermont optimizations for Knights Landing. -----------------------------------------------------------------------
Created attachment 9329 [details] Perfomance data with graphics
This is an automated email from the git hooks/post-receive script. It was generated because a ref change was pushed to the repository containing the project "GNU C Library master sources". The branch, master has been updated via c867597bff2562180a18da4b8dba89d24e8b65c4 (commit) from 5e8c5bb1ac83aa2577d64d82467a653fa413f7ce (commit) Those revisions listed above that are new to this repository have not appeared on any other notification email; so we list those revisions in full, below. - Log ----------------------------------------------------------------- https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c867597bff2562180a18da4b8dba89d24e8b65c4 commit c867597bff2562180a18da4b8dba89d24e8b65c4 Author: H.J. Lu <hjl.tools@gmail.com> Date: Wed Jun 8 13:57:50 2016 -0700 X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones, we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with the new ones. No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used before. If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2 memcpy/memmove optimized with Enhanced REP MOVSB will be used for processors with ERMS. The new AVX512 memcpy/memmove will be used for processors with AVX512 which prefer vzeroupper. Since the new SSE2 memcpy/memmove are faster than the previous default memcpy/memmove used in libc.a and ld.so, we also remove the previous default memcpy/memmove and make them the default memcpy/memmove, except that non-temporal store isn't used in ld.so. Together, it reduces the size of libc.so by about 6 KB and the size of ld.so by about 2 KB. [BZ #19776] * sysdeps/x86_64/memcpy.S: Make it dummy. * sysdeps/x86_64/mempcpy.S: Likewise. * sysdeps/x86_64/memmove.S: New file. * sysdeps/x86_64/memmove_chk.S: Likewise. * sysdeps/x86_64/multiarch/memmove.S: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.S: Likewise. * sysdeps/x86_64/memmove.c: Removed. * sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memmove.c: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.c: Likewise. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memcpy-sse2-unaligned, memmove-avx-unaligned, memcpy-avx-unaligned and memmove-sse2-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Replace __memmove_chk_avx512_unaligned_2 with __memmove_chk_avx512_unaligned. Remove __memmove_chk_avx_unaligned_2. Replace __memmove_chk_sse2_unaligned_2 with __memmove_chk_sse2_unaligned. Remove __memmove_chk_sse2 and __memmove_avx_unaligned_2. Replace __memmove_avx512_unaligned_2 with __memmove_avx512_unaligned. Replace __memmove_sse2_unaligned_2 with __memmove_sse2_unaligned. Remove __memmove_sse2. Replace __memcpy_chk_avx512_unaligned_2 with __memcpy_chk_avx512_unaligned. Remove __memcpy_chk_avx_unaligned_2. Replace __memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned. Remove __memcpy_chk_sse2. Remove __memcpy_avx_unaligned_2. Replace __memcpy_avx512_unaligned_2 with __memcpy_avx512_unaligned. Remove __memcpy_sse2_unaligned_2 and __memcpy_sse2. Replace __mempcpy_chk_avx512_unaligned_2 with __mempcpy_chk_avx512_unaligned. Remove __mempcpy_chk_avx_unaligned_2. Replace __mempcpy_chk_sse2_unaligned_2 with __mempcpy_chk_sse2_unaligned. Remove __mempcpy_chk_sse2. Replace __mempcpy_avx512_unaligned_2 with __mempcpy_avx512_unaligned. Remove __mempcpy_avx_unaligned_2. Replace __mempcpy_sse2_unaligned_2 with __mempcpy_sse2_unaligned. Remove __mempcpy_sse2. * sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support __memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned. Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms if processor has ERMS. Default to __memcpy_sse2_unaligned. (ENTRY): Removed. (END): Likewise. (ENTRY_CHK): Likewise. (libc_hidden_builtin_def): Likewise. Don't include ../memcpy.S. * sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support __memcpy_chk_avx512_unaligned_erms and __memcpy_chk_avx512_unaligned. Use __memcpy_chk_avx_unaligned_erms and __memcpy_chk_sse2_unaligned_erms if if processor has ERMS. Default to __memcpy_chk_sse2_unaligned. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S Change function suffix from unaligned_2 to unaligned. * sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support __mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned. Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms if processor has ERMS. Default to __mempcpy_sse2_unaligned. (ENTRY): Removed. (END): Likewise. (ENTRY_CHK): Likewise. (libc_hidden_builtin_def): Likewise. Don't include ../mempcpy.S. (mempcpy): New. Add a weak alias. * sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support __mempcpy_chk_avx512_unaligned_erms and __mempcpy_chk_avx512_unaligned. Use __mempcpy_chk_avx_unaligned_erms and __mempcpy_chk_sse2_unaligned_erms if if processor has ERMS. Default to __mempcpy_chk_sse2_unaligned. ----------------------------------------------------------------------- Summary of changes: ChangeLog | 80 +++ sysdeps/x86_64/memcpy.S | 585 +------------------- sysdeps/x86_64/memmove.S | 71 +++ sysdeps/x86_64/memmove.c | 26 - sysdeps/x86_64/memmove_chk.S | 33 ++ sysdeps/x86_64/mempcpy.S | 9 +- sysdeps/x86_64/multiarch/Makefile | 6 +- sysdeps/x86_64/multiarch/ifunc-impl-list.c | 63 +-- sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S | 391 ------------- sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S | 175 ------ sysdeps/x86_64/multiarch/memcpy.S | 62 +-- sysdeps/x86_64/multiarch/memcpy_chk.S | 40 +- sysdeps/x86_64/multiarch/memmove-avx-unaligned.S | 22 - .../x86_64/multiarch/memmove-sse2-unaligned-erms.S | 13 - .../x86_64/multiarch/memmove-vec-unaligned-erms.S | 24 +- sysdeps/x86_64/multiarch/memmove.S | 98 ++++ sysdeps/x86_64/multiarch/memmove.c | 73 --- sysdeps/x86_64/multiarch/memmove_chk.S | 71 +++ sysdeps/x86_64/multiarch/memmove_chk.c | 46 -- sysdeps/x86_64/multiarch/mempcpy.S | 74 +-- sysdeps/x86_64/multiarch/mempcpy_chk.S | 38 +- 21 files changed, 492 insertions(+), 1508 deletions(-) create mode 100644 sysdeps/x86_64/memmove.S delete mode 100644 sysdeps/x86_64/memmove.c create mode 100644 sysdeps/x86_64/memmove_chk.S delete mode 100644 sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S delete mode 100644 sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S delete mode 100644 sysdeps/x86_64/multiarch/memmove-avx-unaligned.S delete mode 100644 sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S create mode 100644 sysdeps/x86_64/multiarch/memmove.S delete mode 100644 sysdeps/x86_64/multiarch/memmove.c create mode 100644 sysdeps/x86_64/multiarch/memmove_chk.S delete mode 100644 sysdeps/x86_64/multiarch/memmove_chk.c
Fixed by c867597bff2562180a18da4b8dba89d24e8b65c4.