This is the mail archive of the
glibc-bugs@sourceware.org
mailing list for the glibc project.
[Bug string/19881] Improve x86-64 memset
- From: "cvs-commit at gcc dot gnu.org" <sourceware-bugzilla at sourceware dot org>
- To: glibc-bugs at sourceware dot org
- Date: Wed, 06 Apr 2016 18:00:45 +0000
- Subject: [Bug string/19881] Improve x86-64 memset
- Auto-submitted: auto-generated
- References: <bug-19881-131 at http dot sourceware dot org/bugzilla/>
https://sourceware.org/bugzilla/show_bug.cgi?id=19881
--- Comment #19 from cvs-commit at gcc dot gnu.org <cvs-commit at gcc dot gnu.org> ---
This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".
The branch, hjl/erms/ifunc has been created
at 7d3414159ba17db4224b675cf4086741210544b1 (commit)
- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7d3414159ba17db4224b675cf4086741210544b1
commit 7d3414159ba17db4224b675cf4086741210544b1
Author: H.J. Lu <hjl.tools@gmail.com>
Date: Fri Apr 1 14:01:24 2016 -0700
X86-64: Add dummy memcopy.h and wordcopy.c
Since x86-64 doesn't use memory copy functions, add dummy memcopy.h and
wordcopy.c to reduce code size. It reduces the size of libc.so by about
1 KB.
* sysdeps/x86_64/memcopy.h: New file.
* sysdeps/x86_64/wordcopy.c: Likewise.
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=efd380007f75c4157a823ee14d658c0ced3ba4a8
commit efd380007f75c4157a823ee14d658c0ced3ba4a8
Author: H.J. Lu <hjl.tools@gmail.com>
Date: Thu Mar 31 12:46:57 2016 -0700
X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove
Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones,
we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with
the new ones.
No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used
before. If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2
memcpy/memmove optimized with Enhanced REP MOVSB will be used for
processors with ERMS. The new AVX512 memcpy/memmove will be used for
processors with AVX512 which prefer vzeroupper.
Since the new SSE2 memcpy/memmove are faster than the previous default
memcpy/memmove used in libc.a and ld.so, we also remove the previous
default memcpy/memmove and make them the default memcpy/memmove.
Together, it reduces the size of libc.so by about 6 KB and the size of
ld.so by about 2 KB.
[BZ #19776]
* sysdeps/x86_64/memcpy.S: Make it dummy.
* sysdeps/x86_64/mempcpy.S: Likewise.
* sysdeps/x86_64/memmove.S: New file.
* sysdeps/x86_64/memmove_chk.S: Likewise.
* sysdeps/x86_64/multiarch/memmove.S: Likewise.
* sysdeps/x86_64/multiarch/memmove_chk.S: Likewise.
* sysdeps/x86_64/memmove.c: Removed.
* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise.
* sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise.
* sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise.
* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
Likewise.
* sysdeps/x86_64/multiarch/memmove.c: Likewise.
* sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
memcpy-sse2-unaligned, memmove-avx-unaligned,
memcpy-avx-unaligned and memmove-sse2-unaligned-erms.
* sysdeps/x86_64/multiarch/ifunc-impl-list.c
(__libc_ifunc_impl_list): Replace
__memmove_chk_avx512_unaligned_2 with
__memmove_chk_avx512_unaligned. Remove
__memmove_chk_avx_unaligned_2. Replace
__memmove_chk_sse2_unaligned_2 with
__memmove_chk_sse2_unaligned. Remove __memmove_chk_sse2 and
__memmove_avx_unaligned_2. Replace __memmove_avx512_unaligned_2
with __memmove_avx512_unaligned. Replace
__memmove_sse2_unaligned_2 with __memmove_sse2_unaligned.
Remove __memmove_sse2. Replace __memcpy_chk_avx512_unaligned_2
with __memcpy_chk_avx512_unaligned. Remove
__memcpy_chk_avx_unaligned_2. Replace
__memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned.
Remove __memcpy_chk_sse2. Remove __memcpy_avx_unaligned_2.
Replace __memcpy_avx512_unaligned_2 with
__memcpy_avx512_unaligned. Remove __memcpy_sse2_unaligned_2
and __memcpy_sse2. Replace __mempcpy_chk_avx512_unaligned_2
with __mempcpy_chk_avx512_unaligned. Remove
__mempcpy_chk_avx_unaligned_2. Replace
__mempcpy_chk_sse2_unaligned_2 with
__mempcpy_chk_sse2_unaligned. Remove __mempcpy_chk_sse2.
Replace __mempcpy_avx512_unaligned_2 with
__mempcpy_avx512_unaligned. Remove __mempcpy_avx_unaligned_2.
Replace __mempcpy_sse2_unaligned_2 with
__mempcpy_sse2_unaligned. Remove __mempcpy_sse2.
* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support
__memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned.
Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms
if processor has ERMS. Default to __memcpy_sse2_unaligned.
(ENTRY): Removed.
(END): Likewise.
(ENTRY_CHK): Likewise.
(libc_hidden_builtin_def): Likewise.
Don't include ../memcpy.S.
* sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support
__memcpy_chk_avx512_unaligned_erms and
__memcpy_chk_avx512_unaligned. Use
__memcpy_chk_avx_unaligned_erms and
__memcpy_chk_sse2_unaligned_erms if if processor has ERMS.
Default to __memcpy_chk_sse2_unaligned.
* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip if
not in libc.
* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
Likewise.
* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
Change function suffix from unaligned_2 to unaligned.
* sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support
__mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned.
Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms
if processor has ERMS. Default to __mempcpy_sse2_unaligned.
(ENTRY): Removed.
(END): Likewise.
(ENTRY_CHK): Likewise.
(libc_hidden_builtin_def): Likewise.
Don't include ../mempcpy.S.
(mempcpy): New. Add a weak alias.
* sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support
__mempcpy_chk_avx512_unaligned_erms and
__mempcpy_chk_avx512_unaligned. Use
__mempcpy_chk_avx_unaligned_erms and
__mempcpy_chk_sse2_unaligned_erms if if processor has ERMS.
Default to __mempcpy_chk_sse2_unaligned.
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=8f1593ebaee38ddedcabee5fe3553abdb0f08bfd
commit 8f1593ebaee38ddedcabee5fe3553abdb0f08bfd
Author: H.J. Lu <hjl.tools@gmail.com>
Date: Thu Mar 31 10:42:30 2016 -0700
X86-64: Remove the previous SSE2/AVX2 memsets
Since the new SSE2/AVX2 memsets are faster than the previous ones, we
can remove the previous SSE2/AVX2 memsets and replace them with the
new ones. This reduces the size of libc.so by about 900 bytes.
No change in IFUNC selection if SSE2 and AVX2 memsets weren't used
before. If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset
optimized with Enhanced REP STOSB will be used for processors with
ERMS. The new AVX512 memset will be used for processors with AVX512
which prefer vzeroupper.
[BZ #19881]
* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded
into ...
* sysdeps/x86_64/memset.S: This.
(__bzero): Removed.
(__memset_tail): Likewise.
(__memset_chk): Likewise.
(memset): Likewise.
(MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't
defined.
(MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined.
* sysdeps/x86_64/multiarch/memset-avx2.S: Removed.
(__memset_zero_constant_len_parameter): Check SHARED instead of
PIC.
* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
memset-avx2 and memset-sse2-unaligned-erms.
* sysdeps/x86_64/multiarch/ifunc-impl-list.c
(__libc_ifunc_impl_list): Remove __memset_chk_sse2,
__memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned.
* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: Skip
if not in libc.
* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
Likewise.
* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
(__bzero): Enabled.
* sysdeps/x86_64/multiarch/memset.S (memset): Replace
__memset_sse2 and __memset_avx2 with __memset_sse2_unaligned
and __memset_avx2_unaligned. Use __memset_sse2_unaligned_erms
or __memset_avx2_unaligned_erms if processor has ERMS. Support
__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
(memset): Removed.
(__memset_chk): Likewise.
(MEMSET_SYMBOL): New.
(libc_hidden_builtin_def): Replace __memset_sse2 with
__memset_sse2_unaligned.
* sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace
__memset_chk_sse2 and __memset_chk_avx2 with
__memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms.
Use __memset_chk_sse2_unaligned_erms or
__memset_chk_avx2_unaligned_erms if processor has ERMS. Support
__memset_chk_avx512_unaligned_erms and
__memset_chk_avx512_unaligned.
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=545cd24ea6b85661abfa9ac1e49d56dd7cc19cc9
commit 545cd24ea6b85661abfa9ac1e49d56dd7cc19cc9
Author: H.J. Lu <hjl.tools@gmail.com>
Date: Wed Apr 6 10:49:27 2016 -0700
Use PREFETCH_ONE_SET_X
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=af07dbdaa999d0172dd840f3dbe6963901c3496f
commit af07dbdaa999d0172dd840f3dbe6963901c3496f
Author: H.J. Lu <hjl.tools@gmail.com>
Date: Sun Apr 3 17:21:45 2016 -0700
X86-64: Use non-temporal store in memmove on large data
memcpy/memmove benchmarks with large data shows that there is a
regression with large data on Haswell machine. non-temporal store
in memmove on large data can improve performance significantly. This
patch adds a threshold to use non temporal store which is 4 times of
shared cache size. When size is above the threshold, non temporal
store will be used.
For size below 8 vector register width, we load all data into registers
and store them together. Only forward and backward loops, which move 4
vector registers at a time, are used to support overlapping addresses.
For forward loop, we load the last 4 vector register width of data and
the first vector register width of data into vector registers before the
loop and store them after the loop. For backward loop, we load the first
4 vector register width of data and the last vector register width of
data into vector registers before the loop and store them after the loop.
* sysdeps/x86_64/cacheinfo.c (__x86_shared_non_temporal_threshold):
New.
(init_cacheinfo): Set __x86_shared_non_temporal_threshold to
4 times of shared cache size.
* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S
(PREFETCHNT): New.
(VMOVNT): Likewise.
* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S
(PREFETCHNT): Likewise.
(VMOVNT): Likewise.
* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S
(PREFETCHNT): Likewise.
(VMOVNT): Likewise.
(VMOVU): Changed to movups for smaller code sizes.
(VMOVA): Changed to movaps for smaller code sizes.
* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Update
comments.
(PREFETCH_SIZE): New.
(PREFETCHED_LOAD_SIZE): Likewise.
(PREFETCH_ONE_SET): Likewise.
Rewrite to use forward and backward loops, which move 4 vector
registers at a time, to support overlapping addresses and use
non temporal store if size is above the threshold.
-----------------------------------------------------------------------
--
You are receiving this mail because:
You are on the CC list for the bug.