17801 – memcpy is slower on amd64 than on i686 with a Sandy Bridge CPU

Bug 17801 - memcpy is slower on amd64 than on i686 with a Sandy Bridge CPU

Summary: memcpy is slower on amd64 than on i686 with a Sandy Bridge CPU

Status:	RESOLVED FIXED

Alias:	None

Product:	glibc
Classification:	Unclassified
Component:	libc (show other bugs)
Version:	2.20

Importance:	P2 minor
Target Milestone:	2.21
Assignee:	Not yet assigned to anyone

URL:
Keywords:

Depends on:
Blocks:

Reported:	2015-01-05 22:23 UTC by bugs
Modified:	2016-03-02 19:24 UTC (History)
CC List:	2 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:

Flags:	fweimer: security-

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description bugs 2015-01-05 22:23:18 UTC

On amd64 memcpy is actually calling __memcpy_avx_unaligned, and on i686 it’s calling __memcpy_ssse3_rep, and with a Sandy Bridge CPU, AVX is slower than SSSE3, despite being newer.

I tested by disabling the AVX implementation and got nearly the same speed on amd64 than with the i686 version of the same program.

Other functions (like memmove or strncmp) may also be affected, but I haven’t checked them.

Other CPUs as well, I’ve heard that Ivy Bridge also has a slower AVX implementation, maybe some AMD ones too.

Comment 1 Ondrej Bilka 2015-01-06 13:12:19 UTC

On Mon, Jan 05, 2015 at 10:23:18PM +0000, bugs at linkmauve dot fr wrote:
> On amd64 memcpy is actually calling __memcpy_avx_unaligned, and on i686 it’s
> calling __memcpy_ssse3_rep, and with a Sandy Bridge CPU, AVX is slower than
> SSSE3, despite being newer.
> 
> I tested by disabling the AVX implementation and got nearly the same speed on
> amd64 than with the i686 version of the same program.
> 
> Other functions (like memmove or strncmp) may also be affected, but I haven’t
> checked them.
> 
> Other CPUs as well, I’ve heard that Ivy Bridge also has a slower AVX
> implementation, maybe some AMD ones too.
> 
No, that is typo, that implementation was aimed for avx2 only, specially
haswell where its fast.

Comment 2 Sourceware Commits 2015-01-30 14:57:54 UTC

This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, hjl/pr17711 has been updated
       via  56d25c11b64a97255a115901d136d753c86de24e (commit)
      from  a29c4064115e59bcf8c001c0b3dedfa8d49d3653 (commit)

Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=56d25c11b64a97255a115901d136d753c86de24e

commit 56d25c11b64a97255a115901d136d753c86de24e
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Jan 30 06:50:20 2015 -0800

    Use AVX unaligned memcpy only if AVX2 is available
    
    memcpy with unaligned 256-bit AVX register loads/stores are slow on older
    processorsl like Sandy Bridge.  This patch adds bit_AVX_Fast_Unaligned_Load
    and sets it only when AVX2 is available.
    
    	[BZ #17801]
    	* sysdeps/x86_64/multiarch/init-arch.c (__init_cpu_features):
    	Set the bit_AVX_Fast_Unaligned_Load bit for AVX2.
    	* sysdeps/x86_64/multiarch/init-arch.h (bit_AVX_Fast_Unaligned_Load):
    	New.
    	(index_AVX_Fast_Unaligned_Load): Likewise.
    	(HAS_AVX_FAST_UNALIGNED_LOAD): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Check the
    	bit_AVX_Fast_Unaligned_Load bit instead of the bit_AVX_Usable bit.
    	* sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Likewise.
    	* sysdeps/x86_64/multiarch/memmove.c (__libc_memmove): Replace
    	HAS_AVX with HAS_AVX_FAST_UNALIGNED_LOAD.
    	* sysdeps/x86_64/multiarch/memmove_chk.c (__memmove_chk): Likewise.

-----------------------------------------------------------------------

Summary of changes:
 ChangeLog                              |   18 ++++++++++++++++++
 sysdeps/x86_64/multiarch/init-arch.c   |    9 +++++++--
 sysdeps/x86_64/multiarch/init-arch.h   |    4 ++++
 sysdeps/x86_64/multiarch/memcpy.S      |    2 +-
 sysdeps/x86_64/multiarch/memcpy_chk.S  |    2 +-
 sysdeps/x86_64/multiarch/memmove.c     |    2 +-
 sysdeps/x86_64/multiarch/memmove_chk.c |    2 +-
 sysdeps/x86_64/multiarch/mempcpy.S     |    2 +-
 sysdeps/x86_64/multiarch/mempcpy_chk.S |    2 +-
 9 files changed, 35 insertions(+), 8 deletions(-)

Comment 3 Sourceware Commits 2015-01-30 23:39:44 UTC

This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, master has been updated
       via  5f3d0b78e011d2a72f9e88b0e9ef5bc081d18f97 (commit)
      from  b658fdd82b4524cf6a39881d092caa23f63d93ac (commit)

Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=5f3d0b78e011d2a72f9e88b0e9ef5bc081d18f97

commit 5f3d0b78e011d2a72f9e88b0e9ef5bc081d18f97
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Jan 30 06:50:20 2015 -0800

    Use AVX unaligned memcpy only if AVX2 is available
    
    memcpy with unaligned 256-bit AVX register loads/stores are slow on older
    processorsl like Sandy Bridge.  This patch adds bit_AVX_Fast_Unaligned_Load
    and sets it only when AVX2 is available.
    
    	[BZ #17801]
    	* sysdeps/x86_64/multiarch/init-arch.c (__init_cpu_features):
    	Set the bit_AVX_Fast_Unaligned_Load bit for AVX2.
    	* sysdeps/x86_64/multiarch/init-arch.h (bit_AVX_Fast_Unaligned_Load):
    	New.
    	(index_AVX_Fast_Unaligned_Load): Likewise.
    	(HAS_AVX_FAST_UNALIGNED_LOAD): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Check the
    	bit_AVX_Fast_Unaligned_Load bit instead of the bit_AVX_Usable bit.
    	* sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Likewise.
    	* sysdeps/x86_64/multiarch/memmove.c (__libc_memmove): Replace
    	HAS_AVX with HAS_AVX_FAST_UNALIGNED_LOAD.
    	* sysdeps/x86_64/multiarch/memmove_chk.c (__memmove_chk): Likewise.

-----------------------------------------------------------------------

Summary of changes:
 ChangeLog                              |   18 ++++++++++++++++++
 NEWS                                   |    4 ++--
 sysdeps/x86_64/multiarch/init-arch.c   |    9 +++++++--
 sysdeps/x86_64/multiarch/init-arch.h   |    4 ++++
 sysdeps/x86_64/multiarch/memcpy.S      |    2 +-
 sysdeps/x86_64/multiarch/memcpy_chk.S  |    2 +-
 sysdeps/x86_64/multiarch/memmove.c     |    2 +-
 sysdeps/x86_64/multiarch/memmove_chk.c |    2 +-
 sysdeps/x86_64/multiarch/mempcpy.S     |    2 +-
 sysdeps/x86_64/multiarch/mempcpy_chk.S |    2 +-
 10 files changed, 37 insertions(+), 10 deletions(-)

Comment 4 H.J. Lu 2015-01-30 23:40:27 UTC

Fixed for 2.21.

Comment 5 Sourceware Commits 2015-02-17 07:25:21 UTC

This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, release/2.20/master has been updated
       via  4d54424420c6300efbf57a7b9aa8635a8b8c1942 (commit)
       via  1bf9d48aec087062e2a14b77cb5ee1fa81be334c (commit)
       via  f9e0f439b72e0b2fb035be1bc60aaceeed7f6ed0 (commit)
       via  b0694b9e98ee64cb25490de0921ce307f3872749 (commit)
      from  f80af76648ed97a76745fad6caa3315a79cb1c7c (commit)

Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4d54424420c6300efbf57a7b9aa8635a8b8c1942

commit 4d54424420c6300efbf57a7b9aa8635a8b8c1942
Author: Paul Pluzhnikov <ppluzhnikov@google.com>
Date:   Fri Feb 6 00:30:42 2015 -0500

    CVE-2015-1472: wscanf allocates too little memory
    
    BZ #16618
    
    Under certain conditions wscanf can allocate too little memory for the
    to-be-scanned arguments and overflow the allocated buffer.  The
    implementation now correctly computes the required buffer size when
    using malloc.
    
    A regression test was added to tst-sscanf.
    
    (cherry picked from commit 5bd80bfe9ca0d955bfbbc002781bc7b01b6bcb06)
    
    Conflicts:
    	ChangeLog
    	NEWS

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=1bf9d48aec087062e2a14b77cb5ee1fa81be334c

commit 1bf9d48aec087062e2a14b77cb5ee1fa81be334c
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Jan 30 06:50:20 2015 -0800

    Use AVX unaligned memcpy only if AVX2 is available
    
    memcpy with unaligned 256-bit AVX register loads/stores are slow on older
    processorsl like Sandy Bridge.  This patch adds bit_AVX_Fast_Unaligned_Load
    and sets it only when AVX2 is available.
    
    	[BZ #17801]
    	* sysdeps/x86_64/multiarch/init-arch.c (__init_cpu_features):
    	Set the bit_AVX_Fast_Unaligned_Load bit for AVX2.
    	* sysdeps/x86_64/multiarch/init-arch.h (bit_AVX_Fast_Unaligned_Load):
    	New.
    	(index_AVX_Fast_Unaligned_Load): Likewise.
    	(HAS_AVX_FAST_UNALIGNED_LOAD): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Check the
    	bit_AVX_Fast_Unaligned_Load bit instead of the bit_AVX_Usable bit.
    	* sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Likewise.
    	* sysdeps/x86_64/multiarch/memmove.c (__libc_memmove): Replace
    	HAS_AVX with HAS_AVX_FAST_UNALIGNED_LOAD.
    	* sysdeps/x86_64/multiarch/memmove_chk.c (__memmove_chk): Likewise.
    
    (cherry picked from commit 5f3d0b78e011d2a72f9e88b0e9ef5bc081d18f97)
    
    Conflicts:
    	ChangeLog
    	NEWS

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=f9e0f439b72e0b2fb035be1bc60aaceeed7f6ed0

commit f9e0f439b72e0b2fb035be1bc60aaceeed7f6ed0
Author: Leonhard Holz <leonhard.holz@web.de>
Date:   Tue Jan 13 11:33:56 2015 +0530

    Fix memory handling in strxfrm_l [BZ #16009]
    
    [Modified from the original email by Siddhesh Poyarekar]
    
    This patch solves bug #16009 by implementing an additional path in
    strxfrm that does not depend on caching the weight and rule indices.
    
    In detail the following changed:
    
    * The old main loop was factored out of strxfrm_l into the function
    do_xfrm_cached to be able to alternativly use the non-caching version
    do_xfrm.
    
    * strxfrm_l allocates a a fixed size array on the stack. If this is not
    sufficiant to store the weight and rule indices, the non-caching path is
    taken. As the cache size is not dependent on the input there can be no
    problems with integer overflows or stack allocations greater than
    __MAX_ALLOCA_CUTOFF. Note that malloc-ing is not possible because the
    definition of strxfrm does not allow an oom errorhandling.
    
    * The uncached path determines the weight and rule index for every char
    and for every pass again.
    
    * Passing all the locale data array by array resulted in very long
    parameter lists, so I introduced a structure that holds them.
    
    * Checking for zero src string has been moved a bit upwards, it is
    before the locale data initialization now.
    
    * To verify that the non-caching path works correct I added a test run
    to localedata/sort-test.sh & localedata/xfrm-test.c where all strings
    are patched up with spaces so that they are too large for the caching path.
    
    (cherry picked from commit 0f9e585480edcdf1e30dc3d79e24b84aeee516fa)
    
    Conflicts:
    	ChangeLog
    	NEWS

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=b0694b9e98ee64cb25490de0921ce307f3872749

commit b0694b9e98ee64cb25490de0921ce307f3872749
Author: Roland McGrath <roland@hack.frob.com>
Date:   Thu Sep 11 16:02:17 2014 -0700

    Move findidx nested functions to top-level.
    
    Needed in order to backport strxfrm_l security fix cleanly.
    
    (cherry picked from commit 8c0ab919f63dc03a420751172602a52d2bea59a8)
    
    Conflicts:
    	ChangeLog

-----------------------------------------------------------------------

Summary of changes:
 ChangeLog                              |   77 +++++
 NEWS                                   |    8 +-
 locale/weight.h                        |   13 +-
 locale/weightwc.h                      |   13 +-
 localedata/sort-test.sh                |    7 +
 localedata/xfrm-test.c                 |   52 +++-
 posix/fnmatch.c                        |    8 +
 posix/fnmatch_loop.c                   |   17 +-
 posix/regcomp.c                        |   10 +-
 posix/regex_internal.h                 |    7 +-
 posix/regexec.c                        |    8 +-
 stdio-common/tst-sscanf.c              |   33 +++
 stdio-common/vfscanf.c                 |   12 +-
 string/strcoll_l.c                     |    9 +-
 string/strxfrm_l.c                     |  491 +++++++++++++++++++++++++-------
 sysdeps/x86_64/multiarch/init-arch.c   |    9 +-
 sysdeps/x86_64/multiarch/init-arch.h   |    4 +
 sysdeps/x86_64/multiarch/memcpy.S      |    2 +-
 sysdeps/x86_64/multiarch/memcpy_chk.S  |    2 +-
 sysdeps/x86_64/multiarch/memmove.c     |    2 +-
 sysdeps/x86_64/multiarch/memmove_chk.c |    2 +-
 sysdeps/x86_64/multiarch/mempcpy.S     |    2 +-
 sysdeps/x86_64/multiarch/mempcpy_chk.S |    2 +-
 23 files changed, 642 insertions(+), 148 deletions(-)

Comment 6 Sourceware Commits 2015-04-01 00:11:31 UTC

This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, hjl/release/2.20/master has been created
        at  328fc20e5e334a642f0152d9662474789381a897 (commit)

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=328fc20e5e334a642f0152d9662474789381a897

commit 328fc20e5e334a642f0152d9662474789381a897
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Jan 30 06:50:20 2015 -0800

    Use AVX unaligned memcpy only if AVX2 is available
    
    memcpy with unaligned 256-bit AVX register loads/stores are slow on older
    processorsl like Sandy Bridge.  This patch adds bit_AVX_Fast_Unaligned_Load
    and sets it only when AVX2 is available.
    
    	[BZ #17801]
    	* sysdeps/x86_64/multiarch/init-arch.c (__init_cpu_features):
    	Set the bit_AVX_Fast_Unaligned_Load bit for AVX2.
    	* sysdeps/x86_64/multiarch/init-arch.h (bit_AVX_Fast_Unaligned_Load):
    	New.
    	(index_AVX_Fast_Unaligned_Load): Likewise.
    	(HAS_AVX_FAST_UNALIGNED_LOAD): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Check the
    	bit_AVX_Fast_Unaligned_Load bit instead of the bit_AVX_Usable bit.
    	* sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Likewise.
    	* sysdeps/x86_64/multiarch/memmove.c (__libc_memmove): Replace
    	HAS_AVX with HAS_AVX_FAST_UNALIGNED_LOAD.
    	* sysdeps/x86_64/multiarch/memmove_chk.c (__memmove_chk): Likewise.
    
    [cherry picked from commit 56d25c11b64a97255a115901d136d753c86de24e]

-----------------------------------------------------------------------

Comment 7 Sourceware Commits 2016-03-02 19:24:12 UTC

This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, hjl/memcpy/dpdk/master has been created
        at  1bc1103620e8f6c7e01cb54a8ed04ee1c3eb5a1a (commit)

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=1bc1103620e8f6c7e01cb54a8ed04ee1c3eb5a1a

commit 1bc1103620e8f6c7e01cb54a8ed04ee1c3eb5a1a
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Jan 30 11:07:13 2015 -0800

    Add memcpy-rte-ssse3.c

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=f63a6815da4c72626b14b456a6902cc8d3671729

commit f63a6815da4c72626b14b456a6902cc8d3671729
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Jan 30 08:44:30 2015 -0800

    Add memcpy-rte-avx.c
    
    Don't inline rte_memcpy.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=d2ca99bf141c78bd8d9c1f314ce8a1f12c439d4b

commit d2ca99bf141c78bd8d9c1f314ce8a1f12c439d4b
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Jan 30 08:51:45 2015 -0800

    Import rte_memcpy.h
    
    rte_memcpy.h is a memcpy implementation from DPDK:
    
    http://dpdk.org/
    
    optimized for Sandy Bridge and Haswell. See
    
    http://dpdk.org/ml/archives/dev/2014-November/008158.html
    
    The original code is at
    
    https://gist.github.com/lukego/efc82a15bde5ec83cb1b

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=56d25c11b64a97255a115901d136d753c86de24e

commit 56d25c11b64a97255a115901d136d753c86de24e
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Jan 30 06:50:20 2015 -0800

    Use AVX unaligned memcpy only if AVX2 is available
    
    memcpy with unaligned 256-bit AVX register loads/stores are slow on older
    processorsl like Sandy Bridge.  This patch adds bit_AVX_Fast_Unaligned_Load
    and sets it only when AVX2 is available.
    
    	[BZ #17801]
    	* sysdeps/x86_64/multiarch/init-arch.c (__init_cpu_features):
    	Set the bit_AVX_Fast_Unaligned_Load bit for AVX2.
    	* sysdeps/x86_64/multiarch/init-arch.h (bit_AVX_Fast_Unaligned_Load):
    	New.
    	(index_AVX_Fast_Unaligned_Load): Likewise.
    	(HAS_AVX_FAST_UNALIGNED_LOAD): Likewise.
    	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Check the
    	bit_AVX_Fast_Unaligned_Load bit instead of the bit_AVX_Usable bit.
    	* sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Likewise.
    	* sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Likewise.
    	* sysdeps/x86_64/multiarch/memmove.c (__libc_memmove): Replace
    	HAS_AVX with HAS_AVX_FAST_UNALIGNED_LOAD.
    	* sysdeps/x86_64/multiarch/memmove_chk.c (__memmove_chk): Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a29c4064115e59bcf8c001c0b3dedfa8d49d3653

commit a29c4064115e59bcf8c001c0b3dedfa8d49d3653
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Wed Jan 14 06:29:04 2015 -0800

    Support compilers defaulting to PIE
    
    If PIE is the default, we need to build programs as PIE.
    
    	* Makeconfig (+link): Set to $(+link-pie) if default to PIE.
    	(+link-tests): Set to $(+link-pie-tests) if default to PIE.
    	* config.make.in (build-pie-default): New.
    	* configure.ac (libc_cv_pie_default): New.  Set to yes if -fPIE
    	is default.  AC_SUBST.
    	* configure: Regenerated.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=f0b03bc24b54927677af56778309b6d58aac5eb4

commit f0b03bc24b54927677af56778309b6d58aac5eb4
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Jan 13 06:19:44 2015 -0800

    Compile gcrt1.o with -fPIC
    
    We compile gcrt1.o with -fPIC to support both "gcc -pg" and "gcc -pie -pg".
    
    	[BZ #17836]
    	* csu/Makefile (extra-objs): Add gmon-start.o if not builing
    	shared library.  Add gmon-start.os otherwise.
    	($(objpfx)g$(start-installed-name)): Use $(objpfx)S%
    	$(objpfx)gmon-start.os if builing shared library.
    	($(objpfx)g$(static-start-installed-name)): Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=ccf880ba92fe1ef7f29f17062ba6aa2aa7b52f50

commit ccf880ba92fe1ef7f29f17062ba6aa2aa7b52f50
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Dec 19 06:30:31 2014 -0800

    Compile vismain with -fPIC and link with -pie

-----------------------------------------------------------------------