Hi, looks like code using strstr() and libc-2.27 is running 4 times slower than was with libc-2.23. Having a look at profiled code executed in various moments show different versions of strstr : - libc-2.27.so: strstr.c, str-two-way.h - this is current version ubuntu 18.04.1 Same code, profiled March 13, 2018 : - libc-2.23.so: strstr-sse2-unaligned.S - this version was 4 times faster Was the sse2 optimized code removed from strstr() in 2.27 ? Any plans to reintroduce it ? -Paul
glibc 2.27 on Fedora 28 x86_64 selects __strstr_sse2. What's your architecture, and which CPU do you use?
ubuntu 18.04.1 LTS, Intel Core i7-4710HQ
How did you compile it?
Just using the standard libc installed on ubuntu 18.04
You need to report that to ubuntu then.
already submitted but no answer. Thanks for clarifying that strstr() _should_ have sse 2 optimizations.
(In reply to paul.borile from comment #2) > ubuntu 18.04.1 LTS, Intel Core i7-4710HQ i386 or amd64?
amd64
Okay, then Ubuntu must have a peculiar build. You need to report it to them. As I said, on a Core i7-4810MQ, other builds of glibc 2.27 select an SSE2-optimized version of strstr.
I looked at the Ubuntu patches, but couldn't find anything obvious.
Created attachment 11315 [details] strstr.c
I made a test compiling for amd64 the attached program and running it on different platforms. The test run in 0.5 seconds on a sse2 optimized strstr() while 8 times slower on a non sse2 optimized strstr(). Example of a fast run : $ docker run -i ubuntu:16.04 /bin/bash -c "cat > strstr ; chmod +x strstr ; time ./strstr aa" < ./strstr Unable to find image 'ubuntu:16.04' locally 16.04: Pulling from library/ubuntu 3b37166ec614: Pull complete 504facff238f: Pull complete ebbcacd28e10: Pull complete c7fb3351ecad: Pull complete 2e3debadcbf7: Pull complete Digest: sha256:45ddfa61744947b0b8f7f20b8de70cbcdd441a6a0532f791fd4c09f5e491a8eb Status: Downloaded newer image for ubuntu:16.04 real 0m0.475s user 0m0.474s sys 0m0.000s matches = 3000000 Example of a slow run : docker run -i ubuntu:18.04 /bin/bash -c "cat > strstr ; chmod +x strstr ; time ./strstr aa" < ./strstr Unable to find image 'ubuntu:18.04' locally 18.04: Pulling from library/ubuntu Digest: sha256:de774a3145f7ca4f0bd144c7d4ffb2931e06634f11529653b23eba85aef8e378 Status: Downloaded newer image for ubuntu:18.04 matches = 3000000 real 0m4.796s user 0m4.794s sys 0m0.000s debian buster is showing the same problem : docker run -i debian:buster /bin/bash -c "cat > strstr ; chmod +x strstr ; time ./strstr aa" < ./strstr Unable to find image 'debian:buster' locally buster: Pulling from library/debian 1064a561889d: Pull complete Digest: sha256:b17f0ca6c2c0c5ad88b3a292e50ad657c87fc165ef1bde4dba57d9616651a9b9 Status: Downloaded newer image for debian:buster matches = 3000000 real 0m4.871s user 0m4.871s sys 0m0.000s So it is not an ubuntu 18 only problem.
debian buster (libc-2.27) has the same problem, debian 9 (libc-2.24) does not
fedora 27 and 26 are showing the same problema, fedora 25 is ok : $ docker run -i fedora:27 /bin/bash -c "cat > strstr ; chmod +x strstr ; time ./strstr aa" < ./strstr Unable to find image 'fedora:27' locally 27: Pulling from library/fedora c8ae8b35783e: Pull complete Digest: sha256:f4fb457f7cd858e0f6efa1a78390622823e83c393981b0bf91a2b6e8da3b6f50 Status: Downloaded newer image for fedora:27 matches = 3000000 real 0m4.939s user 0m4.937s sys 0m0.000s $ docker run -i fedora:26 /bin/bash -c "cat > strstr ; chmod +x strstr ; time ./strstr aa" < ./strstr Unable to find image 'fedora:26' locally 26: Pulling from library/fedora df75b105d67a: Pull complete Digest: sha256:b05007bacf64401663ab5d9a97d73b143a561be74a36a0f31219e5acb87fc7ec Status: Downloaded newer image for fedora:26 matches = 3000000 real 0m4.426s user 0m4.426s sys 0m0.000s $ docker run -i fedora:25 /bin/bash -c "cat > strstr ; chmod +x strstr ; time ./strstr aa" < ./strstr Unable to find image 'fedora:25' locally 25: Pulling from library/fedora 8ab017634520: Pull complete Digest: sha256:322cb01bbca26972c98051bacd3ab8555cec059496d64d35ee78b15de9ea0d06 Status: Downloaded newer image for fedora:25 matches = 3000000 real 0m0.468s user 0m0.468s sys 0m0.000s
Cannot reproduce on openSUSE Tumbleweed.
opensuse:42.3 (libc-2.22) does not have the problem. Seems that it was introduced in libc-2.25 (fedora26 has it). $ docker run -i opensuse:42.3 /bin/bash -c "cat > strstr ; chmod +x strstr ; time ./strstr aa" < ./strstr Unable to find image 'opensuse:42.3' locally 42.3: Pulling from library/opensuse 47aa660240a8: Pull complete Digest: sha256:569e6ee7a622838b9fa1111c3bfa99a50fdb34b7503f945b7d18ce66bb94a369 Status: Downloaded newer image for opensuse:42.3 matches = 3000000 real 0m0.450s user 0m0.450s sys 0m0.000s
(In reply to Florian Weimer from comment #1) > glibc 2.27 on Fedora 28 x86_64 selects __strstr_sse2. > > What's your architecture, and which CPU do you use? fedora28 seems to have the problem : $ docker run -i fedora:28 /bin/bash -c "cat > strstr ; chmod +x strstr ; time ./strstr aa" < ./strstr Unable to find image 'fedora:28' locally 28: Pulling from library/fedora 565884f490d9: Pull complete Digest: sha256:b41cd083421dd7aa46d619e958b75a026a5d5733f08f14ba6d53943d6106ea6d Status: Downloaded newer image for fedora:28 matches = 3000000 real 0m4.510s user 0m4.510s sys 0m0.000s
Created attachment 11318 [details] strstr executable
The issue was introduced by commit 'Disable TSX on some Haswell processors.' (2702856bf45c82cf8e69f2064f5aa15c0ceb6359) which change the some default flags for unknown models. Previously, new models were handled by the default switch option which assumed a Core i3/i5/i7 if AVX is available. After the patch, Haswell models (0x3f, 0x3c, 0x45, 0x46) does not set the flags Fast_Rep_String, Fast_Unaligned_Load, Fast_Unaligned_Copy, and Prefer_PMINUB_for_stringop for newer processors. The most straightforward fix is to not mix the memory access flags set with TSX, rather handle it on its own switch: diff --git a/sysdeps/x86/cpu-features.c b/sysdeps/x86/cpu-features.c index 122372862c..ecc82fc6af 100644 --- a/sysdeps/x86/cpu-features.c +++ b/sysdeps/x86/cpu-features.c @@ -297,7 +297,13 @@ init_cpu_features (struct cpu_features *cpu_features) | bit_arch_Fast_Unaligned_Copy | bit_arch_Prefer_PMINUB_for_stringop); break; + } + /* Disable TSX on some Haswell processors to avoid TSX on kernels that + weren't updated with the latest microcode package (which disables + broken feature by default). */ + switch (model) + { case 0x3f: /* Xeon E7 v3 with stepping >= 4 has working TSX. */ if (stepping >= 4) azanella@birita:~/Projects/glibc/glibc-git-work$
So if cpu has the AVX set we loose all the SSE2 optimizations but we keep the avx ones, correct ?
(In reply to paul.borile from comment #21) > So if cpu has the AVX set we loose all the SSE2 optimizations but we keep > the avx ones, correct ? Not really, only Haswell chips in fact will have this behavior [1]: Haswell (Client) GT3E 0 0x6 0x4 0x6 Family 6 Model 70 ULT 0 0x6 0x4 0x5 Family 6 Model 69 S 0 0x6 0x3 0xC Family 6 Model 60 Haswell (Server) E, EP, EX 0 0x6 0x3 0xF Family 6 Model 63 On these chips the internal glibc flags bit_arch_Fast_Rep_String, bit_arch_Fast_Unaligned_Load, bit_arch_Fast_Unaligned_Copy, and bit_arch_Prefer_PMINUB_for_stringop won't be set: The bit_arch_Fast_Rep_String is only used on ifunc selection on i686 (32-bits) and it selects the *ssse3_rep* memcpy, memmove, bcopy, and mempcpy. It it is not set the *ssse3* variant is used instead (not sure which is the performance difference between them). The bit_arch_Fast_Unaligned_Load influences both i686 and x86_64. For x86_64 it influences the selections of the SSE2 unaligned optimization variants for stpncpy, strcpy, strncpy, stpcpy, strncat, strcat, and strstr. For all but strstr an ssse3 or sse2 variant is used instead (not sure either which is the performance difference between them). The bit_arch_Fast_Unaligned_Copy influences mempcpy, memmove, and memcpy. If chip has not SSE3 the bit will select either a RMS or a unaligned variant. For Haswell the *_avx_unaligned_erms variants will be selected, so this bits won't interfere with best selections. The bit_arch_PMINUB_for_stringop is not used on ifunc selection. [1] https://en.wikichip.org/wiki/intel/cpuid
This is an automated email from the git hooks/post-receive script. It was generated because a ref change was pushed to the repository containing the project "GNU C Library master sources". The branch, master has been updated via 18ad0de6513bf8a8e4ba757c069e6806d07920f8 (commit) via c3d8dc45c9df199b8334599a6cbd98c9950dba62 (commit) from f1034472e21d77b978464b73adbb0f9f1f032c91 (commit) Those revisions listed above that are new to this repository have not appeared on any other notification email; so we list those revisions in full, below. - Log ----------------------------------------------------------------- https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=18ad0de6513bf8a8e4ba757c069e6806d07920f8 commit 18ad0de6513bf8a8e4ba757c069e6806d07920f8 Author: Adhemerval Zanella <adhemerval.zanella@linaro.org> Date: Tue Oct 23 14:53:12 2018 -0300 Fix tst-preadvwritev2 build failure on HURD Commit 7a16bdbb9ff41 uses IOV_MAX, which is not defined on hurd. Checked on a build for i686-gnu. * misc/tst-preadvwritev2-common.c (IOV_MAX): Define if not defined. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c3d8dc45c9df199b8334599a6cbd98c9950dba62 commit c3d8dc45c9df199b8334599a6cbd98c9950dba62 Author: Adhemerval Zanella <adhemerval.zanella@linaro.org> Date: Thu Oct 11 15:18:40 2018 -0300 x86: Fix Haswell strong flags (BZ#23709) Th commit 'Disable TSX on some Haswell processors.' (2702856bf4) changed the default flags for Haswell models. Previously, new models were handled by the default switch path, which assumed a Core i3/i5/i7 if AVX is available. After the patch, Haswell models (0x3f, 0x3c, 0x45, 0x46) do not set the flags Fast_Rep_String, Fast_Unaligned_Load, Fast_Unaligned_Copy, and Prefer_PMINUB_for_stringop (only the TSX one). This patch fixes it by disentangle the TSX flag handling from the memory optimization ones. The strstr case cited on patch now selects the __strstr_sse2_unaligned as expected for the Haswell cpu. Checked on x86_64-linux-gnu. [BZ #23709] * sysdeps/x86/cpu-features.c (init_cpu_features): Set TSX bits independently of other flags. ----------------------------------------------------------------------- Summary of changes: ChangeLog | 9 +++++++++ misc/tst-preadvwritev2-common.c | 5 +++++ sysdeps/x86/cpu-features.c | 6 ++++++ 3 files changed, 20 insertions(+), 0 deletions(-)
Fixed on 2.29 (c3d8dc45c9df199b8334599a6cbd98c9950dba62).
This is an automated email from the git hooks/post-receive script. It was generated because a ref change was pushed to the repository containing the project "GNU C Library master sources". The branch, release/2.28/master has been updated via 65010329f2c596bdd1204c1c9c9baac0193637af (commit) from e1af1df694603c0dcd5118c30eea2cdeb01a1a0b (commit) Those revisions listed above that are new to this repository have not appeared on any other notification email; so we list those revisions in full, below. - Log ----------------------------------------------------------------- https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=65010329f2c596bdd1204c1c9c9baac0193637af commit 65010329f2c596bdd1204c1c9c9baac0193637af Author: Adhemerval Zanella <adhemerval.zanella@linaro.org> Date: Fri Nov 2 10:56:00 2018 +0100 x86: Fix Haswell CPU string flags (BZ#23709) Th commit 'Disable TSX on some Haswell processors.' (2702856bf4) changed the default flags for Haswell models. Previously, new models were handled by the default switch path, which assumed a Core i3/i5/i7 if AVX is available. After the patch, Haswell models (0x3f, 0x3c, 0x45, 0x46) do not set the flags Fast_Rep_String, Fast_Unaligned_Load, Fast_Unaligned_Copy, and Prefer_PMINUB_for_stringop (only the TSX one). This patch fixes it by disentangle the TSX flag handling from the memory optimization ones. The strstr case cited on patch now selects the __strstr_sse2_unaligned as expected for the Haswell cpu. Checked on x86_64-linux-gnu. [BZ #23709] * sysdeps/x86/cpu-features.c (init_cpu_features): Set TSX bits independently of other flags. (cherry picked from commit c3d8dc45c9df199b8334599a6cbd98c9950dba62) ----------------------------------------------------------------------- Summary of changes: ChangeLog | 6 ++++++ NEWS | 1 + sysdeps/x86/cpu-features.c | 6 ++++++ 3 files changed, 13 insertions(+), 0 deletions(-)
This is an automated email from the git hooks/post-receive script. It was generated because a ref change was pushed to the repository containing the project "GNU C Library master sources". The branch, release/2.27/master has been updated via d8eee5ef553350364eff1dce9d143fb845c60615 (commit) from 5cd5309d910b0fcdeb5d7e734afb535d0231ba2c (commit) Those revisions listed above that are new to this repository have not appeared on any other notification email; so we list those revisions in full, below. - Log ----------------------------------------------------------------- https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=d8eee5ef553350364eff1dce9d143fb845c60615 commit d8eee5ef553350364eff1dce9d143fb845c60615 Author: Adhemerval Zanella <adhemerval.zanella@linaro.org> Date: Fri Nov 2 11:14:57 2018 +0100 x86: Fix Haswell CPU string flags (BZ#23709) Th commit 'Disable TSX on some Haswell processors.' (2702856bf4) changed the default flags for Haswell models. Previously, new models were handled by the default switch path, which assumed a Core i3/i5/i7 if AVX is available. After the patch, Haswell models (0x3f, 0x3c, 0x45, 0x46) do not set the flags Fast_Rep_String, Fast_Unaligned_Load, Fast_Unaligned_Copy, and Prefer_PMINUB_for_stringop (only the TSX one). This patch fixes it by disentangle the TSX flag handling from the memory optimization ones. The strstr case cited on patch now selects the __strstr_sse2_unaligned as expected for the Haswell cpu. Checked on x86_64-linux-gnu. [BZ #23709] * sysdeps/x86/cpu-features.c (init_cpu_features): Set TSX bits independently of other flags. (cherry picked from commit c3d8dc45c9df199b8334599a6cbd98c9950dba62) ----------------------------------------------------------------------- Summary of changes: ChangeLog | 6 ++++++ NEWS | 1 + sysdeps/x86/cpu-features.c | 6 ++++++ 3 files changed, 13 insertions(+), 0 deletions(-)
This is an automated email from the git hooks/post-receive script. It was generated because a ref change was pushed to the repository containing the project "GNU C Library master sources". The branch, release/2.26/master has been updated via dc40423dba7208ded2ec293c9a2938269f944ee8 (commit) from 33f5de7a79b27b9dce30a46d6681974653a85004 (commit) Those revisions listed above that are new to this repository have not appeared on any other notification email; so we list those revisions in full, below. - Log ----------------------------------------------------------------- https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=dc40423dba7208ded2ec293c9a2938269f944ee8 commit dc40423dba7208ded2ec293c9a2938269f944ee8 Author: Adhemerval Zanella <adhemerval.zanella@linaro.org> Date: Fri Nov 2 11:14:05 2018 +0100 x86: Fix Haswell CPU string flags (BZ#23709) Th commit 'Disable TSX on some Haswell processors.' (2702856bf4) changed the default flags for Haswell models. Previously, new models were handled by the default switch path, which assumed a Core i3/i5/i7 if AVX is available. After the patch, Haswell models (0x3f, 0x3c, 0x45, 0x46) do not set the flags Fast_Rep_String, Fast_Unaligned_Load, Fast_Unaligned_Copy, and Prefer_PMINUB_for_stringop (only the TSX one). This patch fixes it by disentangle the TSX flag handling from the memory optimization ones. The strstr case cited on patch now selects the __strstr_sse2_unaligned as expected for the Haswell cpu. Checked on x86_64-linux-gnu. [BZ #23709] * sysdeps/x86/cpu-features.c (init_cpu_features): Set TSX bits independently of other flags. (cherry picked from commit c3d8dc45c9df199b8334599a6cbd98c9950dba62) ----------------------------------------------------------------------- Summary of changes: ChangeLog | 6 ++++++ NEWS | 1 + sysdeps/x86/cpu-features.c | 6 ++++++ 3 files changed, 13 insertions(+), 0 deletions(-)