This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: Optimized strlen() for x64 (SSE2 based strlen)
- From: "H.J. Lu" <hjl dot tools at gmail dot com>
- To: Trevor Herselman <therselman at gmail dot com>
- Cc: GNU C Library <libc-alpha at sourceware dot org>
- Date: Sat, 12 May 2018 05:54:47 -0700
- Subject: Re: Optimized strlen() for x64 (SSE2 based strlen)
- References: <CAKZKS+YNbqHB8gJ0We4DapFOApL8cMiSoXN6XPNsVWTi58DZ1w@mail.gmail.com>
On Sat, May 12, 2018 at 4:40 AM, Trevor Herselman <therselman@gmail.com> wrote:
> #include <emmintrin.h>
>
> size_t strlen_sse2( const char *str )
> {
> const __m128i *ptr = (__m128i*)( (uintptr_t) str & ~0xf );
> const __m128i zero = _mm_setzero_si128();
>
> int mask = _mm_movemask_epi8( _mm_cmpeq_epi8( _mm_load_si128( ptr
> ), zero ) );
> mask >>= ( (uintptr_t) str & 0xf );
> if ( mask ) return __builtin_ctz( (unsigned int) mask );
> ptr++;
>
> for ( ; ; )
> {
> mask = _mm_movemask_epi8( _mm_cmpeq_epi8( _mm_load_si128( ptr
> ), zero ) );
> if ( mask ) return (char *)ptr - str + __builtin_ctz(
> (unsigned int) mask );
> ptr++;
> }
> __builtin_unreachable();
> }
>
>
> Written by Trevor Herselman
>
>
> Features:
>
>
> *) NO penalty for misaligned addresses (the `mask >>= unaligned bit
> shift` takes care of it
> *) NO alignment checks / jumps - ALL reads are already pre-aligned to 16 bytes
> *) The cost of the misalignment code is amortized inside the first 16
> byte read for all aligned/unaligned reads
> *) The function uses ONLY aligned reads (_mm_load_si128), NEVER uses a
> single unaligned read, not even for 1 byte.
> *) Plain C, the compiler is free to make optimizations around the
> function when inlined.
> *) Small tight code size, less than 1 CPU cache line (about 40 bytes)
> more machine code than the default/bit-twiddling implementation, for a
> significant speed boost!
> *) Compatible with ALL x64 CPU's, because they are ALL required to
> support at least SSE2
> *) NO misaligned read penalties for older hardware
> *) Faster than ALL other SSE2 ~ SSE4.2 implementations, especially for
> smaller strings
> *) The size/cost/performance benefit starts at a string length of 2
> bytes; meaning, if the average string length is 2 bytes or more, then
> this implementation is faster than any other method!
> *) Benchmarked against GLIBC (bit-twiddle), Agner Fog's A_strlen() &
> all other methods I could find.
> *) Doesn't need to do any (4K) page boundary checks and will never
> read beyond a page boundary, as long as the NULL terminator is found.
>
> The main benefit is that there are NO alignment checks! Any unaligned
> bits are bit-shifted off the first mask! So there is just a `shr`
> instead of `if (str & 0xF), `cmp`, `jmp`. If the address is already
> aligned the bit-shift does nothing, because `shr 0` does nothing. But
> at least we've eliminated the alignment check!
>
> Note: I wasn't able to benchmark it against the current GLIBC AVX2 or
> SSE2 implementations, because I can't compile or test them. Besides, I
> looked at the code, I believe they will be slower in general, unless
> you are doing very large strings, because they read 4x XMM registers
> at a time, this is a huge performance cost if you don't need the extra
> reads/data! I believe shorter strings are more common! Also, these
> implementations code bloat are significantly more!
It is trivial to add a new implementation to glibc framework. This patch
adds __strlen_sse2_2 to glibc:
diff --git a/sysdeps/x86_64/multiarch/Makefile
b/sysdeps/x86_64/multiarch/Makefile
index 68257c4017..3bb4fd5ec9 100644
--- a/sysdeps/x86_64/multiarch/Makefile
+++ b/sysdeps/x86_64/multiarch/Makefile
@@ -24,6 +24,7 @@ sysdep_routines += strncat-c stpncpy-c strncpy-c \
strchr-sse2 strchrnul-sse2 strchr-avx2 strchrnul-avx2 \
strrchr-sse2 strrchr-avx2 \
strlen-sse2 strnlen-sse2 strlen-avx2 strnlen-avx2 \
+ strlen-sse2-2 \
strcat-ssse3 strncat-ssse3\
strcpy-sse2 stpcpy-sse2 \
strcpy-ssse3 strncpy-ssse3 stpcpy-ssse3 stpncpy-ssse3 \
diff --git a/sysdeps/x86_64/multiarch/ifunc-impl-list.c
b/sysdeps/x86_64/multiarch/ifunc-impl-list.c
index 7afd674b81..e5b0dfd510 100644
--- a/sysdeps/x86_64/multiarch/ifunc-impl-list.c
+++ b/sysdeps/x86_64/multiarch/ifunc-impl-list.c
@@ -186,6 +186,7 @@ __libc_ifunc_impl_list (const char *name, struct
libc_ifunc_impl *array,
IFUNC_IMPL_ADD (array, i, strlen,
HAS_ARCH_FEATURE (AVX2_Usable),
__strlen_avx2)
+ IFUNC_IMPL_ADD (array, i, strlen, 1, __strlen_sse2_2)
IFUNC_IMPL_ADD (array, i, strlen, 1, __strlen_sse2))
/* Support sysdeps/x86_64/multiarch/strnlen.c. */
Here __strlen_sse2_2 is your implementation. Glibc microbenchmark
shows:
simple_STRLEN builtin_strlen __strlen_avx2 __strlen
_sse2_2 __strlen_sse2
Length 1, alignment 1: 32.875 48.1562 42.8438 40.5 39.9062
Length 1, alignment 0: 30.7812 47.5938 31.75 35.0625 34.9375
Length 2, alignment 2: 40.4062 44.0938 29.9688 29.9062 28.9375
Length 2, alignment 0: 36.6875 42.2812 29.25 32.5938 29.75
Length 3, alignment 3: 50.25 40.375 29.0625 29.0625 29.5312
Length 3, alignment 0: 46.8438 40.4688 29.1875 28.375 31.5625
Length 4, alignment 4: 56.2812 40.3438 28.0625 27.9375 31.1875
Length 4, alignment 0: 48.6875 40.3438 27.9062 27.75 31.625
Length 5, alignment 5: 69.1562 40.2812 33 31.2812 31.875
Length 5, alignment 0: 64.4688 40.1562 31.625 31.0312 28.0312
Length 6, alignment 6: 64.9062 40.6562 27.9062 27.875 31.3125
Length 6, alignment 0: 66.7188 40.4375 27.6875 28.0625 31.5312
Length 7, alignment 7: 66.8438 40.3438 28.0625 27.9688 28.1875
Length 7, alignment 0: 71.5312 40.8438 28.0625 28.25 30.5312
Length 4, alignment 0: 49 40.8438 28.0625 28 27.8125
Length 4, alignment 7: 49.625 40.5938 27.75 27.9062 27.9375
Length 4, alignment 2: 47.375 41 27.8438 30.9688 27.8438
Length 2, alignment 2: 38.3438 40.25 27.9062 27.6875 28.0625
Length 8, alignment 0: 77.2812 40.9688 27.8438 30.25 31.0938
Length 8, alignment 7: 76.1875 40.9375 28.0312 31.5625 29.25
Length 8, alignment 3: 63.2812 40.5 28.0625 28.3125 27.9062
Length 5, alignment 3: 68.7188 40.3438 27.9375 27.9688 30.3125
Length 16, alignment 0: 106.906 40.8125 27.75 51.6875 53.5312
Length 16, alignment 7: 98.3125 43.1562 28.0625 42.8438 53.5
Length 16, alignment 4: 98.3438 40.375 27.9062 44.625 50.9062
Length 10, alignment 4: 76.375 40.3125 28.0938 35 34.25
Length 32, alignment 0: 179.719 51.7812 43.5625 52.875 60.8125
Length 32, alignment 7: 171.094 48.375 40.1562 45.0625 47.0312
Length 32, alignment 5: 167.062 46.5938 38.5625 45.2812 48.0938
Length 21, alignment 5: 123.344 44.7812 29.5625 47.0938 47.5938
Length 64, alignment 0: 346.125 64.75 46.3125 79.5 79.625
Length 64, alignment 7: 338.562 51.2812 43.8438 60.2812 75.1562
Length 64, alignment 6: 338.969 52.5 42.5312 63.0625 74.4062
Length 42, alignment 6: 431 49.8438 36.6875 56.7188 53.5625
Length 128, alignment 0: 845.344 69.75 50.875 79.375 93.4688
Length 128, alignment 7: 853.75 58.5 49.25 75.7812 84.6562
Length 128, alignment 7: 851.344 56.3438 47.1875 75.8125 83.2812
Length 85, alignment 7: 633.25 61.1562 42.2812 67.0625 76.75
Length 256, alignment 0: 1563.38 102.844 91.7188 149.031 106.75
Length 256, alignment 7: 1578.09 88.75 76.625 124.625 101.656
Length 256, alignment 8: 1556.91 88.125 76.9688 133.125 101.531
Length 170, alignment 8: 1075.59 87.6875 71.5312 99.3125 90.5625
Length 512, alignment 0: 2971.5 127.875 120.031 233.094 145
Length 512, alignment 7: 2969.06 112.469 102.062 210.812 137.062
Length 512, alignment 9: 2964.56 109.562 103.594 207.062 132.875
Length 341, alignment 9: 2017 106.125 98.3125 158.969 119.938
Length 1024, alignment 0: 5782.5 179.844 170.375 383.219 237.562
Length 1024, alignment 7: 5785.72 163.406 152.594 382.219 220.688
Length 1024, alignment 10: 5772.62 162.875 152.062 376.812 218.969
Length 682, alignment 10: 3898.88 145.625 125.938 275.5 160.812
Length 2048, alignment 0: 11879.2 247.469 243.094 769.469 374.844
Length 2048, alignment 7: 10140.6 237.531 229.812 768.188 350.094
Length 2048, alignment 11: 10526.2 235.781 227.125 775.5 351.844
Length 1365, alignment 11: 6820.31 190.219 183.156 448.406 266.625
Length 4096, alignment 0: 20157.3 433.469 431.031 1399.66 679.875
Length 4096, alignment 7: 20502.1 428.844 420.438 1379.12 661.688
Length 4096, alignment 12: 12538.4 123.25 120.5 398.719 192.531
Length 2730, alignment 12: 3913.09 90.8125 87.1562 283.5 132.344
Your implementation is good, especially at shorter strings. For strings
longer than 128 bytes, yours is slower.
>
> My name is Trevor Herselman, and I give permission to the developers
> of GCC, GLIBC and anyone else permission to use this algorithm, MIT
> license. Please credit me where possible!
>
>
>
> Benchmarks: 10,000,000,000 strings, warm cache, Core i5 3rd gen,
> lower is better
>
> The `GCC/GLIBC bit-twiddling` technique listed below is the current
> GCC/GLIBC implementation with a 1-byte pre-alignment test, with 64-bit
> reads (8-byte long long).
>
> Aligned memory
>
> 0 byte length (empty string, only NULL terminator)
>
> Traditional (naive) 1-byte test: 5.35s
> GCC/GLIBC bit-twiddling: 14s
> A_strlen: 27.5s
> My SSE2 strlen: 10.5s
>
> 1 byte length
>
> Traditional (naive) 1-byte test: 9s
> GCC/GLIBC bit-twiddling: 17s
> A_strlen: 27.5s
> My SSE2 strlen: 10.5s
>
> 2 byte length
>
> Traditional (naive) 1-byte test: 15s
> bit-twiddling: 20s
> A_strlen: 27.5s
> My SSE2 strlen: 10.5s
>
> 3 byte length
>
> Traditional (naive) 1-byte test: 19s
> GCC bit-twiddle: 23s
> A_strlen: 27.5s
> My SSE2 strlen: 10.5s
>
> 7 byte length (+1 byte NULL = 8 bytes total)
>
> Traditional (naive) 1-byte test: 44s
> GCC bit-twiddle: 33.4s
> A_strlen: 27.5s
> My SSE2 strlen: 10.5s
>
> 8 byte length (+1 byte NULL = 9 bytes total)
>
> Traditional (naive) 1-byte test: 46s
> GCC bit-twiddle: 19s
> A_strlen: 27.5s
> My SSE2 strlen: 10.5s
>
> 15 byte length (+1 byte NULL = 16 bytes total)
>
> Traditional (naive) 1-byte test: 85s
> GCC bit-twiddle: 34s
> A_strlen: 27.5s
> My SSE2 strlen: 10.5s
>
> 16 byte length (+1 byte NULL = 17 bytes total)
>
> Traditional (naive) 1-byte test: 85s
> GCC bit-twiddle: 27s
> A_strlen: 24s
> My SSE2 strlen: 13.3
>
>
> 128 byte length (+1 byte NULL = 129 bytes total)
>
> Traditional (naive) 1-byte test: 718.1s
> GCC bit-twiddle: 171.3s
> A_strlen: 63s
> My SSE2 strlen: 45.5s
>
>
>
> unaligned (1 byte offset)
>
> 0 bytes:
>
> Traditional (naive) 1-byte test: 8s
> GCC bit-twiddle: 9s
> A_strlen: 27.5s
> My SSE2 strlen: 10.5s
>
> NOTE: The GCC/GLIBC implementation is faster here, because of the
> unaligned test, it reads the first bit and ends immediately
>
>
> 1 byte:
>
> Traditional (naive) 1-byte test: 13.2s
> GCC bit-twiddle: 15.9s
> A_strlen: 27.5s
> My SSE2 strlen: 10.5s
>
>
> 2 bytes:
>
> Traditional (naive) 1-byte test: 17.4
> GCC bit-twiddle: 19.3s
> A_strlen: 27.5s
> My SSE2 strlen: 10.5s
>
>
> 3 bytes:
>
> Traditional (naive) 1-byte test: 25.6s
> GCC bit-twiddle: 24.7s
> A_strlen: 27.5s
> My SSE2 strlen: 10.5s
>
>
> 127 bytes:
>
> Traditional (naive) 1-byte test: 715s
> GCC bit-twiddle: 180s
> A_strlen: 65s
> My SSE2 strlen: 47s
--
H.J.