This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Optimized strlen() for x64 (SSE2 based strlen)
- From: Trevor Herselman <therselman at gmail dot com>
- To: libc-alpha at sourceware dot org
- Date: Sat, 12 May 2018 13:40:18 +0200
- Subject: Optimized strlen() for x64 (SSE2 based strlen)
#include <emmintrin.h>
size_t strlen_sse2( const char *str )
{
const __m128i *ptr = (__m128i*)( (uintptr_t) str & ~0xf );
const __m128i zero = _mm_setzero_si128();
int mask = _mm_movemask_epi8( _mm_cmpeq_epi8( _mm_load_si128( ptr
), zero ) );
mask >>= ( (uintptr_t) str & 0xf );
if ( mask ) return __builtin_ctz( (unsigned int) mask );
ptr++;
for ( ; ; )
{
mask = _mm_movemask_epi8( _mm_cmpeq_epi8( _mm_load_si128( ptr
), zero ) );
if ( mask ) return (char *)ptr - str + __builtin_ctz(
(unsigned int) mask );
ptr++;
}
__builtin_unreachable();
}
Written by Trevor Herselman
Features:
*) NO penalty for misaligned addresses (the `mask >>= unaligned bit
shift` takes care of it
*) NO alignment checks / jumps - ALL reads are already pre-aligned to 16 bytes
*) The cost of the misalignment code is amortized inside the first 16
byte read for all aligned/unaligned reads
*) The function uses ONLY aligned reads (_mm_load_si128), NEVER uses a
single unaligned read, not even for 1 byte.
*) Plain C, the compiler is free to make optimizations around the
function when inlined.
*) Small tight code size, less than 1 CPU cache line (about 40 bytes)
more machine code than the default/bit-twiddling implementation, for a
significant speed boost!
*) Compatible with ALL x64 CPU's, because they are ALL required to
support at least SSE2
*) NO misaligned read penalties for older hardware
*) Faster than ALL other SSE2 ~ SSE4.2 implementations, especially for
smaller strings
*) The size/cost/performance benefit starts at a string length of 2
bytes; meaning, if the average string length is 2 bytes or more, then
this implementation is faster than any other method!
*) Benchmarked against GLIBC (bit-twiddle), Agner Fog's A_strlen() &
all other methods I could find.
*) Doesn't need to do any (4K) page boundary checks and will never
read beyond a page boundary, as long as the NULL terminator is found.
The main benefit is that there are NO alignment checks! Any unaligned
bits are bit-shifted off the first mask! So there is just a `shr`
instead of `if (str & 0xF), `cmp`, `jmp`. If the address is already
aligned the bit-shift does nothing, because `shr 0` does nothing. But
at least we've eliminated the alignment check!
Note: I wasn't able to benchmark it against the current GLIBC AVX2 or
SSE2 implementations, because I can't compile or test them. Besides, I
looked at the code, I believe they will be slower in general, unless
you are doing very large strings, because they read 4x XMM registers
at a time, this is a huge performance cost if you don't need the extra
reads/data! I believe shorter strings are more common! Also, these
implementations code bloat are significantly more!
My name is Trevor Herselman, and I give permission to the developers
of GCC, GLIBC and anyone else permission to use this algorithm, MIT
license. Please credit me where possible!
Benchmarks: 10,000,000,000 strings, warm cache, Core i5 3rd gen,
lower is better
The `GCC/GLIBC bit-twiddling` technique listed below is the current
GCC/GLIBC implementation with a 1-byte pre-alignment test, with 64-bit
reads (8-byte long long).
Aligned memory
0 byte length (empty string, only NULL terminator)
Traditional (naive) 1-byte test: 5.35s
GCC/GLIBC bit-twiddling: 14s
A_strlen: 27.5s
My SSE2 strlen: 10.5s
1 byte length
Traditional (naive) 1-byte test: 9s
GCC/GLIBC bit-twiddling: 17s
A_strlen: 27.5s
My SSE2 strlen: 10.5s
2 byte length
Traditional (naive) 1-byte test: 15s
bit-twiddling: 20s
A_strlen: 27.5s
My SSE2 strlen: 10.5s
3 byte length
Traditional (naive) 1-byte test: 19s
GCC bit-twiddle: 23s
A_strlen: 27.5s
My SSE2 strlen: 10.5s
7 byte length (+1 byte NULL = 8 bytes total)
Traditional (naive) 1-byte test: 44s
GCC bit-twiddle: 33.4s
A_strlen: 27.5s
My SSE2 strlen: 10.5s
8 byte length (+1 byte NULL = 9 bytes total)
Traditional (naive) 1-byte test: 46s
GCC bit-twiddle: 19s
A_strlen: 27.5s
My SSE2 strlen: 10.5s
15 byte length (+1 byte NULL = 16 bytes total)
Traditional (naive) 1-byte test: 85s
GCC bit-twiddle: 34s
A_strlen: 27.5s
My SSE2 strlen: 10.5s
16 byte length (+1 byte NULL = 17 bytes total)
Traditional (naive) 1-byte test: 85s
GCC bit-twiddle: 27s
A_strlen: 24s
My SSE2 strlen: 13.3
128 byte length (+1 byte NULL = 129 bytes total)
Traditional (naive) 1-byte test: 718.1s
GCC bit-twiddle: 171.3s
A_strlen: 63s
My SSE2 strlen: 45.5s
unaligned (1 byte offset)
0 bytes:
Traditional (naive) 1-byte test: 8s
GCC bit-twiddle: 9s
A_strlen: 27.5s
My SSE2 strlen: 10.5s
NOTE: The GCC/GLIBC implementation is faster here, because of the
unaligned test, it reads the first bit and ends immediately
1 byte:
Traditional (naive) 1-byte test: 13.2s
GCC bit-twiddle: 15.9s
A_strlen: 27.5s
My SSE2 strlen: 10.5s
2 bytes:
Traditional (naive) 1-byte test: 17.4
GCC bit-twiddle: 19.3s
A_strlen: 27.5s
My SSE2 strlen: 10.5s
3 bytes:
Traditional (naive) 1-byte test: 25.6s
GCC bit-twiddle: 24.7s
A_strlen: 27.5s
My SSE2 strlen: 10.5s
127 bytes:
Traditional (naive) 1-byte test: 715s
GCC bit-twiddle: 180s
A_strlen: 65s
My SSE2 strlen: 47s