This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Optimized strlen() for x64 (SSE2 based strlen)


#include <emmintrin.h>

size_t strlen_sse2( const char *str )
{
    const __m128i *ptr = (__m128i*)( (uintptr_t) str & ~0xf );
    const __m128i zero = _mm_setzero_si128();

    int mask = _mm_movemask_epi8( _mm_cmpeq_epi8( _mm_load_si128( ptr
), zero ) );
    mask >>= ( (uintptr_t) str & 0xf );
    if ( mask ) return __builtin_ctz( (unsigned int) mask );
    ptr++;

    for ( ; ; )
    {
        mask = _mm_movemask_epi8( _mm_cmpeq_epi8( _mm_load_si128( ptr
), zero ) );
        if ( mask ) return (char *)ptr - str + __builtin_ctz(
(unsigned int) mask );
        ptr++;
    }
    __builtin_unreachable();
}


Written by Trevor Herselman


Features:


*) NO penalty for misaligned addresses (the `mask >>= unaligned bit
shift` takes care of it
*) NO alignment checks / jumps - ALL reads are already pre-aligned to 16 bytes
*) The cost of the misalignment code is amortized inside the first 16
byte read for all aligned/unaligned reads
*) The function uses ONLY aligned reads (_mm_load_si128), NEVER uses a
single unaligned read, not even for 1 byte.
*) Plain C, the compiler is free to make optimizations around the
function when inlined.
*) Small tight code size, less than 1 CPU cache line (about 40 bytes)
more machine code than the default/bit-twiddling implementation, for a
significant speed boost!
*) Compatible with ALL x64 CPU's, because they are ALL required to
support at least SSE2
*) NO misaligned read penalties for older hardware
*) Faster than ALL other SSE2 ~ SSE4.2 implementations, especially for
smaller strings
*) The size/cost/performance benefit starts at a string length of 2
bytes; meaning, if the average string length is 2 bytes or more, then
this implementation is faster than any other method!
*) Benchmarked against GLIBC (bit-twiddle), Agner Fog's A_strlen() &
all other methods I could find.
*) Doesn't need to do any (4K) page boundary checks and will never
read beyond a page boundary, as long as the NULL terminator is found.

The main benefit is that there are NO alignment checks! Any unaligned
bits are bit-shifted off the first mask! So there is just a `shr`
instead of `if (str & 0xF), `cmp`, `jmp`. If the address is already
aligned the bit-shift does nothing, because `shr 0` does nothing. But
at least we've eliminated the alignment check!

Note: I wasn't able to benchmark it against the current GLIBC AVX2 or
SSE2 implementations, because I can't compile or test them. Besides, I
looked at the code, I believe they will be slower in general, unless
you are doing very large strings, because they read 4x XMM registers
at a time, this is a huge performance cost if you don't need the extra
reads/data! I believe shorter strings are more common! Also, these
implementations code bloat are significantly more!


My name is Trevor Herselman, and I give permission to the developers
of GCC, GLIBC and anyone else permission to use this algorithm, MIT
license. Please credit me where possible!



Benchmarks:  10,000,000,000 strings, warm cache, Core i5 3rd gen,
lower is better

The `GCC/GLIBC bit-twiddling` technique listed below is the current
GCC/GLIBC implementation with a 1-byte pre-alignment test, with 64-bit
reads (8-byte long long).

Aligned memory

0 byte length (empty string, only NULL terminator)

Traditional (naive) 1-byte test: 5.35s
GCC/GLIBC bit-twiddling: 14s
A_strlen: 27.5s
My SSE2 strlen: 10.5s

1 byte length

Traditional (naive) 1-byte test: 9s
GCC/GLIBC bit-twiddling: 17s
A_strlen: 27.5s
My SSE2 strlen: 10.5s

2 byte length

Traditional (naive) 1-byte test: 15s
bit-twiddling: 20s
A_strlen: 27.5s
My SSE2 strlen: 10.5s

3 byte length

Traditional (naive) 1-byte test: 19s
GCC bit-twiddle: 23s
A_strlen: 27.5s
My SSE2 strlen: 10.5s

7 byte length (+1 byte NULL = 8 bytes total)

Traditional (naive) 1-byte test: 44s
GCC bit-twiddle: 33.4s
A_strlen: 27.5s
My SSE2 strlen: 10.5s

8 byte length (+1 byte NULL = 9 bytes total)

Traditional (naive) 1-byte test: 46s
GCC bit-twiddle: 19s
A_strlen: 27.5s
My SSE2 strlen: 10.5s

15 byte length (+1 byte NULL = 16 bytes total)

Traditional (naive) 1-byte test: 85s
GCC bit-twiddle: 34s
A_strlen: 27.5s
My SSE2 strlen: 10.5s

16 byte length (+1 byte NULL = 17 bytes total)

Traditional (naive) 1-byte test: 85s
GCC bit-twiddle: 27s
A_strlen: 24s
My SSE2 strlen: 13.3


128 byte length (+1 byte NULL = 129 bytes total)

Traditional (naive) 1-byte test: 718.1s
GCC bit-twiddle: 171.3s
A_strlen: 63s
My SSE2 strlen: 45.5s



unaligned (1 byte offset)

0 bytes:

Traditional (naive) 1-byte test: 8s
GCC bit-twiddle: 9s
A_strlen: 27.5s
My SSE2 strlen: 10.5s

NOTE: The GCC/GLIBC implementation is faster here, because of the
unaligned test, it reads the first bit and ends immediately


1 byte:

Traditional (naive) 1-byte test: 13.2s
GCC bit-twiddle: 15.9s
A_strlen: 27.5s
My SSE2 strlen: 10.5s


2 bytes:

Traditional (naive) 1-byte test: 17.4
GCC bit-twiddle: 19.3s
A_strlen: 27.5s
My SSE2 strlen: 10.5s


3 bytes:

Traditional (naive) 1-byte test: 25.6s
GCC bit-twiddle: 24.7s
A_strlen: 27.5s
My SSE2 strlen: 10.5s


127 bytes:

Traditional (naive) 1-byte test: 715s
GCC bit-twiddle: 180s
A_strlen: 65s
My SSE2 strlen: 47s


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]