This describes how string functions are optimized on x64 and tradeoffs taken.
There are two common usage patterns. First is filename handling(or dictionary of words). Second is split file into lines and handle them separately. In both cases strings encountered are small. Headers below exploit these statistical properties.
String functions generally scan string and do something. They share much structure. When distribution of strings is taken into account I found templates below most effective.
Templates can be compiled with gcc -S and then fill in logic into assembly.
Basic loop
if (((unsigned long)s) & 4095 <4096-64) /* This is tight bound as we align string to 16 bytes. */ { // Test first 16 unaligned bytes. s += 16; s = ALIGN_DOWN (s, 16); finish: // Test 64 aligned bytes. } else { s = ALIGN_DOWN (s, 64); // Test 64 aligned bytes and use mask for bytes before start. } s = ALIGN_DOWN (s, 64); while (1) { s += 64; // Do quick test that assumes match is unlikely. if (test) goto finish; }
Basic loop with limit
For strn* functions handling end condition can be tricky.
Following code is mainly designed to minimize difference from non-n version. Second priority is small overhead on expected path.
if (n >= 80 && (((unsigned long)s) & 4095 < 4096 - 64)) { // Test first 16 unaligned bytes. s += 16; s = ALIGN_DOWN (s, 16); // Test 64 aligned bytes. } else { s = ALIGN_DOWN (s, 64); finish: if (n == 0) return // Special case when we do not read anything. // Test 64 aligned bytes and use mask for bytes before start. // Include test for end condition. } s = ALIGN_DOWN (s, 64); char *end = s + n - 1; // We need -1 not to read past page boundary. if (end < s) end = 0; // Needed to handle strnfoo(s,SIZE_MAX) and similar. char *end_aligned = ALIGN_DOWN (end, 64); while (1) { s += 64 if (s == end_aligned) goto finish; // Do quick test that assumes match is unlikely. if (test) goto finish; }
Comments
Unaligned loads - on older architectures slow but faster than alternatives. This applies only for first 16 bytes where we minimize latency.
Prefetching - It helps for big strings but most loops that I considered do more harm than good by unnecessary prefetch and slower running time for shorter strings. As shorter strings are more likely I decided not use it. (A possible way is enable it only by optimizer based on profile feedback.)
Specific functions
memset
See http://kam.mff.cuni.cz/~ondra/benchmark_string/memset_profile/results/result.html for memset characteristic.
This charecteristic suggests that primary use of memset is in calloc/malloc+memset.
These informations give us following optimization: As in memset on my load both dest and dest+n are almost always 8-byte aligned and 16-byte aligned in 83.01% of cases we can use it.
If machine allows unaligned loads then using them will not in common case incur additional runtime cost.
For machines without unaligned loads we first check if ((int)dest|n) is multiple of 8 and then we can use 8-byte loads.