This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [RFC] improve generic strlen.
- From: Rich Felker <dalias at libc dot org>
- To: libc-alpha at sourceware dot org
- Date: Tue, 23 Dec 2014 14:40:44 -0500
- Subject: Re: [RFC] improve generic strlen.
- Authentication-results: sourceware.org; auth=none
- References: <20141223185714 dot GA7889 at domone>
On Tue, Dec 23, 2014 at 07:57:14PM +0100, OndÅej BÃlka wrote:
> Hi
>
> As I promised earlier here is working version of improved strlen.
> For long strings on x64 its twice as fast as current C version. Or 50%
> slower when there is 1/128 probability of character 128.
>
> For simplicity it assumes unaligned loads. For aligned ones I would need
> to add define to mask first bytes.
>
> Could you try this on arm or other arch to see what improvement is
> possible?
>
> Also while header is most important it needs platform specific
> optimizations so if this is improvement next step is take assembly and
> fix gcc mistakes.
FYI, the fastest strlen I've found on 32-bit x86 (and perhaps other
32-bit archs; I don't remember) is this:
size_t unrolled_strlen(const char *s) {
const char *p = s;
for (;;) {
if (!p[0]) return p-s;
if (!p[1]) return p-s+1;
if (!p[2]) return p-s+2;
if (!p[3]) return p-s+3;
if (!p[4]) return p-s+4;
if (!p[5]) return p-s+5;
if (!p[6]) return p-s+6;
if (!p[7]) return p-s+7;
if (!p[8]) return p-s+8;
if (!p[9]) return p-s+9;
if (!p[10]) return p-s+10;
if (!p[11]) return p-s+11;
if (!p[12]) return p-s+12;
if (!p[13]) return p-s+13;
if (!p[14]) return p-s+14;
if (!p[15]) return p-s+15;
p += 16;
}
}
The nice thing is that it really is pure C -- no UB hacks like reading
past the end of the string or aliasing violations. Sadly it can't keep
up with larger-than-32-bit vectorization though, but it's still pretty
close. Also sadly GCC cannot do the unrolling properly automatically;
it generates some very slow code with loop overhead repeated between
every load-and-branch.
On my atoms it runs at 1.15 cycles per byte. I don't have the figures
for other cpus handy.
Rich