This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [RFC] Statistics of non-ASCII characters in strings

From: Richard Earnshaw <rearnsha at arm dot com>
To: Florian Weimer <fweimer at redhat dot com>, Wilco Dijkstra <Wilco dot Dijkstra at arm dot com>, "libc-alpha at sourceware dot org" <libc-alpha at sourceware dot org>
Date: Tue, 23 Dec 2014 16:13:25 +0000
Subject: Re: [RFC] Statistics of non-ASCII characters in strings
Authentication-results: sourceware.org; auth=none
References: <001401d01df6$0f7cc5a0$2e7650e0$ at com> <54997DBF dot 6070305 at redhat dot com>

On 23/12/14 14:35, Florian Weimer wrote:
> On 12/22/2014 03:46 PM, Wilco Dijkstra wrote:
>> Does anyone have statistics of how often strings contain non-ASCII characters? I'm asking because
>> it's feasible to make many string functions faster if they are predominantly ASCII by using a
>> different check for the null byte.
> 
> Why can't you do the equivalent of
> 
>    X = ((X & 0x80) >> 1) | (X & 0x7F);
> 
> before the new check?  Does this lengthen the dependency chain too much?
> 

That sequence would (at least on AArch64) require 4 instructions, only
two of which could be dual-issued.

A better sequence (given that we have a bit-clear operation (AND A ~B)) is

(X - 1) & (~X) & 0x80

Which can be transformed into

(X - 1) & ~(X | 0x7f)

That's just three instructions, two of which can be done in parallel,
and all of which are trivial.

The point of this question is that we clearly know that if strings very
rarely have any non-ascii characters we can test blocks of characters
with even fewer instructions.

It doesn't matter much if strings nearly always have non-ascii
characters either, since the branch handling the switch to full 8-bit
chars will be very predictable (always taken).  The problem case is when
strings have a moderately low number of non-ascii characters, such that
the probability of any hunk of data containing a non-ascii char is ~50%.
 At that point the switch behaviour branch becomes very expensive since
it can't be accurately predicted by hardware.

References:
- [RFC] Statistics of non-ASCII characters in strings
  - From: Wilco Dijkstra
- Re: [RFC] Statistics of non-ASCII characters in strings
  - From: Florian Weimer

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]