This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [RFC] Statistics of non-ASCII characters in strings

From: Rich Felker <dalias at libc dot org>
To: Wilco Dijkstra <Wilco dot Dijkstra at arm dot com>
Cc: libc-alpha at sourceware dot org
Date: Mon, 22 Dec 2014 17:50:44 -0500
Subject: Re: [RFC] Statistics of non-ASCII characters in strings
Authentication-results: sourceware.org; auth=none
References: <001401d01df6$0f7cc5a0$2e7650e0$ at com> <E1Y34Yu-0004LC-KH at fencepost dot gnu dot org> <A610E03AD50BFC4D95529A36D37FA55E38C7897324 at GEORGE dot Emea dot Arm dot com>

On Mon, Dec 22, 2014 at 10:18:17PM +0000, Wilco Dijkstra wrote:
> Hi,
> 
> > Before even bothering to research this I think you should have numbers
> > on how much faster it would make these functions. I don't think the
> > difference is noteworthy.
> 
> I already have an implementation of it in a highly optimized strlen for AArch64,
> and the result of the trick is ~30% speedup on long strings (overall speedup
> > 80% on average on random sized/aligned strings vs the existing strlen).
> 
> > In any case, I think it would be a regression for programs processing
> > large volumes of non-English text to become slower just because
> > someone thought it would be clever to optimize for ASCII only...
> 
> There is no slowdown for non-ASCII, it just doesn't get a speed boost. So the choice
> is 75% of the time for ASCII and 100% for non-ASCII or 100% of the time if not
> doing the optimization. If ASCII is used for 90% of the strings then that's a
> great speedup, but if only 10% then it doesn't seem worth it.
> 
> So I want to test it with a realistic set of strings (not just 100% ASCII or 100%
> non-ASCII like the braindead GLIBC test). Obviously it's going to be 100% ASCII in
> any English speaking country, so there you always get the full speedup, but I have
> no idea what it would be on a Chinese or Japanese Linux PC. This seems like
> something that must have come up before, so that's why I asked.

It's not quite clear to me from your reply, but I get the impression
you're comparing an ASCII-optimized strlen to a non-optimized one,
rather than comparing it to the alternate optimization that also works
for non-ASCII bytes. This is a standard implementation trade-off, not
Aarch64-specific, and in general the right answer is to use the
slightly more expensive code that works for all bytes rather than the
version that has to take a slow-path when it gets a false-positive nul
terminator on non-ASCII bytes.

Rich

References:
- [RFC] Statistics of non-ASCII characters in strings
  - From: Wilco Dijkstra
- Re: [RFC] Statistics of non-ASCII characters in strings
  - From: Alfred M. Szmidt

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]