This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [RFC] Statistics of non-ASCII characters in strings

From: Rich Felker <dalias at libc dot org>
To: Wilco Dijkstra <Wilco dot Dijkstra at arm dot com>
Cc: libc-alpha at sourceware dot org
Date: Mon, 22 Dec 2014 19:25:21 -0500
Subject: Re: [RFC] Statistics of non-ASCII characters in strings
Authentication-results: sourceware.org; auth=none
References: <001401d01df6$0f7cc5a0$2e7650e0$ at com> <E1Y34Yu-0004LC-KH at fencepost dot gnu dot org> <A610E03AD50BFC4D95529A36D37FA55E38C7897324 at GEORGE dot Emea dot Arm dot com> <20141222225044 dot GU4574 at brightrain dot aerifal dot cx> <A610E03AD50BFC4D95529A36D37FA55E38C7897326 at GEORGE dot Emea dot Arm dot com>

On Mon, Dec 22, 2014 at 11:53:39PM +0000, Wilco Dijkstra wrote:
> Rich Felker wrote:
> > It's not quite clear to me from your reply, but I get the impression
> > you're comparing an ASCII-optimized strlen to a non-optimized one,
> > rather than comparing it to the alternate optimization that also works
> > for non-ASCII bytes. This is a standard implementation trade-off, not
> > Aarch64-specific, and in general the right answer is to use the
> > slightly more expensive code that works for all bytes rather than the
> > version that has to take a slow-path when it gets a false-positive nul
> > terminator on non-ASCII bytes.
> 
> No the comparison is between the existing optimized version (which
> already processes 16 characters per iteration) and an even more optimized
> version. Some of the speedup is due to my trick to process ASCII strings faster.
> Non-ascii strings are still at least as fast as the original version (usually much
> faster), so it's definitely not a slow path, but rather a fast path plus an
> extremely fast path.

Can you clarify how you're doing this? Since you can't know a priori
whether the data will be ASCII or not, I don't understand how you can
first use a test that gives the wrong result for non-ASCII and then a
secondary test to get the right result without hurting performance for
non-ASCII.

> The really fast path is well worth it for any English speaking country, however
> I'm wondering how often it would be used in other cases. I'd like to base the
> decision on some hard numbers rather than assuming that the Japanese will
> still call their Linux directories /usr/local/bin etc 

Of course the standard filesystem paths are the same, but the result
of things like strlen("/usr/local/bin") is not the important case.
What's much more likely to matter is string operations encountered
processing symbol tables/relocations, parsing source code or
text-based data, etc. These are cases where you're going to encounter
a lot of ASCII-only strings regardless of the user's language.

> (I believe OndÃej had some
> results that show string functions are most often used on path names).

Even if this is true it's likely to be irrelevant since strlen's on
pathnames are almost certainly accompanied by syscalls on those
pathnames, and the syscall overhead dominates the cost of the strlen.
(I wonder if that result was for the kernel rather than userspace,
though..?)

Rich

References:
- [RFC] Statistics of non-ASCII characters in strings
  - From: Wilco Dijkstra
- Re: [RFC] Statistics of non-ASCII characters in strings
  - From: Alfred M. Szmidt
- Re: [RFC] Statistics of non-ASCII characters in strings
  - From: Rich Felker

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]