This is the mail archive of the mailing list for the glibc project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [RFC] Statistics of non-ASCII characters in strings

On Mon, Dec 22, 2014 at 11:53:39PM +0000, Wilco Dijkstra wrote:
> Rich Felker wrote:
> > It's not quite clear to me from your reply, but I get the impression
> > you're comparing an ASCII-optimized strlen to a non-optimized one,
> > rather than comparing it to the alternate optimization that also works
> > for non-ASCII bytes. This is a standard implementation trade-off, not
> > Aarch64-specific, and in general the right answer is to use the
> > slightly more expensive code that works for all bytes rather than the
> > version that has to take a slow-path when it gets a false-positive nul
> > terminator on non-ASCII bytes.
> No the comparison is between the existing optimized version (which
> already processes 16 characters per iteration) and an even more optimized
> version. Some of the speedup is due to my trick to process ASCII strings faster.
> Non-ascii strings are still at least as fast as the original version (usually much
> faster), so it's definitely not a slow path, but rather a fast path plus an
> extremely fast path.

Can you clarify how you're doing this? Since you can't know a priori
whether the data will be ASCII or not, I don't understand how you can
first use a test that gives the wrong result for non-ASCII and then a
secondary test to get the right result without hurting performance for

> The really fast path is well worth it for any English speaking country, however
> I'm wondering how often it would be used in other cases. I'd like to base the
> decision on some hard numbers rather than assuming that the Japanese will
> still call their Linux directories /usr/local/bin etc 

Of course the standard filesystem paths are the same, but the result
of things like strlen("/usr/local/bin") is not the important case.
What's much more likely to matter is string operations encountered
processing symbol tables/relocations, parsing source code or
text-based data, etc. These are cases where you're going to encounter
a lot of ASCII-only strings regardless of the user's language.

> (I believe OndÃej had some
> results that show string functions are most often used on path names).

Even if this is true it's likely to be irrelevant since strlen's on
pathnames are almost certainly accompanied by syscalls on those
pathnames, and the syscall overhead dominates the cost of the strlen.
(I wonder if that result was for the kernel rather than userspace,


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]