This is the mail archive of the
mailing list for the glibc project.
Re: [RFC] Statistics of non-ASCII characters in strings
- From: Rich Felker <dalias at libc dot org>
- To: OndÅej BÃlka <neleai at seznam dot cz>
- Cc: Wilco Dijkstra <wdijkstr at arm dot com>, libc-alpha at sourceware dot org
- Date: Wed, 24 Dec 2014 12:30:04 -0500
- Subject: Re: [RFC] Statistics of non-ASCII characters in strings
- Authentication-results: sourceware.org; auth=none
- References: <001401d01df6$0f7cc5a0$2e7650e0$ at com> <20141224130834 dot GA20212 at domone>
On Wed, Dec 24, 2014 at 02:08:34PM +0100, OndÅej BÃlka wrote:
> On Mon, Dec 22, 2014 at 02:46:24PM -0000, Wilco Dijkstra wrote:
> > Does anyone have statistics of how often strings contain non-ASCII characters? I'm asking because
> > it's feasible to make many string functions faster if they are predominantly ASCII by using a
> > different check for the null byte. So if say 80-90% of strings in strcpy/strlen are ASCII then it
> > would be well worth optimizing for it.
> I just realized that you do not have to worry about these as you could
> use runtime profiling with zero overhead in ascii case.
> For that you need add plt rewriting function into dynamic linker,
> without that overhead is few cycles to check that variable is zero.
> Without that its few cycles per call to check that variable is zero.
> You can use this pattern, you need to use fast way how get time and
> adjust treshold, if getting time is slow you need to increase number of
> false positives between sucessive checks.
This sounds like a horrible design. The alternate of having both loops
and switching the the universal loop as soon as the loop that gets
tricked by 0x80 has a false-hit is much saner (no global state or
interaction with PLT) and is probably going to give much better
overall performance since applications that process large volumes of
both types of strings are not going to get stuck using the slower
strlen with all of them.