This is the mail archive of the
mailing list for the glibc project.
Re: [RFC] Statistics of non-ASCII characters in strings
- From: OndÅej BÃlka <neleai at seznam dot cz>
- To: Wilco Dijkstra <wdijkstr at arm dot com>
- Cc: libc-alpha at sourceware dot org
- Date: Tue, 23 Dec 2014 11:44:21 +0100
- Subject: Re: [RFC] Statistics of non-ASCII characters in strings
- Authentication-results: sourceware.org; auth=none
- References: <001401d01df6$0f7cc5a0$2e7650e0$ at com>
On Mon, Dec 22, 2014 at 02:46:24PM -0000, Wilco Dijkstra wrote:
> Does anyone have statistics of how often strings contain non-ASCII characters? I'm asking because
> it's feasible to make many string functions faster if they are predominantly ASCII by using a
> different check for the null byte. So if say 80-90% of strings in strcpy/strlen are ASCII then it
> would be well worth optimizing for it.
I do not know as it depends on encoding, you could collect that
percentage from strcoll benchmark.
For string functions just ascii/nonascii percentage is not enough, more
refined statistic will tell you much more.
For strlen you need only know probability of byte 128, which is quite
small in practice.
For strchr its more tricky as you need know x/x+128 pair probability
along with 0/128. Here fact that x varies is advantage as for most pairs
that ratio is small, so weigthed average will be limited.
You cannot have 11 characters each occuring with 10% probability.
Ascii/nonascii ratio would help to estimate strcasecmp
performance. Here implementation already assumes that its dealing with
ascii, when it needs convert nonascii it will be slow no matter what you
I have generic C implementation of strlen using that trick, I will send