This is the mail archive of the mailing list for the glibc project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [RFC] Statistics of non-ASCII characters in strings

On Mon, Dec 22, 2014 at 02:46:24PM -0000, Wilco Dijkstra wrote:
> Does anyone have statistics of how often strings contain non-ASCII characters? I'm asking because
> it's feasible to make many string functions faster if they are predominantly ASCII by using a
> different check for the null byte. So if say 80-90% of strings in strcpy/strlen are ASCII then it
> would be well worth optimizing for it.
I do not know as it depends on encoding, you could collect that
percentage from strcoll benchmark.

For string functions just ascii/nonascii percentage is not enough, more
refined statistic will tell you much more.

For strlen you need only know probability of byte 128, which is quite
small in practice.

For strchr its more tricky as you need know x/x+128 pair probability
along with 0/128. Here fact that x varies is advantage as for most pairs 
that ratio is small, so weigthed average will be limited.
You cannot have 11 characters each occuring with 10% probability.

Ascii/nonascii ratio would help to estimate strcasecmp
performance. Here implementation already assumes that its dealing with
ascii, when it needs convert nonascii it will be slow no matter what you

I have generic C implementation of strlen using that trick, I will send

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]