This is the mail archive of the mailing list for the glibc project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [RFC] Statistics of non-ASCII characters in strings

On Tue, Dec 23, 2014 at 11:44:21AM +0100, OndÅej BÃlka wrote:
> On Mon, Dec 22, 2014 at 02:46:24PM -0000, Wilco Dijkstra wrote:
> > Does anyone have statistics of how often strings contain non-ASCII characters? I'm asking because
> > it's feasible to make many string functions faster if they are predominantly ASCII by using a
> > different check for the null byte. So if say 80-90% of strings in strcpy/strlen are ASCII then it
> > would be well worth optimizing for it.
> > 
> I do not know as it depends on encoding, you could collect that
> percentage from strcoll benchmark.
> For string functions just ascii/nonascii percentage is not enough, more
> refined statistic will tell you much more.
> For strlen you need only know probability of byte 128, which is quite
> small in practice.

If it's that optimization, note that in UTF-8 all characters of the
bit form xxxx000000xxxxxx contain byte 128. There actually aren't many
languages made up entirely of such characters; apparently only
Burmese/Myanmar. Otherwise it's mostly punctuation and a small portion
of CJK characters. Of course characters where 128 is the low byte also
appear once every 64 positions throughout unicode, and there are
non-PUA characters.

Still I think you'd risk making things slower with this optimization.

> For strchr its more tricky as you need know x/x+128 pair probability
> along with 0/128. Here fact that x varies is advantage as for most pairs 
> that ratio is small, so weigthed average will be limited.
> You cannot have 11 characters each occuring with 10% probability.

Is there any trivial transformation so that the affected byte would be
255 rather than 128? I ask because byte 255 will never appear in UTF-8
so it would not matter except for non-text strings (which are still a
valid usage of string functions) or people running legacy encodings
(and IMO these should be deprecated and not considered a performance

> Ascii/nonascii ratio would help to estimate strcasecmp
> performance. Here implementation already assumes that its dealing with
> ascii, when it needs convert nonascii it will be slow no matter what you
> do.
> I have generic C implementation of strlen using that trick, I will send
> it.

I'd be interested in seeing it.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]