This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [RFC] Statistics of non-ASCII characters in strings
- From: Florian Weimer <fweimer at redhat dot com>
- To: "Carlos O'Donell" <carlos at redhat dot com>, Wilco Dijkstra <wdijkstr at arm dot com>, libc-alpha at sourceware dot org
- Date: Tue, 23 Dec 2014 17:33:04 +0100
- Subject: Re: [RFC] Statistics of non-ASCII characters in strings
- Authentication-results: sourceware.org; auth=none
- References: <001401d01df6$0f7cc5a0$2e7650e0$ at com> <54998EA5 dot 3020606 at redhat dot com>
On 12/23/2014 04:47 PM, Carlos O'Donell wrote:
On 12/22/2014 09:46 AM, Wilco Dijkstra wrote:
Does anyone have statistics of how often strings contain non-ASCII
characters? I'm asking because it's feasible to make many string
functions faster if they are predominantly ASCII by using a different
check for the null byte. So if say 80-90% of strings in strcpy/strlen
are ASCII then it would be well worth optimizing for it.
I don't know that anyone has this data.
The OpenJDK folks are collecting somewhat similar data as part of this
project:
<http://openjdk.java.net/jeps/8054307>
The question is slightly different (how many strings exist which contain
non-ASCII characters, and how many of them are not even ISO-8859-1?).
Even though the application behavior under consideration is less dynamic
(you can get that from a heap dump), it's difficult obtain such data.
--
Florian Weimer / Red Hat Product Security