This is the mail archive of the
mailing list for the glibc project.
Re: [RFC] Statistics of non-ASCII characters in strings
- From: "Carlos O'Donell" <carlos at redhat dot com>
- To: Wilco Dijkstra <wdijkstr at arm dot com>, libc-alpha at sourceware dot org
- Date: Tue, 23 Dec 2014 10:47:49 -0500
- Subject: Re: [RFC] Statistics of non-ASCII characters in strings
- Authentication-results: sourceware.org; auth=none
- References: <001401d01df6$0f7cc5a0$2e7650e0$ at com>
On 12/22/2014 09:46 AM, Wilco Dijkstra wrote:
> Does anyone have statistics of how often strings contain non-ASCII
> characters? I'm asking because it's feasible to make many string
> functions faster if they are predominantly ASCII by using a different
> check for the null byte. So if say 80-90% of strings in strcpy/strlen
> are ASCII then it would be well worth optimizing for it.
I don't know that anyone has this data.
However, it brings us to a discussion on whole system benchmarking and
In glibc we consciously chose to start small, and create a microbenchmark.
The community was not familiar with this idea (some individuals were),
and the goal was to move that forward to the point where we had consensus.
I think everyone agrees that microbenchmarks are useful now, and the
biggest disagreement is on the quality and process for making those
measurements. The microbenchmark though is unable to answer your question.
For that we need to add a whole system benchmarking. Within the framework
of whole system benchmarking I see a future where the benchmark gathers
data on all parameters of arguments to glibc functions. This data can
be fed into a data-drive microbenchmark as workloads.
Your particular question is about the average workload, for which there
is no real consensus yet. Note that Ondrej has posted patches for a whole
system benchmarking framework based on his LD_PRELOAD libraries. I think
that or a systemtap-based framework are sensible solutions. I don't care
which goes forward really, but with such a path forward we might start
getting users to run the whole system benchmark in data-gathering mode
with a global LD_PRELOAD and provide us with raw or aggregate data.
If anyone ever complains about our optimizations being skewed, our response
would be: Did you benchmark your workload and submit the data? No? Please
do so now so we can evaluate what is different about your workload.
All of this data can then be used as artificial inputs to the data-driven
microbenchmarks to see how the functions perform.