This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [RFC] Statistics of non-ASCII characters in strings
- From: "Carlos O'Donell" <carlos at redhat dot com>
- To: "H.J. Lu" <hjl dot tools at gmail dot com>
- Cc: Wilco Dijkstra <wdijkstr at arm dot com>, GNU C Library <libc-alpha at sourceware dot org>
- Date: Tue, 23 Dec 2014 13:11:00 -0500
- Subject: Re: [RFC] Statistics of non-ASCII characters in strings
- Authentication-results: sourceware.org; auth=none
- References: <001401d01df6$0f7cc5a0$2e7650e0$ at com> <54998EA5 dot 3020606 at redhat dot com> <CAMe9rOogs+LDys9h=mcaFy0Q=ND28Fmqj2rB_JfyG217F1wEYQ at mail dot gmail dot com>
On 12/23/2014 11:50 AM, H.J. Lu wrote:
> On Tue, Dec 23, 2014 at 7:47 AM, Carlos O'Donell <carlos@redhat.com> wrote:
>> On 12/22/2014 09:46 AM, Wilco Dijkstra wrote:
>>> Does anyone have statistics of how often strings contain non-ASCII
>>> characters? I'm asking because it's feasible to make many string
>>> functions faster if they are predominantly ASCII by using a different
>>> check for the null byte. So if say 80-90% of strings in strcpy/strlen
>>> are ASCII then it would be well worth optimizing for it.
>>
>> I don't know that anyone has this data.
>>
>> However, it brings us to a discussion on whole system benchmarking and
>> data gathering.
>>
>> Your particular question is about the average workload, for which there
>> is no real consensus yet. Note that Ondrej has posted patches for a whole
>> system benchmarking framework based on his LD_PRELOAD libraries. I think
>> that or a systemtap-based framework are sensible solutions. I don't care
>> which goes forward really, but with such a path forward we might start
>> getting users to run the whole system benchmark in data-gathering mode
>> with a global LD_PRELOAD and provide us with raw or aggregate data.
>>
>
> You can use LD_AUDIT to collect such information on your
> system.
Agreed, that is another way to do it.
Keep in mind this will be run by non-experts so we need a lot more
fluffy stuff around the bits we deliver to help non-experts collect
data and return that to us.
Cheers,
Carlos.