This is the mail archive of the
mailing list for the glibc project.
Re: [RFC] further speedup of strcoll
- From: "Carlos O'Donell" <carlos at redhat dot com>
- To: Leonhard Holz <leonhard dot holz at web dot de>, libc-alpha at sourceware dot org
- Date: Wed, 11 Feb 2015 13:16:39 -0500
- Subject: Re: [RFC] further speedup of strcoll
- Authentication-results: sourceware.org; auth=none
- References: <54DB22FA dot 3030208 at web dot de>
On 02/11/2015 04:38 AM, Leonhard Holz wrote:
> The strcoll method can probably be made faster by fast-forwarding to
> the position of the first difference of the two strings via bytewise
> comparison so that the expensive location-aware comparison has to
> deal with the interesting part of the strings only. But this has some
> implications so I ask for your feedback.
> First issue is that I do not know of a function which returns the
> position of the first difference of two strings. This functionality
> is actually included in strcmp but not accessible. I could implement
> a new __ function in C and later carve out platform dependent
> versions from strcmp.
It's always OK to start with an internal function.
> Then it needs to be encoding aware as for anything but 8bit encodings
> the bytewise comparison can stop in the middle of a multi-byte value.
> Actually there is no easy way get the encoding type (one could check
> for "UTF-8" in the locale name, but this is way to expensive) and
> probably __locale_t has to be extended with an int encoding_type that
> e.g. states 0 for other, 1 for 8-bit, 2 for UTF-8 and so on. Is this
The size and alignment of locale_t must not change. That's the hard
requirement from the ABI perspective.
> And finally it is necessary to check if the difference is inside a
> backward sequence since then there could be a difference after the
> found difference that is logically before it. And since there is no
> way to find the start of a backward sequence from within the
> comparison has to start from the beginning of the strings and the
> effort to get the position of the first difference is wasted. But
> this should be overweighted by the speedup in the other cases,
> especially if the strings are actually equal.
We'll need real data and benchmarking to validate that.