This is the mail archive of the
mailing list for the glibc project.
Re: [RFC][BZ #16009] Memory handling in strxfrm_l()
- From: OndÅej BÃlka <neleai at seznam dot cz>
- To: Paul Eggert <eggert at cs dot ucla dot edu>
- Cc: Leonhard Holz <leonhard dot holz at web dot de>, libc-alpha at sourceware dot org
- Date: Tue, 25 Nov 2014 17:21:03 +0100
- Subject: Re: [RFC][BZ #16009] Memory handling in strxfrm_l()
- Authentication-results: sourceware.org; auth=none
- References: <546DCC25 dot 4020808 at web dot de> <20141121040317 dot GT22465 at brightrain dot aerifal dot cx> <546F12A3 dot 5030204 at web dot de> <20141122233907 dot GA2516 at domone> <5472F9AF dot 3060005 at web dot de> <54737BB3 dot 9060102 at cs dot ucla dot edu>
On Mon, Nov 24, 2014 at 10:40:51AM -0800, Paul Eggert wrote:
> On 11/24/2014 01:26 AM, Leonhard Holz wrote:
> >strxfrm is a function for pre-computing things that need to be
> >fast somewhere else
> That may be true in theory, but in practice strxfrm is pretty much
> useless everywhere. When I tried to use it in GNU sort, I gave up
> in frustration, as strxfrm was soooo sllooooow that it was much
> better to simply use strcoll and be smart about it. I don't know of
> any practical application that uses glibc strxfrm in a real way, and
> my advice for optimizating strxfrm is to not bother, and to focus
> one's efforts on making strcoll go faster.
> Unless perhaps you're thinking of rewriting strxfrm from scratch, in
> which case we can talk about what's needed.
A strxfrm is flawed design as most of time one needs to know just first
ten characters to do comparison. As one cannot generate small prefix
because strxfrm needs to report size of entire string and content is
unspecificed by standard it needs to do lot of unnecessary work.
With strcoll improvement that I posted earlier a possible strcoll speedup
is quite big, when I tested a time of 'find | sort' command sort runs
around ten times faster. However is a use case with long common prefixes
which gives that speedup, it would be less when sorting english strings.
What would help is add another function, say strtrim. That function
would delete characters ignored by comparison by string. That would fix
weakness of my approach where comparing long prefixes that differ only
in ignored characters is still slow.
As sorting goes we should add a dedicated algorithm for strcoll, one
wants to use radixsort which is possible but it needs to access locale
specification to get weigths and deal with specific cases.