This is the mail archive of the
mailing list for the glibc project.
Thoughts on bug 15884
- From: Leonhard Holz <leonhard dot holz at web dot de>
- To: libc-alpha at sourceware dot org
- Date: Mon, 15 Sep 2014 10:41:10 +0200
- Subject: Thoughts on bug 15884
- Authentication-results: sourceware.org; auth=none
since I'm new to a the list: My name is Leonhard Holz and I'm a
German/Berliner software developer with some free time to spend on
hunting bugs in open source software.
Bug 15884 (https://sourceware.org/bugzilla/show_bug.cgi?id=15884)
complains about bad runtime performance of the strcoll function. This
probably has some relevance beyond this bug report as e.g. databases use
this function to sort string values. Previous investigators figured out
that the strlen function called to calculate the size of the cache for
the collation sequences is dominating the measured time of strcoll.
Thoughts on this:
* the strlen function is not needed by the comparison algorithm itself,
so it could be removed if
* the cache is removed
* the cache is resized during comparison as needed
* the first pass / ruleset is used to fill the cache but does not
benefit from it
* two strings have to be the same size and very similar (e.g. vary only
in accents or capitalization) to get more than one pass (please correct
if I got this wrong)
-> in most cases the cache will not be used (notice the difference to
strxfrm() which basicly uses the same algorithm)
* caching adds overhead to parsing & comparison
* caching could cause memory problems - it tries to allocate five times
the length of both strings
So I wonder if it's not the best option to completly remove the cache
feature. It will make some scenarios slower but probably most real world
scenarios a lot faster.
BTW: Does anybody have a benchmark script for strcoll()?