This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Thoughts on bug 15884


Hi everybody,

since I'm new to a the list: My name is Leonhard Holz and I'm a German/Berliner software developer with some free time to spend on hunting bugs in open source software.

Bug 15884 (https://sourceware.org/bugzilla/show_bug.cgi?id=15884) complains about bad runtime performance of the strcoll function. This probably has some relevance beyond this bug report as e.g. databases use this function to sort string values. Previous investigators figured out that the strlen function called to calculate the size of the cache for the collation sequences is dominating the measured time of strcoll.

Thoughts on this:

* the strlen function is not needed by the comparison algorithm itself, so it could be removed if
	* the cache is removed
	* the cache is resized during comparison as needed
* the first pass / ruleset is used to fill the cache but does not benefit from it * two strings have to be the same size and very similar (e.g. vary only in accents or capitalization) to get more than one pass (please correct if I got this wrong) -> in most cases the cache will not be used (notice the difference to strxfrm() which basicly uses the same algorithm)
* caching adds overhead to parsing & comparison
* caching could cause memory problems - it tries to allocate five times the length of both strings

So I wonder if it's not the best option to completly remove the cache feature. It will make some scenarios slower but probably most real world scenarios a lot faster.

BTW: Does anybody have a benchmark script for strcoll()?

Kind regards,
Leonhard


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]