This is the mail archive of the mailing list for the glibc project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Thoughts on bug 15884

On 09/15/2014 04:41 AM, Leonhard Holz wrote:
> Hi everybody,
> since I'm new to a the list: My name is Leonhard Holz and I'm a
> German/Berliner software developer with some free time to spend on
> hunting bugs in open source software.

Nice to meet you Leonhard.

> Bug 15884 (
> complains about bad runtime performance of the strcoll function. This
> probably has some relevance beyond this bug report as e.g. databases
> use this function to sort string values. Previous investigators
> figured out that the strlen function called to calculate the size of
> the cache for the collation sequences is dominating the measured time
> of strcoll.
> Thoughts on this:
> * the strlen function is not needed by the comparison algorithm itself, so it could be removed if
>     * the cache is removed
>     * the cache is resized during comparison as needed
> * the first pass / ruleset is used to fill the cache but does not benefit from it
> * two strings have to be the same size and very similar (e.g. vary only in accents or capitalization) to get more than one pass (please correct if I got this wrong)
> -> in most cases the cache will not be used (notice the difference to strxfrm() which basicly uses the same algorithm)
> * caching adds overhead to parsing & comparison
> * caching could cause memory problems - it tries to allocate five times the length of both strings
> So I wonder if it's not the best option to completly remove the cache
> feature. It will make some scenarios slower but probably most real
> world scenarios a lot faster.
> BTW: Does anybody have a benchmark script for strcoll()?

The reason why nobody has fixed this issue is that we don't
know what performance impact this will have in the real world.

I agree that removing the cache may result in fixing this
code, but the resulting performance loss is also unknown.

However, given that the original authors never bothered documenting
the use cases for the optimizations we have no knowledge.

Therefore in general I suggest:

(a) Write a microbenchmark for strcoll.
	- See benchtests/
	- Have the microbenchmark cover several languages.
	- Have the microbenchmark cover as many use cases as you can think.
	- Post microbenchmark patches for review and get feedback.

(b) Attempt to remove the cache and look at the benchmark behaviour.
	- Using the newly checked in microbenchmark compare:
		- Removing cache.
		- Profiling to see the hotest code and what can be changed.

The results of (b) will dictate what kind of solution you
plan to pursue.

Thus in the future when someone changes strcoll they will have
the microbenchmark as a reference for the use cases that mattered
at the time.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]