This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Thoughts on bug 15884


Hi folks,

thanks for the suggestions. Attached is a first attempt for a microbenchmark. It simply sorts words from a text file via strcoll. It's not adapted to the glibc build environment (yet).

As input I have choosen "lorem ipsum" texts since they mirror normal language and can be generated in many charsets / languages. See http://generator.lorem-ipsum.info.

Maybe there are other use case ideas or input sources? And should I vary the collation? Actually C.UTF-8 is choosen.

Best,
Leonhard

Am 15.09.2014 19:14, schrieb Carlos O'Donell:
On 09/15/2014 04:41 AM, Leonhard Holz wrote:
Hi everybody,

since I'm new to a the list: My name is Leonhard Holz and I'm a
German/Berliner software developer with some free time to spend on
hunting bugs in open source software.

Nice to meet you Leonhard.

Bug 15884 (https://sourceware.org/bugzilla/show_bug.cgi?id=15884)
complains about bad runtime performance of the strcoll function. This
probably has some relevance beyond this bug report as e.g. databases
use this function to sort string values. Previous investigators
figured out that the strlen function called to calculate the size of
the cache for the collation sequences is dominating the measured time
of strcoll.

Thoughts on this:

* the strlen function is not needed by the comparison algorithm itself, so it could be removed if
     * the cache is removed
     * the cache is resized during comparison as needed
* the first pass / ruleset is used to fill the cache but does not benefit from it
* two strings have to be the same size and very similar (e.g. vary only in accents or capitalization) to get more than one pass (please correct if I got this wrong)
-> in most cases the cache will not be used (notice the difference to strxfrm() which basicly uses the same algorithm)
* caching adds overhead to parsing & comparison
* caching could cause memory problems - it tries to allocate five times the length of both strings

So I wonder if it's not the best option to completly remove the cache
feature. It will make some scenarios slower but probably most real
world scenarios a lot faster.

BTW: Does anybody have a benchmark script for strcoll()?

The reason why nobody has fixed this issue is that we don't
know what performance impact this will have in the real world.

I agree that removing the cache may result in fixing this
code, but the resulting performance loss is also unknown.

However, given that the original authors never bothered documenting
the use cases for the optimizations we have no knowledge.

Therefore in general I suggest:

(a) Write a microbenchmark for strcoll.
	- See benchtests/
	- Have the microbenchmark cover several languages.
	- Have the microbenchmark cover as many use cases as you can think.
	- Post microbenchmark patches for review and get feedback.

(b) Attempt to remove the cache and look at the benchmark behaviour.
	- Using the newly checked in microbenchmark compare:
		- Removing cache.
		vs.
		- Profiling to see the hotest code and what can be changed.

The results of (b) will dictate what kind of solution you
plan to pursue.

Thus in the future when someone changes strcoll they will have
the microbenchmark as a reference for the use cases that mattered
at the time.

Cheers,
Carlos.

Attachment: strcoll_bench.tar.bz2
Description: Binary data


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]