This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |
Other format: | [Raw text] |
Hi folks,thanks for the suggestions. Attached is a first attempt for a microbenchmark. It simply sorts words from a text file via strcoll. It's not adapted to the glibc build environment (yet).
As input I have choosen "lorem ipsum" texts since they mirror normal language and can be generated in many charsets / languages. See http://generator.lorem-ipsum.info.
Maybe there are other use case ideas or input sources? And should I vary the collation? Actually C.UTF-8 is choosen.
Best, Leonhard Am 15.09.2014 19:14, schrieb Carlos O'Donell:
On 09/15/2014 04:41 AM, Leonhard Holz wrote:Hi everybody, since I'm new to a the list: My name is Leonhard Holz and I'm a German/Berliner software developer with some free time to spend on hunting bugs in open source software.Nice to meet you Leonhard.Bug 15884 (https://sourceware.org/bugzilla/show_bug.cgi?id=15884) complains about bad runtime performance of the strcoll function. This probably has some relevance beyond this bug report as e.g. databases use this function to sort string values. Previous investigators figured out that the strlen function called to calculate the size of the cache for the collation sequences is dominating the measured time of strcoll. Thoughts on this: * the strlen function is not needed by the comparison algorithm itself, so it could be removed if * the cache is removed * the cache is resized during comparison as needed * the first pass / ruleset is used to fill the cache but does not benefit from it * two strings have to be the same size and very similar (e.g. vary only in accents or capitalization) to get more than one pass (please correct if I got this wrong) -> in most cases the cache will not be used (notice the difference to strxfrm() which basicly uses the same algorithm) * caching adds overhead to parsing & comparison * caching could cause memory problems - it tries to allocate five times the length of both strings So I wonder if it's not the best option to completly remove the cache feature. It will make some scenarios slower but probably most real world scenarios a lot faster. BTW: Does anybody have a benchmark script for strcoll()?The reason why nobody has fixed this issue is that we don't know what performance impact this will have in the real world. I agree that removing the cache may result in fixing this code, but the resulting performance loss is also unknown. However, given that the original authors never bothered documenting the use cases for the optimizations we have no knowledge. Therefore in general I suggest: (a) Write a microbenchmark for strcoll. - See benchtests/ - Have the microbenchmark cover several languages. - Have the microbenchmark cover as many use cases as you can think. - Post microbenchmark patches for review and get feedback. (b) Attempt to remove the cache and look at the benchmark behaviour. - Using the newly checked in microbenchmark compare: - Removing cache. vs. - Profiling to see the hotest code and what can be changed. The results of (b) will dictate what kind of solution you plan to pursue. Thus in the future when someone changes strcoll they will have the microbenchmark as a reference for the use cases that mattered at the time. Cheers, Carlos.
Attachment:
strcoll_bench.tar.bz2
Description: Binary data
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |