Bug 18441 - Performance regression due to strcoll_l changes
Summary: Performance regression due to strcoll_l changes
Status: NEW
Alias: None
Product: glibc
Classification: Unclassified
Component: locale (show other bugs)
Version: 2.21
: P2 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-05-21 07:26 UTC by Daniel Lichtenberger
Modified: 2019-05-09 20:44 UTC (History)
6 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:
fweimer: security-


Attachments
Thai sample text from wikipedia (UTF-8) (1.80 KB, text/plain)
2015-05-25 11:42 UTC, Daniel Lichtenberger
Details
Small strcoll benchmark with thai strings (728 bytes, text/x-csrc)
2015-06-07 16:39 UTC, Daniel Lichtenberger
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Daniel Lichtenberger 2015-05-21 07:26:45 UTC
After upgrading from Ubuntu 14.10 (glibc 2.19) to 15.04 (glibc 2.21), we are experiencing a big performance regression in one particular PostgreSQL index creation. The index creation now takes more than 10 minutes, while before it was done in about 30 seconds.

When attaching "perf record" to the postgres process while it creates the index, it shows 99.76% of the time spent in __strcoll_l. Given the recent changes in the strcoll_l implementation due to issue #15884, I assume that we are hitting some issue with the new implementation. 

I also checked an instance that runs on glibc 2.17 (CentOS 7). It spends 72% in get_next_seq, so it seems that the workload _is_ dominated by string collation - it's just that the new implementation is so much slower.

This also happens with a "portable" binary of PostgreSQL 9.1, so the postgres binary in Ubuntu is not to blame. 

The particular table has ~500k rows with strings in about 20 languages, including languages such as Chinese, Japanese, Russian, or Bulgarian. The string length varies between a few characters and goes up to about 5000 characters, the data is encoded as UTF-8, and the system locale is en_US.UTF-8.
Comment 1 Carlos O'Donell 2015-05-21 15:01:47 UTC
We have a microbenchmark for strcoll with a variety of inputs, and we are actively tracking improvements to that API.

For example, this enhancement:
https://sourceware.org/ml/libc-alpha/2015-04/msg00229.html

Is commit f13c2a8dff2329c6692a80176262ceaaf8a6f74e, and was checked in on May 12th.

Do you have any way to test glibc master? Like running your application on Fedora Rawhide?

Alternatively we'd love to get a indicative microbenchmark of the workloads your running, such that when we make future changes we'll know if it regresses for your workload.
Comment 2 Daniel Lichtenberger 2015-05-22 12:51:16 UTC
I ran it on Fedora Rawhide (at glibc development release 2.21.90), but it still seemed way slower than before.

I will try to come up with a meaningful sample of input data. Is the benchmark mentioned in the commit messages available in the glibc source tree?
Comment 3 Carlos O'Donell 2015-05-22 14:43:08 UTC
(In reply to Daniel Lichtenberger from comment #2)
> I ran it on Fedora Rawhide (at glibc development release 2.21.90), but it
> still seemed way slower than before.
> 
> I will try to come up with a meaningful sample of input data. Is the
> benchmark mentioned in the commit messages available in the glibc source
> tree?

Yes it is. It's the "benchtests" directory under the top-level glibc checkout. See benchtets/README for more information. The benchmark tests can be run after a glibc build, so you need to build glibc (but not install it) to run the tests against your built libraries. In Fedora/RHEL we're working on packaging the benchmark libraries so you can run them without needing to do a glibc build, that way you could test your hardware quickly if you found a performance regression.

To build glibc see:
https://sourceware.org/glibc/wiki/Testing/Builds
Though it should be as easy as configure/make.
Comment 4 Daniel Lichtenberger 2015-05-25 11:42:12 UTC
Created attachment 8332 [details]
Thai sample text from wikipedia (UTF-8)

Thanks for the link. Something bad happens with UTF-8 encoded text in languages like Thai or Korean - I added a new input file based on random text grabbed from http://th.wikipedia.org (see attachment) and I'm seeing a huge increase in runtime for bench-strcoll between 2.19 and master:

2.19 (on openSuse):

   "wikipedia-th#en_US.UTF-8": {
    "duration": 2.37528e+09,
    "iterations": 16,
    "mean": 1.48455e+08
   }

master (via ./testrun.sh):

   "wikipedia-th#en_US.UTF-8": {
    "duration": 1.72295e+11,
    "iterations": 16,
    "mean": 1.07684e+10
   }

Korean also shows a regression but less severe. The other inputs (lorem ipsum) show improvements similar to those in the commit messages. 

I used en_US as the locale because th_TH wouldn't work with my local build, and it also represents the PostgreSQL workload which uses a single locale when building the index.
Comment 5 Daniel Lichtenberger 2015-06-07 16:39:55 UTC
Created attachment 8347 [details]
Small strcoll benchmark with thai strings

Attached a small benchmark that does the simplest collation test possible - calling strcoll on two medium-sized TH (thai) strings in a loop. I hope the upload doesn't mangle the UTF-8 encoding string literals.

On my desktop machine (a Core i5-3570K) and with glibc 2.19, 2000 calls complete in 0.4s. With the current git master (commit 711f67a789ba3505ae7b071453763e06590aa245), this takes an astonishing 40 seconds - or 20 ms per strcoll call.

I also used other (Western) strings for testing to make sure it wasn't my local glibc build - they showed better performance in the current version, as expected.

It's hard to follow the implementation in strcoll_l.c without a deeper understanding of Unicode string collation rules - I don't know if the removal of rule caching caused this issue, or a change in the locale tables that are used in the algorithm.
Comment 6 Daniel Lichtenberger 2015-06-08 11:20:18 UTC
I ran my test program against the commit in question (0742aef6e52a935f9ccd69594831b56d807feef3, "strcoll: improve performance by removing the cache (#15884)"), and it already shows the performance regression.

The previous commit (ee54ce44cb734f18fec4f6ccdfbe997d2574321e) is fine.
Comment 7 Daniel Lichtenberger 2015-06-14 16:37:53 UTC
I believe that the performance issue is caused by this code in get_next_seq in string/strcoll_l.c:

/* XXX Traverse BACKW sequences from the beginning of
   BACKW_STOP to get the next sequence.  Is ther a quicker way
   to do this?  */
size_t i = backw_stop;
us = seq->back_us;
while (i < backw)
  {
    int32_t tmp = findidx (table, indirect, extra, &us, -1);
    idx = tmp & 0xffffff;
    i++;
  }
--backw;
us = seq->us;

This introduces exponential behaviour if the enclosing loop does not bail out early (because of a non-zero weight): the backw counter is only decreased by one, but the whole range (backw_stop until backw - 1) will be checked again in the next iteration.

I think that the version before 0742aef6e52a935f9ccd69594831b56d807feef3 did not have this issue as it memoized the indices into the weight array.
Comment 8 Daniel Lichtenberger 2015-10-05 20:01:37 UTC
FYI - there is a patch to address this performance issue:

https://sourceware.org/ml/libc-alpha/2015-07/msg00170.html

It seems difficult to get it merged, though.
Comment 9 Carlos O'Donell 2015-10-07 16:13:07 UTC
(In reply to Daniel Lichtenberger from comment #8)
> FYI - there is a patch to address this performance issue:
> 
> https://sourceware.org/ml/libc-alpha/2015-07/msg00170.html
> 
> It seems difficult to get it merged, though.

Reviewed.
Comment 10 Patraphong Sukhonpitumart 2018-07-03 05:04:43 UTC
I believe this regression still affects glibc 2.27, and as a result, PostgreSQL 10.3 on Ubuntu 16.04. Since our project involves a lot of Thai text sorts, might I ask whether and when the patch is ready to get merged, of course, at your convenience?
Comment 11 Florian Weimer 2018-07-03 07:53:53 UTC
(In reply to Patraphong Sukhonpitumart from comment #10)
> I believe this regression still affects glibc 2.27, and as a result,
> PostgreSQL 10.3 on Ubuntu 16.04. Since our project involves a lot of Thai
> text sorts, might I ask whether and when the patch is ready to get merged,
> of course, at your convenience?

PostgreSQL 10 supports ICU for collations:

https://www.postgresql.org/docs/10/static/sql-altercollation.html

Maybe this is an alternative until this bug is fixed (which could turn out very difficult to implement)?
Comment 12 Patraphong Sukhonpitumart 2018-07-04 16:51:29 UTC
(In reply to Florian Weimer from comment #11)
> (In reply to Patraphong Sukhonpitumart from comment #10)
> > I believe this regression still affects glibc 2.27, and as a result,
> > PostgreSQL 10.3 on Ubuntu 16.04. Since our project involves a lot of Thai
> > text sorts, might I ask whether and when the patch is ready to get merged,
> > of course, at your convenience?
> 
> PostgreSQL 10 supports ICU for collations:
> 
> https://www.postgresql.org/docs/10/static/sql-altercollation.html
> 
> Maybe this is an alternative until this bug is fixed (which could turn out
> very difficult to implement)?

Works beautifully. Thanks!
Comment 13 vectoroc 2019-05-09 19:45:34 UTC
Hello. Is there any chance that the issues will be fixed? Unfortunately PostgreSQL Is unable to use ICU some base features (e.g in analyze operation).
Comment 14 Carlos O'Donell 2019-05-09 20:44:56 UTC
(In reply to vectoroc from comment #13)
> Hello. Is there any chance that the issues will be fixed? Unfortunately
> PostgreSQL Is unable to use ICU some base features (e.g in analyze
> operation).

We haven't had anyone working on strcoll_l performance improvements. So it's unlikely that this will get merged or reviewed any time soon.