This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
[BZ #18441] strcoll performance regression
- From: Leonhard Holz <leonhard dot holz at web dot de>
- To: libc-alpha at sourceware dot org
- Date: Thu, 28 May 2015 22:58:46 +0200
- Subject: [BZ #18441] strcoll performance regression
- Authentication-results: sourceware.org; auth=none
Hello,
the trigger for the regression is that the locale has no information about the
sort order of the chars given. With the locale th_TH it is pretty quick:
"strcoll": {
"wikipedia-th#th_TH.UTF-8": {
"duration": 4.31123e+06,
"iterations": 16,
"mean": 269452
}
}
The english locale has four passes to determine the sort order. In the first three
passes it reports one recognized sequence length of zero independent of the thai
word given. At the fourths levels it recognizes the characters which are all
considered equal so actually the string length is determining the sort order.
The former version had a cache that avoided lookups in the locale data tables for
passes > 1 which did probably help in this scenario (but slows down for all others).
Anyhow the huge difference is astonishing. Next I will investigate how exactly the
sequence lookup works to figure out why it takes so long. But if anyone has an
idea and can point me in the right direction please comment.
Best,
Leonhard