This is the mail archive of the
glibc-bugs@sources.redhat.com
mailing list for the glibc project.
[Bug libc/645] New: localedef does not respect rule definitions in LC_COLLATE
- From: "barbier at linuxfr dot org" <sourceware-bugzilla at sources dot redhat dot com>
- To: glibc-bugs at sources dot redhat dot com
- Date: 7 Jan 2005 22:49:20 -0000
- Subject: [Bug libc/645] New: localedef does not respect rule definitions in LC_COLLATE
- Reply-to: sourceware-bugzilla at sources dot redhat dot com
Executive summary: several bugs in ld-collate.c make localedef produce
wrong collation data, here is a detailed analysis and a patch.
Sorting with French locales is special because diacritics are considered
from right to left, as described in ISO-14651 and many other documents.
And indeed, localedata/locales/iso14651_t1 contains
order_start <LATIN>;forward;backward;forward;forward,position
An example is available at
http://www.open-std.org/jtc1/sc22/wg20/docs/n602.htm#AnnexC
and fr_FR sort this text as if the backward directive had no effect.
I wrote simple tests to debug this problem; the xx_XX.tmpl locale file
defines a and A characters with the rule forward;forward;forward;forward,
and b, B with the rule forward;backward;forward;forward.
The tst-coll-rule program gets pairs of characters (with the same
primary level but different secondary level) as arguments, and
displays the direction of the 2nd level (f=forward, b=backward) for each
pair.
$ export LOCPATH=$(mktemp -d /tmp/localedef.XXXXXX)
$ localedef -i xx_XX.tmpl -f ISO-8859-1 $LOCPATH/xx_XX
$ LC_ALL=xx_XX ./tst-coll-rule aA bB
bb
After switching definitions for S1 and S2:
$ localedef -i xx_XX.tmpl -f ISO-8859-1 $LOCPATH/xx_XX
$ LC_ALL=xx_XX ./tst-coll-rule aA bB
ff
So the last definition wins and overwrites the other one. This is
due to the optimization of rulesets in ld-collate.c, line 1843 needs
to be changed from
memcmp (osect->rules, sect->rules, nrules) == 0
to
memcmp (osect->rules, sect->rules, nrules * sizeof (*osect->rules)) == 0
This patch being applied and xx_XX.tmpl reverted to its initial value,
we got now:
$ localedef -i xx_XX.tmpl -f ISO-8859-1 $LOCPATH/xx_XX
$ LC_ALL=xx_XX ./tst-coll-rule aA bB
bb
Huh? This patch does not look that good, and some more digging in
ld-collate.c is needed. There are named sections, at most one unnamed
section (defined without script name, e.g. order_start forward;forward)
and a symbol section, which stores symbols if they are read before the
first rule.
The test-collate.sh shell script defines all combinations of 2 level
scripts, and runs tst-coll-rule to check whether stored collation data
match their definition. Output is;
1st field: LC_COLLATE definition
s: there is a symbol section, i.e. symbols are defined before the
first order_start keyword.
N: order_start <script_name>;forward;forward
n: order_start <script_name>;forward;backward
U: order_start forward;forward
u: order_start forward;backward
2nd field: output of "LC_ALL=xx_XX tst-coll-rule aA bB", or **
when localedef segfaults.
3rd field: expected output
4th field: 0=match 1=mismatch *=localedef segfaults
Current CVS version:
snn bb bb 0 | sNn bb fb 1 | nn ** bb * | Nn ** fb *
snu bb bb 0 | sNu bb fb 1 | nu bb bb 0 | Nu bb fb 1
snN ff bf 1 | sNN ff ff 0 | nN ** bf * | NN ** ff *
snU ff bf 1 | sNU ff ff 0 | nU ff bf 1 | NU ff ff 0
sun bb bb 0 | sUn bb fb 1 | un bb bb 0 | Un bb fb 1
suN ff bf 1 | sUN ff ff 0 | uN ff bf 1 | UN ff ff 0
After applying the one-line patch described above:
snn bb bb 0 | sNn bb fb 1 | nn ** bb * | Nn ** fb *
snu bb bb 0 | sNu fb fb 0 | nu bb bb 0 | Nu bb fb 1
snN ff bf 1 | sNN ff ff 0 | nN ** bf * | NN ** ff *
snU bf bf 0 | sNU ff ff 0 | nU ff bf 1 | NU ff ff 0
sun bb bb 0 | sUn bb fb 1 | un bb bb 0 | Un bb fb 1
suN ff bf 1 | sUN ff ff 0 | uN ff bf 1 | UN ff ff 0
After applying ld-collate.patch:
snn bb bb 0 | sNn fb fb 0 | nn bb bb 0 | Nn fb fb 0
snu bb bb 0 | sNu fb fb 0 | nu bb bb 0 | Nu fb fb 0
snN bf bf 0 | sNN ff ff 0 | nN bf bf 0 | NN ff ff 0
snU bf bf 0 | sNU ff ff 0 | nU bf bf 0 | NU ff ff 0
sun bb bb 0 | sUn fb fb 0 | un bb bb 0 | Un fb fb 0
suN bf bf 0 | sUN ff ff 0 | uN bf bf 0 | UN ff ff 0
which looks much better. And indeed, my French locale now sorts
the sample file as expected, great.
--
Summary: localedef does not respect rule definitions in
LC_COLLATE
Product: glibc
Version: 2.3.4
Status: NEW
Severity: normal
Priority: P2
Component: libc
AssignedTo: gotom at debian dot or dot jp
ReportedBy: barbier at linuxfr dot org
CC: glibc-bugs at sources dot redhat dot com
http://sources.redhat.com/bugzilla/show_bug.cgi?id=645
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.