Bug 368 - localedef fails with coplex LC_COLLATE rules
Summary: localedef fails with coplex LC_COLLATE rules
Status: RESOLVED FIXED
Alias: None
Product: glibc
Classification: Unclassified
Component: localedata (show other bugs)
Version: unspecified
: P2 normal
Target Milestone: ---
Assignee: Petter Reinholdtsen
URL:
Keywords:
: 307 (view as bug list)
Depends on:
Blocks:
 
Reported: 2004-09-05 20:49 UTC by Pablo Saratxaga
Modified: 2005-10-14 22:57 UTC (History)
3 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:


Attachments
sample dz_BT locale (with several lines commented out with "%%%%" that should be enabled (8.48 KB, text/plain)
2004-09-05 20:50 UTC, Pablo Saratxaga
Details
allow more than 256 collating-element definitions (487 bytes, patch)
2005-01-02 23:26 UTC, Denis Barbier
Details | Diff
C source file for the tst-strcoll program (363 bytes, text/plain)
2005-01-17 22:33 UTC, Denis Barbier
Details
C source file for the tst-wcscoll program (442 bytes, text/plain)
2005-01-17 22:33 UTC, Denis Barbier
Details
dz_BT Collation - generated automatically from CLDR * (14.56 KB, text/plain)
2005-08-03 11:37 UTC, Christopher Fynn
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Pablo Saratxaga 2004-09-05 20:49:01 UTC
I reached what seems to be a limitation in the numlber of LC_COLLATE collating-elements.

I was trying to build a dz_BT locale (Dzongkha language, Buthan);
the sorting rules are quite special, as for example nexy to <ka> entry are words starting with prefix attached to ka radical, eg: <da>-<ka>, <ba>-<ka> etc, come just after words starting with <ka>, and not with words starting with <da>, <ba>, etc.
Said otherwise, the base collating elements are the 30 base letters, plus 103 prefix-radical collating elements.
Now, it is even more complex that that; some letter sequences are prefix-radical or not depending on what follows them; eg for <da>-<ga> it is a prefix if followed with <ga>, <nga>, <da>,... but not otherwise.
That is, it is needed to define collating elements comprising of the prefix element and the next char, which are then sorted as a digraph; eg:
collating-element <rad-ga-d-ga> from "<U0F51><U0F42><U0F42>"
...
<rad-ga-d-ga>  "<TIB-GA-R_D><TIB-GA>";....

That mens there are a lot of collating-elements to define; 303 in total.
But ifI use more than 265 the locale doesn't compile (localedef just runs forever taking 90% of cpu ressources doing nothing); while if I comment some of them to have no more than 265 in use, then it compiles nicely.

I attach the preliminary dz_BT locale I was working on; some entries are commented with %%%% (four percent signs), so that the file can compile; but to have the rules complete, all those lines commented out with "%%%%" should be enabled as well.
Comment 1 Pablo Saratxaga 2004-09-05 20:50:22 UTC
Created attachment 187 [details]
sample dz_BT locale (with several lines commented out with "%%%%" that should be enabled
Comment 2 Denis Barbier 2005-01-02 23:26:05 UTC
Created attachment 332 [details]
allow more than 256 collating-element definitions

I could not find why elem_size has to be less than 257, and thus dropped
this constraint.  Then elem_size had to be computed more accurately in
order to prevent allocation of large unused data.
But your dz_BT file still did not compile because the secondary hashing
function seems to do a poor job: iter was null and there is an endless
loop.  A better secondary hashing function is to add 1 to the current
one, but the functions which read collation data would need to be fixed
too.  Instead, I chose to enlarge the table when such a loop is
encountered.
Comment 3 Denis Barbier 2005-01-17 21:38:30 UTC
As this patch only changes the multi-byte sequence, we can check
whether wide-char and multi-byte collations give the same results,
in which case this patch is certainly right.
I created a file containing sequences of 2 Tibetan characters:
  $ for i in `seq 0x0F00 0x0FCF`; do
      for j in `seq 0x0F00 0x0FCF`; do
        printf "0: %08x %08x 0000000a " $i $j | xxd -r -g4
      done
    done | iconv -f ucs4 -t utf8 > input_file
Then ran
  $ LC_ALL=en_US.UTF-8 ./tst-wcscoll < input_file > out.wc-en_US
  $ LC_ALL=en_US.UTF-8 ./tst-strcoll < input_file > out.mb-en_US
  $ cmp out.wc-en_US out.mb-en_US
  $

So results are exactly similar.  But to show that this patch allows
more than 256 collating elements, we need to check with more complex
LC_COLLATE sections.  I took Pablo's locale file, s/^%%%%</</ to have
more than 256 collating elements, and re-ran this test:
  $ export LOCPATH=`mktemp -d /tmp/test.XXXXXX`
  $ localedef.patched -i dz_BT -f UTF-8 $LOCPATH/dz_BT
  $ LC_ALL=dz_BT ./tst-wcscoll < input_file > out.wc-dz_BT
  $ LC_ALL=dz_BT ./tst-strcoll < input_file > out.mb-dz_BT
  $ cmp out.wc-dz_BT out.mb-dz_BT
  $
Looks good.

Note that tst-strcoll is much slower than tst-wcscoll, which seems
quite logical since the primary key is the first UTF-8 byte and does
not change in the range 0x0F00-0x0FCF.
Comment 4 Denis Barbier 2005-01-17 22:33:00 UTC
Created attachment 372 [details]
C source file for the tst-strcoll program

This program can only process files composed of lines of 2 UTF-8
characters, some modifications are needed to accept any input.
Comment 5 Denis Barbier 2005-01-17 22:33:50 UTC
Created attachment 373 [details]
C source file for the tst-wcscoll program

This program can only process files composed of lines of 2 UTF-8
characters, some modifications are needed to accept any input.
Comment 6 Christopher Fynn 2005-08-03 10:54:12 UTC
localedef *still* only  handles only 256 collating-element definitions.

Cultrually correct (standard dictionary order) of langages like Dzongkha (dz_BT)
and Tibetan (bo_CN) *require* over 350 ellements in LC_COLLATE    
Comment 7 Christopher Fynn 2005-08-03 11:37:34 UTC
Created attachment 567 [details]
dz_BT Collation - generated automatically from CLDR                                                                                  *
Comment 8 Ulrich Drepper 2005-10-14 21:11:40 UTC
*** Bug 307 has been marked as a duplicate of this bug. ***
Comment 9 Ulrich Drepper 2005-10-14 22:56:41 UTC
The ld-collate patch is wrong.  I fixed it myself.

I checked in the first locale.  The second one is completely useless.  If there
are bugs in the file in CVS file a new bug and justify the change.

As for the test programs: they work just fine the way they are.
Comment 10 Ulrich Drepper 2005-10-14 22:57:09 UTC
The ld-collate patch is wrong.  I fixed it myself.

I checked in the first locale.  The second one is completely useless.  If there
are bugs in the file in CVS file a new bug and justify the change.

As for the test programs: they work just fine the way they are.