This is the mail archive of the libc-help@sourceware.org mailing list for the glibc project.
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |
Other format: | [Raw text] |
2009/11/30 Takao Fujiwara<tfujiwar@redhat.com>:Hi,
I'm attaching the two kind of patches and need your ideas: glibc-xx-errno-strcoll.diff glibc-xx-set-undefined.diff
Currently I'm thinking how to sort UTF-8 strings on GNOME/GDM. GDM uses g_utf8_collate() to sort the UTF-8 language names.
The following link is the source code of g_utf8_collate(): http://git.gnome.org./cgit/glib/tree/glib/gunicollate.c
g_utf8_collate() uses wcscoll() internally.
---------------------------- 70:gint 71: g_utf8_collate (const gchar *str1, 72: const gchar *str2) 73:{ ... 109: result = wcscoll ((wchar_t *)str1_norm, (wchar_t *)str2_norm); ... 154:}
However if the chars are not defined in the locale collation, the returned value is not correct.
You need to define correctness in terms of a standard or existing practice in some other C library. What do C libraries on other operating systems do?
I'm attaching the test program (a.c) in this mail to explain the problem. It compares the Korean chars and ASCII chars. If you run the test program on ja_JP.UTF-8, wcscoll() returns "Korean chars< ASCII chars". But I would expect "Korean chars> ASCII chars" on ja_JP.UTF-8.
You need to explain why you expect this, is it done this way on another system?
Then I would think the returned value is not defined in ja_JP collation and I thought setting errno would be good if the char is not defined in the collation. E.g. glibc/localedata/locales/ja_JP LC_COLLATE doesn't include U+D55C so I think the ja_JP.UTF-8 collation table doesn't contain all UTF-8 chars.
I don't understand this paragraph, perhaps you could expand the explanation and take the reader through the logic?
Regarding to WCSCOLL(3P):
---------------------------- RETURN VALUE On error, wcscoll() shall set errno, but no return value is reserved to indicate an error.
ERRORS The wcscoll() function may fail if:
EINVAL The ws1 or ws2 arguments contain wide-character codes outside the domain of the collating sequence. ----------------------------
You don't need to quote the manpage, simply describe the expected return value as defined in the appropriate standard.
However this might be the only definition in POSIX when I searched. I tested the behavior to assign the actual code points in wcscoll().
The attachment glibc-xx-errno-strcoll.diff sets EINVAL if the value is out of the table. My understanding is, __collidx_table_lookup() checks in libc.so if the char is defined in the collation table so my suggestion is to set errno if the char is not defined in the table. If wcscoll/strcoll/wscxfrm/strxfrm would set errno, I could enhance g_utf8_collate(_key) later. E.g. if wcscoll() returns undefined value with errno, wcscmp() could be called later.
Have you tested this patch by running the glibc testsuite?
However somebody might say ja_JP collation table should have all UTF-8 chars but actually ja_JP file is not so.
Who might say this and why?
if a char is not defined in glib/localedata/locales/ja_JP, __collidx_table_lookup() returns 0 in libc.so. If we could use __collseq_table_lookup() instead, it would return the max value for the undefined char and I could resolve this problem. But I think we need to use __collidx_table_lookup() for wcscoll() since the size of locale collation is unclear.
But the problem is when we receive 0, U+0 is actually defined in glib/localedata/locales/ja_JP LC_COLLATION and the result is, the undefined chars are always collated in front of the defined chars in wcscoll().
E.g. If I think a is ASCII char, b is a Japanese char, c is a Korean char, the collation would be c< a< b on ja_JP.UTF-8 since U+0 is defined in ja_JP file.
But if you look at ja_JP file, the file also defines "UNDEFINED" in LC_COLLATE. UNDEFINED char should be collated at last. But the word "UNDEFINED" seems to be used in localedef program only. If we run wcscoll(), we don't know which index of weight[] is the UNDEFINED value.
This is not a coherent description of the solution, internal details are not important right now, what is important is explaining clearly the two solutions.
Then I'm attaching another solution (glibc-xx-set-undefined.diff).
So my solution is, if wcscoll() receives 0 from findidx(), wcscoll() use USTRING_MAX instead of weight[].
If I see zh_CN file, U+0 is not defined. The undefined chars are always collated in front of the defined chars in wcscoll() because the following line effects the result in wcscoll():
result = seq1len == 0 ? -1 : 1;
seq1len is 0 but the string is not shorter than the other in this case. The string is not defined in the locale collation in this case actually.
I'd modified this part in glibc-xx-set-undefined.diff.
Probably it's good for wcscoll() to follow the 'UNDEFINED' keyword in the locale collation file and I think 'UNDEFINED' should be put in the last of the LC_COLLATE.
You need to expand on why you think this is needed.
Thanks, fujiwara
Cheers, Carlos.
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |