This is the mail archive of the
libc-help@sourceware.org
mailing list for the glibc project.
Re: [PATCH] wcscoll/strcoll needs to check if the char is out of the locale collation
On Wed, Dec 2, 2009 at 12:32 PM, Takao Fujiwara <tfujiwar@redhat.com> wrote:
>> You need to define correctness in terms of a standard or existing
>> practice in some other C library. What do C libraries on other
>> operating systems do?
>
> I'm not sure if I understand the question.
> As I quoted in this original mail, the manpage explains the behavior. And
> the manpage is also available in http://opengroup.org.
> Unfortunately I could not find the specification between language chars but
> I have tested some chars on UNIX and Windows and the collation result is
> good for me except for GNU libc.
>
> In UNIX, some of non-UTF-8 collation is not clear for me but UTF-8 collation
> is good in the attached test case at least. So errno might be a good
> solution for non-UTF-8 since glibc/localedata/locales/ja_JP is not used for
> ja_JP.UTF-8 only but also for ja_JP.eucJP, ja_JP.SJIS.
This is better. You should list which other operating systems you
tested on and found the results of the scort to be satisfactory.
>>
>>> I'm attaching the test program (a.c) in this mail to explain the problem.
>>> It compares the Korean chars and ASCII chars.
>>> If you run the test program on ja_JP.UTF-8, wcscoll() returns "Korean
>>> chars< ?ASCII chars".
>>> But I would expect "Korean chars> ?ASCII chars" on ja_JP.UTF-8.
>>
>> You need to explain why you expect this, is it done this way on another
>> system?
>
> The main reason of mine is, if the expectation is not the result, the sorted
> strings looks ugly on GUI(e.g. GDM language name list dialog). The current
> behavior is, some of non-latin chars are collated in front of latin chars
> and all Japanese chars are collated after latin chars and other chars are
> collated after all Japanese chars.
> I think the sort might need the common collation and locale specific
> collation likes UCA, ICU.
This is a good explanation of the problem. It should come first in your post.
> My test case works with this way on UNIX and MS-Windows.
Which UNIX?
>>
>>> Then I would think the returned value is not defined in ja_JP collation
>>> and I thought setting errno would be good if the char is not defined in
>>> the collation.
>>> E.g. glibc/localedata/locales/ja_JP LC_COLLATE doesn't include U+D55C so
>>> I think the ja_JP.UTF-8 collation table doesn't contain all UTF-8 chars.
>>
>> I don't understand this paragraph, perhaps you could expand the
>> explanation and take the reader through the logic?
>
> If you have a look at glibc/localedata/locales/ja_JP, Japanese LC_COLLATE is
> defined in the lines between order_start forward and order_end. almost
> Japanese char collations are defined in the ja_JP file but any Korean char
> collations are not found in ja_JP file, e.g. U+D55C .
> wcscoll() uses __collidx_table_lookup() for the collation data and
> __collidx_table_lookup() returns '0' in U+D55C and I think the reason is
> that the code point is not defined in the ja_JP file.
> It mean I think the current ja_JP collation doesn't include all UTF-8 chars
> on ja_JP.UTF-8.
You should find out whether or not the ja_JP collation includes all
UTF-8 chars. That way you can be certain.
>>
>>> Regarding to WCSCOLL(3P):
>>>
>>> ----------------------------
>>> RETURN VALUE
>>> On ?error, ?wcscoll() ?shall ?set
>>> ? ? ? errno, but no return value is reserved to indicate an error.
>>>
>>> ERRORS
>>> ? ? ? The wcscoll() function may fail if:
>>>
>>> ? ? ? EINVAL The ?ws1 ?or ?ws2 arguments contain wide-character codes
>>> outside
>>> ? ? ? ? ? ? ?the domain of the collating sequence.
>>> ----------------------------
>>
>> You don't need to quote the manpage, simply describe the expected
>> return value as defined in the appropriate standard.
>
> However this might be the only definition in POSIX when I searched.
> I tested the behavior to assign the actual code points in wcscoll().
Then you need to explain that "This is the definition that describes
the error behaviour for wcscoll."
>>
>>> The attachment glibc-xx-errno-strcoll.diff sets EINVAL if the value is
>>> out of the table.
>>> My understanding is, __collidx_table_lookup() checks in libc.so if the
>>> char is defined in the collation table so my suggestion is to set errno
>>> if the char is not defined in the table.
>>> If wcscoll/strcoll/wscxfrm/strxfrm would set errno, I could enhance
>>> g_utf8_collate(_key) later.
>>> E.g. if wcscoll() returns undefined value with errno, wcscmp() could be
>>> called later.
>>
>> Have you tested this patch by running the glibc testsuite?
>
> I'm not familiar with the glibc testsuite... so I just tested some code
> points with wcscoll().
> If any links are available, I'd try it.
Did you build glibc with your patch?
Once you build glibc with your patch you can run "make -k check >&
result.log" to execute the testsuite. Then you must verify that your
patch has not introduced regressions in the testsuite.
>>
>>> However somebody might say ja_JP collation table should have all UTF-8
>>> chars but actually ja_JP file is not so.
>>
>> Who might say this and why?
>
> Actually I'm not sure if the actual people might say this. However the first
> patch sets errno if __collidx_table_lookup() returns '0' and my
> understanding is, the return value of __collidx_table_lookup() depends on
> the glib/localedata/locales/ja_JP file on ja_JP.UTF-8 locale.
You need to determine if this is a bug in the ja_JP collation table.
You must determine if it should have all UTF-8 characters.
>>
>>> if a char is not defined in glib/localedata/locales/ja_JP,
>>> __collidx_table_lookup() returns 0 in libc.so.
>>> If we could use __collseq_table_lookup() instead, it would return the
>>> max value for the undefined char and I could resolve this problem.
>>> But I think we need to use __collidx_table_lookup() for wcscoll() since
>>> the size of locale collation is unclear.
>>>
>>> But the problem is when we receive 0, U+0 is actually defined in
>>> glib/localedata/locales/ja_JP LC_COLLATION and the result is, the
>>> undefined chars are always collated in front of the defined chars in
>>> wcscoll().
>>>
>>> E.g. If I think a is ASCII char, b is a Japanese char, c is a Korean
>>> char, the collation would be c< ?a< ?b on ja_JP.UTF-8 since U+0 is
>>> defined in ja_JP file.
>>>
>>> But if you look at ja_JP file, the file also defines "UNDEFINED" in
>>> LC_COLLATE.
>>> UNDEFINED char should be collated at last.
>>> But the word "UNDEFINED" seems to be used in localedef program only.
>>> If we run wcscoll(), we don't know which index of weight[] is the
>>> UNDEFINED value.
>>
>> This is not a coherent description of the solution, internal details
>> are not important right now, what is important is explaining clearly
>> the two solutions.
>
> OK, if the first patch glibc-xx-errno-strcoll.diff is acceptable, I could
> implement the following scenario:
>
> if wcscoll() returns a value with no errno on ja_JP.* locale, I can expect
> the value is defined in glib/localedata/locales/ja_JP and the returned value
> can be used for software applications with the locale collation.
> If wcscoll() returns a value with errno on ja_JP.*, I can expect the value
> is not defined in the ja_JP file and the collation is not available with the
> locale collation and then another function can be called, e.g. wcscmp(),
> without the locale collation.
> We need to call wcscoll(). E.g. if Composed 'a' + '`' (U+00E0) and ASCII 'b'
> is compared, we'd like to get 'a' + '`' < 'b' on some locales.
Try to make your description more generic e.g. "with no errno on a
locale". Your code must be applicable to all locales.
> Probably I think the errno solution is also good for non-UTF-8 whose char
> range is smaller than UTF-8. E.g. ISO8859-1 is very small code range but
> wcscoll() takes wchar_t* as the argument. WCHAR_MAX value is higher than the
> end of ISO8859-1 char so many wchar_t values are not used in ISO8859-1.
OK
>>
>>> Then I'm attaching another solution (glibc-xx-set-undefined.diff).
>>>
>>> So my solution is, if wcscoll() receives 0 from findidx(), wcscoll() use
>>> USTRING_MAX instead of weight[].
>>>
>>> If I see zh_CN file, U+0 is not defined. The undefined chars are always
>>> collated in front of the defined chars in wcscoll() because the
>>> following line effects the result in wcscoll():
>>>
>>> ? result = seq1len == 0 ? -1 : 1;
>>>
>>> seq1len is 0 but the string is not shorter than the other in this case.
>>> The string is not defined in the locale collation in this case actually.
>>>
>>> I'd modified this part in glibc-xx-set-undefined.diff.
>>>
>>> Probably it's good for wcscoll() to follow the 'UNDEFINED' keyword in
>>> the locale collation file and I think 'UNDEFINED' should be put in the
>>> last of the LC_COLLATE.
>>
>> You need to expand on why you think this is needed.
>
> If the second patch glibc-xx-set-undefined.diff is acceptable, I can apply
> the returned value of wcscoll() to the software applications simply and the
> GUI appearance looks better the current behavior because the patched libc
> can show ASCII < Japanese < other languages chars. I think it's easy for the
> GUI users to find one string in the sorted strings.
OK.
Cheers,
Carlos.