This is the mail archive of the
libc-help@sourceware.org
mailing list for the glibc project.
[PATCH] wcscoll/strcoll needs to check if the char is out of thelocale collation
- From: Takao Fujiwara <tfujiwar at redhat dot com>
- To: libc-help at sourceware dot org
- Cc: Takao Fujiwara <tfujiwar at redhat dot com>
- Date: Tue, 01 Dec 2009 10:00:23 +0900
- Subject: [PATCH] wcscoll/strcoll needs to check if the char is out of thelocale collation
- Reply-to: tfujiwar at redhat dot com
Hi,
I'm attaching the two kind of patches and need your ideas:
glibc-xx-errno-strcoll.diff
glibc-xx-set-undefined.diff
Currently I'm thinking how to sort UTF-8 strings on GNOME/GDM.
GDM uses g_utf8_collate() to sort the UTF-8 language names.
The following link is the source code of g_utf8_collate():
http://git.gnome.org./cgit/glib/tree/glib/gunicollate.c
g_utf8_collate() uses wcscoll() internally.
----------------------------
70:gint
71: g_utf8_collate (const gchar *str1,
72: const gchar *str2)
73:{
...
109: result = wcscoll ((wchar_t *)str1_norm, (wchar_t *)str2_norm);
...
154:}
However if the chars are not defined in the locale collation, the
returned value is not correct.
I'm attaching the test program (a.c) in this mail to explain the problem.
It compares the Korean chars and ASCII chars.
If you run the test program on ja_JP.UTF-8, wcscoll() returns "Korean
chars < ASCII chars".
But I would expect "Korean chars > ASCII chars" on ja_JP.UTF-8.
Then I would think the returned value is not defined in ja_JP collation
and I thought setting errno would be good if the char is not defined in
the collation.
E.g. glibc/localedata/locales/ja_JP LC_COLLATE doesn't include U+D55C so
I think the ja_JP.UTF-8 collation table doesn't contain all UTF-8 chars.
Regarding to WCSCOLL(3P):
----------------------------
RETURN VALUE
On error, wcscoll() shall set
errno, but no return value is reserved to indicate an error.
ERRORS
The wcscoll() function may fail if:
EINVAL The ws1 or ws2 arguments contain wide-character codes
outside
the domain of the collating sequence.
----------------------------
The attachment glibc-xx-errno-strcoll.diff sets EINVAL if the value is
out of the table.
My understanding is, __collidx_table_lookup() checks in libc.so if the
char is defined in the collation table so my suggestion is to set errno
if the char is not defined in the table.
If wcscoll/strcoll/wscxfrm/strxfrm would set errno, I could enhance
g_utf8_collate(_key) later.
E.g. if wcscoll() returns undefined value with errno, wcscmp() could be
called later.
However somebody might say ja_JP collation table should have all UTF-8
chars but actually ja_JP file is not so.
if a char is not defined in glib/localedata/locales/ja_JP,
__collidx_table_lookup() returns 0 in libc.so.
If we could use __collseq_table_lookup() instead, it would return the
max value for the undefined char and I could resolve this problem.
But I think we need to use __collidx_table_lookup() for wcscoll() since
the size of locale collation is unclear.
But the problem is when we receive 0, U+0 is actually defined in
glib/localedata/locales/ja_JP LC_COLLATION and the result is, the
undefined chars are always collated in front of the defined chars in
wcscoll().
E.g. If I think a is ASCII char, b is a Japanese char, c is a Korean
char, the collation would be c < a < b on ja_JP.UTF-8 since U+0 is
defined in ja_JP file.
But if you look at ja_JP file, the file also defines "UNDEFINED" in
LC_COLLATE.
UNDEFINED char should be collated at last.
But the word "UNDEFINED" seems to be used in localedef program only.
If we run wcscoll(), we don't know which index of weight[] is the
UNDEFINED value.
Then I'm attaching another solution (glibc-xx-set-undefined.diff).
So my solution is, if wcscoll() receives 0 from findidx(), wcscoll() use
USTRING_MAX instead of weight[].
If I see zh_CN file, U+0 is not defined. The undefined chars are always
collated in front of the defined chars in wcscoll() because the
following line effects the result in wcscoll():
result = seq1len == 0 ? -1 : 1;
seq1len is 0 but the string is not shorter than the other in this case.
The string is not defined in the locale collation in this case actually.
I'd modified this part in glibc-xx-set-undefined.diff.
Probably it's good for wcscoll() to follow the 'UNDEFINED' keyword in
the locale collation file and I think 'UNDEFINED' should be put in the
last of the LC_COLLATE.
Thanks,
fujiwara
#include <locale.h>
#include <stdio.h>
#include <string.h>
#include <wchar.h>
#include <errno.h>
#define CHECK_STRCOLL 1
int
main (int argc, char *argv[])
{
const static char *array[] = {
"æ±?è¯ (ä¸å?½)",
"��ย (��ย)",
"Albanian ",
"ä¸æ?? (å?°ç?£)",
"Î?λληνικά, ΣÏ?γÏ?Ï?ονα (Î?Ï?Ï?Ï?οÏ?)",
"Ã?slenska (Ã?sland)",
"í??êµì?´ (ë??í??민êµ)",
NULL };
const char *a, *b;
char result[100];
wchar_t ws_a[100], ws_b[100];
wchar_t ws_result[100];
mbstate_t mbs;
int i;
a = array [6];
b = array [2];
setlocale (LC_ALL, "");
errno = 0;
#ifdef CHECK_STRCOLL
i = strcoll (a, b);
#else
i = strxfrm (NULL, a, 0);
strxfrm (result, a, i + 1);
#endif
mbstowcs (ws_a, a, sizeof (ws_a) / sizeof (wchar_t));
mbstowcs (ws_b, b, sizeof (ws_b) / sizeof (wchar_t));
#ifdef CHECK_STRCOLL
wprintf (L"strcoll %ls %d %ls %d\n", ws_a, i, ws_b, errno);
#else
wprintf (L"strxfrm %ls %d %d\n", ws_a, i, errno);
#endif
errno = 0;
#ifdef CHECK_STRCOLL
i = wcscoll (ws_a, ws_b);
wprintf (L"wcscoll %ls %d %ls %d\n", ws_a, i, ws_b, errno);
#else
i = wcsxfrm (NULL, ws_a, 0);
wcsxfrm (ws_result, ws_a, i + 1);
wprintf (L"wcsxfrm %ls %d %d\n", ws_a, i, errno);
#endif
memset (&mbs, 0, sizeof (mbs));
i = mbrlen (a, strlen (a) + 1, &mbs);
memset (&mbs, 0, sizeof (mbs));
i = mbrlen (b, strlen (b) + 1, &mbs);
i = wcscmp (ws_a, ws_b);
wprintf (L"wcscmp %ls %d %ls %d %d\n", ws_a, i, ws_b, sizeof (ws_a), sizeof (wchar_t));
i = strcmp (a, b);
return 0;
}
--- glibc-2.10-355-g1abedcd/string/strcoll_l.c.orig 2009-11-12 15:21:16.000000000 +0900
+++ glibc-2.10-355-g1abedcd/string/strcoll_l.c 2009-11-16 17:36:10.000000000 +0900
@@ -25,6 +25,7 @@
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
+#include <errno.h>
#ifndef STRING_TYPE
# define STRING_TYPE char
@@ -210,6 +211,8 @@ STRCOLL (s1, s2, l)
idx1arr[idx1max] = tmp & 0xffffff;
idx1cnt = idx1max++;
+ if (tmp == 0)
+ __set_errno (EINVAL);
if ((rulesets[rule1arr[idx1cnt] * nrules]
& sort_backward) == 0)
/* No more backward characters to push. */
--- glibc-2.10-355-g1abedcd/string/strxfrm_l.c.orig 2009-11-16 17:27:41.000000000 +0900
+++ glibc-2.10-355-g1abedcd/string/strxfrm_l.c 2009-11-16 17:29:01.000000000 +0900
@@ -26,6 +26,7 @@
#include <stdlib.h>
#include <string.h>
#include <sys/param.h>
+#include <errno.h>
#ifndef STRING_TYPE
# define STRING_TYPE char
@@ -180,6 +181,8 @@ STRXFRM (STRING_TYPE *dest, const STRING
rulearr[idxmax] = tmp >> 24;
idxarr[idxmax] = tmp & 0xffffff;
+ if (tmp == 0)
+ __set_errno (EINVAL);
++idxmax;
}
while (*usrc != L('\0'));
--- glibc-2.10-355-g1abedcd/string/strcoll_l.c.orig 2009-11-12 15:21:16.000000000 +0900
+++ glibc-2.10-355-g1abedcd/string/strcoll_l.c 2009-11-20 17:48:40.000000000 +0900
@@ -29,6 +29,7 @@
#ifndef STRING_TYPE
# define STRING_TYPE char
# define USTRING_TYPE unsigned char
+# define USTRING_MAX UCHAR_MAX
# define STRCOLL __strcoll_l
# define STRCMP strcmp
# define STRLEN strlen
@@ -304,9 +305,13 @@ STRCOLL (s1, s2, l)
first level. */
break;
- /* This means one string is shorter than the other. Find out
- which one and return an appropriate value. */
- result = seq1len == 0 ? -1 : 1;
+ /* if the char is not defined in the locale table,
+ findidx returns 0 but weights is 0 if U+0 is not defined
+ in the locale collation file. */
+ if (seq1len == 0)
+ result = USTRING_MAX - weights[idx2arr[idx2now]];
+ else
+ result = weights[idx1arr[idx1now]] - USTRING_MAX;
goto free_and_return;
}
@@ -322,8 +327,15 @@ STRCOLL (s1, s2, l)
{
if (weights[idx1arr[idx1now]] != weights[idx2arr[idx2now]])
{
- /* The sequences differ. */
- result = weights[idx1arr[idx1now]] - weights[idx2arr[idx2now]];
+ /* findidx returns 0 if the locale table doesn't define
+ the char and 0 is incremented to 1. */
+ if (idx1arr[idx1now] == 1)
+ result = USTRING_MAX - weights[idx2arr[idx2now]];
+ else if (idx2arr[idx2now] == 1)
+ result = weights[idx1arr[idx1now]] - USTRING_MAX;
+ else
+ /* The sequences differ. */
+ result = weights[idx1arr[idx1now]] - weights[idx2arr[idx2now]];
goto free_and_return;
}
--- glibc-2.10-355-g1abedcd/string/strxfrm_l.c.orig 2009-11-16 17:27:41.000000000 +0900
+++ glibc-2.10-355-g1abedcd/string/strxfrm_l.c 2009-11-20 17:50:49.000000000 +0900
@@ -30,6 +30,7 @@
#ifndef STRING_TYPE
# define STRING_TYPE char
# define USTRING_TYPE unsigned char
+# define USTRING_MAX UCHAR_MAX
# define STRXFRM __strxfrm_l
# define STRCMP strcmp
# define STRLEN strlen
@@ -235,8 +236,14 @@ STRXFRM (STRING_TYPE *dest, const STRING
/* Now handle the forward element. */
len = weights[idxarr[idxcnt]++];
if (needed + len < n)
- while (len-- > 0)
- dest[needed++] = weights[idxarr[idxcnt]++];
+ while (len-- > 0) {
+ /* findidx returns 0 if the locale table doesn't
+ define the char and 0 is incremented to 1. */
+ if (idxarr[idxcnt] == 1)
+ dest[needed++] = USTRING_MAX;
+ else
+ dest[needed++] = weights[idxarr[idxcnt]++];
+ }
else
{
/* No more characters fit into the buffer. */
@@ -419,9 +426,11 @@ STRXFRM (STRING_TYPE *dest, const STRING
}
/* Finally store the byte to separate the passes or terminate
- the string. */
+ the string. if the char is not defined in the locale table,
+ findidx returns 0 and weights is 0 if U+0 is not defined in
+ the locale collation file. */
if (needed < n)
- dest[needed] = pass + 1 < nrules ? L('\1') : L('\0');
+ dest[needed] = pass + 1 < nrules ? USTRING_MAX : L('\0');
++needed;
}
--- glibc-2.10-355-g1abedcd/wcsmbs/wcscoll_l.c.orig 2009-11-20 17:13:25.000000000 +0900
+++ glibc-2.10-355-g1abedcd/wcsmbs/wcscoll_l.c 2009-11-20 17:13:36.000000000 +0900
@@ -23,6 +23,7 @@
#define STRING_TYPE wchar_t
#define USTRING_TYPE wint_t
+#define USTRING_MAX WCHAR_MAX
#define STRCOLL __wcscoll_l
#define STRCMP wcscmp
#define STRLEN __wcslen
--- glibc-2.10-355-g1abedcd/wcsmbs/wcsxfrm_l.c.orig 2009-11-20 17:11:21.000000000 +0900
+++ glibc-2.10-355-g1abedcd/wcsmbs/wcsxfrm_l.c 2009-11-20 17:11:52.000000000 +0900
@@ -22,6 +22,7 @@
#define STRING_TYPE wchar_t
#define USTRING_TYPE wint_t
+#define USTRING_MAX WCHAR_MAX
#define STRXFRM __wcsxfrm_l
#define STRCMP wcscmp
#define STRLEN __wcslen