This is the mail archive of the libc-help@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

[PATCH] wcscoll/strcoll needs to check if the char is out of thelocale collation

From: Takao Fujiwara <tfujiwar at redhat dot com>
To: libc-help at sourceware dot org
Cc: Takao Fujiwara <tfujiwar at redhat dot com>
Date: Tue, 01 Dec 2009 10:00:23 +0900
Subject: [PATCH] wcscoll/strcoll needs to check if the char is out of thelocale collation
Reply-to: tfujiwar at redhat dot com

Hi,

I'm attaching the two kind of patches and need your ideas:
glibc-xx-errno-strcoll.diff
glibc-xx-set-undefined.diff

Currently I'm thinking how to sort UTF-8 strings on GNOME/GDM.
GDM uses g_utf8_collate() to sort the UTF-8 language names.

The following link is the source code of g_utf8_collate():
http://git.gnome.org./cgit/glib/tree/glib/gunicollate.c

g_utf8_collate() uses wcscoll() internally.

----------------------------
 70:gint
 71: g_utf8_collate (const gchar *str1,
 72:  const gchar *str2)
 73:{
...
109:  result = wcscoll ((wchar_t *)str1_norm, (wchar_t *)str2_norm);
...
154:}

However if the chars are not defined in the locale collation, the
returned value is not correct.
I'm attaching the test program (a.c) in this mail to explain the problem.
It compares the Korean chars and ASCII chars.
If you run the test program on ja_JP.UTF-8, wcscoll() returns "Korean
chars < ASCII chars".
But I would expect "Korean chars > ASCII chars" on ja_JP.UTF-8.
Then I would think the returned value is not defined in ja_JP collation
and I thought setting errno would be good if the char is not defined in
the collation.
E.g. glibc/localedata/locales/ja_JP LC_COLLATE doesn't include U+D55C so
I think the ja_JP.UTF-8 collation table doesn't contain all UTF-8 chars.

Regarding to WCSCOLL(3P):

----------------------------
RETURN VALUE
On  error,  wcscoll()  shall  set
       errno, but no return value is reserved to indicate an error.

ERRORS
       The wcscoll() function may fail if:

       EINVAL The  ws1  or  ws2 arguments contain wide-character codes
outside
              the domain of the collating sequence.
----------------------------

The attachment glibc-xx-errno-strcoll.diff sets EINVAL if the value is
out of the table.
My understanding is, __collidx_table_lookup() checks in libc.so if the
char is defined in the collation table so my suggestion is to set errno
if the char is not defined in the table.

If wcscoll/strcoll/wscxfrm/strxfrm would set errno, I could enhance
g_utf8_collate(_key) later.
E.g. if wcscoll() returns undefined value with errno, wcscmp() could be
called later.

However somebody might say ja_JP collation table should have all UTF-8
chars but actually ja_JP file is not so.

if a char is not defined in glib/localedata/locales/ja_JP,
__collidx_table_lookup() returns 0 in libc.so.
If we could use __collseq_table_lookup() instead, it would return the
max value for the undefined char and I could resolve this problem.
But I think we need to use __collidx_table_lookup() for wcscoll() since
the size of locale collation is unclear.

But the problem is when we receive 0, U+0 is actually defined in
glib/localedata/locales/ja_JP LC_COLLATION and the result is, the
undefined chars are always collated in front of the defined chars in
wcscoll().

E.g. If I think a is ASCII char, b is a Japanese char, c is a Korean
char, the collation would be c < a < b on ja_JP.UTF-8 since U+0 is
defined in ja_JP file.

But if you look at ja_JP file, the file also defines "UNDEFINED" in
LC_COLLATE.
UNDEFINED char should be collated at last.
But the word "UNDEFINED" seems to be used in localedef program only.
If we run wcscoll(), we don't know which index of weight[] is the
UNDEFINED value.

Then I'm attaching another solution (glibc-xx-set-undefined.diff).

So my solution is, if wcscoll() receives 0 from findidx(), wcscoll() use
USTRING_MAX instead of weight[].

If I see zh_CN file, U+0 is not defined. The undefined chars are always
collated in front of the defined chars in wcscoll() because the
following line effects the result in wcscoll():

   result = seq1len == 0 ? -1 : 1;

seq1len is 0 but the string is not shorter than the other in this case.
The string is not defined in the locale collation in this case actually.

I'd modified this part in glibc-xx-set-undefined.diff.

Probably it's good for wcscoll() to follow the 'UNDEFINED' keyword in
the locale collation file and I think 'UNDEFINED' should be put in the
last of the LC_COLLATE.

Thanks,
fujiwara

#include <locale.h>
#include <stdio.h>
#include <string.h>
#include <wchar.h>
#include <errno.h>

#define CHECK_STRCOLL 1

int
main (int argc, char *argv[])
{
  const static char *array[] = {
    "æ±?è¯ (ä¸å?½)",
    "à¹?à¸?à¸¢ (à¹?à¸?à¸¢)",
    "Albanian ",
    "ä¸æ?? (å?°ç?£)",
    "Î?Î»Î»Î·Î½Î¹ÎºÎ¬, Î£Ï?Î³Ï?Ï?Î¿Î½Î± (Î?Ï?Ï?Ï?Î¿Ï?)",
    "Ã?slenska (Ã?sland)",
    "í??êµì?´ (ë??í??ë¯¼êµ)",
    NULL };
  const char *a, *b;
  char result[100];
  wchar_t ws_a[100], ws_b[100];
  wchar_t ws_result[100];
  mbstate_t mbs;
  int i;

  a = array [6];
  b = array [2];
  setlocale (LC_ALL, "");

  errno = 0;
#ifdef CHECK_STRCOLL
  i = strcoll (a, b);
#else
  i = strxfrm (NULL, a, 0);
  strxfrm (result, a, i + 1);
#endif

  mbstowcs (ws_a, a, sizeof (ws_a) / sizeof (wchar_t));
  mbstowcs (ws_b, b, sizeof (ws_b) / sizeof (wchar_t));

#ifdef CHECK_STRCOLL
  wprintf (L"strcoll %ls %d %ls %d\n", ws_a, i, ws_b, errno);
#else
  wprintf (L"strxfrm %ls %d %d\n", ws_a, i, errno);
#endif

  errno = 0;
#ifdef CHECK_STRCOLL
  i = wcscoll (ws_a, ws_b);
  wprintf (L"wcscoll %ls %d %ls %d\n", ws_a, i, ws_b, errno);
#else
  i = wcsxfrm (NULL, ws_a, 0);
  wcsxfrm (ws_result, ws_a, i + 1);
  wprintf (L"wcsxfrm %ls %d %d\n", ws_a, i, errno);
#endif

  memset (&mbs, 0, sizeof (mbs));
  i = mbrlen (a, strlen (a) + 1, &mbs);
  memset (&mbs, 0, sizeof (mbs));
  i = mbrlen (b, strlen (b) + 1, &mbs);

  i = wcscmp (ws_a, ws_b);
  wprintf (L"wcscmp %ls %d %ls %d %d\n", ws_a, i, ws_b, sizeof (ws_a), sizeof (wchar_t));

  i = strcmp (a, b);

  return 0;
}

--- glibc-2.10-355-g1abedcd/string/strcoll_l.c.orig	2009-11-12 15:21:16.000000000 +0900
+++ glibc-2.10-355-g1abedcd/string/strcoll_l.c	2009-11-16 17:36:10.000000000 +0900
@@ -25,6 +25,7 @@
 #include <stdint.h>
 #include <stdlib.h>
 #include <string.h>
+#include <errno.h>
 
 #ifndef STRING_TYPE
 # define STRING_TYPE char
@@ -210,6 +211,8 @@ STRCOLL (s1, s2, l)
 		    idx1arr[idx1max] = tmp & 0xffffff;
 		    idx1cnt = idx1max++;
 
+		    if (tmp == 0)
+		      __set_errno (EINVAL);
 		    if ((rulesets[rule1arr[idx1cnt] * nrules]
 			 & sort_backward) == 0)
 		      /* No more backward characters to push.  */
--- glibc-2.10-355-g1abedcd/string/strxfrm_l.c.orig	2009-11-16 17:27:41.000000000 +0900
+++ glibc-2.10-355-g1abedcd/string/strxfrm_l.c	2009-11-16 17:29:01.000000000 +0900
@@ -26,6 +26,7 @@
 #include <stdlib.h>
 #include <string.h>
 #include <sys/param.h>
+#include <errno.h>
 
 #ifndef STRING_TYPE
 # define STRING_TYPE char
@@ -180,6 +181,8 @@ STRXFRM (STRING_TYPE *dest, const STRING
       rulearr[idxmax] = tmp >> 24;
       idxarr[idxmax] = tmp & 0xffffff;
 
+      if (tmp == 0)
+	__set_errno (EINVAL);
       ++idxmax;
     }
   while (*usrc != L('\0'));

--- glibc-2.10-355-g1abedcd/string/strcoll_l.c.orig	2009-11-12 15:21:16.000000000 +0900
+++ glibc-2.10-355-g1abedcd/string/strcoll_l.c	2009-11-20 17:48:40.000000000 +0900
@@ -29,6 +29,7 @@
 #ifndef STRING_TYPE
 # define STRING_TYPE char
 # define USTRING_TYPE unsigned char
+# define USTRING_MAX UCHAR_MAX
 # define STRCOLL __strcoll_l
 # define STRCMP strcmp
 # define STRLEN strlen
@@ -304,9 +305,13 @@ STRCOLL (s1, s2, l)
 	       first level.  */
 	    break;
 
-	  /* This means one string is shorter than the other.  Find out
-	     which one and return an appropriate value.  */
-	  result = seq1len == 0 ? -1 : 1;
+	  /* if the char is not defined in the locale table,
+	     findidx returns 0 but weights is 0 if U+0 is not defined
+	     in the locale collation file. */
+	  if (seq1len == 0)
+	    result = USTRING_MAX - weights[idx2arr[idx2now]];
+	  else
+	    result = weights[idx1arr[idx1now]] - USTRING_MAX;
 	  goto free_and_return;
 	}
 
@@ -322,8 +327,15 @@ STRCOLL (s1, s2, l)
 	{
 	  if (weights[idx1arr[idx1now]] != weights[idx2arr[idx2now]])
 	    {
-	      /* The sequences differ.  */
-	      result = weights[idx1arr[idx1now]] - weights[idx2arr[idx2now]];
+	      /* findidx returns 0 if the locale table doesn't define
+	         the char and 0 is incremented to 1. */
+	      if (idx1arr[idx1now] == 1)
+	        result = USTRING_MAX - weights[idx2arr[idx2now]];
+	      else if (idx2arr[idx2now] == 1)
+	        result = weights[idx1arr[idx1now]] - USTRING_MAX;
+	      else
+	        /* The sequences differ.  */
+	        result = weights[idx1arr[idx1now]] - weights[idx2arr[idx2now]];
 	      goto free_and_return;
 	    }
 
--- glibc-2.10-355-g1abedcd/string/strxfrm_l.c.orig	2009-11-16 17:27:41.000000000 +0900
+++ glibc-2.10-355-g1abedcd/string/strxfrm_l.c	2009-11-20 17:50:49.000000000 +0900
@@ -30,6 +30,7 @@
 #ifndef STRING_TYPE
 # define STRING_TYPE char
 # define USTRING_TYPE unsigned char
+# define USTRING_MAX UCHAR_MAX
 # define STRXFRM __strxfrm_l
 # define STRCMP strcmp
 # define STRLEN strlen
@@ -235,8 +236,14 @@ STRXFRM (STRING_TYPE *dest, const STRING
 		  /* Now handle the forward element.  */
 		  len = weights[idxarr[idxcnt]++];
 		  if (needed + len < n)
-		    while (len-- > 0)
-		      dest[needed++] = weights[idxarr[idxcnt]++];
+		    while (len-- > 0) {
+		      /* findidx returns 0 if the locale table doesn't
+		         define the char and 0 is incremented to 1. */
+		      if (idxarr[idxcnt] == 1)
+		        dest[needed++] = USTRING_MAX;
+		      else
+		        dest[needed++] = weights[idxarr[idxcnt]++];
+		    }
 		  else
 		    {
 		      /* No more characters fit into the buffer.  */
@@ -419,9 +426,11 @@ STRXFRM (STRING_TYPE *dest, const STRING
 	}
 
       /* Finally store the byte to separate the passes or terminate
-	 the string.  */
+	 the string.  if the char is not defined in the locale table,
+	 findidx returns 0 and weights is 0 if U+0 is not defined in
+	 the locale collation file. */
       if (needed < n)
-	dest[needed] = pass + 1 < nrules ? L('\1') : L('\0');
+	dest[needed] = pass + 1 < nrules ? USTRING_MAX : L('\0');
       ++needed;
     }
 
--- glibc-2.10-355-g1abedcd/wcsmbs/wcscoll_l.c.orig	2009-11-20 17:13:25.000000000 +0900
+++ glibc-2.10-355-g1abedcd/wcsmbs/wcscoll_l.c	2009-11-20 17:13:36.000000000 +0900
@@ -23,6 +23,7 @@
 
 #define STRING_TYPE wchar_t
 #define USTRING_TYPE wint_t
+#define USTRING_MAX WCHAR_MAX
 #define STRCOLL __wcscoll_l
 #define STRCMP wcscmp
 #define STRLEN __wcslen
--- glibc-2.10-355-g1abedcd/wcsmbs/wcsxfrm_l.c.orig	2009-11-20 17:11:21.000000000 +0900
+++ glibc-2.10-355-g1abedcd/wcsmbs/wcsxfrm_l.c	2009-11-20 17:11:52.000000000 +0900
@@ -22,6 +22,7 @@
 
 #define STRING_TYPE wchar_t
 #define USTRING_TYPE wint_t
+#define USTRING_MAX WCHAR_MAX
 #define STRXFRM __wcsxfrm_l
 #define STRCMP wcscmp
 #define STRLEN __wcslen

Follow-Ups:
- Re: [PATCH] wcscoll/strcoll needs to check if the char is out of the locale collation
  - From: Carlos O'Donell

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]