This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

[Patch v3 11/14] [BZ #14095] update collation data from Unicode / ISO 14651

From: Mike FABIAN <mfabian at redhat dot com>
To: libc-alpha at sourceware dot org
Cc: "Dmitry V. Levin" <ldv at altlinux dot org>
Date: Fri, 23 Feb 2018 11:24:39 +0100
Subject: [Patch v3 11/14] [BZ #14095] update collation data from Unicode / ISO 14651
Authentication-results: sourceware.org; auth=none

>From 5c65168e569ba0c59ad43bbd88f37cdb356c16b6 Mon Sep 17 00:00:00 2001
From: Mike FABIAN <mfabian@redhat.com>
Date: Tue, 23 Jan 2018 17:29:36 +0100
Subject: [PATCH 11/14] Fix test cases tst-fnmatch and tst-regexloc for the new
 iso14651_t1_common file.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

See:

http://pubs.opengroup.org/onlinepubs/7908799/xbd/re.html

> A range expression represents the set of collating elements that fall
> between two elements in the current collation sequence,
> inclusively. It is expressed as the starting point and the ending
> point separated by a hyphen (-).
>
> Range expressions must not be used in portable applications because
> their behaviour is dependent on the collating sequence. Ranges will be
> treated according to the current collating sequence, and include such
> characters that fall within the range based on that collating
> sequence, regardless of character values. This, however, means that
> the interpretation will differ depending on collating sequence. If,
> for instance, one collating sequence defines Ã¤ as a variant of a,
> while another defines it as a letter following z, then the expression
> [Ã¤-z] is valid in the first language and invalid in the second.

Therefore, using [a-z] does not make much sense except in the C/POSIX locale.
The new iso14651_t1_common lists upper case and  lower case Latin characters
in a different order than the old one which causes surprising results
for example in the de_DE locale: [a-z] now includes A because A comes
after a in iso14651_t1_common but does not include Z because that comes
after z in iso14651_t1_common.

	* posix/tst-fnmatch.input: Use range expressions only in C locale.
	* posix/tst-regexloc.c: Do not use a range expression for
        de_DE.ISO-8859-1 locale.
---
 posix/tst-fnmatch.input | 40 ----------------------------------------
 posix/tst-regexloc.c    |  4 ++--
 2 files changed, 2 insertions(+), 42 deletions(-)

diff --git a/posix/tst-fnmatch.input b/posix/tst-fnmatch.input
index 88b3f739a5..1e2f62c0ed 100644
--- a/posix/tst-fnmatch.input
+++ b/posix/tst-fnmatch.input
@@ -418,26 +418,6 @@ C		"-"			"[Z-\\]]"	       NOMATCH
 # Following are tests outside the scope of IEEE 2003.2 since they are using
 # locales other than the C locale.  The main focus of the tests is on the
 # handling of ranges and the recognition of character (vs bytes).
-de_DE.ISO-8859-1 "a"			"[a-z]"		       0
-de_DE.ISO-8859-1 "z"			"[a-z]"		       0
-de_DE.ISO-8859-1 "ä"			"[a-z]"		       0
-de_DE.ISO-8859-1 "ö"			"[a-z]"		       0
-de_DE.ISO-8859-1 "ü"			"[a-z]"		       0
-de_DE.ISO-8859-1 "A"			"[a-z]"		       NOMATCH
-de_DE.ISO-8859-1 "Z"			"[a-z]"		       NOMATCH
-de_DE.ISO-8859-1 "Ä"			"[a-z]"		       NOMATCH
-de_DE.ISO-8859-1 "Ö"			"[a-z]"		       NOMATCH
-de_DE.ISO-8859-1 "Ü"			"[a-z]"		       NOMATCH
-de_DE.ISO-8859-1 "a"			"[A-Z]"		       NOMATCH
-de_DE.ISO-8859-1 "z"			"[A-Z]"		       NOMATCH
-de_DE.ISO-8859-1 "ä"			"[A-Z]"		       NOMATCH
-de_DE.ISO-8859-1 "ö"			"[A-Z]"		       NOMATCH
-de_DE.ISO-8859-1 "ü"			"[A-Z]"		       NOMATCH
-de_DE.ISO-8859-1 "A"			"[A-Z]"		       0
-de_DE.ISO-8859-1 "Z"			"[A-Z]"		       0
-de_DE.ISO-8859-1 "Ä"			"[A-Z]"		       0
-de_DE.ISO-8859-1 "Ö"			"[A-Z]"		       0
-de_DE.ISO-8859-1 "Ü"			"[A-Z]"		       0
 de_DE.ISO-8859-1 "a"			"[[:lower:]]"	       0
 de_DE.ISO-8859-1 "z"			"[[:lower:]]"	       0
 de_DE.ISO-8859-1 "ä"			"[[:lower:]]"	       0
@@ -510,26 +490,6 @@ de_DE.ISO-8859-1 "ba"			"[[.a.]]a"	       NOMATCH
 
 
 # And with a multibyte character set.
-de_DE.UTF-8	 "a"			"[a-z]"		       0
-de_DE.UTF-8	 "z"			"[a-z]"		       0
-de_DE.UTF-8	 "Ã¤"			"[a-z]"		       0
-de_DE.UTF-8	 "Ã¶"			"[a-z]"		       0
-de_DE.UTF-8	 "Ã¼"			"[a-z]"		       0
-de_DE.UTF-8	 "A"			"[a-z]"		       NOMATCH
-de_DE.UTF-8	 "Z"			"[a-z]"		       NOMATCH
-de_DE.UTF-8	 "Ã?"			"[a-z]"		       NOMATCH
-de_DE.UTF-8	 "Ã?"			"[a-z]"		       NOMATCH
-de_DE.UTF-8	 "Ã?"			"[a-z]"		       NOMATCH
-de_DE.UTF-8	 "a"			"[A-Z]"		       NOMATCH
-de_DE.UTF-8	 "z"			"[A-Z]"		       NOMATCH
-de_DE.UTF-8	 "Ã¤"			"[A-Z]"		       NOMATCH
-de_DE.UTF-8	 "Ã¶"			"[A-Z]"		       NOMATCH
-de_DE.UTF-8	 "Ã¼"			"[A-Z]"		       NOMATCH
-de_DE.UTF-8	 "A"			"[A-Z]"		       0
-de_DE.UTF-8	 "Z"			"[A-Z]"		       0
-de_DE.UTF-8	 "Ã?"			"[A-Z]"		       0
-de_DE.UTF-8	 "Ã?"			"[A-Z]"		       0
-de_DE.UTF-8	 "Ã?"			"[A-Z]"		       0
 de_DE.UTF-8	 "a"			"[[:lower:]]"	       0
 de_DE.UTF-8	 "z"			"[[:lower:]]"	       0
 de_DE.UTF-8	 "Ã¤"			"[[:lower:]]"	       0
diff --git a/posix/tst-regexloc.c b/posix/tst-regexloc.c
index 60235b4d3b..7fbc496d0c 100644
--- a/posix/tst-regexloc.c
+++ b/posix/tst-regexloc.c
@@ -29,8 +29,8 @@ do_test (void)
 
   if (setlocale (LC_ALL, "de_DE.ISO-8859-1") == NULL)
     puts ("cannot set locale");
-  else if (regcomp (&re, "[a-f]*", 0) != REG_NOERROR)
-    puts ("cannot compile expression \"[a-f]*\"");
+  else if (regcomp (&re, "[abcdef]*", 0) != REG_NOERROR)
+    puts ("cannot compile expression \"[abcdef]*\"");
   else if (regexec (&re, "abcdefCDEF", 1, mat, 0) == REG_NOMATCH)
     puts ("no match");
   else
-- 
2.14.3

Follow-Ups:
- Re: [Patch v3 11/14] [BZ #14095] update collation data from Unicode / ISO 14651
  - From: Carlos O'Donell

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]