This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] [BZ 17588 13064] Update UTF-8 charmap and width to Unicode 7.0.0


On Feb 20, 2015, "Carlos O'Donell" <carlos@redhat.com> wrote:

> Thus __STDC_ISO_10646__ should be 201304L (the date that ISO/EC 10646:2012
> Amd.1 was published).

Fixed in the patch below.

On Feb 19, 2015, Mike FABIAN <mfabian@redhat.com> wrote:

> Mike Frysinger <vapier@gentoo.org> wrote:

>> module level constants should really be in CAPS.  and use a tuple to make it 
>> const.
>> -mike

> https://github.com/pravins/glibc-i18n/commit/53b81c58d220bfbb0e8faf8d4313c705826f4543

Thanks, integrated.  I also adjusted the copyright notices to use year
ranges, as requested.

On Feb 20, 2015, "Carlos O'Donell" <carlos@redhat.com> wrote:

> One nit:

> -% Character width according to Unicode 5.0.0.
> +% Character width according to Unicode 7.0.0.
>  % - Default width is 1.
>  % - Double-width characters have width 2; generated from
>  %        "grep '^[^;]*;[WF]' EastAsianWidth.txt"
> -%   and  "grep '^[^;]*;[^WF]' EastAsianWidth.txt"
>  % - Non-spacing characters have width 0; generated from PropList.txt or
>  %   "grep '^[^;]*;[^;]*;[^;]*;[^;]*;NSM;' UnicodeData.txt"
>  % - Format control characters have width 0; generated from
>  %   "grep '^[^;]*;[^;]*;Cf;' UnicodeData.txt"
> -% - Zero width characters have width 0; generated from
> -%   "grep '^[^;]*;ZERO WIDTH ' UnicodeData.txt"

> Why even mention the `grep` to be used to generate this data?
> It should just say to use the scripts. Nobody should be confused
> that this data was actually generated by this method. Nor do I want
> anyone doing it this way ever again.

> Thus shouldn't `write_header_width` simply not output any of this
> stuff? I understand we're trying to minimize the initial diff, but
> in cleanup, we should remove all of this and just say:

> "% Character width according to Unicode 7.0.0."

I don't know enough about Unicode to tell whether we've extracted all of
the width information encoded in it, but I have verified that behavior
encoded in the python script is equivalent to what is described in the
comments, so I decided not to act on this right away.  I guess we might
want to tweak the comments to make what's going on clearer, instead of
just dropping the info, although I wouldn't oppose that either.

Does anyone else have thoughts to share on this?

Mike FABIAN, should you want to tackle this, would you please submit a
patch to this list, with a proper ChangeLog entry, so that it can be
installed as written by yourself?


Here's the patch I'm testing.  Ok to install?


Amendments to Unicode 7 update.

From: Alexandre Oliva <aoliva@redhat.com>

for  ChangeLog

	* include/stdc-predef.h (__STDC_ISO_10646__): Update to
	201304L, for Unicode 7.

for  localedata/ChangeLog

	* unicode-gen/ctype_compatibility.py: Use date ranges in
	copyright notice.
	* unicode-gen/ctype_compatibility_test_cases.py: Likewise.
	* unicode-gen/gen_unicode_ctype.py: Likewise.
	* unicode-gen/utf8_compatibility.py: Likewise.
	* unicode-gen/utf8_gen.py: Likewise.  Use upper case for
	global variables, use tuples for global constant arrays.  From
	Mike FABIAN.  Suggested by Mike Frysinger <vapier@gentoo.org>.
---
 include/stdc-predef.h                              |   11 ++++++++---
 localedata/unicode-gen/ctype_compatibility.py      |    2 +-
 .../unicode-gen/ctype_compatibility_test_cases.py  |    2 +-
 localedata/unicode-gen/gen_unicode_ctype.py        |    2 +-
 localedata/unicode-gen/utf8_compatibility.py       |    2 +-
 localedata/unicode-gen/utf8_gen.py                 |   20 ++++++++++----------
 6 files changed, 22 insertions(+), 17 deletions(-)

diff --git a/include/stdc-predef.h b/include/stdc-predef.h
index 1d6a4eb..e5f1139 100644
--- a/include/stdc-predef.h
+++ b/include/stdc-predef.h
@@ -49,9 +49,14 @@
 # define __STDC_IEC_559_COMPLEX__	1
 #endif
 
-/* wchar_t uses ISO/IEC 10646 (2nd ed., published 2011-03-15) /
-   Unicode 6.0.  */
-#define __STDC_ISO_10646__		201103L
+/* wchar_t uses Unicode 7.0.0.  Version 7.0 of the Unicode Standard is
+   synchronized with ISO/IEC 10646:2012, plus Amendments 1 (published
+   on April, 2013) and 2 (not yet published as of February, 2015).
+   Additionally, it includes the accelerated publication of U+20BD
+   RUBLE SIGN.  Therefore Unicode 7.0.0 is between 10646:2012 and
+   10646:2014, and so we use the date ISO/IEC 10646:2012 Amd.1 was
+   published.  */
+#define __STDC_ISO_10646__		201304L
 
 /* We do not support C11 <threads.h>.  */
 #define __STDC_NO_THREADS__		1
diff --git a/localedata/unicode-gen/ctype_compatibility.py b/localedata/unicode-gen/ctype_compatibility.py
index 19e9ee5..0d67f29 100755
--- a/localedata/unicode-gen/ctype_compatibility.py
+++ b/localedata/unicode-gen/ctype_compatibility.py
@@ -1,6 +1,6 @@
 #!/usr/bin/python3
 # -*- coding: utf-8 -*-
-# Copyright (C) 2014, 2015 Free Software Foundation, Inc.
+# Copyright (C) 2014-2015 Free Software Foundation, Inc.
 # This file is part of the GNU C Library.
 #
 # The GNU C Library is free software; you can redistribute it and/or
diff --git a/localedata/unicode-gen/ctype_compatibility_test_cases.py b/localedata/unicode-gen/ctype_compatibility_test_cases.py
index ab7f6dd..34e6de4 100644
--- a/localedata/unicode-gen/ctype_compatibility_test_cases.py
+++ b/localedata/unicode-gen/ctype_compatibility_test_cases.py
@@ -1,5 +1,5 @@
 # -*- coding: utf-8 -*-
-# Copyright (C) 2014, 2015 Free Software Foundation, Inc.
+# Copyright (C) 2014-2015 Free Software Foundation, Inc.
 # This file is part of the GNU C Library.
 #
 # The GNU C Library is free software; you can redistribute it and/or
diff --git a/localedata/unicode-gen/gen_unicode_ctype.py b/localedata/unicode-gen/gen_unicode_ctype.py
index 559af79..0c74f2a 100755
--- a/localedata/unicode-gen/gen_unicode_ctype.py
+++ b/localedata/unicode-gen/gen_unicode_ctype.py
@@ -1,7 +1,7 @@
 #!/usr/bin/python3
 #
 # Generate a Unicode conforming LC_CTYPE category from a UnicodeData file.
-# Copyright (C) 2014, 2015 Free Software Foundation, Inc.
+# Copyright (C) 2014-2015 Free Software Foundation, Inc.
 # This file is part of the GNU C Library.
 # Based on gen-unicode-ctype.c by Bruno Haible <haible@clisp.cons.org>, 2000.
 #
diff --git a/localedata/unicode-gen/utf8_compatibility.py b/localedata/unicode-gen/utf8_compatibility.py
index e11327b..b84a1eb 100755
--- a/localedata/unicode-gen/utf8_compatibility.py
+++ b/localedata/unicode-gen/utf8_compatibility.py
@@ -1,6 +1,6 @@
 #!/usr/bin/python3
 # -*- coding: utf-8 -*-
-# Copyright (C) 2014, 2015 Free Software Foundation, Inc.
+# Copyright (C) 2014-2015 Free Software Foundation, Inc.
 # This file is part of the GNU C Library.
 #
 # The GNU C Library is free software; you can redistribute it and/or
diff --git a/localedata/unicode-gen/utf8_gen.py b/localedata/unicode-gen/utf8_gen.py
index 670a628..f1b88f5 100755
--- a/localedata/unicode-gen/utf8_gen.py
+++ b/localedata/unicode-gen/utf8_gen.py
@@ -1,6 +1,6 @@
 #!/usr/bin/python3
 # -*- coding: utf-8 -*-
-# Copyright (C) 2014, 2015 Free Software Foundation, Inc.
+# Copyright (C) 2014-2015 Free Software Foundation, Inc.
 # This file is part of the GNU C Library.
 #
 # The GNU C Library is free software; you can redistribute it and/or
@@ -33,21 +33,21 @@ import re
 # Auxiliary tables for Hangul syllable names, see the Unicode 3.0 book,
 # sections 3.11 and 4.4.
 
-jamo_initial_short_name = [
+JAMO_INITIAL_SHORT_NAME = (
     'G', 'GG', 'N', 'D', 'DD', 'R', 'M', 'B', 'BB', 'S', 'SS', '', 'J', 'JJ',
     'C', 'K', 'T', 'P', 'H'
-]
+)
 
-jamo_medial_short_name = [
+JAMO_MEDIAL_SHORT_NAME = (
     'A', 'AE', 'YA', 'YAE', 'EO', 'E', 'YEO', 'YE', 'O', 'WA', 'WAE', 'OE',
     'YO', 'U', 'WEO', 'WE', 'WI', 'YU', 'EU', 'YI', 'I'
-]
+)
 
-jamo_final_short_name = [
+JAMO_FINAL_SHORT_NAME = (
     '', 'G', 'GG', 'GS', 'N', 'NI', 'NH', 'D', 'L', 'LG', 'LM', 'LB', 'LS',
     'LT', 'LP', 'LH', 'M', 'B', 'BS', 'S', 'SS', 'NG', 'J', 'C', 'K', 'T',
     'P', 'H'
-]
+)
 
 def ucs_symbol(code_point):
     '''Return the UCS symbol string for a Unicode character.'''
@@ -74,9 +74,9 @@ def process_range(start, end, outfile, name):
             index2, index3 = divmod(i - 0xaC00, 28)
             index1, index2 = divmod(index2, 21)
             hangul_syllable_name = 'HANGUL SYLLABLE ' \
-                                   + jamo_initial_short_name[index1] \
-                                   + jamo_medial_short_name[index2] \
-                                   + jamo_final_short_name[index3]
+                                   + JAMO_INITIAL_SHORT_NAME[index1] \
+                                   + JAMO_MEDIAL_SHORT_NAME[index2] \
+                                   + JAMO_FINAL_SHORT_NAME[index3]
             outfile.write('{:<11s} {:<12s} {:s}\n'.format(
                 ucs_symbol(i), convert_to_hex(i),
                 hangul_syllable_name))


-- 
Alexandre Oliva, freedom fighter    http://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist|Red Hat Brasil GNU Toolchain Engineer


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]