This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH] [BZ 17588 13064] Update UTF-8 charmap and width to Unicode 7.0.0
- From: "Carlos O'Donell" <carlos at redhat dot com>
- To: Alexandre Oliva <aoliva at redhat dot com>
- Cc: Joseph Myers <joseph at codesourcery dot com>, Pravin Satpute <psatpute at redhat dot com>, Siddhesh Poyarekar <siddhesh at redhat dot com>, Mike FABIAN <mfabian at redhat dot com>, libc-alpha at sourceware dot org, Jens Petersen <petersen at redhat dot com>
- Date: Mon, 23 Feb 2015 09:25:58 -0500
- Subject: Re: [PATCH] [BZ 17588 13064] Update UTF-8 charmap and width to Unicode 7.0.0
- Authentication-results: sourceware.org; auth=none
- References: <573624784 dot 8871393 dot 1416848051220 dot JavaMail dot zimbra at redhat dot com> <orzjb3o7yf dot fsf at free dot home> <s9dy4qir6fu dot fsf at ari dot site> <orfvce7y90 dot fsf at free dot home> <s9d388duu5r dot fsf at ari dot site> <orioh35mbq dot fsf at free dot home> <20141223111038 dot GA5172 at spoyarek dot pnq dot redhat dot com> <119234933 dot 5523688 dot 1422972847328 dot JavaMail dot zimbra at redhat dot com> <or7fvnlbeo dot fsf at livre dot home> <orwq3njuvc dot fsf at livre dot home> <54E23EC9 dot 5020400 at redhat dot com> <ortwyig5xa dot fsf at livre dot home> <alpine dot DEB dot 2 dot 10 dot 1502190055460 dot 24016 at digraph dot polyomino dot org dot uk> <54E79E51 dot 70002 at redhat dot com> <or8ufscg8p dot fsf at livre dot home>
On 02/20/2015 06:31 PM, Alexandre Oliva wrote:
> On Feb 20, 2015, "Carlos O'Donell" <carlos@redhat.com> wrote:
>
>> Thus __STDC_ISO_10646__ should be 201304L (the date that ISO/EC 10646:2012
>> Amd.1 was published).
>
> Fixed in the patch below.
This change looks good to me. OK to commit.
> On Feb 19, 2015, Mike FABIAN <mfabian@redhat.com> wrote:
>
>> Mike Frysinger <vapier@gentoo.org> wrote:
>
>>> module level constants should really be in CAPS. and use a tuple to make it
>>> const.
>>> -mike
>
>> https://github.com/pravins/glibc-i18n/commit/53b81c58d220bfbb0e8faf8d4313c705826f4543
>
> Thanks, integrated. I also adjusted the copyright notices to use year
> ranges, as requested.
Thanks.
> On Feb 20, 2015, "Carlos O'Donell" <carlos@redhat.com> wrote:
>
>> One nit:
>
>> -% Character width according to Unicode 5.0.0.
>> +% Character width according to Unicode 7.0.0.
>> % - Default width is 1.
>> % - Double-width characters have width 2; generated from
>> % "grep '^[^;]*;[WF]' EastAsianWidth.txt"
>> -% and "grep '^[^;]*;[^WF]' EastAsianWidth.txt"
>> % - Non-spacing characters have width 0; generated from PropList.txt or
>> % "grep '^[^;]*;[^;]*;[^;]*;[^;]*;NSM;' UnicodeData.txt"
>> % - Format control characters have width 0; generated from
>> % "grep '^[^;]*;[^;]*;Cf;' UnicodeData.txt"
>> -% - Zero width characters have width 0; generated from
>> -% "grep '^[^;]*;ZERO WIDTH ' UnicodeData.txt"
>
>> Why even mention the `grep` to be used to generate this data?
>> It should just say to use the scripts. Nobody should be confused
>> that this data was actually generated by this method. Nor do I want
>> anyone doing it this way ever again.
>
>> Thus shouldn't `write_header_width` simply not output any of this
>> stuff? I understand we're trying to minimize the initial diff, but
>> in cleanup, we should remove all of this and just say:
>
>> "% Character width according to Unicode 7.0.0."
>
> I don't know enough about Unicode to tell whether we've extracted all of
> the width information encoded in it, but I have verified that behavior
> encoded in the python script is equivalent to what is described in the
> comments, so I decided not to act on this right away. I guess we might
> want to tweak the comments to make what's going on clearer, instead of
> just dropping the info, although I wouldn't oppose that either.
>
> Does anyone else have thoughts to share on this?
>
> Mike FABIAN, should you want to tackle this, would you please submit a
> patch to this list, with a proper ChangeLog entry, so that it can be
> installed as written by yourself?
Yes, please take this up with Mike and make sure we clean it up.
My preference is to remove the comment entirely.
> Here's the patch I'm testing. Ok to install?
Yes, OK to install.
> Amendments to Unicode 7 update.
>
> From: Alexandre Oliva <aoliva@redhat.com>
>
> for ChangeLog
>
> * include/stdc-predef.h (__STDC_ISO_10646__): Update to
> 201304L, for Unicode 7.
OK.
> for localedata/ChangeLog
>
> * unicode-gen/ctype_compatibility.py: Use date ranges in
> copyright notice.
> * unicode-gen/ctype_compatibility_test_cases.py: Likewise.
> * unicode-gen/gen_unicode_ctype.py: Likewise.
> * unicode-gen/utf8_compatibility.py: Likewise.
> * unicode-gen/utf8_gen.py: Likewise. Use upper case for
> global variables, use tuples for global constant arrays. From
> Mike FABIAN. Suggested by Mike Frysinger <vapier@gentoo.org>.
> ---
> include/stdc-predef.h | 11 ++++++++---
> localedata/unicode-gen/ctype_compatibility.py | 2 +-
> .../unicode-gen/ctype_compatibility_test_cases.py | 2 +-
> localedata/unicode-gen/gen_unicode_ctype.py | 2 +-
> localedata/unicode-gen/utf8_compatibility.py | 2 +-
> localedata/unicode-gen/utf8_gen.py | 20 ++++++++++----------
> 6 files changed, 22 insertions(+), 17 deletions(-)
>
> diff --git a/include/stdc-predef.h b/include/stdc-predef.h
> index 1d6a4eb..e5f1139 100644
> --- a/include/stdc-predef.h
> +++ b/include/stdc-predef.h
> @@ -49,9 +49,14 @@
> # define __STDC_IEC_559_COMPLEX__ 1
> #endif
>
> -/* wchar_t uses ISO/IEC 10646 (2nd ed., published 2011-03-15) /
> - Unicode 6.0. */
> -#define __STDC_ISO_10646__ 201103L
> +/* wchar_t uses Unicode 7.0.0. Version 7.0 of the Unicode Standard is
> + synchronized with ISO/IEC 10646:2012, plus Amendments 1 (published
> + on April, 2013) and 2 (not yet published as of February, 2015).
> + Additionally, it includes the accelerated publication of U+20BD
> + RUBLE SIGN. Therefore Unicode 7.0.0 is between 10646:2012 and
> + 10646:2014, and so we use the date ISO/IEC 10646:2012 Amd.1 was
> + published. */
OK. Excellent comment.
> +#define __STDC_ISO_10646__ 201304L
>
> /* We do not support C11 <threads.h>. */
> #define __STDC_NO_THREADS__ 1
> diff --git a/localedata/unicode-gen/ctype_compatibility.py b/localedata/unicode-gen/ctype_compatibility.py
> index 19e9ee5..0d67f29 100755
> --- a/localedata/unicode-gen/ctype_compatibility.py
> +++ b/localedata/unicode-gen/ctype_compatibility.py
> @@ -1,6 +1,6 @@
> #!/usr/bin/python3
> # -*- coding: utf-8 -*-
> -# Copyright (C) 2014, 2015 Free Software Foundation, Inc.
> +# Copyright (C) 2014-2015 Free Software Foundation, Inc.
> # This file is part of the GNU C Library.
> #
> # The GNU C Library is free software; you can redistribute it and/or
> diff --git a/localedata/unicode-gen/ctype_compatibility_test_cases.py b/localedata/unicode-gen/ctype_compatibility_test_cases.py
> index ab7f6dd..34e6de4 100644
> --- a/localedata/unicode-gen/ctype_compatibility_test_cases.py
> +++ b/localedata/unicode-gen/ctype_compatibility_test_cases.py
> @@ -1,5 +1,5 @@
> # -*- coding: utf-8 -*-
> -# Copyright (C) 2014, 2015 Free Software Foundation, Inc.
> +# Copyright (C) 2014-2015 Free Software Foundation, Inc.
> # This file is part of the GNU C Library.
> #
> # The GNU C Library is free software; you can redistribute it and/or
> diff --git a/localedata/unicode-gen/gen_unicode_ctype.py b/localedata/unicode-gen/gen_unicode_ctype.py
> index 559af79..0c74f2a 100755
> --- a/localedata/unicode-gen/gen_unicode_ctype.py
> +++ b/localedata/unicode-gen/gen_unicode_ctype.py
> @@ -1,7 +1,7 @@
> #!/usr/bin/python3
> #
> # Generate a Unicode conforming LC_CTYPE category from a UnicodeData file.
> -# Copyright (C) 2014, 2015 Free Software Foundation, Inc.
> +# Copyright (C) 2014-2015 Free Software Foundation, Inc.
> # This file is part of the GNU C Library.
> # Based on gen-unicode-ctype.c by Bruno Haible <haible@clisp.cons.org>, 2000.
> #
> diff --git a/localedata/unicode-gen/utf8_compatibility.py b/localedata/unicode-gen/utf8_compatibility.py
> index e11327b..b84a1eb 100755
> --- a/localedata/unicode-gen/utf8_compatibility.py
> +++ b/localedata/unicode-gen/utf8_compatibility.py
> @@ -1,6 +1,6 @@
> #!/usr/bin/python3
> # -*- coding: utf-8 -*-
> -# Copyright (C) 2014, 2015 Free Software Foundation, Inc.
> +# Copyright (C) 2014-2015 Free Software Foundation, Inc.
> # This file is part of the GNU C Library.
> #
> # The GNU C Library is free software; you can redistribute it and/or
> diff --git a/localedata/unicode-gen/utf8_gen.py b/localedata/unicode-gen/utf8_gen.py
> index 670a628..f1b88f5 100755
> --- a/localedata/unicode-gen/utf8_gen.py
> +++ b/localedata/unicode-gen/utf8_gen.py
> @@ -1,6 +1,6 @@
> #!/usr/bin/python3
> # -*- coding: utf-8 -*-
> -# Copyright (C) 2014, 2015 Free Software Foundation, Inc.
> +# Copyright (C) 2014-2015 Free Software Foundation, Inc.
> # This file is part of the GNU C Library.
> #
> # The GNU C Library is free software; you can redistribute it and/or
> @@ -33,21 +33,21 @@ import re
> # Auxiliary tables for Hangul syllable names, see the Unicode 3.0 book,
> # sections 3.11 and 4.4.
>
> -jamo_initial_short_name = [
> +JAMO_INITIAL_SHORT_NAME = (
> 'G', 'GG', 'N', 'D', 'DD', 'R', 'M', 'B', 'BB', 'S', 'SS', '', 'J', 'JJ',
> 'C', 'K', 'T', 'P', 'H'
> -]
> +)
>
> -jamo_medial_short_name = [
> +JAMO_MEDIAL_SHORT_NAME = (
> 'A', 'AE', 'YA', 'YAE', 'EO', 'E', 'YEO', 'YE', 'O', 'WA', 'WAE', 'OE',
> 'YO', 'U', 'WEO', 'WE', 'WI', 'YU', 'EU', 'YI', 'I'
> -]
> +)
>
> -jamo_final_short_name = [
> +JAMO_FINAL_SHORT_NAME = (
> '', 'G', 'GG', 'GS', 'N', 'NI', 'NH', 'D', 'L', 'LG', 'LM', 'LB', 'LS',
> 'LT', 'LP', 'LH', 'M', 'B', 'BS', 'S', 'SS', 'NG', 'J', 'C', 'K', 'T',
> 'P', 'H'
> -]
> +)
>
> def ucs_symbol(code_point):
> '''Return the UCS symbol string for a Unicode character.'''
> @@ -74,9 +74,9 @@ def process_range(start, end, outfile, name):
> index2, index3 = divmod(i - 0xaC00, 28)
> index1, index2 = divmod(index2, 21)
> hangul_syllable_name = 'HANGUL SYLLABLE ' \
> - + jamo_initial_short_name[index1] \
> - + jamo_medial_short_name[index2] \
> - + jamo_final_short_name[index3]
> + + JAMO_INITIAL_SHORT_NAME[index1] \
> + + JAMO_MEDIAL_SHORT_NAME[index2] \
> + + JAMO_FINAL_SHORT_NAME[index3]
> outfile.write('{:<11s} {:<12s} {:s}\n'.format(
> ucs_symbol(i), convert_to_hex(i),
> hangul_syllable_name))
>
>
OK.
Cheers,
Carlos.