This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH] [BZ 17588 13064] Update UTF-8 charmap and width to Unicode 7.0.0

From: "Carlos O'Donell" <carlos at redhat dot com>
To: Alexandre Oliva <aoliva at redhat dot com>
Cc: Joseph Myers <joseph at codesourcery dot com>, Pravin Satpute <psatpute at redhat dot com>, Siddhesh Poyarekar <siddhesh at redhat dot com>, Mike FABIAN <mfabian at redhat dot com>, libc-alpha at sourceware dot org, Jens Petersen <petersen at redhat dot com>
Date: Mon, 23 Feb 2015 09:25:58 -0500
Subject: Re: [PATCH] [BZ 17588 13064] Update UTF-8 charmap and width to Unicode 7.0.0
Authentication-results: sourceware.org; auth=none
References: <573624784 dot 8871393 dot 1416848051220 dot JavaMail dot zimbra at redhat dot com> <orzjb3o7yf dot fsf at free dot home> <s9dy4qir6fu dot fsf at ari dot site> <orfvce7y90 dot fsf at free dot home> <s9d388duu5r dot fsf at ari dot site> <orioh35mbq dot fsf at free dot home> <20141223111038 dot GA5172 at spoyarek dot pnq dot redhat dot com> <119234933 dot 5523688 dot 1422972847328 dot JavaMail dot zimbra at redhat dot com> <or7fvnlbeo dot fsf at livre dot home> <orwq3njuvc dot fsf at livre dot home> <54E23EC9 dot 5020400 at redhat dot com> <ortwyig5xa dot fsf at livre dot home> <alpine dot DEB dot 2 dot 10 dot 1502190055460 dot 24016 at digraph dot polyomino dot org dot uk> <54E79E51 dot 70002 at redhat dot com> <or8ufscg8p dot fsf at livre dot home>

On 02/20/2015 06:31 PM, Alexandre Oliva wrote:
> On Feb 20, 2015, "Carlos O'Donell" <carlos@redhat.com> wrote:
> 
>> Thus __STDC_ISO_10646__ should be 201304L (the date that ISO/EC 10646:2012
>> Amd.1 was published).
> 
> Fixed in the patch below.

This change looks good to me. OK to commit.

> On Feb 19, 2015, Mike FABIAN <mfabian@redhat.com> wrote:
> 
>> Mike Frysinger <vapier@gentoo.org> wrote:
> 
>>> module level constants should really be in CAPS.  and use a tuple to make it 
>>> const.
>>> -mike
> 
>> https://github.com/pravins/glibc-i18n/commit/53b81c58d220bfbb0e8faf8d4313c705826f4543
> 
> Thanks, integrated.  I also adjusted the copyright notices to use year
> ranges, as requested.

Thanks.

> On Feb 20, 2015, "Carlos O'Donell" <carlos@redhat.com> wrote:
> 
>> One nit:
> 
>> -% Character width according to Unicode 5.0.0.
>> +% Character width according to Unicode 7.0.0.
>>  % - Default width is 1.
>>  % - Double-width characters have width 2; generated from
>>  %        "grep '^[^;]*;[WF]' EastAsianWidth.txt"
>> -%   and  "grep '^[^;]*;[^WF]' EastAsianWidth.txt"
>>  % - Non-spacing characters have width 0; generated from PropList.txt or
>>  %   "grep '^[^;]*;[^;]*;[^;]*;[^;]*;NSM;' UnicodeData.txt"
>>  % - Format control characters have width 0; generated from
>>  %   "grep '^[^;]*;[^;]*;Cf;' UnicodeData.txt"
>> -% - Zero width characters have width 0; generated from
>> -%   "grep '^[^;]*;ZERO WIDTH ' UnicodeData.txt"
> 
>> Why even mention the `grep` to be used to generate this data?
>> It should just say to use the scripts. Nobody should be confused
>> that this data was actually generated by this method. Nor do I want
>> anyone doing it this way ever again.
> 
>> Thus shouldn't `write_header_width` simply not output any of this
>> stuff? I understand we're trying to minimize the initial diff, but
>> in cleanup, we should remove all of this and just say:
> 
>> "% Character width according to Unicode 7.0.0."
> 
> I don't know enough about Unicode to tell whether we've extracted all of
> the width information encoded in it, but I have verified that behavior
> encoded in the python script is equivalent to what is described in the
> comments, so I decided not to act on this right away.  I guess we might
> want to tweak the comments to make what's going on clearer, instead of
> just dropping the info, although I wouldn't oppose that either.
> 
> Does anyone else have thoughts to share on this?
> 
> Mike FABIAN, should you want to tackle this, would you please submit a
> patch to this list, with a proper ChangeLog entry, so that it can be
> installed as written by yourself?

Yes, please take this up with Mike and make sure we clean it up.
My preference is to remove the comment entirely. 

> Here's the patch I'm testing.  Ok to install?
 
Yes, OK to install.
 
> Amendments to Unicode 7 update.
> 
> From: Alexandre Oliva <aoliva@redhat.com>
> 
> for  ChangeLog
> 
> 	* include/stdc-predef.h (__STDC_ISO_10646__): Update to
> 	201304L, for Unicode 7.

OK.

> for  localedata/ChangeLog
> 
> 	* unicode-gen/ctype_compatibility.py: Use date ranges in
> 	copyright notice.
> 	* unicode-gen/ctype_compatibility_test_cases.py: Likewise.
> 	* unicode-gen/gen_unicode_ctype.py: Likewise.
> 	* unicode-gen/utf8_compatibility.py: Likewise.
> 	* unicode-gen/utf8_gen.py: Likewise.  Use upper case for
> 	global variables, use tuples for global constant arrays.  From
> 	Mike FABIAN.  Suggested by Mike Frysinger <vapier@gentoo.org>.
> ---
>  include/stdc-predef.h                              |   11 ++++++++---
>  localedata/unicode-gen/ctype_compatibility.py      |    2 +-
>  .../unicode-gen/ctype_compatibility_test_cases.py  |    2 +-
>  localedata/unicode-gen/gen_unicode_ctype.py        |    2 +-
>  localedata/unicode-gen/utf8_compatibility.py       |    2 +-
>  localedata/unicode-gen/utf8_gen.py                 |   20 ++++++++++----------
>  6 files changed, 22 insertions(+), 17 deletions(-)
> 
> diff --git a/include/stdc-predef.h b/include/stdc-predef.h
> index 1d6a4eb..e5f1139 100644
> --- a/include/stdc-predef.h
> +++ b/include/stdc-predef.h
> @@ -49,9 +49,14 @@
>  # define __STDC_IEC_559_COMPLEX__	1
>  #endif
>  
> -/* wchar_t uses ISO/IEC 10646 (2nd ed., published 2011-03-15) /
> -   Unicode 6.0.  */
> -#define __STDC_ISO_10646__		201103L
> +/* wchar_t uses Unicode 7.0.0.  Version 7.0 of the Unicode Standard is
> +   synchronized with ISO/IEC 10646:2012, plus Amendments 1 (published
> +   on April, 2013) and 2 (not yet published as of February, 2015).
> +   Additionally, it includes the accelerated publication of U+20BD
> +   RUBLE SIGN.  Therefore Unicode 7.0.0 is between 10646:2012 and
> +   10646:2014, and so we use the date ISO/IEC 10646:2012 Amd.1 was
> +   published.  */

OK. Excellent comment.

> +#define __STDC_ISO_10646__		201304L
>  
>  /* We do not support C11 <threads.h>.  */
>  #define __STDC_NO_THREADS__		1
> diff --git a/localedata/unicode-gen/ctype_compatibility.py b/localedata/unicode-gen/ctype_compatibility.py
> index 19e9ee5..0d67f29 100755
> --- a/localedata/unicode-gen/ctype_compatibility.py
> +++ b/localedata/unicode-gen/ctype_compatibility.py
> @@ -1,6 +1,6 @@
>  #!/usr/bin/python3
>  # -*- coding: utf-8 -*-
> -# Copyright (C) 2014, 2015 Free Software Foundation, Inc.
> +# Copyright (C) 2014-2015 Free Software Foundation, Inc.
>  # This file is part of the GNU C Library.
>  #
>  # The GNU C Library is free software; you can redistribute it and/or
> diff --git a/localedata/unicode-gen/ctype_compatibility_test_cases.py b/localedata/unicode-gen/ctype_compatibility_test_cases.py
> index ab7f6dd..34e6de4 100644
> --- a/localedata/unicode-gen/ctype_compatibility_test_cases.py
> +++ b/localedata/unicode-gen/ctype_compatibility_test_cases.py
> @@ -1,5 +1,5 @@
>  # -*- coding: utf-8 -*-
> -# Copyright (C) 2014, 2015 Free Software Foundation, Inc.
> +# Copyright (C) 2014-2015 Free Software Foundation, Inc.
>  # This file is part of the GNU C Library.
>  #
>  # The GNU C Library is free software; you can redistribute it and/or
> diff --git a/localedata/unicode-gen/gen_unicode_ctype.py b/localedata/unicode-gen/gen_unicode_ctype.py
> index 559af79..0c74f2a 100755
> --- a/localedata/unicode-gen/gen_unicode_ctype.py
> +++ b/localedata/unicode-gen/gen_unicode_ctype.py
> @@ -1,7 +1,7 @@
>  #!/usr/bin/python3
>  #
>  # Generate a Unicode conforming LC_CTYPE category from a UnicodeData file.
> -# Copyright (C) 2014, 2015 Free Software Foundation, Inc.
> +# Copyright (C) 2014-2015 Free Software Foundation, Inc.
>  # This file is part of the GNU C Library.
>  # Based on gen-unicode-ctype.c by Bruno Haible <haible@clisp.cons.org>, 2000.
>  #
> diff --git a/localedata/unicode-gen/utf8_compatibility.py b/localedata/unicode-gen/utf8_compatibility.py
> index e11327b..b84a1eb 100755
> --- a/localedata/unicode-gen/utf8_compatibility.py
> +++ b/localedata/unicode-gen/utf8_compatibility.py
> @@ -1,6 +1,6 @@
>  #!/usr/bin/python3
>  # -*- coding: utf-8 -*-
> -# Copyright (C) 2014, 2015 Free Software Foundation, Inc.
> +# Copyright (C) 2014-2015 Free Software Foundation, Inc.
>  # This file is part of the GNU C Library.
>  #
>  # The GNU C Library is free software; you can redistribute it and/or
> diff --git a/localedata/unicode-gen/utf8_gen.py b/localedata/unicode-gen/utf8_gen.py
> index 670a628..f1b88f5 100755
> --- a/localedata/unicode-gen/utf8_gen.py
> +++ b/localedata/unicode-gen/utf8_gen.py
> @@ -1,6 +1,6 @@
>  #!/usr/bin/python3
>  # -*- coding: utf-8 -*-
> -# Copyright (C) 2014, 2015 Free Software Foundation, Inc.
> +# Copyright (C) 2014-2015 Free Software Foundation, Inc.
>  # This file is part of the GNU C Library.
>  #
>  # The GNU C Library is free software; you can redistribute it and/or
> @@ -33,21 +33,21 @@ import re
>  # Auxiliary tables for Hangul syllable names, see the Unicode 3.0 book,
>  # sections 3.11 and 4.4.
>  
> -jamo_initial_short_name = [
> +JAMO_INITIAL_SHORT_NAME = (
>      'G', 'GG', 'N', 'D', 'DD', 'R', 'M', 'B', 'BB', 'S', 'SS', '', 'J', 'JJ',
>      'C', 'K', 'T', 'P', 'H'
> -]
> +)
>  
> -jamo_medial_short_name = [
> +JAMO_MEDIAL_SHORT_NAME = (
>      'A', 'AE', 'YA', 'YAE', 'EO', 'E', 'YEO', 'YE', 'O', 'WA', 'WAE', 'OE',
>      'YO', 'U', 'WEO', 'WE', 'WI', 'YU', 'EU', 'YI', 'I'
> -]
> +)
>  
> -jamo_final_short_name = [
> +JAMO_FINAL_SHORT_NAME = (
>      '', 'G', 'GG', 'GS', 'N', 'NI', 'NH', 'D', 'L', 'LG', 'LM', 'LB', 'LS',
>      'LT', 'LP', 'LH', 'M', 'B', 'BS', 'S', 'SS', 'NG', 'J', 'C', 'K', 'T',
>      'P', 'H'
> -]
> +)
>  
>  def ucs_symbol(code_point):
>      '''Return the UCS symbol string for a Unicode character.'''
> @@ -74,9 +74,9 @@ def process_range(start, end, outfile, name):
>              index2, index3 = divmod(i - 0xaC00, 28)
>              index1, index2 = divmod(index2, 21)
>              hangul_syllable_name = 'HANGUL SYLLABLE ' \
> -                                   + jamo_initial_short_name[index1] \
> -                                   + jamo_medial_short_name[index2] \
> -                                   + jamo_final_short_name[index3]
> +                                   + JAMO_INITIAL_SHORT_NAME[index1] \
> +                                   + JAMO_MEDIAL_SHORT_NAME[index2] \
> +                                   + JAMO_FINAL_SHORT_NAME[index3]
>              outfile.write('{:<11s} {:<12s} {:s}\n'.format(
>                  ucs_symbol(i), convert_to_hex(i),
>                  hangul_syllable_name))
> 
> 

OK.

Cheers,
Carlos.

References:
- Re: [PATCH] [BZ 17588 13064] Update UTF-8 charmap and width to Unicode 7.0.0
  - From: Pravin Satpute
- Re: [PATCH] [BZ 17588 13064] Update UTF-8 charmap and width to Unicode 7.0.0
  - From: Alexandre Oliva
- Re: [PATCH] [BZ 17588 13064] Update UTF-8 charmap and width to Unicode 7.0.0
  - From: Carlos O'Donell
- Re: [PATCH] [BZ 17588 13064] Update UTF-8 charmap and width to Unicode 7.0.0
  - From: Alexandre Oliva
- Re: [PATCH] [BZ 17588 13064] Update UTF-8 charmap and width to Unicode 7.0.0
  - From: Joseph Myers
- Re: [PATCH] [BZ 17588 13064] Update UTF-8 charmap and width to Unicode 7.0.0
  - From: Carlos O'Donell
- Re: [PATCH] [BZ 17588 13064] Update UTF-8 charmap and width to Unicode 7.0.0
  - From: Alexandre Oliva

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]