This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH] [BZ 17588 13064] Update UTF-8 charmap and width to Unicode 7.0.0

From: Mike FABIAN <mfabian at redhat dot com>
To: Alexandre Oliva <aoliva at redhat dot com>
Cc: Pravin Satpute <psatpute at redhat dot com>, libc-alpha at sourceware dot org
Date: Mon, 08 Dec 2014 07:39:33 +0100
Subject: Re: [PATCH] [BZ 17588 13064] Update UTF-8 charmap and width to Unicode 7.0.0
Authentication-results: sourceware.org; auth=none
References: <573624784 dot 8871393 dot 1416848051220 dot JavaMail dot zimbra at redhat dot com> <orzjb3o7yf dot fsf at free dot home>

I changed gen-unicode-ctype.py mostly according to your suggestions
(Thank you very much for your very detailed review):

(Other changes in the next mail â)

Alexandre Oliva <aoliva@redhat.com> ãããããããã:

> On Nov 24, 2014, Pravin Satpute <psatpute@redhat.com> wrote:
>
>> 6. https://github.com/pravins/glibc-i18n
>
> I've reviewed the gen scripts in the git repo above.
>
> WRT gen-unicode-ctype.py, presumably intended to replace glibc's
> localedata/gen-unicode-ctype.c:

[...]

> - the script could probably use a few more comments; the C version is
> already poor in comments, but the script has even fewer; the comments
> present in field and function definitions in C aren't even the script;
> it might be useful to add comments for reviewers that don't have the
> file format by heart, for example, show in comments an example of a pair
> of lines that defines a range in fill_attributes.

more comments:

https://github.com/pravins/glibc-i18n/commit/3ccaf53dd71a6d927d55eeb03fb08d26af3d4534

> - I'm not sure it's wise for fill_attributes to load the entire file
> into memory just to be able to index the lines in an array.  It doesn't
> look like reading the input file line by line would make the code worse.

https://github.com/pravins/glibc-i18n/commit/7ef5161898f54d2d66bb25f898310c68bc2d6577

> - When processing ranges of characters in fill_attributes, the C version
> uses the fields from the End line, whereas the script uses those from
> the Start line.  I guess it doesn't matter and they should have the same
> values for most fields, but perhaps it would make sense to check that
> they do.

Doesnât matter but the above commit checks  this as well.

> - After processing a range of characters, the first character of the
> range will have its attributes filled in again, because the non-range
> call isn't prefixed by âelseâ.  It's bening, but wasteful.

https://github.com/pravins/glibc-i18n/commit/83cd925026093ebae4e0b7802b709b98e05c1227

> - It's not obvious that is_alpha in the script, based on derived
> properties, is equivalent to the many conditions tested in the C
> program.  Is there any other script that checks their equivalence?

It is *not* supposed to be equivalent. The is_alpha() in the
C program uses only data from UnicodeData.txt. This causes
many characters incorrectly not classified as alpha or incorrectly
classified as alpha. See:

https://sourceware.org/bugzilla/show_bug.cgi?id=14010#c0

The many condidtions in is_alpha() in the C program try to correct
some of these mistakes.

The better source for classifying as alpha or not is
DerivedCoreProperties.txt. So gen-unicode-ctype.py uses this.  This
fixes several of the special cases in the old program and many
more. Some of the special cases in the old program were apparently
mistaken, I tried to ask Bruno Haible and Theppitak Karoonboonyanan
about this but got no response. I think DerivedCoreProperties.txt
is the better source for this. But DerivedCoreProperties.txt
would not classify digits as alphabetic. Brunoâs  C program had
the comment:

	      /* Consider all the non-ASCII digits as alphabetic.
		 ISO C 99 forbids us to have them in category "digit",
		 but we want iswalnum to return true on them.  */

which seems to make sense, therefore I kept that in is_alpha()
in gen-unicode-ctype.py.

> - It seems wasteful to append every char that fits the class to
> code_point_ranges.  A similar effect could be attained by modifying the
> second element of the array, if it already exists, like:
>
>             if (code_point_ranges
>                 and code_point_ranges[-1][-1] == code_point - 1):
>                 if len(code_point_ranges[-1]) == 1:
>                     code_point_ranges[-1].append(code_point)
>                 else:
>                     code_point_ranges[-1][-1] = code_point
>             else:
>                 code_point_ranges.append([code_point])

https://github.com/pravins/glibc-i18n/commit/538f04c6d6b4ad295c98a91beb07a1d9cb7fa681

> - In output_charmap, calling map_function twice for each code_point
> seems wasteful.  The C version does that too, but just because saving
> the mapped char in the table would make it much bigger.  Since ths
> script doesn't build the table, it might as well save the result of the
> first call and reuse it instead of the second, like:
>
>         mapped = map_function(code_point)
>         if code_point != mapped:
> [...]
>                          + ucs_symbol(mapped) \
>                          + ')'

https://github.com/pravins/glibc-i18n/commit/33f94bcf70127aca77f912c8e9efef0eb3aa51ce

> - any reason to move down the output of "isxdigit", "blank" and both
> combining classes in the script, compared with the C program?

I wanted to keep the order of the character classes as they currently
are in the glibc/localedata/locales/i18n file.

https://github.com/pravins/glibc-i18n/commit/d14d12ac7e11b1af5ed7140b2849f9edb2e70321

> I see there have been changes since I started the review.
> 
> The only one that caught my attention was in gen-unicode-ctype.py.
> 
> I think the single loop in read_input_file would be better split into
> multiple loops, one for head, one to skip to tail, one for tail.
> 
> Also consider using line.startswith instead of regexps for the
> end-of-head and beginning-of-tail markers, unless that risks matching
> unintended lines.
> 
> Finally, consider returning a tuple instead of modifying the global
> variable.
> 
> It could end up like this:
> 
> def read_input_file(filename):
>   head = tail = ''
>   with open(filename, mode='r') as file:
>     for line in file:
>       match = re.match(
>          r'^(?P<key>date\s+)(?P<value>"[0-9]{4}-[0-9]{2}-[0-9]{2}")',
>          line)
>       if match:
>         line = match.group('key') + '"{:s}"\n'.format(time.strftime('%Y-%m-%d'))
>       head = head + line
>       if (line.startswith('LC_CTYPE'))
>         break
>     for line in file:
>       if (line.startswith('translit_start'))
>         tail = line
>         break
>     for line in file:
>       tail = tail + line
>   return (head, tail)

https://github.com/pravins/glibc-i18n/commit/e8a126e8d84f31444bb6ca315339abe079ccfb86

> On to utf8-compatibility.py:
>
> - same suggestions apply to fill_attribute* in this file that applied to
> those in gen-unicode-ctype.py, since they're identical

https://github.com/pravins/glibc-i18n/commit/2de8e481493c40575cf8137b963fabf036eae532


-- 
Mike FABIAN <mfabian@redhat.com>
â Office: +49-69-365051027, internal 8875027
ççäèãããääãæãã

Follow-Ups:
- Re: [PATCH] [BZ 17588 13064] Update UTF-8 charmap and width to Unicode 7.0.0
  - From: Alexandre Oliva

References:
- Re: [PATCH] [BZ 17588 13064] Update UTF-8 charmap and width to Unicode 7.0.0
  - From: Alexandre Oliva

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]