This is the mail archive of the mailing list for the glibc project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] [BZ 17588 13064] Update UTF-8 charmap and width to Unicode 7.0.0

I changed mostly according to your suggestions
(Thank you very much for your very detailed review):

(Other changes in the next mail â)

Alexandre Oliva <> ãããããããã:

> On Nov 24, 2014, Pravin Satpute <> wrote:
>> 6.
> I've reviewed the gen scripts in the git repo above.
> WRT, presumably intended to replace glibc's
> localedata/gen-unicode-ctype.c:


> - the script could probably use a few more comments; the C version is
> already poor in comments, but the script has even fewer; the comments
> present in field and function definitions in C aren't even the script;
> it might be useful to add comments for reviewers that don't have the
> file format by heart, for example, show in comments an example of a pair
> of lines that defines a range in fill_attributes.

more comments:

> - I'm not sure it's wise for fill_attributes to load the entire file
> into memory just to be able to index the lines in an array.  It doesn't
> look like reading the input file line by line would make the code worse.

> - When processing ranges of characters in fill_attributes, the C version
> uses the fields from the End line, whereas the script uses those from
> the Start line.  I guess it doesn't matter and they should have the same
> values for most fields, but perhaps it would make sense to check that
> they do.

Doesnât matter but the above commit checks  this as well.

> - After processing a range of characters, the first character of the
> range will have its attributes filled in again, because the non-range
> call isn't prefixed by âelseâ.  It's bening, but wasteful.

> - It's not obvious that is_alpha in the script, based on derived
> properties, is equivalent to the many conditions tested in the C
> program.  Is there any other script that checks their equivalence?

It is *not* supposed to be equivalent. The is_alpha() in the
C program uses only data from UnicodeData.txt. This causes
many characters incorrectly not classified as alpha or incorrectly
classified as alpha. See:

The many condidtions in is_alpha() in the C program try to correct
some of these mistakes.

The better source for classifying as alpha or not is
DerivedCoreProperties.txt. So uses this.  This
fixes several of the special cases in the old program and many
more. Some of the special cases in the old program were apparently
mistaken, I tried to ask Bruno Haible and Theppitak Karoonboonyanan
about this but got no response. I think DerivedCoreProperties.txt
is the better source for this. But DerivedCoreProperties.txt
would not classify digits as alphabetic. Brunoâs  C program had
the comment:

	      /* Consider all the non-ASCII digits as alphabetic.
		 ISO C 99 forbids us to have them in category "digit",
		 but we want iswalnum to return true on them.  */

which seems to make sense, therefore I kept that in is_alpha()

> - It seems wasteful to append every char that fits the class to
> code_point_ranges.  A similar effect could be attained by modifying the
> second element of the array, if it already exists, like:
>             if (code_point_ranges
>                 and code_point_ranges[-1][-1] == code_point - 1):
>                 if len(code_point_ranges[-1]) == 1:
>                     code_point_ranges[-1].append(code_point)
>                 else:
>                     code_point_ranges[-1][-1] = code_point
>             else:
>                 code_point_ranges.append([code_point])

> - In output_charmap, calling map_function twice for each code_point
> seems wasteful.  The C version does that too, but just because saving
> the mapped char in the table would make it much bigger.  Since ths
> script doesn't build the table, it might as well save the result of the
> first call and reuse it instead of the second, like:
>         mapped = map_function(code_point)
>         if code_point != mapped:
> [...]
>                          + ucs_symbol(mapped) \
>                          + ')'

> - any reason to move down the output of "isxdigit", "blank" and both
> combining classes in the script, compared with the C program?

I wanted to keep the order of the character classes as they currently
are in the glibc/localedata/locales/i18n file.

> I see there have been changes since I started the review.
> The only one that caught my attention was in
> I think the single loop in read_input_file would be better split into
> multiple loops, one for head, one to skip to tail, one for tail.
> Also consider using line.startswith instead of regexps for the
> end-of-head and beginning-of-tail markers, unless that risks matching
> unintended lines.
> Finally, consider returning a tuple instead of modifying the global
> variable.
> It could end up like this:
> def read_input_file(filename):
>   head = tail = ''
>   with open(filename, mode='r') as file:
>     for line in file:
>       match = re.match(
>          r'^(?P<key>date\s+)(?P<value>"[0-9]{4}-[0-9]{2}-[0-9]{2}")',
>          line)
>       if match:
>         line ='key') + '"{:s}"\n'.format(time.strftime('%Y-%m-%d'))
>       head = head + line
>       if (line.startswith('LC_CTYPE'))
>         break
>     for line in file:
>       if (line.startswith('translit_start'))
>         tail = line
>         break
>     for line in file:
>       tail = tail + line
>   return (head, tail)

> On to
> - same suggestions apply to fill_attribute* in this file that applied to
> those in, since they're identical

Mike FABIAN <>
â Office: +49-69-365051027, internal 8875027

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]