This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH] [BZ 17588 13064] Update UTF-8 charmap and width to Unicode 7.0.0
- From: Mike FABIAN <mfabian at redhat dot com>
- To: Alexandre Oliva <aoliva at redhat dot com>
- Cc: Pravin Satpute <psatpute at redhat dot com>, libc-alpha at sourceware dot org
- Date: Mon, 08 Dec 2014 07:39:33 +0100
- Subject: Re: [PATCH] [BZ 17588 13064] Update UTF-8 charmap and width to Unicode 7.0.0
- Authentication-results: sourceware.org; auth=none
- References: <573624784 dot 8871393 dot 1416848051220 dot JavaMail dot zimbra at redhat dot com> <orzjb3o7yf dot fsf at free dot home>
I changed gen-unicode-ctype.py mostly according to your suggestions
(Thank you very much for your very detailed review):
(Other changes in the next mail â)
Alexandre Oliva <aoliva@redhat.com> ãããããããã:
> On Nov 24, 2014, Pravin Satpute <psatpute@redhat.com> wrote:
>
>> 6. https://github.com/pravins/glibc-i18n
>
> I've reviewed the gen scripts in the git repo above.
>
> WRT gen-unicode-ctype.py, presumably intended to replace glibc's
> localedata/gen-unicode-ctype.c:
[...]
> - the script could probably use a few more comments; the C version is
> already poor in comments, but the script has even fewer; the comments
> present in field and function definitions in C aren't even the script;
> it might be useful to add comments for reviewers that don't have the
> file format by heart, for example, show in comments an example of a pair
> of lines that defines a range in fill_attributes.
more comments:
https://github.com/pravins/glibc-i18n/commit/3ccaf53dd71a6d927d55eeb03fb08d26af3d4534
> - I'm not sure it's wise for fill_attributes to load the entire file
> into memory just to be able to index the lines in an array. It doesn't
> look like reading the input file line by line would make the code worse.
https://github.com/pravins/glibc-i18n/commit/7ef5161898f54d2d66bb25f898310c68bc2d6577
> - When processing ranges of characters in fill_attributes, the C version
> uses the fields from the End line, whereas the script uses those from
> the Start line. I guess it doesn't matter and they should have the same
> values for most fields, but perhaps it would make sense to check that
> they do.
Doesnât matter but the above commit checks this as well.
> - After processing a range of characters, the first character of the
> range will have its attributes filled in again, because the non-range
> call isn't prefixed by âelseâ. It's bening, but wasteful.
https://github.com/pravins/glibc-i18n/commit/83cd925026093ebae4e0b7802b709b98e05c1227
> - It's not obvious that is_alpha in the script, based on derived
> properties, is equivalent to the many conditions tested in the C
> program. Is there any other script that checks their equivalence?
It is *not* supposed to be equivalent. The is_alpha() in the
C program uses only data from UnicodeData.txt. This causes
many characters incorrectly not classified as alpha or incorrectly
classified as alpha. See:
https://sourceware.org/bugzilla/show_bug.cgi?id=14010#c0
The many condidtions in is_alpha() in the C program try to correct
some of these mistakes.
The better source for classifying as alpha or not is
DerivedCoreProperties.txt. So gen-unicode-ctype.py uses this. This
fixes several of the special cases in the old program and many
more. Some of the special cases in the old program were apparently
mistaken, I tried to ask Bruno Haible and Theppitak Karoonboonyanan
about this but got no response. I think DerivedCoreProperties.txt
is the better source for this. But DerivedCoreProperties.txt
would not classify digits as alphabetic. Brunoâs C program had
the comment:
/* Consider all the non-ASCII digits as alphabetic.
ISO C 99 forbids us to have them in category "digit",
but we want iswalnum to return true on them. */
which seems to make sense, therefore I kept that in is_alpha()
in gen-unicode-ctype.py.
> - It seems wasteful to append every char that fits the class to
> code_point_ranges. A similar effect could be attained by modifying the
> second element of the array, if it already exists, like:
>
> if (code_point_ranges
> and code_point_ranges[-1][-1] == code_point - 1):
> if len(code_point_ranges[-1]) == 1:
> code_point_ranges[-1].append(code_point)
> else:
> code_point_ranges[-1][-1] = code_point
> else:
> code_point_ranges.append([code_point])
https://github.com/pravins/glibc-i18n/commit/538f04c6d6b4ad295c98a91beb07a1d9cb7fa681
> - In output_charmap, calling map_function twice for each code_point
> seems wasteful. The C version does that too, but just because saving
> the mapped char in the table would make it much bigger. Since ths
> script doesn't build the table, it might as well save the result of the
> first call and reuse it instead of the second, like:
>
> mapped = map_function(code_point)
> if code_point != mapped:
> [...]
> + ucs_symbol(mapped) \
> + ')'
https://github.com/pravins/glibc-i18n/commit/33f94bcf70127aca77f912c8e9efef0eb3aa51ce
> - any reason to move down the output of "isxdigit", "blank" and both
> combining classes in the script, compared with the C program?
I wanted to keep the order of the character classes as they currently
are in the glibc/localedata/locales/i18n file.
https://github.com/pravins/glibc-i18n/commit/d14d12ac7e11b1af5ed7140b2849f9edb2e70321
> I see there have been changes since I started the review.
>
> The only one that caught my attention was in gen-unicode-ctype.py.
>
> I think the single loop in read_input_file would be better split into
> multiple loops, one for head, one to skip to tail, one for tail.
>
> Also consider using line.startswith instead of regexps for the
> end-of-head and beginning-of-tail markers, unless that risks matching
> unintended lines.
>
> Finally, consider returning a tuple instead of modifying the global
> variable.
>
> It could end up like this:
>
> def read_input_file(filename):
> head = tail = ''
> with open(filename, mode='r') as file:
> for line in file:
> match = re.match(
> r'^(?P<key>date\s+)(?P<value>"[0-9]{4}-[0-9]{2}-[0-9]{2}")',
> line)
> if match:
> line = match.group('key') + '"{:s}"\n'.format(time.strftime('%Y-%m-%d'))
> head = head + line
> if (line.startswith('LC_CTYPE'))
> break
> for line in file:
> if (line.startswith('translit_start'))
> tail = line
> break
> for line in file:
> tail = tail + line
> return (head, tail)
https://github.com/pravins/glibc-i18n/commit/e8a126e8d84f31444bb6ca315339abe079ccfb86
> On to utf8-compatibility.py:
>
> - same suggestions apply to fill_attribute* in this file that applied to
> those in gen-unicode-ctype.py, since they're identical
https://github.com/pravins/glibc-i18n/commit/2de8e481493c40575cf8137b963fabf036eae532
--
Mike FABIAN <mfabian@redhat.com>
â Office: +49-69-365051027, internal 8875027
ççäèãããääãæãã