This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]
Re: [PATCH] [BZ 17588 13064] Update UTF-8 charmap and width to Unicode 7.0.0

From: Alexandre Oliva <aoliva at redhat dot com>
To: Pravin Satpute <psatpute at redhat dot com>, Mike FABIAN <mfabian at redhat dot com>
Cc: libc-alpha at sourceware dot org
Date: Thu, 04 Dec 2014 11:33:44 -0200
Subject: Re: [PATCH] [BZ 17588 13064] Update UTF-8 charmap and width to Unicode 7.0.0
Authentication-results: sourceware.org; auth=none
References: <573624784 dot 8871393 dot 1416848051220 dot JavaMail dot zimbra at redhat dot com>
On Nov 24, 2014, Pravin Satpute <psatpute@redhat.com> wrote:

> 6. https://github.com/pravins/glibc-i18n

I've reviewed the gen scripts in the git repo above.

WRT gen-unicode-ctype.py, presumably intended to replace glibc's
localedata/gen-unicode-ctype.c:

- I like the readability of the Python script.  Good job!

- the script could probably use a few more comments; the C version is
already poor in comments, but the script has even fewer; the comments
present in field and function definitions in C aren't even the script;
it might be useful to add comments for reviewers that don't have the
file format by heart, for example, show in comments an example of a pair
of lines that defines a range in fill_attributes.

- I'm not sure it's wise for fill_attributes to load the entire file
into memory just to be able to index the lines in an array.  It doesn't
look like reading the input file line by line would make the code worse.

- When processing ranges of characters in fill_attributes, the C version
uses the fields from the End line, whereas the script uses those from
the Start line.  I guess it doesn't matter and they should have the same
values for most fields, but perhaps it would make sense to check that
they do.

- After processing a range of characters, the first character of the
range will have its attributes filled in again, because the non-range
call isn't prefixed by âelseâ.  It's bening, but wasteful.

- It's not obvious that is_alpha in the script, based on derived
properties, is equivalent to the many conditions tested in the C
program.  Is there any other script that checks their equivalence?

- It seems wasteful to append every char that fits the class to
code_point_ranges.  A similar effect could be attained by modifying the
second element of the array, if it already exists, like:

            if (code_point_ranges
                and code_point_ranges[-1][-1] == code_point - 1):
                if len(code_point_ranges[-1]) == 1:
                    code_point_ranges[-1].append(code_point)
                else:
                    code_point_ranges[-1][-1] = code_point
            else:
                code_point_ranges.append([code_point])

- In output_charmap, calling map_function twice for each code_point
seems wasteful.  The C version does that too, but just because saving
the mapped char in the table would make it much bigger.  Since ths
script doesn't build the table, it might as well save the result of the
first call and reuse it instead of the second, like:

        mapped = map_function(code_point)
        if code_point != mapped:
[...]
                         + ucs_symbol(mapped) \
                         + ')'

- any reason to move down the output of "isxdigit", "blank" and both
combining classes in the script, compared with the C program?

- thanks for integrating gen-unicode-ctype-dcp.py into
gen-unicode-ctype.py; the latter looks much nicer!


On to ctype-compatibility.py:

- I suppose it used to be named check-backcompatibility.py;
steps-to-generate-ctype needs updating

- get_lines_from_file doesn't check that current_line is empty at the
end of the loop.  that's fine, because it would indicate a file
terminated with a continuation line.  I'm not sure this is worth
checking and reporting an error, but maybe a comment would be in order

- I don't think there's any reason to read the entire input files into
arrays, instead of yielding individual (comment-stripped and
continuation-merged) lines from get_lines_from_file.  It's not like the
input files are huge, but it's probably good practice nevertheless

- I'm a little bit concerned with the regexp in
extract_character_classes_and_code_points.  I recommend using
re.escape(char_class) instead of plain char_class, so that, if the
char_classes ever contains rx-active characters, we won't get unintended
matches or rx compilation failures.  Furthermore, the current rx could
match such extraneous line prefixes as Âclass "upper Â or Âlower"; Â.
Maybe drop the '?'s from the current rx and accept
'|'+re.escape(char_class)+'\s+' as an *alternative* to the partiall rx
between '^' and '\s+', or check that if any of the before- and
after-char_class groups is empty, then so is the other, reporting an
error otherwise.  (Am I being too paranoid?)

- process_chars has no error reporting if none of the regexps match, and
it could be reworked to use a single regexp to cover all cases, like:
'^<U(?P<codepoint>...)>(?:(?:\.\.(?:\(2\)\.\.)?)<U(?P<codepoint2>...)>)$'
+ '|^\(<U(?P<codepointf>...)>,<U(?P<codepointm>...)>)\)$' and then
testing the match.group()s to tell the cases apart.  Or maybe have two
separate regexps, but breaking up the handling of classes and of maps
into separate functions, so that you don't have to repeatedly test for
one case or the other.  It would also make the code far more readable:
it took me a while to realize why the comma separator was handled so
differently from the two other kinds of separators, before I finally hit
the parentheses in the regexp ;-)

- cpcheck needlessly expands ranges; I think this could be avoided by
rearranging the loop so as to use ranges as generators, rather than
expanding them, like (untested):

    for r in code_point_list_with_ranges:
      for code_point in [r] if type(r) == type(int()) else range(r[0], r[1]+1):
        for c in char_classes:

- there are a couple of âcharcterâ typos in tests

- the conversion of ctype_dict to ctype_dict2 tests whether key is in a
list; I *think* testing the type of value might be more efficient, while
avoiding hard-coding the list of maps.  Also, pulling this test out of
the loop can avoid repeating the test for every loop iteration, even if
duplicating the inner loop in the if and else cases.

- tests (and cpcheck) have a large number of code blocks that go:

    if (...):
      number_of_errors += 1
      print(...%{...})

  it would be nice and less error prone to factor out at least part of
this pattern.  For example, define a function that increments
number_of_errors and prints the given string, so that this could be
rewritten as:

    if (...):
      cperror(...%{...})

  it might be even nicer if the expression, the format string and the
bindings dictionary could be turned into a lambda or somesuch in a list
of tests, but I can't think of a clean way to do that without needlessly
applying the format operator when the test passes.  Something to think
about...


On to utf8-gen.py...

- some vertical space before the comments that introduce process_charmap
would be welcome; if the comments included a sample input line (like
process_range), reviewing the code would be easier.

- as in ctype-compatibility.py, I suggest reworking the reading of the
input files into using generator functions that yield individual
(filtered) lines, rather than reading entire input files into lists that
will be iterater over only once.  The iteration over the lists with
indices, rather than with iterators, aggravates the inefficiency.

- write_comments(flag) is very odd; any reason to not write two separate
functions, each one outputting the text that should precede each portion
of the output file?

- the handling of surrogates in process_charmap is quite confusing; it
appears that in the end the script only outputs comments for them;
couldn't we just skip them in the first place, or use the same output
logic, but setting a prefix string to an empty string or the comment
character to it to avoid the ugly duplication?  (The prefix string could
even be modified afterwards to contain the extra zeros required for the
len(w[0]) != 4 case, like:

    if $is_surrogate:
      prefix = '%'
    else
      prefix = ''

    prefix = prefix + '<U'
    if len(w) > 4:
      prefix = prefix + ('0' * (8 - len(w)))

    outfile.write(prefix + ...

- I'm a bit concerned about using Python's UTF-8 encoder instead of
open-coding it.  The fact that its encoder discards surrogates, rather
than blindly applying the conversion rules for any inputs, means it
might very well rule out other characters that future versions of
Unicode might introduce, or even convert them incorrectly.

- convert_to_hex is working too hard.  unihex can be iterated on, so
  this should work:

    for c in unihex:
      hexword = hexword + "/x" + ('%02x' % c)

  or even better:

  hexword = ''.join(['/x%02x' % c for c in unihex])

- how about factoring the inner loop out of process_charmaps, so that
the logic that converts the code point to hexword can be called from
process_range too?

- factoring out the logic to turn a code point number into a <U...>
string would also reduce the amount of ugly duplication.  I envision
something like this:

  else:
    iend = int(end, 16)
    for i in range(int(start, 16), iend, 64):
      outfile.write(mapline(cpi2Us(i) + ".." + cpi2Us(min(i + 63, iend)),
                            chr(i), name.split(",")[0]))

def cp2Us(s):
  return '<U' + s + '>'
def cpi2Us(i):
  return cp2Us(('%04X' if i <= 0xffff else '%08X') % i)
def cps2Us(w):
  if len(w) > 4:
    w = '0' * (8 - len(w))
  return cp2Us(w)
def mapline(prefix, c, name):
  unihex = c.encode("UTF-8")
  hexword = convert_to_hex(unihex)

  return prefix + "     " + hexword + "         " + name + "\n"

- the comments at the top of process_width appear to be out of sync with
what the function does

- I suggest passing an empty width_dict to process_charmap, and having
it fill it in as it scans the UnicodeData.txt input with the same logic
at the top of process_width, so that the file is only scanned once

- then, process_width could have its main loop simplified to (note the
  +1 added to the range start):

    w = l.split(";")
    wc = w[0].split("..")
    if len(wc) == 1:
      rend = ''
    else:
      rend = '...' + cp2Us(wc[1])
      for key in range(int(wc[0], 16)+1, int(wc[1], 16)+1):
        if key in width_dict:
          del width_dict[key]
    width_dict[int(wc[0], 16)] = cp2Us(wc[0]) + rend + '\t2'


On to utf8-compatibility.py:

- same suggestions apply to fill_attribute* in this file that applied to
those in gen-unicode-ctype.py, since they're identical

- fill_east_asian_widths would probably benefit in readability and
performance from a single regexp that could match individual codepoints
and ranges:

            match = re.match(
                r'^(?P<codepoint1>[0-9A-F]{4,6})(?:\.\.(?P<codepoint2>[0-9A-F]{4,6}))?\s*;\s*(?P<property>[a-zA-Z]+)',
                line)
            if not match:
              continue
            start = match.group('codepoint')
            end = match.group('codepoint2')
            if not end:
              end = start
            for code_point in range(int(start, 16), int(end, 16)+1):
              east_asian_widths[code_point] = match.group('property')

- instead of reading entire old and new files, how about pass the file
handle to check_charmap and check_width, so that âfor line in file:â can
be used?  Then introduce âadvance_past_line('CHARMAP')â to position both
streams after that line.  create_charmap_dictionary can then return when
it reaches line 'END' or 'WIDTH', and neither it nor
create_width_dictionary would have to skip lines any more.  (removing
the ugly âstart == Falseâ, that had better be spelt ânot startâ :-)

- it's probably better to keep comments out of the dictionaries, if
that's what the key !='%' is about.  Test w[0][0] != '%' before
inserting the key.

- then, in check_charmap, combine the two error messages into one with:

    if not (key in ncharmap and ncharmap[key] == ocharmap[key]):

or spell them out differently, which probably makes more sense, since
different mapped-to character is not evidence that a char is missing,
but rather that it is incorrectly mapped, no?

- also, shouldn't we test for mappings in ncharmap but not in ocharmap?

- simplify insertion in width_dictionary (ranges and non-ranges):

   wc = w[0].split("...")
   for i in range(int(wc[0][2:-1], 16), int(wc[-1][2:-1], 16) + 1):
     width_dictionary[i] = int(w[1])

- in check_width, the two loops over owidth could be simplified to:

    for key in owidth:
        ow = owidth[key]
        if key in nwidth:
          nw = nwidth[key]
        else:
          nw = 1
        if ow != nw:
          changed_width[key] = (ow, nw)

- in check_width, it seems to me that the tests for missing and added
characters are redundant with the change tests, since missing entries
have width 1.


Great stuff, thanks for working on this!

-- 
Alexandre Oliva, freedom fighter    http://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist|Red Hat Brasil GNU Toolchain Engineer
Follow-Ups:
- Re: [PATCH] [BZ 17588 13064] Update UTF-8 charmap and width to Unicode 7.0.0
  - From: Alexandre Oliva
- Re: [PATCH] [BZ 17588 13064] Update UTF-8 charmap and width to Unicode 7.0.0
  - From: Mike FABIAN
- improvements to ctype-compatibility.py (was: [PATCH] [BZ 17588 13064] Update UTF-8 charmap and width to Unicode 7.0.0)
  - From: Mike FABIAN
- utf8-gen.py (was: [PATCH] [BZ 17588 13064] Update UTF-8 charmap and width to Unicode 7.0.0)
  - From: Mike FABIAN
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]