This is the mail archive of the mailing list for the glibc project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

improvements to (was: [PATCH] [BZ 17588 13064] Update UTF-8 charmap and width to Unicode 7.0.0)

Alexandre Oliva <> ãããããããã:

> On to
> - I suppose it used to be named;
> steps-to-generate-ctype needs updating

> - get_lines_from_file doesn't check that current_line is empty at the
> end of the loop.  that's fine, because it would indicate a file
> terminated with a continuation line.  I'm not sure this is worth
> checking and reporting an error, but maybe a comment would be in order

> - I don't think there's any reason to read the entire input files into
> arrays, instead of yielding individual (comment-stripped and
> continuation-merged) lines from get_lines_from_file.  It's not like the
> input files are huge, but it's probably good practice nevertheless

I think it is better readable as it is and the performance is really
not a problem.

> - I'm a little bit concerned with the regexp in
> extract_character_classes_and_code_points.  I recommend using
> re.escape(char_class) instead of plain char_class, so that, if the
> char_classes ever contains rx-active characters, we won't get unintended
> matches or rx compilation failures.  Furthermore, the current rx could
> match such extraneous line prefixes as Âclass "upper  or Âlower"; Â.
> Maybe drop the '?'s from the current rx and accept
> '|'+re.escape(char_class)+'\s+' as an *alternative* to the partiall rx
> between '^' and '\s+', or check that if any of the before- and
> after-char_class groups is empty, then so is the other, reporting an
> error otherwise.  (Am I being too paranoid?)

Looks much uglier though.

> - process_chars has no error reporting if none of the regexps match,

> and
> it could be reworked to use a single regexp to cover all cases, like:
> '^<U(?P<codepoint>...)>(?:(?:\.\.(?:\(2\)\.\.)?)<U(?P<codepoint2>...)>)$'
> + '|^\(<U(?P<codepointf>...)>,<U(?P<codepointm>...)>)\)$' and then
> testing the to tell the cases apart.  Or maybe have two
> separate regexps, but breaking up the handling of classes and of maps
> into separate functions, so that you don't have to repeatedly test for
> one case or the other.  It would also make the code far more readable:
> it took me a while to realize why the comma separator was handled so
> differently from the two other kinds of separators, before I finally hit
> the parentheses in the regexp ;-)

To me it looks very readable as it is, better then a single more
complicated regexp to cover all cases.

> - cpcheck needlessly expands ranges; I think this could be avoided by
> rearranging the loop so as to use ranges as generators, rather than
> expanding them, like (untested):
>     for r in code_point_list_with_ranges:
>       for code_point in [r] if type(r) == type(int()) else range(r[0], r[1]+1):
>         for c in char_classes:

> - there are a couple of âcharcterâ typos in tests

> - the conversion of ctype_dict to ctype_dict2 tests whether key is in a
> list; I *think* testing the type of value might be more efficient, while
> avoiding hard-coding the list of maps.  Also, pulling this test out of
> the loop can avoid repeating the test for every loop iteration, even if
> duplicating the inner loop in the if and else cases.

> - tests (and cpcheck) have a large number of code blocks that go:
>     if (...):
>       number_of_errors += 1
>       print(...%{...})
>   it would be nice and less error prone to factor out at least part of
> this pattern.  For example, define a function that increments
> number_of_errors and prints the given string, so that this could be
> rewritten as:
>     if (...):
>       cperror(...%{...})

>   it might be even nicer if the expression, the format string and the
> bindings dictionary could be turned into a lambda or somesuch in a list
> of tests, but I can't think of a clean way to do that without needlessly
> applying the format operator when the test passes.  Something to think
> about...

Mike FABIAN <>
â Office: +49-69-365051027, internal 8875027

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]