This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
improvements to ctype-compatibility.py (was: [PATCH] [BZ 17588 13064] Update UTF-8 charmap and width to Unicode 7.0.0)
- From: Mike FABIAN <mfabian at redhat dot com>
- To: Alexandre Oliva <aoliva at redhat dot com>
- Cc: Pravin Satpute <psatpute at redhat dot com>, libc-alpha at sourceware dot org
- Date: Mon, 08 Dec 2014 12:24:05 +0100
- Subject: improvements to ctype-compatibility.py (was: [PATCH] [BZ 17588 13064] Update UTF-8 charmap and width to Unicode 7.0.0)
- Authentication-results: sourceware.org; auth=none
- References: <573624784 dot 8871393 dot 1416848051220 dot JavaMail dot zimbra at redhat dot com> <orzjb3o7yf dot fsf at free dot home>
Alexandre Oliva <aoliva@redhat.com> ãããããããã:
> On to ctype-compatibility.py:
>
> - I suppose it used to be named check-backcompatibility.py;
> steps-to-generate-ctype needs updating
https://github.com/pravins/glibc-i18n/commit/b701c3b99906168af5dc4d8d76328f34e2715033
> - get_lines_from_file doesn't check that current_line is empty at the
> end of the loop. that's fine, because it would indicate a file
> terminated with a continuation line. I'm not sure this is worth
> checking and reporting an error, but maybe a comment would be in order
https://github.com/pravins/glibc-i18n/commit/02eec85d6a16712f1522b8e62d9866701d3d00d7
> - I don't think there's any reason to read the entire input files into
> arrays, instead of yielding individual (comment-stripped and
> continuation-merged) lines from get_lines_from_file. It's not like the
> input files are huge, but it's probably good practice nevertheless
I think it is better readable as it is and the performance is really
not a problem.
> - I'm a little bit concerned with the regexp in
> extract_character_classes_and_code_points. I recommend using
> re.escape(char_class) instead of plain char_class, so that, if the
> char_classes ever contains rx-active characters, we won't get unintended
> matches or rx compilation failures. Furthermore, the current rx could
> match such extraneous line prefixes as Âclass "upper  or Âlower"; Â.
> Maybe drop the '?'s from the current rx and accept
> '|'+re.escape(char_class)+'\s+' as an *alternative* to the partiall rx
> between '^' and '\s+', or check that if any of the before- and
> after-char_class groups is empty, then so is the other, reporting an
> error otherwise. (Am I being too paranoid?)
https://github.com/pravins/glibc-i18n/commit/c01b5da2a7b9679c8aa3c2ba9a676e16a5465e82
Looks much uglier though.
> - process_chars has no error reporting if none of the regexps match,
https://github.com/pravins/glibc-i18n/commit/c01b5da2a7b9679c8aa3c2ba9a676e16a5465e82
> and
> it could be reworked to use a single regexp to cover all cases, like:
> '^<U(?P<codepoint>...)>(?:(?:\.\.(?:\(2\)\.\.)?)<U(?P<codepoint2>...)>)$'
> + '|^\(<U(?P<codepointf>...)>,<U(?P<codepointm>...)>)\)$' and then
> testing the match.group()s to tell the cases apart. Or maybe have two
> separate regexps, but breaking up the handling of classes and of maps
> into separate functions, so that you don't have to repeatedly test for
> one case or the other. It would also make the code far more readable:
> it took me a while to realize why the comma separator was handled so
> differently from the two other kinds of separators, before I finally hit
> the parentheses in the regexp ;-)
To me it looks very readable as it is, better then a single more
complicated regexp to cover all cases.
> - cpcheck needlessly expands ranges; I think this could be avoided by
> rearranging the loop so as to use ranges as generators, rather than
> expanding them, like (untested):
>
> for r in code_point_list_with_ranges:
> for code_point in [r] if type(r) == type(int()) else range(r[0], r[1]+1):
> for c in char_classes:
https://github.com/pravins/glibc-i18n/commit/2d58578a35352941ebf76def9f07d594a4b066cb
> - there are a couple of âcharcterâ typos in tests
https://github.com/pravins/glibc-i18n/commit/bee6c0580a6a4b10a012aee2dc8d31a1a01beac5
> - the conversion of ctype_dict to ctype_dict2 tests whether key is in a
> list; I *think* testing the type of value might be more efficient, while
> avoiding hard-coding the list of maps. Also, pulling this test out of
> the loop can avoid repeating the test for every loop iteration, even if
> duplicating the inner loop in the if and else cases.
https://github.com/pravins/glibc-i18n/commit/5725c7c0f8740b48b316b3476e358ad05e271c73
> - tests (and cpcheck) have a large number of code blocks that go:
>
> if (...):
> number_of_errors += 1
> print(...%{...})
>
> it would be nice and less error prone to factor out at least part of
> this pattern. For example, define a function that increments
> number_of_errors and prints the given string, so that this could be
> rewritten as:
>
> if (...):
> cperror(...%{...})
https://github.com/pravins/glibc-i18n/commit/0b6ce0661347b8a8ab4c25facbacc9c4820db024
> it might be even nicer if the expression, the format string and the
> bindings dictionary could be turned into a lambda or somesuch in a list
> of tests, but I can't think of a clean way to do that without needlessly
> applying the format operator when the test passes. Something to think
> about...
--
Mike FABIAN <mfabian@redhat.com>
â Office: +49-69-365051027, internal 8875027
ççäèãããääãæãã