This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

improvements to ctype-compatibility.py (was: [PATCH] [BZ 17588 13064] Update UTF-8 charmap and width to Unicode 7.0.0)

From: Mike FABIAN <mfabian at redhat dot com>
To: Alexandre Oliva <aoliva at redhat dot com>
Cc: Pravin Satpute <psatpute at redhat dot com>, libc-alpha at sourceware dot org
Date: Mon, 08 Dec 2014 12:24:05 +0100
Subject: improvements to ctype-compatibility.py (was: [PATCH] [BZ 17588 13064] Update UTF-8 charmap and width to Unicode 7.0.0)
Authentication-results: sourceware.org; auth=none
References: <573624784 dot 8871393 dot 1416848051220 dot JavaMail dot zimbra at redhat dot com> <orzjb3o7yf dot fsf at free dot home>

Alexandre Oliva <aoliva@redhat.com> ãããããããã:

> On to ctype-compatibility.py:
>
> - I suppose it used to be named check-backcompatibility.py;
> steps-to-generate-ctype needs updating

https://github.com/pravins/glibc-i18n/commit/b701c3b99906168af5dc4d8d76328f34e2715033

> - get_lines_from_file doesn't check that current_line is empty at the
> end of the loop.  that's fine, because it would indicate a file
> terminated with a continuation line.  I'm not sure this is worth
> checking and reporting an error, but maybe a comment would be in order


https://github.com/pravins/glibc-i18n/commit/02eec85d6a16712f1522b8e62d9866701d3d00d7

> - I don't think there's any reason to read the entire input files into
> arrays, instead of yielding individual (comment-stripped and
> continuation-merged) lines from get_lines_from_file.  It's not like the
> input files are huge, but it's probably good practice nevertheless

I think it is better readable as it is and the performance is really
not a problem.

> - I'm a little bit concerned with the regexp in
> extract_character_classes_and_code_points.  I recommend using
> re.escape(char_class) instead of plain char_class, so that, if the
> char_classes ever contains rx-active characters, we won't get unintended
> matches or rx compilation failures.  Furthermore, the current rx could
> match such extraneous line prefixes as Âclass "upper Â or Âlower"; Â.
> Maybe drop the '?'s from the current rx and accept
> '|'+re.escape(char_class)+'\s+' as an *alternative* to the partiall rx
> between '^' and '\s+', or check that if any of the before- and
> after-char_class groups is empty, then so is the other, reporting an
> error otherwise.  (Am I being too paranoid?)

https://github.com/pravins/glibc-i18n/commit/c01b5da2a7b9679c8aa3c2ba9a676e16a5465e82

Looks much uglier though.

> - process_chars has no error reporting if none of the regexps match,

https://github.com/pravins/glibc-i18n/commit/c01b5da2a7b9679c8aa3c2ba9a676e16a5465e82

> and
> it could be reworked to use a single regexp to cover all cases, like:
> '^<U(?P<codepoint>...)>(?:(?:\.\.(?:\(2\)\.\.)?)<U(?P<codepoint2>...)>)$'
> + '|^\(<U(?P<codepointf>...)>,<U(?P<codepointm>...)>)\)$' and then
> testing the match.group()s to tell the cases apart.  Or maybe have two
> separate regexps, but breaking up the handling of classes and of maps
> into separate functions, so that you don't have to repeatedly test for
> one case or the other.  It would also make the code far more readable:
> it took me a while to realize why the comma separator was handled so
> differently from the two other kinds of separators, before I finally hit
> the parentheses in the regexp ;-)

To me it looks very readable as it is, better then a single more
complicated regexp to cover all cases.

> - cpcheck needlessly expands ranges; I think this could be avoided by
> rearranging the loop so as to use ranges as generators, rather than
> expanding them, like (untested):
>
>     for r in code_point_list_with_ranges:
>       for code_point in [r] if type(r) == type(int()) else range(r[0], r[1]+1):
>         for c in char_classes:

https://github.com/pravins/glibc-i18n/commit/2d58578a35352941ebf76def9f07d594a4b066cb

> - there are a couple of âcharcterâ typos in tests

https://github.com/pravins/glibc-i18n/commit/bee6c0580a6a4b10a012aee2dc8d31a1a01beac5

> - the conversion of ctype_dict to ctype_dict2 tests whether key is in a
> list; I *think* testing the type of value might be more efficient, while
> avoiding hard-coding the list of maps.  Also, pulling this test out of
> the loop can avoid repeating the test for every loop iteration, even if
> duplicating the inner loop in the if and else cases.

https://github.com/pravins/glibc-i18n/commit/5725c7c0f8740b48b316b3476e358ad05e271c73

> - tests (and cpcheck) have a large number of code blocks that go:
>
>     if (...):
>       number_of_errors += 1
>       print(...%{...})
>
>   it would be nice and less error prone to factor out at least part of
> this pattern.  For example, define a function that increments
> number_of_errors and prints the given string, so that this could be
> rewritten as:
>
>     if (...):
>       cperror(...%{...})

https://github.com/pravins/glibc-i18n/commit/0b6ce0661347b8a8ab4c25facbacc9c4820db024

>   it might be even nicer if the expression, the format string and the
> bindings dictionary could be turned into a lambda or somesuch in a list
> of tests, but I can't think of a clean way to do that without needlessly
> applying the format operator when the test passes.  Something to think
> about...

-- 
Mike FABIAN <mfabian@redhat.com>
â Office: +49-69-365051027, internal 8875027
ççäèãããääãæãã

Follow-Ups:
- Re: improvements to ctype-compatibility.py
  - From: Alexandre Oliva

References:
- Re: [PATCH] [BZ 17588 13064] Update UTF-8 charmap and width to Unicode 7.0.0
  - From: Alexandre Oliva

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]