This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: improvements to ctype-compatibility.py
- From: Alexandre Oliva <aoliva at redhat dot com>
- To: Mike FABIAN <mfabian at redhat dot com>
- Cc: Pravin Satpute <psatpute at redhat dot com>, libc-alpha at sourceware dot org
- Date: Wed, 17 Dec 2014 20:50:38 -0200
- Subject: Re: improvements to ctype-compatibility.py
- Authentication-results: sourceware.org; auth=none
- References: <573624784 dot 8871393 dot 1416848051220 dot JavaMail dot zimbra at redhat dot com> <orzjb3o7yf dot fsf at free dot home> <s9d388qqt9m dot fsf_-_ at ari dot site>
Hi, Mike,
Please disregard my earlier comment about simplifying with open: for: to
a single for; I think the latter won't close the file at the end of the
loop.
On Dec 8, 2014, Mike FABIAN <mfabian@redhat.com> wrote:
> Alexandre Oliva <aoliva@redhat.com> ãããããããã:
>> - get_lines_from_file
Its only use is as a loop generator, so you might as well ditch
all_lines and yield each full line as you build it, like:
if line.endswith('/'):
current_line += line[:-1]
else:
yield current_line + line
current_line = ''
if current_line: # file ends with a continuation line
yield current_line
>> - I'm a little bit concerned with the regexp in
>> extract_character_classes_and_code_points. I recommend using
>> re.escape(char_class) instead of plain char_class, so that, if the
>> char_classes ever contains rx-active characters, we won't get unintended
>> matches or rx compilation failures. Furthermore, the current rx could
>> match such extraneous line prefixes as Âclass "upper  or Âlower"; Â.
>> Maybe drop the '?'s from the current rx and accept
>> '|'+re.escape(char_class)+'\s+' as an *alternative* to the partiall rx
>> between '^' and '\s+', or check that if any of the before- and
>> after-char_class groups is empty, then so is the other, reporting an
>> error otherwise. (Am I being too paranoid?)
> https://github.com/pravins/glibc-i18n/commit/c01b5da2a7b9679c8aa3c2ba9a676e16a5465e82
> Looks much uglier though.
It does indeed.
I wonder if it would make sense to construct a regexp that matched all
possibilities once, and use it throughout the loop. Something along the
lines of:
classes = '|'.join([re.escape_char(x) for x in [
'upper', 'lower', 'alpha'... 'combining_level3']])
maps = '|'.join([re.escape_char(x) for x in [
'toupper, 'tolower', 'totitle']])
rx = re.compile(r'^(?:' + '|'.join([
'(?P<noclass>' + classes + ')',
'class "(?P<class>' + classes + ')"',
'(?P<nomap>' + maps + ')',
'map "(?P<map>' + maps + ')"']) + ')')
for line in get_lines_from_file(filename):
match = re.match(rx, line)
if match:
for group in ['noclass', 'class', 'nomap', 'map']:
char_class = match.group(group)
if char_class:
break
# use char_class like before
but, yuck... This would only make sense if we wanted to
distinguish between classes and maps and make sure the names follow the
expected kinds, I guess. Here's an alternative:
classes = '|'.join([re.escape_char(x) for x in [
'upper', 'lower', 'alpha'... 'totitle]])
rx = re.compile(r'(?:(?:class|map) (?P<quote>"))?(?P<class>' + classes + ')(?P=quote)')
for line in get_lines_from_file(filename):
match = re.match(rx, line)
if match:
char_class = match.group('class')
# use char_class like before
and this could still be easily extended to tell classes and maps apart,
if we so wanted.
>> and
>> it could be reworked to use a single regexp to cover all cases, like:
>> '^<U(?P<codepoint>...)>(?:(?:\.\.(?:\(2\)\.\.)?)<U(?P<codepoint2>...)>)$'
>> + '|^\(<U(?P<codepointf>...)>,<U(?P<codepointm>...)>)\)$' and then
>> testing the match.group()s to tell the cases apart. Or maybe have two
>> separate regexps, but breaking up the handling of classes and of maps
>> into separate functions, so that you don't have to repeatedly test for
>> one case or the other. It would also make the code far more readable:
>> it took me a while to realize why the comma separator was handled so
>> differently from the two other kinds of separators, before I finally hit
>> the parentheses in the regexp ;-)
> To me it looks very readable as it is, better then a single more
> complicated regexp to cover all cases.
I still dislike the code duplication in cases 2 and 3 (and 1?), but I
guess I can live with that ;-)
>> if (...):
>> number_of_errors += 1
>> print(...%{...})
>>
>> if (...):
>> cperror(...%{...})
So... this didn't work out as well as anticipated, because we're still
using the global variable. I guess avoiding global variables will
require changing the idiom a bit, either passing explicitly an object
holding the counter, or having functions return an error count to be
totaled by the caller, like:
errors = 0
if (...)
errors += cperror(...)
return errors
:-(
> https://github.com/pravins/glibc-i18n/commit/0b6ce0661347b8a8ab4c25facbacc9c4820db024
>> it might be even nicer if the expression, the format string and the
>> bindings dictionary could be turned into a lambda or somesuch in a list
>> of tests, but I can't think of a clean way to do that without needlessly
>> applying the format operator when the test passes. Something to think
>> about...
We might have to think about this harder, for I'm pretty sure this
function exceeds both the complexity and module size limits :-(
--
Alexandre Oliva, freedom fighter http://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/ FSF Latin America board member
Free Software Evangelist|Red Hat Brasil GNU Toolchain Engineer