This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: improvements to ctype-compatibility.py
- From: Mike FABIAN <mfabian at redhat dot com>
- To: Alexandre Oliva <aoliva at redhat dot com>
- Cc: Pravin Satpute <psatpute at redhat dot com>, libc-alpha at sourceware dot org, Jens Petersen <petersen at redhat dot com>
- Date: Thu, 18 Dec 2014 12:34:35 +0100
- Subject: Re: improvements to ctype-compatibility.py
- Authentication-results: sourceware.org; auth=none
- References: <573624784 dot 8871393 dot 1416848051220 dot JavaMail dot zimbra at redhat dot com> <orzjb3o7yf dot fsf at free dot home> <s9d388qqt9m dot fsf_-_ at ari dot site> <orbnn199hd dot fsf at free dot home>
Alexandre Oliva <aoliva@redhat.com> ãããããããã:
>>> - get_lines_from_file
>
> Its only use is as a loop generator, so you might as well ditch
> all_lines and yield each full line as you build it, like:
>
> if line.endswith('/'):
> current_line += line[:-1]
> else:
> yield current_line + line
> current_line = ''
> if current_line: # file ends with a continuation line
> yield current_line
https://github.com/pravins/glibc-i18n/commit/e6e14931e1abbccd55821fcc3df8dd3f4e14cf30
>>> - I'm a little bit concerned with the regexp in
>>> extract_character_classes_and_code_points. I recommend using
>>> re.escape(char_class) instead of plain char_class, so that, if the
>>> char_classes ever contains rx-active characters, we won't get unintended
>>> matches or rx compilation failures. Furthermore, the current rx could
>>> match such extraneous line prefixes as Âclass "upper  or Âlower"; Â.
>>> Maybe drop the '?'s from the current rx and accept
>>> '|'+re.escape(char_class)+'\s+' as an *alternative* to the partiall rx
>>> between '^' and '\s+', or check that if any of the before- and
>>> after-char_class groups is empty, then so is the other, reporting an
>>> error otherwise. (Am I being too paranoid?)
>
>> https://github.com/pravins/glibc-i18n/commit/c01b5da2a7b9679c8aa3c2ba9a676e16a5465e82
>
>> Looks much uglier though.
>
> It does indeed.
>
> I wonder if it would make sense to construct a regexp that matched all
> possibilities once, and use it throughout the loop. Something along the
> lines of:
>
> classes = '|'.join([re.escape_char(x) for x in [
> 'upper', 'lower', 'alpha'... 'combining_level3']])
> maps = '|'.join([re.escape_char(x) for x in [
> 'toupper, 'tolower', 'totitle']])
>
> rx = re.compile(r'^(?:' + '|'.join([
> '(?P<noclass>' + classes + ')',
> 'class "(?P<class>' + classes + ')"',
> '(?P<nomap>' + maps + ')',
> 'map "(?P<map>' + maps + ')"']) + ')')
>
> for line in get_lines_from_file(filename):
> match = re.match(rx, line)
> if match:
> for group in ['noclass', 'class', 'nomap', 'map']:
> char_class = match.group(group)
> if char_class:
> break
> # use char_class like before
>
> but, yuck...
I also think this looks even uglier.
> This would only make sense if we wanted to
> distinguish between classes and maps and make sure the names follow the
> expected kinds, I guess. Here's an alternative:
>
> classes = '|'.join([re.escape_char(x) for x in [
> 'upper', 'lower', 'alpha'... 'totitle]])
>
> rx = re.compile(r'(?:(?:class|map) (?P<quote>"))?(?P<class>' + classes + ')(?P=quote)')
>
> for line in get_lines_from_file(filename):
> match = re.match(rx, line)
> if match:
> char_class = match.group('class')
> # use char_class like before
>
> and this could still be easily extended to tell classes and maps apart,
> if we so wanted.
Iâd rather leave it as it is, I tried the above and it is actually
slightly slower (probably because of the long regexp with the many
classes joined by | in the regexp).
Here is the slighly slower and uglier code I tested (pushed to a
different branch because I liked the original better):
https://github.com/pravins/glibc-i18n/commits/for-alexandre-oliva
https://github.com/pravins/glibc-i18n/commit/27e7f1960eb77df486a384822061a9b9e8e42122
>>> if (...):
>>> number_of_errors += 1
>>> print(...%{...})
>>>
>>> if (...):
>>> cperror(...%{...})
>
> So... this didn't work out as well as anticipated, because we're still
> using the global variable. I guess avoiding global variables will
> require changing the idiom a bit, either passing explicitly an object
> holding the counter, or having functions return an error count to be
> totaled by the caller, like:
>
> errors = 0
> if (...)
> errors += cperror(...)
> return errors
>
> :-(
But that really doesnât make it any better.
I think the rules in the glibc pylintrc are too strict here.
--
Mike FABIAN <mfabian@redhat.com>
â Office: +49-69-365051027, internal 8875027
ççäèãããääãæãã