This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: improvements to ctype-compatibility.py

From: Mike FABIAN <mfabian at redhat dot com>
To: Alexandre Oliva <aoliva at redhat dot com>
Cc: Pravin Satpute <psatpute at redhat dot com>, libc-alpha at sourceware dot org, Jens Petersen <petersen at redhat dot com>
Date: Thu, 18 Dec 2014 12:34:35 +0100
Subject: Re: improvements to ctype-compatibility.py
Authentication-results: sourceware.org; auth=none
References: <573624784 dot 8871393 dot 1416848051220 dot JavaMail dot zimbra at redhat dot com> <orzjb3o7yf dot fsf at free dot home> <s9d388qqt9m dot fsf_-_ at ari dot site> <orbnn199hd dot fsf at free dot home>

Alexandre Oliva <aoliva@redhat.com> ãããããããã:

>>> - get_lines_from_file
>
> Its only use is as a loop generator, so you might as well ditch
> all_lines and yield each full line as you build it, like:
>
>               if line.endswith('/'):
>                 current_line += line[:-1]
>             else:
>                 yield current_line + line
>                 current_line = ''
>     if current_line: # file ends with a continuation line
>         yield current_line

https://github.com/pravins/glibc-i18n/commit/e6e14931e1abbccd55821fcc3df8dd3f4e14cf30

>>> - I'm a little bit concerned with the regexp in
>>> extract_character_classes_and_code_points.  I recommend using
>>> re.escape(char_class) instead of plain char_class, so that, if the
>>> char_classes ever contains rx-active characters, we won't get unintended
>>> matches or rx compilation failures.  Furthermore, the current rx could
>>> match such extraneous line prefixes as Âclass "upper Â or Âlower"; Â.
>>> Maybe drop the '?'s from the current rx and accept
>>> '|'+re.escape(char_class)+'\s+' as an *alternative* to the partiall rx
>>> between '^' and '\s+', or check that if any of the before- and
>>> after-char_class groups is empty, then so is the other, reporting an
>>> error otherwise.  (Am I being too paranoid?)
>
>> https://github.com/pravins/glibc-i18n/commit/c01b5da2a7b9679c8aa3c2ba9a676e16a5465e82
>
>> Looks much uglier though.
>
> It does indeed.
>
> I wonder if it would make sense to construct a regexp that matched all
> possibilities once, and use it throughout the loop.  Something along the
> lines of:
>
>   classes = '|'.join([re.escape_char(x) for x in [
>                'upper', 'lower', 'alpha'... 'combining_level3']])
>   maps = '|'.join([re.escape_char(x) for x in [
>                'toupper, 'tolower', 'totitle']])
>
>   rx = re.compile(r'^(?:' + '|'.join([
>      '(?P<noclass>' + classes + ')',
>      'class "(?P<class>' + classes + ')"',
>      '(?P<nomap>' + maps + ')',
>      'map "(?P<map>' + maps + ')"']) + ')')
>
>   for line in get_lines_from_file(filename):
>     match = re.match(rx, line)
>     if match:
>       for group in ['noclass', 'class', 'nomap', 'map']:
>         char_class = match.group(group)
>         if char_class:
>           break
>       # use char_class like before
>
> but, yuck...

I also think this looks even uglier.

> This would only make sense if we wanted to
> distinguish between classes and maps and make sure the names follow the
> expected kinds, I guess.  Here's an alternative:
>
>   classes = '|'.join([re.escape_char(x) for x in [
>                'upper', 'lower', 'alpha'... 'totitle]])
>
>   rx = re.compile(r'(?:(?:class|map) (?P<quote>"))?(?P<class>' + classes + ')(?P=quote)')
>
>   for line in get_lines_from_file(filename):
>     match = re.match(rx, line)
>     if match:
>       char_class = match.group('class')
>       # use char_class like before
>
> and this could still be easily extended to tell classes and maps apart,
> if we so wanted.

Iâd rather leave it as it is, I tried the above and it is actually
slightly slower (probably because of the long regexp with the many
classes joined by | in the regexp).

Here is the slighly slower and uglier code I tested (pushed to a
different branch because I liked the original better):

https://github.com/pravins/glibc-i18n/commits/for-alexandre-oliva
https://github.com/pravins/glibc-i18n/commit/27e7f1960eb77df486a384822061a9b9e8e42122

>>> if (...):
>>>   number_of_errors += 1
>>>   print(...%{...})
>>> 
>>> if (...):
>>>   cperror(...%{...})
>
> So...  this didn't work out as well as anticipated, because we're still
> using the global variable.  I guess avoiding global variables will
> require changing the idiom a bit, either passing explicitly an object
> holding the counter, or having functions return an error count to be
> totaled by the caller, like:
>
>   errors = 0
>   if (...)
>     errors += cperror(...)
>   return errors
>
> :-(

But that really doesnât make it any better.

I think the rules in the glibc pylintrc are too strict here.

-- 
Mike FABIAN <mfabian@redhat.com>
â Office: +49-69-365051027, internal 8875027
ççäèãããääãæãã

References:
- Re: [PATCH] [BZ 17588 13064] Update UTF-8 charmap and width to Unicode 7.0.0
  - From: Alexandre Oliva
- improvements to ctype-compatibility.py (was: [PATCH] [BZ 17588 13064] Update UTF-8 charmap and width to Unicode 7.0.0)
  - From: Mike FABIAN
- Re: improvements to ctype-compatibility.py
  - From: Alexandre Oliva

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]