This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: improvements to ctype-compatibility.py

From: Alexandre Oliva <aoliva at redhat dot com>
To: Mike FABIAN <mfabian at redhat dot com>
Cc: Pravin Satpute <psatpute at redhat dot com>, libc-alpha at sourceware dot org
Date: Wed, 17 Dec 2014 20:50:38 -0200
Subject: Re: improvements to ctype-compatibility.py
Authentication-results: sourceware.org; auth=none
References: <573624784 dot 8871393 dot 1416848051220 dot JavaMail dot zimbra at redhat dot com> <orzjb3o7yf dot fsf at free dot home> <s9d388qqt9m dot fsf_-_ at ari dot site>

Hi, Mike,

Please disregard my earlier comment about simplifying with open: for: to
a single for; I think the latter won't close the file at the end of the
loop.

On Dec  8, 2014, Mike FABIAN <mfabian@redhat.com> wrote:

> Alexandre Oliva <aoliva@redhat.com> ãããããããã:

>> - get_lines_from_file

Its only use is as a loop generator, so you might as well ditch
all_lines and yield each full line as you build it, like:

              if line.endswith('/'):
                current_line += line[:-1]
            else:
                yield current_line + line
                current_line = ''
    if current_line: # file ends with a continuation line
        yield current_line


>> - I'm a little bit concerned with the regexp in
>> extract_character_classes_and_code_points.  I recommend using
>> re.escape(char_class) instead of plain char_class, so that, if the
>> char_classes ever contains rx-active characters, we won't get unintended
>> matches or rx compilation failures.  Furthermore, the current rx could
>> match such extraneous line prefixes as Âclass "upper Â or Âlower"; Â.
>> Maybe drop the '?'s from the current rx and accept
>> '|'+re.escape(char_class)+'\s+' as an *alternative* to the partiall rx
>> between '^' and '\s+', or check that if any of the before- and
>> after-char_class groups is empty, then so is the other, reporting an
>> error otherwise.  (Am I being too paranoid?)

> https://github.com/pravins/glibc-i18n/commit/c01b5da2a7b9679c8aa3c2ba9a676e16a5465e82

> Looks much uglier though.

It does indeed.

I wonder if it would make sense to construct a regexp that matched all
possibilities once, and use it throughout the loop.  Something along the
lines of:

  classes = '|'.join([re.escape_char(x) for x in [
               'upper', 'lower', 'alpha'... 'combining_level3']])
  maps = '|'.join([re.escape_char(x) for x in [
               'toupper, 'tolower', 'totitle']])

  rx = re.compile(r'^(?:' + '|'.join([
     '(?P<noclass>' + classes + ')',
     'class "(?P<class>' + classes + ')"',
     '(?P<nomap>' + maps + ')',
     'map "(?P<map>' + maps + ')"']) + ')')

  for line in get_lines_from_file(filename):
    match = re.match(rx, line)
    if match:
      for group in ['noclass', 'class', 'nomap', 'map']:
        char_class = match.group(group)
        if char_class:
          break
      # use char_class like before

but, yuck...  This would only make sense if we wanted to
distinguish between classes and maps and make sure the names follow the
expected kinds, I guess.  Here's an alternative:

  classes = '|'.join([re.escape_char(x) for x in [
               'upper', 'lower', 'alpha'... 'totitle]])

  rx = re.compile(r'(?:(?:class|map) (?P<quote>"))?(?P<class>' + classes + ')(?P=quote)')

  for line in get_lines_from_file(filename):
    match = re.match(rx, line)
    if match:
      char_class = match.group('class')
      # use char_class like before

and this could still be easily extended to tell classes and maps apart,
if we so wanted.

>> and
>> it could be reworked to use a single regexp to cover all cases, like:
>> '^<U(?P<codepoint>...)>(?:(?:\.\.(?:\(2\)\.\.)?)<U(?P<codepoint2>...)>)$'
>> + '|^\(<U(?P<codepointf>...)>,<U(?P<codepointm>...)>)\)$' and then
>> testing the match.group()s to tell the cases apart.  Or maybe have two
>> separate regexps, but breaking up the handling of classes and of maps
>> into separate functions, so that you don't have to repeatedly test for
>> one case or the other.  It would also make the code far more readable:
>> it took me a while to realize why the comma separator was handled so
>> differently from the two other kinds of separators, before I finally hit
>> the parentheses in the regexp ;-)

> To me it looks very readable as it is, better then a single more
> complicated regexp to cover all cases.

I still dislike the code duplication in cases 2 and 3 (and 1?), but I
guess I can live with that ;-)

>> if (...):
>>   number_of_errors += 1
>>   print(...%{...})
>> 
>> if (...):
>>   cperror(...%{...})

So...  this didn't work out as well as anticipated, because we're still
using the global variable.  I guess avoiding global variables will
require changing the idiom a bit, either passing explicitly an object
holding the counter, or having functions return an error count to be
totaled by the caller, like:

  errors = 0
  if (...)
    errors += cperror(...)
  return errors

:-(

> https://github.com/pravins/glibc-i18n/commit/0b6ce0661347b8a8ab4c25facbacc9c4820db024

>> it might be even nicer if the expression, the format string and the
>> bindings dictionary could be turned into a lambda or somesuch in a list
>> of tests, but I can't think of a clean way to do that without needlessly
>> applying the format operator when the test passes.  Something to think
>> about...

We might have to think about this harder, for I'm pretty sure this
function exceeds both the complexity and module size limits :-(

-- 
Alexandre Oliva, freedom fighter    http://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist|Red Hat Brasil GNU Toolchain Engineer

Follow-Ups:
- Re: improvements to ctype-compatibility.py
  - From: Mike FABIAN

References:
- Re: [PATCH] [BZ 17588 13064] Update UTF-8 charmap and width to Unicode 7.0.0
  - From: Alexandre Oliva
- improvements to ctype-compatibility.py (was: [PATCH] [BZ 17588 13064] Update UTF-8 charmap and width to Unicode 7.0.0)
  - From: Mike FABIAN

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]