This is the mail archive of the
mailing list for the glibc project.
Re: [PATCH] [BZ 17588 13064] Update UTF-8 charmap and width to Unicode 7.0.0
- From: Alexandre Oliva <aoliva at redhat dot com>
- To: Mike FABIAN <mfabian at redhat dot com>
- Cc: Pravin Satpute <psatpute at redhat dot com>, libc-alpha at sourceware dot org
- Date: Wed, 17 Dec 2014 19:38:35 -0200
- Subject: Re: [PATCH] [BZ 17588 13064] Update UTF-8 charmap and width to Unicode 7.0.0
- Authentication-results: sourceware.org; auth=none
- References: <573624784 dot 8871393 dot 1416848051220 dot JavaMail dot zimbra at redhat dot com> <orzjb3o7yf dot fsf at free dot home> <s9dy4qir6fu dot fsf at ari dot site>
On Dec 8, 2014, Mike FABIAN <email@example.com> wrote:
> I changed gen-unicode-ctype.py mostly according to your suggestions
> Alexandre Oliva <firstname.lastname@example.org> ãããããããã:
>> - I'm not sure it's wise for fill_attributes to load the entire file
>> into memory just to be able to index the lines in an array. It doesn't
>> look like reading the input file line by line would make the code worse.
In fill_attributes and fill_derived_core_properties, any reason to not
with open(...) as ..._file:
...1... # doesn't refer to ..._file
for line in ..._file:
...2... # doesn't refer to ..._file
for line in open(...):
>> - It's not obvious that is_alpha in the script, based on derived
>> properties, is equivalent to the many conditions tested in the C
>> program. Is there any other script that checks their equivalence?
> It is *not* supposed to be equivalent.
> /* Consider all the non-ASCII digits as alphabetic.
> ISO C 99 forbids us to have them in category "digit",
> but we want iswalnum to return true on them. */
> which seems to make sense, therefore I kept that in is_alpha()
> in gen-unicode-ctype.py.
*nod*. Speaking of which... There are at least four occurrences of the
test for code_points in the '0'..'9' range. Would it make sense to
factor them all out into a single function?
There are a few uses of âif 0:â that IMHO wouldn't hurt the eye as much
;-) if written âif False:â
There's at least one occurrence of '%s...1...'%...2... that might be
more efficiently written as ...2...+'...1...'.
len(a+b) is probably more efficient if written as len(a)+len(b); there
are at least two occurrences of the former.
IIRC the âverificationsâ function is exceeding the complexity limit set
by our pylintrc, and one of the scripts is exceeding the size limit.
verifications could be simplified by turning each test into a that take
a code_point as argument, perform the test and print a failure message
if appropriate. verifications would then iterate, for each code_point,
over the list of functions, calling each one in turn. This would reduce
the complexity, as presumably intended by the set limit.
As for the script size limit, the solution really is modularization.
Consider moving the parsing of UnicodeData.txt and DerivedProperties.txt
each to a separate module, that can then be reused by all scripts that
need to deal with this data. Even the is_* functions might be turned
into a module of their own, if that makes sense.
Alexandre Oliva, freedom fighter http://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/ FSF Latin America board member
Free Software Evangelist|Red Hat Brasil GNU Toolchain Engineer