This is the mail archive of the
mailing list for the glibc project.
Re: [PATCH] [BZ 17588 13064] Update UTF-8 charmap and width to Unicode 7.0.0
- From: Alexandre Oliva <aoliva at redhat dot com>
- To: Mike FABIAN <mfabian at redhat dot com>, siddhesh at redhat dot com
- Cc: Pravin Satpute <psatpute at redhat dot com>, libc-alpha at sourceware dot org, Jens Petersen <petersen at redhat dot com>
- Date: Mon, 22 Dec 2014 20:52:41 -0200
- Subject: Re: [PATCH] [BZ 17588 13064] Update UTF-8 charmap and width to Unicode 7.0.0
- Authentication-results: sourceware.org; auth=none
- References: <573624784 dot 8871393 dot 1416848051220 dot JavaMail dot zimbra at redhat dot com> <orzjb3o7yf dot fsf at free dot home> <s9dy4qir6fu dot fsf at ari dot site> <orfvce7y90 dot fsf at free dot home> <s9d388duu5r dot fsf at ari dot site>
On Dec 18, 2014, Mike FABIAN <firstname.lastname@example.org> wrote:
> One might think so because len(a+b) seems to create a new string.
> But even then I would prefer it over len(a)+len(b), it just seems
> to look nicer to me. Nevertheless I tried to time it and surprisingly
> len(a+b) seems even faster:
I would guess the âcompilerâ that converts python into the internal
representation of the program used by the interpreter is pre-processing
the concatenation of the string literals or somesuch, because len(a+b)
has to do at least as much work as len(a)+len(b). Anyway, if you prefer
len(a+b), then there are other occurrences of len(a)+len(b) elsewhere
that you might want to change to the preferred form.
I have re-reviewed utf8_gen.py and utf8_compatibility.py, and no further
suggestions occurred to me.
> But I really think we should not worry about such tiny details.
> more important is that the code is correct (generates the correct
> character classes).
Comformance to a project's standards is important for long-term
maintenance; so is avoiding duplicates that could lead to fixes that
change one copy but not others. Designing for reuse, using introducing
components/modules with well-defined interfaces, is also good software
engineering practice in general.
While these scripts might be perceived as self-contained and one-shot
uses, there are significant portions that could be turned into modules
and reused by third parties, or even by ourselves as part of testsuites
and whatnot. So, in spite of the discussion I'm starting below, I
encourage you to reconsider the idea of turning UnicodeData.txt, and
DerivedCoreProperties.txt, and the ctype interfaces each into a separate
module, so that the glibc file generators use these modules to output
the glibc-specific files. Even if the generators have stand-alone
logic, it should ideally also have a module interface with all the
generation logic, so that other programs could experiment with
generating modified files out of modified data structures, rather than
just reading from the .txt files as the modules would do if ran as a
> That limitation to 1000 lines in a module seems completely arbitrary
> to me.
I had somehow got the idea that there had been broad discussion about
our python standards, particularly about pylint rules and limits, but
all I could find was this section without any backing up references:
and the thread in which the pylintrc file was proposed and introduced
doesn't seem to have discussed these specific limits at all:
Siddhesh, since you pointed myself and Mike to pylint, and you installed
the file and presumably the section above in the wiki, can you provide
any pointers or reasons to justify the limits set forth in pylint?
I'm particularly interested in rationales behind the limits on file size
and function complexity (branch count), that appear to be so narrow as
to discourage detailed self-testing.
Mike is hitting them hard and, although I see value in *some*
modularization, artificially breaking up the large number of tests into
multiple functions and then into multiple modules, just so as to avoid
hitting the pylint limits, doesn't seem desirable or even sensible.
In case we end up agreeing that these limits may be inadequate for some
scripts, should we make individual exceptions for these Unicode scripts,
or bump the limits up so as to not need individual exceptions for them?
Meanwhile, Mike, in case you have not yet reviewed the glibc style and
conventions for python, and Mike Frysinger's review of Siddhesh's patch
in the thread above, and raise any concerns you might have about the
standards there? If we're going to revisit glibc's python coding
standards WRT file size and function complexity, we might as well
revisit other issues that AFAICT have not been discussed, and then
review the newly-proposed code so as to fit whatever consensus emerges
One issue that springs to mind is the requirement for python2.7
compatibility. I'm pretty sure we have used python3-only features, and
I hope we don't have to rewrite them into less readable variants just
for python2.7 compatibility, for scripts that would only have to be
rerun at Unicode version updates. I'm assuming we'll keep on holding
the generated ctype and utf8 files in the source repository.
Alexandre Oliva, freedom fighter http://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/ FSF Latin America board member
Free Software Evangelist|Red Hat Brasil GNU Toolchain Engineer