This is the mail archive of the
mailing list for the glibc project.
utf8-gen.py (was: [PATCH] [BZ 17588 13064] Update UTF-8 charmap and width to Unicode 7.0.0)
- From: Mike FABIAN <mfabian at redhat dot com>
- To: Alexandre Oliva <aoliva at redhat dot com>
- Cc: Pravin Satpute <psatpute at redhat dot com>, libc-alpha at sourceware dot org
- Date: Mon, 08 Dec 2014 22:59:35 +0100
- Subject: utf8-gen.py (was: [PATCH] [BZ 17588 13064] Update UTF-8 charmap and width to Unicode 7.0.0)
- Authentication-results: sourceware.org; auth=none
- References: <573624784 dot 8871393 dot 1416848051220 dot JavaMail dot zimbra at redhat dot com> <orzjb3o7yf dot fsf at free dot home>
Alexandre Oliva <email@example.com> ãããããããã:
> On to utf8-gen.py...
> - some vertical space before the comments that introduce process_charmap
> would be welcome; if the comments included a sample input line (like
> process_range), reviewing the code would be easier.
> - as in ctype-compatibility.py, I suggest reworking the reading of the
> input files into using generator functions that yield individual
> (filtered) lines, rather than reading entire input files into lists that
> will be iterater over only once. The iteration over the lists with
> indices, rather than with iterators, aggravates the inefficiency.
The whole script takes less than half a second on my machine and
I think readability is more important than efficiency here.
I made a small change which seems to make it better readable
(at least for me):
> - write_comments(flag) is very odd; any reason to not write two separate
> functions, each one outputting the text that should precede each portion
> of the output file?
> - the handling of surrogates in process_charmap is quite confusing; it
> appears that in the end the script only outputs comments for them;
> couldn't we just skip them in the first place, or use the same output
> logic, but setting a prefix string to an empty string or the comment
> character to it to avoid the ugly duplication? (The prefix string could
> even be modified afterwards to contain the extra zeros required for the
> len(w) != 4 case, like:
> if $is_surrogate:
> prefix = '%'
> prefix = ''
> prefix = prefix + '<U'
> if len(w) > 4:
> prefix = prefix + ('0' * (8 - len(w)))
> outfile.write(prefix + ...
> - I'm a bit concerned about using Python's UTF-8 encoder instead of
> open-coding it. The fact that its encoder discards surrogates, rather
> than blindly applying the conversion rules for any inputs, means it
> might very well rule out other characters that future versions of
> Unicode might introduce, or even convert them incorrectly.
I didnât yet change that but changed convert_to_hex() and made
it the only place where that encoding is done, so it can be fixed
in a single place in convert_to_hex() now if desired. See next
change for convert_to_hex().
> - convert_to_hex is working too hard. unihex can be iterated on, so
> this should work:
> for c in unihex:
> hexword = hexword + "/x" + ('%02x' % c)
> or even better:
> hexword = ''.join(['/x%02x' % c for c in unihex])
> - how about factoring the inner loop out of process_charmaps, so that
> the logic that converts the code point to hexword can be called from
> process_range too?
The above commit changing convert_to_hex() does this as well.
> - factoring out the logic to turn a code point number into a <U...>
> string would also reduce the amount of ugly duplication. I envision
> something like this:
> iend = int(end, 16)
> for i in range(int(start, 16), iend, 64):
> outfile.write(mapline(cpi2Us(i) + ".." + cpi2Us(min(i + 63, iend)),
> chr(i), name.split(",")))
> def cp2Us(s):
> return '<U' + s + '>'
> def cpi2Us(i):
> return cp2Us(('%04X' if i <= 0xffff else '%08X') % i)
> def cps2Us(w):
> if len(w) > 4:
> w = '0' * (8 - len(w))
> return cp2Us(w)
> def mapline(prefix, c, name):
> unihex = c.encode("UTF-8")
> hexword = convert_to_hex(unihex)
> return prefix + " " + hexword + " " + name + "\n"
We had a similar function ucs_symbol() in gen-unicode-ctype.py already,
I reused it here, that makes the script more similar:
> - the comments at the top of process_width appear to be out of sync with
> what the function does
> - I suggest passing an empty width_dict to process_charmap, and having
> it fill it in as it scans the UnicodeData.txt input with the same logic
> at the top of process_width, so that the file is only scanned once
That doesnât seem to enhance the readability for me and speed is
not so important here.
> - then, process_width could have its main loop simplified to (note the
> +1 added to the range start):
> w = l.split(";")
> wc = w.split("..")
> if len(wc) == 1:
> rend = ''
> rend = '...' + cp2Us(wc)
> for key in range(int(wc, 16)+1, int(wc, 16)+1):
> if key in width_dict:
> del width_dict[key]
> width_dict[int(wc, 16)] = cp2Us(wc) + rend + '\t2'
Mike FABIAN <firstname.lastname@example.org>
â Office: +49-69-365051027, internal 8875027