This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

utf8-gen.py (was: [PATCH] [BZ 17588 13064] Update UTF-8 charmap and width to Unicode 7.0.0)


Alexandre Oliva <aoliva@redhat.com> ãããããããã:

> On to utf8-gen.py...
>
> - some vertical space before the comments that introduce process_charmap
> would be welcome; if the comments included a sample input line (like
> process_range), reviewing the code would be easier.

https://github.com/pravins/glibc-i18n/commit/eaa3fd4cac231017d3291e8949f130bee4c946be

> - as in ctype-compatibility.py, I suggest reworking the reading of the
> input files into using generator functions that yield individual
> (filtered) lines, rather than reading entire input files into lists that
> will be iterater over only once.  The iteration over the lists with
> indices, rather than with iterators, aggravates the inefficiency.

The whole script takes less than half a second on my machine and
I think readability is more important than efficiency here.

I made a small change  which seems to make it better readable
(at least for me):

https://github.com/pravins/glibc-i18n/commit/2ff51246d59640003adbf5da8a781f53e3ca1afa

> - write_comments(flag) is very odd; any reason to not write two separate
> functions, each one outputting the text that should precede each portion
> of the output file?

https://github.com/pravins/glibc-i18n/commit/f68cb958b5af000e2fce52be7924d503e53918b5

> - the handling of surrogates in process_charmap is quite confusing; it
> appears that in the end the script only outputs comments for them;
> couldn't we just skip them in the first place, or use the same output
> logic, but setting a prefix string to an empty string or the comment
> character to it to avoid the ugly duplication?  (The prefix string could
> even be modified afterwards to contain the extra zeros required for the
> len(w[0]) != 4 case, like:
>
>     if $is_surrogate:
>       prefix = '%'
>     else
>       prefix = ''
>
>     prefix = prefix + '<U'
>     if len(w) > 4:
>       prefix = prefix + ('0' * (8 - len(w)))
>
>     outfile.write(prefix + ...

https://github.com/pravins/glibc-i18n/commit/94d7fd4224730e115cc04cf3399de4df2f23024f

> - I'm a bit concerned about using Python's UTF-8 encoder instead of
> open-coding it.  The fact that its encoder discards surrogates, rather
> than blindly applying the conversion rules for any inputs, means it
> might very well rule out other characters that future versions of
> Unicode might introduce, or even convert them incorrectly.

I didnât yet change that but changed convert_to_hex() and made
it the only place where that encoding is done, so it can be fixed
in a single place in convert_to_hex() now if desired. See next
change for convert_to_hex().

> - convert_to_hex is working too hard.  unihex can be iterated on, so
>   this should work:
>
>     for c in unihex:
>       hexword = hexword + "/x" + ('%02x' % c)
>
>   or even better:
>
>   hexword = ''.join(['/x%02x' % c for c in unihex])

https://github.com/pravins/glibc-i18n/commit/f69ccab21b64ec4f9945fef5d39ee6b814397cb3

> - how about factoring the inner loop out of process_charmaps, so that
> the logic that converts the code point to hexword can be called from
> process_range too?

The above commit changing convert_to_hex() does this as well.

> - factoring out the logic to turn a code point number into a <U...>
> string would also reduce the amount of ugly duplication.  I envision
> something like this:
>
>   else:
>     iend = int(end, 16)
>     for i in range(int(start, 16), iend, 64):
>       outfile.write(mapline(cpi2Us(i) + ".." + cpi2Us(min(i + 63, iend)),
>                             chr(i), name.split(",")[0]))
>
> def cp2Us(s):
>   return '<U' + s + '>'
> def cpi2Us(i):
>   return cp2Us(('%04X' if i <= 0xffff else '%08X') % i)
> def cps2Us(w):
>   if len(w) > 4:
>     w = '0' * (8 - len(w))
>   return cp2Us(w)
> def mapline(prefix, c, name):
>   unihex = c.encode("UTF-8")
>   hexword = convert_to_hex(unihex)
>
>   return prefix + "     " + hexword + "         " + name + "\n"

We had a similar function ucs_symbol() in gen-unicode-ctype.py already,
I reused it here, that makes the script more similar:

https://github.com/pravins/glibc-i18n/commit/013ebf466773f8633e4254c4eaa4dc4167b5e6d3

> - the comments at the top of process_width appear to be out of sync with
> what the function does

https://github.com/pravins/glibc-i18n/commit/e8d86b5d5f321217e423f97400526befec19c948

> - I suggest passing an empty width_dict to process_charmap, and having
> it fill it in as it scans the UnicodeData.txt input with the same logic
> at the top of process_width, so that the file is only scanned once

That doesnât seem to enhance the readability for me and speed is
not so important here.

> - then, process_width could have its main loop simplified to (note the
>   +1 added to the range start):
>
>     w = l.split(";")
>     wc = w[0].split("..")
>     if len(wc) == 1:
>       rend = ''
>     else:
>       rend = '...' + cp2Us(wc[1])
>       for key in range(int(wc[0], 16)+1, int(wc[1], 16)+1):
>         if key in width_dict:
>           del width_dict[key]
>     width_dict[int(wc[0], 16)] = cp2Us(wc[0]) + rend + '\t2'

-- 
Mike FABIAN <mfabian@redhat.com>
â Office: +49-69-365051027, internal 8875027
ççäèãããääãæãã


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]