[PATCH 2/2] Add new C.UTF-8 locale (Bug 17318)
Florian Weimer
fweimer@redhat.com
Mon Jun 29 09:42:45 GMT 2020
* Carlos O'Donell via Libc-alpha:
> diff --git a/locale/programs/charmap.c b/locale/programs/charmap.c
> index c23e50944f..d89d788a9b 100644
> --- a/locale/programs/charmap.c
> +++ b/locale/programs/charmap.c
> @@ -49,7 +49,7 @@ static void new_width (struct linereader *cmfile, struct charmap_t *result,
> @@ -285,6 +285,27 @@ parse_charmap (struct linereader *cmfile, int verbose, int be_quiet)
> enum token_t ellipsis = 0;
> int step = 1;
>
> + /* POSIX explicitly requires that ellipsis processing do the
> + following: "Bytes shall be treated as unsigned octets, and carry
> + shall be propagated between the bytes as necessary to represent the
> + range." It then goes on to say that such a declaration should
> + never be specified because it creates NULL bytes. Therefore we
> + error on this condition (see charmap_new_char). However this still
> + leaves a problem for encodings which use less than the full 8-bits,
> + like UTF-8, and in such encodings you can use an ellipsis to
> + silently and accidentally create invalid ranges. In UTF-8 you have
> + only the first 6-bits of the first byte and if your ellipsis covers
> + a code point range larger than this 64 code point block the output
> + is going to be an invalid non-UTF-8 multi-byte sequence. Thus for
> + UTF-8 we add a speical ellipsis handling loop that can increment
> + UTF-8 multi-byte output effectively and for UTF-8 we allow larger
> + ellipsis ranges without error. There may still be other encodings
> + for which the ellipsis will still generate invalid multi-byte
> + output, but not for UTF-8. The only alternative would be to call
> + gconv for each Unicode code point in the loop to convert it to the
> + appropriate multi-byte output, but that would be slow. */
Typo: speical
> @@ -1039,11 +1134,52 @@ hexadecimal range format should use only capital characters"));
> for (cnt = from_nr; cnt <= to_nr; cnt += step)
> {
> char *name_end;
> + unsigned char ubytes[4] = { '\0', '\0', '\0', '\0' };
> obstack_printf (ob, decimal_ellipsis ? "%.*s%0*d" : "%.*s%0*X",
> prefix_len, from, len1 - prefix_len, cnt);
> obstack_1grow (ob, '\0');
> name_end = obstack_finish (ob);
>
> + /* Either we have a UTF-8 charmap, and we compute the bytes (see comment
> + above), or we have a non-UTF-8 charmap and we follow POSIX rules as
> + further below for incrementing the bytes in an ellipsis. */
> + if (is_utf8)
> + {
> + int nubytes;
> +
> + /* Direclty convert the code point to the UTF-8 encoded bytes. */
> + nubytes = output_utf8_bytes (cnt, 4, ubytes);
Typo: Direclty
There are some overlong linese here, please fix.
> diff --git a/localedata/C.UTF-8.in b/localedata/C.UTF-8.in
> new file mode 100644
> index 0000000000..70ab2bbac7
> --- /dev/null
> +++ b/localedata/C.UTF-8.in
> @@ -0,0 +1,852388 @@
I do not think it's a good idea to check in this file. It's large and
it's dormant during regular builds.
Thanks,
Florian
More information about the Libc-alpha
mailing list