[PATCH 2/2] Add new C.UTF-8 locale (Bug 17318)

Carlos O'Donell carlos@redhat.com
Mon Jun 29 19:47:02 GMT 2020


On 6/29/20 5:42 AM, Florian Weimer wrote:
> * Carlos O'Donell via Libc-alpha:
> 
>> diff --git a/locale/programs/charmap.c b/locale/programs/charmap.c
>> index c23e50944f..d89d788a9b 100644
>> --- a/locale/programs/charmap.c
>> +++ b/locale/programs/charmap.c
>> @@ -49,7 +49,7 @@ static void new_width (struct linereader *cmfile, struct charmap_t *result,
> 
>> @@ -285,6 +285,27 @@ parse_charmap (struct linereader *cmfile, int verbose, int be_quiet)
>>    enum token_t ellipsis = 0;
>>    int step = 1;
>>  
>> +  /* POSIX explicitly requires that ellipsis processing do the
>> +     following: "Bytes shall be treated as unsigned octets, and carry
>> +     shall be propagated between the bytes as necessary to represent the
>> +     range."  It then goes on to say that such a declaration should
>> +     never be specified because it creates NULL bytes.  Therefore we
>> +     error on this condition (see charmap_new_char).  However this still
>> +     leaves a problem for encodings which use less than the full 8-bits,
>> +     like UTF-8, and in such encodings you can use an ellipsis to
>> +     silently and accidentally create invalid ranges.  In UTF-8 you have
>> +     only the first 6-bits of the first byte and if your ellipsis covers
>> +     a code point range larger than this 64 code point block the output
>> +     is going to be an invalid non-UTF-8 multi-byte sequence.  Thus for
>> +     UTF-8 we add a speical ellipsis handling loop that can increment
>> +     UTF-8 multi-byte output effectively and for UTF-8 we allow larger
>> +     ellipsis ranges without error.  There may still be other encodings
>> +     for which the ellipsis will still generate invalid multi-byte
>> +     output, but not for UTF-8.  The only alternative would be to call
>> +     gconv for each Unicode code point in the loop to convert it to the
>> +     appropriate multi-byte output, but that would be slow.  */
> 
> Typo: speical
> 
> 
>> @@ -1039,11 +1134,52 @@ hexadecimal range format should use only capital characters"));
>>    for (cnt = from_nr; cnt <= to_nr; cnt += step)
>>      {
>>        char *name_end;
>> +      unsigned char ubytes[4] = { '\0', '\0', '\0', '\0' };
>>        obstack_printf (ob, decimal_ellipsis ? "%.*s%0*d" : "%.*s%0*X",
>>  		      prefix_len, from, len1 - prefix_len, cnt);
>>        obstack_1grow (ob, '\0');
>>        name_end = obstack_finish (ob);
>>  
>> +      /* Either we have a UTF-8 charmap, and we compute the bytes (see comment
>> +	 above), or we have a non-UTF-8 charmap and we follow POSIX rules as
>> +	 further below for incrementing the bytes in an ellipsis.  */
>> +      if (is_utf8)
>> +	{
>> +	  int nubytes;
>> +
>> +	  /* Direclty convert the code point to the UTF-8 encoded bytes.  */
>> +	  nubytes = output_utf8_bytes (cnt, 4, ubytes);
> 
> Typo: Direclty
> 
> There are some overlong linese here, please fix.
> 
>> diff --git a/localedata/C.UTF-8.in b/localedata/C.UTF-8.in
>> new file mode 100644
>> index 0000000000..70ab2bbac7
>> --- /dev/null
>> +++ b/localedata/C.UTF-8.in
>> @@ -0,0 +1,852388 @@
> 
> I do not think it's a good idea to check in this file.  It's large and
> it's dormant during regular builds.

I accept that. Until we enable C.UTF-8 more broadly we won't be using it.

My worry here is that as soon as we enable this in debian and fedora
we'll start getting working C.UTF-8 that consumes 28MiB installed.

Should we limit collation to ASCII only for C.UTF-8 until we've fixed
the collation table size?

* Submit a C.UTF-8.in with just ASCII in LC_COLLATE.
* Add C.UTF-8 to SUPPORTED.
* Test C.UTF-8.

-- 
Cheers,
Carlos.



More information about the Libc-alpha mailing list