[RFC] Add new C.UTF-8 locale.

Florian Weimer fweimer@redhat.com
Mon Jun 22 21:33:16 GMT 2020


* Carlos O'Donell:

> However, after considering this more deeply I think we can actually
> handle this differently.
>
> Consider the following:
>
> (a) Currently the full collation with weights is 28MiB of data.
>     This is too big for most container deployments of C.UTF-8.
>
> (b) If we agree that surrogate pairs would be invalid UTF-8 anyway,
>     then we can use the equivalent of LC_COLLATE set to C to get code
>     point ordering, with the understanding that surrogate pairs if
>     present would sort into their code point ordering.
>
> In general this would allow a full C.UTF-8 with code point ordering
> that doesn't take up 28MiB with weight data that isn't really required.
>
> This suggestion was made by Rich Felker (musl) and Peter
> Eisentraut (postgresql).
>
> I'm going to see if I can hack up a C.UTF-8 that uses only sorting of
> the first byte to get full code point sorting.
>
> Thoughts?

I'm worried you still need tables to get a working wcscoll.  But
otherwise, the plan sounds fine.

Thanks,
Florian



More information about the Libc-alpha mailing list