[RFC] Add new C.UTF-8 locale.
Florian Weimer
fweimer@redhat.com
Mon Jun 22 21:33:16 GMT 2020
* Carlos O'Donell:
> However, after considering this more deeply I think we can actually
> handle this differently.
>
> Consider the following:
>
> (a) Currently the full collation with weights is 28MiB of data.
> This is too big for most container deployments of C.UTF-8.
>
> (b) If we agree that surrogate pairs would be invalid UTF-8 anyway,
> then we can use the equivalent of LC_COLLATE set to C to get code
> point ordering, with the understanding that surrogate pairs if
> present would sort into their code point ordering.
>
> In general this would allow a full C.UTF-8 with code point ordering
> that doesn't take up 28MiB with weight data that isn't really required.
>
> This suggestion was made by Rich Felker (musl) and Peter
> Eisentraut (postgresql).
>
> I'm going to see if I can hack up a C.UTF-8 that uses only sorting of
> the first byte to get full code point sorting.
>
> Thoughts?
I'm worried you still need tables to get a working wcscoll. But
otherwise, the plan sounds fine.
Thanks,
Florian
More information about the Libc-alpha
mailing list