[PATCH 0/2] Initial C.UTF-8 support.

Carlos O'Donell carlos@redhat.com
Mon Jun 29 04:07:06 GMT 2020


The initial C.UTF-8 support is something I've been working on for a
long time now, and I can see why nobody has tried to get this working
well in the past.

There was least one ellipsis bug (bug 22668) which needed fixing first
and was immediately obvious when you go through the code. It is
surprising that we don't use more ellipsis, but like all things if
it doesn't work then the authors just change the source file format
to work around the issue.  In this case I don't want to work around
the issue because it makes the C locale source file ungainly (requires
listing all collation elements one-by-one).

Next I had to learn that the POSIX collation ellipsis rules don't
work for UTF-8 and that we have no error reporting for such corruption.
Rather than have to break up the UTF-8 charmap into 64-symbol chunks
to avoid the POSIX rules breaking the UTF-8 output, I just added special
processing for UTF-8 input. That is to say if the charmap is called
"UTF-8" then special code for generating the multi-byte sequences are
used. This special code can generate the output quickly and efficiently,
and compiling the locale is very fast. I will in the future add a special
warning pass here to cross check between gconv and the data in the charmap
since such a cross-check would have revealed the problem. Then when we
do our own testing we can run such cross checks.

In the end we succeed in implementing C.UTF-8, but it's 28MiB in size and
there is no easy way to short-circuit the weights table. What we need is
to remove the collation weights and instead just use strcmp internally
since UTF-8 was designed this way.

I would still like to commit C.UTF-8, without adding it to SUPPORTED,
and without adding it to test-input. I want to do this to make incremental
progress and allow other developers the opportunity to work on some of
the changes.

The first patch in the series fixes the ellipsis range handling.

The second patch implements C.UTF-8.

The next steps after inclusion would be:
- Enable sort-test for C.UTF-8 by parallelizing the testing to hide the
  ~5-7 minutes of testing required for C.UTF-8 full code point sorting.
- Add code to remove C.UTF-8 collation weights and just use strcmp instead.
- Enable C.UTF-8 in SUPPORTED.
- Add warning pass to collation support to look for corrupt output.

-- 
Cheers,
Carlos.



More information about the Libc-alpha mailing list