[PATCH v9] POSIX locale covers every byte [BZ# 29511]

наб nabijaczleweli@nabijaczleweli.xyz
Wed Apr 26 18:54:12 GMT 2023

Hi! Long time, apologies.

On Mon, Feb 13, 2023 at 03:52:06PM +0100, Florian Weimer wrote:
> > This largely duplicates the ASCII code with the error path changed
> >
> > There are two user-facing changes:
> >   * nl_langinfo(CODESET) is "POSIX" instead of "ANSI_X3.4-1968"
> >   * mbrtowc() and friends return b if b <= 0x7F else <UDF00>+b
> >
> > Since Issue 7 TC 2/Issue 8, the C/POSIX locale, effectively,
> >   (a) is 1-byte, stateless, and contains 256 characters
> >   (b) they collate in byte order
> >   (c) the first 128 characters are equivalent to ASCII (like previous)
> > cf. https://www.austingroupbugs.net/view.php?id=663 for a summary of
> > changes to the standard;
> > in short, this means that mbrtowc() must never fail and must return
> >   b if b <= 0x7F else ab+c for all bytes b
> >   where c is some constant >=0x80
> >     and a is a positive integer constant
> >
> > By strategically picking c=<UDF00> we land at the tail-end of the
> > Unicode Low Surrogate Area at DC00-DFFF, described as
> >   > Isolated surrogate code points have no interpretation;
> >   > consequently, no character code charts or names lists
> >   > are provided for this range.
> > and match musl
> I've thought about this some more, and I don't think this is the
> direction we should be going in.
> * Add a UTF-8SE charset to glibc: it's UTF-8 with surrogate encoding (in
>   the Python style).  It should have the property that it can encode
>   every byte string as a string of wchar_t characters, and convert the
>   result back.  It's not entirely trivial because we need to handle
>   partial UTF-8 sequences at the end of the buffer carefully.  There
>   might be some warts regarding EILSEQ handling lurking there.  Like the
>   Python approach, it is somewhat imperfect because it's not preserving
>   identity under string concatenation, i.e. f(x) || f(y) is not always
>   equal to f(x || y), but that's just unavoidable.
> * Switch the charset for the default C locale to UTF-8SE.  This matches
>   the POSIX requirement that every byte can be encoded.
The main point of LC_CTYPE=POSIX as specified is that it allows you to
process paths (which are sequences of bytes, not characters) in a sane
way ‒ part of that is that collation needs to be correct, so maybe, as a
smoke test, "[a, b, c] < [a, b, c+1] for all a,b,c".

  >>> b'\xc4\xbf'.decode('UTF-8', errors='surrogateescape')
  >>> b'\xc4\xc0'.decode('UTF-8', errors='surrogateescape')
  >>> [*map(ord, b'\xc4\xbf'.decode('UTF-8', errors='surrogateescape'))]
  >>> [*map(ord, b'\xc4\xc0'.decode('UTF-8', errors='surrogateescape'))]
  [56516, 56512]
which, I mean, sure, maybe that's sensible (I wouldn't say so), but
  >>> b'\xef\xbf\xbf'.decode('UTF-8', errors='surrogateescape')
  >>> b'\xef\xbf\xc0'.decode('UTF-8', errors='surrogateescape')
  >>> [*map(ord, b'\xef\xbf\xbf'.decode('UTF-8', errors='surrogateescape'))]
  >>> [*map(ord, b'\xef\xbf\xc0'.decode('UTF-8', errors='surrogateescape'))]
  [56559, 56511, 56512]

Which means you can't process arbitrary data (pathnames) in a way that
makes sense. In my opinion this would be /worse/ than the current
behaviour, behaving erratically in the presence of Some Data instead of
simply not supporting it.

> * Work with POSIX to drop the requirement that the C locale needs to be
>   a single-byte locale.
That's not going to happen because it's the /only/ way to process paths.
Indeed, XBD 8.2 puts it nicely:
  Users may use the following environment variables to announce specific
  localization requirements to applications.
As a user, I want to be able to announce "each byte is a character,
 in natural ordering". This is what LC_CTYPE=C lets me do. I hope
you'll agree this is a good feature to be support.

POSIX, also, explicitly says that (XBD 8.2):
5499  1. If the LC_ALL environment variable is defined and is not null, the value of LC_ALL shall
5500     be used.
5501  2. If the LC_* environment variable (LC_COLLATE, LC_CTYPE, LC_MESSAGES,
5502     LC_MONETARY, LC_NUMERIC, LC_TIME) is defined and is not null, the value of the
5503     environment variable shall be used to initialize the category that corresponds to the
5504     environment variable.
5505  3. If the LANG environment variable is defined and is not null, the value of the LANG
5506     environment variable shall be used.
5507  4. If the LANG environment variable is not set or is set to the empty string, the
5508     implementation-defined default locale shall be used.
and XBD 7.2:
3643  All implementations shall define a locale as the default locale, to be invoked when no
3644  environment variables are set, or set to the empty string. This default locale can be the POSIX
3645  locale or any other implementation-defined locale. Some implementations may provide facilities
3646  for local installation administrators to set the default locale, customizing it for each location.
3647  POSIX.1-202x does not require such a facility.

To that end, how's about:
  * invent UTF-8SE encoding as you say
  * invent POSIX   encoding like in this patch
    (but move the area to match UTF-8SE probably, it's a good precedent)
  * hook up POSIX to POSIX as in here
  * change the implementation-defined default locale to POSIX-but-UTF-8SE
  * (maybe) change the default locale on entry to main() to POSIX-but-UTF-8SE

POSIX requires that LC_ALL=POSIX is the default on entry to main().
That said, I wouldn't mind violating /that/, since anything we do with it
is backwards-compatible. Maybe it makes sense to do that for programs that
don't call setlocale() at all, and they'll behave better when used
internationally. Or not.

Logically, this translates to:
  * if the user has their native locale selected, use that
  * if the user has explicitly selected the bytewise locale, use that
  * if the user hasn't configured their locales at all,
    assume they want UTF-8 but degrade sensibly
  * (maybe) if the program hasn't been written with locales in mind,
            assume the user will be using it with UTF-8 input but
			degrade sensibly

I think this leaves the wolf full and the sheep alive ‒ the default
behaviour is UTF-8(ish), and can be overridden to full UTF-8 or bytes,
per the user's requirements.

Existing users will thus gain the ability to:
  * process data that's UTF-8 but skip over/retain
    illegal/otherwise-encoded bytes losslessly
    (this makes the sample above a killer feature instead of non-sensible,
     so long as it's an encoding in its own right)
  * correctly process arbitrily-encoded data as bytes

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <https://sourceware.org/pipermail/libc-alpha/attachments/20230426/09879c56/attachment.sig>

More information about the Libc-alpha mailing list