This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: Note on encodings (and locales) with shift state

From: Florian Weimer <fweimer at redhat dot com>
To: Zack Weinberg <zackw at panix dot com>
Cc: Joseph Myers <joseph at codesourcery dot com>, GNU C Library <libc-alpha at sourceware dot org>
Date: Wed, 08 May 2019 11:40:36 +0200
Subject: Re: Note on encodings (and locales) with shift state
References: <87k1f39zuo.fsf@oldenburg2.str.redhat.com> <CAKCAbMg-HRP7pc7PB2CkEJE1X99W2bYBC=nqJO_RGBU_t-mxOw@mail.gmail.com> <87sgtq924w.fsf@oldenburg2.str.redhat.com> <alpine.DEB.2.21.1905071546540.3445@digraph.polyomino.org.uk> <CAKCAbMhuU=-yfoBZr-yxhCzmPBAbqk69U4Tyb96z93SUkrUFEQ@mail.gmail.com>

* Zack Weinberg:

> More importantly, the set of precomposed Hebrew letters with vowel
> points in Unicode is incomplete.  Any nontrivial text with vowel
> points is going to need to use decomposed forms.  And none of the
> precomposed characters are the NFC normal form for their glyph, and
> they're all in the "Alphabetic Presentation Forms" Unicode block,
> which IIUC is not meant to be used for normal text.  So I think it'd
> be unlikely to do any harm if we changed our CP1255-to-Unicode mapping
> so that, for instance, E0 C7 would become U+05D0 U+05B7 instead of
> U+FB2E.

This is good to know, thanks.

>> >> (2), (4) and (5) appear to require genuine (shift) state support.  The
>> >> only locale we actually have in this category is zh_HK.
>
> N.B. the encoding used for this locale is Big5-HKSCS, which is one of
> the very few non-UTF multibyte encodings that the WhatWG Encoding
> Standard (https://encoding.spec.whatwg.org/) considers to be worth
> bothering to support.  (I generally think WhatWG has been too
> aggressive in dropping support for legacy encodings, but I also
> generally think we have been too *conservative* about that, so any
> point of agreement is probably Doing It Right.)

Interesting.  It also shows the multi-codepoint generation much more
clearly than the glibc sources.

WhatWG also lists ISO-2022-JP.  I'm not sure to what extent you can use
that with the multibyte functions.  The current glibc manual wouldn't
consider it sufficiently ASCII-transparent, considering things like
this:
$ echo ¨ | iconv -t ISO-2022-JP | xxd
00000000: 1b24 4221 2f1b 2842 0a                   .$B!/.(B.

The encoding is not safe for file names because it introduces spurious /
characters (and the gracious use of shell metacharacters obviously does
not help, either).

There's an open project here to make sure that our charset converters
match the WhatWG specification.

>> > I tend to think the entire <wchar.h> API set is not particularly
>> > useful anyway.  The <locale.h> and <langinfo.h> APIs are useful, but I
>> > wonder how much would break if we made these changes:
>> >
>> > * In ALL locales, narrow C strings are encoded in UTF-8 and wide
>> >   strings are encoded in UTF-32.  (Therefore, the mbstowcs family of
>> >   functions only needs to handle the conversion between these two
>> >   encodings.)
>>
>> I think this is what musl is doing.
>>
>> But maybe this goes too far.  Support simple single-byte locales *and*
>> UTF-8 may cover more use cases and would not impact other projects much
>> (such as rational ranges).
>
> I'm not sure how to assess the risk here.  Considering the entire set
> of changes I listed as a whole, the most important potential problem I
> could think of was, there are probably a lot of programs that assume
> the "b" flag has no effect, so they fopen binary files without "b" and
> if that starts trying to convert, I dunno, PNG data to UTF-8, they're
> going to break.

Ahh.  This wasn't clear to me.  I think no one expects UTF-8 encoding or
decoding for text files accessed using narrow streams.  And we obviously
cannot enforce proper UTF-8 encoding even in an UTF-8 locale, so
applications cannot rely on it anyway.

>> > * The CODESET property of a locale is used for only one purpose: it
>> >   specifies the encoding to/from which both narrow and wide strings
>> >   are converted when written to/read from FILEs opened in text mode,
>> >   unless overridden by the ",ccs=" mode extension.  (If it isn't
>> >   already, it becomes an error to specify ",ccs=" together with the
>> >   "b" flag.)
>>
>> I don't think this is feasible.  CODESET really has to reflect the
>> encoding of file names, not just file contents.  (Maybe we should have
>> separate knobs for both, but that ship has sailed.)
>
> I thought there was consensus that filenames had to be UTF-8
> regardless.

That's not how Emacs, OpenJDK and others behave.  (I don't know what the
expected behavior of desktop environments such as GNOME is these days.)
They follow CODESET for file name encoding.

Windows and Mac are different, of course.  Windows has UCS-2 at the
storage layer, and Mac apparently uses byte strings, but enforces UTF-8
in a particular normalization form somewhere in the file system stack.

>> > * FILEs opened in text mode never acquire a width orientation: you can
>> >   always apply either wide or narrow functions to them, regardless of
>> >   previous actions, and fwide(fp, mode) always returns 0.
>>
>> Is that conforming?  I don't think it is.  There has to be an
>> orientiation, but I think we can still allow the narrow functions on
>> wide streams (in theory, in practice this could be difficult for
>> backwards compatibility reasons).
>
> Hmm, I think you're right that fwide() can't always return 0: N1570
> 7.21.2p4 has no wiggle room, a stream must acquire an orientation upon
> the first use of a narrow or wide function with it.  However, applying
> narrow functions to a wide stream, or vice versa, has undefined
> behavior (7.21.2p5, violation of "shall" requirement not within a
> "constraints" section) so I don't think there should be any
> theoretical obstacle to us _defining_ that behavior.  What backward
> compatibility reasons did you have in mind?

Purely related to the libio data layout.  Many of the _IO_* symbol
versions for GLIBC_2.2 are related to wide stream support, so there was
at one point an expectation that this is part of the ABI.

On the other hand, I have never seen any compatibility problems related
to these, unlike for the narrow interfaces.  It might be the case that
there was never a libstdc++ with glibc-shared libio with wide character
support in wide circulation.  Compatibility was likely impossible once
the vtable pointer position changed in the C++ ABI.  (Before the Itanium
C++ ABI, the ABI was rarely stable across GCC releases, and there wasn't
even a flag to switch back to older ABIs.)

> On Tue, May 7, 2019 at 11:51 AM Joseph Myers <joseph@codesourcery.com> wrote:
>>
>> toupper / tolower in single-byte locales, and towupper / towlower in
>> general, however, do have to be locale-sensitive to behave correctly in
>> Turkish / Azerbaijani / ... (tr_TR and locales with 'copy "tr_TR"' in
>> LC_CTYPE) locales.
>
> Yah, I forgot about toupper/tolower/towupper/towlower.  But I don't
> think there should be any problem with isupper('İ') and islower('ı')
> being true in all locales.

I think you mean iswupper and iswlower, but otherwise, I agree.

Thanks,
Florian

References:
- Note on encodings (and locales) with shift state
  - From: Florian Weimer
- Re: Note on encodings (and locales) with shift state
  - From: Zack Weinberg
- Re: Note on encodings (and locales) with shift state
  - From: Florian Weimer
- Re: Note on encodings (and locales) with shift state
  - From: Joseph Myers
- Re: Note on encodings (and locales) with shift state
  - From: Zack Weinberg

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]