This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: Note on encodings (and locales) with shift state

From: Zack Weinberg <zackw at panix dot com>
To: Joseph Myers <joseph at codesourcery dot com>, Florian Weimer <fweimer at redhat dot com>, GNU C Library <libc-alpha at sourceware dot org>
Date: Tue, 7 May 2019 16:37:40 -0400
Subject: Re: Note on encodings (and locales) with shift state
References: <87k1f39zuo.fsf@oldenburg2.str.redhat.com> <CAKCAbMg-HRP7pc7PB2CkEJE1X99W2bYBC=nqJO_RGBU_t-mxOw@mail.gmail.com> <87sgtq924w.fsf@oldenburg2.str.redhat.com> <alpine.DEB.2.21.1905071546540.3445@digraph.polyomino.org.uk>

On Tue, May 7, 2019 at 3:04 AM Florian Weimer <fweimer@redhat.com> wrote:
> * Zack Weinberg:
> > On Mon, May 6, 2019 at 2:56 PM Florian Weimer <fweimer@redhat.com> wrote:
...
> I suppose some research would be required to see if CP1255 is even
> useful for teaching or scholarly use.  (For a start, the set of vowel
> points in CP1255 appears to be incomplete compared to what's in
> Unicode.)  But I don't think it's a relevant question, so I'm not going
> to investigate it.

I *think* the diacritics in Unicode but not in CP1255 are not vowel
points but "cantillation marks", required only for texts that will be
recited in the traditional fashion during a Jewish religious service.

More importantly, the set of precomposed Hebrew letters with vowel
points in Unicode is incomplete.  Any nontrivial text with vowel
points is going to need to use decomposed forms.  And none of the
precomposed characters are the NFC normal form for their glyph, and
they're all in the "Alphabetic Presentation Forms" Unicode block,
which IIUC is not meant to be used for normal text.  So I think it'd
be unlikely to do any harm if we changed our CP1255-to-Unicode mapping
so that, for instance, E0 C7 would become U+05D0 U+05B7 instead of U+FB2E.

> >> (2), (4) and (5) appear to require genuine (shift) state support.  The
> >> only locale we actually have in this category is zh_HK.

N.B. the encoding used for this locale is Big5-HKSCS, which is one of
the very few non-UTF multibyte encodings that the WhatWG Encoding
Standard (https://encoding.spec.whatwg.org/) considers to be worth
bothering to support.  (I generally think WhatWG has been too
aggressive in dropping support for legacy encodings, but I also
generally think we have been too *conservative* about that, so any
point of agreement is probably Doing It Right.)

> > I tend to think the entire <wchar.h> API set is not particularly
> > useful anyway.  The <locale.h> and <langinfo.h> APIs are useful, but I
> > wonder how much would break if we made these changes:
> >
> > * In ALL locales, narrow C strings are encoded in UTF-8 and wide
> >   strings are encoded in UTF-32.  (Therefore, the mbstowcs family of
> >   functions only needs to handle the conversion between these two
> >   encodings.)
>
> I think this is what musl is doing.
>
> But maybe this goes too far.  Support simple single-byte locales *and*
> UTF-8 may cover more use cases and would not impact other projects much
> (such as rational ranges).

I'm not sure how to assess the risk here.  Considering the entire set
of changes I listed as a whole, the most important potential problem I
could think of was, there are probably a lot of programs that assume
the "b" flag has no effect, so they fopen binary files without "b" and
if that starts trying to convert, I dunno, PNG data to UTF-8, they're
going to break.

> > * The CODESET property of a locale is used for only one purpose: it
> >   specifies the encoding to/from which both narrow and wide strings
> >   are converted when written to/read from FILEs opened in text mode,
> >   unless overridden by the ",ccs=" mode extension.  (If it isn't
> >   already, it becomes an error to specify ",ccs=" together with the
> >   "b" flag.)
>
> I don't think this is feasible.  CODESET really has to reflect the
> encoding of file names, not just file contents.  (Maybe we should have
> separate knobs for both, but that ship has sailed.)

I thought there was consensus that filenames had to be UTF-8 regardless.

> > * FILEs opened in text mode never acquire a width orientation: you can
> >   always apply either wide or narrow functions to them, regardless of
> >   previous actions, and fwide(fp, mode) always returns 0.
>
> Is that conforming?  I don't think it is.  There has to be an
> orientiation, but I think we can still allow the narrow functions on
> wide streams (in theory, in practice this could be difficult for
> backwards compatibility reasons).

Hmm, I think you're right that fwide() can't always return 0: N1570
7.21.2p4 has no wiggle room, a stream must acquire an orientation upon
the first use of a narrow or wide function with it.  However, applying
narrow functions to a wide stream, or vice versa, has undefined
behavior (7.21.2p5, violation of "shall" requirement not within a
"constraints" section) so I don't think there should be any
theoretical obstacle to us _defining_ that behavior.  What backward
compatibility reasons did you have in mind?

I have been tempted to propose this change by itself, before, just
because it would allow us to eliminate a bunch of workarounds within
glibc for the possibility that stderr might be wide-oriented.

On Tue, May 7, 2019 at 11:51 AM Joseph Myers <joseph@codesourcery.com> wrote:
>
> toupper / tolower in single-byte locales, and towupper / towlower in
> general, however, do have to be locale-sensitive to behave correctly in
> Turkish / Azerbaijani / ... (tr_TR and locales with 'copy "tr_TR"' in
> LC_CTYPE) locales.

Yah, I forgot about toupper/tolower/towupper/towlower.  But I don't
think there should be any problem with isupper('İ') and islower('ı')
being true in all locales.

zw

Follow-Ups:
- Re: Note on encodings (and locales) with shift state
  - From: Florian Weimer

References:
- Note on encodings (and locales) with shift state
  - From: Florian Weimer
- Re: Note on encodings (and locales) with shift state
  - From: Zack Weinberg
- Re: Note on encodings (and locales) with shift state
  - From: Florian Weimer
- Re: Note on encodings (and locales) with shift state
  - From: Joseph Myers

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]