This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: Note on encodings (and locales) with shift state

From: Florian Weimer <fweimer at redhat dot com>
To: Zack Weinberg <zackw at panix dot com>
Cc: GNU C Library <libc-alpha at sourceware dot org>
Date: Tue, 07 May 2019 09:04:15 +0200
Subject: Re: Note on encodings (and locales) with shift state
References: <87k1f39zuo.fsf@oldenburg2.str.redhat.com> <CAKCAbMg-HRP7pc7PB2CkEJE1X99W2bYBC=nqJO_RGBU_t-mxOw@mail.gmail.com>

* Zack Weinberg:

> On Mon, May 6, 2019 at 2:56 PM Florian Weimer <fweimer@redhat.com> wrote:
> ...
>> Looking at this list, (1) is possibly a fringe use case.  It may not
>> even be necessary to produce pre-composed Unicode code points for
>> correctness.  CP1255 is a variant of ISO88598-8, which does not do
>> pre-composition.

(ISO 88598-8 does not have vowel points, which explains why there is no
pre-composition.)

>> We use CP1255 with yi_US (Yiddish), and ISO8859-8 with
>> he_IL (Hebrew).  The difference probably does not matter for
>> contemporary Hebrew because it does not generally use vowel-points, but
>> Yiddish might (preferences appear to vary).  But then the Yiddish
>> Wikipedia does not seem to use pre-composed characters.
>
> In Modern Hebrew, AFAIK, the primary use case for vowel points is
> texts intended for people learning the written form of the language
> (whether or not they are native speakers).  They might also be usable
> to indicate the proper pronunciation of texts in Ancient Hebrew, but I
> don't know if the historical sound changes were limited to the vowels.
> (I know you *can't* do that to handle the differences between Sephardi
> and Ashkenazi pronounciation, which definitely does affect
> consonants.)  Linguists would probably use IPA anyway.

I suppose some research would be required to see if CP1255 is even
useful for teaching or scholarly use.  (For a start, the set of vowel
points in CP1255 appears to be incomplete compared to what's in
Unicode.)  But I don't think it's a relevant question, so I'm not going
to investigate it.

>> (2), (4) and (5) appear to require genuine (shift) state support.  The
>> only locale we actually have in this category is zh_HK.
>>
>> That's a lot of code in the library to support what is essentially a
>> single locale.  But as of today, we definitely need the shift state
>> support.
>
> I tend to think the entire <wchar.h> API set is not particularly
> useful anyway.  The <locale.h> and <langinfo.h> APIs are useful, but I
> wonder how much would break if we made these changes:
>
> * In ALL locales, narrow C strings are encoded in UTF-8 and wide
>   strings are encoded in UTF-32.  (Therefore, the mbstowcs family of
>   functions only needs to handle the conversion between these two
>   encodings.)

I think this is what musl is doing.

But maybe this goes too far.  Support simple single-byte locales *and*
UTF-8 may cover more use cases and would not impact other projects much
(such as rational ranges).

> * In ALL locales, the <ctype.h> functions only recognize ASCII
>   characters (this is a consequence of narrow C strings always being
>   UTF-8; only ASCII characters fit in a single 'char' anymore) and the
>   <wctype.h> functions' behavior is locale-invariant and defined
>   strictly in terms of Unicode character properties.  LC_CTYPE blocks
>   in locale definition files are either ignored or rejected (with the
>   possible exception of transliteration specs).

I think we should do this for isidigt and a few other functions anyway.
They really do not have to be locale-sensitive because a
C-conforming/POSIX-conforming locale cannot change the tables anyway.

> * strcoll, wcscoll, strxfrm, and wcsxfrm, however, continue to have
>   locale-specific behavior.

Yes, that is somewhat unavoidable unfortunately.

> * iconv continues to be able to convert among all the encodings it
>   currently supports.

Makes sense.

> * The CODESET property of a locale is used for only one purpose: it
>   specifies the encoding to/from which both narrow and wide strings
>   are converted when written to/read from FILEs opened in text mode,
>   unless overridden by the ",ccs=" mode extension.  (If it isn't
>   already, it becomes an error to specify ",ccs=" together with the
>   "b" flag.)

I don't think this is feasible.  CODESET really has to reflect the
encoding of file names, not just file contents.  (Maybe we should have
separate knobs for both, but that ship has sailed.)

I think it would be more consistent to have only UTF-8 locales.  (But
see above.)

> * FILEs opened in text mode never acquire a width orientation: you can
>   always apply either wide or narrow functions to them, regardless of
>   previous actions, and fwide(fp, mode) always returns 0.

Is that conforming?  I don't think it is.  There has to be an
orientiation, but I think we can still allow the narrow functions on
wide streams (in theory, in practice this could be difficult for
backwards compatibility reasons).

Thanks,
Florian

Follow-Ups:
- Re: Note on encodings (and locales) with shift state
  - From: Joseph Myers

References:
- Note on encodings (and locales) with shift state
  - From: Florian Weimer
- Re: Note on encodings (and locales) with shift state
  - From: Zack Weinberg

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]