This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: Note on encodings (and locales) with shift state
- From: Zack Weinberg <zackw at panix dot com>
- To: Florian Weimer <fweimer at redhat dot com>
- Cc: GNU C Library <libc-alpha at sourceware dot org>
- Date: Mon, 6 May 2019 16:40:03 -0400
- Subject: Re: Note on encodings (and locales) with shift state
- References: <87k1f39zuo.fsf@oldenburg2.str.redhat.com>
On Mon, May 6, 2019 at 2:56 PM Florian Weimer <fweimer@redhat.com> wrote:
...
> Looking at this list, (1) is possibly a fringe use case. It may not
> even be necessary to produce pre-composed Unicode code points for
> correctness. CP1255 is a variant of ISO88598-8, which does not do
> pre-composition. We use CP1255 with yi_US (Yiddish), and ISO8859-8 with
> he_IL (Hebrew). The difference probably does not matter for
> contemporary Hebrew because it does not generally use vowel-points, but
> Yiddish might (preferences appear to vary). But then the Yiddish
> Wikipedia does not seem to use pre-composed characters.
In Modern Hebrew, AFAIK, the primary use case for vowel points is
texts intended for people learning the written form of the language
(whether or not they are native speakers). They might also be usable
to indicate the proper pronunciation of texts in Ancient Hebrew, but I
don't know if the historical sound changes were limited to the vowels.
(I know you *can't* do that to handle the differences between Sephardi
and Ashkenazi pronounciation, which definitely does affect
consonants.) Linguists would probably use IPA anyway.
> (2), (4) and (5) appear to require genuine (shift) state support. The
> only locale we actually have in this category is zh_HK.
>
> That's a lot of code in the library to support what is essentially a
> single locale. But as of today, we definitely need the shift state
> support.
I tend to think the entire <wchar.h> API set is not particularly
useful anyway. The <locale.h> and <langinfo.h> APIs are useful, but I
wonder how much would break if we made these changes:
* In ALL locales, narrow C strings are encoded in UTF-8 and wide
strings are encoded in UTF-32. (Therefore, the mbstowcs family of
functions only needs to handle the conversion between these two
encodings.)
* In ALL locales, the <ctype.h> functions only recognize ASCII
characters (this is a consequence of narrow C strings always being
UTF-8; only ASCII characters fit in a single 'char' anymore) and the
<wctype.h> functions' behavior is locale-invariant and defined
strictly in terms of Unicode character properties. LC_CTYPE blocks
in locale definition files are either ignored or rejected (with the
possible exception of transliteration specs).
* strcoll, wcscoll, strxfrm, and wcsxfrm, however, continue to have
locale-specific behavior.
* iconv continues to be able to convert among all the encodings it
currently supports.
* The CODESET property of a locale is used for only one purpose: it
specifies the encoding to/from which both narrow and wide strings
are converted when written to/read from FILEs opened in text mode,
unless overridden by the ",ccs=" mode extension. (If it isn't
already, it becomes an error to specify ",ccs=" together with the
"b" flag.)
* FILEs opened in text mode never acquire a width orientation: you can
always apply either wide or narrow functions to them, regardless of
previous actions, and fwide(fp, mode) always returns 0.
* FILEs opened in binary mode are automatically made narrow-oriented;
there is no way to override this.
zw