This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Note on encodings (and locales) with shift state


On Mon, May 6, 2019 at 2:56 PM Florian Weimer <fweimer@redhat.com> wrote:
...
> Looking at this list, (1) is possibly a fringe use case.  It may not
> even be necessary to produce pre-composed Unicode code points for
> correctness.  CP1255 is a variant of ISO88598-8, which does not do
> pre-composition.  We use CP1255 with yi_US (Yiddish), and ISO8859-8 with
> he_IL (Hebrew).  The difference probably does not matter for
> contemporary Hebrew because it does not generally use vowel-points, but
> Yiddish might (preferences appear to vary).  But then the Yiddish
> Wikipedia does not seem to use pre-composed characters.

In Modern Hebrew, AFAIK, the primary use case for vowel points is
texts intended for people learning the written form of the language
(whether or not they are native speakers).  They might also be usable
to indicate the proper pronunciation of texts in Ancient Hebrew, but I
don't know if the historical sound changes were limited to the vowels.
(I know you *can't* do that to handle the differences between Sephardi
and Ashkenazi pronounciation, which definitely does affect
consonants.)  Linguists would probably use IPA anyway.

> (2), (4) and (5) appear to require genuine (shift) state support.  The
> only locale we actually have in this category is zh_HK.
>
> That's a lot of code in the library to support what is essentially a
> single locale.  But as of today, we definitely need the shift state
> support.

I tend to think the entire <wchar.h> API set is not particularly
useful anyway.  The <locale.h> and <langinfo.h> APIs are useful, but I
wonder how much would break if we made these changes:

* In ALL locales, narrow C strings are encoded in UTF-8 and wide
  strings are encoded in UTF-32.  (Therefore, the mbstowcs family of
  functions only needs to handle the conversion between these two
  encodings.)

* In ALL locales, the <ctype.h> functions only recognize ASCII
  characters (this is a consequence of narrow C strings always being
  UTF-8; only ASCII characters fit in a single 'char' anymore) and the
  <wctype.h> functions' behavior is locale-invariant and defined
  strictly in terms of Unicode character properties.  LC_CTYPE blocks
  in locale definition files are either ignored or rejected (with the
  possible exception of transliteration specs).

* strcoll, wcscoll, strxfrm, and wcsxfrm, however, continue to have
  locale-specific behavior.

* iconv continues to be able to convert among all the encodings it
  currently supports.

* The CODESET property of a locale is used for only one purpose: it
  specifies the encoding to/from which both narrow and wide strings
  are converted when written to/read from FILEs opened in text mode,
  unless overridden by the ",ccs=" mode extension.  (If it isn't
  already, it becomes an error to specify ",ccs=" together with the
  "b" flag.)

* FILEs opened in text mode never acquire a width orientation: you can
  always apply either wide or narrow functions to them, regardless of
  previous actions, and fwide(fp, mode) always returns 0.

* FILEs opened in binary mode are automatically made narrow-oriented;
  there is no way to override this.

zw


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]