This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Note on encodings (and locales) with shift state

From: Florian Weimer <fweimer at redhat dot com>
To: libc-alpha at sourceware dot org
Date: Mon, 06 May 2019 20:55:59 +0200
Subject: Note on encodings (and locales) with shift state

I ran the following program on Fedora 29:

#include <err.h>
#include <langinfo.h>
#include <locale.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int
main (void)
{
  FILE *fp = popen ("locale -a", "r");
  if (fp == NULL)
    err (1, "locale -a");

  char *buffer = NULL;
  size_t buffer_length = 0;
  while (true)
    {
      ssize_t ret = getline (&buffer, &buffer_length, fp);
      if (ferror (fp))
        err (1, "getline");
      if (feof (fp))
        break;
      if (ret <= 0)
        errx (1, "getline");
      {
        char *nl = strchr (buffer, '\n');
        if (nl != NULL)
          *nl = '\0';
      }

      const char *old = setlocale (LC_ALL, buffer);
      if (old == NULL)
      err (1, "setlocale (\"%s\")", buffer);
      if (mblen (NULL, 0) != 0)
        printf ("%s: %s\n", buffer, nl_langinfo (CODESET));
      if (setlocale (LC_ALL, old) == NULL)
        err (1, "setlocale (\"%s\")", old);
    }

  if (pclose (fp) != 0)
    errx (1, "pclose");

  return 0;
}

Here's the output I got:

yi_US: CP1255
yi_US.cp1255: CP1255
zh_HK: BIG5-HKSCS
zh_HK.big5hkscs: BIG5-HKSCS

Fedora builds all locales listed in localedata/SUPPORTED, so the shift
state code is there to support only two locales.

This prompted me to look into how shift state is used.  gconv converters
with state use need to define SAVE_RESET_STATE.  The usages broadly fall
into these categories:

1 Pre-composing characters for encodings which are nominally
  single-byte.  This is the CP1255 case, where Hebrew letters followed
  by vowel-points are translated into a single Unicode codepoint.  This
  requires state because a Hebrew letter at the end of the input is not
  an incomplete multi-byte sequence, so the multi-byte routines cannot
  unconditionally request additional input bytes in this case.

  CP1258 and TCVN5712-1 appear to be the same and are ASCII-transparent,
  so we could create locales with them.

2 Work-around for limitations of the C multi-byte API with internal UCS4
  encoding.  The problem here is that the API assumes that one
  multi-byte sequence produces one wide character.  (See mbtowc; this
  function is particularly clear in this regard.)  This is the
  BIG5-HKSCS case: The prefix of a multi-byte BIG5-HKSCS sequence does
  not determine the prefix of the UCS4 sequence, so the whole sequence
  has to be processed, and for the sake of functions of mbtowc, the
  second UCS4 character has to be saved for a subsequent call.

  I believe this applies to EUC-JISX0213 and SHIFT_JISX0213 as well.

3 Multi-byte API limitations with expansive UCS4 encoding.  TSCII is
  probably the most drastic example of this: the byte 0x82 expands to
  four wide characters.  It is a variant of the previous case, but the
  key point is that the decoded UCS4 string contains *more* codepoints
  that than the input contains bytes, even starting from an initial
  shift state.  This probably causes vulnerabilities (buffer overflows
  due to miscalculated string lengths).

  We do not ship a TSCII locale, but TSCII is ASCII-transparent, so we
  could.

4 Traditional shift states.  This is what applies to various IBM
  charsets.  None of those are even remotely ASCII-transparent as far as
  I can tell, and can therefore not be used with the C multi-byte APIs.
  However, it is conceivable that an ASCII-transparent locale with a
  similar encoding style could be created.

  UTF-7 is probably fairly close to a traditional shift state encoding,
  with '+' as the shift-in byte and '-' as the shift-out byte, except
  that it's even less ASCII-transparent than SJIS, which we consider
  problematic and not official supported.
  
5 ISO-2202-style encodings. These have ISO-2022-* names.  I think they
  could be considered ASCII-transparent to some degree, but we do not
  list them as encodings for any supported locales.  These are fairly
  close to traditional shift states, except that the shift sequence is
  itself an ASCII multi-byte sequence (starting with ESC).

SJIS (Shift-JIS) is curiously NOT among these charsets.  In fact,
Shift-JIS does not use shift states, despite the name.

Looking at this list, (1) is possibly a fringe use case.  It may not
even be necessary to produce pre-composed Unicode code points for
correctness.  CP1255 is a variant of ISO88598-8, which does not do
pre-composition.  We use CP1255 with yi_US (Yiddish), and ISO8859-8 with
he_IL (Hebrew).  The difference probably does not matter for
contemporary Hebrew because it does not generally use vowel-points, but
Yiddish might (preferences appear to vary).  But then the Yiddish
Wikipedia does not seem to use pre-composed characters.

(3) looks like a Unicode problem.  I think under the Unicode rules,
TSCII 0x82 should have received its own codepoint, to which it
translates, and similar for the other expansive sequences.  It certainly
would help to avoid security vulnerabilities.  (I think an example I
wrote for the manual has such a buffer overflow.)

(2), (4) and (5) appear to require genuine (shift) state support.  The
only locale we actually have in this category is zh_HK.

That's a lot of code in the library to support what is essentially a
single locale.  But as of today, we definitely need the shift state
support.

Thanks,
Florian

Follow-Ups:
- Re: Note on encodings (and locales) with shift state
  - From: Zack Weinberg

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]