Bug 25734 - mbrtowc with Big5-HKSCS fails to reset conversion state for conversions that produce two Unicode code points
Summary: mbrtowc with Big5-HKSCS fails to reset conversion state for conversions that ...
Status: RESOLVED FIXED
Alias: None
Product: glibc
Classification: Unclassified
Component: locale (show other bugs)
Version: 2.31
: P2 normal
Target Milestone: 2.32
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-03-27 04:05 UTC by Tom Honermann
Modified: 2020-04-18 10:20 UTC (History)
3 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed: 2020-03-27 00:00:00
fw: security-


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Tom Honermann 2020-03-27 04:05:02 UTC
mbrtowc() with Big5-HKSCS fails to reset conversion state for double byte sequences (such as 0x88 0x62) that are converted to multiple Unicode code points (U+00CA U+0304) after writing the second code point.  This results in subsequent calls to mbrtowc() repeatedly writing the second code point without consuming new input.

$ cat t.c
#include <assert.h>
#include <locale.h>
#include <stdio.h>
#include <string.h>
#include <wchar.h>

int main() {
  /* Attempt to translate Big5-HKSCS input containing a double byte sequence
     (0x88 0x62) that maps to two Unicode code points (U+00CA U+0304) followed
     by an ASCII character.  This test demonstrates that the converter fails to
     reset conversion state after writing the second code point such that each
     subsequent call of mbrtowc() writes the second code point again. */

  if (! setlocale(LC_ALL, "zh_HK.BIG5-HKSCS")) {
    perror("setlocale");
    return 1;
  }

  const char *mbs;
  wchar_t wc;
  mbstate_t s;
  size_t result;

  mbs = "\x88\x62\x58";
  memset(&s, 0, sizeof(s));
  /* Translate the first code unit sequence.  This call to mbrtowc() consumes
     the first two bytes and writes the first Unicode code point. */
  result = mbrtowc(&wc, mbs, 3, &s);
  assert(result == (size_t) 2);
  mbs += 2;
  printf("1st wc: 0x%04X\n", (unsigned)wc);
  assert(wc == 0x00CA);
  /* This next call to mbrtowc() writes the second Unicode code point without
     consuming any input.  Since output was written, but no input was consumed,
     0 is returned.  This is a case where mbrtoc16() would return (size_t)-3,
     but mbrtowc() isn't specified to do so.  This behavior is a bit confusing
     because the return of 0 is specified to indicate that a null character
     was written; but that isn't the case here. */
  result = mbrtowc(&wc, mbs, 1, &s);
  assert(result == (size_t) 0);
  printf("2nd wc: 0x%04X\n", (unsigned)wc);
  assert(wc == 0x0304);
  /* This next call to mbrtowc() should now consume and write the ASCII
     character.  However, the failure of the converter to reset the conversion
     state results in the second Unicode code point being written again; again
     without consuming new input. */
  wc = 0; /* set wc to an arbitrary value. */
  result = mbrtowc(&wc, mbs, 1, &s);
  printf("3rd wc: 0x%04X (should be 0x0058)\n", (unsigned)wc);
  printf("result: %d (should be 1)\n", (int)result);
  /* These two assertions should succeed, but currently fail. */
  assert(wc == 0x0058);
  assert(result == 1);
}

$ gcc t.c -o t

$ ./t
1st wc: 0x00CA
2nd wc: 0x0304
3rd wc: 0x0304 (should be 0x0058)
result: 0 (should be 1)
t: t.c:52: main: Assertion `wc == 0x0058' failed.
Aborted (core dumped)


As noted in the test and by the failed assertion, the third call to mbrtowc() should have consumed 1 new byte (0x58) and written one code point (U+0058), but it instead consumed no input and re-wrote the previously written code point.
Comment 1 Carlos O'Donell 2020-03-27 16:37:47 UTC
Andreas Schwab probably has the most experience with Big5-HKSCS. I'm CC'ing him on this this issue in case he has any thoughts about this.

I can confirm that /x88/x62 should map to <U00CA><U0304>.

Double checked that against Big5-HKSCS 2016 (which we don't support yet):
https://www.ogcio.gov.hk/en/our_work/business/tech_promotion/ccli/terms/doc/e_hkscs_2016.pdf

I can confirm the test case behaves as you indicated.

In glibc/iconvdata/big5hkscs.c we certainly handle the combining characters:

17843                 /* Check for special cases: combining characters.  */         \
17844                 if (idx == 195 + 0x22 /* 8862 */)                             \
17845                   {                                                           \
17846                     ch = 0xca;                                                \
17847                     ch2 = 0x304;                                              \
17848                   }                                                           \
17849                 else if (idx == 195 + 0x24 /* 8864 */)                        \
17850                   {                                                           \
17851                     ch = 0xca;                                                \
17852                     ch2 = 0x30c;                                              \
17853                   }                                                           \
17854                 else if (idx == 195 + 0x63 /* 88a3 */)                        \
17855                   {                                                           \
17856                     ch = 0xea;                                                \
17857                     ch2 = 0x304;                                              \
17858                   }                                                           \
17859                 else if (idx == 195 + 0x65 /* 88a5 */)                        \
17860                   {                                                           \
17861                     ch = 0xea;                                                \
17862                     ch2 = 0x30c;                                              \
17863                   }                                                           \
17864                 else                                                          \
17865                   /* This is illegal.  */                                     \
17866                   STANDARD_FROM_LOOP_ERR_HANDLER (1);                         \

I haven't debugged this in great detail though.

diff --git a/iconvdata/big5hkscs.c b/iconvdata/big5hkscs.c
index 01fcfeba76..70f84a5226 100644
--- a/iconvdata/big5hkscs.c
+++ b/iconvdata/big5hkscs.c
@@ -17898,6 +17898,7 @@ static struct
                                                                              \
     put32 (outptr, ch);                                                              \
     outptr += 4;                                                             \
+    *statep = 0;                                                             \
   }
 #define LOOP_NEED_FLAGS
 #define EXTRA_LOOP_DECLS       , int *statep

If we have a pushed back character we should clear statep.

I expect we need a test case to cover the case you outline.
Comment 2 Carlos O'Donell 2020-03-27 16:55:39 UTC
The quick fix appears to work.

GCONV_PATH=/home/carlos/build/glibc/iconvdata ./test
1st wc: 0x00CA
2nd wc: 0x0304
3rd wc: 0x0058 (should be 0x0058)
result: 1 (should be 1)

I need to dig into this a bit more.
Comment 3 Tom Honermann 2020-03-27 18:25:07 UTC
Thanks for confirming, Carlos.

I had independently arrived at the same quick fix, but you beat me to posting it.

I don't know glibc internals well.  Something I've been surprised by is that iconv doesn't seem to suffer this same issue.  I haven't tried to debug to determine why it is unaffected.

(In reply to Carlos O'Donell from comment #1)
> I expect we need a test case to cover the case you outline.

Agreed.  The patch below is what I had added locally as a test.  Feel free to use this (I've already signed a glibc contributor form).

diff --git a/wcsmbs/Makefile b/wcsmbs/Makefile
index f02167fa58..2e58ac36e8 100644
--- a/wcsmbs/Makefile
+++ b/wcsmbs/Makefile
@@ -58,7 +58,8 @@ include ../Rules
 
 ifeq ($(run-built-tests),yes)
 LOCALES := de_DE.ISO-8859-1 de_DE.UTF-8 en_US.ANSI_X3.4-1968 hr_HR.ISO-8859-2 \
-          ja_JP.EUC-JP zh_TW.EUC-TW tr_TR.UTF-8 tr_TR.ISO-8859-9
+          ja_JP.EUC-JP zh_TW.EUC-TW tr_TR.UTF-8 tr_TR.ISO-8859-9 \
+          zh_HK.BIG5-HKSCS
 include ../gen-locales.mk
 
 $(objpfx)tst-btowc.out: $(gen-locales)
diff --git a/wcsmbs/tst-mbrtowc.c b/wcsmbs/tst-mbrtowc.c
index 38a431983b..b2b3ff3744 100644
--- a/wcsmbs/tst-mbrtowc.c
+++ b/wcsmbs/tst-mbrtowc.c
@@ -150,6 +150,43 @@ utf8_test (void)
   return error;
 }
 
+static int
+big5_hkscs_test (void)
+{
+  const char *locale = "zh_HK.BIG5-HKSCS";
+  const char *mbs;
+  wchar_t wc;
+  mbstate_t s;
+
+  if (!setlocale (LC_CTYPE, locale))
+    {
+      fprintf (stderr, "locale '%s' not available!\n", locale);
+      exit (1);
+    }
+
+  /* This test exercises a special case conversion of a double byte sequence
+     to a pair of Unicode code points.  */
+  wc = 42;                     /* arbitrary number */
+  mbs = "\x88\x62";            /* 0x88 0x62 => U+00CA U+0304 */
+  memset (&s, 0, sizeof(s));
+  /* Translate the double byte sequence.  This call consumes the first two
+     bytes and writes the first of the two code points.  */
+  assert (mbrtowc (&wc, mbs, 3, &s) == (size_t) 2);
+  assert (wc == 0x00CA);
+  /* Attempt to translate the next sequence.  mbrtowc isn't specified to
+     return (size_t) -3 like mbrtoc16 is in the case where there are multiple
+     wide chars to write.  This call returns 0 despite not writing a null
+     character.  0 is returned because no new input is consumed; the call
+     writes the second code point. */
+  assert (mbrtowc (&wc, mbs + 2, 1, &s) == (size_t) 0);
+  assert (wc == 0x0304);
+  /* The following now consumes and converts the null character. */
+  assert (mbrtowc (&wc, mbs + 2, 1, &s) == (size_t) 0);
+  assert(wc == 0);
+
+  return 0;
+}
+
 
 static int
 do_test (void)
@@ -169,6 +206,8 @@ do_test (void)
   setlocale (LC_ALL, "ja_JP.EUC-JP");
   result |= check_ascii (setlocale (LC_ALL, NULL));
 
+  result |= big5_hkscs_test ();
+
   return result;
 }
Comment 4 Carlos O'Donell 2020-03-27 21:20:36 UTC
Tom,

I split the test out specifically for BIG5HKSCS and included all 4 conversions that create 2 wchar_ts.

Posted: https://sourceware.org/pipermail/libc-alpha/2020-March/112257.html
Comment 5 Carlos O'Donell 2020-04-16 02:40:51 UTC
Fixed in 2.32
Comment 6 Tom Honermann 2020-04-18 01:56:59 UTC
Verified fixed for 2.32 as of commit c580e6466d6da8262820cdbad19f32c5546226cf (https://sourceware.org/git/?p=glibc.git;a=commit;h=c580e6466d6da8262820cdbad19f32c5546226cf)
Comment 7 Florian Weimer 2020-04-18 10:20:55 UTC
Please leave the status as RESOLVED. There is some automation that depends on it.