mbrtowc() with Big5-HKSCS fails to reset conversion state for double byte sequences (such as 0x88 0x62) that are converted to multiple Unicode code points (U+00CA U+0304) after writing the second code point. This results in subsequent calls to mbrtowc() repeatedly writing the second code point without consuming new input. $ cat t.c #include <assert.h> #include <locale.h> #include <stdio.h> #include <string.h> #include <wchar.h> int main() { /* Attempt to translate Big5-HKSCS input containing a double byte sequence (0x88 0x62) that maps to two Unicode code points (U+00CA U+0304) followed by an ASCII character. This test demonstrates that the converter fails to reset conversion state after writing the second code point such that each subsequent call of mbrtowc() writes the second code point again. */ if (! setlocale(LC_ALL, "zh_HK.BIG5-HKSCS")) { perror("setlocale"); return 1; } const char *mbs; wchar_t wc; mbstate_t s; size_t result; mbs = "\x88\x62\x58"; memset(&s, 0, sizeof(s)); /* Translate the first code unit sequence. This call to mbrtowc() consumes the first two bytes and writes the first Unicode code point. */ result = mbrtowc(&wc, mbs, 3, &s); assert(result == (size_t) 2); mbs += 2; printf("1st wc: 0x%04X\n", (unsigned)wc); assert(wc == 0x00CA); /* This next call to mbrtowc() writes the second Unicode code point without consuming any input. Since output was written, but no input was consumed, 0 is returned. This is a case where mbrtoc16() would return (size_t)-3, but mbrtowc() isn't specified to do so. This behavior is a bit confusing because the return of 0 is specified to indicate that a null character was written; but that isn't the case here. */ result = mbrtowc(&wc, mbs, 1, &s); assert(result == (size_t) 0); printf("2nd wc: 0x%04X\n", (unsigned)wc); assert(wc == 0x0304); /* This next call to mbrtowc() should now consume and write the ASCII character. However, the failure of the converter to reset the conversion state results in the second Unicode code point being written again; again without consuming new input. */ wc = 0; /* set wc to an arbitrary value. */ result = mbrtowc(&wc, mbs, 1, &s); printf("3rd wc: 0x%04X (should be 0x0058)\n", (unsigned)wc); printf("result: %d (should be 1)\n", (int)result); /* These two assertions should succeed, but currently fail. */ assert(wc == 0x0058); assert(result == 1); } $ gcc t.c -o t $ ./t 1st wc: 0x00CA 2nd wc: 0x0304 3rd wc: 0x0304 (should be 0x0058) result: 0 (should be 1) t: t.c:52: main: Assertion `wc == 0x0058' failed. Aborted (core dumped) As noted in the test and by the failed assertion, the third call to mbrtowc() should have consumed 1 new byte (0x58) and written one code point (U+0058), but it instead consumed no input and re-wrote the previously written code point.
Andreas Schwab probably has the most experience with Big5-HKSCS. I'm CC'ing him on this this issue in case he has any thoughts about this. I can confirm that /x88/x62 should map to <U00CA><U0304>. Double checked that against Big5-HKSCS 2016 (which we don't support yet): https://www.ogcio.gov.hk/en/our_work/business/tech_promotion/ccli/terms/doc/e_hkscs_2016.pdf I can confirm the test case behaves as you indicated. In glibc/iconvdata/big5hkscs.c we certainly handle the combining characters: 17843 /* Check for special cases: combining characters. */ \ 17844 if (idx == 195 + 0x22 /* 8862 */) \ 17845 { \ 17846 ch = 0xca; \ 17847 ch2 = 0x304; \ 17848 } \ 17849 else if (idx == 195 + 0x24 /* 8864 */) \ 17850 { \ 17851 ch = 0xca; \ 17852 ch2 = 0x30c; \ 17853 } \ 17854 else if (idx == 195 + 0x63 /* 88a3 */) \ 17855 { \ 17856 ch = 0xea; \ 17857 ch2 = 0x304; \ 17858 } \ 17859 else if (idx == 195 + 0x65 /* 88a5 */) \ 17860 { \ 17861 ch = 0xea; \ 17862 ch2 = 0x30c; \ 17863 } \ 17864 else \ 17865 /* This is illegal. */ \ 17866 STANDARD_FROM_LOOP_ERR_HANDLER (1); \ I haven't debugged this in great detail though. diff --git a/iconvdata/big5hkscs.c b/iconvdata/big5hkscs.c index 01fcfeba76..70f84a5226 100644 --- a/iconvdata/big5hkscs.c +++ b/iconvdata/big5hkscs.c @@ -17898,6 +17898,7 @@ static struct \ put32 (outptr, ch); \ outptr += 4; \ + *statep = 0; \ } #define LOOP_NEED_FLAGS #define EXTRA_LOOP_DECLS , int *statep If we have a pushed back character we should clear statep. I expect we need a test case to cover the case you outline.
The quick fix appears to work. GCONV_PATH=/home/carlos/build/glibc/iconvdata ./test 1st wc: 0x00CA 2nd wc: 0x0304 3rd wc: 0x0058 (should be 0x0058) result: 1 (should be 1) I need to dig into this a bit more.
Thanks for confirming, Carlos. I had independently arrived at the same quick fix, but you beat me to posting it. I don't know glibc internals well. Something I've been surprised by is that iconv doesn't seem to suffer this same issue. I haven't tried to debug to determine why it is unaffected. (In reply to Carlos O'Donell from comment #1) > I expect we need a test case to cover the case you outline. Agreed. The patch below is what I had added locally as a test. Feel free to use this (I've already signed a glibc contributor form). diff --git a/wcsmbs/Makefile b/wcsmbs/Makefile index f02167fa58..2e58ac36e8 100644 --- a/wcsmbs/Makefile +++ b/wcsmbs/Makefile @@ -58,7 +58,8 @@ include ../Rules ifeq ($(run-built-tests),yes) LOCALES := de_DE.ISO-8859-1 de_DE.UTF-8 en_US.ANSI_X3.4-1968 hr_HR.ISO-8859-2 \ - ja_JP.EUC-JP zh_TW.EUC-TW tr_TR.UTF-8 tr_TR.ISO-8859-9 + ja_JP.EUC-JP zh_TW.EUC-TW tr_TR.UTF-8 tr_TR.ISO-8859-9 \ + zh_HK.BIG5-HKSCS include ../gen-locales.mk $(objpfx)tst-btowc.out: $(gen-locales) diff --git a/wcsmbs/tst-mbrtowc.c b/wcsmbs/tst-mbrtowc.c index 38a431983b..b2b3ff3744 100644 --- a/wcsmbs/tst-mbrtowc.c +++ b/wcsmbs/tst-mbrtowc.c @@ -150,6 +150,43 @@ utf8_test (void) return error; } +static int +big5_hkscs_test (void) +{ + const char *locale = "zh_HK.BIG5-HKSCS"; + const char *mbs; + wchar_t wc; + mbstate_t s; + + if (!setlocale (LC_CTYPE, locale)) + { + fprintf (stderr, "locale '%s' not available!\n", locale); + exit (1); + } + + /* This test exercises a special case conversion of a double byte sequence + to a pair of Unicode code points. */ + wc = 42; /* arbitrary number */ + mbs = "\x88\x62"; /* 0x88 0x62 => U+00CA U+0304 */ + memset (&s, 0, sizeof(s)); + /* Translate the double byte sequence. This call consumes the first two + bytes and writes the first of the two code points. */ + assert (mbrtowc (&wc, mbs, 3, &s) == (size_t) 2); + assert (wc == 0x00CA); + /* Attempt to translate the next sequence. mbrtowc isn't specified to + return (size_t) -3 like mbrtoc16 is in the case where there are multiple + wide chars to write. This call returns 0 despite not writing a null + character. 0 is returned because no new input is consumed; the call + writes the second code point. */ + assert (mbrtowc (&wc, mbs + 2, 1, &s) == (size_t) 0); + assert (wc == 0x0304); + /* The following now consumes and converts the null character. */ + assert (mbrtowc (&wc, mbs + 2, 1, &s) == (size_t) 0); + assert(wc == 0); + + return 0; +} + static int do_test (void) @@ -169,6 +206,8 @@ do_test (void) setlocale (LC_ALL, "ja_JP.EUC-JP"); result |= check_ascii (setlocale (LC_ALL, NULL)); + result |= big5_hkscs_test (); + return result; }
Tom, I split the test out specifically for BIG5HKSCS and included all 4 conversions that create 2 wchar_ts. Posted: https://sourceware.org/pipermail/libc-alpha/2020-March/112257.html
Fixed in 2.32
Verified fixed for 2.32 as of commit c580e6466d6da8262820cdbad19f32c5546226cf (https://sourceware.org/git/?p=glibc.git;a=commit;h=c580e6466d6da8262820cdbad19f32c5546226cf)
Please leave the status as RESOLVED. There is some automation that depends on it.