The following test case demonstrates an issue with the Big5-HKSCS converter that occurs when certain double byte characters are consumed one byte at a time. Quoting from the C18 standard for convenience: 7.29.6.3.2p4 states: > The mbrtowc function returns the first of the following that applies > (given the current conversion state): > > 0 if the next n or fewer bytes complete the multibyte character that > corresponds to the null wide character (which is the value stored). > > between 1 and n inclusive if the next n or fewer bytes complete a valid > multibyte character (which is the value stored); the value returned is > the number of bytes that complete the multibyte character. > > (size_t) (−2) if the next n bytes contribute to an incomplete (but > potentially valid) multibyte character, and all n bytes have been > processed (no value is stored). > > (size_t)(-1) if an encoding error occurs, in which case the next n or > fewer bytes do not contribute to a complete and valid multibyte > character (no value is stored); the value of the macro EILSEQ is stored > in errno, and the conversion state is unspecified. The following test case converts the Big5-HKSCS double byte character 0x88 0x62 one byte at a time. This is one of the special double byte characters that converts to two Unicode code points (U+00CA U+0304). When consuming the second byte, mbrtowc() returns a value of 2, but 1 should be returned since only one byte was consumed; the wording quoted above doesn't allow for the return value to be larger than the value of 'n' that was passed in. I don't know if this issue only occurs for double byte characters that map to multiple Unicode code points. $ cat t.c #include <assert.h> #include <locale.h> #include <stdio.h> #include <string.h> #include <wchar.h> int main() { /* This test case demonstrates glibc's current (2.31) behavior when attempting to convert, one byte at a time, Big5-HKSCS input containing a double byte sequence that maps to multiple Unicode code points. The first call to mbrtowc() consumes the first byte and returns a value of -2 indicating an incomplete multibyte character as expected. However, the second call, with an input length of 1, consumes the second byte, recognizes completion of the previously incomplete character, writes the mapped Unicode code point, and then returns 2. The return value of 2 is surprising since only 1 byte was read. This seems to violate the C standard since the return value is greater than the input length. This issue does not occur for all double byte characters. */ if (! setlocale(LC_ALL, "zh_HK.BIG5-HKSCS")) { perror("setlocale"); return 1; } const char *mbs; wchar_t wc; mbstate_t s; size_t result; mbs = "\x88\x62"; memset(&s, 0, sizeof(s)); /* This call to mbrtowc() consumes the first byte and returns -2 indicating that a potentially valid but incomplete character was read. This is expected behavior. */ result = mbrtowc(&wc, mbs, 1, &s); printf("1st mbrtowc call:\n"); printf(" result: %zd (-2 expected)\n", result); assert(result == (size_t) -2); mbs += 1; /* This call consumes the second byte to complete the double byte character and writes the Unicode code point. The C standard requires a return value of 1 since only one byte was consumed by this call, but glibs returns 2 (presumably corresponding to the total number of bytes that contributed to the multibyte character. */ result = mbrtowc(&wc, mbs, 1, &s); printf("2nd mbrtowc call:\n"); printf(" result: %zd (1 expected, glibc returns 2)\n", result); printf(" wc: 0x%04X (0x00CA expected)\n", (unsigned)wc); mbs += 1; assert(result == (size_t) 2); /* Current glibc behavior */ assert(wc == 0x00CA); /* This call writes the second Unicode code point, but does not consume any input. 0 is returned since no input is consumed. According to the C standard, a return of 0 is reserved for when a null character is written, but since the C standard doesn't acknowledge the existence of characters that can't be represented in a single wchar_t, we're already operating outside the scope of the standard. Returning 0 seems reasonable to me. Returning -3 as mbrtoc16() does would also be reasonable. */ result = mbrtowc(&wc, mbs, 1, &s); printf("3rd mbrtowc call:\n"); printf(" result: %zd (0 expected)\n", result); printf(" wc: 0x%04X (0x0304 expected)\n", (unsigned)wc); mbs += 0; assert(result == (size_t) 0); assert(wc == 0x0304); /* Attempting to process any further input would run afoul of a separate issue tracked by https://sourceware.org/bugzilla/show_bug.cgi?id=25734. */ } $ gcc t.c -o t $ ./t 1st mbrtowc call: result: -2 (-2 expected) 2nd mbrtowc call: result: 2 (1 expected, glibc returns 2) wc: 0x00CA (0x00CA expected) 3rd mbrtowc call: result: 0 (0 expected) wc: 0x0304 (0x0304 expected) As noted in the code comments and process output, the return value for the second call to mbrtowc() is unexpected.
(In reply to Tom Honermann from comment #0) > 2nd mbrtowc call: > result: 2 (1 expected, glibc returns 2) > wc: 0x00CA (0x00CA expected) > As noted in the code comments and process output, the return value for the > second call to mbrtowc() is unexpected. My opinion is that this is an implementation defect in the converter which has absolute control over how those return values are computed. My worry though is that changing this now might break existing code that is out there that is using BIG5HKSCS. If that's actually the case then we could create a distinct gconv module with the old behaviour and release that.
I don't think you can call mbrtowc with a different value for mbs after the first call returned -2, without resetting the state inbetween.
https://sourceware.org/pipermail/libc-alpha/2020-March/112300.html Options: (a) Do not change the current converter. We return 2 consumed bytes in the first conversion, and 0 on the second. (b) Look ahead and split the conversion. We return 1 consumed in the first conversion, and 1 on the second. This prevents us from returning 0 which may be interpreted as a L'\0'. This leads to false assumption that the user could stop the conversion at this point and modify the input. (c) Design something new. Return (size_t) -3 indicating a One-to-Many conversion is happening and that there is more output to be generated.
(In reply to Andreas Schwab from comment #2) > I don't think you can call mbrtowc with a different value for mbs after the > first call returned -2, without resetting the state inbetween. The standard is not clear here, but I think it is required to increment 'mbs' by 'n' in such cases. Existing tests in glibc for other encodings demonstrate this intent. For examples, see the tests for UTF-8 in wcsmbs/tst-mbrtowc.c.
`n' is what?
(In reply to Andreas Schwab from comment #5) > `n' is what? The value for 'n' passed to mbrtowc() and that corresponds to the specification for the -2 return value: > (size_t) (−2) if the next n bytes contribute to an incomplete (but > potentially valid) multibyte character, and all n bytes have been > processed (no value is stored).
I debugged and identified the root cause for this issue. The problem is that the Big5-HKSCS converter is failing to preserve the lowest 3 bits of the mbstate_t __count data member. The relevant code is below. In iconv/loop.c, lines 396-397 are responsible for unpacking previously cached bytes stored in 'state' into an internal buffer (bytebuf). The number of previously cached bytes is stored in the three lowest bits of 'state->__count'. (The Big5-HKSCS converter does not define UNPACK_BYTES). Lines 485-490 are responsible for the packing of processed bytes and the count of them when insufficient bytes were available to process a complete character. (The Big5-HKSCS converter does not define STORE_REST). The line most relevant to this issue is line 477 where the number of bytes to advance the input pointer provided by the caller is determined. The calculation is performed by determining how much of the internal buffer (bytebuf) was read by the converter, and then subtracting the number of bytes that were populated from the cache (lines 396-397). The Big5-HKSCS converter runs at line 446 in the expansion of BODY. Note that this occurs between the lines that unpack cached bytes (396-397) and the line that computes how far to advance the input pointer (477). (more after the code break). iconv/loop.c: 366 static inline int 367 __attribute ((always_inline)) 368 SINGLE(LOOPFCT) (struct __gconv_step *step, 369 struct __gconv_step_data *step_data, 370 const unsigned char **inptrp, const unsigned char *inend, 371 unsigned char **outptrp, unsigned char *outend, 372 size_t *irreversible EXTRA_LOOP_DECLS) 373 { ... 391 # ifdef UNPACK_BYTES ... 393 # else 394 /* Add the bytes from the state to the input buffer. */ 395 assert ((state->__count & 7) <= sizeof (state->__value)); 396 for (inlen = 0; inlen < (size_t) (state->__count & 7); ++inlen) 397 bytebuf[inlen] = state->__value.__wchb[inlen]; 398 # endif ... 441 inptr = bytebuf; 442 inend = &bytebuf[inlen]; ... 446 BODY ... 477 *inptrp += inend - bytebuf - (state->__count & 7); 478 # ifdef STORE_REST ... 482 # else 483 /* We don't have enough input for another complete input 484 character. */ 485 assert (inend - inptr > (state->__count & ~7)); 486 assert (inend - inptr <= sizeof (state->__value)); 487 state->__count = (state->__count & ~7) | (inend - inptr); 488 inlen = 0; 489 while (inptr < inend) 490 state->__value.__wchb[inlen++] = *inptr++; 491 # endif 492 } 493 494 return result; 495 } The Big5-HKSCS converter also stores information in 'state->__count'. It does not use the lower most 3 bits for its own state (where the number of cached characters is stored; it appears to acknowledge that they are reserved). The problem is that there are numerous places where the converter updates 'state->__count', but fails to preserve those bits (there are also cases where it fails to properly ignore them). When these bits are not preserved, line 477 above miscalulates the number of bytes consumed from the input. Cases where those bits are not properly handled are shown below. iconvdata/big5hkscs.c: 17773 #define EMIT_SHIFT_TO_INIT \ 17774 if (data->__statep->__count != 0) \ ..... 17783 data->__statep->__count = 0; ..... ..... 17797 data->__statep->__count = 0; \ ..... 17803 } ..... 17812 #define BODY \ 17813 { \ ..... 17883 *statep = ch2 << 3; \ ..... 17901 } ..... 17920 #define BODY \ 17921 { \ ..... 17948 *statep = 0; \ ..... 17961 *statep = 0; \ ..... 17998 *statep = ((cp[0] << 8) | cp[1]) << 3; \ ..... 18019 } A local update to preserve the lower 3 bits sufficed to resolve this issue. I'm working on a patch.
A patch for this issue has been submitted to the libc-alpha mailing list. - https://sourceware.org/pipermail/libc-alpha/2020-April/112595.html
The master branch has been updated by Adhemerval Zanella <azanella@sourceware.org>: https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=598f790fb17bcfff7fedde5209933a82d7748328 commit 598f790fb17bcfff7fedde5209933a82d7748328 Author: Tom Honermann <tom@honermann.net> Date: Thu Jun 30 08:52:13 2022 -0400 gconv: Correct Big5-HKSCS conversion to preserve all state bits. [BZ #25744] This patch corrects the Big5-HKSCS converter to preserve the lowest 3 bits of the mbstate_t __count data member when the converter encounters an incomplete multibyte character. This fixes BZ #25744. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Fixed on 2.36.