Bug 25744 - mbrtowc with Big5-HKSCS returns 2 instead of 1 when consuming the second byte of certain double byte characters
Summary: mbrtowc with Big5-HKSCS returns 2 instead of 1 when consuming the second byte...
Status: RESOLVED FIXED
Alias: None
Product: glibc
Classification: Unclassified
Component: locale (show other bugs)
Version: 2.31
: P2 normal
Target Milestone: 2.36
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-03-29 04:28 UTC by Tom Honermann
Modified: 2022-07-06 14:25 UTC (History)
2 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed: 2020-03-29 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Tom Honermann 2020-03-29 04:28:58 UTC
The following test case demonstrates an issue with the Big5-HKSCS converter that occurs when certain double byte characters are consumed one byte at a time.

Quoting from the C18 standard for convenience: 7.29.6.3.2p4 states:

> The mbrtowc function returns the first of the following that applies
> (given the current conversion state):
>
> 0 if the next n or fewer bytes complete the multibyte character that
> corresponds to the null wide character (which is the value stored).
>
> between 1 and n inclusive if the next n or fewer bytes complete a valid
> multibyte character (which is the value stored); the value returned is
> the number of bytes that complete the multibyte character.
>
> (size_t) (−2) if the next n bytes contribute to an incomplete (but
> potentially valid) multibyte character, and all n bytes have been
> processed (no value is stored).
>
> (size_t)(-1) if an encoding error occurs, in which case the next n or
> fewer bytes do not contribute to a complete and valid multibyte
> character (no value is stored); the value of the macro EILSEQ is stored
> in errno, and the conversion state is unspecified.

The following test case converts the Big5-HKSCS double byte character 0x88 0x62 one byte at a time.  This is one of the special double byte characters that converts to two Unicode code points (U+00CA U+0304).  When consuming the second byte, mbrtowc() returns a value of 2, but 1 should be returned since only one byte was consumed; the wording quoted above doesn't allow for the return value to be larger than the value of 'n' that was passed in.

I don't know if this issue only occurs for double byte characters that map to multiple Unicode code points.

$ cat t.c 
#include <assert.h>
#include <locale.h>
#include <stdio.h>
#include <string.h>
#include <wchar.h>

int main() {
  /* This test case demonstrates glibc's current (2.31) behavior when
     attempting to convert, one byte at a time, Big5-HKSCS input
     containing a double byte sequence that maps to multiple Unicode
     code points.

     The first call to mbrtowc() consumes the first byte and returns
     a value of -2 indicating an incomplete multibyte character as
     expected.  However, the second call, with an input length of 1,
     consumes the second byte, recognizes completion of the previously
     incomplete character, writes the mapped Unicode code point, and
     then returns 2.  The return value of 2 is surprising since only
     1 byte was read.  This seems to violate the C standard since the
     return value is greater than the input length.

     This issue does not occur for all double byte characters. */

  if (! setlocale(LC_ALL, "zh_HK.BIG5-HKSCS")) {
    perror("setlocale");
    return 1;
  }

  const char *mbs;
  wchar_t wc;
  mbstate_t s;
  size_t result;

  mbs = "\x88\x62";
  memset(&s, 0, sizeof(s));
  /* This call to mbrtowc() consumes the first byte and returns -2 indicating
     that a potentially valid but incomplete character was read.  This is
     expected behavior. */
  result = mbrtowc(&wc, mbs, 1, &s);
  printf("1st mbrtowc call:\n");
  printf("  result: %zd (-2 expected)\n", result);
  assert(result == (size_t) -2);
  mbs += 1;
  /* This call consumes the second byte to complete the double byte character
     and writes the Unicode code point.  The C standard requires a return
     value of 1 since only one byte was consumed by this call, but glibs
     returns 2 (presumably corresponding to the total number of bytes that
     contributed to the multibyte character. */
  result = mbrtowc(&wc, mbs, 1, &s);
  printf("2nd mbrtowc call:\n");
  printf("  result: %zd (1 expected, glibc returns 2)\n", result);
  printf("  wc: 0x%04X (0x00CA expected)\n", (unsigned)wc);
  mbs += 1;
  assert(result == (size_t) 2); /* Current glibc behavior */
  assert(wc == 0x00CA);
  /* This call writes the second Unicode code point, but does not consume
     any input.  0 is returned since no input is consumed.  According to
     the C standard, a return of 0 is reserved for when a null character is
     written, but since the C standard doesn't acknowledge the existence of
     characters that can't be represented in a single wchar_t, we're already
     operating outside the scope of the standard.  Returning 0 seems
     reasonable to me.  Returning -3 as mbrtoc16() does would also be
     reasonable. */
  result = mbrtowc(&wc, mbs, 1, &s);
  printf("3rd mbrtowc call:\n");
  printf("  result: %zd (0 expected)\n", result);
  printf("  wc: 0x%04X (0x0304 expected)\n", (unsigned)wc);
  mbs += 0;
  assert(result == (size_t) 0);
  assert(wc == 0x0304);
  /* Attempting to process any further input would run afoul of a separate
     issue tracked by https://sourceware.org/bugzilla/show_bug.cgi?id=25734. */
}

$ gcc t.c -o t

$ ./t
1st mbrtowc call:
  result: -2 (-2 expected)
2nd mbrtowc call:
  result: 2 (1 expected, glibc returns 2)
  wc: 0x00CA (0x00CA expected)
3rd mbrtowc call:
  result: 0 (0 expected)
  wc: 0x0304 (0x0304 expected)

As noted in the code comments and process output, the return value for the second call to mbrtowc() is unexpected.
Comment 1 Carlos O'Donell 2020-03-29 16:59:57 UTC
(In reply to Tom Honermann from comment #0)
> 2nd mbrtowc call:
>   result: 2 (1 expected, glibc returns 2)
>   wc: 0x00CA (0x00CA expected)

> As noted in the code comments and process output, the return value for the
> second call to mbrtowc() is unexpected.

My opinion is that this is an implementation defect in the converter which has absolute control over how those return values are computed.

My worry though is that changing this now might break existing code that is out there that is using BIG5HKSCS.

If that's actually the case then we could create a distinct gconv module with the old behaviour and release that.
Comment 2 Andreas Schwab 2020-03-30 12:44:43 UTC
I don't think you can call mbrtowc with a different value for mbs after the first call returned -2, without resetting the state inbetween.
Comment 3 Carlos O'Donell 2020-03-30 15:13:14 UTC
https://sourceware.org/pipermail/libc-alpha/2020-March/112300.html

Options:

(a) Do not change the current converter. We return 2 consumed bytes in
    the first conversion, and 0 on the second.

(b) Look ahead and split the conversion. We return 1 consumed in the
    first conversion, and 1 on the second. This prevents us from returning
    0 which may be interpreted as a L'\0'. This leads to false assumption
    that the user could stop the conversion at this point and modify the
    input.

(c) Design something new. Return (size_t) -3 indicating a One-to-Many
    conversion is happening and that there is more output to be generated.
Comment 4 Tom Honermann 2020-03-30 17:31:08 UTC
(In reply to Andreas Schwab from comment #2)
> I don't think you can call mbrtowc with a different value for mbs after the
> first call returned -2, without resetting the state inbetween.

The standard is not clear here, but I think it is required to increment 'mbs' by 'n' in such cases.  Existing tests in glibc for other encodings demonstrate this intent.  For examples, see the tests for UTF-8 in wcsmbs/tst-mbrtowc.c.
Comment 5 Andreas Schwab 2020-03-30 17:36:02 UTC
`n' is what?
Comment 6 Tom Honermann 2020-03-30 17:38:00 UTC
(In reply to Andreas Schwab from comment #5)
> `n' is what?

The value for 'n' passed to mbrtowc() and that corresponds to the specification for the -2 return value:

> (size_t) (−2) if the next n bytes contribute to an incomplete (but
> potentially valid) multibyte character, and all n bytes have been
> processed (no value is stored).
Comment 7 Tom Honermann 2020-04-01 04:04:24 UTC
I debugged and identified the root cause for this issue.  The problem is that the Big5-HKSCS converter is failing to preserve the lowest 3 bits of the mbstate_t __count data member.

The relevant code is below.

In iconv/loop.c, lines 396-397 are responsible for unpacking previously cached bytes stored in 'state' into an internal buffer (bytebuf).  The number of previously cached bytes is stored in the three lowest bits of 'state->__count'.  (The Big5-HKSCS converter does not define UNPACK_BYTES).

Lines 485-490 are responsible for the packing of processed bytes and the count of them when insufficient bytes were available to process a complete character.  (The Big5-HKSCS converter does not define STORE_REST).

The line most relevant to this issue is line 477 where the number of bytes to advance the input pointer provided by the caller is determined.  The calculation is performed by determining how much of the internal buffer (bytebuf) was read by the converter, and then subtracting the number of bytes that were populated from the cache (lines 396-397).

The Big5-HKSCS converter runs at line 446 in the expansion of BODY.  Note that this occurs between the lines that unpack cached bytes (396-397) and the line that computes how far to advance the input pointer (477).  (more after the code break).

iconv/loop.c:
366 static inline int
367 __attribute ((always_inline))
368 SINGLE(LOOPFCT) (struct __gconv_step *step,
369                  struct __gconv_step_data *step_data,
370                  const unsigned char **inptrp, const unsigned char *inend,
371                  unsigned char **outptrp, unsigned char *outend,
372                  size_t *irreversible EXTRA_LOOP_DECLS)
373 {
...
391 #  ifdef UNPACK_BYTES
...
393 #  else
394   /* Add the bytes from the state to the input buffer.  */
395   assert ((state->__count & 7) <= sizeof (state->__value));
396   for (inlen = 0; inlen < (size_t) (state->__count & 7); ++inlen)
397     bytebuf[inlen] = state->__value.__wchb[inlen];
398 #  endif
...
441   inptr = bytebuf;
442   inend = &bytebuf[inlen];
...
446       BODY
...
477       *inptrp += inend - bytebuf - (state->__count & 7);
478 #  ifdef STORE_REST
...
482 #  else
483       /* We don't have enough input for another complete input
484          character.  */
485       assert (inend - inptr > (state->__count & ~7));
486       assert (inend - inptr <= sizeof (state->__value));
487       state->__count = (state->__count & ~7) | (inend - inptr);
488       inlen = 0;
489       while (inptr < inend)
490         state->__value.__wchb[inlen++] = *inptr++;
491 #  endif
492     }
493 
494   return result;
495 }

The Big5-HKSCS converter also stores information in 'state->__count'.  It does not use the lower most 3 bits for its own state (where the number of cached characters is stored; it appears to acknowledge that they are reserved).  The problem is that there are numerous places where the converter updates 'state->__count', but fails to preserve those bits (there are also cases where it fails to properly ignore them).  When these bits are not preserved, line 477 above miscalulates the number of bytes consumed from the input.

Cases where those bits are not properly handled are shown below.

iconvdata/big5hkscs.c:
17773 #define EMIT_SHIFT_TO_INIT \
17774   if (data->__statep->__count != 0)                                           \
.....
17783               data->__statep->__count = 0;     .....
.....
17797               data->__statep->__count = 0;                                    \
.....
17803     }
.....
17812 #define BODY \
17813   {                                                                           \
.....
17883                 *statep = ch2 << 3;                                           \
.....
17901   }
.....
17920 #define BODY \
17921   {                                                                           \
.....
17948         *statep = 0;                                                          \
.....
17961         *statep = 0;                                                          \
.....
17998                 *statep = ((cp[0] << 8) | cp[1]) << 3;                        \
.....
18019   }

A local update to preserve the lower 3 bits sufficed to resolve this issue.  I'm working on a patch.
Comment 8 Tom Honermann 2020-04-07 05:43:01 UTC
A patch for this issue has been submitted to the libc-alpha mailing list.
- https://sourceware.org/pipermail/libc-alpha/2020-April/112595.html
Comment 9 Sourceware Commits 2022-07-06 14:20:43 UTC
The master branch has been updated by Adhemerval Zanella <azanella@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=598f790fb17bcfff7fedde5209933a82d7748328

commit 598f790fb17bcfff7fedde5209933a82d7748328
Author: Tom Honermann <tom@honermann.net>
Date:   Thu Jun 30 08:52:13 2022 -0400

    gconv: Correct Big5-HKSCS conversion to preserve all state bits. [BZ #25744]
    
    This patch corrects the Big5-HKSCS converter to preserve the lowest 3 bits of
    the mbstate_t __count data member when the converter encounters an incomplete
    multibyte character.
    
    This fixes BZ #25744.
    
    Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
Comment 10 Adhemerval Zanella 2022-07-06 14:25:54 UTC
Fixed on 2.36.