This is the mail archive of the newlib@sourceware.org mailing list for the newlib project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] Bug in _mbtowc_r UTF-8 handling


Corinna Vinschen wrote:
Hi,

there's a bug in __utf8_mbtowc, which appears to be in that code for
ages.  If you give an UTF-8 multibyte character to this function one
at a time, it screws up counting and testing the value `n', the number
of bytes in the input stream.

Here's a testcase:

#include <stdio.h>
#include <locale.h>
#include <stdlib.h>
#include <wchar.h>

int main(void) {
  wchar_t wc;
  size_t ret;
  mbstate_t s = { 0 };
  puts(setlocale(LC_CTYPE, "en_GB.UTF-8"));
  printf("%i\n", mbrtowc(&wc, "\xe2", 1, &s));
  printf("%i\n", mbrtowc(&wc, "\x94", 1, &s));
  printf("%i\n", mbrtowc(&wc, "\x84", 1, &s));
  printf("%x\n", wc);
  return 0;
}

That's the expected output:

en_GB.UTF-8
-2
-2
1
2504

That's what happens with newlib's implementation:

en_GB.UTF-8
-2
-1
-1
<arbitrary value>

The sequence E2 94 84 should translate to U+2504. Instead, the second
and third calls to mbrtowc report encoding errors. It does work
correctly if the three bytes are passed to mbrtowc() in one go:

printf("%i\n", mbrtowc(&wc, "\xe2\x94\x84", 3, 0));

As I mentioned afore, the problem is how the argument `n' is counted
when `state' is already filled and subsequent calls encounter a
non-empty state.  As soon as the incoming byte sequence consists of
a multibyte followup sequence, and this sequence is in itself incomplete
again, the value of `n' is wrong and the tests which are supposed to
return -2 fail.  The code slips through, reading a char too much, and
either fails, or, worst case, returns an incorrect wchar_t value.
Given that, the problem never happens for two-byte sequences, only
for three- and four-byte sequences.

I created a patch to fix this.  The difference to the original code is a
more unified way to increment `n'.  The original code incremented `n' by
one too much in the above scenario.  Tested on Cygwin with 2, 3, and 4
byte sequences.

This way I found another problem in the UTF-16 surrogate handling which
only occurs with incomplete 4-byte sequences.  I have no fix for this
and I'm not sure yet if there's a satisfying way to fix this at all.

So the below patch only fixes the above problem, not the UTF-16
related problem.

Ok to apply?


Please go ahead.

-- Jeff J.
Thanks,
Corinna


* libc/stdlib/mbtowc_r.c (__utf8_mbtowc): Fix incrementing n in case of handling incomplete sequences.


Index: libc/stdlib/mbtowc_r.c
===================================================================
RCS file: /cvs/src/src/newlib/libc/stdlib/mbtowc_r.c,v
retrieving revision 1.13
diff -u -p -r1.13 mbtowc_r.c
--- libc/stdlib/mbtowc_r.c 9 Apr 2009 08:20:10 -0000 1.13
+++ libc/stdlib/mbtowc_r.c 28 Jul 2009 10:01:20 -0000
@@ -220,11 +220,7 @@ _DEFUN (__utf8_mbtowc, (r, pwc, s, n, ch
if (state->__count == 0)
ch = t[i++];
else
- {
- if (n < (size_t)-1)
- ++n;
- ch = state->__value.__wchb[0];
- }
+ ch = state->__value.__wchb[0];
if (ch == '\0')
{
@@ -244,7 +240,10 @@ _DEFUN (__utf8_mbtowc, (r, pwc, s, n, ch
{
/* two-byte sequence */
state->__value.__wchb[0] = ch;
- state->__count = 1;
+ if (state->__count == 0)
+ state->__count = 1;
+ else if (n < (size_t)-1)
+ ++n;
if (n < 2)
return -2;
ch = t[i++];
@@ -288,7 +287,10 @@ _DEFUN (__utf8_mbtowc, (r, pwc, s, n, ch
return -1;
}
state->__value.__wchb[1] = ch;
- state->__count = 2;
+ if (state->__count == 1)
+ state->__count = 2;
+ else if (n < (size_t)-1)
+ ++n;
if (n < 3)
return -2;
ch = t[i++];
@@ -347,7 +349,10 @@ _DEFUN (__utf8_mbtowc, (r, pwc, s, n, ch
return -1;
}
state->__value.__wchb[2] = ch;
- state->__count = 3;
+ if (state->__count == 2)
+ state->__count = 3;
+ else if (n < (size_t)-1)
+ ++n;
if (n < 4)
return -2;
ch = t[i++];





Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]