[PATCH] Bug in _mbtowc_r UTF-8 handling

Tue Jul 28 16:51:00 GMT 2009

Hi,

there's a bug in __utf8_mbtowc, which appears to be in that code for
ages.  If you give an UTF-8 multibyte character to this function one
at a time, it screws up counting and testing the value `n', the number
of bytes in the input stream.

Here's a testcase:

#include <stdio.h>
#include <locale.h>
#include <stdlib.h>
#include <wchar.h>

int main(void) {
  wchar_t wc;
  size_t ret;
  mbstate_t s = { 0 };
  puts(setlocale(LC_CTYPE, "en_GB.UTF-8"));
  printf("%i\n", mbrtowc(&wc, "\xe2", 1, &s));
  printf("%i\n", mbrtowc(&wc, "\x94", 1, &s));
  printf("%i\n", mbrtowc(&wc, "\x84", 1, &s));
  printf("%x\n", wc);
  return 0;
}

That's the expected output:

en_GB.UTF-8
-2
-2
1
2504

That's what happens with newlib's implementation:

en_GB.UTF-8
-2
-1
-1
<arbitrary value>

The sequence E2 94 84 should translate to U+2504. Instead, the second
and third calls to mbrtowc report encoding errors. It does work
correctly if the three bytes are passed to mbrtowc() in one go:

  printf("%i\n", mbrtowc(&wc, "\xe2\x94\x84", 3, 0));

As I mentioned afore, the problem is how the argument `n' is counted
when `state' is already filled and subsequent calls encounter a
non-empty state.  As soon as the incoming byte sequence consists of
a multibyte followup sequence, and this sequence is in itself incomplete
again, the value of `n' is wrong and the tests which are supposed to
return -2 fail.  The code slips through, reading a char too much, and
either fails, or, worst case, returns an incorrect wchar_t value.
Given that, the problem never happens for two-byte sequences, only
for three- and four-byte sequences.

I created a patch to fix this.  The difference to the original code is a
more unified way to increment `n'.  The original code incremented `n' by
one too much in the above scenario.  Tested on Cygwin with 2, 3, and 4
byte sequences.

This way I found another problem in the UTF-16 surrogate handling which
only occurs with incomplete 4-byte sequences.  I have no fix for this
and I'm not sure yet if there's a satisfying way to fix this at all.

So the below patch only fixes the above problem, not the UTF-16
related problem.

Ok to apply?


Thanks,
Corinna


	* libc/stdlib/mbtowc_r.c (__utf8_mbtowc): Fix incrementing n
	in case of handling incomplete sequences.


Index: libc/stdlib/mbtowc_r.c
===================================================================
RCS file: /cvs/src/src/newlib/libc/stdlib/mbtowc_r.c,v
retrieving revision 1.13
diff -u -p -r1.13 mbtowc_r.c

--- libc/stdlib/mbtowc_r.c	9 Apr 2009 08:20:10 -0000	1.13
+++ libc/stdlib/mbtowc_r.c	28 Jul 2009 10:01:20 -0000
@@ -220,11 +220,7 @@ _DEFUN (__utf8_mbtowc, (r, pwc, s, n, ch
   if (state->__count == 0)
     ch = t[i++];
   else
-    {
-      if (n < (size_t)-1)
-	++n;
-      ch = state->__value.__wchb[0];
-    }
+    ch = state->__value.__wchb[0];
 
   if (ch == '\0')
     {
@@ -244,7 +240,10 @@ _DEFUN (__utf8_mbtowc, (r, pwc, s, n, ch
     {
       /* two-byte sequence */
       state->__value.__wchb[0] = ch;
-      state->__count = 1;
+      if (state->__count == 0)
+	state->__count = 1;
+      else if (n < (size_t)-1)
+	++n;
       if (n < 2)
 	return -2;
       ch = t[i++];
@@ -288,7 +287,10 @@ _DEFUN (__utf8_mbtowc, (r, pwc, s, n, ch
 	  return -1;
 	}
       state->__value.__wchb[1] = ch;
-      state->__count = 2;
+      if (state->__count == 1)
+	state->__count = 2;
+      else if (n < (size_t)-1)
+	++n;
       if (n < 3)
 	return -2;
       ch = t[i++];
@@ -347,7 +349,10 @@ _DEFUN (__utf8_mbtowc, (r, pwc, s, n, ch
 	  return -1;
 	}
       state->__value.__wchb[2] = ch;
-      state->__count = 3;
+      if (state->__count == 2)
+	state->__count = 3;
+      else if (n < (size_t)-1)
+	++n;
       if (n < 4)
 	return -2;
       ch = t[i++];


-- 
Corinna Vinschen
Cygwin Project Co-Leader
Red Hat