This is the mail archive of the
newlib@sourceware.org
mailing list for the newlib project.
[PATCH] Bug in _mbtowc_r UTF-8 handling
- From: Corinna Vinschen <vinschen at redhat dot com>
- To: newlib at sourceware dot org
- Date: Tue, 28 Jul 2009 12:14:25 +0200
- Subject: [PATCH] Bug in _mbtowc_r UTF-8 handling
- Reply-to: newlib at sourceware dot org
Hi,
there's a bug in __utf8_mbtowc, which appears to be in that code for
ages. If you give an UTF-8 multibyte character to this function one
at a time, it screws up counting and testing the value `n', the number
of bytes in the input stream.
Here's a testcase:
#include <stdio.h>
#include <locale.h>
#include <stdlib.h>
#include <wchar.h>
int main(void) {
wchar_t wc;
size_t ret;
mbstate_t s = { 0 };
puts(setlocale(LC_CTYPE, "en_GB.UTF-8"));
printf("%i\n", mbrtowc(&wc, "\xe2", 1, &s));
printf("%i\n", mbrtowc(&wc, "\x94", 1, &s));
printf("%i\n", mbrtowc(&wc, "\x84", 1, &s));
printf("%x\n", wc);
return 0;
}
That's the expected output:
en_GB.UTF-8
-2
-2
1
2504
That's what happens with newlib's implementation:
en_GB.UTF-8
-2
-1
-1
<arbitrary value>
The sequence E2 94 84 should translate to U+2504. Instead, the second
and third calls to mbrtowc report encoding errors. It does work
correctly if the three bytes are passed to mbrtowc() in one go:
printf("%i\n", mbrtowc(&wc, "\xe2\x94\x84", 3, 0));
As I mentioned afore, the problem is how the argument `n' is counted
when `state' is already filled and subsequent calls encounter a
non-empty state. As soon as the incoming byte sequence consists of
a multibyte followup sequence, and this sequence is in itself incomplete
again, the value of `n' is wrong and the tests which are supposed to
return -2 fail. The code slips through, reading a char too much, and
either fails, or, worst case, returns an incorrect wchar_t value.
Given that, the problem never happens for two-byte sequences, only
for three- and four-byte sequences.
I created a patch to fix this. The difference to the original code is a
more unified way to increment `n'. The original code incremented `n' by
one too much in the above scenario. Tested on Cygwin with 2, 3, and 4
byte sequences.
This way I found another problem in the UTF-16 surrogate handling which
only occurs with incomplete 4-byte sequences. I have no fix for this
and I'm not sure yet if there's a satisfying way to fix this at all.
So the below patch only fixes the above problem, not the UTF-16
related problem.
Ok to apply?
Thanks,
Corinna
* libc/stdlib/mbtowc_r.c (__utf8_mbtowc): Fix incrementing n
in case of handling incomplete sequences.
Index: libc/stdlib/mbtowc_r.c
===================================================================
RCS file: /cvs/src/src/newlib/libc/stdlib/mbtowc_r.c,v
retrieving revision 1.13
diff -u -p -r1.13 mbtowc_r.c
--- libc/stdlib/mbtowc_r.c 9 Apr 2009 08:20:10 -0000 1.13
+++ libc/stdlib/mbtowc_r.c 28 Jul 2009 10:01:20 -0000
@@ -220,11 +220,7 @@ _DEFUN (__utf8_mbtowc, (r, pwc, s, n, ch
if (state->__count == 0)
ch = t[i++];
else
- {
- if (n < (size_t)-1)
- ++n;
- ch = state->__value.__wchb[0];
- }
+ ch = state->__value.__wchb[0];
if (ch == '\0')
{
@@ -244,7 +240,10 @@ _DEFUN (__utf8_mbtowc, (r, pwc, s, n, ch
{
/* two-byte sequence */
state->__value.__wchb[0] = ch;
- state->__count = 1;
+ if (state->__count == 0)
+ state->__count = 1;
+ else if (n < (size_t)-1)
+ ++n;
if (n < 2)
return -2;
ch = t[i++];
@@ -288,7 +287,10 @@ _DEFUN (__utf8_mbtowc, (r, pwc, s, n, ch
return -1;
}
state->__value.__wchb[1] = ch;
- state->__count = 2;
+ if (state->__count == 1)
+ state->__count = 2;
+ else if (n < (size_t)-1)
+ ++n;
if (n < 3)
return -2;
ch = t[i++];
@@ -347,7 +349,10 @@ _DEFUN (__utf8_mbtowc, (r, pwc, s, n, ch
return -1;
}
state->__value.__wchb[2] = ch;
- state->__count = 3;
+ if (state->__count == 2)
+ state->__count = 3;
+ else if (n < (size_t)-1)
+ ++n;
if (n < 4)
return -2;
ch = t[i++];
--
Corinna Vinschen
Cygwin Project Co-Leader
Red Hat