This is the mail archive of the newlib@sourceware.org mailing list for the newlib project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] Handle surrogate pairs in _wctomb_r/_mbtowc_r


Please go ahead and commit.

-- Jeff J.

Corinna Vinschen wrote:
Ping?

On Feb 18 16:55, Corinna Vinschen wrote:
Hi,


below is a patch which adds handling of UTF-16 surrogate pairs on systems which define wchar_t as two byte values. Unfortunately the POSIX functions wcrtomb and mbrtowc don't define surrogate handling at all because POSIX assumes that wchar_t is big enough to hold an entire wide char under all circumstances. The problem is that this assumption actually breaks these functions for all Unicode chars beyond 0xffff, which is quite a lot of chars.

Given that Cygwin is based on Windows and Windows is a wchar_t == UTF-16
system, that breaks Cygwin in conjunction with a significant number of
languages.  That's why I created the below patch which is, quite
certainly, a hack based on the lack of the underlying system.

How the patch works on UTF-16 systems:

- _wctomb_r: If a first half of a surrogate pair is detected in wchar,
  it creates a temporary wint_t value based on the 10 value bits in the
  surrogate wchar_t.  This value is then stored in state, and the first
  byte of the resulting UTF-8 char is returned.  If a second half of a
  surrogate pair is detected, _wctomb_r checks if it already detected a
  first half in the previous run.  If not, it's an invalid wchar value.
  Otherwise it creates the full Unicode value, resets the state, and
  returns the trailing 3 UTF-8 bytes in s.

- _mbtowc_r: If the detected UTF-8 char results in a Unicode char in the
  range from 0x10000 <= unicode_char <= 0x10ffff, it stores the value in
  state and returns the first surrogate UTF-8 value.  In the next call,
  if the state indicates that we're in the middle of a surrogated char,
  it resets the state, and returns the second half of the surrogate
  pair.

This *might* break applications on UTF-16 systems which are ignorant of
the fact that wchar_t doesn't hold a complete Unicode char, *and* use
wcrtomb/mbrtowc directly.  However, most applications will use the
higher level string functions (wcstombs/mbstowcs), and these are using
_wctomb_r/_mbtowc_r transparently from the application's point of view.

So, the bottom line is, I'm not entirely sure if that's a good idea in
all cases, but IMHO the advantages outweigh the potential problems.

Btw., the patch for _mbtowc_r also fixes two compiler warnings.


Corinna



* mbtowc_r.c (_mbtowc_r): Fix two compiler warnings. Handle surrogate pairs in case of wchar_t == UTF-16. * wctomb_r.c (_wctomb_r): Handle surrogate pairs in case of wchar_t == UTF-16.


--- mbtowc_r.c-UNI 2009-02-18 10:02:35.000000000 +0100
+++ mbtowc_r.c 2009-02-18 16:22:41.000000000 +0100
@@ -65,8 +65,7 @@ _DEFUN (_mbtowc_r, (r, pwc, s, n, state)
return -2;
#ifdef _MB_CAPABLE
- if (__lc_ctype == NULL ||
- (strlen (__lc_ctype) <= 1))
+ if ((strlen (__lc_ctype) <= 1))
{ /* fall-through */ }
else if (!strcmp (__lc_ctype, "C-UTF-8"))
{
@@ -76,6 +75,18 @@ _DEFUN (_mbtowc_r, (r, pwc, s, n, state)
if (s == NULL)
return 0; /* UTF-8 character encodings are not state-dependent */
+ if (state->__count == 4)
+ {
+ /* Create the second half of the surrogate pair. For a description
+ see the comment below. */
+ wint_t tmp = (wchar_t)((state->__value.__wchb[0] & 0x07) << 18)
+ | (wchar_t)((state->__value.__wchb[1] & 0x3f) << 12)
+ | (wchar_t)((state->__value.__wchb[2] & 0x3f) << 6)
+ | (wchar_t)(state->__value.__wchb[3] & 0x3f);
+ state->__count = 0;
+ *pwc = 0xdc00 | ((tmp - 0x10000) & 0x3ff);
+ return 2;
+ }
if (state->__count == 0)
ch = t[i++];
else
@@ -153,8 +164,7 @@ _DEFUN (_mbtowc_r, (r, pwc, s, n, state)
else if (ch >= 0xf0 && ch <= 0xf7)
{
/* four-byte sequence */
- if (sizeof(wchar_t) < 4)
- return -1; /* we can't store such a value */
+ wint_t tmp;
state->__value.__wchb[0] = ch;
if (state->__count == 0)
state->__count = 1;
@@ -185,11 +195,25 @@ _DEFUN (_mbtowc_r, (r, pwc, s, n, state)
ch = t[i++];
if (ch < 0x80 || ch > 0xbf)
return -1;
- *pwc = (wchar_t)((state->__value.__wchb[0] & 0x07) << 18)
- | (wchar_t)((state->__value.__wchb[1] & 0x3f) << 12)
- | (wchar_t)((state->__value.__wchb[2] & 0x3f) << 6)
- | (wchar_t)(ch & 0x3f);
-
+ tmp = (wint_t)((state->__value.__wchb[0] & 0x07) << 18)
+ | (wint_t)((state->__value.__wchb[1] & 0x3f) << 12)
+ | (wint_t)((state->__value.__wchb[2] & 0x3f) << 6)
+ | (wint_t)(ch & 0x3f);
+ if (tmp > 0xffff && sizeof(wchar_t) == 2)
+ {
+ /* On systems which have wchar_t being UTF-16 values, the value
+ doesn't fit into a single wchar_t in this case. So what we
+ do here is to store the state with a special value of __count
+ and return the first half of a surrogate pair. As return
+ value we choose to return the half of the actual UTF-8 char.
+ The second half is returned in case we recognize the special
+ __count value above. */
+ state->__value.__wchb[3] = ch;
+ state->__count = 4;
+ *pwc = 0xd800 | (((tmp - 0x10000) >> 10) & 0x3ff);
+ return 2;
+ }
+ *pwc = tmp;
state->__count = 0;
return i;
}
@@ -330,7 +354,7 @@ _DEFUN (_mbtowc_r, (r, pwc, s, n, state)
*pwc = (((wchar_t)state->__value.__wchb[0]) << 8) + (wchar_t)(t[i]);
return (i + 1);
case MAKE_A:
- ptr = (char *)(t + i + 1);
+ ptr = (unsigned char *)(t + i + 1);
break;
case ERROR:
default:
--- wctomb_r.c-UNI 2009-02-18 10:02:26.000000000 +0100
+++ wctomb_r.c 2009-02-18 16:11:46.000000000 +0100
@@ -28,6 +28,11 @@ _DEFUN (_wctomb_r, (r, s, wchar, state),
if (s == NULL)
return 0; /* UTF-8 encoding is not state-dependent */
+ if (state->__count == -4 && (wchar < 0xdc00 || wchar >= 0xdfff))
+ {
+ /* At this point only the second half of a surrogate pair is valid. */
+ return -1;
+ }
if (wchar <= 0x7f)
{
*s = wchar;
@@ -41,10 +46,39 @@ _DEFUN (_wctomb_r, (r, s, wchar, state),
}
else if (wchar >= 0x800 && wchar <= 0xffff)
{
- /* UTF-16 surrogates -- must not occur in normal UCS-4 data */
if (wchar >= 0xd800 && wchar <= 0xdfff)
- return -1;
-
+ {
+ wint_t tmp;
+ /* UTF-16 surrogates -- must not occur in normal UCS-4 data */
+ if (sizeof (wchar_t) != 2)
+ return -1;
+ if (wchar >= 0xdc00)
+ {
+ /* Second half of a surrogate pair. It's not valid if
+ we don't have already read a first half of a surrogate
+ before. */
+ if (state->__count != -4)
+ return -1;
+ /* If it's valid, reconstruct the full Unicode value and
+ return the trailing three bytes of the UTF-8 char. */
+ tmp = (state->__value.__wchb[0] << 16)
+ | (state->__value.__wchb[1] << 8)
+ | (wchar & 0x3ff);
+ state->__count = 0;
+ *s++ = 0x80 | ((tmp & 0x3f000) >> 12);
+ *s++ = 0x80 | ((tmp & 0xfc0) >> 6);
+ *s = 0x80 | (tmp & 0x3f);
+ return 3;
+ }
+ /* First half of a surrogate pair. Store the state and return
+ the first byte of the UTF-8 char. */
+ tmp = ((wchar & 0x3ff) << 10) + 0x10000;
+ state->__value.__wchb[0] = (tmp >> 16) & 0xff;
+ state->__value.__wchb[1] = (tmp >> 8) & 0xff;
+ state->__count = -4;
+ *s = (0xf0 | ((tmp & 0x1c0000) >> 18));
+ return 1;
+ }
*s++ = 0xe0 | ((wchar & 0xf000) >> 12);
*s++ = 0x80 | ((wchar & 0xfc0) >> 6);
*s = 0x80 | (wchar & 0x3f);


--
Corinna Vinschen
Cygwin Project Co-Leader
Red Hat



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]