[PATCH] Handle surrogate pairs in _wctomb_r/_mbtowc_r
Corinna Vinschen
vinschen@redhat.com
Tue Feb 24 12:25:00 GMT 2009
Ping?
On Feb 18 16:55, Corinna Vinschen wrote:
> Hi,
>
>
> below is a patch which adds handling of UTF-16 surrogate pairs on
> systems which define wchar_t as two byte values. Unfortunately the
> POSIX functions wcrtomb and mbrtowc don't define surrogate handling at
> all because POSIX assumes that wchar_t is big enough to hold an entire
> wide char under all circumstances. The problem is that this assumption
> actually breaks these functions for all Unicode chars beyond 0xffff,
> which is quite a lot of chars.
>
> Given that Cygwin is based on Windows and Windows is a wchar_t == UTF-16
> system, that breaks Cygwin in conjunction with a significant number of
> languages. That's why I created the below patch which is, quite
> certainly, a hack based on the lack of the underlying system.
>
> How the patch works on UTF-16 systems:
>
> - _wctomb_r: If a first half of a surrogate pair is detected in wchar,
> it creates a temporary wint_t value based on the 10 value bits in the
> surrogate wchar_t. This value is then stored in state, and the first
> byte of the resulting UTF-8 char is returned. If a second half of a
> surrogate pair is detected, _wctomb_r checks if it already detected a
> first half in the previous run. If not, it's an invalid wchar value.
> Otherwise it creates the full Unicode value, resets the state, and
> returns the trailing 3 UTF-8 bytes in s.
>
> - _mbtowc_r: If the detected UTF-8 char results in a Unicode char in the
> range from 0x10000 <= unicode_char <= 0x10ffff, it stores the value in
> state and returns the first surrogate UTF-8 value. In the next call,
> if the state indicates that we're in the middle of a surrogated char,
> it resets the state, and returns the second half of the surrogate
> pair.
>
> This *might* break applications on UTF-16 systems which are ignorant of
> the fact that wchar_t doesn't hold a complete Unicode char, *and* use
> wcrtomb/mbrtowc directly. However, most applications will use the
> higher level string functions (wcstombs/mbstowcs), and these are using
> _wctomb_r/_mbtowc_r transparently from the application's point of view.
>
> So, the bottom line is, I'm not entirely sure if that's a good idea in
> all cases, but IMHO the advantages outweigh the potential problems.
>
> Btw., the patch for _mbtowc_r also fixes two compiler warnings.
>
>
> Corinna
>
>
> * mbtowc_r.c (_mbtowc_r): Fix two compiler warnings.
> Handle surrogate pairs in case of wchar_t == UTF-16.
> * wctomb_r.c (_wctomb_r): Handle surrogate pairs in case of
> wchar_t == UTF-16.
>
>
> --- mbtowc_r.c-UNI 2009-02-18 10:02:35.000000000 +0100
> +++ mbtowc_r.c 2009-02-18 16:22:41.000000000 +0100
> @@ -65,8 +65,7 @@ _DEFUN (_mbtowc_r, (r, pwc, s, n, state)
> return -2;
>
> #ifdef _MB_CAPABLE
> - if (__lc_ctype == NULL ||
> - (strlen (__lc_ctype) <= 1))
> + if ((strlen (__lc_ctype) <= 1))
> { /* fall-through */ }
> else if (!strcmp (__lc_ctype, "C-UTF-8"))
> {
> @@ -76,6 +75,18 @@ _DEFUN (_mbtowc_r, (r, pwc, s, n, state)
> if (s == NULL)
> return 0; /* UTF-8 character encodings are not state-dependent */
>
> + if (state->__count == 4)
> + {
> + /* Create the second half of the surrogate pair. For a description
> + see the comment below. */
> + wint_t tmp = (wchar_t)((state->__value.__wchb[0] & 0x07) << 18)
> + | (wchar_t)((state->__value.__wchb[1] & 0x3f) << 12)
> + | (wchar_t)((state->__value.__wchb[2] & 0x3f) << 6)
> + | (wchar_t)(state->__value.__wchb[3] & 0x3f);
> + state->__count = 0;
> + *pwc = 0xdc00 | ((tmp - 0x10000) & 0x3ff);
> + return 2;
> + }
> if (state->__count == 0)
> ch = t[i++];
> else
> @@ -153,8 +164,7 @@ _DEFUN (_mbtowc_r, (r, pwc, s, n, state)
> else if (ch >= 0xf0 && ch <= 0xf7)
> {
> /* four-byte sequence */
> - if (sizeof(wchar_t) < 4)
> - return -1; /* we can't store such a value */
> + wint_t tmp;
> state->__value.__wchb[0] = ch;
> if (state->__count == 0)
> state->__count = 1;
> @@ -185,11 +195,25 @@ _DEFUN (_mbtowc_r, (r, pwc, s, n, state)
> ch = t[i++];
> if (ch < 0x80 || ch > 0xbf)
> return -1;
> - *pwc = (wchar_t)((state->__value.__wchb[0] & 0x07) << 18)
> - | (wchar_t)((state->__value.__wchb[1] & 0x3f) << 12)
> - | (wchar_t)((state->__value.__wchb[2] & 0x3f) << 6)
> - | (wchar_t)(ch & 0x3f);
> -
> + tmp = (wint_t)((state->__value.__wchb[0] & 0x07) << 18)
> + | (wint_t)((state->__value.__wchb[1] & 0x3f) << 12)
> + | (wint_t)((state->__value.__wchb[2] & 0x3f) << 6)
> + | (wint_t)(ch & 0x3f);
> + if (tmp > 0xffff && sizeof(wchar_t) == 2)
> + {
> + /* On systems which have wchar_t being UTF-16 values, the value
> + doesn't fit into a single wchar_t in this case. So what we
> + do here is to store the state with a special value of __count
> + and return the first half of a surrogate pair. As return
> + value we choose to return the half of the actual UTF-8 char.
> + The second half is returned in case we recognize the special
> + __count value above. */
> + state->__value.__wchb[3] = ch;
> + state->__count = 4;
> + *pwc = 0xd800 | (((tmp - 0x10000) >> 10) & 0x3ff);
> + return 2;
> + }
> + *pwc = tmp;
> state->__count = 0;
> return i;
> }
> @@ -330,7 +354,7 @@ _DEFUN (_mbtowc_r, (r, pwc, s, n, state)
> *pwc = (((wchar_t)state->__value.__wchb[0]) << 8) + (wchar_t)(t[i]);
> return (i + 1);
> case MAKE_A:
> - ptr = (char *)(t + i + 1);
> + ptr = (unsigned char *)(t + i + 1);
> break;
> case ERROR:
> default:
> --- wctomb_r.c-UNI 2009-02-18 10:02:26.000000000 +0100
> +++ wctomb_r.c 2009-02-18 16:11:46.000000000 +0100
> @@ -28,6 +28,11 @@ _DEFUN (_wctomb_r, (r, s, wchar, state),
> if (s == NULL)
> return 0; /* UTF-8 encoding is not state-dependent */
>
> + if (state->__count == -4 && (wchar < 0xdc00 || wchar >= 0xdfff))
> + {
> + /* At this point only the second half of a surrogate pair is valid. */
> + return -1;
> + }
> if (wchar <= 0x7f)
> {
> *s = wchar;
> @@ -41,10 +46,39 @@ _DEFUN (_wctomb_r, (r, s, wchar, state),
> }
> else if (wchar >= 0x800 && wchar <= 0xffff)
> {
> - /* UTF-16 surrogates -- must not occur in normal UCS-4 data */
> if (wchar >= 0xd800 && wchar <= 0xdfff)
> - return -1;
> -
> + {
> + wint_t tmp;
> + /* UTF-16 surrogates -- must not occur in normal UCS-4 data */
> + if (sizeof (wchar_t) != 2)
> + return -1;
> + if (wchar >= 0xdc00)
> + {
> + /* Second half of a surrogate pair. It's not valid if
> + we don't have already read a first half of a surrogate
> + before. */
> + if (state->__count != -4)
> + return -1;
> + /* If it's valid, reconstruct the full Unicode value and
> + return the trailing three bytes of the UTF-8 char. */
> + tmp = (state->__value.__wchb[0] << 16)
> + | (state->__value.__wchb[1] << 8)
> + | (wchar & 0x3ff);
> + state->__count = 0;
> + *s++ = 0x80 | ((tmp & 0x3f000) >> 12);
> + *s++ = 0x80 | ((tmp & 0xfc0) >> 6);
> + *s = 0x80 | (tmp & 0x3f);
> + return 3;
> + }
> + /* First half of a surrogate pair. Store the state and return
> + the first byte of the UTF-8 char. */
> + tmp = ((wchar & 0x3ff) << 10) + 0x10000;
> + state->__value.__wchb[0] = (tmp >> 16) & 0xff;
> + state->__value.__wchb[1] = (tmp >> 8) & 0xff;
> + state->__count = -4;
> + *s = (0xf0 | ((tmp & 0x1c0000) >> 18));
> + return 1;
> + }
> *s++ = 0xe0 | ((wchar & 0xf000) >> 12);
> *s++ = 0x80 | ((wchar & 0xfc0) >> 6);
> *s = 0x80 | (wchar & 0x3f);
>
> --
> Corinna Vinschen
> Cygwin Project Co-Leader
> Red Hat
--
Corinna Vinschen
Cygwin Project Co-Leader
Red Hat
More information about the Newlib
mailing list