Lone surrogates in UTF-8? (was: Re: Console codepage setting via chcp?)
Andy Koppe
andy.koppe@gmail.com
Mon Sep 28 06:23:00 GMT 2009
2009/9/28 Andy Koppe:
> If the Unix filename contains the UTF-8 representation of U+F0xx, that
> will now roundtrip to just the xx byte. U+F000 is particularly
> problematic, as that roundtrips to a null byte.
>
> Solution: if f_mbtowc comes back with a U+F0xx, scratch that, and
> instead turn each of the original bytes into a U+F0xx, i.e.:
>
> \xEF\x80\x80 -> U+F0EF U+F080 U+F080
>
> One for later?
Actually, I think there's a very simple way to implement this: just
treat a U+F0xx result the same as an encoding error. For example:
--- strfuncs.cc.bak 2009-09-28 06:05:53.866000000 +0100
+++ strfuncs.cc 2009-09-28 07:08:36.909000000 +0100
@@ -602,9 +602,10 @@ sys_cp_mbstowcs (mbtowc_p f_mbtowc, cons
*ptr = 0x18;
}
}
- else if ((bytes = f_mbtowc (_REENT, ptr, (const char *) pmbs, nms,
- charset, &ps)) < 0
- && *pmbs >= 0x80)
+ else if (((bytes = f_mbtowc (_REENT, ptr, (const char *) pmbs, nms,
+ charset, &ps)) < 0
+ && *pmbs >= 0x80)
+ || (*ptr & 0xff00) == 0xf000)
{
/* The technique is based on a discussion here:
http://www.mail-archive.com/linux-utf8@nl.linux.org/msg00080.html
@@ -615,7 +616,7 @@ sys_cp_mbstowcs (mbtowc_p f_mbtowc, cons
to store them in a symmetric way. */
bytes = 1;
if (dst)
- *ptr = L'\xf080' | *pmbs;
+ *ptr = L'\xf000' | *pmbs;
memset (&ps, 0, sizeof ps);
}
Btw, is the '*pmbs >= 0x80' check necessary there? ASCII bytes should
pass unharmed through all encodings (well, at the start of a mbchar
anyway), and if they didn't, we'd probably still want to encode them
as U+F0xx.
Andy
More information about the Cygwin-developers
mailing list