Lone surrogates in UTF-8? (was: Re: Console codepage setting via chcp?)

Mon Sep 28 06:23:00 GMT 2009

2009/9/28 Andy Koppe:
> If the Unix filename contains the UTF-8 representation of U+F0xx, that
> will now roundtrip to just the xx byte. U+F000 is particularly
> problematic, as that roundtrips to a null byte.
>
> Solution: if f_mbtowc comes back with a U+F0xx, scratch that, and
> instead turn each of the original bytes into a U+F0xx, i.e.:
>
> \xEF\x80\x80 -> U+F0EF U+F080 U+F080
>
> One for later?

Actually, I think there's a very simple way to implement this: just
treat a U+F0xx result the same as an encoding error. For example:

--- strfuncs.cc.bak     2009-09-28 06:05:53.866000000 +0100
+++ strfuncs.cc 2009-09-28 07:08:36.909000000 +0100
@@ -602,9 +602,10 @@ sys_cp_mbstowcs (mbtowc_p f_mbtowc, cons
                *ptr = 0x18;
            }
        }
-      else if ((bytes = f_mbtowc (_REENT, ptr, (const char *) pmbs, nms,
-                                 charset, &ps)) < 0
-              && *pmbs >= 0x80)
+      else if (((bytes = f_mbtowc (_REENT, ptr, (const char *) pmbs, nms,
+                                 charset, &ps)) < 0
+               && *pmbs >= 0x80)
+              || (*ptr & 0xff00) == 0xf000)
        {
          /* The technique is based on a discussion here:
             http://www.mail-archive.com/linux-utf8@nl.linux.org/msg00080.html
@@ -615,7 +616,7 @@ sys_cp_mbstowcs (mbtowc_p f_mbtowc, cons
             to store them in a symmetric way. */
          bytes = 1;
          if (dst)
-           *ptr = L'\xf080' | *pmbs;
+           *ptr = L'\xf000' | *pmbs;
          memset (&ps, 0, sizeof ps);
        }

Btw, is the '*pmbs >= 0x80' check necessary there? ASCII bytes should
pass unharmed through all encodings (well, at the start of a mbchar
anyway), and if they didn't, we'd probably still want to encode them
as U+F0xx.

Andy