This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Question about iconv, UTF 8/16/32 and error reporting due to UTF-16 surrogates.


On Wed, Dec 02, 2015 at 01:29:40PM +0100, Florian Weimer wrote:
> On 12/02/2015 01:09 PM, Stefan Liebler wrote:
> 
> > What is the reason for reporting an error in the direction from UTF-8 to
> > UTF-32, but not in the direction from UTF-32 to UTF-8?
> > Or is it a bug?
> 
> It's a bug.  When processing UTF-* encodings, iconv needs to detect
> invalid source sequences and avoid creating invalid destination sequences.
> 
> There are various legacy encodings which treat surrogate code points as
> if they were regular characters: CESU-8 corresponding to UTF-8, and
> UCS-2 corresponding to UTF-16.  But if the user-visibile identifier
> contains the string "UTF", it really should conform to the  current (?)
> specification.
> 
> UTF-8, UTF-32 (and perhaps UCS-4, I do not have access to the ISO
> standard) were changed fairly recently to restrict valid code points to
> the first 17 planes (those that can be encoded in UTF-16).  This is
> another source of decoding and encoding failures (which are also
> required by the Unicode specification).
> 
> I wrote âcurrent (?)â above because it's a bit annoying that the
> definition of UTF-32 was change retroactively without changing its
> identifier.  But I don't think there is anything glibc can do except to
> adopt the new behavior for the old identifier.
> 
> glibc iconv seems to treat UCS-2 as UTF-16 (checking for surrogate
> characters, which looks like a bug), but UCS-4 as a superset of UTF-32
> (which could be correct, depending on what the last version of ISO 10646
> says).

The relevant term is "Unicode Scalar Values", and these are exactly
the integers 0-0xd7ff and 0xe000-0x10ffff. UTF's assign a unique
encoding (in terms of code units) to each the Unicode Scalar Value,
and are not defined for any other integers. Likewise, UCS (16 or 32)
does not include values which are not Unicode Scalar Values.

Rich


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]