This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: Question about iconv, UTF 8/16/32 and error reporting due to UTF-16 surrogates.
- From: Rich Felker <dalias at libc dot org>
- To: Florian Weimer <fweimer at redhat dot com>
- Cc: Stefan Liebler <stli at linux dot vnet dot ibm dot com>, "Joseph S. Myers" <joseph at codesourcery dot com>, Carlos O'Donell <carlos at redhat dot com>, GNU C Library <libc-alpha at sourceware dot org>
- Date: Thu, 3 Dec 2015 16:44:17 -0500
- Subject: Re: Question about iconv, UTF 8/16/32 and error reporting due to UTF-16 surrogates.
- Authentication-results: sourceware.org; auth=none
- References: <565EDF7C dot 9020808 at linux dot vnet dot ibm dot com> <565EE434 dot 4090205 at redhat dot com>
On Wed, Dec 02, 2015 at 01:29:40PM +0100, Florian Weimer wrote:
> On 12/02/2015 01:09 PM, Stefan Liebler wrote:
>
> > What is the reason for reporting an error in the direction from UTF-8 to
> > UTF-32, but not in the direction from UTF-32 to UTF-8?
> > Or is it a bug?
>
> It's a bug. When processing UTF-* encodings, iconv needs to detect
> invalid source sequences and avoid creating invalid destination sequences.
>
> There are various legacy encodings which treat surrogate code points as
> if they were regular characters: CESU-8 corresponding to UTF-8, and
> UCS-2 corresponding to UTF-16. But if the user-visibile identifier
> contains the string "UTF", it really should conform to the current (?)
> specification.
>
> UTF-8, UTF-32 (and perhaps UCS-4, I do not have access to the ISO
> standard) were changed fairly recently to restrict valid code points to
> the first 17 planes (those that can be encoded in UTF-16). This is
> another source of decoding and encoding failures (which are also
> required by the Unicode specification).
>
> I wrote âcurrent (?)â above because it's a bit annoying that the
> definition of UTF-32 was change retroactively without changing its
> identifier. But I don't think there is anything glibc can do except to
> adopt the new behavior for the old identifier.
>
> glibc iconv seems to treat UCS-2 as UTF-16 (checking for surrogate
> characters, which looks like a bug), but UCS-4 as a superset of UTF-32
> (which could be correct, depending on what the last version of ISO 10646
> says).
The relevant term is "Unicode Scalar Values", and these are exactly
the integers 0-0xd7ff and 0xe000-0x10ffff. UTF's assign a unique
encoding (in terms of code units) to each the Unicode Scalar Value,
and are not defined for any other integers. Likewise, UCS (16 or 32)
does not include values which are not Unicode Scalar Values.
Rich