This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Question about iconv, UTF 8/16/32 and error reporting due to UTF-16 surrogates.


On 12/02/2015 01:09 PM, Stefan Liebler wrote:

> What is the reason for reporting an error in the direction from UTF-8 to
> UTF-32, but not in the direction from UTF-32 to UTF-8?
> Or is it a bug?

It's a bug.  When processing UTF-* encodings, iconv needs to detect
invalid source sequences and avoid creating invalid destination sequences.

There are various legacy encodings which treat surrogate code points as
if they were regular characters: CESU-8 corresponding to UTF-8, and
UCS-2 corresponding to UTF-16.  But if the user-visibile identifier
contains the string "UTF", it really should conform to the  current (?)
specification.

UTF-8, UTF-32 (and perhaps UCS-4, I do not have access to the ISO
standard) were changed fairly recently to restrict valid code points to
the first 17 planes (those that can be encoded in UTF-16).  This is
another source of decoding and encoding failures (which are also
required by the Unicode specification).

I wrote âcurrent (?)â above because it's a bit annoying that the
definition of UTF-32 was change retroactively without changing its
identifier.  But I don't think there is anything glibc can do except to
adopt the new behavior for the old identifier.

glibc iconv seems to treat UCS-2 as UTF-16 (checking for surrogate
characters, which looks like a bug), but UCS-4 as a superset of UTF-32
(which could be correct, depending on what the last version of ISO 10646
says).

> There is a further issue in utf-16.c when converting from UTF-16 to
> internal. If an uint16_t value is in the range of 0xd800 .. 0xdfff,
> the next uint16_t value is checked, if it is in the range of a low
> surrogate (0xdc00 .. 0xdfff). Afterwards these two uint16_t values are
> interpreted as a high- and low-surrogates pair.
> But there is no test if the first uint16_t value is really in the range
> of a high-surrogate (0xd800 .. 0xdbff).
> If there would be two uint16_t values in the range of a low surrogate,
> then they will be treated as a valid high- and low-surrogates pair.
> Should iconv() report the error "invalid multibyte sequence" in such a
> case?

Yes, this is a bug, and it should report an error.

Florian


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]