This is the mail archive of the libc-help@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Question about iconv, UTF 8/16/32 and error reporting due to UTF-16 surrogates.

From: Stefan Liebler <stli at linux dot vnet dot ibm dot com>
To: libc-help at sourceware dot org
Cc: Florian Weimer <fweimer at redhat dot com>, "Joseph S. Myers" <joseph at codesourcery dot com>, Carlos O'Donell <carlos at redhat dot com>
Date: Wed, 2 Dec 2015 13:09:32 +0100
Subject: Question about iconv, UTF 8/16/32 and error reporting due to UTF-16 surrogates.
Authentication-results: sourceware.org; auth=none

Hi,

when converting characters from UTF-16 to UTF-32 and the byte-sequencecontains a single low-UTF-16-surrogate (0xdc00 .. 0xdfff), then iconv()

reports an error "invalid multibyte sequence".

Due to this requirement, the s390 hardware-instructions for convertingfrom UTF-16 to UTF-8 / UTF-32 were disabled, because they do not reportthis error.

When converting from UTF-32 to UTF-8 / UTF-16, the s390hardware-instructions do not report an error, if an UTF-32 character isin the range of a UTF16-low-surrogate (0xdc00 .. 0xdfff).

Should iconv() report the error "invalid multibyte sequence" in such cases?
If yes, then these two hardware instructions have to be disabled, too!

As comparison, the common-code does not report an error on such alow-surrogates character while converting from UTF-32 to INTERNAL andfrom INTERNAL to UTF-8.

In the other direction from UTF-8 to INTERNAL, characters in the rangeof a UTF-16 surrogate are not accepted and iconv returns the error"invalid multibyte sequence". The same behaviour when converting fromINTERNAL to UTF-32.


According to the comment

"/* Surrogate characters in UCS-4 input are not valid. We must catchthis. If we let surrogates pass through, attackers could make asecurity hole exploit by generating "irregular UTF-32" sequences. */"

in utf-32.c, this is a security issue.

What is the reason for reporting an error in the direction from UTF-8 toUTF-32, but not in the direction from UTF-32 to UTF-8?

Or is it a bug?

According to the latest Unicode Standard, an error should be reported inall cases:

See http://www.unicode.org/versions/Unicode8.0.0/ch03.pdf
in chapter 3.9 Unicode Encoding Forms:
"D76    Unicode scalar value:
Any Unicode code point except high-surrogate and low-surrogate code points.

â As a result of this definition, the set of Unicode scalar valuesconsists of the ranges 0 to D7FF 16 and E000 16 to 10FFFF 16, inclusive.

D84 Ill-formed: A Unicode code unit sequence that purports to be in aUnicode encoding form is called ill-formed if and only if it does notfollow the specification of that Unicode encoding form.â Any code unit sequence that would correspond to a code point outsidethe defined range of Unicode scalar values would, for example, beill-formed.


UTF-32: D90: ...

â Because surrogate code points are not included in the set of Unicodescalar values, UTF-32 code units in the range 0000D800 16 ..0000DFFF 16are ill-formed.


UTF-16: D91: ...

â Because surrogate code points are not Unicode scalar values,isolated UTF-16 code units in the range D800 16 ..DFFF 16 are ill-formed.


UTF-8: D92: ...

â Because surrogate code points are not Unicode scalar values, anyUTF-8 byte sequence that would otherwise map to code pointsU+D800..U+DFFF is ill-formed.


Encoding Form Conversion: D93: ...

â A conformant encoding form conversion will treat any ill-formed codeunit sequence as an error condition. (See conformance clause C10.) Thisguarantees that it will neither interpret nor emit an ill-formed codeunit sequence. Any implementation of encoding form conversion must takethis requirement into account, because an encoding form conversionimplicitly involves a verification that the Unicode strings beingconverted do, in fact, contain well-formed code unit sequences."

There is a further issue in utf-16.c when converting from UTF-16 tointernal. If an uint16_t value is in the range of 0xd800 .. 0xdfff,the next uint16_t value is checked, if it is in the range of a lowsurrogate (0xdc00 .. 0xdfff). Afterwards these two uint16_t values areinterpreted as a high- and low-surrogates pair.But there is no test if the first uint16_t value is really in the rangeof a high-surrogate (0xd800 .. 0xdbff).If there would be two uint16_t values in the range of a low surrogate,then they will be treated as a valid high- and low-surrogates pair.

Should iconv() report the error "invalid multibyte sequence" in such a case?

Bye
Stefan

Follow-Ups:
- Re: Question about iconv, UTF 8/16/32 and error reporting due to UTF-16 surrogates.
  - From: Florian Weimer

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]