This is the mail archive of the
mailing list for the glibc project.
Question about iconv, UTF 8/16/32 and error reporting due to UTF-16 surrogates.
- From: Stefan Liebler <stli at linux dot vnet dot ibm dot com>
- To: libc-help at sourceware dot org
- Cc: Florian Weimer <fweimer at redhat dot com>, "Joseph S. Myers" <joseph at codesourcery dot com>, Carlos O'Donell <carlos at redhat dot com>
- Date: Wed, 2 Dec 2015 13:09:32 +0100
- Subject: Question about iconv, UTF 8/16/32 and error reporting due to UTF-16 surrogates.
- Authentication-results: sourceware.org; auth=none
when converting characters from UTF-16 to UTF-32 and the byte-sequence
contains a single low-UTF-16-surrogate (0xdc00 .. 0xdfff), then iconv()
reports an error "invalid multibyte sequence".
Due to this requirement, the s390 hardware-instructions for converting
from UTF-16 to UTF-8 / UTF-32 were disabled, because they do not report
When converting from UTF-32 to UTF-8 / UTF-16, the s390
hardware-instructions do not report an error, if an UTF-32 character is
in the range of a UTF16-low-surrogate (0xdc00 .. 0xdfff).
Should iconv() report the error "invalid multibyte sequence" in such cases?
If yes, then these two hardware instructions have to be disabled, too!
As comparison, the common-code does not report an error on such a
low-surrogates character while converting from UTF-32 to INTERNAL and
from INTERNAL to UTF-8.
In the other direction from UTF-8 to INTERNAL, characters in the range
of a UTF-16 surrogate are not accepted and iconv returns the error
"invalid multibyte sequence". The same behaviour when converting from
INTERNAL to UTF-32.
According to the comment
"/* Surrogate characters in UCS-4 input are not valid. We must catch
this. If we let surrogates pass through, attackers could make a
security hole exploit by generating "irregular UTF-32" sequences. */"
in utf-32.c, this is a security issue.
What is the reason for reporting an error in the direction from UTF-8 to
UTF-32, but not in the direction from UTF-32 to UTF-8?
Or is it a bug?
According to the latest Unicode Standard, an error should be reported in
in chapter 3.9 Unicode Encoding Forms:
"D76 Unicode scalar value:
Any Unicode code point except high-surrogate and low-surrogate code points.
â As a result of this definition, the set of Unicode scalar values
consists of the ranges 0 to D7FF 16 and E000 16 to 10FFFF 16, inclusive.
D84 Ill-formed: A Unicode code unit sequence that purports to be in a
Unicode encoding form is called ill-formed if and only if it does not
follow the specification of that Unicode encoding form.
â Any code unit sequence that would correspond to a code point outside
the defined range of Unicode scalar values would, for example, be
UTF-32: D90: ...
â Because surrogate code points are not included in the set of Unicode
scalar values, UTF-32 code units in the range 0000D800 16 ..0000DFFF 16
UTF-16: D91: ...
â Because surrogate code points are not Unicode scalar values,
isolated UTF-16 code units in the range D800 16 ..DFFF 16 are ill-formed.
UTF-8: D92: ...
â Because surrogate code points are not Unicode scalar values, any
UTF-8 byte sequence that would otherwise map to code points
U+D800..U+DFFF is ill-formed.
Encoding Form Conversion: D93: ...
â A conformant encoding form conversion will treat any ill-formed code
unit sequence as an error condition. (See conformance clause C10.) This
guarantees that it will neither interpret nor emit an ill-formed code
unit sequence. Any implementation of encoding form conversion must take
this requirement into account, because an encoding form conversion
implicitly involves a verification that the Unicode strings being
converted do, in fact, contain well-formed code unit sequences."
There is a further issue in utf-16.c when converting from UTF-16 to
internal. If an uint16_t value is in the range of 0xd800 .. 0xdfff,
the next uint16_t value is checked, if it is in the range of a low
surrogate (0xdc00 .. 0xdfff). Afterwards these two uint16_t values are
interpreted as a high- and low-surrogates pair.
But there is no test if the first uint16_t value is really in the range
of a high-surrogate (0xd800 .. 0xdbff).
If there would be two uint16_t values in the range of a low surrogate,
then they will be treated as a valid high- and low-surrogates pair.
Should iconv() report the error "invalid multibyte sequence" in such a case?