This is the mail archive of the
glibc-bugs@sourceware.org
mailing list for the glibc project.
[Bug libc/10093] New: iconv accepts UTF-8-encoded UTF-16 surrogates
- From: "aurelien at aurel32 dot net" <sourceware-bugzilla at sourceware dot org>
- To: glibc-bugs at sources dot redhat dot com
- Date: 23 Apr 2009 21:40:09 -0000
- Subject: [Bug libc/10093] New: iconv accepts UTF-8-encoded UTF-16 surrogates
- Reply-to: sourceware-bugzilla at sourceware dot org
According to 'man utf-8':
| The UCS code values 0xd800–0xdfff (UTF-16 surrogates) as well as 0xfffe
| and 0xffff (UCS non-characters) should not appear in conforming UTF-8
|?streams.
This is confirmed by RFC2279:
| The algorithm for encoding UCS-2 (or Unicode) to UTF-8 can be
| obtained from the above, in principle, by simply extending each
| UCS-2 character with two zero-valued octets. However, pairs of
| UCS-2 values between D800 and DFFF (surrogate pairs in Unicode
| parlance), being actually UCS-4 characters transformed through
| UTF-16, need special treatment: the UTF-16 transformation must be
| undone, yielding a UCS-4 character that is then transformed as
| above.
However the following code shows however that iconv accepts suchs invalid
characters:
$ s='\xed\xa0\x88\xed\xbd\x85' # 0xd808 + 0xdf45
$ for e in UTF-8 UTF-16 UTF-32 UCS-4 ; do printf "$e\t" ; printf $s | iconv -f
UTF-8 -t $e > /dev/null && printf 'OK\n' ; done
UTF-8 OK
UTF-16 iconv: illegal input sequence at position 0
UTF-32 iconv: illegal input sequence at position 0
UCS-4 OK
--
Summary: iconv accepts UTF-8-encoded UTF-16 surrogates
Product: glibc
Version: unspecified
Status: NEW
Severity: normal
Priority: P2
Component: libc
AssignedTo: drepper at redhat dot com
ReportedBy: aurelien at aurel32 dot net
CC: glibc-bugs at sources dot redhat dot com
GCC build triplet: x86_64-unknown-linux-gnu
GCC host triplet: x86_64-unknown-linux-gnu
GCC target triplet: x86_64-unknown-linux-gnu
http://sourceware.org/bugzilla/show_bug.cgi?id=10093
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.