This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH] Alias for ISO-10646-UCS-2 charset
- From: Rich Felker <dalias at aerifal dot cx>
- To: libc-alpha at sourceware dot org
- Date: Fri, 7 Dec 2012 15:05:17 -0500
- Subject: Re: [PATCH] Alias for ISO-10646-UCS-2 charset
- References: <50C0221C.9070406@redhat.com>
On Wed, Dec 05, 2012 at 09:42:04PM -0700, Jeff Law wrote:
>
> Certain embedded devices use the ISO-10646-UCS-2 charset; it is
> currently not possible for glibc's iconv to translate messages from
> those devices.
>
> The ISO-10646-UCS-2 charset is an older character set that was
> superseded by UTF-16 of the Unicode standard in July 1996.
>
> UCS-2 and UTF-16 are identical for purposes of data exchange. Both
> are 16 bit formats and have exactly the same code unit
> representation.
>
> UCS-2 does not support supplementary characters and doesn't
> interpret pairs of surrogate code points as characters.
>
> Given they are identical for data exchange, the easiest way to
> support this charset is to create an alias.
UCS-2 is not the same as UTF-16. When processing UCS-2, code units in
the surrogate range must be rejected as invalid code units.
Interpreting them in pairs as UTF-16 would break the property of
fixed-width character encoding and would allow invalid UCS-2 to
validate, possibly allowing corrupt transmitions.
What's worse, for conversions in the other direction (to UCS-2),
characters that cannot be represented in UCS-2 would wrongly be
converted to pairs of surrogates. In a program that tries "oldest"
encodings first with the goal of being conservative in what you
transmit, this will lead to an incorrect conclusion that the data can
be encoded as UCS-2, and will result in malformed data being received
by the recipient (surrogates are not legal in UCS-2).
Isn't UCS-2 already supported anyway, just without the ISO-10646
prefix on the name?
Rich