This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: Question about iconv, UTF 8/16/32 and error reporting due to UTF-16 surrogates.
- From: Florian Weimer <fweimer at redhat dot com>
- To: Rich Felker <dalias at libc dot org>
- Cc: Stefan Liebler <stli at linux dot vnet dot ibm dot com>, "Joseph S. Myers" <joseph at codesourcery dot com>, "Carlos O'Donell" <carlos at redhat dot com>, GNU C Library <libc-alpha at sourceware dot org>
- Date: Thu, 3 Dec 2015 23:33:12 +0100
- Subject: Re: Question about iconv, UTF 8/16/32 and error reporting due to UTF-16 surrogates.
- Authentication-results: sourceware.org; auth=none
- References: <565EDF7C dot 9020808 at linux dot vnet dot ibm dot com> <565EE434 dot 4090205 at redhat dot com> <20151203214417 dot GZ3818 at brightrain dot aerifal dot cx>
On 12/03/2015 10:44 PM, Rich Felker wrote:
> The relevant term is "Unicode Scalar Values", and these are exactly
> the integers 0-0xd7ff and 0xe000-0x10ffff. UTF's assign a unique
> encoding (in terms of code units) to each the Unicode Scalar Value,
> and are not defined for any other integers. Likewise, UCS (16 or 32)
> does not include values which are not Unicode Scalar Values.
The term Unicode Scalar Value did not exist when Unicode support was
added to glibc. For example, all the reference I have readily at hand
(I can't find the 10646 CD right now) imply that UCS-4 in ISO/IEC
10646:2000 still had 31 bits and not the range restriction you gave.
The question is what glibc should doâimplement historic definitions,
preserving the meaning of charset names for backwards compatibility, or
tweak the implementations as the definitions evolve.
Florian