This is the mail archive of the
xsl-list@mulberrytech.com
mailing list .
Re: CJK UTF-16 test
- To: xsl-list at lists dot mulberrytech dot com
- Subject: Re: [xsl] CJK UTF-16 test
- From: Mike Brown <mike at skew dot org>
- Date: Wed, 28 Mar 2001 21:35:34 -0700 (MST)
- Reply-To: xsl-list at lists dot mulberrytech dot com
Benjamin Franz wrote:
> XML does NOT support UTF-16 since UTF-16 includes the surrogates
Wow, strike that from the archives, because it's dead wrong.
XML is specified in terms of sequences of allowable ISO/IEC 10646-1
characters, not particular binary-encoded representations of those
characters.
> [2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] |
> [#x10000-#x10FFFF]
These are characters, not UTF-16 bytes.
In ISO/IEC 10646-1 and Unicode _there is no character_ at code point 0xD800.
And in a UTF-16 encoded document, the bit sequence that I would write in hex
as D800 (big endian) or 00D8 (little) are not a character. The *sequence*
D800 DC00 (big) represents character #x10000, which I write here using the
same notation as the EBNF excerpt you quoted from the XML spec.
If you were to say that an XML document can contain a "character" #xD800 then
you would
a.) be in violation of the definition of character as being what
from ISO/IEC 10646-1 (which XML relies on), and
b.) have no way of representing that character in a UTF-16 encoded
document, because by definition, D800 in UTF-16 is the first half
of a surrogate pair, not a character...
- Mike
_____________________________________________________________________________
mike j. brown, software engineer at | xml/xslt: http://skew.org/xml/
webb.net in denver, colorado, USA | personal: http://hyperreal.org/~mike/
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list