This is the mail archive of the xsl-list@mulberrytech.com mailing list .


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: How to read the encoding of an XML document


At 14:18 25-10-2001, James Garriss wrote:
>I've been looking at a lot of European web pages, viewing source to see 
>what charset they define in the HTML META tag.  The majority use 
>iso-8859-1, but a few don't.  Most notably Turkey and Greece have 
>character sets that are quite different.  How do I determine if UTF-16 (or 
>UTF-8) will work for those languages?

Time for the primer again.

A character is an abstract notion, like "Latin capital letter A".

A character repertoire is a collection of characters - like "Latin 
upper-case letters".  Different languages require different character 
repertoires.

A character set is an ordered, numbered character repertoire.  ISO 8859-1 
is one such character set, assigning numbers 0-255 to 256 characters.  Its 
repertoire covers nearly all of the characters needed for western European 
languages like French, Spanish, German, and Italian, as well as English, 
Icelandic, Swedish, Norwegian, and Dutch.  There are other ISO 8859 
character sets that cover characters needed by other languages like 
Turkish, Polish, Greek, Russian, Hebrew, and Arabic.

Unicode is also a character set.  It assigns the numbers 0 - (2^32)-1 to a 
whole lot of characters.  Its repertoire includes all of the characters 
covered in other national and International Standards, including all of the 
ISO 8859 sets.

An encoding is a mapping of bit patterns to a character set.  UTF-8 and 
UTF-16 are encodings of Unicode.  In a sense, ISO 8859-1 and its kin are 
also encodings of Unicode, but ones that can not represent all of the 
characters.

In short: Unless you are working in Klingon, Minbari, or Silvestri, Unicode 
covers the characters you need in its repertoire.  UTF-8 and UTF-16 are 
both capable of representing all of the characters in Unicode.  All XML 
parsers are required to read UTF-8 and UTF-16 data.

Use them.  Know them.  Love them.

-Chris
-- 
Christopher R. Maden, Principal Consultant, HMM Consulting Int'l, Inc.
DTDs/schemas - conversion - ebooks - publishing - Web - B2B - training
<URL: http://www.hmmci.com/ > <URL: http://crism.maden.org/consulting/ >
PGP Fingerprint: BBA6 4085 DED0 E176 D6D4  5DFC AC52 F825 AFEC 58DA


 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]