This is the mail archive of the
docbook-apps@lists.oasis-open.org
mailing list .
Re: Bad Continuation of Multi-Byte UTF-8 Sequence
To Walsh's comment:
> >Encoding can be specified by this way for external parsed entities,
> >version pseudoattribute is optional - moreover some XML processors are
> >unable to process external entity if it contains version information in
> >its declaration.
Pawson-san wrote:
> Surely this is a weakness in the XML spec then? I'm stuffed if I need
> an external parsed entity in a different encoding?
While the encoding is part of the specification, it's optional to support
multiple encodings. Saxon, for example, only supports UTF-8, USASCII, and
ISO-8859-1 (all of which are exact subsets of UTF-8).
You must not deal with languages that have multiple encodings. The reason I
prefer to use Xalan/Xerces over Saxon is this every issue, the Apache XML/XSL
tools allow the encoding to be specified on a per document basis. The loss
is speed is made up for in versitility.
What this function allows me to do is take a document produced by one
engineer on a Windows box in Shift_JIS, then process it with an XSL(T) on my
FreeBSD box that is encoded in EUC-JP. (For HTML, I often have the output
encoding set in the XSL to be ISO-2022-JP.)
I was recently told (but didn't confirm) that Danish has a number of
different encodings as well depending on platform.
Where i18n and l10n is concerned, this is a strength in the XML spec, not a
weekness.
--
Michael Westbay
Work: Beacon-IT http://www.beacon-it.co.jp/
Home: http://www.seaple.icc.ne.jp/~westbay
Commentary: http://www.japanesebaseball.com/