This is the mail archive of the mailing list .

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: Bad Continuation of Multi-Byte UTF-8 Sequence

To Walsh's comment:

> >Encoding can be specified by this way for external parsed entities,
> >version pseudoattribute is optional - moreover some XML processors are
> >unable to process external entity if it contains version information in
> >its declaration.

Pawson-san wrote:

> Surely this is a weakness in the XML spec then? I'm stuffed if I need
> an external parsed entity in a different encoding?

While the encoding is part of the specification, it's optional to support 
multiple encodings.  Saxon, for example, only supports UTF-8, USASCII, and 
ISO-8859-1 (all of which are exact subsets of UTF-8).

You must not deal with languages that have multiple encodings.  The reason I 
prefer to use Xalan/Xerces over Saxon is this every issue, the Apache XML/XSL 
tools allow the encoding to be specified on a per document basis.  The loss 
is speed is made up for in versitility.

What this function allows me to do is take a document produced by one 
engineer on a Windows box in Shift_JIS, then process it with an XSL(T) on my 
FreeBSD box that is encoded in EUC-JP.  (For HTML, I often have the output 
encoding set in the XSL to be ISO-2022-JP.)

I was recently told (but didn't confirm) that Danish has a number of 
different encodings as well depending on platform.

Where i18n and l10n is concerned, this is a strength in the XML spec, not a 

Michael Westbay
Work: Beacon-IT

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]