Re: Bad Continuation of Multi-Byte UTF-8 Sequence

At 01:56 PM 6/24/01, Michael Westbay wrote:
>To Walsh's comment:
>> >Encoding can be specified by this way for external parsed entities,
>> >version pseudoattribute is optional - moreover some XML processors are
>> >unable to process external entity if it contains version information in
>> >its declaration.
>Pawson-san wrote:
>> Surely this is a weakness in the XML spec then? I'm stuffed if I need
>> an external parsed entity in a different encoding?
>While the encoding is part of the specification, it's optional to support 
>multiple encodings.  Saxon, for example, only supports UTF-8, USASCII, and 
>ISO-8859-1 (all of which are exact subsets of UTF-8).

Agreed, though i18n suggests the net is moving away from only speaking
Western encodings. I referred to just such a case as yours, i.e. taking
in file b with encoding X into a file a with encoding Y.

>You must not deal with languages that have multiple encodings.

Rather a sweeping statement? You state a reasonable use case below, using individual
files, I can see the day when the encoding will need to change within a 
single file.

>  The reason I 
>prefer to use Xalan/Xerces over Saxon is this every issue, the Apache XML/XSL 
>tools allow the encoding to be specified on a per document basis.  The loss 
>is speed is made up for in versitility.
>What this function allows me to do is take a document produced by one 
>engineer on a Windows box in Shift_JIS, then process it with an XSL(T) on my 
>FreeBSD box that is encoded in EUC-JP.  (For HTML, I often have the output 
>encoding set in the XSL to be ISO-2022-JP.)

Fair judgement, with the case you state. I'm presuming that multiple
encoding fragments will become a norm rather than an exception. I guess
processors will gradually align as code becomes more available.

>Where i18n and l10n is concerned, this is a strength in the XML spec, not a 

I referred to the statement that Norm corrected.

REgards DaveP

