This is the mail archive of the
mailing list .
Re: Bad Continuation of Multi-Byte UTF-8 Sequence
At 01:56 PM 6/24/01, Michael Westbay wrote:
>To Walsh's comment:
>> >Encoding can be specified by this way for external parsed entities,
>> >version pseudoattribute is optional - moreover some XML processors are
>> >unable to process external entity if it contains version information in
>> >its declaration.
>> Surely this is a weakness in the XML spec then? I'm stuffed if I need
>> an external parsed entity in a different encoding?
>While the encoding is part of the specification, it's optional to support
>multiple encodings. Saxon, for example, only supports UTF-8, USASCII, and
>ISO-8859-1 (all of which are exact subsets of UTF-8).
Agreed, though i18n suggests the net is moving away from only speaking
Western encodings. I referred to just such a case as yours, i.e. taking
in file b with encoding X into a file a with encoding Y.
>You must not deal with languages that have multiple encodings.
Rather a sweeping statement? You state a reasonable use case below, using individual
files, I can see the day when the encoding will need to change within a
> The reason I
>prefer to use Xalan/Xerces over Saxon is this every issue, the Apache XML/XSL
>tools allow the encoding to be specified on a per document basis. The loss
>is speed is made up for in versitility.
>What this function allows me to do is take a document produced by one
>engineer on a Windows box in Shift_JIS, then process it with an XSL(T) on my
>FreeBSD box that is encoded in EUC-JP. (For HTML, I often have the output
>encoding set in the XSL to be ISO-2022-JP.)
Fair judgement, with the case you state. I'm presuming that multiple
encoding fragments will become a norm rather than an exception. I guess
processors will gradually align as code becomes more available.
>Where i18n and l10n is concerned, this is a strength in the XML spec, not a
I referred to the statement that Norm corrected.