This is the mail archive of the mailing list .

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: Bad Continuation of Multi-Byte UTF-8 Sequence

First of all, apologies for my misunderstanding.  But in a way I'm glad 
because it let you expand your ideas to state:

> [...] I can see the day when the encoding will need to change
> within a single file.

I have such a file.  It's name is mbox.  Not in XML, but the biggest problem 
with having a file in multiple encodings that I can see is not being able to 
grep and/or edit it easily.  If such a day comes as you suggest, tools will 
have to be revised to deal with it better.  (Yes, mail clients do deal with 
this particular issue very well.)

I've considered using DocBook for multiple languages, where one document 
contains various languages.  I can easily see this as being a case where 
multiple encodings would be necessary.  (No, I haven't gotten it to work as 
getting the right combination of tags with lang="xx" with what the DTD allows 
for children isn't easy.  SmartDoc was designed to handle this case better.)

Nonetheless, isn't this where Unicode comes in to save the day?  (I know 
about the faults in Unicode as some friends have to use more common gliphs 
for their names when registering with Unicode based software.)  If one has 
the gliphs, typing in multiple languages (each normally with multiple 
encodings) becomes possible in a single file.  I'm curious as to why you 
would prefer to use multiple encodings in a single file over UTF-8.  Or am I 
misinterpreting your statement again?

By the way, before I started to "get" Unicode, I also wanted to be able to 
specify multiple encodings in a given file.  I don't like that some friends 
have gliphless names, but when everthing is converted to and processed in 
Unicode anyway, why fight it on the input file side?

>> [Case of Shift_JIS encoded XML with EUC-JP encoded XSL(T) snipped]
> Fair judgement, with the case you state. I'm presuming that multiple
> encoding fragments will become a norm rather than an exception. I guess
> processors will gradually align as code becomes more available.

Actually, while it is reasonable to have, for example, a Japanese based XSL 
set for dealing with the DocBook DTD in one of the major encodings, it makes 
more sence to have one encoding decided on for a given project, and use that 
encoding throughout the project.  And I think that for projects developed in 
environments with a single language and multiple possible encodings, deciding 
on a single encoding to use is more the norm.

(That reminds me, I need to fill out a bug report to have a font 
specification for the bullets.  TM, Circle-R, Circle-C, and a few others 
cause errors using a Japanese Font-Family [the gliphs don't exist in them] 
with FOP.  I'll try to do that today.)

Where the fragmentation is more likely to take place is in database storage.  
Accessing multiple data sources may very well produce XML trees in different 
encodings.  But there, too, a standard (UTF-8?) will most likely become the 
standard encoding.  (Gee, I say "most likely" a lot.  Am I that unsure? ;-)

Thank you for the interesting ideas for a Monday morning to get the grey 
cells working.

Michael Westbay
Work: Beacon-IT

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]