This is the mail archive of the cygwin mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

OT: RE: filesystem encoding


> Hmm...interesting.  Not entirely sure what the implications 
> of what you are saying are (as I don't really understand codepages).
> 
> Does a codepage represent a character with 16 bits? or 8?  
> Could you recommend a book or a URL on the subject?  Maybe I 
> should look at this when I have more time (I'm in the middle 
> of a move).

A "codepage" isn't a Unicode thing, it's a horrific hack that was and still
is used to allow a computer to "speak" almost all of the worlds' languages
(the ones that aren't made of thousands of pictographs anyway).

A codepage is essentially a mapping of 7- or 8-bit numbers to the glyphs of
a particular language.  So for example, Russian might have a codepage that
says the number 0x01 is the backwards-"R" letter, 0x02 is the "X" with a
vertical line though it, etc etc.  So a guy in Russia sets up his computer
to use this codepage, and he gets his Cyrillic characters popping up when he
types, and everything is great, right?

Wrong:

- Ever get an "ASCII" text email or file that had some goofy graphic
characters in it that clearly weren't what the other guy had typed?  You're
not using the same codepage as the guy who wrote the text.  His codepage has
a "starting quote" character at the same number where yours has a goofy
graphics character.
- Some languages have more than one codepage.  Russian IIRC has like five or
six.  The mappings may or may not be related to each other in any way.  So
even if you speak the same language as the guy you're sending a text file
to, it may be completely unintelligible to him.
- And heaven help you if you're an American and need to look at a Russian
text file.  Which ASCII character is "backwards R" going to map to?  Let me
field that one: trick question, it'll map to some control character or
something and if you're lucky it'll be rendered by your text editor as "?"
or something, if you're not....

Email tries to get around these problems by having a header telling you what
codepage the email was composed in, but if the mutt ML is any indication it
seems to be spottily implemented.  With your garden variety text file,
you're just SOL.

Welcome to the 21st century, where computers can't even unambiguously
represent written text.

-- 
Gary R. Van Sickle


--
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple
Problem reports:       http://cygwin.com/problems.html
Documentation:         http://cygwin.com/docs.html
FAQ:                   http://cygwin.com/faq/


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]