This is the mail archive of the guile@cygnus.com mailing list for the guile project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: i18n; wide characters; Guile


Jim Blandy <jimb@red-bean.com> writes:

> [...]

I agree that multi-byte encoding as well as the MULE encoding are the
wrong way to go.

> 
> Thus, my current inclinations:
> - Use 16-bit characters in strings throughout.
> - Prescribe the use of Unicode throughout.
> - Provide functions to convert between Unicode character strings
>   all other widely-used formats: UTF-8, UTF-7, Latin-1, and the JIS
>   variants, as well as anything else people would like to contribute.
> - Provide a separate "byte array" type, for applications which
>   genuinely want this.

A few comments:

- the Unicode consortium wants everybody to think that UCS2 is the
right way.  But it's a pain in the same way as using a multibyte
encoding is.  It was obvious right from the beginning that 16bits are
not enough.  It's similar to ASCII: Americans thought 7 bits are
enough and now the users of alphabet languages want to make us know 16
bits are enough.  A last resort was that Unicode 2.0 now contains an
extension method which effectvely makes UCS2 a multibyte encoding.

The answer can only be UCS4.  It's no surprise that all reasonable
i18n developers (this excludes those at IBM) use a 32bit type for
wchar_t.

This may sound like a big waste of space but if used correctly it
isn't.  Normally string are not meant to contain whole text books but
instead are rather short.  This means there is not that much
redundancy.  If you need to store large texts you can still fall back
on a multibyte encoding, perhaps offer several of them so that the
most effective can be chosen.

This is closely related with your conversion functions.  François
Pinard (and in parts myself) currently extend GNU recode to work as a
library.  The result will be the new recode program and I'll also use
it in GNU libc to write iconv() and the wide character I/O streams.
Since you have the same problem you are the next client.

The functions in the recode include some to convert from UCS4 to, say,
UTF-7 or perhaps KOI-8.  The former is effective if mainly chars from
the latin alphabets are used (which are encoded first).  Special
encodings like KOI-8 can be used if the text is know to contain only
characters which naturally can be represented in this charset.  By
offering the user an interface to the recode library to convert the
UCS4 strings to multibyte strings in one of the provided encodings you
don't have to fear the memory consumption of UCS4.

And the recoding library will on systems not supporting the wide
character I/O stream from ISO C amendment 1.  The port implementation
will have to be able to print UCS4 strings in the currently wanted
external representation.

-- Uli
---------------.      drepper at gnu.org  ,-.   Rubensstrasse 5
Ulrich Drepper  \    ,-------------------'   \  76149 Karlsruhe/Germany
Cygnus Solutions `--' drepper at cygnus.com   `------------------------