representing charsets

Andy Koppe
Tue Mar 30 11:49:00 GMT 2010

Corinna Vinschen:
> On Mar 29 19:19, Andy Koppe wrote:
>> Corinna Vinschen:
>> > Anyway, feel free to send a patch to change the charset name parameter
>> > to an array index parameter.
>> Attached is a newlib patch for that. The core of it is the removal of
>> the calls to __cp_index and __iso_8859_index from the singlebyte
>> charset conversion functions and adding a __charset_index global
>> variable, but quite a lot of function definitions and calls needed to
>> be changed accordingly to take an index argument instead of a string.
> Two problems:
> - The usage of the __charset_index variable should be changed to a call
>  to a function __locale_charset_idx (), analog to the __locale_charset ()
>  function.  The reason is that, in the long run, we will implement
>  the _l family of functions plus the newlocale/uselocale stuff from
>  POSIX-1.2008.  The global information will be replaced by locale_t
>  structures, basically.  For that we need access wrapper functions which
>  allow to use the right locale_t for the current thread.

Ah, I had wondered why those wrappers were there. Will do.

> - The __monetary_load_locale function and friends, as well as the
>  subsequently calld Cygwin functions should still get the charset
>  name.  In a later incarnation(*) they will store the charset names
>  in the locale information.

I see. How shall I tackle these?

1) Pass the charset string as well as the index into these functions.
2) Go back to passing the string only, and introduce a 'int
__charset_index(const char *charset)' function that converts it to an
index where needed.

Of those two, I prefer the second for its cleaner API. It would be
slightly slower, but only in setlocale(), which isn't critical.

But actually what I'd really like to do is this:

3) Represent charsets as enum constants (or #defines) rather than
strings throughout, with the singlebyte charsets ordered in such a way
that they correspond to their order in the conversion tables, along
these lines:

enum {
  CS_UTF8 = 0,

  /* ISO singlebyte codepages */
  CS_ISO8859_1 = 1,
  CS_ISO8859_2 = 2,
  CS_ISO8859_11 = 11,
  /* ISO-8859-12 doesn't exist */
  CS_ISO8859_13 = 12,
  CS_ISO8859_16 = 15,

  /* Windows singlebyte codepages */
  CS_CP437 = 100,
  CS_CP720 = 101,
  CS_CP737 = 102,

  /* Multibyte codepages */
  CS_SJIS = 200,
  CS_GBK = 201,

Obviously, this would require quite a bit of additional work, but I do
think it would be cleaner and a bit more efficient than the current
model. Do you think this is worth pursuing (on the newlib list)?

>> I was concerned I might forget to change a prototype or call
>> somewhere, but actually Cygwin does use all the functions in question,
>> I think.
> Yes, except for __jis_mbtowc/__jis_wctomb.

Good point, I'll pay extra attention to those.


More information about the Cygwin-developers mailing list