LC_MESSAGES implementation

Mon Feb 8 10:30:00 GMT 2010

Hi guys,

I have finally found a method to implement the locale-specific
LC_MESSAGES info.

Basically, I have two application which generate the yesexpr, noexpr,
yesstr and nostr strings from foreign local data.  The first application
is finished and generates the data from GLIBC locale data.  The second
application is almost finished and generates the data from the CLDR
project (http://cldr.unicode.org) locale data.

In both cases the data is generated as a header file containing a big
array like this:

truct lc_msg_t
{
  const char    *locale;
  const wchar_t *yesexpr;
  const wchar_t *noexpr;
  const wchar_t *yesstr;
  const wchar_t *nostr;
};

static struct lc_msg_t lc_msg[] =
{
  { "aa_DJ", L"\x005e\x005b\x006f\x004f\x0079\x0059\x005d\x002e\x002a", L"\x005e\x005b\x006d\x006e\x004d\x004e\x005d\x002e\x002a", L"", L"" },
  [...]
};

The subsequent code called from newlib's loadlocale() function fetches
the locale data from this array using bsearch() with the key being the
locale, and converts it into the correct charset.

Here are two questions:

- First of all, I'm not sure if I should use the GLIBC or the CLDR data.

  What speaks for GLIBC:

  - The GLIBC data contains the more relaxed and simpler yesexpr and
    noexpr strings.

  - Quite often the CLDR entries are just placeholder using the default
    C/POSIX strings.  THis almost never happens in GLIB.

  - The locale names match our locale names exactly, while CLDR uses the
    RFC 4646 strings just like Windows.  This simplifies generation of
    the locale data while it requires conversion in CLDR.

  What speaks for CLDR:

  - The number of supported locales is bigger than in GLIBC, and they
    match more of the locales supported by Windows.

  - In GLIBC the (deprecated) yestrs and nostr strings are quite often
    not available at all.  This never happens in CLDR.

  Which one would you prefer?  Of course I could generate two arrays
  and mix the data, but that's not easy to automate.

- Second, in my current implementation the data is stored within Cygwin,
  as a big array of about 24K (in the GLIBC case, the CLDR case should
  be comparable).

  Since all of the other locale classes, LC_COLLATE, LC_CTYPE,
  LC_MONETARY, LC_NUMERIC, and LC_TIME, are implemented internally,
  mostly using data already available in Windows, I have a hard time
  to implement a file-based solution just for the single LC_MESSAGES
  case.  It just doesn't seem right, and 24K isn't *that* big, is it?

  So, here's the question:

  - Do you think it's ok to keep the data internally and regenerate it
    from time to time with a new Cygwin version when a new GLIBC version
    or a new CLDR version has been released?

  - Or, would you prefer a file-based solution using a precompiled
    single file containing the data, which could be memory-mapped into
    Cygwin when necessary?

  - Or, would you prefer a file-based solution using single locale-specific
    LC_MESSAGES files?

Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat