codeset problems in wprintf and wcsftime

Andy Koppe andy.koppe@gmail.com
Sat Feb 20 16:31:00 GMT 2010


Corinna Vinschen:
> while working on finalizing locale support for Cygwin it suddenly
> occured to me that we have a problem in wprintf and wcsftime.
>
> Let's assume a funny combination of localization variables in the user's
> environment:
>
>  LANG=de_DE.utf8
>  LC_TIME=ja_JP.eucjp
>  LC_NUMERIC=en_US.iso88591
>
> Yes, it's pretty unlikely, but nevertheless possible and valid.
>
> So, at setlocale time we read and store the localized strings in the
> codeset specified by the localization variable:
>
>  - __locale_charset()             returns UTF-8
>  - __get_current_time_locale()    returns data stored in EUC-JP
>  - __get_current_numeric_locale() returns data stored in ISO-8859-1
>  - localeconv()                   returns with decimal_point and
>                                   thousands_sep stored in ISO-8859-1,
>                                   and all other strings from the
>                                   LC_MONETARY category in UTF-8.
>  - nl_langinfo()                  CODESET is UTF-8,
>                                   strings from the LC_TIME category are
>                                   returned in EUC-JP,
>                                   strings from LC_MESSAGES are returned
>                                   in UTF-8
>                                   RADIXCHAR and THOUSEP are returned in
>                                   ISO-8859-1.
>
> This is no problem at all as long as you call the multibyte variations
> printf and strftime, the user gets what she asked for, and who are we
> to ask the user for the reason behind this choice.

Have you verified that the user does indeed get a mix of charsets when
doing this on glibc?

I'm asking because another alternative to the solutions you outlined
might be to store those strings as wchar versions only, to be used
directly in wprintf and converted to the LC_CTYPE character set when
needed in printf. That way, the user would always get readable output.


> - Store the charset not only for LC_CTYPE, but for each localization
>  category, and provide a function to request the charset.
>  This also requires to store the associated multibyte to widechar
>  conversion functions, obviously, and to call the correct functions
>  from wprintf and wcftime.
>
> - Redefine the locale data structs so that they contain multibyte and
>  widechar representations of all strings.  Use the multibyte strings
>  in the multibyte functions, the widechar strings in the widechar
>  functions.
>
> Personally I'd prefer the second approach.

Agreed. Sounds like less overhead.

Andy



More information about the Newlib mailing list