codeset problems in wprintf and wcsftime
Jeff Johnston
jjohnstn@redhat.com
Tue Feb 23 21:14:00 GMT 2010
On 20/02/10 10:59 AM, Corinna Vinschen wrote:
> Hi,
>
> while working on finalizing locale support for Cygwin it suddenly
> occured to me that we have a problem in wprintf and wcsftime.
>
> Let's assume a funny combination of localization variables in the user's
> environment:
>
> LANG=de_DE.utf8
> LC_TIME=ja_JP.eucjp
> LC_NUMERIC=en_US.iso88591
>
> Yes, it's pretty unlikely, but nevertheless possible and valid.
>
> So, at setlocale time we read and store the localized strings in the
> codeset specified by the localization variable:
>
> - __locale_charset() returns UTF-8
> - __get_current_time_locale() returns data stored in EUC-JP
> - __get_current_numeric_locale() returns data stored in ISO-8859-1
> - localeconv() returns with decimal_point and
> thousands_sep stored in ISO-8859-1,
> and all other strings from the
> LC_MONETARY category in UTF-8.
> - nl_langinfo() CODESET is UTF-8,
> strings from the LC_TIME category are
> returned in EUC-JP,
> strings from LC_MESSAGES are returned
> in UTF-8
> RADIXCHAR and THOUSEP are returned in
> ISO-8859-1.
>
> This is no problem at all as long as you call the multibyte variations
> printf and strftime, the user gets what she asked for, and who are we
> to ask the user for the reason behind this choice.
>
> However, it is a problem in the wprintf and wcsftime functions. The
> problem is that we have decimal_point, thousands_sep and all the LC_TIME
> variables stored in some arbitrary multibyte codeset. Since we need the
> widechar representation, wprintf and wcsftime have to convert the
> strings using some mbtowc function. But the mbtowc functions always
> assume the multibyte charset defined by __locale_charset().
>
> Consequentially the conversion results in invalid strings.
>
> AFAICS, there are two possible approaches to fix this problem:
>
> - Store the charset not only for LC_CTYPE, but for each localization
> category, and provide a function to request the charset.
> This also requires to store the associated multibyte to widechar
> conversion functions, obviously, and to call the correct functions
> from wprintf and wcftime.
>
> - Redefine the locale data structs so that they contain multibyte and
> widechar representations of all strings. Use the multibyte strings
> in the multibyte functions, the widechar strings in the widechar
> functions.
>
This assumes that widechar representations from separate mbtowc
converters can be concatenated and be decoded by a single wctomb
converter. Without this ability, the concatenated widechar string
derived is of no use to anybody unless they know where the charset
changes occur.
IMO, this is "undefined behaviour".
I think one could optionally flag an error either in the setlocale
routine or the wprintf routines themselves.
-- Jeff J.
More information about the Newlib
mailing list