codeset problems in wprintf and wcsftime

Jeff Johnston jjohnstn@redhat.com
Tue Feb 23 21:14:00 GMT 2010


On 20/02/10 10:59 AM, Corinna Vinschen wrote:
> Hi,
>
> while working on finalizing locale support for Cygwin it suddenly
> occured to me that we have a problem in wprintf and wcsftime.
>
> Let's assume a funny combination of localization variables in the user's
> environment:
>
>    LANG=de_DE.utf8
>    LC_TIME=ja_JP.eucjp
>    LC_NUMERIC=en_US.iso88591
>
> Yes, it's pretty unlikely, but nevertheless possible and valid.
>
> So, at setlocale time we read and store the localized strings in the
> codeset specified by the localization variable:
>
>    - __locale_charset()             returns UTF-8
>    - __get_current_time_locale()    returns data stored in EUC-JP
>    - __get_current_numeric_locale() returns data stored in ISO-8859-1
>    - localeconv()                   returns with decimal_point and
>                                     thousands_sep stored in ISO-8859-1,
> 				   and all other strings from the
> 				   LC_MONETARY category in UTF-8.
>    - nl_langinfo()                  CODESET is UTF-8,
> 				   strings from the LC_TIME category are
> 				   returned in EUC-JP,
> 				   strings from LC_MESSAGES are returned
> 				   in UTF-8
> 				   RADIXCHAR and THOUSEP are returned in
> 				   ISO-8859-1.
>
> This is no problem at all as long as you call the multibyte variations
> printf and strftime, the user gets what she asked for, and who are we
> to ask the user for the reason behind this choice.
>
> However, it is a problem in the wprintf and wcsftime functions.  The
> problem is that we have decimal_point, thousands_sep and all the LC_TIME
> variables stored in some arbitrary multibyte codeset.  Since we need the
> widechar representation, wprintf and wcsftime have to convert the
> strings using some mbtowc function.  But the mbtowc functions always
> assume the multibyte charset defined by __locale_charset().
>
> Consequentially the conversion results in invalid strings.
>
> AFAICS, there are two possible approaches to fix this problem:
>
> - Store the charset not only for LC_CTYPE, but for each localization
>    category, and provide a function to request the charset.
>    This also requires to store the associated multibyte to widechar
>    conversion functions, obviously, and to call the correct functions
>    from wprintf and wcftime.
>
> - Redefine the locale data structs so that they contain multibyte and
>    widechar representations of all strings.  Use the multibyte strings
>    in the multibyte functions, the widechar strings in the widechar
>    functions.
>

This assumes that widechar representations from separate mbtowc 
converters can be concatenated and be decoded by a single wctomb 
converter.  Without this ability, the concatenated widechar string 
derived is of no use to anybody unless they know where the charset 
changes occur.

IMO, this is "undefined behaviour".

I think one could optionally flag an error either in the setlocale 
routine or the wprintf routines themselves.

-- Jeff J.



More information about the Newlib mailing list