This is the mail archive of the newlib@sourceware.org mailing list for the newlib project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: codeset problems in wprintf and wcsftime


On 20/02/10 10:59 AM, Corinna Vinschen wrote:
Hi,

while working on finalizing locale support for Cygwin it suddenly
occured to me that we have a problem in wprintf and wcsftime.

Let's assume a funny combination of localization variables in the user's
environment:

   LANG=de_DE.utf8
   LC_TIME=ja_JP.eucjp
   LC_NUMERIC=en_US.iso88591

Yes, it's pretty unlikely, but nevertheless possible and valid.

So, at setlocale time we read and store the localized strings in the
codeset specified by the localization variable:

   - __locale_charset()             returns UTF-8
   - __get_current_time_locale()    returns data stored in EUC-JP
   - __get_current_numeric_locale() returns data stored in ISO-8859-1
   - localeconv()                   returns with decimal_point and
                                    thousands_sep stored in ISO-8859-1,
				   and all other strings from the
				   LC_MONETARY category in UTF-8.
   - nl_langinfo()                  CODESET is UTF-8,
				   strings from the LC_TIME category are
				   returned in EUC-JP,
				   strings from LC_MESSAGES are returned
				   in UTF-8
				   RADIXCHAR and THOUSEP are returned in
				   ISO-8859-1.

This is no problem at all as long as you call the multibyte variations
printf and strftime, the user gets what she asked for, and who are we
to ask the user for the reason behind this choice.

However, it is a problem in the wprintf and wcsftime functions.  The
problem is that we have decimal_point, thousands_sep and all the LC_TIME
variables stored in some arbitrary multibyte codeset.  Since we need the
widechar representation, wprintf and wcsftime have to convert the
strings using some mbtowc function.  But the mbtowc functions always
assume the multibyte charset defined by __locale_charset().

Consequentially the conversion results in invalid strings.

AFAICS, there are two possible approaches to fix this problem:

- Store the charset not only for LC_CTYPE, but for each localization
   category, and provide a function to request the charset.
   This also requires to store the associated multibyte to widechar
   conversion functions, obviously, and to call the correct functions
   from wprintf and wcftime.

- Redefine the locale data structs so that they contain multibyte and
   widechar representations of all strings.  Use the multibyte strings
   in the multibyte functions, the widechar strings in the widechar
   functions.


This assumes that widechar representations from separate mbtowc converters can be concatenated and be decoded by a single wctomb converter. Without this ability, the concatenated widechar string derived is of no use to anybody unless they know where the charset changes occur.


IMO, this is "undefined behaviour".

I think one could optionally flag an error either in the setlocale routine or the wprintf routines themselves.

-- Jeff J.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]