codeset problems in wprintf and wcsftime
Sat Feb 20 15:59:00 GMT 2010
while working on finalizing locale support for Cygwin it suddenly
occured to me that we have a problem in wprintf and wcsftime.
Let's assume a funny combination of localization variables in the user's
Yes, it's pretty unlikely, but nevertheless possible and valid.
So, at setlocale time we read and store the localized strings in the
codeset specified by the localization variable:
- __locale_charset() returns UTF-8
- __get_current_time_locale() returns data stored in EUC-JP
- __get_current_numeric_locale() returns data stored in ISO-8859-1
- localeconv() returns with decimal_point and
thousands_sep stored in ISO-8859-1,
and all other strings from the
LC_MONETARY category in UTF-8.
- nl_langinfo() CODESET is UTF-8,
strings from the LC_TIME category are
returned in EUC-JP,
strings from LC_MESSAGES are returned
RADIXCHAR and THOUSEP are returned in
This is no problem at all as long as you call the multibyte variations
printf and strftime, the user gets what she asked for, and who are we
to ask the user for the reason behind this choice.
However, it is a problem in the wprintf and wcsftime functions. The
problem is that we have decimal_point, thousands_sep and all the LC_TIME
variables stored in some arbitrary multibyte codeset. Since we need the
widechar representation, wprintf and wcsftime have to convert the
strings using some mbtowc function. But the mbtowc functions always
assume the multibyte charset defined by __locale_charset().
Consequentially the conversion results in invalid strings.
AFAICS, there are two possible approaches to fix this problem:
- Store the charset not only for LC_CTYPE, but for each localization
category, and provide a function to request the charset.
This also requires to store the associated multibyte to widechar
conversion functions, obviously, and to call the correct functions
from wprintf and wcftime.
- Redefine the locale data structs so that they contain multibyte and
widechar representations of all strings. Use the multibyte strings
in the multibyte functions, the widechar strings in the widechar
Personally I'd prefer the second approach. The requirement to convert
the strings at runtime is rather unfortunate.
What do you think?
Btw., would it be ok to add more possible arguments to the nl_langinfo()
function, for internal use only? This approach is used on BSD and
Linux, for instance, to access locale data for which no offical POSIX
API exists. The grroundwork already exists in langinfo.h, it's just not
used so far.
Cygwin Project Co-Leader
More information about the Newlib