codeset problems in wprintf and wcsftime

Wed Feb 24 22:10:00 GMT 2010

On 24/02/10 04:17 PM, Jeff Johnston wrote:
> On 24/02/10 04:17 AM, Corinna Vinschen wrote:
>> On Feb 23 16:14, Jeff Johnston wrote:
>>> On 20/02/10 10:59 AM, Corinna Vinschen wrote:
>>>> AFAICS, there are two possible approaches to fix this problem:
>>>>
>>>> - Store the charset not only for LC_CTYPE, but for each localization
>>>> category, and provide a function to request the charset.
>>>> This also requires to store the associated multibyte to widechar
>>>> conversion functions, obviously, and to call the correct functions
>>>> from wprintf and wcftime.
>>>>
>>>> - Redefine the locale data structs so that they contain multibyte and
>>>> widechar representations of all strings. Use the multibyte strings
>>>> in the multibyte functions, the widechar strings in the widechar
>>>> functions.
>>>>
>>>
>>> This assumes that widechar representations from separate mbtowc
>>> converters can be concatenated and be decoded by a single wctomb
>>> converter. Without this ability, the concatenated widechar string
>>> derived is of no use to anybody unless they know where the charset
>>> changes occur.
>>>
>>> IMO, this is "undefined behaviour".
>>
>> I don't understand. The wide char representation is Unicode. Why
>> should it be a problem to use Unicode strings together, just because
>> they are from different sources? Even if wchar_t is UTF-16, as on
>> Cygwin, the strings are complete. There's no such thing as just one
>> half of a surrogate.
>>
>
> So, you are saying if I use the mbtowc for EUC-JP in current newlib and
> concatenate that to UTF-16 widechar output and add mbtowc output for
> SJIS, a user can simply call wctomb() in newlib and have it pull it all
> apart again? This obviously won't work for the old eucjp and sjis
> versions of mbtowc/wctomb that Cygwin doesn't currently use, but even
> so, I still see 3 versions of wctomb (utf8, iso, and cp) that apply to
> Cygwin inside wctomb_r. Am I missing something? How can one of these
> functions handle all types of wchar input?
>
> If one cannot take the concatenated string and pass it to a single
> internal version of the wctomb() function (i.e. the user has to call 3
> versions of wctomb for different charsets), then the user has to know
> where each section begins in the full string which makes the end-result
> of little use and thus not worth supporting.
>

Never mind.  Let me retract that.  I get it now.

>
>> The advantage of having the strings available in wchar_t representation
>> would be that the wcsftime and wprintf functions don't have to worry
>> about charsets at all. In contrast to the current solution which
>> requires a conversion from multibyte which means, you have to *know*
>> which source charset was being used when creating these strings. Right
>> now they only have information about one charset, which is the LC_CTYPE
>> charset.
>>
>> In Glibc, as well as on Windows, the localization strings are originally
>> stored in Unicode on disk, and Glibc stores the strings internally in
>> multibyte
>> and wchar_t representation. When Cygwin fetches the strings from Windows
>> it has to convert them to multibyte since there is no wchar_t slot for
>> the data, and following POSIX, it has to store them in the charset given
>> for the locale category, LC_TIME, LC_MESSAGES, etc.
>>

Under those circumstances, it seems a reasonable strategy for Cygwin 
regardless of the multiple charset support.

>>> I think one could optionally flag an error either in the setlocale
>>> routine or the wprintf routines themselves.
>>
>> Well, if the conversion doesn't work, vfwprintf just falls back to the
>> defaults for the C locale and switches off grouping. That's probably
>> the sanest thing to do.
>> If wcsftime fails to convert the format string it returns 0, which is
>> the defined error behaviour. In case of the new era and alt_digits
>> strings (http://sourceware.org/ml/newlib/2010/msg00153.html), it will
>> fail to store the era and alt_digits information and fall back to the
>> default behaviour: %EC -> %C, %EY -> %Y, %OH -> %H, etc.
>>
>> That's probably ok, given the POSIX-1.2008 quote given by Andy in
>> http://sourceware.org/ml/newlib/2010/msg00146.html
>> I just hoped we could do better.
>>
>>
>> Corinna
>>
>