codeset problems in wprintf and wcsftime

Corinna Vinschen
Thu Feb 25 09:25:00 GMT 2010

On Feb 24 17:10, Jeff Johnston wrote:
> On 24/02/10 04:17 PM, Jeff Johnston wrote:
> >On 24/02/10 04:17 AM, Corinna Vinschen wrote:
> >>On Feb 23 16:14, Jeff Johnston wrote:
> >>>On 20/02/10 10:59 AM, Corinna Vinschen wrote:
> >>>>- Redefine the locale data structs so that they contain multibyte and
> >>>>widechar representations of all strings. Use the multibyte strings
> >>>>in the multibyte functions, the widechar strings in the widechar
> >>>>functions.
> >>>>
> >>>This assumes that widechar representations from separate mbtowc
> >>>converters can be concatenated and be decoded by a single wctomb
> >>>converter.
> >>
> >>I don't understand. The wide char representation is Unicode.
> >
> >So, you are saying if I use the mbtowc for EUC-JP in current newlib and
> >concatenate that to UTF-16 widechar output and add mbtowc output for
> >SJIS, a user can simply call wctomb() in newlib and have it pull it all
> >apart again? This obviously won't work for the old eucjp and sjis
> >versions of mbtowc/wctomb that Cygwin doesn't currently use, but even
> >so, I still see 3 versions of wctomb (utf8, iso, and cp) that apply to
> >Cygwin inside wctomb_r. Am I missing something? How can one of these
> >functions handle all types of wchar input?
> >
> >If one cannot take the concatenated string and pass it to a single
> >internal version of the wctomb() function (i.e. the user has to call 3
> >versions of wctomb for different charsets), then the user has to know
> >where each section begins in the full string which makes the end-result
> >of little use and thus not worth supporting.
> >
> Never mind.  Let me retract that.  I get it now.

Hang on.  I forgot about the fact that the wide char representation
of EUC-JP, SJIS and JIS are *not* Unicode when running under another
target than Cygwin.  UTF-8, ISO-8859-x and CPxxx, plus all eastern
charsets supported by Cygwin, are all converted to Unicode, so it's
no problem.

However, the fact that the eastern codesets are not Unicode on other
targets disallows to change the code in a generic way which would just
drop the mbtowc conversion in vfwprintf/wcsftime.

Back to the drawing board...

> Under those circumstances, it seems a reasonable strategy for Cygwin
> regardless of the multiple charset support.

Yeah, for Cygwin.  And there it is again, the unwelcome #ifdef __CYGWIN__.

We have three obvious choices.

- Convert the wide char representation of EUC-JP, SJIS, and JIS for
  all targets to Unicode.

- Add another flag __HAVE_WCHAR_LOCALE_INFO__ or something.  Used in
  vfwprintf and wcsftime it would use different code for targets
  which only have the multibyte locale info and targets which have
  also the wide char representation.

Converting the wide char representation is probably the most satisfying
solution in the long run.  The __wctomb and __mbtowc functions will be a
lot bigger, probably.  OTOH, the conversion function __jp2uc together
with the conversion tables in jp2uc.h result in a 17K object file on
i386 and very likely it's not much smaller on other targets.

Unfortunately it's easier said than done.  So I guess we should opt
for the second solution for now.

Nevertheless, using Unicode throughout would be a win in my eyes, so
we should keep that in mind.


Corinna Vinschen
Cygwin Project Co-Leader
Red Hat

More information about the Newlib mailing list