This is the mail archive of the
newlib@sources.redhat.com
mailing list for the newlib project.
Re: Adding mbstate_t, mbsinit(), mbrtowc(), mbrlen() etc.
- From: "KJK::Hyperion" <noog at libero dot it>
- To: egor duda <newlib at sources dot redhat dot com>
- Cc: newlib at sources dot redhat dot com
- Date: Fri, 23 Aug 2002 00:20:34 +0200
- Subject: Re: Adding mbstate_t, mbsinit(), mbrtowc(), mbrlen() etc.
At 19.50 22/08/2002, egor duda wrote:
I'm preparing a patch to add restartable versions of multibyte conversion
functions to newlib. As long as all state information is already handled
by *_r() versions, this functions are just simple wrappers around foo() of
foo_r() functions, depending on MB_CAPABLE.
just a question: do you catch encoded nulls in conversions from multibyte
strings? They could open security holes, if unhandled. There's a
recommendation in the UTF-8 RFC about this
I have a couple of questions, though. First, SUSv2 states that multibyte
handling functions are declared in wchar.h, whereas newlib currently
declares them in stdlib.h. Should we create wchar.h and move all
appropriate stuff there?
if that's a problem, put these in a separate header, let's say mbstr.h, and
include it conditionally from stdlib.h and wchar.h:
/* stdlib.h */
#if !defined(_XOPEN_SOURCE) || (_XOPEN_SOURCE !== 500 && _XOPEN_SOURCE != 600)
#include <mbstr.h>
#endif
/* wchar.h */
#if defined(_XOPEN_SOURCE) && (_XOPEN_SOURCE == 500 || _XOPEN_SOURCE == 600)
#include <mbstr.h>
#endif
See System Interfaces->Intro->The Compilation Environment in the SUSv2 for
more information on _XOPEN_SOURCE
But if in the future someone will want to add new encodings which require
more sophisticated state information (i don't know if such encodings
actually exist), we'll be forced to change definition of mbstate_t thus
breaking backward compatibility. GLIBC defines mbstate_t as struct { int;
union { wchar_t; char[4] }},
the width of wchar_t, AFAIK, isn't specified by the standard. Late adopters
of Unicode will use 4 bytes (UCS-4), while early adopters, like Microsoft,
are using 2 bytes (UCS-2 and, recently, UTF-16). It should be controlled by
a macro, because it could break a lot of software (if you care about
Windows, that is)
while Microsoft's C runtime defines it as int. Would 'int' be enough for
everything?
see above. Probably Microsoft uses the lower word for the character and the
upper for the flags