This is the mail archive of the newlib@sources.redhat.com mailing list for the newlib project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: Adding mbstate_t, mbsinit(), mbrtowc(), mbrlen() etc.

From: "KJK::Hyperion" <noog at libero dot it>
To: egor duda <newlib at sources dot redhat dot com>
Cc: newlib at sources dot redhat dot com
Date: Fri, 23 Aug 2002 00:20:34 +0200
Subject: Re: Adding mbstate_t, mbsinit(), mbrtowc(), mbrlen() etc.

At 19.50 22/08/2002, egor duda wrote:

I'm preparing a patch to add restartable versions of multibyte conversion functions to newlib. As long as all state information is already handled by *_r() versions, this functions are just simple wrappers around foo() of foo_r() functions, depending on MB_CAPABLE.

just a question: do you catch encoded nulls in conversions from multibyte strings? They could open security holes, if unhandled. There's a recommendation in the UTF-8 RFC about this

I have a couple of questions, though. First, SUSv2 states that multibyte handling functions are declared in wchar.h, whereas newlib currently declares them in stdlib.h. Should we create wchar.h and move all appropriate stuff there?

if that's a problem, put these in a separate header, let's say mbstr.h, and include it conditionally from stdlib.h and wchar.h:

/* stdlib.h */
#if !defined(_XOPEN_SOURCE) || (_XOPEN_SOURCE !== 500 && _XOPEN_SOURCE != 600)
#include <mbstr.h>
#endif

/* wchar.h */
#if defined(_XOPEN_SOURCE) && (_XOPEN_SOURCE == 500 || _XOPEN_SOURCE == 600)
#include <mbstr.h>
#endif

See System Interfaces->Intro->The Compilation Environment in the SUSv2 for more information on _XOPEN_SOURCE

But if in the future someone will want to add new encodings which require more sophisticated state information (i don't know if such encodings actually exist), we'll be forced to change definition of mbstate_t thus breaking backward compatibility. GLIBC defines mbstate_t as struct { int; union { wchar_t; char[4] }},

the width of wchar_t, AFAIK, isn't specified by the standard. Late adopters of Unicode will use 4 bytes (UCS-4), while early adopters, like Microsoft, are using 2 bytes (UCS-2 and, recently, UTF-16). It should be controlled by a macro, because it could break a lot of software (if you care about Windows, that is)

while Microsoft's C runtime defines it as int. Would 'int' be enough for everything?

see above. Probably Microsoft uses the lower word for the character and the upper for the flags

References:
- Adding mbstate_t, mbsinit(), mbrtowc(), mbrlen() etc.
  - From: egor duda

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]