This is the mail archive of the gdb-patches@sources.redhat.com mailing list for the GDB project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH RFC] Character set support


On Thu, Sep 12, 2002 at 10:11:10PM -0500, Jim Blandy wrote:
> 
> > Two comments:
> >   There's a lot of passing integers around to refer to a character. 
> > That doesn't make a lot of sense to me; we should either be passing
> > char *, so that we can decode multibyte sequences, or using wchar_t
> > explicitly and autoconfing for it.
> > 
> >   I see hardcoded support for a couple of simplistic charsets; would it
> > be worthwhile to add (minimal!) support for UTF-8 in case iconv is not
> > available?  Gcj is natively UTF-8, and I have some open Debian bug
> > reports about this.
> 
> Absolutely --- as I say in the comments to charset.c:
> 
>    At the moment, GDB only supports single-byte, stateless character
>    sets.  This includes the ISO-8859 family (ASCII extended with
>    accented characters, and (I think) Cyrillic, for European
>    languages), and the EBCDIC family (used on IBM's mainframes).
>    Unfortunately, it excludes many Asian scripts, the fixed- and
>    variable-width Unicode encodings, and other desireable things.
>    Patches are welcome!  (For example, it would be nice if the Java
>    string support could simply get absorbed into some more general
>    multi-byte encoding support.)
> 
> But it seemed to me that supporting stateless variable-width encodings
> was going to be a *lot* of work.  Specifically, how the printing code
> should change was a bit beyond me.

+      /* These all suggest that the input or output character sets
+         have multi-byte encodings of some characters, which means
+         it's unsuitable for use as a GDB character set.  We should
+         never have selected it.  */

Sigh - OK, I see that this can't even use iconv for UTF-8->ASCII. 
That's a real shame.  I have some code which does this so if I get a
chance I can try to improve it in GDB; or someone who (unlike me)
actually groks iconv can try it...

> Regarding `int' vs. `wchar_t': the wchar_t we could detect with
> autoconf is a host type.  It has no necessary relationship to the
> `wchar_t' on the target.  LONGEST might be a better choice than `int',
> but `wchar_t' is worse.

The first part is accurate but not relevant.  I'm not suggesting
reading wchar_t's from the target; that's not terribly useful a thing
to do.  You _want_ the host wchar_t.  It is a host type capable of
holding a wide character; the type changes based on platform and on
whether or not the platform actually has wide character support. 
There's not much you can do if it doesn't, is there?  Rather than using
iconv, which is meant for converting strings of text, it seemed to me
when I wrote the above comments that we should be using mbrtowc/wctomb
functions.  However, unlike iconv, they appear to operate based on the
current locale rather than a specified charset.  I suppose they are
unsuitable and we'll have to figure out how to use iconv appropriately.

-- 
Daniel Jacobowitz
MontaVista Software                         Debian GNU/Linux Developer


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]