This is the mail archive of the gdb@sourceware.org mailing list for the GDB project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: printing wchar_t*

From: Vladimir Prus <ghost at cs dot msu dot su>
To: Eli Zaretskii <eliz at gnu dot org>
Cc: "Jim Blandy" <jimb at red-bean dot com>, gdb at sources dot redhat dot com
Date: Mon, 17 Apr 2006 10:36:47 +0400
Subject: Re: printing wchar_t*
References: <e1lsqg$aml$1@sea.gmane.org> <8f2776cb0604141216m216ba87ch529180cd079ce971@mail.gmail.com> <u64lb25zv.fsf@gnu.org>

On Saturday 15 April 2006 01:37, Eli Zaretskii wrote:

> > My point is, MI consumers are already parsing ISO C strings.  They
> > just need to parse more of them.
>
> This ``more parsing'' is not magic.  It's a lot of work, in general.

I don't quite get it. Say that frontend and gdb somehow agree on the 8-bit 
encoding using by gdb to print the strings. Then frontend can look at the 
string and:

  - If it sees \x, look at the following hex digits and convert it to either
    code point or code unit
  - If it sees anything else, convert it from local 8-bit to Unicode

The only question here is whether \x encodes a code unit or code point. If it 
encodes a code unit, frontend needs extra processing (for me, that's easy). 
If it encodes code point, then further changes in gdb are needed.

Note that due to charset function interface using 'int', you can't use UTF-8 
for encoding passed to frontend, but using ASCII + \x is still feasible.

There's one nice thing about this approach. If there's new 'print array until 
XX" syntax, I indeed need to special-case processing of values in several 
contexts -- most notably arguments in stack trace. With "\x" escapes I'd need 
to write a code to handle them once. In fact, I can add this code right to MI 
parser (which operates using Unicode-enabled QString class already). That 
will be more convenient than invoking 'print array' for any wchar_t* I ever 
see.

> > > For the interactive user, understanding non-ASCII strings in the
> > > suggested ASCII encoding might not be easy at all.  For example, for
> > > all my knowledge of Hebrew, if someone shows me \x05D2, I will have
> > > hard time recognizing the letter Gimel.
> >
> > If the host character set includes Gimel, then GDB won't print it with
> > a hex escape.
>
> The host character set has nothing to do, in general, with what
> characters can be displayed.  The same host character set can be
> displayed on an appropriately localized xterm, but not on a bare-bones
> character terminal.  Not every system that runs in the Hebrew locale
> has Hebrew-enabled xterm.  Some characters may be missing from a
> particular font, especially a Unicode-based font (because there so
> many Unicode characters).  Etc., etc.
>
> Even if I do have a Hebrew enabled xterm, chances are that it cannot
> display characters sent in 16-bit Unicode codepoints, it will want
> some single-byte encoding, like UTF-8 or maybe ISO 8859-8.
>
> GDB will generally know nothing about these complications, unless we
> teach it.  For example, to display Hebrew letters on a UTF-8 enabled
> xterm, we (i.e. the user, through appropriate GDB commands) will have
> to tell GDB that wchar_t strings should be encoded in UTF-8 by the CLI
> output routines.  Sometimes these settings can be gleaned from the
> environment variables, but Emacs's experience shows how very
> unreliable and error-prone this is.

I don't quite get. First you say you want \x05D2 to display using Unicode font 
on console, now you say it's very hard. Now, if you want Unicode display for 
\x05D2, there should be some method to tell gdb that your console can display 
Unicode, and if user told that Unicode is supported, what are the problems?

> how many glyphs will it produce 
> on the screen, where it can be broken into several lines if it is too
> long, etc.  This is all trivial with 7-bit ASCII (every byte produces
> a single glyph, except a few non-printables, whitespace characters
> signal possible locations to break the line, etc.), but can get very
> complex with other character sets.

Isn't this completely outside of GDB? In fact, this is also outside of 
frontend -- GUI toolkit will handle this transparently (and if it won't, it's 
broken).

> GDB cannot be asked to know about all of those complications, but I
> think it should at least provide a few simple translation services so
> that a front end will not have to work too hard to handle and display
> strings as mostly readable text.  Passing the characters as fixed-size
> codepoints expressed as ASCII hex strings leaves the front-end with
> only very simple job.  What's more, it uses an existing feature: array
> printing.

Using \x escapes, provided they encode *code units*, leaves frontend with the 
same simple job. Really, using strings with \x escapes differs from array 
printing in just one point: some characters are printed not as hex values, 
but as characters in local 8-bit encoding. Why do you think this is a 
problem? I can't see what's wrong with that.

> > > What you are suggesting is simple for GDB, but IMHo leaves too much
> > > complexity to the FE.  I think GDB could do better.  In particular, if
> > > I'm sitting at a UTF-8 enabled xterm, I'd be grateful if GDB would
> > > show me Unicode characters in their normal glyphs, which would require
> > > GDB to output the characters in their UTF-8 encoding (which the
> > > terminal will then display in human-readable form).  Your suggestion
> > > doesn't allow such a feature, AFAICS, at least not for CLI users.
> >
> > When the host character set contains a character, there's no need for
> > GDB to use an escape to show it.
>
> Whose host character set? GDB's?  But GDB is not displaying the
> strings, the front end is.  And as I wrote above, there's no
> guarantees that the host character set can be transparently displayed
> on the screen.  This only works for ASCII and some simple single-byte
> encodings, mostly Latin ones.  But it doesn't work in general.
>
> And why are you talking about host character set?  The
> L"123\x0f04\x0fccxyz" string came from the target, GDB simply
> converted it to 7-bit ASCII.  These are characters from the target
> character set.  And the target doesn't necessarily talk in the host
> locale's character set and language, you could be debugging a program
> which talks Farsi with GDB that runs in a German locale.

So, characters that happen to exist in German locale are printed as literal 
chars. Other characters are printed using \x. FE reads the string, and when 
it sees literal char, it converts it from German locale to Unicode used 
internally. Where's the problem?

- Volodya

Follow-Ups:
- Re: printing wchar_t*
  - From: Eli Zaretskii

References:
- printing wchar_t*
  - From: Vladimir Prus
- Re: printing wchar_t*
  - From: Jim Blandy
- Re: printing wchar_t*
  - From: Eli Zaretskii

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]