This is the mail archive of the gdb@sourceware.org mailing list for the GDB project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: printing wchar_t*

From: Eli Zaretskii <eliz at gnu dot org>
To: "Jim Blandy" <jimb at red-bean dot com>
Cc: ghost at cs dot msu dot su, gdb at sources dot redhat dot com
Date: Fri, 14 Apr 2006 21:27:31 +0300
Subject: Re: printing wchar_t*
References: <e1lsqg$aml$1@sea.gmane.org> <200604141257.41690.ghost@cs.msu.su> <uu08w1cnf.fsf@gnu.org> <200604141837.26618.ghost@cs.msu.su> <uirpc19u8.fsf@gnu.org> <8f2776cb0604141053v73e512e3o2d1c9086312316bd@mail.gmail.com>
Reply-to: Eli Zaretskii <eliz at gnu dot org>

> Date: Fri, 14 Apr 2006 10:53:44 -0700
> From: "Jim Blandy" <jimb@red-bean.com>
> Cc: "Vladimir Prus" <ghost@cs.msu.su>, gdb@sources.redhat.com
> 
> I think folks are seeing difficult problems where there aren't any. 

What difficulties? there _are_ no difficulties ;-)

> Suppose we have a wide string where wchar_t values are Unicode code
> points.  Suppose our host character set is plain ASCII.  Suppose the
> user's program has a string containing the digits '123', followed by
> some funky Tibetan characters U+0F04 U+0FCC, followed by the letters
> 'xyz'.  When asked to print that string, GDB should print the
> following twenty-one ASCII characters:
> 
> L"123\x0f04\x0fccxyz"

This will work, if we accept your assumptions (which are by no means
universally correct, e.g. parts of our discussion were around whether
the string contains U+XXXX Unicode codepoints or their UTF-16
encodings).  But all you did is invent an encoding (and a
variable-size encoding at that).  Something in the GUI FE still has to
interpret that encoding, i.e. convert it back to binary representation
of the characters, because your encoding cannot be displayed by any
known GUI API.

Compare this with the facility that we already have today:

 (gdb) print *warray@8
  {0x0031, 0x0032, 0x0033, 0x0F04, 0x0FCC, 0x0078, 0x0079, 0x007A}

Except for using up 60-odd characters where you used 21, this is IMHO
better, since it doesn't require any code on the FE side: just convert
the strings to integers, and you've got Unicode, ready to be used for
whatever purposes.

> Since this is a valid way to write that string in a source program, a
> user at the GDB command line should understand it.  Since consumers of
> MI information must contain parsers for C values already, they can
> reliably find the contents of the string.

I only partly agree with the first sentence, and not at all with the
second.

For the interactive user, understanding non-ASCII strings in the
suggested ASCII encoding might not be easy at all.  For example, for
all my knowledge of Hebrew, if someone shows me \x05D2, I will have
hard time recognizing the letter Gimel.

As for the second sentence, ``reliably find the contents of the
string'' there obviously doesn't consider the complexities of handling
wide characters.  In my experience, for any non-trivial string
processing, working with variable-size encoding is much harder than
with fixed-size wchar_t arrays, because you need to interpret the
bytes as you go, even if all you need is to find the n-th character.
Even the simple task of computing the number of characters in the
string becomes complicated.

> Note that this gets a GUI the contents of the string in the *target*
> character set.  The GUI itself should be responsible for converting
> target characters to whatever character set it wants to use to present
> data to its user.  Here, GDB's 'host' character set is just the
> character set used to carry information from GDB to the GUI; it should
> probably be set to ASCII, just to avoid needless variation.  But
> either way, it's just acting as a medium for values in C source code
> syntax, and has no bearing on either the character set the target
> program is using, or the character set the GUI will use to present
> data to its user.

What you are suggesting is simple for GDB, but IMHo leaves too much
complexity to the FE.  I think GDB could do better.  In particular, if
I'm sitting at a UTF-8 enabled xterm, I'd be grateful if GDB would
show me Unicode characters in their normal glyphs, which would require
GDB to output the characters in their UTF-8 encoding (which the
terminal will then display in human-readable form).  Your suggestion
doesn't allow such a feature, AFAICS, at least not for CLI users.

That said, if someone volunteers to do the job of adding your
suggestions to GDB, I won't object to accepting the patches, because
whoever does the job gets to choose the tools.

> Unicode technical report #17 lays out the terminology the Unicode
> folks use for all this stuff, with good explanations:
> http://www.unicode.org/reports/tr17/

Yes, that's a good background reading for related stuff.

> According to the ISO C standard, the coding character set used by
> wchar_t must be a superset of that used by char for members of the
> basic character set.  See ISO/IEC 9899:1999 (E) section 7.17,
> paragraph 2.  So I think it's sufficient for the user to specify the
> coding character set used by wide characters; that fixes the ccs used
> for char values.

If wchar_t uses fixed-size characters, not their variable-size
encodings, then specifying the CCS will do.  Encodings are another
matter; as I wrote earlier, there could be many different encodings of
the same CCS, and I suppose some weirdo software somewhere could stuff
such encoding into a wchar_t.

Follow-Ups:
- Re: printing wchar_t*
  - From: Jim Blandy

References:
- printing wchar_t*
  - From: Vladimir Prus
- Re: printing wchar_t*
  - From: Vladimir Prus
- Re: printing wchar_t*
  - From: Eli Zaretskii
- Re: printing wchar_t*
  - From: Vladimir Prus
- Re: printing wchar_t*
  - From: Eli Zaretskii
- Re: printing wchar_t*
  - From: Jim Blandy

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]