This is the mail archive of the gdb@sourceware.org mailing list for the GDB project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: printing wchar_t*

From: "Jim Blandy" <jimb at red-bean dot com>
To: "Vladimir Prus" <ghost at cs dot msu dot su>
Cc: gdb at sources dot redhat dot com
Date: Fri, 14 Apr 2006 00:29:37 -0700
Subject: Re: printing wchar_t*
References: <e1lsqg$aml$1@sea.gmane.org> <8f2776cb0604131031g370d6fa9p9361421bd21d178@mail.gmail.com> <e1necb$gen$1@sea.gmane.org>

On 4/13/06, Vladimir Prus <ghost@cs.msu.su> wrote:
> Jim Blandy wrote:
>
> > On 4/13/06, Vladimir Prus <ghost@cs.msu.su> wrote:
> >> I have a user-defined command that can produce the output I want, but is
> >> defining a custom command the right approach?
> >
> > Well, you'd like wide strings to be printed properly when they appear
> > in structures, as arguments to functions, and so on, right?  So a
> > user-defined command isn't ideal.
>
> I think I'll still need to do some processing for wchar_t* on frontend side.
> The problem is that I don't see any way how gdb can print wchar_t in a way
> that does not require post-processing. It can print it as UTF8, but then
> for printing char* gdb should use local 8 bit encoding, which is likely to
> be *not* UTF8. Gdb can probably use some extra markers for values: like:
>
>    "foo"  for string in local 8-bit encoding
>    L"foo" for string in UTF8 encoding.
>
> It's also possible to use "\u" escapes.
>
> But then there's a problem:
>
>    - Do we assume that wchar_t is always UTF-16 or UTF-32?
>    - If not:
>      - how user can select this?
>      - how user-specified encoding will be handled

You can't hard-code assumptions about the character set into GDB.  Nor
can you hard-code the assumption that the host and target character
sets are the same.  GDB needs to do explicit conversions between the
two as needed, and handle mismatches in some reasonable way.

GDB already has the commands 'set host-charset' and 'set
target-charset', so you can assume that you have accurate information
about the character sets at hand.  They fall back to ASCII.

> > The best approach would be to extend charset.[ch] to handle wide
> > character sets as well, and then add code to the language-specific
> > printing routines to use the charset functions.  (This is fortunately
> > much simpler than adding support for multibyte characters.)
>
> For, for each wchar_t element language-specific code will call
> 'target_wchar_t_to_host', that will output specific representation of that
> wchar_t. Hmm, the interface there seem to assume theres 1<->1 mapping
> between target and host characters.  This makes L"UTF8" format and ascii
> string with \u escapes format impossible, It seems.

Not at all.  The current character and string printing code uses those
routines, and it handles unprintable and invalid characters just fine.
 See, for example, host_print_char_literally, and
c_target_char_has_backslash_escape.

GDB tries to print characters and strings as they would appear in
source code.  C doesn't assume that the source and execution character
sets are the same; by using numeric escapes, you can write programs
for any execution character set in any source character set.  You just
need enough information to manage the overlap.

As far as 1-to-1 mappings are concerned, the only necessary property
is that host_char_to_target and target_char_to_host be inverses, and
return zero for characters that can't make a round trip.  The existing
string-printing code will automatically use numeric escapes for
characters that target_char_to_host won't translate.

Follow-Ups:
- Re: printing wchar_t*
  - From: Vladimir Prus

References:
- printing wchar_t*
  - From: Vladimir Prus
- Re: printing wchar_t*
  - From: Jim Blandy
- Re: printing wchar_t*
  - From: Vladimir Prus

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]