This is the mail archive of the gdb-patches@sourceware.org mailing list for the GDB project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

[PATCH/WIP] C/C++ wchar_t/Unicode printing support

From: Julian Brown <julian at codesourcery dot com>
To: gdb-patches at sourceware dot org
Cc: tromey at redhat dot com
Date: Thu, 15 Jan 2009 20:24:11 +0000
Subject: [PATCH/WIP] C/C++ wchar_t/Unicode printing support

Hi,

This patch contains (at least the start of) support for printing
wchar_t strings from a debugged program within GDB. This is the subject
for GDB bugs 9103 (and its duplicates 9369, 9268) and maybe 7821.

Notes on the implementation:

1. I've added a new configuration variable, similar to "host-charset"
and "target-charset". The latter can't be used for printing wide
characters, because regular C strings and wide strings aren't
necessarily (or in fact ever) encoded using the same encoding. The new
variable is set like:

(gdb) set target-wide-charset UTF-32

I considered adding "set target-wide-charset auto" to attempt to
auto-detect the charset used for wchar_t strings automatically (i.e.
probably 4 bytes -> UCS-4, 2 bytes -> UTF-16), but that's not done
presently.

2. The host terminal may be able to print Unicode characters, by
feeding it UTF-8 encoded characters. There are some limitations: I
don't think Unix terminals support combining character sequences --
I've ignored that for now. GDB currently defaults "host-charset" to
ISO-8859-1, although a given terminal may not print
top-bit-set characters correctly.

I've added a new way of setting the host character set from the
host terminal (using nl_langinfo (CODESET)), like so:

(gdb) set host-charset auto

If the terminal supports UTF-8 (e.g. LC_ALL is set to en_US.UTF-8), we
will then see:

(gdb) show host-charset
The host character set is "UTF-8" (auto).

If the terminal only supports ASCII (e.g. LC_ALL is set to C), we will
instead see:

(gdb) show host-charset
The host character set is "ANSI_X3.4-1968" (auto).

3. Types which are literally called "wchar_t" are assumed to be wide
characters. So we can do:

wchar_t *msg = L"Hello world";

and then:

(gdb) p msg
$1 = (wchar_t *) 0x85c4 "Hello world"

If the message contains funny characters, and the user has typed "set
host-charset auto" on a UTF-8 capable terminal, they will be printed
nicely:

(gdb) p msg
$2 = (wchar_t *) 0x85c4 "SchÃne GrÃÃe"

With the caveat that there's no way for GDB to know if you have a font
with the right glyphs in it: if not, you can fall back to ASCII:

(gdb) set host-charset ASCII
(gdb) p msg
$3 = (wchar_t *) 0x85c4 "Sch\x00f6ne Gr\x00fc\x00dfe"

4. If you want to print an integer array type which isn't literally
called "wchar_t" but nevertheless contains a wchar_t string, you can
override using "/s", just like with regular strings, e.g.:

(gdb) p/s intmsg
$2 = (int *) 0x85c4 "SchÃne GrÃÃe"

5. The existing string-printing code is careful about not printing out
lots of repeating characters. For wchar_t strings (taking into account
the differences between what they represent on various platforms
mentioned above), there is generally an X-Y correspondence between the
number of input bytes and the number of output bytes for each
character: to detect repeats, we convert an arbitrary number of X's to
UCS-4, detect repeated UCS-4 values, then translate each to Y output
characters.

Current shortcomings:

1. There's no support for non-C-like languages.

2. I've probably broken building with iconv disabled (actually I
couldn't figure out how to build without iconv() support -- even for
e.g. a mingw32 host which shouldn't support it).

3. Currently wrong-endian wide characters from the target will confuse
things (but you can explicitly set target-wide-charset to UCS-4LE or
UCS-4BE for example).

4. I've not written documentation or altered test cases yet
(charset.exp shows some regressions).

Tom Tromey is working on a patch related to this. Some of his comments
are incorporated in this patch relative to an earlier version sent to
him privately (thanks!).

Regression tested on x86-64 Linux, and spot-checked with an ARM Linux
cross debugger (from x86 build/host). As mentioned above, there are
some regressions so far.

OK to apply, or any comments?

Cheers,

Julian

ChangeLog

    gdb/
    * c-valprint.c (textual_element_type): Alter TYPE to be the type of
    the element before looking through typedefs, and update comment. Add
    wide-character support.
    (c_val_print): Pass type before typedef resolution to
    textual_element_type calls.
    * charset.c (langinfo.h): Include, if HAVE_LANGINFO_CODESET.
    (GDB_DEFAULT_TARGET_WIDE_CHARSET, GDB_INTERNAL_CODESET): New macros.
    (host_charset_auto): New.
    (show_host_charset_name): Indicate automatically-selected charset.
    (target_wide_charset_name, show_target_wide_charset_name): New.
    (host_charset_enum): Add "auto".
    (target_wide_charset_enum): New. Support a limited number of
    wchar_t character sets.
    (iconv_char_print_literally): New.
    (iconv_to_control): New.
    (lookup_and_register_iconv_charset): New.
    (default_c_internal_char_has_backslash_escape): New.
    (current_target_wide_charset, internal_charset): New.
    (set_host_charset): Add support for "auto" host charset.
    (show_charset): Show target wide charset.
    (set_target_wide_charset, set_target_wide_charset_sfunc)
    (target_wide_charset, cached_iconv_target_to_internal)
    (cached_iconv_internal_to_host, target_to_internal_iconv_t)
    (internal_to_host_iconv_t, reset_host_char_state)
    (target_char_to_internal, internal_char_host_emit): New.
    (_initialize_charset): Add wide-character support.
    * charset.h (target_wide_charset, reset_host_char_state)
    (target_char_to_internal) (internal_char_host_emit): Add prototypes.
    * c-lang.c (c_internal_char_host_emit, c_printwidestr): New.
    (c_printstr): Call c_printwidestr when appropriate.
    * printcmd.c (print_formatted): Add wide-character support.
    * configure.ac (AM_LANGINFO_CODESET): Add.
    * acinclude.m4 (../config/codeset.m4): Include.
    * config.in: Regenerate.
    * configure: Regenerate.

Index: gdb/c-valprint.c
===================================================================
RCS file: /cvs/src/src/gdb/c-valprint.c,v
retrieving revision 1.55
diff -c -p -r1.55 c-valprint.c
*** gdb/c-valprint.c	3 Jan 2009 05:57:51 -0000	1.55
--- gdb/c-valprint.c	15 Jan 2009 20:10:38 -0000
*************** print_function_pointer_address (CORE_ADD
*** 59,70 ****
     to TYPE should be printed as a textual string.  Return non-zero if
     it should, or zero if it should be treated as an array of integers
     or pointer to integers.  FORMAT is the current format letter,
!    or 0 if none.
  
     We guess that "char" is a character.  Explicitly signed and
     unsigned character types are also characters.  Integer data from
     vector types is not.  The user can override this by using the /s
!    format letter.  */
  
  static int
  textual_element_type (struct type *type, char format)
--- 59,76 ----
     to TYPE should be printed as a textual string.  Return non-zero if
     it should, or zero if it should be treated as an array of integers
     or pointer to integers.  FORMAT is the current format letter,
!    or 0 if none.  So that we can detect wchar_t strings, TYPE should
!    *not* have been resolved using check_typedef before calling this
!    function (in C, wchar_t would then appear to be a plain integer).
  
     We guess that "char" is a character.  Explicitly signed and
     unsigned character types are also characters.  Integer data from
     vector types is not.  The user can override this by using the /s
!    format letter.  The /s format letter can also be used to print arrays
!    of 2- or 4-byte integers as wide character strings.
!    
!    If TYPE is named "wchar_t" (before looking through typedefs), and elements
!    are of 2 or 4-byte integer type, detect as a wide-character string.  */
  
  static int
  textual_element_type (struct type *type, char format)
*************** textual_element_type (struct type *type,
*** 80,89 ****
  
    if (format == 's')
      {
!       /* Print this as a string if we can manage it.  For now, no
! 	 wide character support.  */
        if (TYPE_CODE (true_type) == TYPE_CODE_INT
! 	  && TYPE_LENGTH (true_type) == 1)
  	return 1;
      }
    else
--- 86,96 ----
  
    if (format == 's')
      {
!       /* Print this as a string if we can manage it.  */
        if (TYPE_CODE (true_type) == TYPE_CODE_INT
! 	  && (TYPE_LENGTH (true_type) == 1
! 	      || TYPE_LENGTH (true_type) == 2
! 	      || TYPE_LENGTH (true_type) == 4))
  	return 1;
      }
    else
*************** textual_element_type (struct type *type,
*** 97,102 ****
--- 104,116 ----
  	return 1;
      }
  
+   if (TYPE_NAME (type) && strcmp (TYPE_NAME (type), "wchar_t") == 0
+       && TYPE_CODE (true_type) == TYPE_CODE_INT
+       && (TYPE_LENGTH (true_type) == 2
+ 	  || TYPE_LENGTH (true_type) == 4)
+       && !TYPE_NOTTEXT (true_type))
+     return 1;
+ 
    return 0;
  }
  
*************** c_val_print (struct type *type, const gd
*** 115,121 ****
  {
    unsigned int i = 0;	/* Number of characters printed */
    unsigned len;
!   struct type *elttype;
    unsigned eltlen;
    LONGEST val;
    CORE_ADDR addr;
--- 129,136 ----
  {
    unsigned int i = 0;	/* Number of characters printed */
    unsigned len;
!   struct type *elttype, *unresolved_elttype;
!   struct type *unresolved_type = type;
    unsigned eltlen;
    LONGEST val;
    CORE_ADDR addr;
*************** c_val_print (struct type *type, const gd
*** 124,131 ****
    switch (TYPE_CODE (type))
      {
      case TYPE_CODE_ARRAY:
!       elttype = check_typedef (TYPE_TARGET_TYPE (type));
!       if (TYPE_LENGTH (type) > 0 && TYPE_LENGTH (TYPE_TARGET_TYPE (type)) > 0)
  	{
  	  eltlen = TYPE_LENGTH (elttype);
  	  len = TYPE_LENGTH (type) / eltlen;
--- 139,147 ----
    switch (TYPE_CODE (type))
      {
      case TYPE_CODE_ARRAY:
!       unresolved_elttype = TYPE_TARGET_TYPE (type);
!       elttype = check_typedef (unresolved_elttype);
!       if (TYPE_LENGTH (type) > 0 && TYPE_LENGTH (unresolved_elttype) > 0)
  	{
  	  eltlen = TYPE_LENGTH (elttype);
  	  len = TYPE_LENGTH (type) / eltlen;
*************** c_val_print (struct type *type, const gd
*** 135,141 ****
  	    }
  
  	  /* Print arrays of textual chars with a string syntax.  */
!           if (textual_element_type (elttype, options->format))
  	    {
  	      /* If requested, look for the first null char and only print
  	         elements up to it.  */
--- 151,157 ----
  	    }
  
  	  /* Print arrays of textual chars with a string syntax.  */
!           if (textual_element_type (unresolved_elttype, options->format))
  	    {
  	      /* If requested, look for the first null char and only print
  	         elements up to it.  */
*************** c_val_print (struct type *type, const gd
*** 145,153 ****
  
  		  /* Look for a NULL char. */
  		  for (temp_len = 0;
! 		       (valaddr + embedded_offset)[temp_len]
! 		       && temp_len < len && temp_len < options->print_max;
! 		       temp_len++);
  		  len = temp_len;
  		}
  
--- 161,173 ----
  
  		  /* Look for a NULL char. */
  		  for (temp_len = 0;
! 		       (temp_len < len
! 			&& temp_len < options->print_max
! 			&& extract_unsigned_integer (valaddr + embedded_offset
! 						     + temp_len * eltlen,
! 						     eltlen) == 0);
! 		       temp_len++)
! 		    ;
  		  len = temp_len;
  		}
  
*************** c_val_print (struct type *type, const gd
*** 209,215 ****
  	  print_function_pointer_address (addr, stream, options->addressprint);
  	  break;
  	}
!       elttype = check_typedef (TYPE_TARGET_TYPE (type));
  	{
  	  addr = unpack_pointer (type, valaddr + embedded_offset);
  	print_unpacked_pointer:
--- 229,236 ----
  	  print_function_pointer_address (addr, stream, options->addressprint);
  	  break;
  	}
!       unresolved_elttype = TYPE_TARGET_TYPE (type);
!       elttype = check_typedef (unresolved_elttype);
  	{
  	  addr = unpack_pointer (type, valaddr + embedded_offset);
  	print_unpacked_pointer:
*************** c_val_print (struct type *type, const gd
*** 228,236 ****
  
  	  /* For a pointer to a textual type, also print the string
  	     pointed to, unless pointer is null.  */
- 	  /* FIXME: need to handle wchar_t here... */
  
! 	  if (textual_element_type (elttype, options->format)
  	      && addr != 0)
  	    {
  	      i = val_print_string (addr, -1, TYPE_LENGTH (elttype), stream,
--- 249,256 ----
  
  	  /* For a pointer to a textual type, also print the string
  	     pointed to, unless pointer is null.  */
  
! 	  if (textual_element_type (unresolved_elttype, options->format)
  	      && addr != 0)
  	    {
  	      i = val_print_string (addr, -1, TYPE_LENGTH (elttype), stream,
*************** c_val_print (struct type *type, const gd
*** 268,274 ****
  		    }
  		  else
  		    {
! 		      wtype = TYPE_TARGET_TYPE (type);
  		    }
  		  vt_val = value_at (wtype, vt_address);
  		  common_val_print (vt_val, stream, recurse + 1, options,
--- 288,294 ----
  		    }
  		  else
  		    {
! 		      wtype = unresolved_elttype;
  		    }
  		  vt_val = value_at (wtype, vt_address);
  		  common_val_print (vt_val, stream, recurse + 1, options,
*************** c_val_print (struct type *type, const gd
*** 442,448 ****
  	     Since we don't know whether the value is really intended to
  	     be used as an integer or a character, print the character
  	     equivalent as well.  */
! 	  if (textual_element_type (type, options->format))
  	    {
  	      fputs_filtered (" ", stream);
  	      LA_PRINT_CHAR ((unsigned char) unpack_long (type, valaddr + embedded_offset),
--- 462,468 ----
  	     Since we don't know whether the value is really intended to
  	     be used as an integer or a character, print the character
  	     equivalent as well.  */
! 	  if (textual_element_type (unresolved_type, options->format))
  	    {
  	      fputs_filtered (" ", stream);
  	      LA_PRINT_CHAR ((unsigned char) unpack_long (type, valaddr + embedded_offset),
Index: gdb/charset.c
===================================================================
RCS file: /cvs/src/src/gdb/charset.c,v
retrieving revision 1.16
diff -c -p -r1.16 charset.c
*** gdb/charset.c	3 Jan 2009 05:57:51 -0000	1.16
--- gdb/charset.c	15 Jan 2009 20:10:38 -0000
***************
*** 30,35 ****
--- 30,39 ----
  #include <iconv.h>
  #endif
  
+ #ifdef HAVE_LANGINFO_CODESET
+ #include <langinfo.h>
+ #endif
+ 
  
  /* How GDB's character set support works
  
*************** struct translation {
*** 162,174 ****
  #define GDB_DEFAULT_TARGET_CHARSET "ISO-8859-1"
  #endif
  
  static const char *host_charset_name = GDB_DEFAULT_HOST_CHARSET;
  static void
  show_host_charset_name (struct ui_file *file, int from_tty,
  			struct cmd_list_element *c,
  			const char *value)
  {
!   fprintf_filtered (file, _("The host character set is \"%s\".\n"), value);
  }
  
  static const char *target_charset_name = GDB_DEFAULT_TARGET_CHARSET;
--- 166,192 ----
  #define GDB_DEFAULT_TARGET_CHARSET "ISO-8859-1"
  #endif
  
+ #ifndef GDB_DEFAULT_TARGET_WIDE_CHARSET
+ #define GDB_DEFAULT_TARGET_WIDE_CHARSET "UTF-32"
+ #endif
+ 
+ #ifndef GDB_INTERNAL_CODESET
+ #define GDB_INTERNAL_CODESET "UCS-4LE"
+ #endif
+ 
  static const char *host_charset_name = GDB_DEFAULT_HOST_CHARSET;
+ static int host_charset_auto = 1;
  static void
  show_host_charset_name (struct ui_file *file, int from_tty,
  			struct cmd_list_element *c,
  			const char *value)
  {
!   fprintf_filtered (file, _("The host character set is \"%s\""), value);
! 
!   if (host_charset_auto)
!     fprintf_filtered (file, _(" (auto).\n"));
!   else
!     fputs_filtered (".\n", file);
  }
  
  static const char *target_charset_name = GDB_DEFAULT_TARGET_CHARSET;
*************** show_target_charset_name (struct ui_file
*** 180,190 ****
--- 198,217 ----
  		    value);
  }
  
+ static const char *target_wide_charset_name = GDB_DEFAULT_TARGET_WIDE_CHARSET;
+ static void
+ show_target_wide_charset_name (struct ui_file *file, int from_tty,
+ 			       struct cmd_list_element *c, const char *value)
+ {
+   fprintf_filtered (file, _("The target wide character set is \"%s\".\n"),
+ 		    value);
+ }
  
  static const char *host_charset_enum[] = 
  {
    "ASCII",
    "ISO-8859-1",
+   "auto",
    0
  };
  
*************** static const char *target_charset_enum[]
*** 197,202 ****
--- 224,246 ----
    0
  };
  
+ static const char *target_wide_charset_enum[] =
+ {
+   "UCS-2",
+   "UCS-2LE",
+   "UCS-2BE",
+   "UCS-4",
+   "UCS-4LE",
+   "UCS-4BE",
+   "UTF-16",
+   "UTF-16LE",
+   "UTF-16BE",
+   "UTF-32",
+   "UTF-32LE",
+   "UTF-32BE",
+   0
+ };
+ 
  /* The global list of all the charsets GDB knows about.  */
  static struct charset *all_charsets;
  
*************** ebcdic_family_charset (const char *name)
*** 376,381 ****
--- 420,474 ----
  
  #if defined(HAVE_ICONV)
  
+ /* Note: this is a stub.  */
+ 
+ static int
+ iconv_char_print_literally (void *baton, int c)
+ {
+   return 1;
+ }
+ 
+ /* Note: this is a stub.  */
+ 
+ static int
+ iconv_to_control (void *baton, int c, int *ctrl_char)
+ {
+   return 0;
+ }
+ 
+ /* Check charset is permitted by iconv, and return a "struct charset *"
+    representing it if so.  Return NULL on failure.  */
+ static struct charset *
+ lookup_and_register_iconv_charset (const char *name)
+ {
+   struct charset **ptr, *cs;
+   iconv_t probe;
+   
+   /* On Solaris, identity conversions are apparently not permitted.  Try two
+      probes: the first to GDB_INTERNAL_CODESET, the second from ASCII.  If one
+      of these succeeds, we know that iconv supports charset NAME.  */
+   probe = iconv_open (name, GDB_INTERNAL_CODESET);
+   if (probe == (iconv_t) -1)
+     probe = iconv_open ("ASCII", name);
+   
+   if (probe == (iconv_t) -1)
+     {
+       warning (_("Invalid iconv character set `%s'."), name);
+       
+       return NULL;
+     }
+     
+   iconv_close (probe);
+   
+   for (ptr = &all_charsets; *ptr; ptr = &(*ptr)->next)
+     if (! strcmp (name, (*ptr)->name))
+       return *ptr;
+   
+   /* Warning: valid_host_charset == 1 isn't necessarily true.  */
+   return simple_charset (xstrdup (name), 1, iconv_char_print_literally, NULL,
+ 			 iconv_to_control, NULL);
+ }
+ 
  struct cached_iconv {
    struct charset *from, *to;
    iconv_t i;
*************** default_c_parse_backslash (void *baton, 
*** 575,580 ****
--- 668,688 ----
  }
  
  
+ /* Similar to default_c_target_char_has_backslash_escape, but works on an
+    internal char in UCS-4.  */
+ static const char *
+ default_c_internal_char_has_backslash_escape (unsigned long internal_char)
+ {
+   const char *ix;
+   
+   ix = strchr (represented, internal_char);
+   if (ix)
+     return backslashed[ix - represented];
+   else
+     return NULL;
+ }
+ 
+ 
  /* Convert using a cached iconv descriptor.  */
  static int
  iconv_convert (void *baton, int from_char, int *to_char)
*************** simple_table_translation (const char *fr
*** 898,904 ****
  
  
  /* The current host and target character sets.  */
! static struct charset *current_host_charset, *current_target_charset;
  
  /* The current functions and batons we should use for the functions in
     charset.h.  */
--- 1006,1013 ----
  
  
  /* The current host and target character sets.  */
! static struct charset *current_host_charset, *current_target_charset,
! 		      *current_target_wide_charset, *internal_charset;
  
  /* The current functions and batons we should use for the functions in
     charset.h.  */
*************** set_host_and_target_charsets (struct cha
*** 1041,1048 ****
  static void
  set_host_charset (const char *charset)
  {
!   struct charset *cs = lookup_charset_or_error (charset);
!   check_valid_host_charset (cs);
    set_host_and_target_charsets (cs, current_target_charset);
  }
  
--- 1150,1183 ----
  static void
  set_host_charset (const char *charset)
  {
!   struct charset *cs;
!   
!   if (strcmp (charset, "auto") == 0)
!     {
!       const char *old_charset_name = host_charset_name;
!       struct charset *old_charset = current_host_charset;
! #ifdef HAVE_LANGINFO_CODESET
!       charset = nl_langinfo (CODESET);
! #else
!       /* No nl_langinfo (CODESET).  Fall back to default.  */
!       charset = GDB_DEFAULT_HOST_CHARSET;
! #endif
!       host_charset_auto = 1;
!       host_charset_name = charset;
!       cs = lookup_and_register_iconv_charset (charset);
!       if (!cs)
!         {
! 	  host_charset_auto = 0;
! 	  host_charset_name = old_charset_name;
! 	  cs = old_charset;
! 	}
!     }
!   else
!     {  
!       cs = lookup_charset_or_error (charset);
!       host_charset_auto = 0;
!       check_valid_host_charset (cs);
!     }
    set_host_and_target_charsets (cs, current_target_charset);
  }
  
*************** set_target_charset (const char *charset)
*** 1055,1060 ****
--- 1190,1203 ----
    set_host_and_target_charsets (current_host_charset, cs);
  }
  
+ static void
+ set_target_wide_charset (const char *charset)
+ {
+   struct charset *cs = lookup_and_register_iconv_charset (charset);
+   
+   current_target_wide_charset = cs;
+ }
+ 
  
  /* 'Set charset', 'set host-charset', 'set target-charset', 'show
     charset' sfunc's.  */
*************** set_target_charset_sfunc (char *charset,
*** 1087,1092 ****
--- 1230,1243 ----
    set_target_charset (target_charset_name);
  }
  
+ /* Wrapper for the 'set target-wide-charset' command.  */
+ static void
+ set_target_wide_charset_sfunc (char *charset, int from_tty,
+ 			       struct cmd_list_element *c)
+ {
+   set_target_wide_charset (target_wide_charset_name);
+ }
+ 
  /* sfunc for the 'show charset' command.  */
  static void
  show_charset (struct ui_file *file, int from_tty, struct cmd_list_element *c,
*************** show_charset (struct ui_file *file, int 
*** 1103,1108 ****
--- 1254,1261 ----
        fprintf_filtered (file, _("The current target character set is `%s'.\n"),
  			target_charset ());
      }
+   fprintf_filtered (file, _("The current target wide character set is `%s'.\n"),
+ 		    target_wide_charset ());
  }
  
  
*************** target_charset (void)
*** 1120,1125 ****
--- 1273,1284 ----
    return current_target_charset->name;
  }
  
+ const char *
+ target_wide_charset (void)
+ {
+   return current_target_wide_charset->name;
+ }
+ 
  
  
  /* Public character management functions.  */
*************** target_char_to_host (int target_char, in
*** 1174,1179 ****
--- 1333,1516 ----
            (target_char_to_host_baton, target_char, host_char));
  }
  
+ /* Wide character support, via iconv.  */
+ 
+ static struct cached_iconv cached_iconv_target_to_internal;
+ static struct cached_iconv cached_iconv_internal_to_host;
+ 
+ static iconv_t
+ target_to_internal_iconv_t (void)
+ {
+   check_iconv_cache (&cached_iconv_target_to_internal,
+ 		     current_target_wide_charset,
+ 		     internal_charset);
+   
+   return cached_iconv_target_to_internal.i;
+ }
+ 
+ static iconv_t
+ internal_to_host_iconv_t (void)
+ {
+   check_iconv_cache (&cached_iconv_internal_to_host,
+ 		     internal_charset,
+ 		     current_host_charset);
+   
+   return cached_iconv_internal_to_host.i;
+ }
+ 
+ void
+ reset_host_char_state (struct ui_file *stream)
+ {
+   char resetcode[200];  /* FIXME: Yuck, fixed-size buffer.  */
+   size_t output_to_go = sizeof (resetcode), ret;
+   char *op = &resetcode[0];
+   iconv_t cd = internal_to_host_iconv_t ();
+   
+   ret = iconv (cd, NULL, NULL, &op, &output_to_go);
+   
+   if (ret != -1)
+     {
+       int i, reset_seq_length = sizeof (resetcode) - output_to_go;
+       
+       for (i = 0; i < reset_seq_length; i++)
+         fputc_filtered (resetcode[i], stream);
+     }
+ }
+ 
+ /* Convert target bytes at *CP until we've read one code point in internal form
+    (UCS-4).  Move *CP to the next input (multibyte) character.  Returns the
+    converted character in *INTERN.  Returns 0 on success, 1 on error.  */
+ 
+ int
+ target_char_to_internal (unsigned long *intern, gdb_byte **cp)
+ {
+   char *ip = *cp;
+   char outbuf[4], *op;
+   size_t outbytesleft = sizeof (outbuf), ret, inbytes, probe_inbytes;
+   unsigned long internal = 0;
+   int i;
+   iconv_t cd = target_to_internal_iconv_t ();
+ 
+   probe_inbytes = 1;
+ 
+   *intern = 0;
+ 
+   while (outbytesleft != 0)
+     {
+       inbytes = probe_inbytes;
+       memset (outbuf, '\0', sizeof (outbuf));
+       ip = *cp;
+       op = &outbuf[0];
+       outbytesleft = sizeof (outbuf);
+ 
+       /* Reset conversion state.  */
+       iconv (cd, NULL, NULL, NULL, NULL);
+       /* And do conversion.  */
+       ret = iconv (cd, (ICONV_CONST char **) &ip, &inbytes, &op, &outbytesleft);
+ 
+       if (ret == (size_t) -1)
+         {
+ 	  switch (errno)
+ 	    {
+ 	    case EILSEQ:
+ 	      /* Illegal multibyte sequence -- give up.  */
+ 	      (*cp) += probe_inbytes;
+ 	      return 1;
+ 	    
+ 	    case EINVAL:
+ 	      /* Incomplete multibyte sequence.  Try converting a longer
+ 		 one.  */
+ 	      probe_inbytes++;
+ 	      break;
+ 	    
+ 	    default:
+ 	      /* Something else went wrong.  */
+ 	      error (_("GDB encountered unexpected `iconv' error."));
+ 	      return 1;
+ 	    }
+ 	}
+     }
+   
+   /* Note: We explicitly use little-endian UCS-4 for our internal
+      representation, so that this gets the codepoint right.  */
+   for (i = 0; i < 4; i++)
+     internal |= (unsigned char) outbuf[i] << (i * 8);
+   
+   /* Move to next input char.  */
+   *cp = ip;
+   *intern = internal;
+   
+   return 0;
+ }
+ 
+ /* Return 0 on success, 1 on error.  */
+ 
+ int
+ internal_char_host_emit (struct ui_file *stream, unsigned long codept)
+ {
+   char inbuf[4], *outbuf, *ip, *op;
+   static size_t outbufsize = 4;
+   size_t inbytesleft, rc, outbytesleft;
+   int i, converted;
+   iconv_t cd = internal_to_host_iconv_t ();
+   
+   /* Handle control characters, etc. specially.  Hm, this is C-specific.  */
+   if (codept < 32 || codept == 127)
+     {
+       const char *esc = default_c_internal_char_has_backslash_escape (codept);
+       
+       if (esc)
+         fprintf_filtered (stream, "\\%s", esc);
+       else
+         fprintf_filtered (stream, "\\%.3lo", codept);
+ 
+       return 0;
+     }
+   
+   for (i = 0; i < 4; i++)
+     {
+       inbuf[i] = codept & 255;
+       codept >>= 8;
+     }
+   
+   outbuf = xmalloc (outbufsize);
+   
+   while (1)
+     {
+       ip = &inbuf[0];
+       op = outbuf;
+       inbytesleft = 4;
+       outbytesleft = outbufsize;
+       /* Reset conversion state.  */
+       iconv (cd, NULL, NULL, NULL, NULL);
+       /* Attempt conversion.  */
+       rc = iconv (cd, (ICONV_CONST char **) &ip, &inbytesleft, &op,
+ 		  &outbytesleft);
+       
+       if (rc != (size_t) -1)
+         break;
+ 
+       if (errno == E2BIG)
+         {
+ 	  outbufsize *= 2;
+ 	  outbuf = xrealloc (outbuf, outbufsize);
+ 	}
+       else
+         break;
+     }
+   
+   converted = outbufsize - outbytesleft;
+   
+   if (inbytesleft != 0 || converted == 0 || rc > 0)
+     return 1;
+   
+   for (i = 0; i < converted; i++)
+     fputc_filtered (outbuf[i], stream);
+   
+   free (outbuf);
+   
+   return 0;
+ }
  
  
  /* The charset.c module initialization function.  */
*************** _initialize_charset (void)
*** 1231,1236 ****
--- 1568,1576 ----
  
    set_host_charset (host_charset_name);
    set_target_charset (target_charset_name);
+   set_target_wide_charset (target_wide_charset_name);
+ 
+   internal_charset = lookup_and_register_iconv_charset (GDB_INTERNAL_CODESET);
  
    add_setshow_enum_cmd ("charset", class_support,
  			host_charset_enum, &host_charset_name, _("\
*************** To see a list of the character sets GDB 
*** 1271,1274 ****
--- 1611,1628 ----
  			set_target_charset_sfunc,
  			show_target_charset_name,
  			&setlist, &showlist);
+ 
+   target_wide_charset_name = xstrdup (GDB_DEFAULT_TARGET_WIDE_CHARSET);
+ 
+   add_setshow_enum_cmd ("target-wide-charset", class_support,
+ 			target_wide_charset_enum, &target_wide_charset_name,
+ 			_("\
+ Set the target wide character (wchar_t) character set."), _("\
+ Show the target wide character (wchar_t) character set."), _("\
+ The `target wide character set' is the one used by the program being\n\
+ debugged for wide characters, e.g. literal wchar_t strings."),
+ 			set_target_wide_charset_sfunc,
+ 			show_target_wide_charset_name,
+ 			&setlist, &showlist);
+ 
  }
Index: gdb/charset.h
===================================================================
RCS file: /cvs/src/src/gdb/charset.h,v
retrieving revision 1.7
diff -c -p -r1.7 charset.h
*** gdb/charset.h	3 Jan 2009 05:57:51 -0000	1.7
--- gdb/charset.h	15 Jan 2009 20:10:38 -0000
***************
*** 49,54 ****
--- 49,55 ----
     it.  */
  const char *host_charset (void);
  const char *target_charset (void);
+ const char *target_wide_charset (void);
  
  /* In general, the set of C backslash escapes (\n, \f) is specific to
     the character set.  Not all character sets will have form feed
*************** int target_char_to_host (int target_char
*** 103,107 ****
--- 104,117 ----
     zero.  */
  int target_char_to_control_char (int target_char, int *target_ctrl_char);
  
+ /* Wide character support: reset terminal state.  */
+ void reset_host_char_state (struct ui_file *stream);
+ 
+ /* Wide character support: convert target character to internal form.  */
+ int target_char_to_internal (unsigned long *, gdb_byte **cp);
+ 
+ /* Wide character support: emit character in internal form to host output
+    stream.  */
+ int internal_char_host_emit (struct ui_file *stream, unsigned long codept);
  
  #endif /* CHARSET_H */
Index: gdb/c-lang.c
===================================================================
RCS file: /cvs/src/src/gdb/c-lang.c,v
retrieving revision 1.60
diff -c -p -r1.60 c-lang.c
*** gdb/c-lang.c	3 Jan 2009 05:57:51 -0000	1.60
--- gdb/c-lang.c	15 Jan 2009 20:10:38 -0000
*************** c_printchar (int c, struct ui_file *stre
*** 78,83 ****
--- 78,213 ----
    fputc_filtered ('\'', stream);
  }
  
+ void
+ c_internal_char_host_emit (struct ui_file *stream, unsigned long codept)
+ {
+   int err;
+   
+   err = internal_char_host_emit (stream, codept);
+   
+   /* Some error occurred before printing anything.  NOTE: This can cause
+      ambiguity in the displayed output.  Not sure what to do about that.  */
+   if (err)
+     fprintf_filtered (stream, "\\x%.4lx", codept);
+ }
+ 
+ /* Convert wchar_t elements (of WIDTH bytes each) from target memory to
+    internal form (a buffer of PRINT_MAX such elements) -- UCS-4 code points in
+    host endianness. Perform repeated character detection on this buffer --
+    allowing extension in case more characters are repeated.  If a break in
+    repetition is detected, emit elements (in internal form) to the output
+    stream, in the host charset.
+    Don't print more than LENGTH target elements.
+    Note: WIDTH is currently ignored.  */
+ 
+ void
+ c_printwidestr (struct ui_file *stream, const gdb_byte *string,
+ 		unsigned int length, int width, int force_ellipses,
+ 		const struct value_print_options *options)
+ {
+   unsigned long *buffer;
+   int buf_read_idx = 0, buf_write_idx = 0, repeat_starts_at = 0;
+   gdb_byte *sp = (gdb_byte *) string;
+   unsigned long repeating_char = -1u;
+   int repeat_count = 0, endpoint;
+   int in_quotes = 0, need_comma = 0, found_terminator = 0, any_errs = 0;
+   unsigned int buf_length = options->print_max + 1, things_printed = 0;
+   
+   buffer = xmalloc (sizeof (long) * buf_length);
+   
+   /* Most likely this is not necessary.  */
+   reset_host_char_state (stream);
+   
+   while (!found_terminator || buf_read_idx != buf_write_idx)
+     {
+       int err = target_char_to_internal (&buffer[buf_write_idx], &sp);
+ 
+       any_errs |= err;
+ 
+       if (need_comma)
+         {
+ 	  fputs_filtered (", ", stream);
+ 	  need_comma = 0;
+ 	}
+ 
+       if (buffer[buf_write_idx] == repeating_char && !found_terminator)
+         repeat_count++;
+       else
+         {
+ 	  int repeating_tail = repeat_count > options->repeat_count_threshold;
+ 	  int nonrepeating_end = repeating_tail ? repeat_starts_at
+ 						: buf_write_idx;
+ 	  int nonrepeating_head = nonrepeating_end > buf_read_idx;
+ 
+ 	  if (!in_quotes && nonrepeating_head)
+ 	    {
+ 	      if (options->inspect_it)
+ 		fputs_filtered ("\\\"", stream);
+ 	      else
+ 		fputs_filtered ("\"", stream);
+ 	      in_quotes = 1;
+ 	    }
+ 
+ 	  while (buf_read_idx < nonrepeating_end)
+ 	    {
+ 	      c_internal_char_host_emit (stream, buffer[buf_read_idx++]);
+ 	      things_printed++;
+ 	    }
+ 
+ 	  if (repeating_tail)
+ 	    {
+ 	      if (in_quotes)
+ 	        {
+ 		  if (options->inspect_it)
+ 		    fputs_filtered ("\\\", ", stream);
+ 		  else
+ 		    fputs_filtered ("\", ", stream);
+ 		  in_quotes = 0;
+ 		}	    
+ 
+ 	      fputc_filtered ('\'', stream);
+ 	      c_internal_char_host_emit (stream, repeating_char);
+ 	      fputc_filtered ('\'', stream);
+ 
+ 	      fprintf_filtered (stream, _(" <repeats %u times>"), repeat_count);
+ 	      buf_read_idx = buf_write_idx;
+ 	      
+ 	      things_printed += repeat_count;
+ 	      
+ 	      need_comma = 1;
+ 	    }
+ 	
+ 	  repeating_char = buffer[buf_write_idx];
+ 	  repeat_starts_at = buf_write_idx;
+ 	  repeat_count = 1;
+ 	}
+ 
+       if (buf_write_idx < length && things_printed < options->print_max && !err)
+ 	buf_write_idx++;
+       else
+ 	found_terminator = 1;
+     }
+ 
+   if (in_quotes)
+     {
+       if (options->inspect_it)
+ 	fputs_filtered ("\\\"", stream);
+       else
+ 	fputs_filtered ("\"", stream);
+     }
+ 
+   /* Most likely this is not necessary.  */
+   reset_host_char_state (stream);
+ 
+   if (any_errs)
+     fputs_filtered ("<character conversion error>", stream);
+ 
+   if (force_ellipses || buf_write_idx < length)
+     fputs_filtered ("...", stream);
+ 
+   free (buffer);
+ }
+ 
  /* Print the character string STRING, printing at most LENGTH characters.
     LENGTH is -1 if the string is nul terminated.  Each character is WIDTH bytes
     long.  Printing stops early if the number hits print_max; repeat counts are
*************** c_printstr (struct ui_file *stream, cons
*** 109,114 ****
--- 239,250 ----
        return;
      }
  
+   if (width > 1)
+     {
+       c_printwidestr (stream, string, length, width, force_ellipses, options);
+       return;
+     }
+ 
    for (i = 0; i < length && things_printed < options->print_max; ++i)
      {
        /* Position of the character we are examining
Index: gdb/printcmd.c
===================================================================
RCS file: /cvs/src/src/gdb/printcmd.c,v
retrieving revision 1.141
diff -c -p -r1.141 printcmd.c
*** gdb/printcmd.c	3 Jan 2009 05:57:53 -0000	1.141
--- gdb/printcmd.c	15 Jan 2009 20:10:41 -0000
*************** print_formatted (struct value *val, int 
*** 269,279 ****
        switch (options->format)
  	{
  	case 's':
! 	  /* FIXME: Need to handle wchar_t's here... */
! 	  next_address = VALUE_ADDRESS (val)
! 	    + val_print_string (VALUE_ADDRESS (val), -1, 1, stream,
! 				options);
! 	  return;
  
  	case 'i':
  	  /* We often wrap here if there are long symbolic names.  */
--- 269,293 ----
        switch (options->format)
  	{
  	case 's':
! 	  {
! 	    struct type *elttype = TYPE_TARGET_TYPE (type)
! 				     ? check_typedef (TYPE_TARGET_TYPE (type))
! 				     : NULL;
! 	    unsigned eltlen = 1;
! 
! 	    /* If this is a plausible string of wide characters, try to print
! 	       it as such.  */
! 	    if (TYPE_CODE (type) == TYPE_CODE_PTR
! 		&& elttype
! 	       && TYPE_CODE (elttype) == TYPE_CODE_INT
! 		&& (TYPE_LENGTH (elttype) == 2 || TYPE_LENGTH (elttype) == 4))
! 	      eltlen = TYPE_LENGTH (elttype);
! 
! 	    next_address = VALUE_ADDRESS (val)
! 	      + val_print_string (VALUE_ADDRESS (val), -1, eltlen, stream,
! 				  options);
! 	    return;
! 	  }
  
  	case 'i':
  	  /* We often wrap here if there are long symbolic names.  */
Index: gdb/configure.ac
===================================================================
RCS file: /cvs/src/src/gdb/configure.ac,v
retrieving revision 1.84
diff -c -p -r1.84 configure.ac
*** gdb/configure.ac	12 Jan 2009 01:10:27 -0000	1.84
--- gdb/configure.ac	15 Jan 2009 20:10:42 -0000
*************** AC_DEFINE(GDB_DEFAULT_HOST_CHARSET, "ISO
*** 1913,1918 ****
--- 1913,1920 ----
  
  AM_ICONV
  
+ AM_LANGINFO_CODESET
+ 
  AC_OUTPUT(Makefile .gdbinit:gdbinit.in gnulib/Makefile,
  [
  dnl Autoconf doesn't provide a mechanism for modifying definitions 
Index: gdb/acinclude.m4
===================================================================
RCS file: /cvs/src/src/gdb/acinclude.m4,v
retrieving revision 1.24
diff -c -p -r1.24 acinclude.m4
*** gdb/acinclude.m4	3 Jan 2009 05:57:50 -0000	1.24
--- gdb/acinclude.m4	15 Jan 2009 20:10:42 -0000
*************** sinclude(../config/acx.m4)
*** 23,28 ****
--- 23,31 ----
  dnl for TCL definitions
  sinclude(../config/tcl.m4)
  
+ dnl for langinfo check
+ sinclude(../config/codeset.m4)
+ 
  dnl For dependency tracking macros.
  sinclude([../config/depstand.m4])

Follow-Ups:
- Re: [PATCH/WIP] C/C++ wchar_t/Unicode printing support
  - From: Tom Tromey
- Re: [PATCH/WIP] C/C++ wchar_t/Unicode printing support
  - From: Eli Zaretskii
- Re: [PATCH/WIP] C/C++ wchar_t/Unicode printing support
  - From: Tom Tromey

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]