This is the mail archive of the
gdb@sourceware.org
mailing list for the GDB project.
Re: printing wchar_t*
On Friday 14 April 2006 17:59, Eli Zaretskii wrote:
> > In an original post, I've asked if gdb can print wchar_t just as a raw
> > sequence of values, like this:
> >
> > 0x56, 0x1456
>
> The answer is YES. Use array notation, and add a feature to report
> the length of a wchar_t array.
Ok.
> Now, the same letter ``small a'' can be encoded in several other ways:
> for example, its ISO-2022-7bit encoding is 0x1B 0x24 0x2C 0x31 0x28
> 0x50, its KOI8-r encoding is 0xC1, its ISO-8859-5 encoding is 0xD0,
> etc. It should be obvious that, of all the encodings, only the
> fixed-length ones can be used in a wchar_t array (because wchar_t
> arrays are stateless,
I don't think this statement is backed up by anything.
> This is why I said that wchar_t is not used for an encoding (such as
> ISO-8859-5 or UTF-8 or UTF-16), but for characters' codepoints. It is
> nowadays almost universally accepted that wchar_t is a Unicode
> codepoint,
Again, can you provide any specific pointers to support that view?
> the only difference between applications being whether only
> the first 64K characters (the so-called BMP) are supported by 16-bit
> wchar_t, or the entire 23-bit range is supported by a 32-bit wchar_t.
I believe that on Windows:
- wchar_t is 16-bit
- wchar_t* values are supposed to be in UTF-16 encoding
(see
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/unicode_9i79.asp
Do you disagree with any of the above statements? If not, then it directly
follows that a given wchar_t is not a Unicode code point, but a code unit in
specific representation (UTF-16), and a given code points takes either one or
two code units, that is either one or two wchar_t. This is contrary to your
statement that wchar_t is a single code point.
Anyway, this is quickly getting off-topic for gdb list, so maybe we should
bring this somewhere else.
- Volodya