This is the mail archive of the
gdb@sourceware.org
mailing list for the GDB project.
[RFC] string handling in python
- From: Thiago Jung Bauermann <bauerman at br dot ibm dot com>
- To: gdb ml <gdb at sourceware dot org>
- Date: Mon, 07 Jul 2008 02:25:02 -0300
- Subject: [RFC] string handling in python
Hi folks,
I've been thinking about how we should handle strings and charsets in
Python, and I have some ideas. I'd like to ask your opinion about them.
First, some explanation about strings in Python, and how it deals with
different character sets (warning, I just learned about this stuff today
and I may be wrong about it...):
By default it is quite simple: Python doesn't deal with the problem. The
regular string type is just a byte array which has some convenience
methods to treat them as strings with 8-bit characters. If Python ever
needs to assume a charset, it will assume ASCII (it is possible to
change the default charset in Python, but it is highly discouraged).
Because of this, you can easily run into trouble if you use non-ASCII
characters (even Latin 1) in regular Python strings.
There's another string type which is the Unicode string. You get them by
prepending string literals with u, like in u"hello, world!". I believe
the internal representation is UTF-32 or UCS-4, but I'm not sure and it
doens't matter, in fact. Python can convert back and forth between
Unicode and several charsets, and from what I read you can mix Unicode
strings with regular strings and things will work (as long as the
regular strings are ASCII-only or you explicitly convert them to Unicode
usin string.decode("some_charset")).
There's some more info about this in
http://effbot.org/zone/unicode-objects.htm
So, in my opinion for GDB's Python bindings we should always use Unicode
strings, and convert to/from desired encodings as necessary. Strings
provided by the user would be assumed to have host_charset () encoding,
and strings coming from/going to the inferior would be assumed to have
target_charset () encoding.
So for example, to create a value object of char * type using a string
provided by the user or coming from Python code, GDB would first convert
the Python string object (assumed to be in the host charset) to a
unicode object (this process is called "decoding", in python parlance),
and then convert it from unicode to a string in the target charset. This
is what is implemented at the moment in gdbpy_make_value in the git
repo, BTW.
What do you think?
--
[]'s
Thiago Jung Bauermann
Software Engineer
IBM Linux Technology Center