Summary: | C strings, gdb.Value.__str__, and Python 3 | ||
---|---|---|---|
Product: | gdb | Reporter: | Samuel Bronson <naesten> |
Component: | python | Assignee: | Not yet assigned to anyone <unassigned> |
Status: | RESOLVED OBSOLETE | ||
Severity: | normal | CC: | b.r.longbons, tromey |
Priority: | P2 | ||
Version: | 7.7 | ||
Target Milestone: | --- | ||
Host: | Target: | ||
Build: | Last reconfirmed: |
Description
Samuel Bronson
2014-07-10 02:46:21 UTC
Having done a lot, there are only a couple of solutions for dealing with strings: 1. Avoid unicode strings entirely, use 'bytes', avoid operations that differ (particularly, indexing, which returns an integer in python3). Disadvantage: lots of operations are not available on 'bytes' in python3. 2. Use unicode everywhere, with errors='surrogateescape'. Disadvantages: have to put up with lots of whining from the Python community about how you don't understand strings; has no C implementation in python2 and you have to bundle the python version. 3. Use unicode in python3, bytes in python2. Advantage: avoids most of the language-feature problems. Disadvantage: *lots* of opportunities for subtle bugs, such as the ones mentioned in this bug report. You'll note that all the real difficulties occur only in Python3, since it *insists* that you have absolute knowledge about and control over your users (this has caused no end of pain for people writing webservers). 3 is what programs do if you don't pay any attention. 2 is feasible if you are developing new code with python3 as your primary target (from __future__ import unicode_literals). 1 is the most correct for the kind of work gdb is doing, but can be painful without a. But that leads us to: 4. Invent an entire new string type that just DTRT in both python2 and python3. This *should* be possible as long as everyone duck types. It probably *is* safe to assume that any unicode string you get (mostly, from python string literals) is safe to treat as utf-8 (most of them will be ascii anyway), but for the vast majority of your code, you can just deal with byte strings in whatever encoding the inferior wants. In approach 4, all functions that take strings, just need to feed them through the new string factory, so there's not a lot of pain on callers. There is, however, a problem that you can't use *builtin* functions on strings, particularly you can't write: '%s %s' % (a, b), you have to write bstring('%s %s') % (a, b). The best we can do for this case is try to see if it's possible to make that always throw an exception, so at least they will fail quickly. Unless maybe we hook in an AST rewriter like py.test does ... If all this seems too complicated: 5. Just stick with python2 forever, and acheive ultimate success by simply ignoring python3 and unicode. I think this area has been cleaned up somewhat and now this bug is obsolete. Python 2 is no longer supported. Value.string does try to make a Python string and will fail if the encoding is wrong. It works like Python decoders, though. Value.lazy_string can be used from pretty-printers to defer decisions to gdb's internal decoding code. When printing this can handle possibly-incorrect contents. Memory can be read directly now and returned as a memoryview object. This avoids all encoding problems and lets code deal with just bytes. |