This is the mail archive of the
systemtap@sourceware.org
mailing list for the systemtap project.
[Bug runtime/14487] New: need better UTF-8 handling
- From: "jistone at redhat dot com" <sourceware-bugzilla at sourceware dot org>
- To: systemtap at sourceware dot org
- Date: Sat, 18 Aug 2012 01:00:59 +0000
- Subject: [Bug runtime/14487] New: need better UTF-8 handling
- Auto-submitted: auto-generated
http://sourceware.org/bugzilla/show_bug.cgi?id=14487
Bug #: 14487
Summary: need better UTF-8 handling
Product: systemtap
Version: unspecified
Status: NEW
Severity: enhancement
Priority: P2
Component: runtime
AssignedTo: systemtap@sourceware.org
ReportedBy: jistone@redhat.com
Classification: Unclassified
We generally take the blissful approach that all strings are merely
0-terminated byte sequences, and we don't care much about the meaning of those
bytes.
This breaks down in any instance where we start splitting up those bytes
though. The most obvious case is with any truncation at MAXSTRINGLEN. This
could lead to an incomplete UTF-8 sequence at the tail. (Fortunately UTF-8 is
robust enough that this only corrupts one Unicode character in the output.) We
also have functions like substr() which count by bytes rather than characters.
It's not clear that we can solve this 100%, but if we choose to commit to a
worldview that all strings are utf-8, then we could make and use our own
runtime strlcpy8, strlcat8, etc. functions which preserve boundaries.
Even then, this is preserving only *code points*, whereas one may really have
composite characters with combining diacritical marks and such. I believe
combining characters are in specific ranges (though new Unicode versions can
expand this), so really fancy runtime functions might preserve these
connections too.
--
Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.