This is the mail archive of the
mailing list for the Kawa project.
Re: Escaping of non-ASCII characters in XML
- From: Per Bothner <per at bothner dot com>
- To: ÐÐÐÑÑÐÐ <dmymd at yandex dot ru>
- Cc: kawa at sources dot redhat dot com
- Date: Mon, 23 Jul 2012 13:21:02 -0700
- Subject: Re: Escaping of non-ASCII characters in XML
- References: <firstname.lastname@example.org>
On 07/23/2012 04:57 AM, ÐÐÐÑÑÐÐ wrote:
I believe the current XML functions for creating XML and found XML in Kawa
practically unusable for languages with a non-Latin script.
E.g. <p>ÐÐÑÐÐÑÑÐÐ</p> is automatically escaped to
All non-ASCII characters are escaped.
This shouldn't really matter in principle. Humans normally wouldn't be
looking at computer-generated XML/HTML. However, it does make the output
bulkier, and it makes "View Source" (or the quivalent) uglier. so it's
certainly not ideal.
Does anyone really need this kind of escaping? Kawa's internal HTTP server
escapes strings after this anyway, so in this case it's a mere duplication.
(The server escaping is also not quite adequate for Ukrainian and Russian,
but this is a different issue.)
Can you remind me where Kawa's internal HTTP server does the
Is it possible to add "xp.escapeNonAscii = false;" somewhere in the the
gnu.kawa.xml.KNode:toString function (gnu\kawa\xml\KNode.java, after line 32).
[I believe this should turn the escaping off, but I don't have JDK at hand to
check.] xp.escapeNonAscii shouldn't affect control characters (these are
encoded anyway), only characters outside ASCII.
It might make sense, but I'm a little uncomfortable with the idea that
toString output is different from printing to a file. What you then
the toString return to an ASCII or Latin-1-only file or terminal?
Of course you have the same problem printing strings in general.
If this escaping is desirable for some reason (though I can't think of any),
is it possible to add some variable like *xml-escape-string* to turn this
It has the big advantage that the output is portable, regardless of the
environment's character encoding.
Now if most of the world is using Unicode, perhaps it isn't as much of
as it used to be. But I'm guessing ISO/IEC 8859-5 might still be fairly
common in your part of the world - and then what happens if I write out
(say) Ã ?
W3C in http://www.w3.org/TR/xslt-xquery-serialization-30/#HTML_CHARDATA
"Entity references and character references SHOULD be used only where
the character is not present in the selected encoding"
A problem is that getting at the encoding and then figuring out if a
is present are both non-trivial.
You would probably prefer Cyrillic letters to non-escaped, but you might
Ã to be escaped. (And you might prefer this even if you or your server runs
in a Unicode locale, since your clients might not.) So ideally you'd
use a charset (http://www.gnu.org/software/kawa/Character-sets.html) to
which characters are escaped. But there is a layering problem:
not depend on the Kawa-Scheme language, but charsets are implemented in
pure Scheme. (The solution to that may be to move the actual data type
and the core primitive methods to gnu/kawa/util.)