This is the mail archive of the xsl-list@mulberrytech.com mailing list .


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [xsl] RE: [xsl] RE: [xsl] RE: [xsl] RE: [xsl] Re: [xsl] RE: [xsl] Re: [xsl]   is being displayed as Á


Hi Theo,

> Ok then, since I *have* consulted the FAQ (and seem to be missing
> this). Could somebody explain to my WHY '&' translates to '&'
> but ' ' doesn't change at all?

Let's consider this simple stylesheet:

<?xml version="1.0" encoding="ISO-8859-1"?>
<xsl:stylesheet version="1.0"
                xmlns:xsl="http://www.w3.org/1999/XSL/Transform";>

<xsl:template match="/">
  <html>
    <head><title>Test</title></head>
    <body>
      <p>Non-breaking&amp;nbsp;space</p>
      <p>Non-breaking&#160;space</p>
    </body>
  </html>
</xsl:template>
                
</xsl:stylesheet>

This stylesheet is stored on the hard disk as a series of bytes. The
bytes match characters according to the ISO-8859-1 encoding (see the
encoding pseudo-attribute on the XML declaration?).

When the XML parser reads in this as an XML document, it decodes the
bytes into Unicode characters. It also parses the document,
recognising things like start tags (e.g. <p>), built-in entity
references (e.g. &amp;) and character references (e.g. &#160;).

The parser knows that &amp; stands for an & character (because it
knows XML) and knows that &#160; stands for a non-breaking space
character (because it knows XML and Unicode).

The parser reports to the XSLT processor when elements occur and what
characters text is made up of, but doesn't report whether a particular
character was originally serialized as the plain character (an actual
space character), an entity reference or a character reference.

As far as an XSLT processor is concerned, therefore, the following
elements in the stylesheet (or in an XML source document) would all be
reported as *exactly* the same (a p element containing a text node
whose string value is a double-quote character):

  <p>"</p>
  <p>&#34;</p>
  <p>&#x22;</p>
  <p>&quot;</p>
  <p><![CDATA["]]></p>

The two p elements serialized in the stylesheet, look like:

  <p>Non-breaking&amp;nbsp;space</p>
  <p>Non-breaking&#160;space</p>

For the first p element, the XML parser reports the string (here
containing no escaping of any kind - every character is a literal
character):

  Non-breaking&nbsp;space

For the second p element, the XML parser reports the string (here
containing an underscore character as a stand-in for a non-breaking
space, since you can't see non-breaking spaces in emails):

  Non-breaking_space

The XSLT processor builds a result tree from the stylesheet, which
contains these text nodes and looks something like:

  /
  +- html
     +- head
     |  +- title
     |     +- text: "Test"
     +- body
        +- p
        |  +- text: "Non-breaking&nbsp;space"
        +- p
           +- text: "Non-breaking_space"

This tree exists in memory. All the characters are Unicode characters.

Once the XSLT processor has finished its transformation, it serializes
this result tree. There are three methods that it could use to
serialize the result tree: xml, html and text, which is controlled by
the method attribute of xsl:output. It could also use any encoding -
any mapping of characters to bytes - which is controlled by the
encoding attribute of xsl:output.

The most straight-forward output method is the XML output method. In
the XML output method, element nodes are serialized as a start tag,
followed by content, followed by an end tag. Any characters in the
element content that have to be escaped due to XML rules are escaped.
So if you have a less-than sign in your text node, then it is
automatically escaped to &lt;. If you have an ampersand in your text
node then it is automatically escaped to &amp;. If you have a
character that can't be represented by the encoding that you're using,
then it is escaped using character references (e.g. &#160;).

Let's use a really really basic encoding, ASCII, which only covers 128
characters (and doesn't include non-breaking spaces). You can usually
make your stylesheet generate ASCII with:

<xsl:output encoding="ASCII" />

The non-breaking space character isn't covered by ASCII, so the
non-breaking space character has to be escaped in the serialization
using a character reference. So the serialization of the output tree
will look like:

<html>
  <head><title>Title</title></head>
  <body>
    <p>Non-breaking&amp;nbsp;space</p>
    <p>Non-breaking&#160;space</p>
  </body>
</html>

If you used an encoding that covers the non-breaking space character,
such as ISO-8859-1 or UTF-8 or UTF-16, then the non-breaking space
character would be output as a literal non-breaking space character,
and you'd get (substituting _ for non-breaking space characters
again):

<html>
  <head><title>Title</title></head>
  <body>
    <p>Non-breaking&amp;nbsp;space</p>
    <p>Non-breaking_space</p>
  </body>
</html>

Trouble arises, however, when you try to view a document that's been
saved using UTF-16 in an editor that doesn't support UTF-16 (for
example Notepad in Windows). The editor always tries to interpret the
sequence of bytes that it reads from the file as ISO-8859-1
characters. It's a bit like taking an English document and trying to
read it as if it were written in German. Some of the words might make
sense, but most of the time you get gobbledy-gook.

Specifically, because UTF-16 uses two bytes for every character
whereas ISO-8859-1 uses one, when you try to read a UTF-16 document as
if it were ISO-8859-1, you see two characters for every one character
that you expect. The first byte in a UTF-16 character is usually the
same as the byte that is used in ISO-8859-1 to mean the Á character,
while the second byte is the one that actually contains the
information. So you tend to see Á_ rather than just _, for example.

Let's return to looking at the possible serializations of the result
tree. The next possible serialization is HTML. HTML is serialized
more-or-less the same as XML, with a few differences. The difference
that is pertinent here is that when you use the html output method,
XSLT processors are allowed to use the entities defined in HTML rather
than as a native character (if the character can be represented in the
encoding) or a character reference (if it can't). In our case, XSLT
processors are allowed to serialize the non-breaking space character
as the HTML character entity reference &nbsp;. So serializing as HTML,
you may get:

<html>
  <head><title>Title</title></head>
  <body>
    <p>Non-breaking&amp;nbsp;space</p>
    <p>Non-breaking&nbsp;space</p>
  </body>
</html>

Finally, let's consider the text output method. In the text output
method, everything aside from text nodes are ignored, and the text is
output without any automatic escaping. If a character can be
represented in the encoding that you use, then it will be serialized
as a native character. If it can't be, then the XSLT processor gives
you an error. In our case, assuming that we're using an encoding that
supports the non-breaking space characters, we'd get something like
(again with _ representing the non-breaking space):

Non-breaking&nbsp;spaceNon-breaking_space

> And, how would you suggest someone actually get '&nbsp;' into the
> output in order to avoid the issue which started this thread in the
> first place? (browsers assuming a different encoding type than is
> sent, and therefore mistranslating character 160 as 'Á' instead of '
> '? I have yet to see a browser which misunderstands '&nbsp;'.

Hopefully, what I've explained above makes it clear that a browser
that sees a non-breaking space character as an Á followed by a
non-breaking space character is making that error because it is
reading the result of the transformation as if it is in one encoding
(e.g. ISO-8859-1) when in fact it is in another encoding (e.g.
UTF-16).

There are several solutions:

 - change the browser so that it auto-detects the actual encoding
   that's being used in the HTML/XML document (and make sure that
   you're reporting the correct encoding in the HTTP headers)

 - change the serialization process so that you use an encoding that
   the browser is expecting, by adding encoding="ISO-8859-1" to the
   xsl:output element

 - change the serialization process so that you use an encoding that
   doesn't include the non-breaking space character, so that the
   processor uses a character reference for it, for example using
   ASCII as the encoding

 - use the HTML output method with an XSLT processor that serializes
   non-breaking spaces as &nbsp;

Cheers,

Jeni

P.S. There is another solution that will work with some processors,
but not all - disabling output escaping for the text node that
contains the relevant characters. But since you can solve the problem
a lot more elegantly with one of the methods above, there's no reason
to use it.

---
Jeni Tennison
http://www.jenitennison.com/


 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]