This is the mail archive of the xsl-list@mulberrytech.com mailing list .
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Re: Splitting an XML file based on size

To: xsl-list at lists dot mulberrytech dot com
Subject: Re: [xsl] Splitting an XML file based on size
From: dan mason <dmason at wso dot williams dot edu>
Date: Wed, 4 Apr 2001 10:30:24 -0400
Reply-To: xsl-list at lists dot mulberrytech dot com
> Date: Tue, 3 Apr 2001 15:50:04 -0700
> From: Adam Van Den Hoven <Adam.Hoven@bluezone.net>
> Subject: [xsl] Splitting an XML file based on size
>
> Hey guys,
>
> I'm processing an NITF file into HTML. NITF is very much like HTML in 
> that
> it has a body with paragraph tags that has mixed content. The HTML that 
> I am
> creating from my tranforms can quickly become several tens of kb in 
> size.
> Since I'm transfering this over a wireless modem to a PocketPC at a 
> maximum
> of 14.4 kbs, an HTML file that is 15kb is entirely too big.
>
> I need some way to keep track of the number of characters I've 
> processed and
> stop when I reach a specific size, stoping at the end of the 
> paragraph. I
> understand that counting characters is not very precise but I am only
> interested in getting the transfer size to be less than 2K or so.
>

I used to work on the development of a mobile applications platform 
(NetMorf SiteMorfer) that had to deal with byte size pagination (that's 
what we called this problem) in a flexible, automagic way for n 
applications and n devices, all of which had different digest sizes 
(some mandatory, others suggested, like for the Pocket PC, Palm, RIM, 
etc.), numbers of rows, numbers of accesskeys, etc..  The short answer 
is that it's not easy in general, and especially not in XSLT.  Before I 
get flamed, let me try to explain why :) and invite people to produce a 
pure XSLT solution, because I know it's possible, but I also know that 
it's a royal pain in the behind (at least, the way I was trying to do 
it).

Solution 1 would be the pure XSLT solution.  Like I said, I think it's 
possible, your code snippet down below is a start.  But I think it's 
going to be extremely hard to make a solution like that extensible (you 
may end up writing the same code for <p>, <table> and any other tags, 
just slightly different).  Also, I'll go out on a limb here and make a 
blanket statement:  XSLT (this version, anyway) is not supposed to be 
the end point of a delivery architecture.  XSLT is designed for document 
transformation, so going from unpaginated NITF to unpaginated HTML is 
almost trivial, as you know.  But it has no clue what device it's 
talking to, which delivery architectures have to know and take into 
account.  You could make your stylesheet aware of the device and its 
capabilities, although the colossal pain of keeping variables for byte 
size, number of rows, number of accesskeys (for phones), and linking to 
the data you didn't have room for will keep you up nights.

You could probably use extension functions or calls out to Java classes 
to give you more power and a cleaner stylesheet, but it's still a pain 
(and I have no idea what the performance implications are).  I don't 
know much about that stuff; it's possible that a few  extension 
functions would be able to keep track of where you are and short circuit 
the transformation when you overflow, but I don't remember whether they 
can be stateful?  if not, Java calls would work, I ended up writing a 
Java class to catch and paginate tags as I wrote them, with varying 
levels of success.

Solution 2 would be to use XSLT and build a pagination engine that takes 
in the output and chops it down to size.  This makes a lot more sense to 
me, all you have to do is make sure you're spitting out XHTML, parse it, 
and go through and count bytes.  You still have to decide what to do 
with the data you chop off, and you have to make sure you never chop off 
a valid end tag, things like that, but it's doable.  I worked on a 
prototype of a system like this, but for n devices; instead of spitting 
out XHTML, we used our own XML to preserve structure, and then embedded 
markup inside it (WML, HDML, HTML, whatever).  So, based on universal 
rules for how to paginate our XML (in your case, NITF), we could chop 
markup for any device down to size using one component.  It was spiffy.

If you can pull off solution 2, it has a bunch of advantages: 1) you can 
reuse your pagination engine for multiple apps, and not have to write it 
all into each stylesheet (I know you can simplify this by inheriting 
XSLT templates, but I dare anyone to do it :), 2) the stylesheet author 
(if it's not you) doesn't have to know how to paginate anything, they 
can just write XSLT and not worry about it, and 3) your stylesheets are 
cleaner, and don't take as long to execute (probably, there are 
performance implications for splitting the job like this too, as we have 
to reparse the XHTML, etc.).  I did all this in C++, a coworker did the 
same thing in Java, don't know how easy it would be to do in a scripting 
environment.

Good luck, I hope this is useful, and more than that, I would love to 
hear about experiences other people have had with paginating in XSLT.  I 
know that at least for mobile apps, this was concern #1, and everybody 
had a story on how to do it.  Not being an XSLT guru, I didn't know the 
answer, but I figure somebody on this list might...

-d

>> I can't be so coarse as counting paragraphs since I might also have a
>> table (essentially an HTML table) or lists or something. Some 
>> paragraphs
>> will be as short as a single sentance, others will be much longer.
>>
>> I also need to do some additional processing after I reach the end of 
>> the
>> NITF text (but the size of those will be much more rigid and simply
>> subtracted from the target filesize).
>>
>> I had thought about doing something approximately like:
>>
>> <xsl:template match="p" mode="block">
>> 	<xsl:param name="cursize" select="0">
>> 	<xsl:variable name="size" select="$cursize" />
>> 	<p>
>> 		<xsl:apply-templates select="child::node()" mode="inline">
>> 			<xsl:with-param name="cursize" select="$size + 7" />
>> <!-- +7 characters for the tags -->
>> 		</xsl:apply-templates>
>> 	</p>
>> 	<xsl:if test="$size <= 400">
>> 		<xsl:apply-templates match="followingsibling::p[1]"
>> mode="block"/>
> 			<xsl:with-param name="cursize" select="$size"
> 		</xsl:apply-templates>
>> 	</xsl:if>
>> </xsl:template>
>>
>> but clearly that isn't going to work. I also assume that making a 
>> global
>> variable called $size wouldn't work either.
>>
>> I am getting the feeling that this isn't strictly possible with XSL. I 
>> am
>> using MSXML 3 so scripting might be a solution but I am loath to use it
>> unless I have to.
>>
>> Adam van den Hoven
>> Internet Application Developer
>> Blue Zone
>> tel. 604.685.4310
>> fax. 604.685.4391
>> Blue Zone makes you interactive.(tm) http://www.bluezone.net/

 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]