This is the mail archive of the docbook-apps@lists.oasis-open.org mailing list .


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [docbook-apps] Dynamic web serving of large Docbook


Hi Frans,
I've been dealing with this issue of modular doc processing for some time
during my publishing tools career.  I agree that you should not have to
divide up your source files just to facilitate modular processing.  But
without more details on your requirements, I can't give you a complete,
solution, but I can give you my thoughts on the matter.

It is quite possible to select content for processing without having to load
an entire document. Your server could construct a skeleton document that has
a single XInclude that references the content you want to process.  The
DocBook XSL stylesheets will handle most of the hierarchical elements as a
document root element, although you should check the 'root.elements'
variable in fo/docbook.xsl to make sure all the ones you want to process
will generate a page sequence in FO output.

One difficulty you will encounter is to simulate the chunking behavior of
chunk.xsl.  When you process a chapter as chunks, it will generate one chunk
for the chapter content before the first section, and then chunk sections
according to the parameter settings.  I'm not sure an XInclude xpointer can
select just the first part of a chapter.

Of course, when you select a chapter and process it by itself, it will
always be numbered 1.  You can get around that problem by generating an
olink database for your whole document, which will include the number
information for all numbered elements.  Your stylesheet customization would
have to change the templates in label.markup mode to look up the number in
the olink database instead of counting chapters in the document.  The same
would apply to number labels on figures, tables, sections, etc.  In that way
you are simulating the context of the selected content within the document.

You could also use the sequencing in the olink database to compute the Next,
Previous, and Up navigational links for a chunk.

For cross refere




Bob Stayton
Sagehill Enterprises
DocBook Consulting
bobs@sagehill.net


----- Original Message ----- 
From: "Frans Englich" <frans.englich@telia.com>
To: <docbook-apps@lists.oasis-open.org>
Cc: "Michael Smith" <smith@xml-doc.org>
Sent: Saturday, October 16, 2004 6:11 PM
Subject: Re: [docbook-apps] Dynamic web serving of large Docbook


>
> Michael, thanks for your extensive replies. I have been looking into this
> relatively extensively, and it sure is tricky. Docbook is a very
attractive
> format to have beneath, and being able to swiftly use it in large web
> projects would make it even more powerfull. I think it applies to many, so
a
> clean, thorough solution which is pushed upstream(into a CMS or
stylesheets)
> would gain many people.
>
> It should be noted I have no possibilities for financing or proprietary
> solutions due to several reasons, one is that it's for an open source
> project. Also, sorry about the late reply :|
>
> On Wednesday 13 October 2004 13:29, Michael Smith wrote:
> > Frans,
> >
> > Reading through your message a little more...
> >
> > [...]
> >
> > > The perfect solution, AFAICT, would be a dynamic, cached, generation.
> > > When a certain section is requested, only that part is transformed,
and
> > > cached for future deliveries. It sounds nice, and sounds like it would
be
> > > fast.
> > >
> > > I looked at Cocoon(cocoon.apache.org) for helping me with this, and it
> > > does many things well; it caches XSLT sheets, the source files, and
even
> > > CIncludes(same as XIncludes basically).
> > >
> > > However, AFAICT, Docbook makes it not easy:
> > >
> > > * If one section is to be transformed, the sheets must parse /all/
> > > sources, in order to resolve references and so forth. There's no way
to
> > > workaround this, right?
> >
> > It seems like your main requirement as far as HTML output is to be
> > able to preserve stable cross-references among your rendered
> > pages. And you would like to be able to dynamically regenerate
> > just a certain HTML page without regenerating every HTML page that
> > it needs to cross-reference.
> >
> > And, if I understand you right, your requirement for PDF output is
> > to be able to generate a PDF file with the same content as each
> > HTML chunk, without regenerating the whole set/book it belongs to.
> > (At least that's what I take your mention "chunked PDF" in your
> > original message to mean.)
>
> Yes, correct interpretation.
>
> >
> > (But -- this is just an indicental question -- in the case of the
> > PDF chunks, you're not able to preserve cross-references between
> > individual PDF files, right? There's no easy way to do that. Not
> > that I know of at least.)
>
> Nope, the PDF would simply contain the content of the viewed page without
any
> webspecifics such as navigation; used for printing. Example(upper right
> corner):
> http://xml.apache.org/
>
> >
> > If the above is all an accurate description of your requirements,
> > then I think a partial solution is
> >
> >   - set up the relationship between your source files and HTML
> >     output such that the DocBook XML source for your parts are
> >     stored as separate physical files that corresponded one-to-one
> >     with the HTML files in your chunked output
> >
> >   - use olinks for cross-references (instead of using xref or link)
> >
> >       http://www.sagehill.net/docbookxsl/Olinking.html
> >
> > If you were to do those two things, then maybe:
> >
> >  1. You could do an initial "transform everything" step of your
> >     set/book file, with the individual XML files brought together
> >     using XInclude or entities; that would generate your TOC &
> >     index and one big PDF file for the whole set/book
> >
> >  2. You would then need to to generate a target data file for each
> >     of your individual XML files, using a unique filename value for
> >     the targets.filename parameter for each one, and then
> >     regenerate the HTML page for each individual XML file, and
> >     also the corresponding PDF output file.
> >
> >  3. After doing that initial setup once, then each time an
> >     individual part is requested (HTML page or individual PDF
> >     file), you could regenerate just that from its corresponding
> >     XML source file.
> >
> >     The cross-references in your HTML output will then be
> >     preserved (as long as the relationship between files hasn't
> >     changed and you use the target.database.document and
> >     current.docid parameters when calling your XSLT engine).
> >
> > I _think_ that all would work. But Bob Stayton would know best.
> > (He's the one who developed the olink implementation in the
> > DocBook XSL stylesheets.)
> >
> > A limitation of it all is that, if a writer adds a new section to
> > a document, you're still going to need to re-generate the whole
> > set/book to get that new section to show up in the master TOC.
> > Same thing if a writer adds an index marker, in order to get that
> > marker to show up in the index.
> >
> > But one way to deal with that is, you could just do step 3 above
> > on-demand, and have steps 1 and 2 re-run, via a cron job or
> > equivalent, at some regular interval -- once a day or once an hour
> > or at whatever the minimum interval is that you figure would be
> > appropriate given how often writers are likely to add new sections
> > or index markers.
> >
> > And during that interval, of course there would be some
> > possibility of an end user not being aware of a certain newly
> > added section because the TOC hasn't been regenerated yet, and
> > similarly, not finding anything about that section in the index
> > because it hasn't been regenerated yet.
> >
> > > * Cocoon specific: It cannot cache "a part" of a transformation, which
> > > means the point above isn't workarounded. Right? This would otherwise
> > > mean the transformation of all non-changed sources would be cached.
> >
> > Caching is something that you could do with or without Cocoon, and
> > something that's entirely separate from transformation phase. You
> > wouldn't necessarily need Cocoon or anything Cocoon-like if you
> > used the solution above (and if it would actually work as I
> > think). And using Cocoon just to handle caching would probably be
> > overkill. I think there are probably some lighter-weight ways to
> > handle caching.
> >
> > Anyway, I think the solution I described would be some work to set
> > up -- but you could hire some outside expertise to help you do
> > that (Bob Stayton comes to mind for some reason...).
>
>
> I looked at the solution of using an olink database, but perhaps I
discarded
> it too quickly. Perhaps I'm setting the threshold to high(I am..), but I
find
> it hackish; it isn't transparent, and it most of all disturbs creation of
> content: one can't use standard Docbook, and authors have to bother with
> technical problems. It's messy.
>
> One thing which can be remembered is that splitting the source document
> mustn't be done propotionally to what pieces that are rendered; it only
have
> to be kept in such small pieces that performance is acceptable(it's a
small
> detail, but it can from an editing perspective be practical with a
document
> larger than what is to be viewed), /assuming/ the CMS( or whatever content
> generation mechanism is used) can map the generated output to a certain
part
> in the source file(like XInclude).
>
> To recapitulate, the problem is the initial transformation of the
requested
> content -- that the XSLs must traverse "all" the sources -- and that
> performance hit is the same regardless of whether it's PDF, HTML, and if
the
> requested content is small. Once it's generated all is cool, since it's
> cached for later deliveries. That's the key problem -- everything depends
on
> it.
>
> Here's possible solutions:
>
>
> 1. The olink way you described. It works, but it's complex, restraining,
and
> intrusive on content creation.
>
> 2. True static content(croned). Not intrusive on content creation, but
it's
> perhaps too simple(too dumb), and it actually can become a performance
issue
> too; generating PDFs for each section -- that's a lot of mega bytes to
write
> to disk each time the cron job runs.
>
> 3. To actually go for the long transformation which we try to avoid; that
all
> the sources are transformed for each requested section. First of all, this
> long transformation happens for the first request -- the first user -- and
> then it's cached. How long does it take then? Cocoon caches includes, and
the
> files, so when the cache becomes invalidated one source file is
reloaded(the
> one which have changed) while all others and the Docbook XSLs(they're
huge)
> are kept in memory(DOM, I presume) -- perhaps that's enough for reducing
that
> first transformation to reasonable speeds. I'm only speculating, no doubt
> that it's the transformation that takes the longest time(perhaps someone
> knows if I'm unrealistic, but otherwise real testing gives the definite
> answer). If this worked, it would be the best solution.
>
> These approaches can also be combined; the html output could be
static(cron),
> while PDFs are dynamic. In this way the performance trouble of 2) are
> gone(writing tons of PDF files), and perhaps the delay is ok for PDF. From
my
> shallow reading about Forrest, I have understood it's good at combining
> serving dynamic and generating static, perhaps it can be a way to pull it
all
> together under one technical framework.
>
>
> ***
>
> Another trouble, or at least something which requires action, with
flexible
> website integration is navigation. As I see it, Docbook is tricky on that
> front -- the XSLs are quite focused on static content generation, the
chunked
> output for example. Since dynamic generation basically takes a node and
> transforms with docbook.xsl, navigation must be hand written, for example
if
> one wants the TOC as a sidebar, and that it changes depending on what is
> viewed(flexible integration). I bet this is relatively easy to do,
> considering how the XSLs are written, and this could be good to have in a
> generic way somewhere(Forrest, Docbook XSLs, perhaps..).
>
>
>
> Yes, speculations. When I write something, have actual numbers, proof of
> concept, or know what I actually talk about, I will definitely share it on
> this list.
>
> Hm.. That's as far as I see.
>
>
> Cheers,
>
> Frans
>
>
>



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]