This is the mail archive of the
xsl-list@mulberrytech.com
mailing list .
Re: Converting poorly formed HTML into well-formed XML
>
> The HTML has been written by various web developers over a period of time,
> so it is very inconsistent in formatting, use of quotation marks in
> attributes, etc.
>
But, most of all, is the HTML correct, or conformant ?
> Does XSLT have the facilities to directly read in the poorly formed HTML?
> And if so, what needs to be done.
>
Nope, unless it is valid XML (that would be XHTML)
> I've already begun developing the latter (custom) solution, but thought I'd
> double check to see if there are any HTML -> XHTML converters available.
>
Check out HTML Tidy, from the W3C consortium (www.w3.org).
It's a C application that cleans up messy (and incorrect HTML) and
has an option to generate XHTML.
The main problem of developing your own converter is that either you are
sure your HTML is correct (and so you only need to fix cases, quotes in
attributes, entitities and close the few HTML empty tags) or you will go
crazy trying to cope with all the possible errors that the "official" web
browsers accept but that would kill any simple parser.
Anyway, I would be interested in knowing if there is any similar
application/package
in java. I would like to convert some pages (where I pretty much know the
format)
into XHTML and from there output the content in XML.
The only other package I found is in Perl (HTML::TreeBuilder). It has a
smart
input parser and the author explains how he had to add a lot of hardcoded
stuff
to cover a lot of weird cases. I wrote a few lines of perl that reads in
an HTML
file and output XHTML, if anyone is interested.
-- Raffaele
-----------------------------------------------------
raff@aromatic.org (::) http://www.aromatic.org/~raff/
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list