This is the mail archive of the
xsl-list@mulberrytech.com
mailing list .
Converting poorly formed HTML into well-formed XML
- To: "'XSL-List at mulberrytech dot com'" <XSL-List at mulberrytech dot com>
- Subject: Converting poorly formed HTML into well-formed XML
- From: Joseph Fourness <josephf at avanade dot com>
- Date: Tue, 26 Sep 2000 15:56:20 -0700
- Cc: Joseph Fourness <josephf at avanade dot com>
- Reply-To: xsl-list at mulberrytech dot com
Hello,
I am currently developing a system that converts arbitrary poorly formed
HTML into well formed XML (or XHTML).
Example of HTML:
<TD valign=TOP width="100">
<br>
<A href="http://www.mulberrytech.com" target=_top>Link</a>
The HTML has been written by various web developers over a period of time,
so it is very inconsistent in formatting, use of quotation marks in
attributes, etc.
I need to convert these files (approx. 120,000) into XHTML for usability
with an XSLT processor.
Desired output:
<td valign="top" width="100">
<br/>
<a href="http://www.mulberrytech.com" target="_top">Link</a>
Does XSLT have the facilities to directly read in the poorly formed HTML?
And if so, what needs to be done.
Or,
Will designing a custom parser that builds a DOM from the poorly formed HTML
to then be output to an XML file, or directly processed by an XSLT document,
be the best solution.
I've already begun developing the latter (custom) solution, but thought I'd
double check to see if there are any HTML -> XHTML converters available.
Thanks in advance for your help,
Joe Fourness
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list