This is the mail archive of the xsl-list@mulberrytech.com mailing list .


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: Converting poorly formed HTML into well-formed XML


| Does XSLT have the facilities to directly 
| read in the poorly formed HTML?

No built-in features to do this.

I'd recommend leveraging Andy Quick's excellent (open source)
Java implementation of Dave Raggett's HTML "Tidy" utility called
JTidy.

http://www3.sympatico.ca/ac.quick/jtidy.html

It can expose a DOM API to the "tidied-up" (that is, well-formed)
XML tree for any ill-formed HTML document. You can then pass
the DOM Document into your XSLT engine for transformation.

In my about-to-be-released book "Building Oracle XML Applications"
from O'Reilly, I had occasion to use this JTidy library to show
readers how to take ill-formed HTML and use XSLT to "scrape" 
interesting data out of the "tidied"-up XML result from dynamic
web pages like stock quote services or other online sources of 
information.

______________________________________________________________
Steve Muench, Lead XML Evangelist & Consulting Product Manager
BC4J & XSQL Servlet Development Teams, Oracle Rep to XSL WG
Author "Building Oracle XML Applications", O'Reilly
http://www.oreilly.com/catalog/orxmlapp/


| Does XSLT have the facilities to directly read in the poorly formed HTML?
| And if so, what needs to be done.
| 
| Or,
| 
| Will designing a custom parser that builds a DOM from the poorly formed HTML
| to then be output to an XML file, or directly processed by an XSLT document,
| be the best solution.
| 
| I've already begun developing the latter (custom) solution, but thought I'd
| double check to see if there are any HTML -> XHTML converters available.
| 
| Thanks in advance for your help,
| 
| Joe Fourness
| 
| 
|  XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
| 


 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]