« Experimental Support for OpenCRS Converting JSON to XML with Perl »
Cleaning Up Bad HTML in Perl, Take 2
Posted February 9, 2009 – 11:10 pm by Yakov Shafranovich in Programming(A followup on an earlier post)
Here is another way to cleanup bad HTML with Perl, and convert to XML:
use HTML::DOMbo;
use HTML::TreeBuilder;
use XML::LibXML;
$html_code = '';
// Parse HTML
my $builder = HTML::TreeBuilder->new();
$xml_source = $builder->parse($html_code);
// Convert to XML DOM
$xml_source1 = $xml_source->to_XML_DOM;
// Extract XML and encode UTF-8
$xml_source2 = (encode("utf-8", $xml_source1);
This approach relies on the HTML::DOMbo module to do the actual conversion between HTML and XML, and HTML::TreeBuilder for parsing.
Permalink | Trackback URL | This post has









