Google
 

«           »

Cleaning Up Bad HTML in Perl, Take 2

Posted February 9, 2009 – 11:10 pm by Yakov Shafranovich in Programming

(A followup on an earlier post)

Here is another way to cleanup bad HTML with Perl, and convert to XML:

use HTML::DOMbo;
use HTML::TreeBuilder;
use XML::LibXML;

$html_code = '';

// Parse HTML
my $builder = HTML::TreeBuilder->new();
$xml_source = $builder->parse($html_code);

// Convert to XML DOM
$xml_source1 = $xml_source->to_XML_DOM;

// Extract XML and encode UTF-8
$xml_source2 = (encode("utf-8", $xml_source1);

This approach relies on the HTML::DOMbo module to do the actual conversion between HTML and XML, and HTML::TreeBuilder for parsing.

Permalink | Trackback URL | This post has

Post a Comment