Google
 

«           »

Cleaning Up Bad HTML in Perl

Posted October 24, 2008 – 10:18 am by Yakov Shafranovich in Programming

Here is a short way to cleanup bad HTML input and convert to XML with Perl:

use HTML::TreeBuilder;
use XML::LibXML;

$html_code = '';

my $builder = HTML::TreeBuilder->new();
$xml_source = $builder->parse($html_code);
$xml_source->elementify();
$xml_source1 = $xml_source->as_XML();

my $parser = XML::LibXML->new();
$parser->recover(1);
my $doc = $parser->parse_string($xml_source1);
$xml_source2 = $doc->toString();

Tags: , ,

Permalink | Trackback URL | This post has

  1. 2 Responses to “Cleaning Up Bad HTML in Perl”

  2. Kinda ugly. Also, “use strict;”.

    By L. Wall on Oct 24, 2008

  3. Here is a followup post:

    http://www.shaftek.org/blog/2009/02/09/cleaning-up-bad-html-in-perl-take-2/

    By Yakov Shafranovich on Feb 9, 2009

Post a Comment