« The Last Word on Cavalier Fixing “Input is not proper UTF-8, indicate encoding” Error »
Cleaning Up Bad HTML in Perl
Posted October 24, 2008 – 10:18 am by Yakov Shafranovich in ProgrammingHere is a short way to cleanup bad HTML input and convert to XML with Perl:
use HTML::TreeBuilder; use XML::LibXML; $html_code = ''; my $builder = HTML::TreeBuilder->new(); $xml_source = $builder->parse($html_code); $xml_source->elementify(); $xml_source1 = $xml_source->as_XML(); my $parser = XML::LibXML->new(); $parser->recover(1); my $doc = $parser->parse_string($xml_source1); $xml_source2 = $doc->toString();
Permalink | Trackback URL | This post has
2 Responses to “Cleaning Up Bad HTML in Perl”
Kinda ugly. Also, “use strict;”.
By L. Wall on Oct 24, 2008
Here is a followup post:
http://www.shaftek.org/blog/2009/02/09/cleaning-up-bad-html-in-perl-take-2/
By Yakov Shafranovich on Feb 9, 2009