Google
 

« Opera 9 Released         Reverse Resolution of IP Addresses with AJAX »

Fixing Malformed UTF-8 via Regex

Posted June 21, 2006 – 12:39 pm by Yakov Shafranovich in Programming

I have been struggling with a weird problem on one of my sites that prevent that site from functioning. One of XML files that is used for this site is supposed to come in UTF-8 but unfortunatly it had some extra characters that were not encoded properly. After looking at this site, I came up with a short regular expression of my own that can convert any malformed UTF-8 characters to XML/HTML numbered entities:

s/([^x80-xFF])/'�' . ord($1) . ';'/gse;

On a related note, another issue that came up a while back is the use of ampresand without being encoded as “&”. Here is another regex to solve that issue (don’t remember the site I got it from):

s/&(?!#?[xX]?(?:[0-9a-fA-F]+|w{1,8});)/&/g;

Tags: ,

Permalink | Trackback URL | This post has

Post a Comment