Google
 

Fixing Malformed UTF-8 via Regex

June 21, 2006 – 12:39 pm

I have been struggling with a weird problem on one of my sites that prevent that site from functioning. One of XML files that is used for this site is supposed to come in UTF-8 but unfortunatly it had some extra characters that were not encoded properly. After looking at this site, I came up with a short regular expression of my own that can convert any malformed UTF-8 characters to XML/HTML numbered entities:

s/([^x80-xFF])/'�' . ord($1) . ';'/gse;

On a related note, another issue that came up a while back is the use of ampresand without being encoded as “&”. Here is another regex to solve that issue (don’t remember the site I got it from):

s/&(?!#?[xX]?(?:[0-9a-fA-F]+|w{1,8});)/&/g;

Removing Vowels from Hebrew Unicode Text

June 3, 2005 – 4:28 pm

One of the questions that recently came up is how to remove vowels from Hebrew characters in Unicode (or any other similar language). A quick look at Hebrew Unicode chart shows that the vowels are all located between 0×0591 (1425) and 0×05C7 (1479). With this and Javascript’s charCodeAt function, it is trivial to strip them out:

function stripVowels(rawString)
{
	var newString = '';
	for(j=0; j<rawString.length; j++) {
		if(rawString.charCodeAt(j)<1425
			 || rawString.charCodeAt(j)>1479)
		{ newString = newString + rawString.charAt(j); }
	}
	return(newString);
}

You can test it below:


Great Unicode Charts

January 22, 2005 – 11:42 pm

I just ran across these great Unicode charts from Matt Corks.