Google
 

« Hard Cover Support Added for PublicDomainReprints.org         Charlie the Groundhog »

Google Base and Unicode

Posted August 11, 2009 – 8:41 am by Yakov Shafranovich in Website

For quite some time, Google Base feeds for some of of my projects were either partially ingested or rejected out of hand with a message “Required attribute missing”. I ran xmllint and several online validation tools, and found nothing. But thanks to a Mac blog, I finally figured it out.

It seems that while officially Google Base supports Unicode and utf-8 encoding in XML feeds as stated here. they don’t support it fully. Apparently it seems that instead of taking plain UTF-8 text, Google Base requires it to be encoded at Unicode entities like &xxxx; where xxxx is the Unicode codepoint. This was originally found by this blogger.

The solution in XSLT at least is to use us-ascii encoding which forces entity creation. In Perl you can probably use Encode.pm or iconv.

Many thanks to Michael Fourman of the Mac Tips blog for this.

Tags: , ,

Permalink | Trackback URL | This post has

Post a Comment