CDB cat dog-egories:
      

January 13, 2005

Note: CogDogBlog has a new WordPress powered home at http://cogdogblog.com/. All entries from this version have been moved there, so as a guide dog service try finding this article in its new home by title search.

Oh, those messy character encodings..

I recently wrote of some experiments to improved Feed2JS (see the updates fed to the site, bottom of the main page).

Specifically, based on the request from a user in Germany, I attempted to change the output to encode content as UTF-8 using the new features on Magpie RSS 0.7. However, I have gotten an email and a comment from people with apparently French language sites who claim it has broken their french accents and characters.

However, when I preview the feeds in question from our site using the Build a Feed tool, they look okay.

For one comment to the site, I was suspicious since the URL provided had its own encoding set in the HEAD meta tags as iso-8559-1... does that mean French language sites break under UTF-8?? I am really ignorant of this stuff. But if it breaks more sites than it helps, I will have to revert the encoding to what it was before (Magpie does not allow a per feed encoding setting, it is all or nothing).

What's a character to do?

Update: Until I can sort this out, I am reverting Feed2JS so it uses default iso-8559-1 encoding. Feeds may need an hour to refresh from our cache.

Another Update: Another attempt. A new paraemter utf=8 sent to the script on our server, should fork it to a different Magpie for the UTF encoding (see the examples on the Feed2JS log site)

blogged January 13, 2005 10:27 AM :: category [ rss ]
Comments About "Oh, those messy character encodings.."
RSS Feed for comments on this entry
RSS Feed for all CDB comments
 

If you have english text as UTF-8 and display it on a page with ISO-8859-1 or US ASCII, it works fine because UTF-8 is backwards compatible with US ASCII (which in turn is the base for ISO-8859-1).

Unfortunately, finnish, swedish, french, german and many other european languages that use ISO-8859-? have characters that are not compatible with UTF-8.

You have to:

a) Explicitly ask your users to use UTF-8 as the default encoding for their web pages. As feed2js is targetted to less tech-savvy people, this is not a good idea.

b) Implement an option into your feed2js script to encode the content in UTF-8 or convert the UTF-8 output to some target encoding, for example ISO-8859-1. Obviously you can't easily convert UTF-8 to all encoding formats in existence so you will have to limit your target audience (or sort the problem with a really kick ass UTF-8 to ??? conversion library).

Commented by: Teemu Arina on January 13, 2005 11:05 AM

Spammers Have Force Our Hands...
spamroach.jpg
Note: Those nasty blog-spamming roaches have forced us to take action to prevent their spread- all entries made to this blog will remain open for comments for 30 days after the original posting date. After that, it is old news anyhow, correct?

If you really need to make contact with the chief dog around here, please submit a request via our feedback center