Thursday, July 14, 2011

Removing Illegal Characters and Preventing Unicode Errors

I hate unicode errors like the one below, and I kept getting them intermediately on a table I was writing some scripts against.  Here is an example of an error I received:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-4:
ordinal not in range(128)

After much internal debate, I decided to just remove the illegal characters when exporting or reading values in that table.  First I check to see if the value returned is a unicode type, then I apply my operation.

userValue = row.getValue(field)
if type(userValue) is unicode:
   val = ''.join([x for x in userValue if ord(x) < 128]) 

   # do something with val #
Now, the illegal characters are gone!

Here is a simpler example using IDLE:

>>> userValue = "abcdéf"
>>> val = ''.join([x for x in userValue if ord(x) < 128])
>>> print val

Notice that the function just removed the é value and produced 'abcdf'.

Hope this helps!

1 comment:

Marc said...

I found this entry when looking for something else, but I thought I should add something: It's not good security to strip invalid unicode sequences, you should always put something in its place otherwise it can be possible to bypass validation and wreck havoc on your systems with crafted unicode strings.