Thursday, July 14, 2011

Removing Illegal Characters and Preventing Unicode Errors

I hate unicode errors like the one below, and I kept getting them intermediately on a table I was writing some scripts against.  Here is an example of an error I received:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-4:
ordinal not in range(128)


After much internal debate, I decided to just remove the illegal characters when exporting or reading values in that table.  First I check to see if the value returned is a unicode type, then I apply my operation.

userValue = row.getValue(field)
if type(userValue) is unicode:
   val = ''.join([x for x in userValue if ord(x) < 128]) 

   # do something with val #
Now, the illegal characters are gone!

Here is a simpler example using IDLE:

>>> userValue = "abcdéf"
>>> val = ''.join([x for x in userValue if ord(x) < 128])
>>> print val
abcdf

Notice that the function just removed the é value and produced 'abcdf'.

Hope this helps!