Every once in a while, I end up in this weird place called Encoding Hell, and it takes me about a day to get out of it. Usually it's related to MySQLdb, the MySQL driver for Python; however, this time it was related to URLs.
I was trying to do a MySQL import. I kept getting a lot of warnings, which I usually try to fix. I couldn't even figure out what the warnings were because I was using the mysqlimport tool. After a while, I figured out that if you do the import from within the MySQL shell, you can run "SHOW WARNINGS;".
Anyway, I got a warning like "Incorrect string value: '\xF1os' for column 'category' at row 76997". I traced it back to a URL like "http://www.example.com/themes/keywords/southside%20sure%F1as". That's an ASCII URL, so I couldn't figure out what the problem was.
I had some code that was splitting the URL into two other parts. It used a regex to pull out the parts I wanted, and then it unurlencoded the parts. It turns out that once you unurlencode 'southside%20sure%F1as', you are left with 'southside sure\xf1as'. I tried to .decode('UTF-8') it, but it didn't work. I finally figured out, thanks to Vim's automatic encoding detection, that I needed to .decode('Latin-1') it, and I ended up with 'southside sureñas' (whatever that is).
What's interesting is that I started off with a perfectly fine ASCII URL and ended up with some Latin-1 that I wasn't expecting. That's a good reminder that user-submitted data is pervasively dangerous--those could have been control characters or something.
I was trying to do a MySQL import. I kept getting a lot of warnings, which I usually try to fix. I couldn't even figure out what the warnings were because I was using the mysqlimport tool. After a while, I figured out that if you do the import from within the MySQL shell, you can run "SHOW WARNINGS;".
Anyway, I got a warning like "Incorrect string value: '\xF1os' for column 'category' at row 76997". I traced it back to a URL like "http://www.example.com/themes/keywords/southside%20sure%F1as". That's an ASCII URL, so I couldn't figure out what the problem was.
I had some code that was splitting the URL into two other parts. It used a regex to pull out the parts I wanted, and then it unurlencoded the parts. It turns out that once you unurlencode 'southside%20sure%F1as', you are left with 'southside sure\xf1as'. I tried to .decode('UTF-8') it, but it didn't work. I finally figured out, thanks to Vim's automatic encoding detection, that I needed to .decode('Latin-1') it, and I ended up with 'southside sureñas' (whatever that is).
What's interesting is that I started off with a perfectly fine ASCII URL and ended up with some Latin-1 that I wasn't expecting. That's a good reminder that user-submitted data is pervasively dangerous--those could have been control characters or something.
Comments
"southside southern women"
I further guess its latin women.
this is why typing http://example.com/ä will give me http://example.com/%E4 in the browser address bar, but typing http://example.com/äЖ will give me http://example.com/%C3%A4%D0%96 where ä is %C3%A4. neat, eyh? means you have to have at least one fallback encoding scheme ready for when UTF-8 fails.
the nasty thing here is that (1) there is to my knowledge no way for the client to include information about URL encoding used; (2) you can only have one fallback encoding scheme that is activated by UTF-8 decoding failure—schemes like Latin-1 do not have a big chance to throw an exception on string.decode; (3) what fallback scheme to use will depend on the geolocation of the majority of your clients; (4) since URLs are typically rather short pieces of text, a heuristic encoding detection (such as the ones browser employ for web pages) should be difficult to impossible to implement; (5) if the secondary URL encoding is really governed by system settings it may prove exceedingly difficult to testdrive whatever setup you choose.
all told, it's sort of a shame that in 2009 we still have technology that shows erratic %4D%75%6D%62%6F%4A%75%6D%62%6F instead of readable text. this should have been in the past by now.