Every once in a while, I end up in this weird place called Encoding Hell, and it takes me about a day to get out of it. Usually it's related to MySQLdb, the MySQL driver for Python; however, this time it was related to URLs.
I was trying to do a MySQL import. I kept getting a lot of warnings, which I usually try to fix. I couldn't even figure out what the warnings were because I was using the mysqlimport tool. After a while, I figured out that if you do the import from within the MySQL shell, you can run "SHOW WARNINGS;".
Anyway, I got a warning like "Incorrect string value: '\xF1os' for column 'category' at row 76997". I traced it back to a URL like "http://www.example.com/themes/keywords/southside%20sure%F1as". That's an ASCII URL, so I couldn't figure out what the problem was.
I had some code that was splitting the URL into two other parts. It used a regex to pull out the parts I wanted, and then it unurlencoded the parts. It turns out that once you unurlencode 'southside%20sure%F1as', you are left with 'southside sure\xf1as'. I tried to .decode('UTF-8') it, but it didn't work. I finally figured out, thanks to Vim's automatic encoding detection, that I needed to .decode('Latin-1') it, and I ended up with 'southside sureñas' (whatever that is).
What's interesting is that I started off with a perfectly fine ASCII URL and ended up with some Latin-1 that I wasn't expecting. That's a good reminder that user-submitted data is pervasively dangerous--those could have been control characters or something.