PyCon: Status of Unicode in Python 3

Status of Unicode in Python 3

The talk was by Victor Stinner. I went to dinner with him and a few other people. He was a nice, French guy.

The encoding for source code defaults to UTF-8 in Python 3.

Surrogate escapes are a new feature in Python 3.2. They let you deal with stuff that can't be decoded as UTF-8. For instance, you can decode a filename string to a unicode object without losing data even if the decoding isn't clean.

There are still issues to work on.

Victor had bootstrap issues implementing all this stuff.

It took a lot of hard work to improve all this stuff.

Check out Programming with Unicode, which is a book that Victor wrote.

Victor has event more Unicode fixes in store for Python 3.3.

Side note: I had an idea. It'd be cool to create a tool that shows you a call tree for your application. In the call tree, it can show you where all the encodes and decodes are done. This would help you know where to do encodes and decodes. This would really help when porting from Python2 to Python3. Figuring out where to do encodes and decodes is a lot more subtle than you might think. It doesn't always make sense to do it at the very edges of your application.