Friday, March 23, 2012

PyCon: Advanced Python Tutorials

I took Raymond Hettinger's Advanced Python I and II tutorials. These are my notes. See the website for more details: I and II.

Here's the source code for Python 2 and Python 3.

Raymond is the author of itertools, the set class, the key argument to the sort function, parts of the decimal module, etc.

He said nice things about "Python Essential Reference".

He said nice things about the library reference for Python. If you install Python, it'll get installed.

Read the docs for the built-in functions. It's time well-invested.

He likes Emacs and Idle. He uses a Mac.

Use the dis module to disassemble code. That's occasionally useful.

Use rlcompleter to add tab completion to the Python shell.

Use "python -m test.pystone" to test how fast your machine is.

Show "python -m turtle" to your kids.

Don't be afraid to look at the source code for a module.

He likes "itty", a tiny web framework.

The decimal module is 6000 lines long!

Idle has more stuff than I though, although I still think PyCharm is better.

He seems to use Idle to browse code and Emacs to edit code.

Use function.__name__ to get a function's name.

Use a bound method to save on method lookup in a tight loop. Notice the naming pattern:
s = []
s_append = s.append
He is very optimistic about PyPy. He thinks it'll become the defacto standard for Python use.

Here are some optimization tips:
  • Replace global lookups (and builtin lookups) by setting local aliases.
  • Use bound methods to avoid method lookups.
  • Minimize pure-python function calls inside a loop.
A new stack frame is created for every function call.

You should only need to use speedups like the above in a handful of places such as inner loops.

Listening to him explain how expensive even simple things in Python are makes me want to switch to Go ;)

Manually inline function calls in some cases.

Here's how to time code:
from timeit import Timer
print min(Timer(stmt, setup).repeat(7, 20))
"Loop invariant code motion" is an optimization technique where you move stuff outside the loop where possible.

"Vectorization" [according to him] means replacing CPython's eval-loop with a C function that does all the work for you. For instance, he suggests moving from list comprehensions to map where it makes sense.

Use to parallelize map.

He keeps plugging PyPy.

itertools.repeat(2, 100) repeats 2 over-and-over again, 100 times.

itertools.count(0) counts starting at 0.

In some cases, switching to itertools can get you most of the performance benefits that you might get by switching to C.

Here are the optimization techniques he covered: vectorize, localize, use bound methods, move loop invariants out of the loop, and reduce the number of Python function calls.

itertools now has new functions called permutations, combinations, and product (which gives you the cartesian product of two sequences).

You can use these to generate all the possible test cases given a set of states.

Think of itertools.product as the functional approach to nested for loops:
for t in product([0, 1], repeat=3): print t
is the same as:
for a in [0, 1]:
for b in [0, 1]:
for c in [0, 1]:
print (a, b, c)
Other useful things:




vars(foo) == foo.__dict__

Use dir(foo) to get the public API for foo.

sorted(vars(collections).keys()) is the same as dir(foo), but dir also removes the private methods.

"Everything in Python is based on dictionaries."

He said that Guido added OOP to Python in a weekend.

Raymond showed code that simulated classes using just functions and dicts.

"import antigravity" launches the famous XKCD cartoon on Python in a browser.

Use "" to open a URL in a browser.

ChainMap is a new tool in Python 3.3 to do a chain of lookups in a list of

"I used to be a high frequency trader. I helped destroy the world's economy. Sorry 'bout that."

Using the collections.namedtuple module is a great way to improve the readability of code.

There are many useful, valid uses for exec and eval. He criticized people who think that exec and eval are universally evil.

collections.namedtuple is based on exec.

He showed Python code generation (i.e. generating code as a string and then passing it to exec). Using a piece of Python data that acts as a DSL, you can generate some Python code and pass it to exec. You can generate code for other programming languages just as well as Python.

He thinks that showing a little bit of code is better than letting people download slides.

Here's a trick: subclass an object, and add methods for all the double under methods in order to add logging. This lets you track how the method was used. You can use this to evaluate stuff symbolically instead of arithmetically. For instance, subclass int, add methods for things like __add__, and keep track of how __add__ was called.

Polymorphism and operator overloading let you create custom classes that do additional stuff that numbers can't.

He showed function dispatch, like:
getattr(self, 'do_' + cmd)(arg)
See the cmd module.

Python's grammar is in grammar/grammar in the source code.

He showed how PLY puts Lex and Yacc expressions in docstrings. I.e. PLY uses docstrings to hold a DSL that PLY understands.

He showed loops with else clauses.

Knuth was the one who first came up with the idea of adding something like an else clause to a loop.

The idea that you shouldn't return in the middle of a function is advice from days gone by that no longer makes sense.

The nice thing about the way Python intervals works is:
s[2:5] + s[5:8] == s[2:8]
Copy a list: c = s[:]

Clear a list: del s[:]

Another way to clear a list: s[:] = []

In Python 3.X, a copy method was added to the list class. They're also adding a clear method to lists, to match all the other collections.

You can use itemgetter and attrgetter for the key function when calling list.sort. There's also methodcaller.

Use locale.strxfrm for the key function when sorting strings for locale-aware sorting.

Sort has a keyword argument named reverse.

To sort with two keys, use two passes:
s.sort(key=attrgetter('lastname'))           # Secondary key
s.sort(key=attrgetter('age'), reverse=True) # Primary key
"deque" is pronounced "deck". It gives you O(1) appends and pops from both ends.

"deque" is a "double ended queue".

He also mentioned defaultdict, counter, and OrderedDict. counter is a dict that knows how to count.

Here's how to use a namedtuple:
namedtuple('Point', 'x y z')
p = Point(10, 20)
Here's how to use a defaultdict:
d = defaultdict(list)
dict.__missing__ gets called if you lookup something that isn't in the dict. You can subclass dict and just add a __missing__ method.

idle has nice tab completion in the shell. It also has a nice menu item to lookup modules by name so you can find the source easily.

You can use __getattr__ to introduce tracing.

He pronounces "__missing__" as "dunder missing". "dunder" is an abbreviation for "double underscore".

Writing "d.x" implicitly calls __getattribute__ which works as follows:
Check the instance.
Check the class tree:
If it's a descriptor, invoke it.
Check for __getattr__:
If it exists, invoke it.
Otherwise, raise AttributeError.
OrderedDict is really helpful when you must remember the order. This helps if you're going to move to a dict temporarily and then want stuff to come back out in the same order that it went in.

Each of the methods in OrderedDict has the same big O as the respective methods in dict. (Presumably, the constants are different.)

Here is Raymond's documentation on descriptors.

Here's a descriptor:
class Desc(object):

def __get__(self, obj, objtype):
# obj will be None if the descriptor is invoked on the class.
print "Invocation!"
return obj.x + 10

class A(object):

def __init__(self, x):
self.x = x

plus_ten = Desc()

a = A(5)
If you attach a descriptor to an instance instead of a class, it won't work.

There is more than one __getattribute__ method:
A.x => type.__getattrbute__(A, 'x')
a.x => object.__getattribute__(a, 'x')
By overriding __getattribute__, you "own the dot".

"Super Considered Super" was a blog post he wrote to refute "Super Considered Harmful".

__mro__ gives you the method resolution order.

super() doesn't necessarily go to the parent class of the current class. It's all about the instance's ancestor tree. super() might go to some other part of the the instance's MRO, some part that your class doesn't necessarily know about.

Functions are descriptors. If you attach a function to a class dictionary, it'll add the magic for bound methods.

A.__dict__['func'] returns a normal function. A.func returns an unbound method. A().func returns a bound method.

Here is an example of using slots.

Here is another example of using slots:
class A(object):
__slots__ = 'x', 'y'
If you have an instance of a class that uses slots, then it won't have a __dict__ attribute.

The type metaclass controls how classes are created. It supplies them with __getattribute__.

"Descriptors are how most of the modern features of Python were built.""

At this point in the day, my brain was dead, and he was about to start talking about Unicode. I'm not sure that saving Unicode for the end of the day is the best strategy ;)

He said that "unicode" should be called "unitable".

Unicode is a dictionary of code points to strings. The glyphs are not part of Unicode. They're part of a font rendering engine.

There are more than 100k unicode code points.

Microsoft and Apple worked hard on Arial so that it has glyphs for almost every codepoint.

from unicodedata import category, name

Arabic and Chinese have their own glyphs for digits. int works correctly with all the different ways to write numbers.

There are two ways to write an umlat O because of combining characters.

Use "unicodedate.normalize('NFC', s)" to normalize the combining characters.

Arabic and Hebrew are written right-to-left--but not for numbers!

There are unicode control characters to switch which direction you're writing:
U+200E is for left-to-right
U+200F is for right-to-left
If you slice a string, you might accidentally chop off the unicode control character which causes the text to be backwards.

Just google for "bidi unicode" to get lots of help.

Most machines are little endian, but the Internet is big endian. Computers byte swap a lot, but they do it in hardware.

Code pages assume that the only people in the world are "us and the Americans." Everyone else gets question marks.

Encodings with "utf" in them do not lose information for any language. Any other encoding does.

If you use UTF-8, you lose the ability to get O(1) random access to characters in the string.

UTF-8 gives you some compression compared to fixed-width encodings, but not much.

The three main unicode problems are "many-to-one, one-to-many, and bidi."

Doubly encoding something or doubly decoding something is a super common problem.

If some characters don't display, it's probably a font problem. Try Arial.

The "one true encoding" is "UTF-8" (according to Tim Berners Lee).

UTF-8 is a superset of ASCII.

UTF-8 has holes. I.e. there are some number combinations that are not valid.

There's a lot of data in the world that is still encoded in UCS2. It's a two byte encoding.

It was a presidential order that caused us to move from EBCDIC to ASCII.

It was the Chinese government that decided UCS2 was not acceptable.

UTF-16_be is a superset of UCS2.

There are only a handful of Chinese characters that don't fit into UCS2. The treble clef is a character that won't fit in UCS2.

To figure out what encoding something is in, HTTP has headers and email has MIME types.

If a browser wants to guess at an encoding, it'll try all the encodings and look for character frequency distributions. You can fool such a browser by giving it a page that says, "which witch has which witch?"

Mojibake is when you get your characters mixed up because you guessed the encoding wrong.

No comments: