Friday, March 21, 2008

PyCon: Core Python Containers--Under the Hood

Core Python Containers--Under the Hood

This was perhaps my favorite talk.

A list is a fixed-length array of pointers.

realloc is called occassionally to grow the list. However, overallocation is used in a very intelligent way to minimize the number of times this is necessary. Rather than simply doubling the size of the list, it's more of a curve. No more than 12.5% of the list is ever empty. The list shrinks when it is half empty.

Memory allocation under Windows is slow.

Inserting to the middle of or shrinking from the middle of a list is O(n). Appending is (on average) O(1).

Use a deque if you want to push or pop from the ends of a list. That's much more efficient than trying to inserting an element at index 0 of a normal list.

Sets are based on fixed-length hash tables. They are kept very sparse. Anytime it becomes 2/3 full, it is grown by a factor of 4. The hash table never needs to be resized for the keyword args dict (assuming you don't modify it).

The builtin set type is faster than implementing your own with dicts because he's able to use some shortcuts when implementing set operations.

On average, there are no more than 1.5 probes per lookup.

Building dicts and sets is expensive.

Dicts are the most finely tuned data structure in the language.

Using a dict in Python is way faster than using a poorly implemented map-like object in C.

2 comments:

lotrpy said...

jjinuxland, thanks for the info. looking forward to more :)
and, what's the mean of "On average, there are no more than 1.5 probes per lookup." maybe it refers to the set operations?

Shannon -jj Behrens said...

It does refer to set operations. I think it has to do with how hashes are implemented. I think it means that hash collisions are low enough that you only need to do on average 1.5 object comparisons when checking for set inclusion. I could be totally wrong ;)