Skip to main content


Showing posts from March, 2012

PyCon: Parsing Sentences with the OTHER Natural Language Tool: Link Grammar

See the website.

When it comes to NLTK (the Natural Language Toolkit), some assembly is definitely required. If you're not a linguist, it's not so easy.

Link Grammar is a theory of parsing sentences. It is also a specific implementation written in C. It's from CMU. It started in 1991. The latest release was in 2004.

Link Grammar does not use "explicit constituents" (e.g. noun phrases).

It puts an emphasis on lexicality.

Sometimes, specific words have a large and important meaning in the language. For instance, consider the word "on" in "I went to work on Friday."

pylinkgrammar is a Python wrapper for Link Grammar. (Make sure you use the version of pylinkgrammar on BitBucket.)

Often, there are multiple valid linkages for a specific sentence.

It can produce a sentence tree. It can even generate Postscript containing the syntax tree. (The demo was impressive.)

A link grammar is a set of rules defining how words can be linked together to form sent…

PyCon: Parsing Horrible Things with Python

See the website.

He's trying to parse MediaWiki text. MediaWiki is based on lots of regex replacements. It doesn't have a proper parser.

He's doing this for the Mozilla wiki.

He tried Pyparsing. (Looking at it, I think I like PLY better, syntactically at least.) He had problems with debugging. Pyparsing is a recursive decent parser.

He tried PLY. He really likes it. It is LALR or LR(1). PLY has stood the test of time, and it has great debugging output.

However, it turns out that MediaWiki's syntax is a bit too sloppy for PLY.

LALR or LR(1) just doesn't work for MediaWiki.

Next, he tried Pijnu. It supports PEG, partial expression grammars. He got it to parse MediaWiki. However, it has no tests, it's not written Pythonicly, it's way too slow, and it eats up a ton of RAM!

He wrote his own parser called Parsimonious. His goals were to make it fast, short, frugal on RAM usage, minimalistic, understandable, idiomatic, well tested, and readable. He wanted to …

PyCon: Keynote: Guido Van Rossum

Take the survey at

Videos are already being uploaded to

Guido is wearing a "Python is for girls" T-shirt.

He showed a Python logo made out of foam on a latte. Yuko Honda really did it.

He had some comments about trolls.

Troll, n:A leading question whose main purpose is to provoke an argument that cannot be won.A person who asks such questions.Guido said that Python, Ruby, and Perl are exactly the same language from 10,000 feet. They should be our friends. Lisp is not all that different from Python, although you might not notice it. Ruby is not kicking Python's butt in any particular case.

They added unicode literals back to Python 3.3 just to make compatibility with Python 2.7 easier.

"I should have used a real presentation tool instead of Google Docs." (Doh! He's a Googler!)

More and more projects are being ported to Python 3.

People are actually capable of using one code base for Python 2.7 and Python 3.3. There's a …

PyCon: Python for Data Lovers: Explore It, Analyze It, Map It

See the website.

I missed the beginning of this talk, and since I'm not a data lover, I'm afraid my notes may not do it justice.

There is lots of interesting, "open data."

There is a lot of data that is released by cities.

She's a geographer and obviously a real data lover. She gets excited about all this data.

csvkit is an amazing set of utilities for working with CSV. It replaces the csv module.

"Social network analysis is focused on uncovering the patterning of people's interactions."

They used QGIS.

She relies heavily on Google Refine.

PySAL is really great for spatial analysis.

She recommended "Social Network Analysis for Startups" from O'Reilly. Her advisor wrote it.

PyCon: Storm: the Hadoop of Realtime Stream Processing

See the website.

"Storm: Keeping it Real(time)."

Storm is from dotCloud which is a platform to scale web apps.

They're in the MEGA-DATA zone.

They were using RRD.

Storm is real-time, computation framework.

It can do distributed RPC and stream processing.

It focuses on continuous computation, such as counting all the things passing by on a stream.

Storm does for real-time what Hadoop does for batch processing.

It is a high-volume, distributed, horizontally scalable, continuous system.

Even if the control layer goes down, computation can keep going.

It's strategy for handling failures is to die and recover quickly.

It is fault tolerant, but not fault proof.

Data is processed at least once. With more work and massaging, they have support for "exactly once".

Storm does not handle persistence.

If failures happen, it resubmits stuff through the system.

It doesn't process batches reliability.

It complements Hadoop, but does not attempt to replace Hadoop.

It does not protect ag…

PyCon: Pragmatic Unicode, or, How do I stop the pain?

See the website.

See the slides.

This was one of the best talks. The room was packed. This is the best unicode talk I've ever been to!

Computers deal with bytes: files, networks, everything. We assign meaning to bytes using convention. The first such convention was ASCII.

The world needs more than 256 symbols. Character codes were a system that mapped single bytes to characters. However, this still limited us to 256 symbols.


Then, they tried two bytes.

Finally, they came up with Unicode.

Unicode assigns characters to code points. There are 1.1 million code points. Only 110k are assigned at this point. All major writing systems have been covered.

"Klingon is not in Unicode. I can explain later."

Unicode has many funny symbols, like a snowman and a pile of poo.

"U+2602 UMBRELLA" is a Unicode character.

Encodings map unicode code points to bytes.

UTF-16, UTF_32, UCS-2, UCS-4, UTF-8 are all encodings.

UTF-8 is the king of encodings. It use…

PyCon: How the PyPy JIT Works

See the website.

"If the implementation is hard to explain, it's a bad idea." (Except PyPy!)

The JIT is interpreter agnostic.

It's a tracing JIT. They compile only the code that's run repeatedly through the interpreter.

They have to remove all the indirection that's there because it's a dynamic language.

They try to optimize simple, idiomatic Python. That is not an easy talk.

(The room is packed. I guess people were pretty excited about David Beazley's keynote.)

There's a metainterpreter. It traces through function calls, flattening the loop.

JIT compiler optimizations are different than compiler optimizations. You're limited by speed. You have to do the optimizations fast.

If objects are allocated in a loop and they don't escape the loop, they don't need to use the heap and they can remove boxing.

They do unrolling to take out the loop invariants.

They have a JIT viewer.

Generating assembly is surprisingly easy. They use a linear register…

PyCon: Why PyPy by Example

See the website.

PyPy is a fast, open source Python VM.

It's a 9 year old project.

PyPy is not a silver bullet.

For speed comparisons, see

PyPy is X times faster than cPython. If it's not faster than cPython, it's a bug.

Hard code number crunching in a loop is much, much faster in PyPy.

(When I think about PyPy, V8, and all the various versions of Ruby, it makes me think that it's an amazing time for VMs!)

If you think of the history of software engineering, GC was hard to get right, but now it's mostly done. Now we talk about how to use multiple cores. It's a mess with locks, semaphores, events, etc. However, one day, using multiple cores will be something that is somewhat automatic like GC is.

He said nice things about transactional memory. It promises to give multicore usage. It has hard integration issues just like GC did. His solution is to run everything in transactional memory. I.e. let the decision about when to use transactional memory b…

PyCon: Let's Talk About ????

David Beazley gave the keynote on the second day of PyCon. He decided to talk about PyPy.

PyPy made his code run 34x faster, without changing anything.

In theory, it's easier to add new features to Python using PyPy than cPython.

He's been tinkering with PyPy lately.

IPython Notebook is cool.

Is PyPy's implementation only for evil geniuses?

PyPy scares him because there is a lot of advanced computer science inside.

He doesn't know if you can mess around with PyPy.

It takes a few hours to build PyPy.

It needs more than 4G of RAM.

PyPy translates RPython to C. It generates 10.4 million lines of C code!

PyPy is implemented in RPython, which is a restricted subset of Python.

"RPython is [defined to be] everything that our translation toolchain can accept."

The PyPy docs are hard to read.

4513 .py files, 1.25 million non-blank lines of Python. convers RPython code to C.

The PyPy version is faster than the C version of Fibonacci! Although, if you turn on C optimiza…

PyCon: Welcome Message on the Second Day

There were 2300 people at PyCon.

180 people came to the PyCon 5K race. There were 5 people who finished in under 20 minutes.

Steve Holden is the current chairman of the Python Software Foundation. However, he's letting someone else take over. He kind of gave up on OSS before coming to Python, but has since changed his mind.

There was still a tremendous gender imbalance at PyCon, but there were a lot more women this year. There was one or more women in every row when I looked around.

Yesterday, the keynote had dancing robots. You can control them with Python.

PyCon: Lightning Talks

Numba is a Python compiler for NumPy and SciPy. It replaces byte-code on the stack with simple type-inferencing. It translates to LLVM. The code then gets inserted into the NumPy runtime. They use LLVM-PY. They have a @numba.compile decorator. It's from Continuum Analytics. is a replacement for He doesn't trust does not require the use of a mouse--it's for hackers. You can run it locally so that you don't have to give another web site your bank passwords.

Why do so many talks fall flat? Your talk should tell a story. People are story tellers. People care about people. Show puzzles, not solutions. Hacking is a skill, not a piece of knowledge.

He was measuring the Python 3 support for packages on PyPI. 54-58% of the top 50 projects on PyPI support Python 3. We planned on moving to Python 3 over the course of 5 years. We're at year 3. Update your Trope classifiers to say that your project supports Python 3.


PyCon: Introspecting Running Python Processes

See the website.

What is your application doing?

Logging is your application's diary, but there are some drawbacks.

gdb-heap, eventlet's backdoor module, and Werkzeug's debugger are all useful tools.

These all have tradeoffs.

What's missing compared to the JVM? Look at JMX.

jconsole connects to a running JVM.

jstack sends a signal to the JVM to dump the stack of every thread.

You can expose metrics via JMX.

New Relic and Graphite are also useful.

New Relic does hosted web app monitoring.

Graphite is a scalable graphing system for time series data.

socketconsole is a tool that can provide stack trace dumps for Python processes. It even works with multi-processed and multi-threaded apps. It does not use UNIX signals.

mmstats is "the /proc filesystem for your application." It uses shared memory to expose data from an app. It has a simple API.

mmash is a web server that exposes stuff from mmstats.

He uses Nagios. He has pretty graphs.

See also:ScalespystuckProjects used in…

PyCon: Python Metaprogramming for Mad Scientists and Evil Geniuses

See the website.

This was one of the best talks.

Python is ideal for mad scientists (because it's cool) and evil geniuses (because it has practical applications).

Equipment:Synthetic functions, classes, and modulesMonkey"Synthetic" means building something without the normally required Python source code.

Synthetic functions can be created using exec.

Synthetic classes can be created using type('name', (), d).

(exec and eval are very popular at PyCon this year. Three talks have shown good uses for them. I wonder if this is partially inspired by Ruby.)

Here's how to create a synthetic module:import new
module = new.module(...)
sys.modules['name'] = moduleFunctions, classes, and modules are just objects in memory.

Patching third-party objects is more robust than patching third-party code.

You can use these tricks to implement Aspect-Oriented Programming.

(I wonder if it's possible to implement "call by name" using the dis mo…

PyCon: Make Sure Your Programs Crash

See the website.

This talk was given by Moshe Zadka from VMware.

Think about how to crash and then recover from the crash.

If your application recovers quickly, stuff can crash and no one will see.

Even Python code occasionally crashes due to C bugs, untrapped exceptions, infinite loops, blocking calls, thread deadlocks, inconsistent resident state, etc. These things happen!

Recovery is important.

A system failure can usually be considered to be the result of two program errors. The second error is in the recovery routine.

When a program crashes, it leaves data that was written in an arbitrary program state.

Avoid storage: caches are better than master copies.

Databases are good at transactions and at recovering from crashes.

File rename is an atomic operation in modern OSs.

Think of efficient caches and reliable masters. Mark cache inconsistency.

He seems to be skeptical of the ACID nature of MySQL and PostgreSQL. I'm not sure why.

Don't write proper shutdown code. Always crash so th…

PyCon: Apache Cassandra and Python

See the website.

See the slides.

He doesn't cover setting up a production cluster.

Using a schema is optional.

Cassandra is like a combination of Dynamo from Amazon and BigTable from Google.

It uses timestamps for conflict resolution. The clients determine the time. There are other approaches to conflict resolution as well.

Data in Cassandra looks like a multi-level dict.

By default, Cassandra eats 1/2 of your RAM. You might want to change that ;)

He uses pycassa for his client. It's the simplest approach.

telephus is a Cassandra client for Twisted.

cassandra-dbapi2 is a Cassandra client that supports DBAPI2. It's based on Cassandra's new CQL interface.

Don't use pure Thrift to talk to Cassandra.

Cassandra is good about scaling up linearly.

There's a batch interface and a streaming interface.

There's a lot of flexibility concerning column families. You can even have columns representing different periods in time.

Pycassa supports different data types.

Pycassa has an …

PyCon: Code Generation in Python: Dismantling Jinja

See the website.

See also

Is eval evil? How does it impact security and performance?

Use repr to get something safe to pass to eval for a given type.

Eval code in a different namespace to keep namespaces clean.

Using code generation results in faster code than writing a custom interpreter in Python.

Here is a little "Eval 101".

Here is how to compile a string to a code object:code = compile('a = 1 + 2', '', 'exec')
ns = {}
exec code in ns # exec code, ns in Python 3.
ns['a'] == 3In Python 2.3 or later, use "ast.parse('a = 1 + 2')", and then pass the result to the compile function.

You can modify the ast (abstract syntax tree).

You can assign line numbers.

You don't have to pass strings to eval and exec. You can handle the compilation to bytecode explicitly. You can also execute the code in an explicit namespace.

Jinja mostly has Python semantics, but not exactly. It uses different scoping rules.

Lexer -> P…

PyCon: Advanced Python Tutorials

I took Raymond Hettinger's Advanced Python I and II tutorials. These are my notes. See the website for more details: I and II.

Here's the source code for Python 2 and Python 3.

Raymond is the author of itertools, the set class, the key argument to the sort function, parts of the decimal module, etc.

He said nice things about "Python Essential Reference".

He said nice things about the library reference for Python. If you install Python, it'll get installed.

Read the docs for the built-in functions. It's time well-invested.

He likes Emacs and Idle. He uses a Mac.

Use the dis module to disassemble code. That's occasionally useful.

Use rlcompleter to add tab completion to the Python shell.

Use "python -m test.pystone" to test how fast your machine is.

Show "python -m turtle" to your kids.

Don't be afraid to look at the source code for a module.

He likes "itty", a tiny web framework.

The decimal module is 6000 lines long!

Idle has mo…

Personal: Booth Babe

Yesterday at GDC, I achieved my goal of becoming a booth babe. I stood at a booth for 7 hours and answered questions about integrating YouTube video upload functionality into video games. Man are my feat sore! Oh well, at least I didn't have to wear heals ;)

Ruby: Using YouTube APIs for Education

I gave a talk at the East Bay Ruby Meetup and the San Francisco Ruby Meetup called Using YouTube APIs for Education. In the talk, I covered, Google client libraries for Ruby, OAuth2, and doing TDD with web services using Pry and WebMock.

See also this talk on