Skip to main content

PyCon Notes: Introduction to SQLAlchemy

At PyCon, Mike Bayer, the author of SQLAlchemy, gave a three hour tutorial on it. Here's the video. What follows are my notes.

He used something called sliderepl. sliderepl is a nice ASCII tool that's a mix of slides and a REPL. You can flip through his source code / slides in the terminal. It's actually pretty neat.

At the lowest level, SQLAlchemy sends strings to the database and interprets the responses.

It took him 10 years to write it. He started the project in 2005. He's finally going to hit 1.0.

SQLAlchemy forces you to be aware of transactions.

Isolation models have to do with how ongoing transactions see ongoing work amongst each other.


    Provide helpers, etc.

    Provides a fully featured facade over the Python DBAPI.

    Provide an industrial strength, but optional, ORM.

    Act as a base for inhouse tools.


    Make the usage of different DBs and adaptors as consistent as possible.

    But still expose unique features in each backend.

    It's not realistic for an ORM to perfectly match any DB. He mentioned
    leaky abstractions.

    Don't hide the DB. You must continue to think in SQL.

    Provide automation and DRY.

    Allow expression using declarative patterns.

Here's the stack:

    SQLAlchemy ORM
    SQLAlchemy Core:
        SQL expression language.
            Connection pooling.
        There are different libraries for different DBs.

Reddit and Dropbox use SQLAlchemy Core without the ORM.

An "Engine" is a registry which provides connectivity to a particular DB server.

A "Dialect" interprets generic SQL and database commands in terms of a specific DBAPI and DB backend.

A "Connection Pool" holds a collection of DB connections in memory for fast reuse. It also handles reconnecting if the connection drops.

The "SQL Expression Language" allows SQL to be written using Python expressions.

"Schema/Types" uses Python objects to represent tables, columns, and datatypes.

The ORM:

    Allows construction of Python objects which can be mapped to relational DB

    Transparently persists objects into their corresponding DB tables using
    the "unit of work" pattern.

    Provides a query system which loads objects and attributes using SQL
    generated from mappings.

    Builds on top of the Core.

    Presents a slightly more object centric perspective as opposed to a schema
    centric perspective.

SQLAlchemy is like an onion:

        SQL Expressions:
            Table Metadata, Reflection, DDL:
                Engine, Connection, Transactions

Level 1: Engine, Connection, Transactions

The DBAPI is the Python Database API. It's the defacto system for providing Python DB interfaces.

Most DBs have more than one DBAPI implementation.

MySQL has more than 10, and SQLAlchemy has to support about 6 of them.

They vary wildly.

He showed an example of using the DBAPI directly.

His favorite DBAPI implementation is psycopg2 (for PostgreSQL).

The DBAPI assumes that a transaction is always in progress. There is no begin() method, only commit() and rollback().

The DBAPI encourages bound parameters, via the execute() and executemany() methods, but it has six different formats.

All DBAPIs have inconsistencies regarding lots of things.

The DBAPI has its own exception hierarchy.

The first layer in SQLAlchemy is known as the "Engine", which is the object that maintains the classical DBAPI interaction.

SQLAlchemy has a lazy initialization pattern. It does a lot of lazy stuff.

An engine is a "factory" for connections.

    from sqllite import create_engine

    engine = create_engine("sqlite:///some.db")
    result = engine.execute("somequery ...=:empid", empid=...)
    row = result.fetchone()

At this level of the API, the syntax for quoting variables (:empid) is whatever the DBAPI expects.

row is a tuple as well as a dict.

You can loop over the result:

    for row in result:

The actual cursor is result.cursor.

Under the covers, result.cursor.description has the names of the fields.

There's also result.fetchall().

You can control the scope of the connection using connect():

    conn = engine.connect()
    result = conn.execute(query)


    trans = conn.begin()

If you don't use a transaction explicitly, it'll use autocommit. Everything will be wrapped in a transaction.

The DBAPI doesn't use autocommit by default.

Using with:

    with engine.begin() as conn:

Turning on debugging:

    engine.echo = True

Connecting directly via the engine is called "connectionless execution". The engine connects and disconnects for us.

Using a Connection explicitly lets you control scope.

There's a connection pool.

Whatever you give to engine.execute() gets passed to the DBAPI directly.

Level 2: Table Metadata, Reflection, DDL

He was inspired by "Patterns of Enterprise Architecture" when he wrote the SQL generation code and the ORM.

You can use SQLAlchemy to generate a schema.

You can also have SQLAlchemy use reflection to load an existing schema.

    from sqlalchemy import MetaData, Table, Column, \
        Integer, String
    metadata = MetaData()
    user_table = Table('user', metadata,
        Column('id', Integer, primary_key=True),

String is a varchar.

user_table.c has all the columns:

Using it: == 'asdf'))

Creating tables:



    Numeric(10, 2)
    Enum('a', 'b', 'c')


    from sqlalchemy import ForeignKey
    Column('user_id', Integer, ForeignKey(''))

The references are lazy, so user can be created later.

Composite foreign keys:

    Column('story_id', Integer),
    Column('version_id', Integer),
        ['story_id', 'version_id'],
        ['story.story_id', 'story.version_id']

There's a special setting to tell it to add foreign key dependencies later. Use this if you have mutually dependent tables.

Nullable is True by default. Here's how to make it False:


You can combine lines and leave out the type for foreign keys:

    Column('owner_id', ForeignKey(''))


    metadata2 = MetaData()
    user_reflected = Table('user', metadata2, autoload=True,

It took a really long time for him to learn how to meet everyone's needs.

Don't use "bound metadata". It's an antipattern.

Another system (not using metadata):

    from sqlalchemy import inspect
    inspector = inspect(engine)

Types and Querying


    Numeric (a decimal)

Create and drop:


These are classes. They have magic methods like __eq__, etc.


This returns a BinaryExpression to be used as part of an expression:

    user_table.c.username == 'ed'


    ((user_table.c.username == 'ed') |
  (user_table.c.username == 'jack'))

There's also and_ and or_.

There's also >, etc.

There's also == None. It translates to "is NULL".

There are lots of operators.



    engine.execute( == 'ed')


    from sqlalchemy.dialects import postgresql


    insert_stmt = user_table.insert().values(
        fullname='Ed Jones'
    conn = engine.connect()
    result = conn.execute(insert_stmt)

Inserting many:

    conn.execute(user_table.insert(), [
        {'username': 'jack', ...},


        select([user_table.c.username, user_table.c.fullname])
            .where(user_table.c.username == 'ed'))

Select all:


If you use where multiple times, it ands them together.

The result for an insert will have the primary key.

Looking at a statement:

    stmt = ...
    print stmt

Look at the rowcount on the result to see how many rows were affected.

Level 4: ORM

The OO classes are called a "domain model".

The most basic level is:

Some ORMs can represent multiple rows as domain objects.

Most ORMs can do composition using foreign key associations.

SQLAlchemy can represent class inheritance hiearchies.

SQLAlchemy can handle sharding.

Flavors of ORMs:

    Active Record: Domain objects handle their own persistence.

    Data Mapper: It tries to keep the details of persistence separate from the
    object being persisted.

There's also different ways of configuring them. Most use an "all-at-once", or declarative style, where class and table information is together.

Another style is to have the class declaration and the ORM mapper configured separately.

Hibernate is more pure, but it's tedious and verbose.

That's how SQLAlchemy worked in the beginning, but then he realized it was too verbose.

The SQLAlchemy ORM is a data mapper on the bottom, but it has a declarative style on top.

.save() is lazy. It doesn't flush immediately.

Objects are identical if they have the same primary key.

SQLAlchemy uses lazy loading.

SQLAlchemy is not compatible with Tornado and Twisted because they're callback oriented, and the lazy loading conflicts with that.

It also supports eager loading.

It also has method chaining.

Using the ORM:

    from sqlalchemy.ext.declarative import declarative_base
    Base = declarative_base()

    class User(Base):
        __tablename__ = 'user'
        id = Column(Intger, primary_key=True)
        name = Column(String)

        def __repr__(self):
          return "" % (, self.fullname)


The Mapper links a class to a table:


Other stuff:


Creating a user:

    ed_user = User(name='ed', ...)

Setting stuff up:

    engine = ...
    session = Session(bind=engine)

The session is an ORM object.



    our_user = session.query(User).filter_by(name='ed').first()

Instead of where, you use filter_by.

As soon as the session begins its work, it starts a transaction.

It keeps the identity map, so "is" works for ORM objects. Generally, other Active Record ORMs get this wrong.

Add all:

        User(...), ...

session.dirty has state changes. has the new objects.

session.commit() flushes everything and commits the transaction.

session.flush() flushes changes to the database without doing a commit.

After a commit, there is no transaction. All data is invalidated. Accessing data will start a new transaction and re-load the data from the database.

He learned stuff from Storm and Hibernate about how to do things this way.

Expiring after the commit is the right thing to do.

session.rollback() also ends the transaction.

The object is a proxy to a row in the database. You only know about stuff in the database when there's a transaction in play.

One of the challenges with SQLAlchemy is performance, especially when setting a ton of attributes.

The core will do whatever you want. The ORM is more opinionated. It's clearer about how things should be done.

He has to be really careful about adding features. A lot of times, he's had to remove features that were not well thought out.

Here's a fuller example:

    class Network(Base):
        __tablename__ = 'network'
        network_id = Column(Integer, primary_key=True)
        name = Column(String(100), nullable=False)

    session.add_all([Network(...), ...])

nullable is True by default.

Inner workings:[0]
    User.__table__ == 'ed'

But, you can write higher-level code such as: == 'ed'


    query = session.query(User).filter( == 'ed'

    for name, fullname in session.query(, User.fullname

Create a dict really easily:

    d = dict(session.query(, User))

Using order, limit, offset:


filter is for full-blown expressions:

    (User.something == "something")

filter_by is for filtering by a specific field:

    (something == "something")

If you use multiple filters in a chain, they are anded.

.first() limits things to one row.

.one() asserts that there is one and only one row This may raise NoResultFound or MultipleResultsFound.


    q = session.query(User.fullname).order_by(User.fullname)

    q2 = q.filter(or_( == 'mary', == 'ed'))
    print q2[1]

Advanced ORM Usage

Start with the same mappings as above.

What he showed above is a little verbose. There are shortcuts.

Unlike other ORMs, SQLAlchemy doesn't assume you're going to have a primary key named id. He was trying to make it appealing for existing databases.

It doesn't decide what the database should look like. This makes it really explicit.

There are ways to enforce conventions using mixins.

    class Address(Base):

        user_id = Column(Integer, ForeignKey(''))

        # This gives you address.user and user.addresses.
        user = relationship("User", backref="addresses")

This stuff works with reflection too.

    jack.addresses = [

You can do anything to jack.addresses that you can do with a list.

If you run session.add(jack), it adds everything else automatically.

This is where ORMs can save you a lot of time. They make persisting stuff much easier.

The collection stays in memory until the transaction ends.

There is a way to use eager loading to avoid the N+1 queries problem.

As soon as you begin working with the session, it starts a new transaction.

The only time you need to use session.begin() is if you explicitly turned on autocommit.

If you don't care about the transaction, just .close() it. Then the cursor pool will do whatever it does.

When he does a POST in a web app, he'll do an explicit commit().

SQLAlchemy uses a connection pool by default.

Collections and references are updated by manipulating objects, not primary / foreign key values.

An implicit join:

    session.query(User, Address).filter( == Address.user_id

Without the filter, you get the cartesian product.

Here's an explicit join:

    session.query(User, Address).join(
        Address, == Address.user_id

Using the relationship:

    session.query(User, Address).join(User.addresses).all()

Self join:

    from sqlalchemy.orm import aliased

    a1, a2 = aliased(Address), aliased(Address)
        a1.email_address == '...'
    ).filter(a2.email_address == '...').all()

There's a group_by.

Add .subquery() to the end to make it a subquery:

    subq = session.query(

    ...outerjoin(subq, == subq.c.user_id)...

Here's how to avoid the N+1 problem:

    for user in session.query(User).options(
        print(user, user.addresses)

There's a really good chapter in the docs that covers this stuff.

Use Alembic for migrations.

People used to use Elixir on top of SQLAlchemy, but Elixir isn't being maintained anymore. Everything you can do in Elixir can be done using Declarative in SQLAlchemy.


Popular posts from this blog

Ubuntu 20.04 on a 2015 15" MacBook Pro

I decided to give Ubuntu 20.04 a try on my 2015 15" MacBook Pro. I didn't actually install it; I just live booted from a USB thumb drive which was enough to try out everything I wanted. In summary, it's not perfect, and issues with my camera would prevent me from switching, but given the right hardware, I think it's a really viable option. The first thing I wanted to try was what would happen if I plugged in a non-HiDPI screen given that my laptop has a HiDPI screen. Without sub-pixel scaling, whatever scale rate I picked for one screen would apply to the other. However, once I turned on sub-pixel scaling, I was able to pick different scale rates for the internal and external displays. That looked ok. I tried plugging in and unplugging multiple times, and it didn't crash. I doubt it'd work with my Thunderbolt display at work, but it worked fine for my HDMI displays at home. I even plugged it into my TV, and it stuck to the 100% scaling I picked for the othe

ERNOS: Erlang Networked Operating System

I've been reading Dreaming in Code lately, and I really like it. If you're not a dreamer, you may safely skip the rest of this post ;) In Chapter 10, "Engineers and Artists", Alan Kay, John Backus, and Jaron Lanier really got me thinking. I've also been thinking a lot about Minix 3 , Erlang , and the original Lisp machine . The ideas are beginning to synthesize into something cohesive--more than just the sum of their parts. Now, I'm sure that many of these ideas have already been envisioned within , LLVM , Microsoft's Singularity project, or in some other place that I haven't managed to discover or fully read, but I'm going to blog them anyway. Rather than wax philosophical, let me just dump out some ideas: Start with Minix 3. It's a new microkernel, and it's meant for real use, unlike the original Minix. "This new OS is extremely small, with the part that runs in kernel mode under 4000 lines of executable code.&quo

Haskell or Erlang?

I've coded in both Erlang and Haskell. Erlang is practical, efficient, and useful. It's got a wonderful niche in the distributed world, and it has some real success stories such as CouchDB and Haskell is elegant and beautiful. It's been successful in various programming language competitions. I have some experience in both, but I'm thinking it's time to really commit to learning one of them on a professional level. They both have good books out now, and it's probably time I read one of those books cover to cover. My question is which? Back in 2000, Perl had established a real niche for systems administration, CGI, and text processing. The syntax wasn't exactly beautiful (unless you're into that sort of thing), but it was popular and mature. Python hadn't really become popular, nor did it really have a strong niche (at least as far as I could see). I went with Python because of its elegance, but since then, I've coded both p