Saturday, November 29, 2014

Monday, November 10, 2014

Play Framework Essentials, Learning Scala, and Functional Programming Principles in Scala

I've been really busy over the last six months or so:

First, I was a technical editor for Play Framework Essentials. It's a fun book because every example is first shown in Scala and then in Java, which makes it kind of a Rosetta Stone for the two languages.

Then, I was a technical editor for Learning Scala. That was a great book! Jason did a great job making a really complex language very approachable. This wasn't the first book I've read on Scala, but it's certainly one of my favorites!

Finally, I took Martin Odersky's class on Coursera, Functional Programming Principles in Scala. I really liked that as well because it gave me a much stronger grasp of functional programming fundamentals. Furthermore, I participated in a study group at Twitter, which made the whole experience much more rewarding.

My hope is that I'll get to the point where I can teach a few Scala classes at work. Hence, I've been working my butt off two hours a day on BART and a couple hours on Saturdays in order to get up to speed with Scala. I still haven't written anything large, but at least I'm feeling a lot more comfortable with it these days.


Wednesday, June 11, 2014

Python: Custom App Labels in Django

Django has had a long-standing missing feature that made it impossible to give your apps friendly names. See #3591 and #14251. Thankfully, this will be fixed in Django 1.7, but switching to the latest version of Django is not an option for me right now. This was making my admin look really ugly because I have app names such as "mps" and "budget_cost_data". Those would show up in random places in the admin as "Mps", "Budget_cost_data", "Budget cost data", etc. What I wanted was "MPS" and "Budget Cost Data".

There are many ways to try to solve this problem. You may be able to switch to Django Suit or Grappelli and then use localization to fix it. However, when I tried that approach with Django Suit, the breadcrumbs were still broken.

I finally found this snippet which got me started down a viable path. I grabbed a copy of the Django egg I was using, unzipped it, and copied django/contrib/admin/templates/admin to my project's templates directory. Then I started hacking. I created two filters:

@register.filter(name='custom_app_label')
@stringfilter
def custom_app_label(value):
  """This is used to change the Django app labels.

  See: https://snipt.net/chrisdpratt/overriding-app_label-in-django-admin/

  """
  return settings.CUSTOM_APP_LABELS.get(value.lower(), value)

@register.filter(name='custom_title')
@stringfilter
def custom_title(value):
  """This is used to change the <h1> in admin/base.html."""
  if value == "Site administration":
    return "Site Admin"
  if value.endswith(" administration"):
    try:
      app, administration = value.split()
      app = custom_app_label(app.lower())
      return "%s Admin" % app
    except ValueError:
      pass
  return value

Then, I applied one of those two filters everywhere in the admin templates that I needed to. Make sure you remember to load your filter library at the top of each template you modify.

You may want to give your models friendly names as well, such as:

class PlatformBudgetCostData(models.Model):
  ...
  class Meta(object):
    verbose_name = "Platform Budget Cost Data"
    verbose_name_plural = verbose_name

At the end of the day, I got something that wasn't as beautiful as, say, Django Suit, but at least the app names, titles, capitalization, etc. looked right.

Friday, June 06, 2014

A Blog Post on Excel, 4x4s, Moonshots, and General Purpose Software

Using Excel is like driving a 4x4. Writing custom software is like building an interstate freeway.

With a modest amount of training, you can do amazing things with a 4x4. You can go to places where there are no roads. If you don't like where you are, you can just as easily drive somewhere new. The downside is that when you're driving off-road, you generally have to drive pretty slowly, and you have to bring your own gas.

You do have to be a specialist to build an interstate freeway. In fact, it requires an amazing amount of work from a large number of people. Yet, once it's built, a 100,000 people a day can fly by at 70mph. It offers them tremendous convenience; they don't even need to bring their own gas! On the other hand, a freeway is only useful if it goes in the direction you need to go.

Hence, Excel offers tremendous flexibility, whereas custom software offers tremendous convenience. Excel is easy to get started with, but grows unwieldy as you scale the size of your models and teams. Custom software is fairly difficult to get started with but is easier to scale in terms of size and usage.

I have to interact with Excel a lot, and yet I've spent most of my career writing custom software. What amazes me about Excel (or, rather, spreadsheets in general) is that it's so amazingly general purpose. Think of how useful spreadsheets are to businesses all over the world. There are few pieces of software that are as general purpose or as widely useful as spreadsheets are.

To some degree, it seems to me that Silicon Valley is drunk with the power of being able to write custom software. Pretty much my whole career has been spent writing it. But what about writing new general purpose software like Excel? As far as I can tell, not many people get to work on general purpose software, and even fewer people get to create new types of general purpose software.

The first commercial spreadsheet was VisiCalc. What an amazing invention! Although VisiCalc was used by far fewer people than, say, Excel, it's notable because it represented a new "type" of general purpose software; it changed the world.

Now, think about all the other types of general purpose software. Most of it was invented a long time ago. Word processors, operating systems, web servers, databases, ERP systems, etc. were all first created decades ago. That's ancient history as far as computing goes! (Although, to be fair, all of these things are incredibly recent in terms of the history of civilization or the history of mankind.)

So, where are the new types of general purpose software? It's actually kind of hard to recognize them because they haven't been around long enough or copied enough times to be recognized as a "type" of software. However, I think that we're far enough along to recognize Facebook is an instance of the type "social network." Similarly, Twitter is a "microblog" (among other things). Both of these things have had a fairly worldwide impact.

Facebook and Twitter also illustrate another thing. You could say that both of them are really just custom software, whereas the truly general pieces of software that they're built on are web browsers, web servers, databases, programming languages, and operating systems. On the other hand, as I said above, they're also instances of new types of software: social networks and microblogs. Innovation is like that. An instance of one type of thing can become its own new type of thing. Innovation is always pushing toward greater specialization.

Now consider programming languages. We've seen a lot of evolutionary improvement, but a lot of the revolutionary ideas first happened decades ago. Think of subroutines, OOP, exceptions, processes, threads, coroutines, types, recursion, closures, compilers, interpreters, virtual machines, etc. These concepts have been around for decades. Sure, we have new programming languages that each have a different philosophy on how these things should work, but most of the really big ideas happened decades ago. I thought software transactional memory might be a counterexample since it seems fairly recent, but even that started about two decades ago.

Imagine, if you were to invent a fundamentally new replacement for subroutines, create a new way to package data and/or code together, or create a new form of concurrency, what would it look like?

Of course the flip side is that it's really hard to get people to understand and accept fundamentally new ideas. It took at least a decade or two before OOP was generally considered a useful technique, and that's still not universally accepted ;) The same is happening now for functional programming. And who knows where we'll go with concurrency in the future!

Truly new ideas just don't come along that often. It takes a while for a new instance of a thing to be recognized as a new type of thing. There aren't that many "moon shots" as Google likes to put it. And when they do come along, they often take a really long time before they gain widespread acceptance. Furthermore, the first mover often isn't the one who reaps the greatest rewards. Excel wasn't the first spreadsheet. Google wasn't the first search engine. Facebook wasn't the first social network.

I'm not really saying anything new. Kevin Kelly did a much better job explaining all of this in his wonderful book, What Technology Wants, but I'll leave you with a quote from another excellent book, The Myths of Innovation, Don't worry about people stealing your ideas. If they're any good, you're going to have to cram them down people's throats!

Saturday, May 31, 2014

Security: Scam Involving the "assoc" Command on Windows

My dad sent me the following:

Today I received a call from a Mark Atkison. He claims to be with Windows Technical Services, located in (or on) Brainbridge Island, Washington. Phone number 206-201-2413

Mark claims for the last two weeks my computer has been downloading online infections, junk files and miscellaneous viruses. I asked him about my “online ID number” Mark said my “customer license Security Identification number is: 888DCA60-FC0A-11CF-8F0F-[deleted]“. Mark said I could verify this by pressing the Windows key and r at the same time.... That would open a “run box” When the run box opens I was to type ASSOC. When I hit the Windows key + r, I saw a box open with “cmd”... which I figured stands for “command”. If I remember correctly, I erased the “cmd”. I was to type ASSOC. When I did, I saw something come up with “exe”. By the way, when I typed in ASSOC, I would not hit enter. I thought this might be some kink of scam or bull shit. I told Mark I was going to contact my son who is a high end programmer. Mark said I could call him back at the number listed above and refer to, “Docket number Yash 120695”. Mark told me they will show me the error and warning reports they have been receiving from my computer or lap top operating system.

This evening, I looked up Brainbridge Island, Washington... I found there was no Braindridge Island, Washington. There was however a Bainbridge Island, Washington (no “r”). Did I make a mistake? I'm not so sure I did. I had him spell out everything. I did a Google search for the phone number he gave me.... I found the following:

Match Found! We found phone number (206)201-2413

See Full Results

Received a call from (206)201-2413? View the comments below or add a comment of your own for 2062012413. Remember to not reveal personal information. Tell us about 206-201-2413. What time did they call and what was the call about?

Anonymous Monday, 19 May, 2014 15:19
Yes this is a scam call, beware do NOT install anything on your computer. They will records all your personal info

Anonymous Monday, 19 May, 2014 15:12
was this a scam call???

Anonymous Friday, 16 May, 2014 16:00
They told me windows was receiving a virus report on from my computer.

I think my instincts were good and your assessments were right on. Needless to say, I will not be calling Mark.

Best wishes to you and yours,
Dad

Apparently, the assoc command in Windows can be used to change file associations. The attacker could use this to convince you to treat .txt files as .exe files. Then, he could give you an executable that has a .txt extension. You would think it was safe, but when you opened it, it would run the executable, thereby taking over your computer.

At least, that's what I think is going on. I'm not 100% sure. It kind of seems like a like of work for the attacker since it involves him calling people manually.

Friday, May 30, 2014

Being Turing Complete Ain't All That and a Bag of Chips

I was talking to someone the other day. He said that given two Turing Complete programming languages, A and B, if you can write a program in A, you can write a similar program in B. Is that true? I suspect not.

I never took a class on computability theory, but I suspect it only works for a limited subset of programs--ones that only require the features provided by a Turing machine. Let me provide a counterexample. Let's suppose that language A has networking APIs and language B doesn't. Nor does language B have any way to access networking APIs. It's entirely possible for language B to be Turing Complete without actually providing such APIs. In such a case, you can write a program in language A that you can't write in language B.

Of course, I could be completely wrong because I don't even understand the definitions fully. Like I said, I've never studied computability theory.

Friday, May 16, 2014

Raspberry Pi: Building an LED Digital Clock

As I mentioned in a previous post, I really enjoyed reading Programming Raspberry Pi: Getting Started with Python. One of the chapters in the book teaches you how to build an LED digital clock. It took some futzing around, but I finally got it done :)

The first problem I had was that I didn't know how to solder. My buddy Chris Dudte gave me a kit to learn. I watched a bunch of YouTube videos with the kids in my Raspberry Pi class, and then we put the circuit board together. Problem solved.

The next two problems I encountered were with the author's library for talking to the smbus for controlling the LEDs, i2c7segment. One of the problems resulted in my saying quite a few less than charitable words under my breath. The Python code kept giving me the error message "IOError: [Errno 5] Input/output error".

I finally figured it out. On line 42 of i2c7segment.py, the code is hardcoded to use smbus.SMBus(0). However, sometimes you need to use smbus.SMBus(1). You can run "sudo i2cdetect -y 0" and "sudo i2cdetect -y 1" to figure out which bus to use. You should see "70" in one of these two. The hardest part of figuring out this problem was that I thought the software must be correct, and that I must have wired it or soldered it wrong. My guess is that this might be a Raspberry Pi model A vs. model B thing.

To hack around the problem, I edited i2c7segment.py to use bus number 1. A better approach would be to allow the calling code to pass a bus number or to try to autoselect it; however, I didn't bother.

Anyway, I'm super excited that I got the project done, especially considering this is the first time I've ever done something like this. Special thanks to my buddy Chris Dudte for all his support and for giving me a Raspberry Pi in the first place!

Wednesday, May 14, 2014

Best Practices for Software Engineers

As I mentioned in my last blog post, I'm going to be giving my "Best Practices for Software Engineers" talk at both the East Bay Ruby Meetup and at BayPIGgies (the Bay Area Python Interest Group). We're planning on broadcasting the BayPIGgies meeting using a Google+ Hangout on Air. If you're interested, here's the event, and here's the direct YouTube link.

Thanks to @nicholsonjf for setting this up!

Friday, May 02, 2014

Best Practices for Software Engineers

I'm going to be giving my talk "Best Practices for Software Engineers" at two different user groups in May:

Here's the abstract:

Being a software engineer requires a lot more than knowing how to write good code.

This class covers a wide variety of topics such as making code reviews useful and effective, how to deal with team conflicts, networking in real life, and planning for your career. The goal is to help you not only be a solid asset for your team, but also to be the type of software engineer that others really enjoy working with.

I hope to see some of you there!

Tuesday, April 29, 2014

PyCon Notes: Introduction to SQLAlchemy

At PyCon, Mike Bayer, the author of SQLAlchemy, gave a three hour tutorial on it. Here's the video. What follows are my notes.

He used something called sliderepl. sliderepl is a nice ASCII tool that's a mix of slides and a REPL. You can flip through his source code / slides in the terminal. It's actually pretty neat.

At the lowest level, SQLAlchemy sends strings to the database and interprets the responses.

It took him 10 years to write it. He started the project in 2005. He's finally going to hit 1.0.

SQLAlchemy forces you to be aware of transactions.

Isolation models have to do with how ongoing transactions see ongoing work amongst each other.

Goals:

    Provide helpers, etc.

    Provides a fully featured facade over the Python DBAPI.

    Provide an industrial strength, but optional, ORM.

    Act as a base for inhouse tools.

Philosophies:

    Make the usage of different DBs and adaptors as consistent as possible.

    But still expose unique features in each backend.

    It's not realistic for an ORM to perfectly match any DB. He mentioned
    leaky abstractions.

    Don't hide the DB. You must continue to think in SQL.

    Provide automation and DRY.

    Allow expression using declarative patterns.

Here's the stack:

    SQLAlchemy ORM
    SQLAlchemy Core:
        Schema/Types.
        SQL expression language.
        Engine:
            Connection pooling.
            Dialects.
    DBAPI
        There are different libraries for different DBs.
    Database

Reddit and Dropbox use SQLAlchemy Core without the ORM.

An "Engine" is a registry which provides connectivity to a particular DB server.

A "Dialect" interprets generic SQL and database commands in terms of a specific DBAPI and DB backend.

A "Connection Pool" holds a collection of DB connections in memory for fast reuse. It also handles reconnecting if the connection drops.

The "SQL Expression Language" allows SQL to be written using Python expressions.

"Schema/Types" uses Python objects to represent tables, columns, and datatypes.

The ORM:

    Allows construction of Python objects which can be mapped to relational DB
    tables.

    Transparently persists objects into their corresponding DB tables using
    the "unit of work" pattern.

    Provides a query system which loads objects and attributes using SQL
    generated from mappings.

    Builds on top of the Core.

    Presents a slightly more object centric perspective as opposed to a schema
    centric perspective.

SQLAlchemy is like an onion:

    ORM:
        SQL Expressions:
            Table Metadata, Reflection, DDL:
                Engine, Connection, Transactions

Level 1: Engine, Connection, Transactions

The DBAPI is the Python Database API. It's the defacto system for providing Python DB interfaces.

Most DBs have more than one DBAPI implementation.

MySQL has more than 10, and SQLAlchemy has to support about 6 of them.

They vary wildly.

He showed an example of using the DBAPI directly.

His favorite DBAPI implementation is psycopg2 (for PostgreSQL).

The DBAPI assumes that a transaction is always in progress. There is no begin() method, only commit() and rollback().

The DBAPI encourages bound parameters, via the execute() and executemany() methods, but it has six different formats.

All DBAPIs have inconsistencies regarding lots of things.

The DBAPI has its own exception hierarchy.

The first layer in SQLAlchemy is known as the "Engine", which is the object that maintains the classical DBAPI interaction.

SQLAlchemy has a lazy initialization pattern. It does a lot of lazy stuff.

An engine is a "factory" for connections.

    from sqllite import create_engine

    engine = create_engine("sqlite:///some.db")
    result = engine.execute("somequery ...=:empid", empid=...)
    row = result.fetchone()

At this level of the API, the syntax for quoting variables (:empid) is whatever the DBAPI expects.

row is a tuple as well as a dict.

You can loop over the result:

    for row in result:

The actual cursor is result.cursor.

Under the covers, result.cursor.description has the names of the fields.

There's also result.fetchall().

You can control the scope of the connection using connect():

    conn = engine.connect()
    result = conn.execute(query)
    result.fetchall()
    conn.close()

Transactions:

    trans = conn.begin()
    trans.commit()
    conn.close()

If you don't use a transaction explicitly, it'll use autocommit. Everything will be wrapped in a transaction.

The DBAPI doesn't use autocommit by default.

Using with:

    with engine.begin() as conn:
        conn.execute(query)

Turning on debugging:

    engine.echo = True

Connecting directly via the engine is called "connectionless execution". The engine connects and disconnects for us.

Using a Connection explicitly lets you control scope.

There's a connection pool.

Whatever you give to engine.execute() gets passed to the DBAPI directly.

Level 2: Table Metadata, Reflection, DDL

He was inspired by "Patterns of Enterprise Architecture" when he wrote the SQL generation code and the ORM.

You can use SQLAlchemy to generate a schema.

You can also have SQLAlchemy use reflection to load an existing schema.

    from sqlalchemy import MetaData, Table, Column, \
        Integer, String
    metadata = MetaData()
    user_table = Table('user', metadata,
        Column('id', Integer, primary_key=True),
        ...
    )
    user_table.name

String is a varchar.

user_table.c has all the columns:

    user_table.c.name.type

Using it:

    user_table.select().where(user_table.c.fullname == 'asdf'))

Creating tables:

    metadata.create_all(engine)

Types:

    String(50)
    DateTime
    Numeric(10, 2)
    Enum('a', 'b', 'c')
    ...

Constraints:

    from sqlalchemy import ForeignKey
    ...
    Column('user_id', Integer, ForeignKey('user.id'))

The references are lazy, so user can be created later.

Composite foreign keys:

    ...
    Column('story_id', Integer),
    Column('version_id', Integer),
    ForeignKeyConstraint(
        ['story_id', 'version_id'],
        ['story.story_id', 'story.version_id']
    )

There's a special setting to tell it to add foreign key dependencies later. Use this if you have mutually dependent tables.

Nullable is True by default. Here's how to make it False:

    nullable=False

You can combine lines and leave out the type for foreign keys:

    Column('owner_id', ForeignKey('user.id'))

Reflection:

    metadata2 = MetaData()
    user_reflected = Table('user', metadata2, autoload=True,
                           autoload_with=engine)
    user_reflected.c

It took a really long time for him to learn how to meet everyone's needs.

Don't use "bound metadata". It's an antipattern.

Another system (not using metadata):

    from sqlalchemy import inspect
    inspector = inspect(engine)
    inspector.get_table_names()
    inspector.get_columns('address')
    inspector.get_foreign_keys('address')

Types and Querying

Types:

    Integer
    String
    Unicode
    Boolean
    DateTime
    Float
    Numeric (a decimal)

Create and drop:

    metadata.create_all()
    table.create()
    metadata.drop_all()
    table.drop()

These are classes. They have magic methods like __eq__, etc.

    user_table.c.username.__eq__

This returns a BinaryExpression to be used as part of an expression:

    user_table.c.username == 'ed'

Or:

    ((user_table.c.username == 'ed') |
  (user_table.c.username == 'jack'))

There's also and_ and or_.

There's also >, etc.

There's also == None. It translates to "is NULL".

There are lots of operators.

    user_table.c.something.in_(...)

Execute:

    engine.execute(
        user_table.select().where(user_table.c.username == 'ed')
    )

Dialects:

    from sqlalchemy.dialects import postgresql
    expression.compile(dialect=postgresql.dialect())

Inserts:

    insert_stmt = user_table.insert().values(
        username='ed',
        fullname='Ed Jones'
    )
    conn = engine.connect()
    result = conn.execute(insert_stmt)

Inserting many:

    conn.execute(user_table.insert(), [
        {'username': 'jack', ...},
    ])

Select:

    conn.execute(
        select([user_table.c.username, user_table.c.fullname])
            .where(user_table.c.username == 'ed'))

Select all:

    conn.execute(select([user_table]).where....)

If you use where multiple times, it ands them together.

The result for an insert will have the primary key.

Looking at a statement:

    stmt = ...
    print stmt
    conn.execute(stmt)

Look at the rowcount on the result to see how many rows were affected.

Level 4: ORM

The OO classes are called a "domain model".

The most basic level is:

    object.save()
    class.load()

Some ORMs can represent multiple rows as domain objects.

Most ORMs can do composition using foreign key associations.

SQLAlchemy can represent class inheritance hiearchies.

SQLAlchemy can handle sharding.

Flavors of ORMs:

    Active Record: Domain objects handle their own persistence.

    Data Mapper: It tries to keep the details of persistence separate from the
    object being persisted.

There's also different ways of configuring them. Most use an "all-at-once", or declarative style, where class and table information is together.

Another style is to have the class declaration and the ORM mapper configured separately.

Hibernate is more pure, but it's tedious and verbose.

That's how SQLAlchemy worked in the beginning, but then he realized it was too verbose.

The SQLAlchemy ORM is a data mapper on the bottom, but it has a declarative style on top.

.save() is lazy. It doesn't flush immediately.

Objects are identical if they have the same primary key.

SQLAlchemy uses lazy loading.

SQLAlchemy is not compatible with Tornado and Twisted because they're callback oriented, and the lazy loading conflicts with that.

It also supports eager loading.

It also has method chaining.

Using the ORM:

    from sqlalchemy.ext.declarative import declarative_base
    Base = declarative_base()

    class User(Base):
        __tablename__ = 'user'
        id = Column(Intger, primary_key=True)
        name = Column(String)
        ...

        def __repr__(self):
          return "" % (self.name, self.fullname)

    User.__table__

The Mapper links a class to a table:

    User.__mapper__

Other stuff:

    Base.metadata
    Base._decl_class_registry

Creating a user:

    ed_user = User(name='ed', ...)

Setting stuff up:

    engine = ...
    session = Session(bind=engine)

The session is an ORM object.

    session.add(ed_user)

Query:

    our_user = session.query(User).filter_by(name='ed').first()

Instead of where, you use filter_by.

As soon as the session begins its work, it starts a transaction.

It keeps the identity map, so "is" works for ORM objects. Generally, other Active Record ORMs get this wrong.

Add all:

    session.add_all([
        User(...), ...
    ])

session.dirty has state changes.

session.new has the new objects.

session.commit() flushes everything and commits the transaction.

session.flush() flushes changes to the database without doing a commit.

After a commit, there is no transaction. All data is invalidated. Accessing data will start a new transaction and re-load the data from the database.

He learned stuff from Storm and Hibernate about how to do things this way.

Expiring after the commit is the right thing to do.

session.rollback() also ends the transaction.

The object is a proxy to a row in the database. You only know about stuff in the database when there's a transaction in play.

One of the challenges with SQLAlchemy is performance, especially when setting a ton of attributes.

The core will do whatever you want. The ORM is more opinionated. It's clearer about how things should be done.

He has to be really careful about adding features. A lot of times, he's had to remove features that were not well thought out.

Here's a fuller example:

    class Network(Base):
        __tablename__ = 'network'
        network_id = Column(Integer, primary_key=True)
        name = Column(String(100), nullable=False)

    Base.metadata.create_all(engine)
    session
    session.add_all([Network(...), ...])

nullable is True by default.

Inner workings:

    User.name.property.columns[0]
    User.__table__
    User.__table__.c.name == 'ed'

But, you can write higher-level code such as:

    User.name == 'ed'

Query:

    query = session.query(User).filter(
        User.name == 'ed'
    ).order_by(User.id)
    query.all()

    for name, fullname in session.query(
        user.name, User.fullname
    ):
        ...

Create a dict really easily:

    d = dict(session.query(User.name, User))

Using order, limit, offset:

    session.query(User).order_by(...)[1:3]

filter is for full-blown expressions:

    (User.something == "something")

filter_by is for filtering by a specific field:

    (something == "something")

If you use multiple filters in a chain, they are anded.

.first() limits things to one row.

.one() asserts that there is one and only one row This may raise NoResultFound or MultipleResultsFound.

Examples:

    q = session.query(User.fullname).order_by(User.fullname)
    q.all()

    q2 = q.filter(or_(User.name == 'mary', User.name == 'ed'))
    print q2[1]

Advanced ORM Usage

Start with the same mappings as above.

What he showed above is a little verbose. There are shortcuts.

Unlike other ORMs, SQLAlchemy doesn't assume you're going to have a primary key named id. He was trying to make it appealing for existing databases.

It doesn't decide what the database should look like. This makes it really explicit.

There are ways to enforce conventions using mixins.

    class Address(Base):

        ...
        user_id = Column(Integer, ForeignKey('user.id'))

        # This gives you address.user and user.addresses.
        user = relationship("User", backref="addresses")

This stuff works with reflection too.

    jack.addresses = [
        Address(...),
        ...
    ]

You can do anything to jack.addresses that you can do with a list.

If you run session.add(jack), it adds everything else automatically.

This is where ORMs can save you a lot of time. They make persisting stuff much easier.

The collection stays in memory until the transaction ends.

There is a way to use eager loading to avoid the N+1 queries problem.

As soon as you begin working with the session, it starts a new transaction.

The only time you need to use session.begin() is if you explicitly turned on autocommit.

If you don't care about the transaction, just .close() it. Then the cursor pool will do whatever it does.

When he does a POST in a web app, he'll do an explicit commit().

SQLAlchemy uses a connection pool by default.

Collections and references are updated by manipulating objects, not primary / foreign key values.

An implicit join:

    session.query(User, Address).filter(
        User.id == Address.user_id
    ).all()

Without the filter, you get the cartesian product.

Here's an explicit join:

    session.query(User, Address).join(
        Address, User.id == Address.user_id
    ).all()

Using the relationship:

    session.query(User, Address).join(User.addresses).all()

Self join:

    from sqlalchemy.orm import aliased

    a1, a2 = aliased(Address), aliased(Address)
    session.query(User).join(a1).join(a2).filter(
        a1.email_address == '...'
    ).filter(a2.email_address == '...').all()

There's a group_by.

Add .subquery() to the end to make it a subquery:

    subq = session.query(
        func.count(Address.id).label('count'),
        User.id.label('user_id')
    ).join(Address.user).group_by(User.id).subquery()

    ...outerjoin(subq, User.id == subq.c.user_id)...

Here's how to avoid the N+1 problem:

    for user in session.query(User).options(
        subqueryload(User.addresses)
    ):
        print(user, user.addresses)

There's a really good chapter in the docs that covers this stuff.

Use Alembic for migrations.

People used to use Elixir on top of SQLAlchemy, but Elixir isn't being maintained anymore. Everything you can do in Elixir can be done using Declarative in SQLAlchemy.

Wednesday, April 23, 2014

PyCon Notes: PostgreSQL Proficiency for Python People

In summary, this tutorial was fantastic! I learned more in three hours than I would have learned if I had read a whole book!

Here's the video. Here are the slides. Here are my notes:

Christoph Pettus was the speaker. He's from PostgreSQL Experts.

PostgreSQL is a rich environment.

It's fully ACID compliant.

It has the richest set of features of any modern, production RDMS. It has even more features than
Oracle.

PostgreSQL focuses on quality, security, and spec compliance.

It's capable of very high performance: tens of thousands of transactions per second, petabyte-sized data sets, etc.

To install it, just use your package management system (apt, yum, etc.). Those systems will usually take care of initialization.

There are many options for OS X. Heroku even built a Postgres.app that runs more like a foreground app.

A "cluster" is a single PostgreSQL server (which can manage multiple databases).

initdb creates the basic file structure. PostgreSQL has to be up and running to run initdb.

To create a database:

sudo su - postgres
psql

create database this_new_database;

To drop a database:

drop database this_new_database;

Debian runs initdb for you. Red Hat does not.

Debian has a cluster management system. Use it. See, for instance, pg_createcluster.

Always create databases as UTF-8. Once you've created it, you can't change it.

Don't use SQLASCII. It's a nightmare. Don't use "C locale".

pg_ctl is a built-in command to start and stop PostgreSQL:

cd POSTGRES_DIRECTORY
pg_ctl -D . start

Usually, pg_ctl is wrapped by something provided by your platform.

On Ubuntu, start PostgreSQL via:

service postgresql start

Always use "-m fast" when stopping.

Postgres puts its own data in a top-level directory. Let's call it $PGDATA.

Don't monkey around with that data.

pg_clog and pg_xlog are important. Don't mess with them.

On most systems, configuration lives in $PGDATA.

postgresql.conf contains server configuration.

pg_hba.conf contains authentication settings.

postgresql.conf can feel very overwhelming.

Avoid making a lot of changes to postgresql.conf. Instead, add the following to it:

include "postgresql.conf.include"

Then, mess with "postgresql.conf.include".

The important parameters fall into these categories: logging, memory, checkpoints, and the planner.

Logging:

Be generous with logging. It has a very low impact on the system. It's your best source of info for diagnosing problems.

You can log to syslog or log CSV to files. He showed his typical logging configuration.

He showed his guidelines / heuristics for all the settings, including how to finetune things. They're really good! See his slides.

As of version 9.3, you don't need to tweak Linux kernel parameters anymore.

Do not mess with fsync or  synchronous_commit.

Most settings require a server reload to take effect. Some things require a server restart. Some can be set on a per-session basis. Here's how to do that. This is also an example of how to use a transaction:

begin;
set local random_page_cost = 2.5;
show random_page_cost;
abort;

pg_hba.conf contains users and roles. Roles are like groups. They form a hierarchy.

A user is just a role with login privs.

Don't use the "postgres" superuser for anything application-related.

Sadly, you probably will have to grant schema-modification privs to your app user if you use migrations, but if you don't have to, don't.

By default, DB traffic is not encrypted. Turn on SSL if you are running in a cloud provider.

In pg_hba.conf, "trust" means if they can log into the server, they can access Postgres too. "peer" means they can have a Postgres user that matches their username. "md5" is an md5 hash password.

It's a good idea to restrict the IP addresses allowed to talk to the server fairly tightly.

The WAL

The Write-Ahead Log is key to many Postgres operations. It's the basis for replication, crash recovery, etc.

When each transaction is committed, it is logged to the write-ahead log.

Changes in the transaction are flushed to disk.

If the system crashes, the WAL is "replayed" to bring the DB to a consistent state.

It's a continuous record of changes since the last checkpoint.

The WAL is stored in 16MB segments in the pg_xlog directory.

Never delete anything from pg_xlog.

archive_command is a way to move the WAL segments to someplace safe (like a
different system).

By default, synchronous_commit is on, which means that commits do not return until the WAL flush is done. If you turn it off, they'll return when the WAL flush is queued. You might lose transactions in the case of a crash, but there's no risk of database corruption.

Backup and Recovery

Experience has shown that 20% of the time, your EBS volumes will not reattach when you reboot in AWS.

pg_dump is a built-in dump/restore tool.

It takes a logical snapshot of the database.

It doesn't lock the database or prevent writes to disk.

pg_restore restores the database. It's not fast.

It's great for simple backups but not suitable for fast recovery from major failures.

pg_bench is the built in benchmarking tool.

pg_dump -Fc --verbose example > example.dump

Without the -Fc, it dumps SQL commands instead of its custom format.

pg_restore --dbname=example_restored --verbose example.dump

pg_restore takes a long time because it has to recreate indexes.

pg_dumpall --globals-only

Back up each database with pg_dump using --format=custom.

To do a parallel restore, use --jobs=.

If you have a large database, pg_dump may not be appropriate.

A disk snapshot + every WAL segment is enough to recreate the database.

To start a PITR (point in time recovery) backup:

select pg_start_backup(...);

Copy the disk image and any WAL files that are created.

select pg_stop_backup();

Make sure you have all the WAL segments.

The disk image + all the WAL segments are enough to create the DB.

See also github.com/wal-e/wal-e. It's highly recommended.

It automates backups to S3.

He explained how to do a PITR.

With PITR, you can rollback to a particular point in time. You don't have to replay everything.

This is super handy for application failures.

RDS is something that scripts all this stuff for you.

Replication

Send the WAL to another server.

Keep the server up to date with the primary server.

That's how PostgreSQL replication works.

The old way was called "WAL Archiving". Each 16MB segment was sent to the secondary when complete. Use rsync, WAL-E, etc., not scp.

The new way is Streaming Replication.

The secondary gets changes as they happen.

It's all setup via recovery.conf in your $PGDATA.

He showed a recovery.conf for a secondary machine, and showed how to let it become the master.

Always have a disaster recovery strategy.

pg_basebackup is a utility for doing a snapshot of a running server. It's the easiest way to take a snapshot to start a new secondary. It's also useful for archival backups. It's not the fastest thing, but it's pretty foolproof.

Replication:

The good:

Easy to setup.

Schema changes are replicated.

Secondaries can handle read-only queries for load balancing.

It either works or it complains loudly.

The bad:

You get the entire DB cluster or none of it.

No writes of any kind to the secondary, not even temporary tables.

Some things aren't replicated like temporary tables and unlogged tables.

His advice is to start with WAL-E. The README tells you everything. It fixes a ton of problems.

The biggest problem with WAL-E is that writing to S3 can be slow.

Another way to do funky things is trigger-based replication. There's a bunch of third-party packages to do this.

Bucardo is one that lets you do multi-master setups.

However, they're fiddly and complex to set up. They can also fail quietly.

Transactions, MVCC, and Vacuum

BEGIN;
INSERT ...;
INSERT ...;
COMMIT;

By the way, no bank works this way ;)

Everything runs inside of a transaction.

If there is no explicit transaction, each statement is wrapped in one for you.

Everything that modifies the database is transactional, even schema changes.

\d shows you all your tables.

With a transaction, you can even rollback a table drop.

South (the Django migration tool) runs the whole migration in a single transaction.

Many resources are held until the end of a transaction. Keep your transactions brief and to the point.

Beware of "IDLE IN TRANSACTION" sessions. This is a problem for Django apps.

A tuple in Postgres is the same thing as a row.

Postgres uses Multi-Version Concurrency Control. Each transaction sees its own version of the database.

Writers only block writers to the same tuple. Nothing else causes blocking.

Postgres will not allow two snapshots to "fork" the database. If two people try to write to the same tuple, Postgres will block one of them.

There are higher isolation modes. His description of them was really interesting.

He suggested that new apps use SERIALIZABLE. This will help you find the concurrency errors in your app.

Deleted tuples are not usually immediately freed.

Vacuum's primary job is to scavenge tuples that are no longer visible to any transaction.

autovacuum generally handles this problem for you without intervention (since version 8).

Run analyze after a major database change to help the planner out.

If someone tells you "vacuum's not working", they're probably wrong.

The DB generally stabilizes at 20% to 50% bloat. That's acceptable.

The problem might be that there are long-running transactions or idle-in-transaction sessions. They'll block vacuuming. So will manual table locking.

He talked about vacuum issues for rare situations.

Schema Design

Normalization is important, but don't obsess about it.

Pick "entities". Make sure that no entity-level info gets pushed into the subsidiary items.

Pick a naming scheme and stick with it.

Plural or singular? DB people tend to like plural. ORMs tend to like singular.

You probably want lower_case to avoid quoting.

Calculated denormalization can sometimes be useful; copied denormalization is almost never useful.

Joins are good.

PostgreSQL executes joins very efficiently. Don't be afraid of them.

Don't worry about large tables joined with small tables.

Use the typing system. It has a rich set of types.

Use domains to create custom types.

A domain is a core type + a constraint.

Don't use polymorphic fields (fields whose interpretation is dependent on another field).

Don't use strings to store multiple types.

Use constraints. They're cheap and fast.

You can create constraints across multiple columns.

Avoid Entity-Attribute-Value schemas. They cause great pain. They're very inefficient. They make reports very difficult.

Consider using UUIDs instead of serials as synthetic keys.

The problem with serials for keys is that merging tables can be hard.

Don't have "Thing" tables like "Object" tables.

If a table has a few frequently-updated fields and a few slowly-updated fields, consider splitting the table. Split the fast-moving stuff out into a separate 1-to-1 table.

Arrays are a first-class type in PostgreSQL. It's a good substitute for using a subsidiary table.

A list of tags is a good fit for arrays.

He talked about hstore. It's much better than Entity-Attribute-Value. It's great for optional, variable attributes. It's like a hash. It can be indexed, searched, etc. It lets you add attributes to tables for users. Don't use it as a way to avoid all table modifications.

json is now a built in type.

There's also jsonb.

Avoid indexes on big things, like 10k character strings.

NULL it a total pain in the neck.

Only use it to mean "missing value".

Never use it to represent a meaningful value.

Let's call anything 1MB or more a "very large object". Store them in files. Store the metadata in the database. The database API is just not a good fit for this.

Many-to-many tables can get extremely large. Consider replacing them with array fields (either one way or both directions). You can use a trigger to maintain integrity.

You don't want more than about 250k entries in an array.

Use UTF-8. Period.

Always use TIMESTAMPTZ (which Django uses by default). Don't use TIMESTAMP. TIMESTAMPTZ is a timestamp converted to UTC.

Index types:

B-Tree

Use a B-Tree on a column if you frequently query on that column,
use one of the comparison operators, only get back 10-15% of the rows,
and run that query frequently.

It won't use the index if you're going to get back more than 15% of
the rows because it's faster to scan a table then scan an index.

Use a partial index if you can ignore most of the rows.

The entire tuple has to be copied into the index.

GiST

It's a framework to create indexes.

KNN indexes are the K-nearest neighbors.

GIN

Generalized inverted index. Used for full-text search.

The others either are not good or very specific.

Why isn't it using my index?

Use explain analyze to look at the query.

If it thinks it's going to require most of the rows, it'll do a table scan.

If it's wrong, use analyze to update the planner stats.

Sometimes, it can't use the index.

Two ways to create an index:

create index

create index concurrently

reindex rebuilds an index from scratch.

pg_stat_user_indexes tells you about how your indexes are being used.

What do you do if a query is slow:

Use explain or explain analyze.

explain doesn't actually run the query.

"Cost" is measured in arbitrary units. Traditionally, they have been "disk fetches". Costs are inclusive of subnodes.

I think explain analyze actually runs the query.

Things that are bad:

Joins between 2 large tables.

Cross joins (cartesian products). These often happen by accident.

Sequential scans on large tables.

select count(*) is slow because it results in a full table scan since you
have to see if the tuples are alive or dead.

offset / limit. These actually run the query and then throw away that many
rows. Beware that GoogleBot is relentless. Use other keys.

If the database is slow:

Look at pg_stat_activity:

select * from pg_stat_activity;

tail -f the logs.

Too much I/O? iostat 5.

If the database isn't responding:

Try connecting with it using psql.

pg_stat_activity

pg_locks

Python Particulars

psycopg2 is the only real option in Python 2.

The result set of a query is loaded into client memory when the query completes. If there are a ton of rows, you could run out of memory. If you want to scroll through the results, use a "named" cursor. Be sure to dispose of it properly.

The Python 3 situation is not so great. There's py-postgresql. It's pure Python.

If you are using Django 1.6+, use the @atomic decorator.

Cluster all your writes into small transactions. Leave read operations outside.

Do all your writes at the very end of the view function.

Multi-database works very nicely with hot standby.

Point the writes at the primary, and the reads at the secondary.

For Django 1.5, use the @xact decorator.

Sloppy transaction management can cause the dreaded Django idle-in-transaction problem.

Use South for database migration. South is getting merged into Django in version 1.7 of Django.

You can use manual migrations for stuff the Django ORM can't specify.

Special Situations

Upgrade to 9.3.4. Upgrade minor versions promptly.

Major version upgrades require more planning. pg_upgrade has to be run when the database is not running.

A full pg_dump / pg_restore is always the safest, although not the most practical.

Always read the release notes.

All parts of a replication set must be upgraded at once (for major versions).

Use copy, not insert, for bulk loading data. psycopg2 has a nice interface. Do a vacuum afterwards.

AWS

Instances can disappear and come back up without instance storage.

EBS can fail to reattach after reboot.

PIOPS are useful (but pricey) if you are using EBS.

Script everything, instance creation, PostgreSQL, etc. Use Salt. Use a VPC.

Scale up and down as required to meet load. If you're just using them to rent a server, it's really expensive.

PostgreSQL RDS is a managed database instance. Big plus: automatic failover! Big minus: you can't read from the secondary. It's expensive. It's a good place to start.

Sharding

Eventually, you'll run out of write capacity on your master.

postgres-xc is an open source fork of PostgreSQL.

Bucardo provides multi-master write capability.

He talked about custom sharding.

Instagram wrote a nice article about it.

Pooling

Opening a connection is expensive. Use a pooler.

pgbouncer is a pooler.

pgPool II can even do query analysis. However, it has higher overhead and is more complex to configure.

Tools

Monitor everything.

check_postgres.pl is a plugin to monitor PostgreSQL.

pgAdmin III and Navicat are nice clients.

pgbadger is for log analysis. So is pg_stat_statements.

Closing

MVCC works by each tuple having a range of transaction IDs that can see that
tuple.

Failover is annoying to do in the real world. People use HAProxy, some pooler, etc. with some scripting, or they have a human do the failover.

HandyRep is a server-based tool designed to allow you to manage a PostgreSQL "replication cluster", defined as a master and one or more replicas on the same network.

Dagger: A Dependency Injection Framework for Android and Java

Dagger is a new dependency injection framework for Android and Java. I went to a meetup yesterday to learn more about it. These are my notes:

The talk was by Jake Wharton who works at Square.

Every single app has some form of DI. You can do DI even if you're not using a library for doing it. The goal of DI is to separate the behavior of something from its required classes. If you've ever used a constructor to receive stuff, you've done a simple version of DI.

Square used Guice heavily.

Problems with Guice:
Config problems fail at runtime. 
Slow initialization, slow injection, and memory problems.

These are worse on Android. It causes the OS to load all the code for your app at once. This caused their app to take 2 seconds to start.

They called Dagger "Object Graph" initially.

Goals of Dagger:

Static analysis of all dependencies and injections. 
Fail as early as possible--compile time, not runtime.
Eliminate the need to do reflection of methods and annotations at runtime. Reflection in Dalvik is really slow.
Have negligible memory impact.

Jesse Wilson wrote it over the course of 5 weeks. He previously worked on Guice and Dalvik.

Square switched from Guice to Dagger in a fairly short period of time.

The name Dagger refers to "directed acyclic graph".

An ObjectGraph is the central dependency manager and injector.

@Module + @Provides

@Inject

@Singleton

Modules are meant to be composed together.

@Inject is required.

Field injection or constructor injection.

Dependencies can be stored in private final fields.

If you have @Inject, you don't have to say @Provides.

Injected fields cannot be private or final. They can be package protected.

Object graphs can be scoped. One object graph is a superset of another. For instance, you might create a new object graph once the user logs in that contains all of the objects that are require a user object.

Android

The Android platform makes it really hard to test your apps. He showed how they deal with it.

The ObjectGraph is just another object. Hence, you can pass it around like a normal object.

There's one pain point in Dagger. All injection points must be listed on a module. This is used for aggressive static analysis.

Use overrides to facilitate testing.

He showed how to integrate with Gradle.

They have a "Debug Drawer" in their apps. It is hidden in the UI, but lets you configure all sorts of debug settings.

U+2020 is a sample app to show how to do all of this.

Using DI Incorrectly

Do NOT ignore the pattern.

Do NOT make every class use the pattern. Use it for the big things, such as the things that talk to a remote API.

Do NOT store dependencies as static fields.

Other Stuff

The Android docs say not to use DI. That advice is stale. Those complaints don't apply to Dagger.

Dagger is developer and debugger friendly.

The Future

Dagger has been out for 18 months.

They're working on the next major version, version 2.0.

Google is leading the development of the next version.

It won't use any reflection.

Dagger is not Android specific.

They're getting rid of injects lists.

Components encapsulate dependencies.

There will be dedicated annotations to denote scopes.

See squ.re/dagger2.

Questions

They use protocol buffers for their APIs. They use a schema for their APIs.

They have code that can generate a GraphViz file that shows you your dependency graph.

In their apps, they have a network module, an Android module, an app module, etc.

There's an IntelliJ plugin that lets you jump between @Inject and @Provides.

Friday, March 28, 2014

Books: Two Scoops of Django: Best Practices For Django 1.6

I just finished reading the book Two Scoops of Django: Best Practices For Django 1.6. I had already reviewed the previous edition, so I was anxious to see what had changed. In short, I loved it!

It's not an introduction, tutorial, or a reference for Django. In fact, it assumes you've already gone through the Django tutorial, and it occasionally refers you to the Django documentation. Rather, it tells you what you should and shouldn't do to use Django effectively. It's very prescriptive, and it has strong opinions. I've always enjoyed books like that. Best of all, it's only about 400 pages long, and it's very easy to read.

This edition is 100 pages longer than the previous edition, and I really enjoyed the new chapters. It has even more silly drawings and creamy ice cream analogies than the original, and even though I'm lactose intolerant, that made the book a lot of fun to read.

Having read the book cover-to-cover, even though I'm fairly new to Django, I kind of feel like I know what I'm doing at this point--at least a little ;) I was afraid when I started using Django that I wouldn't have that feeling until I had used it for a couple years.

So, if you're using Django, I highly recommend it!

Thursday, February 27, 2014

Python: A Response to Glyph's Blog Post on Concurrency

If you haven't seen it yet, Glyph wrote a great blog post on concurrency called Unyielding. This blog post is a response to that blog post.

Over the years, I've tried each of the approaches he talks about while working at various companies. I've known all of the arguments for years. However, I think his blog post is the best blog post I've read for the arguments he is trying to make. Nice job, Glyph!

In particular, I agree with his statements:

What I hope I’ve demonstrated is that if you agree with me that threading has problematic semantics, and is difficult to reason about, then there’s no particular advantage to using microthreads, beyond potentially optimizing your multithreaded code for a very specific I/O bound workload.
There are no shortcuts to making single-tasking code concurrent. It's just a hard problem, and some of that hard problem is reflected in the difficulty of typing a bunch of new concurrency-specific code.

In this blog post, I'm not really disputing his core message. Rather, I'm just pointing out some details and distinctions.

First of all, it threw me off when he mentioned JavaScript since JavaScript doesn't have threads. In the browser, it has web workers which are like processes, and in Node, it has a mix of callbacks, deferreds, and yield. However, reading his post a second time, all he said was that JavaScript had "global shared mutable state". He never said that it had threads.

The next thing I'd like to point out is that there are some real readability differences between the different approaches. Glyph did a good job of arguing that it's difficult to reason about concurrency when you use threads. However, if you ignore race conditions for a moment: I think it's certainly true that threads, explicit coroutines, and green threads are easier to read than callbacks and deferreds. That's because they let you write code in a more traditional, linear fashion. Even though I can do it, using callbacks and deferreds always cause my brain to hurt ;) Perhaps I just need more practice.

Another thing to note is that the type of application matters a lot when you need to address concurrency concerns. For instance, if you're building a UI, you don't want any computationally heavy work to be done on the UI thread. For instance, in Android, you do as little CPU heavy and IO heavy work as possible on the UI thread, and instead push that work off into other threads.

Other things to consider are IO bound vs. CPU bound, stateful vs. stateless.

Threads are fine, if all of the following are true:

  • You're building a stateless web app.
  • You're IO bound.
  • All mutable data is stored in a per-request context object, in per-request instances, or in thread-local storage.
  • You have no module-level or class-level mutable data.
  • You're not doing things like creating new classes or modules on the fly.
  • In general, threads don't interact with each other.
  • You keep your application state in a database.

Sure there's always going to be some global, shared, mutable data such as sys.modules, but in practice Python itself protects that using the GIL.

I've built apps such as the above in a multithreaded way for years, and I've never run into any race conditions. The difference between this sort of app and the app that lead to Glyph's "buggiest bug" is that he was writing a very stateful application server.

I'd also like to point out that it's important to not overlook the utility of UNIX processes. Everyone knows how useful the multiprocessing module is and that processes are the best approach in Python for dealing with CPU bound workloads (because you don't have to worry about the GIL).

However, using a pre-fork model is also a great way of building stateless web applications. If you have to handle a lot of requests, but you don't have to handle a very large number simultaneously, pre-forked processes are fine. The upside is that the code is both easy to read (because it doesn't use callbacks or deferreds), and it's easy to reason about (because you don't have the race conditions that threads have). Hence, a pre-fork model is great for programmer productivity. The downside is that each process can eat up a lot of memory. Of course, if your company makes it to the point where hardware efficiency costs start outweighing programmer efficiency costs, you have what I like to call a "nice to have problem". PHP and Ruby on Rails have both traditionally used a pre-fork approach.

I'm also a huge fan of approaches such as Erlang that give you what is conceptually a process, without the overhead of real UNIX processes.

As Glyph hinted at, this is a really polarizing issue, and there really are no easy, perfect-in-every-way solutions. Any time concurrency is involved, there are always going to be some things you need to worry about regardless of which approach to concurrency you take. That's why we have things like databases, transactions, etc. It's really easy to fall into religious arguments about the best approach to concurrency at an application level. I think it's really helpful to be frank and honest about the pros and cons of each approach.

That being said, I do look forward to one day trying out Guido's asyncio module.

See also:

Thursday, January 30, 2014

Python: A lightning quick introduction to virtualenv, nose, mock, monkey patching, dependency injection, and doctest

pip

pip is a tool for installing Python packages. Python 2.7.9 and later include pip by default. If you don't have pip, you'll need to install it.

virtualenv

virtualenv is a tool for installing Python packages locally (i.e. local to a particular project) instead of globally. Here's how to get everything setup:

# Make sure you're using the version of Python and pip that you want to use.
sudo which python
sudo which pip

sudo pip install virtualenv

Now, let's setup a new project:

mkdir ~/Desktop/sfpythontesting
cd ~/Desktop/sfpythontesting
virtualenv env

# Do this anytime you want to work on the application.
. env/bin/activate

# Make sure that pip is running from within the env.
which pip

pip install nose
pip install mock
pip freeze > requirements.txt

# Now that you've created a requirements.txt, other people can just run:
# pip install -r requirements.txt

nose

Nose is a popular Python testing library. It simple and powerful.

Setup the project structure:

mkdir sfpythontesting
touch sfpythontesting/__init__.py
mkdir tests

Create a file, ~/Desktop/sfpythontesting/sfpythontesting/main.py with the following:

import random

def sum(a, b):
  return a + b

Now, create another file, ~/Desktop/sfpythontesting/tests/test_main.py with the following:

from nose.tools import assert_equal, assert_raises
import mock

from sfpythontesting import main

def test_sum():
  assert_equal(main.sum(1, 2), 3)

To run the tests:

nosetests --with-doctest

Testing a function that raises an exception

Add the following to main.py:

def raise_an_exception():
  raise ValueError("This is a ValueError")

And the following to test_main.py:

def test_raise_an_exception():
  with assert_raises(ValueError) as context:
    main.raise_an_exception()
  assert_equal(str(context.exception), "This is a ValueError")

Your tests should still be passing.

Monkeypatching

Sometimes there are parts of your code that are difficult to test because they involve randomness, they are time dependent, or they involve external things such as third-party web services. One approach to solving this problem is to use a mocking library to mock out those sorts of things:

Add the following to main.py:

def make_a_move_with_mock_patch():
  """Figure out what move to make in a hypothetical game.

  Use random.randint in part of the decision making process.

  In order to test this function, you have to use mock.patch to monkeypatch random.randint.

  """
  if random.randint(0, 1) == 0:
    return "Attack!"
  else:
    return "Defend!"

Now, add the following to test_main.py. This code dynamically replaces random.randint with a mock (that is, a fake version) thereby allowing you to make it return the same value every time.

@mock.patch("sfpythontesting.main.random.randint")
def test_make_a_move_with_mock_patch_can_attack(randint_mock):
  randint_mock.return_value = 0
  assert_equal(main.make_a_move_with_mock_patch(), "Attack!")

@mock.patch("sfpythontesting.main.random.randint")
def test_make_a_move_with_mock_patch_can_defend(randint_mock):
  randint_mock.return_value = 1
  assert_equal(main.make_a_move_with_mock_patch(), "Defend!")

Your tests should still be passing.

Here's a link to a more detailed article on the mock library.

Dependency injection

Another approach to this same problem is to use dependency injection. Add the following to main.py:

def make_a_move_with_dependency_injection(randint=random.randint):
  """This is another version of make_a_move.

  Accept the randint *function* as a parameter so that the test code can inject a different
  version of the randint function.

  This is known as dependency injection.

  """
  if randint(0, 1) == 0:
    return "Attack!"
  else:
    return "Defend!"

And add the following to test_main.py. Instead of letting make_a_move_with_dependency_injection use the normal version of randint, we pass in our own special version:

def test_make_a_move_with_dependency_injection_can_attack():
  def randint(a, b): return 0
  assert_equal(main.make_a_move_with_dependency_injection(randint=randint), "Attack!")

def test_make_a_move_with_dependency_injection_can_defend():
  def randint(a, b): return 1
  assert_equal(main.make_a_move_with_dependency_injection(randint=randint), "Defend!")

To learn more about dependency injection in Python, see this talk by Alex Martelli.

Since monkeypatching and dependency injection can solve similar problems, you might be wondering which one to use. This turns out to be sort of a religious argument akin to asking whether you should use Vi or Emacs. Personally, I recommend using a combination of PyCharm and Sublime Text ;)

My take is to use dependency injection when you can, but fall back to monkeypatching when using dependency injection becomes impractical. I also recommend that you not get bent out of shape if someone disagrees with you on this subject ;)

doctest

One benefit of using nose is that it can automatically support a wide range of testing APIs. For instance, it works with the unittest testing API as well as its own testing API. It also supports doctests which are tests embedded inside of the docstrings of normal Python code. Add the following to main.py:

def hello_doctest(name):
  """This is a Hello World function for using Doctest.

  >>> hello_doctest("JJ")
  'Hello, JJ!'

  """
  return "Hello, %s!" % name

Notice the docstring serves as both a useful example as well as an executable test. Doctests have fallen out of favor in the last few years because if you overuse them, they can make your docstrings really ugly. However, if you use them to make sure your usage examples keep working, they can be very helpful.

Conclusion

Ok, there's my lightning quick introduction to virtualenv, nose, mock, monkey patching, dependency injection, and doctest. Obviously I've only just scratched the surface. However, hopefully I've given you enough to get started!

As I mentioned above, people tend to have really strong opinions about the best approaches to testing, so I recommend being pragmatic with your own tests and tolerant of other people's strong opinions on testing. Furthermore, testing is a skill, kind of like coding Python is a skill. To get really good at it, you're going to need to learn a lot more (perhaps by reading a book) and practice. It'll get easier with time.

If you enjoyed this blog post, you might also enjoy my other short blog post, The Zen of Testing. Also, here's a link to the code I used above.