Posts

Showing posts from 2008

Vim: ctags

ctags is a tool that figures out where various functions, classes, etc. are defined. Using ctags, you can use a hot key to jump to the definition of the symbol under the cursor.

To get started, install exuberant-ctags. In Ubuntu, this is just "apt-get install exuberant-ctags". Now, from within Vim::cd project_root
:!ctags -R .
:set tags=tagsTo jump to the definition of the symbol under the cursor, use cntl-]. To get back to where you were, use cntl-o.

There's also a taglist plugin for Vim. Once you install that, you can use ":TlistToggle" to open up a window on the left that shows all the things defined in your open files. I have that mapped to "T" by putting the following in my .vimrc: "map T :TlistToggle<CR>".

Thanks to Benjamin Sergeant for helping me get started with ctags.

Editors: I Dig Komodo Edit

Image
I think I'll switch to Komodo Edit for editing HTML, CSS, JavaScript, Python, Ruby, Perl, and PHP. I'll still use Vim for random text editing and for editing my outline files, and I'll still use Emacs for editing Erlang, Haskell, and Lisp, but I think Komodo Edit is better suited for Web programming.

This is going to be a fairly long review, so let me break it down into sections:The Good PartsOne thing I really like about this editor is that it is more sophisticated than a default installation of Vim or Emacs, but less sophisticated than a full-blown IDE. I don't feel overwhelmed like I do with Eclipse. The download is only 37mb compared to 134mb for Aptana Studio, and you can really feel the difference. So far, it's been very easy to learn and use rather than feeling frighteningly complex.

Let's start with the basics. As you might expect, it does a beautiful job highlighting the various languages. It handles HTML that contains JavaScript and CSS quite easi…

Emacs: nXhtml

Image
In response to the comments in Software Engineering: The Right Editor for the Right Job, I took a look at nXhtml for Emacs.

The scope of nXhtml is impressive. Take a look at the picture. This is a snippet of HTML / JavaScript that I was testing as a part of something else. I hit tab on every line to make it indent things. nXhtml isn't getting the indentation perfectly correct, nor is it getting the syntax highlighting completely correct (why is "beacon" in red?); however, this is worlds better than what comes with Aquamacs by default.

I think nXhtml is a promising project.

Next up, I'm going to check out Komodo Edit. It does make sense to me that since Emacs is written in Lisp, it would be one of the best editors for Lisp, whereas since Komodo Edit is based on XUL (aka Firefox), it would be one of the best editors for editing HTML, CSS, and JavaScript. Of course, I'll have to wait and see.

Software Engineering: The Right Editor for the Right Job

Imagine if you were reasonably skilled with all text editors and all IDEs. Which would you prefer for which tasks?

Clearly, if you're coding elisp, Vim would be a bad choice. Of course, what would be the point? More seriously, Emacs is written in Lisp and has SLIME, the Superior Lisp Interaction Mode for Emacs. Duh, no brainer.

For Scheme, there's something nice to be said about DrScheme's editor. Although, if we stick with the premise of knowing all text editors reasonably well, I'm guessing you might still stick with Emacs.

However, Emacs isn't perfect for everything. For instance, it my have a built-in Web browser, but I can guarantee you that I won't be giving up Firefox just so that I can use Emacs form widgets.

Similarly, Emacs is a little weak on the HTML, CSS, JavaScript side. Aquamacs comes with a fantastic mode for Latex, but if you want to edit an HTML file that has CSS and JavaScript in it, it's less than pleasant. mmm-mode and nXhtml-mode a…

Web: Robust Click-through Tracking

I have a web service that provides recommendations. I want to know when people click on the links. The site showing the links (imagine a book store) is separate from my web service.

Let's imagine a situation. My server generates some recommendations. The site shows those recommendations. After 10 minutes, my server goes down because both of my datacenters go down. I want to know if the user clicks on a link, but if my server is down, that must not block the user from surfing to that link.

I see how Google does click-through tracking. It's simple, non-obtrusive, and effective. However, as far as I can tell, it requires the server to be up. Well, they're Google ;) It's different when you're a simple web service that must never ever cause the customer's site to stop working.

I came up with the following:<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">

<html>
<head>

Python: Web Beacons in Pylons

A Web beacon is usually an image tag that refers to a 1x1 clear gif on a remote server. The remote server is able to track that the gif was seen when the browser tries to download it. If you're using Pylons, here's how to implement that beacon in a way that won't be cached:CLEAR_GIF = 'GIF89a\x01\x00\x01\x00\x91\xff\x00\xff\xff\xff\x00\x00\x00\xff\xff\xff\x00\x00\x00!\xff\x0bADOBE:IR1.0\x02\xde\xed\x00!\xf9\x04\x01\x00\x00\x02\x00,\x00\x00\x00\x00\x01\x00\x01\x00\x00\x02\x02T\x01\x00;'
...
def some_action(self):
# Do interesting things here...
response.headers['Content-Type'] = 'image/gif'
response.headers['Cache-Control'] = 'no-cache'
response.write(CLEAR_GIF)

Python: Timesheet Calculator

Little programs are so much fun to write ;) Here's one that adds up the hours in my time sheet.#!/usr/bin/env python

"""Add up the hours in my hours.otl file.

The file should have the following format::

12/21/2008
3.25 hours working on project-specific domain names.

The date must be in margin 0. The number of hours must be indented.

Testing::

nosetests --with-doctest addhours.py

Note, I'm positive that this script could be replaced by a one line
awk script, but whatever. It was fun to write.

"""

from cStringIO import StringIO
from optparse import OptionParser
import re
import sys

TEST_DATA = """\
12/18/2008
7 Hours programming.

12/19/2008
8 Hours hacking.
"""

hours_regex = re.compile(r"^\s+([0-9.]+)")

__docformat__ = "restructuredtext"


def process_file(f):
"""Add up and return the hours in the given open file handle.

This may raise a ValueError if the file is malformed.

Test::

Emacs: vimoutliner

I've been drinking too much caffeine lately, and if you know me, you know what that means--I start getting weird urges to play with Emacs.

One of the things that always drives me crazy about Emacs is indentation. It's hard to get it to do what I want it to do in cases where there is no mode that matches what I'm coding. I have a ton of files written using vimoutliner, and I don't feel like switching them to Emacs' own format. It's a simple outline format. Four space wide tabs are used for nesting.

I could never figure out how to get Emacs to just "do the right thing" with these .otl files. I finally figured out the right magical incantation, thanks to some hints from Jesse Montrose. Updated:;; Add support for Vim outline files.
(defun otl-setup ()
(setq outline-regexp "\t+")
(setq indent-tabs-mode t) ;; Use real tabs.
(setq tab-width 4))

(setq auto-mode-alist
(cons '("\\.otl$" . outline-mode)
auto-mode-alist))
(ad…

Programming: If programming languages were religions...

Python and Ruby: Regular Expression Anchors

In Python regular expressions, multiline mode is off by default. The documentation says:When [multiline mode is] specified, the pattern character '^' matches at the beginning of the string and at the beginning of each line (immediately following each newline); and the pattern character '$' matches at the end of the string and at the end of each line (immediately preceding each newline). By default, '^' matches only at the beginning of the string, and '$' only at the end of the string and immediately before the newline (if any) at the end of the string.In Ruby regular expressions, the multiline modifier (m) is also off by default. However, '^' still matches the beginning of each line.

Hence, in Python, the following does not match:re.match(r'^foo', '\nfoo\nbar')Interestingly enough, this does not match in Perl either:"\nfoo\nbar" =~ /^foo/In Ruby, it does:/^foo/.match("\nfoo\nbar")Both Python and Ruby support …

Computer History: Doug Engelbart

I went to a talk yesterday. It was the 40th anniversary of Doug Engelbart's 1968 "mother of all demos". In the demo, Engelbart demonstrated:The first computer mouseThe first graphical user interfaceThe first personal, interactive, networked computerThe first use of hypertext (i.e. text with links)I had heard about the demo but never watched it. It's available on YouTube, and it's definitely a must see. Doug had a grand vision of using the computer as a tool to help people accelerate how quickly they could solve problems. That goal has always fascinated me.

Robert Taylor, whose funding led to the creation of the ARPANET, told a pretty good joke, which he himself said was probably apocryphal. Rather than retell it, I grabbed a copy from here:Whenever you build an airplane, you have to make sure that each part weighs no more than allocated by the designers, and you have to control where the weight it located to keep the center of gravity with limits. So there i…

Books: RESTful Web Services

I just finished reading RESTful Web Services. I'll summarize. At its worst, it was boring and dogmatic. At its best, it helped me to formalize my understanding of REST, and it gave me a protocol-level introduction to a variety of topics like the Atom Publishing Protocol, microformats, S3, del.icio.us, HTML 5, etc.

One thing I found particularly frustrating is the author's attitude toward RPC. Basically, his stance is that RPC is synonymous with all things evil. Consider the following quote:This is why making up your own HTTP methods is a very, very bad idea: your custom vocabulary puts you in a community of one. You might as well be using XML-RPC. [p. 105]Ok, so using XML-RPC is just as bad as sending "EAT / HTTP/1.0" to a server. WTF?

I've implemented services using CORBA, JRMI, XML-RPC, and a couple times with REST. At the risk of calling the emperor naked, I liked XML-RPC the most. REST might look nicer at the wire level, but at least in Python, XML-RP…

Grammar: Predicates

I've noticed that certain programmers love grammar, so I hope you won't mind the following:

"The predicate is the subject of this sentence."

What's the subject? "The predicate" is the subject of this sentence.

What's the predicate? The predicate is "is the subject of this sentence."

Python: Class Methods Make Good Factories

Alex Martelli explained something to me a while back. One of the best uses of class methods is as constructors. For instance, if you want to have multiple constructors, but don't want to rely on one method that simply accepts different sorts of arguments, then use different class methods. The datetime module does this; it has class methods like fromordinal and fromtimestamp to create new datetime instances.

My first thought was that you could just as well use standalone factory functions. However, he brought up a good point. If I use a factory function, the class name is hard coded in the factory function. It can't easily return an instance of some subclass of the class. That's not the case with class methods.

Let me show you what I mean:class MyClass:

def __init__(self):
# This is the "base" constructor.
pass

@classmethod
def one_constructor(klass, foo):
# This is one special constructor.
self = klass()
self.foo =…

Auto: Square Pistons

(Disclaimer: I am mostly ignorant of auto tech.)

Why must pistons be round? I'm guessing that it's because it's easy to machine something really accurately if its round, and there's probably also something to be said for even pressure distribution. However, I'm thinking that if you used a square piston with rounded corners, you could get a larger "cylinder" to fit in the same block without compromising the thickness of the walls.

Also, why must ports be round? I can imagine ports that are triangles with rounded corners. This could be used to tune how much air is allowed in or out as the piston is going up and down. This would be a tunable, just like a camshaft.

PC-BSD

I tried out PC-BSD 7.0.1 under VMware Fusion on my MacBook.

From the guide:PC-BSD is basically FreeBSD with [a modern version of KDE,] a nice installer, some pre-configuration, kernel tweaks, PBI package management, a couple pre-selected packages and some handy (GUI) utilities to make PC-BSD suitable for desktop use.I worked on FreeBSD GUIs (both desktop and Web user interfaces) for five years. Let me tell you, I'm thankful that PC-BSD finally happened! For some reason, FreeBSD developers tend to either despise GUIs or own a Mac. Hence, it seemed to me that FreeBSD's GUI support actually got worse over the years. It's about time someone finally came along and "pulled an Ubuntu"!

Overall, I was pretty impressed. It reminds me of the early days of Ubuntu where you could see the potential, but you could also see some places that needed some polish. Here are some things I found worthy of note:

KDE looks really nice these days! It seemed a little unstable, but tha…

VMware Euphoria

I've been playing around with VMware since about 2000, but I've never had a computer powerful enough to really run it. Yesterday, I bought another 1gig stick of RAM for my MacBook, which puts me at 2gigs. That's not a heck of a lot, but it's enough.

I now have OS X, Ubuntu, and NetBSD running full screen on different OS X Spaces. I setup VMware Fusion to allow Ubuntu to use both CPUs and 1gig of RAM, whereas I only allocated 1 CPU and 256mb of RAM for NetBSD. OS X does fine with whatever the other two don't use. Ubuntu now has enough horsepower that I can even play the video game I wrote at full speed.

With a simple hot key, I can be in OS X, Ubuntu, or NetBSD. Even better: I can shut the lid of my laptop, and all three suspend without crashing. They all share my Mac's wireless connection, which tends to be pretty stable. If something is giving me a hard time installing under MacPorts, I can just install it on Ubuntu.

Being a minimalist, I only have one c…

Misspelled Variables

Care to guess what happens when you execute the following PHP?define('FOO', 'Hi');
print(FO);It prints 'FO'.

I do believe PHP got this from Perl:perl
print FOO . "FOO"; # Prints FOOFOOIt works even if you're strict:perl -w
use strict;
print FOO . "FOO"; # Prints FOOFOORuby behaves differently depending on whether you try to print an undefined variable/method or an undefined attribute:irb
>> print a
NameError: undefined local variable or method `a' for main:Object
from (irb):1
>> print @a
nil=> nilPython raises an exception:python
>>> print a
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'a' is not definedIn a compiled language, these sorts of errors would be caught at compile time. However, a compiled language would never let me do something like:python
>>> var_name = 'a'
>>> locals()[var_name] = 'yep'
>>> pr…

AI: Thankful for Bad AI

Imagine if the first computers man was able to create worked in pretty much the same way the human brain works. Imagine if they were pretty decent at reasoning, and terrible at calculating things quickly without error. Image that instead of having a quest for artificial intelligence, we had a quest for a "really fast, really accurate data cruncher." It'd be a different world. It definitely makes me grateful that we have humans *and* computers, each very useful in their own way.

The question of whether computers can think is like the question of whether submarines can swim -- Edsger W. Dijkstra

NetBSD: X11 Forwarding over SSH

I installed NetBSD 4.0.1 under VMware Fusion 2.0.1 on my OS X 10.5 box, and I had a heck of a time getting X11 forwarding working. I was getting the sshd configuration slightly wrong. Anyway, on the server I edited /etc/ssh/sshd_config:X11Forwarding yes
X11DisplayOffset 10
# X11UseLocalhost yes
XAuthLocation /usr/X11R6/bin/xauthThen I ran:rm /home/jj/.Xauthority
/etc/rc.d/sshd restartTo login from my Mac, I ran:ssh -YA jj@192.168.64.128Viola! xterm now works!

Ruby: An Interesting Block Pattern

Ruby has blocks, which enable all sorts of interesting idioms. I'm going to show one that will be familiar to Rails enthusiasts, but was new to me.

I was reading some code in a book, and it had the following:def if_found(obj)
if obj
yield
else
render :text => "Not found.", :status => "404 Not Found"
false
end
endHere's how you call it:if_found(obj) do
# We have a valid obj. Render something with it.
endThe code in the block will only execute if the obj was found. If it wasn't found, the response will already have been taken care of.

I've been in the same situation in Python (using Pylons), and I coded something like:def handle_not_found(obj):
if not obj:
return render_404_page()
return NoneHere's how you call it:response = handle_not_found(obj)
if response:
return response
# Otherwise, continue normally.Pylons likes to return responses, whereas render in Ruby works as a side effect whose return value isn…

Python: Some Notes on lxml

I wrote a webcrawler that uses lxml, XPath, and Beautiful Soup to easily pull data from a set of poorly formatted Web pages. In summary, it works, and I'm quite happy :)

The script needs to pull data from hundreds of Web pages, but not millions, so I opted to use threads. The script actually takes the list of things to look for as a set of XPath expressions on the command line, which makes it super flexible. Let me give you some hints for the parts that I found difficult.

First of all, here's how to install it. If you're using Ubuntu, then:apt-get install libxslt1-dev libxml2-dev
# I also have python-dev, build-essentials, etc. installed.
easy_install lxml
easy_install BeautifulSoupIf you're using MacPorts, doport install py25-lxml
easy_install BeautifulSoupThe FAQ states that if you use MacPorts, you may encounter difficulties because you will have multiple versions of libxml and libxslt installed. For instance, the following may segfault:python -c "import webbrow…

Python: Permission denied: '/var/www/.python-eggs'

I have a Pylons app, and I got the following exception in my logs:The following error occurred while trying to extract file(s) to the Python egg
cache:

[Errno 13] Permission denied: '/var/www/.python-eggs'

The Python egg cache directory is currently set to:

/var/www/.python-eggs

Perhaps your account does not have write access to this directory? You can
change the cache directory by setting the PYTHON_EGG_CACHE environment
variable to point to an accessible directory.The problem is that the app was running as www-data (which was the user created for nginx and Apache). www-data's home directory is /var/www, but it doesn't have write access to it. (I'm afraid of allowing write access so that it can unpack eggs into that directory because that directory is the web root. In general, you should be careful of what you put in the web root.)

There are a few ways to address this problem. One is to make sure to always use --always-unzip when installing eggs. Another is to cr…

Web: Flock

I've been using Flock for a few months now, and I finally noticed that it's not open source. Time for me to switch to Firefox 3 ;)

Web: REST Verbs

I find it curious that REST enthusiasts insist on viewing the world through the five verbs GET, HEAD, PUT, POST, and DELETE. It reminds me of a story:

Back in the early '80s, I worked for DARPA. During the height of the Cold War, we were really worried about being attacked by Russia. My team was charged with designing a RESTful interface to a nuclear launch site; as far as technology goes, we were way ahead of our time.

Anyway, I wanted the interface to be "PUT /bomb". However, my co-worker insisted that it should be "DELETE /russia". One of my other buddies suggested that we compromise on something more mainstream like "POST /russia/bomb".

Finally, my boss put an end to the whole fiasco. He argued that any strike against the USSR would necessarily be in retaliation to an attack from them. Hence, he suggested that it be "GET /even", so that's what we went with.

You have to understand, back then, GETs with side effects weren't yet …

IPv6 T-shirt

Image
Here's a shout out to all my homies in the IPv6 world! If you can't read it, it says "There is no place like 127.0.0.1 (except maybe ::1)". Thanks go to Tarek Ziade (ziade.tarek at gmail.com) for the custom T-shirt design.

Books: Expert Python Programming

I just received my copy of Expert Python Programming. I was the technical editor, and I also wrote the foreword. This is the first time I've ever been mentioned on the front cover of a book, so I'm very excited!

I really enjoyed editing this book. It's the first expert-level book on Python I've read. For a long time, I considered writing one. Tarek beat me to the punch, and I think he did a fantastic job!

A Python Programmmer's Perspective on C#

Being a language fanatic, I was really excited when I met a really smart guy named Corey Kosak who gave me a tour of C#'s newest features. I had heard a lot of good things about C# lately, including that it had been strongly influenced by Haskell, which makes sense since Microsoft actually funds research on Haskell. Anyway, a lot of C#'s newest features are a lot more like Python than Java. Let me show you some examples.Here is a sample C# iterator:foreach(var x in CountForeverFrom(123).Take(5)) {
Console.WriteLine(x);
}In Python, I'd write:for i in itertools.islice(itertools.count(123), 5):
print i
C# also iterators that are similar to Python's generators. Here is the C#:public static IEnumerable<int> CountForeverFrom(int start) {
while(true) {
yield return start;
start++;
}
}In Python, I'd write:def count_forever_from(start):
while True:
yield start
start += 1C#'s LINQ syntax is similar to Python's generator expressions…

Python: Debugging Memory Leaks

I wrote a simple tool that could take Web logs and replay them against a server in "real time". I was performance testing my Web app over the course of a day by hitting it with many days worth of Web logs at the same time.

By monitoring top, I found out that it was leaking memory. I was excited to try out Guppy, but it didn't help. Neither did playing around with the gc module. I had too many objects coming and going to make sense of it all.

Hence, I fell back to a simple process of elimination. Divide-and-conquer! I would make a change to the code, then I would exercise the code in a loop and monitor the output from top for ever-increasing memory usage.

Several hours later, I was able to nail it down to this simple repro:# This program leaks memory rather quickly. Removing the charset
# parameter fixes it.

import MySQLdb
import sys


while True:
connection = MySQLdb.connect(user='user', passwd='password',
host='localh…

Software Engineering: Reuse Has Finally Arrived

Have you noticed that code reuse works these days? For a long time, software engineers struggled with the difficulty of reusing existing software, but it's now common place

Let me give you some examples. I use Linux, Nginx, MySQL, and Python, not to mention a Web browser. These days, very few people need to write a custom kernel, Web server, database, or programming language to solve their particular problem. Sure it happens, but it's far more common to reuse something existing.

I even make use of an existing Web framework, Pylons, and an existing templating engine, Mako. Those things are often written from scratch, but I didn't need to. They were fine.

Even within my own code, I find plenty of places for reuse. Each of my clients has a pretty different setup. Their input formats and output formats are often pretty different, but by using a UNIXy "small tools that can be pieced together" approach, I usually write only a small amount of code when I get a new …

Free Software: Stallman and Births

Since I have four children, I found the following quote from Stallman to be very disturbing:Hundreds of thousands of babies are born every day. While the whole phenomenon is menacing, one of them by itself is not newsworthy. Nor is it a difficult achievement—even some fish can do it.When a fellow Emacs developer said that he had just become a father, Stallman replied, "I am sorry to hear it."

Perhaps he was just trolling. Well, Stallman's right. Even fish can reproduce. However, even a dog knows not to piss on his friend's leg.

Python: Bambi Meets Godzilla

I just re-read a blog post that I read a couple years ago called Bambi Meets Godzilla, and I enjoyed it just as much the second time around. It's a brief history of Smalltalk, Java, Perl, Python, and Ruby, and it talks about why hype is vitally important. It also spends a fair amount of time critiquing Python's culture. If you haven't read it yet, stop reading my post, and go read it instead ;)

It reminds me of The UNIX-HATERS Handbook, which I also love. The funny thing is that to some degree, he's right about Python's culture. I've seen it with my own eyes.

Don't believe me? If I were to admit that I preferred Ruby on Rails over Django, how long do you think it would take for someone to flame me in a comment calling me either an idiot, a troll, a loser, or a heretic, or to say something like "You can recognize good design by the inanity of its detractors"?

Web: SilverStripe

A couple years ago, I built my church's website using Plone. I had to read most of "The Definitive Guide to Plone", but I did it and it worked.

Recently, I realized it was time to overhaul the website. My buddy is a Plone expert, and he told me I would have an easier time rebuilding the website than trying to migrate it since my version of Plone is so old. After two years, I had forgotten much of what I knew about Plone, and I knew that my book was out of date.

I went looking for something that didn't have quite the same learning curve. Plone is fantastic if you're a Plone expert, but I'm not. I just needed "an overly simplistic content management system." I tried out Drupal and Joomla, but for long and complicated reasons, some of which involved my ISP, I decided against them; I'm sure they're quite nice.

My buddy Leon Atkinson told me that he had seen a cool demo for SilverStripe. SilverStripe is PHP, but I decided to watch the video a…

Python for Unix and Linux System Administration

The good news is that I was a lead technical editor of Python for Unix and Linux System Administration which just came out.

The bad news is that as my wife called me to tell me that my copy of the book had arrived, I noticed that someone had clipped my car in the parking lot and tore off part of the bumper. It looks like I'll have to replace the whole bumper.

C'est la vie.

Anyway, about the book, it's exactly what the title says it is. If you have a computer science background, this book is not for you. However, if you're a sysadmin trying to learn Python, it's perfect. In fact, when I think of all the sysadmins I've met who do a bit of scripting, this book matches them perfectly.

Linux: Trac and Subversion on Ubuntu with Nginx and SSL

I just setup Trac and Subversion on Ubuntu. I decided to proxy tracd behind Nginx so that I could use SSL. I used ssh to access svn. I got email and commit hooks for everything working. I used runit to run tracd. In all, it took me about four days. Here's a brain dump of my notes:Set up Trac and Subversion:
Setup runit:
touch /etc/inittab # Latest Ubuntu uses "upstart" instead of the sysv init.
apt-get install runit
initctl start runsvdir
initctl status runsvdir
While still on oldserver, I took care of some Trac setup:
Setup permissions:
See: http://trac.edgewall.org/wiki/TracPermissions
trac-admin:
permission list
permission remove anonymous '*'
permission remove authenticated '*'
permission add authenticated BROWSER_VIEW CHANGESET_VIEW FILE_VIEW LOG_VIEW MILESTONE_VIEW REPORT_SQL_VIEW REPORT_VIEW ROADMAP_VIEW SEARCH_VIEW TICKET_CREATE TICKET_MODIFY TICKET_VIEW TIMELINE_VIEW WIKI_CREA
TE WI…

Books: Basics of Compiler Design

I started reading Basics of Compiler Design. I think, perhaps, it might have helped if I had actually taken the course rather than simply try to read the book.

Here's a simple rule of thumb:Never use three pages of complicated mathematics to explain that which can be explained using either a simple picture or a short snippet of pseudo code.The section on "Converting an NFA to a DFA" had me at the point of tears. After a couple hours, I finally understood it. However, even after I understood it, I knew I could do a better job teaching it. A little bit of Scheme written by the SICP guys would have been infinitely clearer.

I hate to be harsh, but it seemed like the author was just having a good time playing with TeX. I picked this book because it was short and didn't dive into code too much. What I found is that it uses math instead of code. I'd prefer code.

The worst part of reading this book by myself is that even if I make it to the end, I won't know if I…

Humor: I've Been Simpsonized!

Image
Thanks to Dean Fraser (jericho at telusplanet dot net) at Springfield Punx for the artwork.

Books: The Art of UNIX Programming

I just finished reading The Art of UNIX Programming. In short, I liked it a lot.

Here are a few fun quotes:Controlling complexity is the essence of computer programming -- Brian Kernighan [p. 14]Software design and implementation should be a joyous art, a kind of high-level play...To do Unix philosophy right, you need to have (or recover) that attitude. [p. 27]Microsoft actually admitted publicly that NT security is impossible in March 2003. [p. 69, Unfortunately, the URL he provided no longer works.]One good test for whether an API is well designed is this one: if you try to write a description of it in purely human language (with no source-code extracts allowed), does it make sense? It is a very good idea to get into the habit of writing informal descriptions of your APIs before you code them. [p. 85, this is a good explanation for why I write docstrings before I write code.]C++ is anticompact--the language's designer has admitted that he doesn't expect any one programmer …

Python: the csv module and mysqlimport

Here's one way to get Python's csv module and mysqlimport to play nicely with one another.

When exporting something with the csv module, use:csv.writer(fileobj, dialect='excel-tab', lineterminator='\n')When importing with mysqlimport, use:mysqlimport \
--user=USERNAME \
--password \
--columns=COLUMNS \
--compress \
--fields-optionally-enclosed-by='"' \
--fields-terminated-by='\t' \
--fields-escaped-by='' \
--lines-terminated-by='\n' \
--local \
--lock-tables \
--verbose \
DATABASE INPUT.tsvIn particular, the "--fields-escaped-by=''" took me a while to figure out. Hence, the csv module and mysqlimport will agree that '"' is escaped via '""' rather than '\"'.

Math: pi

As of today, I am roughly 33π×107 seconds old.

Linux: LinuxWorld, BeOS, Openmoko

I went to LinuxWorld Conference & Expo again this year like I always do. My mentor Leon Atkinson and I always go together. Here are a few notes.

There was a guy who had a booth for the New York Times. I asked him what it had to do with Linux. He said, "Nothing, but I've sold about 40 subscriptions in the last two days and made about $2000. Wanna buy a subscription?" I felt like I had been hit with a 5lb chunk of pink meat right in the face. There was another booth selling office chairs and another selling (I think) foot messages.

I didn't see Novell, HP, O'Reilly, Slashdot, GNOME, KDE, or a ton of other booths I expected to see. I talked with the lead editor at another "very large, but purposely unnamed" publisher, and he said that they wouldn't be back next year either.

There was a pretty cool spherical sculpture made of used computer parts. I was also pleased to see a bunch of guys putting together used computers and loading Linux on th…

SICP: Truly Conquering SICP

This guy is my hero:I’ve written 52 blog posts (not including this one) in the SICP category, spread over 10 months...Counting with the cloc tool (Count Lines Of Code), the total physical LOC count1 for the code I’ve written during this time: 7,300 LOC of Common Lisp, 4,100 LOC of Scheme.Gees, and I was excited when I finished the videos. I feel so inadequate ;)

Python: sort | uniq -c via the subprocess module

Here is "sort | uniq -c" pieced together using the subprocess module:from subprocess import Popen, PIPE

p1 = Popen(["sort"], stdin=PIPE, stdout=PIPE)
p1.stdin.write('FOO\nBAR\nBAR\n')
p1.stdin.close()
p2 = Popen(["uniq", "-c"], stdin=p1.stdout, stdout=PIPE)
for line in p2.stdout:
print line.rstrip()Note, I'm not bothering to check the exit status. You can see my previous post about how to do that.

Now, here's the question. Why does the program freeze if I put the two Popen lines together? I don't understand why I can't setup the pipeline, then feed it data, then close the stdin, and then read the result.

Python: Memory Conservation Tip: Temporary dbms

A dbm is an on disk hash mapping from strings to strings. The shelve module is a simple wrapper around the anydbm module that takes care of pickling the values. It's nice because it mimics the dict API so well. It's simple and useful. However, one thing that isn't so simple is trying to use a temporary file for the dbm.

The problem is that shelve uses anydb which uses whichdb. When you create a temporary file securely, it hands you an open file handle. There's no secure way to get a temporary file that isn't opened yet. Since the file already exists, whichdb tries to figure out what format it uses. Since it doesn't contain anything yet, you get a big explosion.

The solution is to use a temporary directory. The next question is, how do you make sure that temporary directory gets cleaned up without reams of code? Well, just like with temporary files, you can delete the temporary directory even if your code still has an open file handle referencing a file …

Python: Memory Conservation Tip: sort Tricks

The UNIX "sort" command is really quite amazing. It's fast and it can deal with a lot of data with very little memory. Throw in the "-u" flag to make the results unique, and you have quite a useful utility. In fact, you'd be surprised at how you can use it.

Suppose you have a bunch of pairs:a b
b c
a c
a c
b d
...You want to figure out which atoms (i.e. items) are related to which other atoms. This is easy to do with a dict of sets:referrers[left].add(right)
referrers[right].add(left)Notice, I used a set because I only want to know if two things are related, not how many times they are related.

My situation is strange. It's small enough so that I don't need to use a cluster. However, it's too big for such a dict to fit into memory. It's not too big for the data to fit in /tmp.

The question is, how do you get this sort of a hash to run from disk? Berkeley DB is one option. You could probably also use Lucene. Another option is to simply use s…

Python: Memory Conservation Tip: Nested Dicts

I'm working with a large amount of data, and I have a data structure that looks like:pair_counts[(a, b)] = countIt turns out that in my situation, I can save memory by switching to:pair_counts[a][b] = count Naturally, the normal rules of premature optimization apply: I wrote for readability, waited until I ran out of memory, did lots of profiling, and then optimized as little as possible.

In my small test case, this dropped my memory usage from 84mb to 61mb.

Python: Memory Conservation Tip: intern()

I'm working with a lot of data, and running out of memory is a problem. When I read a line of data, I've often seen the same data before. Rather than have two pointers that point to two separate copies of "foo", I'd prefer to have two pointers that point to the same copy of "foo". This makes a lot of sense in Python since strings are immutable anyway.

I knew that this was called the flyweight design pattern, but I didn't know if it was already implemented somewhere in Python. (Strictly speaking, I thought it was called the "flywheel" design pattern, and my buddy Drew Perttula corrected me.)

My first attempt was to write code like:>>> s1 = "foo"
>>> s2 = ''.join(['f', 'o', 'o'])
>>> s1 == s2
True
>>> s1 is s2
False
>>> identity_cache = {}
>>> s1 = identity_cache.setdefault(s1, s1)
>>> s2 = identity_cache.setdefault(s2, s2)
>>> s1 == &#…