Skip to main content


Showing posts from August, 2008

Linux: Trac and Subversion on Ubuntu with Nginx and SSL

I just setup Trac and Subversion on Ubuntu. I decided to proxy tracd behind Nginx so that I could use SSL. I used ssh to access svn. I got email and commit hooks for everything working. I used runit to run tracd. In all, it took me about four days. Here's a brain dump of my notes:Set up Trac and Subversion:
Setup runit:
touch /etc/inittab # Latest Ubuntu uses "upstart" instead of the sysv init.
apt-get install runit
initctl start runsvdir
initctl status runsvdir
While still on oldserver, I took care of some Trac setup:
Setup permissions:
permission list
permission remove anonymous '*'
permission remove authenticated '*'

Books: Basics of Compiler Design

I started reading Basics of Compiler Design. I think, perhaps, it might have helped if I had actually taken the course rather than simply try to read the book.

Here's a simple rule of thumb:Never use three pages of complicated mathematics to explain that which can be explained using either a simple picture or a short snippet of pseudo code.The section on "Converting an NFA to a DFA" had me at the point of tears. After a couple hours, I finally understood it. However, even after I understood it, I knew I could do a better job teaching it. A little bit of Scheme written by the SICP guys would have been infinitely clearer.

I hate to be harsh, but it seemed like the author was just having a good time playing with TeX. I picked this book because it was short and didn't dive into code too much. What I found is that it uses math instead of code. I'd prefer code.

The worst part of reading this book by myself is that even if I make it to the end, I won't know if I…

Humor: I've Been Simpsonized!

Thanks to Dean Fraser (jericho at telusplanet dot net) at Springfield Punx for the artwork.

Books: The Art of UNIX Programming

I just finished reading The Art of UNIX Programming. In short, I liked it a lot.

Here are a few fun quotes:Controlling complexity is the essence of computer programming -- Brian Kernighan [p. 14]Software design and implementation should be a joyous art, a kind of high-level play...To do Unix philosophy right, you need to have (or recover) that attitude. [p. 27]Microsoft actually admitted publicly that NT security is impossible in March 2003. [p. 69, Unfortunately, the URL he provided no longer works.]One good test for whether an API is well designed is this one: if you try to write a description of it in purely human language (with no source-code extracts allowed), does it make sense? It is a very good idea to get into the habit of writing informal descriptions of your APIs before you code them. [p. 85, this is a good explanation for why I write docstrings before I write code.]C++ is anticompact--the language's designer has admitted that he doesn't expect any one programmer …

Python: the csv module and mysqlimport

Here's one way to get Python's csv module and mysqlimport to play nicely with one another.

When exporting something with the csv module, use:csv.writer(fileobj, dialect='excel-tab', lineterminator='\n')When importing with mysqlimport, use:mysqlimport \
--user=USERNAME \
--password \
--columns=COLUMNS \
--compress \
--fields-optionally-enclosed-by='"' \
--fields-terminated-by='\t' \
--fields-escaped-by='' \
--lines-terminated-by='\n' \
--local \
--lock-tables \
--verbose \
DATABASE INPUT.tsvIn particular, the "--fields-escaped-by=''" took me a while to figure out. Hence, the csv module and mysqlimport will agree that '"' is escaped via '""' rather than '\"'.

Linux: LinuxWorld, BeOS, Openmoko

I went to LinuxWorld Conference & Expo again this year like I always do. My mentor Leon Atkinson and I always go together. Here are a few notes.

There was a guy who had a booth for the New York Times. I asked him what it had to do with Linux. He said, "Nothing, but I've sold about 40 subscriptions in the last two days and made about $2000. Wanna buy a subscription?" I felt like I had been hit with a 5lb chunk of pink meat right in the face. There was another booth selling office chairs and another selling (I think) foot messages.

I didn't see Novell, HP, O'Reilly, Slashdot, GNOME, KDE, or a ton of other booths I expected to see. I talked with the lead editor at another "very large, but purposely unnamed" publisher, and he said that they wouldn't be back next year either.

There was a pretty cool spherical sculpture made of used computer parts. I was also pleased to see a bunch of guys putting together used computers and loading Linux on th…

SICP: Truly Conquering SICP

This guy is my hero:I’ve written 52 blog posts (not including this one) in the SICP category, spread over 10 months...Counting with the cloc tool (Count Lines Of Code), the total physical LOC count1 for the code I’ve written during this time: 7,300 LOC of Common Lisp, 4,100 LOC of Scheme.Gees, and I was excited when I finished the videos. I feel so inadequate ;)

Python: sort | uniq -c via the subprocess module

Here is "sort | uniq -c" pieced together using the subprocess module:from subprocess import Popen, PIPE

p1 = Popen(["sort"], stdin=PIPE, stdout=PIPE)
p2 = Popen(["uniq", "-c"], stdin=p1.stdout, stdout=PIPE)
for line in p2.stdout:
print line.rstrip()Note, I'm not bothering to check the exit status. You can see my previous post about how to do that.

Now, here's the question. Why does the program freeze if I put the two Popen lines together? I don't understand why I can't setup the pipeline, then feed it data, then close the stdin, and then read the result.

Python: Memory Conservation Tip: Temporary dbms

A dbm is an on disk hash mapping from strings to strings. The shelve module is a simple wrapper around the anydbm module that takes care of pickling the values. It's nice because it mimics the dict API so well. It's simple and useful. However, one thing that isn't so simple is trying to use a temporary file for the dbm.

The problem is that shelve uses anydb which uses whichdb. When you create a temporary file securely, it hands you an open file handle. There's no secure way to get a temporary file that isn't opened yet. Since the file already exists, whichdb tries to figure out what format it uses. Since it doesn't contain anything yet, you get a big explosion.

The solution is to use a temporary directory. The next question is, how do you make sure that temporary directory gets cleaned up without reams of code? Well, just like with temporary files, you can delete the temporary directory even if your code still has an open file handle referencing a file …

Python: Memory Conservation Tip: sort Tricks

The UNIX "sort" command is really quite amazing. It's fast and it can deal with a lot of data with very little memory. Throw in the "-u" flag to make the results unique, and you have quite a useful utility. In fact, you'd be surprised at how you can use it.

Suppose you have a bunch of pairs:a b
b c
a c
a c
b d
...You want to figure out which atoms (i.e. items) are related to which other atoms. This is easy to do with a dict of sets:referrers[left].add(right)
referrers[right].add(left)Notice, I used a set because I only want to know if two things are related, not how many times they are related.

My situation is strange. It's small enough so that I don't need to use a cluster. However, it's too big for such a dict to fit into memory. It's not too big for the data to fit in /tmp.

The question is, how do you get this sort of a hash to run from disk? Berkeley DB is one option. You could probably also use Lucene. Another option is to simply use s…

Python: Memory Conservation Tip: Nested Dicts

I'm working with a large amount of data, and I have a data structure that looks like:pair_counts[(a, b)] = countIt turns out that in my situation, I can save memory by switching to:pair_counts[a][b] = count Naturally, the normal rules of premature optimization apply: I wrote for readability, waited until I ran out of memory, did lots of profiling, and then optimized as little as possible.

In my small test case, this dropped my memory usage from 84mb to 61mb.

Python: Memory Conservation Tip: intern()

I'm working with a lot of data, and running out of memory is a problem. When I read a line of data, I've often seen the same data before. Rather than have two pointers that point to two separate copies of "foo", I'd prefer to have two pointers that point to the same copy of "foo". This makes a lot of sense in Python since strings are immutable anyway.

I knew that this was called the flyweight design pattern, but I didn't know if it was already implemented somewhere in Python. (Strictly speaking, I thought it was called the "flywheel" design pattern, and my buddy Drew Perttula corrected me.)

My first attempt was to write code like:>>> s1 = "foo"
>>> s2 = ''.join(['f', 'o', 'o'])
>>> s1 == s2
>>> s1 is s2
>>> identity_cache = {}
>>> s1 = identity_cache.setdefault(s1, s1)
>>> s2 = identity_cache.setdefault(s2, s2)
>>> s1 == &#…