Saturday, December 27, 2008

Vim: ctags

ctags is a tool that figures out where various functions, classes, etc. are defined. Using ctags, you can use a hot key to jump to the definition of the symbol under the cursor.

To get started, install exuberant-ctags. In Ubuntu, this is just "apt-get install exuberant-ctags". Now, from within Vim:
:cd project_root
:!ctags -R .
:set tags=tags
To jump to the definition of the symbol under the cursor, use cntl-]. To get back to where you were, use cntl-o.

There's also a taglist plugin for Vim. Once you install that, you can use ":TlistToggle" to open up a window on the left that shows all the things defined in your open files. I have that mapped to "T" by putting the following in my .vimrc: "map T :TlistToggle<CR>".

Thanks to Benjamin Sergeant for helping me get started with ctags.

Friday, December 26, 2008

Editors: I Dig Komodo Edit


I think I'll switch to Komodo Edit for editing HTML, CSS, JavaScript, Python, Ruby, Perl, and PHP. I'll still use Vim for random text editing and for editing my outline files, and I'll still use Emacs for editing Erlang, Haskell, and Lisp, but I think Komodo Edit is better suited for Web programming.

This is going to be a fairly long review, so let me break it down into sections:
The Good Parts
One thing I really like about this editor is that it is more sophisticated than a default installation of Vim or Emacs, but less sophisticated than a full-blown IDE. I don't feel overwhelmed like I do with Eclipse. The download is only 37mb compared to 134mb for Aptana Studio, and you can really feel the difference. So far, it's been very easy to learn and use rather than feeling frighteningly complex.

Let's start with the basics. As you might expect, it does a beautiful job highlighting the various languages. It handles HTML that contains JavaScript and CSS quite easily. It doesn't have a smart indent mode like Emacs, but I've always disliked that feature anyway.

It supports Vi key bindings (more on that later). It knows how to reflow a paragraph, even if that paragraph has "#" at the beginning of each line. It can increase or decrease the indentation level. It has a column selection (aka rectangular selection) mode, which, by the way, is one of those things that separates great editors from mediocre editors. It has code folding, although that's never been all that important to me.

It can do autocomplete for symbols within the current file. Even better, it has code assist. If you type "import os; os.", it'll tell you what your options are. If you type "import os; os.path.join(", it'll tell you what the API for the method is. The code assist is very helpful when you're editing CSS. It'll give you a drop down for things like "background-color" as well as a drop down for the possible values. It can also jump to the definition of a function (more on that later).

It has the notion of a project, but it doesn't require you to set much stuff up. You just say, "This directory is a project." This lets you do project-wide searches, and it shows your files in a file explorer pane on the left. It does create a project file, which is of type .kpf, but that file contains only 7 lines of XML. It basically says to figure out everything on the fly. I don't feel like I have to convert to a new religion or convince all my coworkers to switch before I can start using it.

It recognizes syntax errors. If I improperly indent some Python, it complains. If I forget the ":" after a for loop, it complains. Forgetting the ":" is perhaps my single most common syntax error, so that's helpful. However, it doesn't have built-in support for PyChecker or PyLint like Pydev does (or so I've heard). Hence, it doesn't complain if I print a local variable before setting the local variable even though PyChecker could catch that.

It knows how to run external commands and do something useful with the output like Emacs does. I told it to run "make test", and I purposely made a test fail. I was able to click on a filename in the exception to go directly to the file. What's better, it has the concept of a toolbox where I can setup all sorts of little things like "make test" and click on them when I need them.

Using the Open/Find toolbar does indeed make it easy to do a project-wide grep. "Find in Current Project" is even easier since it understands the root of your project. It allows you to search using a string or a regex. The UI is pleasant. All the matches are shown in a pane so that you can click on them one at a time.

Some of the smaller niceties include the following. There's a line at the 80 column mark. (I can only get Vim to do that using an awful hack, and that drives me crazy.) It opens up the project and files I had open when I last used it. Macros work, even when you're using Vi commands. It's pleasant to look at (it could be prettier, but it ain't bad).
The Bad Parts
It tends to freeze the UI if you ask it to do something really hard like do a project-wide search in a directory containing 1.7g. That sort of stuff should run on a background thread so that the UI never freezes. What's worse, I had to force quit it when it froze while I was playing around with running external commands like "svn diff" (which did work at least once, by the way).

It doesn't provide good error messages when it doesn't like what you're doing. It tends to just ignore you instead. I found several cases of this.

"Go to Definition" only works for the current file. It doesn't know how to use something like ctags for the project as a whole. You can't even use it for the standard library. This is, perhaps, my single largest complaint. WingIDE is much better in this regard. The essential problem is that the "API catalogs" are pre-made. I found this comment on the issue:
We are going to document this CIX codeintel structure, and later, some tools to help build CIX API catalogs, we shall let everyone know when we have this ready.
In general, the code intelligence behaves a bit strange. It is easily confused. It also doesn't give me nearly as much information as WingIDE does.

It claims to support SCP, but I couldn't get it to work. Even when I used a server on a standard port and provided a username and password (even though I have an ssh key), it still wouldn't connect. Furthermore, instead of giving me a useful error message, it just timed out.

The Help / Help for Languages section is less than spectacular. When I clicked on the Python Reference, I got some ad page for Komodo. The other languages were better. They should probably have links for CSS, HTML, and JavaScript too a la W3Schools.

The Open/Find toolbar is useful, but I keep ending up in the wrong field when I hit tab for autocomplete. Furthermore, it doesn't add the trailing "/" when autocompleting directories like a shell would do. Last of all, when you run out of room in the widget, it doesn't scroll to the right; hence, you end up not being able to see what you're typing. This definitely isn't as nice as opening up a file with "C-x C-f" in Emacs. (By the way, Vim is even nicer in that it passes the path through the shell when using ":e". That means you can use environmental variables, etc. That means it even understands whacky zsh syntax.)

Project / Import from File System... doesn't show me a dialog. It just ignores me. I have no clue why or what it does.

When I tried to create a new project using an existing directory it kind of just ignored me. It turns out that it actually did create the project file. I would have expected it to automatically open the project, but it didn't. Nonetheless, I was able to open the project manually after I created it.

When you preview an HTML file, the CSS doesn't work. However, once I copied the file:// URL and gave it to Firefox, Firefox rendered the file properly. I'm not sure why.
Vi Key Bindings
Komodo Edit does support Vi and Emacs key bindings. Its support is useful, but, as you might expect, it's far from perfect. Perhaps I'm too accustomed to Vim.

"gg" does not go to the top of the file. You have to use ":0" instead.

You can't use "gq" to reflow the paragraph, but you can use "Shift-Apple-Q" instead.

If you use "shift-v" to highlight multiple lines and then ">" to indent them, the cursor must not be in the first column. Otherwise, the last line won't get indented.

"control-o" and "control-i" don't work. Hence, there's no easy way to jump to wherever you were recently.

Surprisingly, "*" and "#" do work for searching for the symbol under the cursor.

Rectangle select works, but doesn't do anything useful. According to the documentation:
With Vi emulation enabled, mouse selection and Vi visual blockwise selection ('Ctrl'+'V') will not trigger column editing. While in Input mode, use 'Alt'+'Shift' with the direction keys to make the column selection, and 'Esc' to exit column editing mode.
Using ">}" to indent the current paragraph doesn't work. "}" by itself does move the cursor. You can use "v}>" to achieve the same goal.

"cw tab tab tab" inserts three things into the undo list instead of just one. Of course, this is a pedantic complaint.

Using "%" to jump between "{" and "}" works, but it's off by one character.

"50i. esc" does not insert 50 periods.

":e" is worthless since it doesn't do autocomplete.
Why Not ...?
You might wonder why I like Komodo Edit over some of the alternatives.

Why not TextMate? Because it's commercial. I might tolerate closed source software in some situations, but I'd rather write my own editor than spend 8 hours a day using someone else's proprietary editor. Call me crazy.

Why not WingIDE? Because Komodo Edit is probably more useful for a wider range of languages.

Why not Komodo IDE? Because it's commercial. Sure, it has an integrated debugger, Python shell, and revision control functionality. However, those things add a lot of complexity. That would destroy the simplicity that I admire so much in Komodo Edit. Besides, I'm a shell junky anyway, and I can use the shell for those things.

Why not Vim? I like the code intelligence features of Komodo Edit. I can still make use of Vi key bindings.

Why not Emacs? Emacs ain't so hot when it comes to HTML with embedded JavaScript and CSS. nXhtml is very promising, but it isn't there yet. Komodo Edit can do more with less configuring and less learning.

Why not jEdit? I've just never felt very drawn to it.
Conclusion
Well, I like it. It's far from perfect, but I can definitely see how to be productive with it. If you'd like to see it in action, check out the Komodo IDE screencast, and just ignore all the things that aren't in Komodo Edit.

Thursday, December 25, 2008

Emacs: nXhtml


In response to the comments in Software Engineering: The Right Editor for the Right Job, I took a look at nXhtml for Emacs.

The scope of nXhtml is impressive. Take a look at the picture. This is a snippet of HTML / JavaScript that I was testing as a part of something else. I hit tab on every line to make it indent things. nXhtml isn't getting the indentation perfectly correct, nor is it getting the syntax highlighting completely correct (why is "beacon" in red?); however, this is worlds better than what comes with Aquamacs by default.

I think nXhtml is a promising project.

Next up, I'm going to check out Komodo Edit. It does make sense to me that since Emacs is written in Lisp, it would be one of the best editors for Lisp, whereas since Komodo Edit is based on XUL (aka Firefox), it would be one of the best editors for editing HTML, CSS, and JavaScript. Of course, I'll have to wait and see.

Software Engineering: The Right Editor for the Right Job

Imagine if you were reasonably skilled with all text editors and all IDEs. Which would you prefer for which tasks?

Clearly, if you're coding elisp, Vim would be a bad choice. Of course, what would be the point? More seriously, Emacs is written in Lisp and has SLIME, the Superior Lisp Interaction Mode for Emacs. Duh, no brainer.

For Scheme, there's something nice to be said about DrScheme's editor. Although, if we stick with the premise of knowing all text editors reasonably well, I'm guessing you might still stick with Emacs.

However, Emacs isn't perfect for everything. For instance, it my have a built-in Web browser, but I can guarantee you that I won't be giving up Firefox just so that I can use Emacs form widgets.

Similarly, Emacs is a little weak on the HTML, CSS, JavaScript side. Aquamacs comes with a fantastic mode for Latex, but if you want to edit an HTML file that has CSS and JavaScript in it, it's less than pleasant. mmm-mode and nXhtml-mode aim to fix this, but (from what I've heard) they're less than fun to set up. Pretty much out-of-the-box (i.e. turn on syntax highlighting, auto-indent, etc.), Vim is much nicer for editing an HTML file with CSS and JavaScript in it. From what I can see in the Aptana IDE videos, Aptana is even slicker.

What about Python? Emacs has very good Python integration, including integration with the shell. However, Vim is also pretty pleasant to use for Python. I've heard multiple times that Wing IDE (commercial) is the best Python IDE available, but Pydev (for Eclipse) also seems very active.

Ruby seems to be a no brainer. The entire core team uses TextMate. Of course, the choice is tougher if you object to using a closed source editor. I've seen other Rails coders use RadRails inside Aptana.

If you're coding Erlang, you should probably stick with Emacs. It was the standard editor among the guys who wrote Erlang. I've heard people joke that the only way to make sure you haven't gotten ".", ";", and "," confused is to make sure Emacs is indenting it right.

Similarly, Emacs is probably a good fit for Haskell, at least based on the Haskell coders I've met.

What do you use to edit config files on a remote system? The conventional wisdom is Vi, of course. However, these days, many editors (including Emacs, Vim, and Gedit) support editing over scp. Hence, you don't have to put up with HP-UX's version of Vi (I've heard it's awful) just because you're on a remote system--assuming you have network access.

Concerning Emacs vs. Vim specifically, I think that if there's a well written Emacs mode, you're better off with Emacs. In other cases, you're better off with Vim. In general, Vim's understanding of most programming languages is much weaker, but it comes builtin with support for many, many more of them. Furthermore, Vim is much better out of the box dealing with multi-mode files like HTML, CSS, and JavaScript files.

Furthermore, it's so nice to be able to say something like ":set sw=4 sts=4 et ai" which means "set the shift width to 4 spaces, set soft tab stops to 4 spaces, emulate tabs, auto indent". That might not be as smart as smart indentation mode in Emacs, but it sure is a time saver if there is no smart indentation mode for the syntax you're editing.

I still think that Vim is the fastest editor for straight text editing if you're a touch typist and you really know it well. A lot of "switch hitters" agree with this sentiment. "2dw" = "delete two words". "j." = "go down a line and do it again". Nice ;)

What about Java? Because of the nature of Java, I know very few people who don't use an IDE for Java. I've heard many people say IntelliJ is the best, but it's commercial. Eclipse is the big open source option. Surprisingly, I've heard a lot of nice things about NetBeans; I think they must have put some serious effort into it lately.

If you need something super general purpose and multi-platform, I've heard lots of good things about jEdit, but I can't think of any language for which jEdit is a must have compared to all other editors.

Ok, last tip: if you're coding in Turbo Pascal, any editor will do--as long as it's made by Borland and uses a yellow on blue font ;)

Happy Hacking!

Monday, December 22, 2008

Web: Robust Click-through Tracking

I have a web service that provides recommendations. I want to know when people click on the links. The site showing the links (imagine a book store) is separate from my web service.

Let's imagine a situation. My server generates some recommendations. The site shows those recommendations. After 10 minutes, my server goes down because both of my datacenters go down. I want to know if the user clicks on a link, but if my server is down, that must not block the user from surfing to that link.

I see how Google does click-through tracking. It's simple, non-obtrusive, and effective. However, as far as I can tell, it requires the server to be up. Well, they're Google ;) It's different when you're a simple web service that must never ever cause the customer's site to stop working.

I came up with the following:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">

<html>
<head>
<title>Click-through Tracking</title>
<script type="text/javascript">
function click(elem) {
(new Image()).src = 'http://localhost:5000/api/beacon';
return true;
}
</script>
</head>

<body>
<p>
<a href="http://www.google.com"
onclick="return click(this);">click me!</a>
</p>
</body>
</html>
Note a few things. It doesn't mess with the href. It works whether or not the third-party server (localhost) is up. It does talk to a third-party server, but it does so using an image request; hence, the normal cross-site JavaScript constraints aren't imposed. It has all the qualities I want, and I actually think it's a pretty clever trick. However, I'm worried.

I like the fact that loading an image is asynchronous. I'm depending on that. However, what if it takes the browser 1 second to connect to my server, and only 0.1 seconds to move on to Google (because that's what the link links to). It's a race condition. As long as the browser makes the request at all, I'm fine. However, if it gives up on the request because DNS takes too long, I'm hosed.

Does anyone have any idea how the browsers will behave? Do my requirements make sense? Is there an easier way?

Sunday, December 21, 2008

Python: Web Beacons in Pylons

A Web beacon is usually an image tag that refers to a 1x1 clear gif on a remote server. The remote server is able to track that the gif was seen when the browser tries to download it. If you're using Pylons, here's how to implement that beacon in a way that won't be cached:
CLEAR_GIF = 'GIF89a\x01\x00\x01\x00\x91\xff\x00\xff\xff\xff\x00\x00\x00\xff\xff\xff\x00\x00\x00!\xff\x0bADOBE:IR1.0\x02\xde\xed\x00!\xf9\x04\x01\x00\x00\x02\x00,\x00\x00\x00\x00\x01\x00\x01\x00\x00\x02\x02T\x01\x00;'
...
def some_action(self):
# Do interesting things here...
response.headers['Content-Type'] = 'image/gif'
response.headers['Cache-Control'] = 'no-cache'
response.write(CLEAR_GIF)

Python: Timesheet Calculator

Little programs are so much fun to write ;) Here's one that adds up the hours in my time sheet.
#!/usr/bin/env python

"""Add up the hours in my hours.otl file.

The file should have the following format::

12/21/2008
3.25 hours working on project-specific domain names.

The date must be in margin 0. The number of hours must be indented.

Testing::

nosetests --with-doctest addhours.py

Note, I'm positive that this script could be replaced by a one line
awk script, but whatever. It was fun to write.

"""

from cStringIO import StringIO
from optparse import OptionParser
import re
import sys

TEST_DATA = """\
12/18/2008
7 Hours programming.

12/19/2008
8 Hours hacking.
"""

hours_regex = re.compile(r"^\s+([0-9.]+)")

__docformat__ = "restructuredtext"


def process_file(f):
"""Add up and return the hours in the given open file handle.

This may raise a ValueError if the file is malformed.

Test::

>>> process_file(StringIO(TEST_DATA))
15.0

"""
total = 0
for line in f:
match = hours_regex.match(line)
if match is not None:
total += float(match.group(1))
return total


def main():
"""Run the program.

Deal with optparse, printing nice error messages, etc.

"""
parser = OptionParser("usage: %prog < hours.otl")
(options, args) = parser.parse_args()
if args:
parser.error("No arguments expected")
try:
print process_file(sys.stdin)
except ValueError, e:
parser.error("Malformed file: %s" % e)


if __name__ == '__main__':
main()

Friday, December 19, 2008

Emacs: vimoutliner

I've been drinking too much caffeine lately, and if you know me, you know what that means--I start getting weird urges to play with Emacs.

One of the things that always drives me crazy about Emacs is indentation. It's hard to get it to do what I want it to do in cases where there is no mode that matches what I'm coding. I have a ton of files written using vimoutliner, and I don't feel like switching them to Emacs' own format. It's a simple outline format. Four space wide tabs are used for nesting.

I could never figure out how to get Emacs to just "do the right thing" with these .otl files. I finally figured out the right magical incantation, thanks to some hints from Jesse Montrose. Updated:
;; Add support for Vim outline files.
(defun otl-setup ()
(setq outline-regexp "\t+")
(setq indent-tabs-mode t) ;; Use real tabs.
(setq tab-width 4))

(setq auto-mode-alist
(cons '("\\.otl$" . outline-mode)
auto-mode-alist))
(add-hook 'outline-mode-hook
'otl-setup)
Viola! Editing .otl files just became possible!

Thursday, December 11, 2008

Python and Ruby: Regular Expression Anchors

In Python regular expressions, multiline mode is off by default. The documentation says:
When [multiline mode is] specified, the pattern character '^' matches at the beginning of the string and at the beginning of each line (immediately following each newline); and the pattern character '$' matches at the end of the string and at the end of each line (immediately preceding each newline). By default, '^' matches only at the beginning of the string, and '$' only at the end of the string and immediately before the newline (if any) at the end of the string.
In Ruby regular expressions, the multiline modifier (m) is also off by default. However, '^' still matches the beginning of each line.

Hence, in Python, the following does not match:
re.match(r'^foo', '\nfoo\nbar')
Interestingly enough, this does not match in Perl either:
"\nfoo\nbar" =~ /^foo/
In Ruby, it does:
/^foo/.match("\nfoo\nbar")
Both Python and Ruby support the "\A" operator which explicitly matches the beginning of the string (not the line).

To make matters even more confusing, in Python "\Z" matches the "end of the string." In Ruby, "\Z" matches the end of the string except for the final newline, whereas "\z" matches the end of the string. Ruby is similar to Perl in this regard.

I was surprised to discover such subtle differences existed. Things like that make expert-level proficiency in multiple languages extremely difficult.

Wednesday, December 10, 2008

Computer History: Doug Engelbart

I went to a talk yesterday. It was the 40th anniversary of Doug Engelbart's 1968 "mother of all demos". In the demo, Engelbart demonstrated:
  • The first computer mouse
  • The first graphical user interface
  • The first personal, interactive, networked computer
  • The first use of hypertext (i.e. text with links)
I had heard about the demo but never watched it. It's available on YouTube, and it's definitely a must see. Doug had a grand vision of using the computer as a tool to help people accelerate how quickly they could solve problems. That goal has always fascinated me.

Robert Taylor, whose funding led to the creation of the ARPANET, told a pretty good joke, which he himself said was probably apocryphal. Rather than retell it, I grabbed a copy from here:
Whenever you build an airplane, you have to make sure that each part weighs no more than allocated by the designers, and you have to control where the weight it located to keep the center of gravity with limits. So there is an organization called weights which tracks that.

For the 747-100, one of the configuration items was the software for the navigation computer. In those days (mid-1960s), the concept of software was not widely understood. The weight of the software was 0. The weights people didn't understand this so they sent a guy to the software group to understand this. The software people tried mightily to explain that the software was weightless, and the weights guy eventually went away, dubious.

The weights guy comes back a few days later with a box of punch cards (if you don't know what a punch card is, e-mail me and I will explain). The box weighed about 15 pounds. The weights guy said "This box contains software". The software guys inspected the cards and it was, in fact, a computer program. "See?", the weights guy said, "This box weighs about 15 pounds". "You don't understand", the software guys responded, "The software is in the holes".
Allan Kay was also there. Allan is always fascinating to listen to. He said something that I thought was useful. He said that there is a difference between "new" and "news". "News" is when something happens and you get an update that it happened. News is simple and easy to assimilate. Something is "new" when it changes the rules of the game. When something is "new", it's impossible to fully understand the ramifications.

He had a great example. When the printing press came out, people thought it was "news". Suddenly, it was cheaper to print books. What they didn't understand was that it was actually "new". The printing press allowed ideas to spread more quickly, more broadly, and more accurately than ever before. It was impossible for them to understand just how profoundly the printing press would affect the world.

Monday, December 01, 2008

Books: RESTful Web Services

I just finished reading RESTful Web Services. I'll summarize. At its worst, it was boring and dogmatic. At its best, it helped me to formalize my understanding of REST, and it gave me a protocol-level introduction to a variety of topics like the Atom Publishing Protocol, microformats, S3, del.icio.us, HTML 5, etc.

One thing I found particularly frustrating is the author's attitude toward RPC. Basically, his stance is that RPC is synonymous with all things evil. Consider the following quote:
This is why making up your own HTTP methods is a very, very bad idea: your custom vocabulary puts you in a community of one. You might as well be using XML-RPC. [p. 105]
Ok, so using XML-RPC is just as bad as sending "EAT / HTTP/1.0" to a server. WTF?

I've implemented services using CORBA, JRMI, XML-RPC, and a couple times with REST. At the risk of calling the emperor naked, I liked XML-RPC the most. REST might look nicer at the wire level, but at least in Python, XML-RPC is a heck of a lot easier to code for. None of the author's client examples can match the simplicity of using an XML-RPC service in Python. Sure, you can say that XML is ugly, but I never actually had to deal with XML when using XML-RPC. The libraries did it for me. However with REST, I have to slog through XML all the time.

Another thing that frustrated me about this book was how often it relies on standards that aren't yet standard. The back of the book says, 'This book puts the "Web" back into web services.' Unfortunately, this book makes use of HTML 5, a couple different IETF Internet-Drafts, and HTTP methods that my browser doesn't actually support. It should have said, "This book shows you how great REST would be if we had the perfect web."

One thing I really liked about this book was the checklist for creating Resource-Oriented Architectures, which I'll quote here:
  1. Figure out the data set.
  2. Split the data set into resources.
  3. For each kind of resource:
    1. Name the resources with URIs
    2. Expose a subset of the uniform interface
    3. Design the representation(s) accepted from the client
    4. Design the representation(s) served to the client.
    5. Integrate this resource into existing resources, using hypermedia links and forms.
    6. Consider the typical course of events: what's supposed to happen? Standard control flows like the Atom Publishing Protocol can help (see Chapter 9).
    7. Consider error conditions: what might go wrong? Again, standard control flows can help. [p. 216]
Aside from that, here is a list of quotes that I found surprising, frustrating, interesting, or simply entertaining (especially when taken out of context):
Now, lots of architectures are technically RESTful...More than you'd think. The Google SOAP API for web search technically has a RESTful architecture...But these are bad architectures for web services, because they look nothing like the Web. [p. 13]
Service-Oriented Architecture...This is a big industry buzzword...A book on service-oriented architecture should work on a slightly higher level, showing how to use services as software components, how to integrate them into a coherent whole. I don't cover that sort of thing in this book. [p. 20]
ProgrammableWeb...is the most popular web service directory...Its terminology isn't as exact as I'd like (it tends to classify REST-RPC hybrids as "REST" services). [p. 368]
I do my bit to promote WADL as a resource-oriented alternative to WSDL. I think it's the simplest and most elegant solution. [p. 25]
If a web service designer has never heard of REST, or thinks that hybrid services are "RESTful," there's little you can do about it. Most existing services are hybrids or full-blown RPC services. [p. 27]
Another uniform interface consists solely of HTTP GET and overloaded POST...This interface is perfectly RESTful, but, again, it doesn't conform to my Resource-Oriented Architecture. [p. 125]
Web services are just web sites for robots. [p. 132]
I need to truly capture the capabilities of my service. XHTML 5 has a feature called the repetition model, which allows me to express an arbitrary number of text boxes without writing an infinitely long HTML page. [p. 136]
You may have noticed a problem in Example 6-3. Its form specifies an HTTP method of PUT...I'm using the as-yet-unreleased XHTML 5 to get around the shortcomings of the current version of HTML. [p. 153]
The two derivations from the HTML you're familiar with are in the method attribute...and the brand-new template attribute, which inserts a form variable ("username") into the URI using the URI Templating standard (http://www.ietf.org/internet-drafts/draft-gregorio-uritemplate-00.txt). [p. 155]
Note that the Internet-Draft itself says:
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."
I had to hack Rails to get the behavior I want, instead of the behavior Rails creator David Heinemeier Hansson wants. [p. 168]
It's been a while since I presented any code. Indeed, coming up with the code is currently a general problem for REST advocates. [p. 167]
There's no good way to do that in Rails. [p. 189]
The if_found method sends a response code of 404 ("Not Found") if the user tries to GET or DELETE a nonexistent user. [p. 199; Hey, I thought DELETE was supposed to be idempotent!]
Of course, using the web service just means writing more code. [p. 209]
Clients could work on a higher level than HTTP...The idea here is to apply higher-level conventions than REST's, so that the client programmer doesn't have to write as much code. [p. 212]
The ActiveResource/ActiveRecord approach won't work for all web services, or even all Rails web services. It doesn't work very well on this service...As of the time of writing, it's more a promising possibility than a real-world solution to a problem. [p. 212]
If you want to do without PUT and DELETE altogether, it's entirely RESTful to expose safe operations on resources through GET, and all other operations through overloaded POST. Doing this violates my Resource-Oriented Architecture, but it conforms to the less restrictive rules of REST. [p. 220]
But every resource works basically the same way and can be acccessed with a universal client. This is a big part of the success of the Web. The restrictions imposed by the uniform interface (safety for GET and HEAD, idempotence for PUT and DELETE) make HTTP more reliable. [p. 222; The web was successful despite the fact that GET is often not safe and PUT and DELETE aren't available in Web browsers.]
How can you DELETE two resources at once?...You might be wondering what HTTP status code to send in response to a batch operation...You can use an extended HTTP status code created by the WebDAV extension to HTTP: 207 ("Multi-Status"). [p. 230]
Yet again, the way to deal with an action that doesn't fit the uniform interface is to expose the action itself as a resource. [p. 232]
I'll translate. If you feel the need to use verbs other than GET, PUT, POST, and DELETE, just convert your verb to a noun. REST consists of converting all interesting verbs into nouns so that you only have to use basic verbs like GET, PUT, POST, DELETE, HEAD, and OPTIONS.
[Browsers only support GET and POST.] If the server supports it, a client can get around these limitations by tunneling PUT and DELETE requests through overloaded POST...Include the "real" HTTP method in the query string. Ruby on Rails defines a hidden form field called _method...Restlet uses the method variable...The second way is to include the "real" HTTP action in the X-HTTP-Method-Override HTTP request header. [p. 252; That's a pretty good summary of how to stick to the web's uniform interface.]
A web service that sends HTTP cookies violates the principle of statelessness...What about cookies that really do contain application state? What if you serialize the actual session hash and send it as a cookie...This can be RESTful, but it's usually not. [p. 252]
POST Once Exactly (POE) is a way of making HTTP POST idempotent, like PUT and DELETE...Post was defined by Mark Nottingham in an IETF draft that expired in 2005. [p. 283; He is relying on a long-expired Internet-Draft to show how to implement a feature.]
I cover four hypermedia technologies in this section. As of the time of writing, XHTML 4 is the only hypermedia technology in active use. [p. 285]
As of the time of writing, WADL is more talked about than used. [p. 290]
If all you're doing is serializing a data structure for transport across the wire (as happens in the weblogs.com ping service), consider JSON as your representation format. [p. 308]
In 2006, IBM and Microsoft shut down their public UDDI registry after publicly declaring it a success. [p. 309]
Suffice it to say that security concepts are much better specified and deployed in SOAP-based protocols than in native HTTP protocols. [p. 311; Which isn't to say I like SOAP.]
Two-phase commit requires a level of control over and trust in the services you're coordinating. This works well when all the services are yours, but not so well when you need to work with a competing bank...I generally think it's inappropriate for RESTful web services. [p. 313]
Refer [is a] request header...Yes, it's misspelled. [p. 401; That would explain why I always misspell it!]
Anyway, sorry for going so long. I hope some of those quotes entertained you as much as they entertained me.

Grammar: Predicates

I've noticed that certain programmers love grammar, so I hope you won't mind the following:

"The predicate is the subject of this sentence."

What's the subject? "The predicate" is the subject of this sentence.

What's the predicate? The predicate is "is the subject of this sentence."

Thursday, November 27, 2008

Python: Class Methods Make Good Factories

Alex Martelli explained something to me a while back. One of the best uses of class methods is as constructors. For instance, if you want to have multiple constructors, but don't want to rely on one method that simply accepts different sorts of arguments, then use different class methods. The datetime module does this; it has class methods like fromordinal and fromtimestamp to create new datetime instances.

My first thought was that you could just as well use standalone factory functions. However, he brought up a good point. If I use a factory function, the class name is hard coded in the factory function. It can't easily return an instance of some subclass of the class. That's not the case with class methods.

Let me show you what I mean:
class MyClass:

def __init__(self):
# This is the "base" constructor.
pass

@classmethod
def one_constructor(klass, foo):
# This is one special constructor.
self = klass()
self.foo = foo
return self


@classmethod
def another_constructor(klass, bar):
# This is another special constructor.
# ...
pass


class MySubclass(MyClass):
# This does some necessary customizations.
pass


obj = MySubclass.one_constructor('foo')
Here I am instantiating an instance of MySubclass, but I am using the class method one_constructor from the superclass as the constructor.

If you've followed me so far, then perhaps you can imagine why Java's "public static void main" sometimes makes sense for Python too.

Auto: Square Pistons

(Disclaimer: I am mostly ignorant of auto tech.)

Why must pistons be round? I'm guessing that it's because it's easy to machine something really accurately if its round, and there's probably also something to be said for even pressure distribution. However, I'm thinking that if you used a square piston with rounded corners, you could get a larger "cylinder" to fit in the same block without compromising the thickness of the walls.

Also, why must ports be round? I can imagine ports that are triangles with rounded corners. This could be used to tune how much air is allowed in or out as the piston is going up and down. This would be a tunable, just like a camshaft.

Friday, November 21, 2008

PC-BSD

I tried out PC-BSD 7.0.1 under VMware Fusion on my MacBook.

From the guide:
PC-BSD is basically FreeBSD with [a modern version of KDE,] a nice installer, some pre-configuration, kernel tweaks, PBI package management, a couple pre-selected packages and some handy (GUI) utilities to make PC-BSD suitable for desktop use.
I worked on FreeBSD GUIs (both desktop and Web user interfaces) for five years. Let me tell you, I'm thankful that PC-BSD finally happened! For some reason, FreeBSD developers tend to either despise GUIs or own a Mac. Hence, it seemed to me that FreeBSD's GUI support actually got worse over the years. It's about time someone finally came along and "pulled an Ubuntu"!

Overall, I was pretty impressed. It reminds me of the early days of Ubuntu where you could see the potential, but you could also see some places that needed some polish. Here are some things I found worthy of note:

KDE looks really nice these days! It seemed a little unstable, but that could just be my setup.

I wasn't able to get the same resolution I can get in Linux under VMware, i.e. 1280x800. VMware has a ton of kernel modules for Linux that I'm guessing simply aren't available for FreeBSD. Hence, I was stuck at 1024x768.

The fonts don't look so hot on my laptop. The anti-aliasing looks wrong. There is some discussion about why this is the case here.

PC-BSD supports FreeBSD's normal packaging system and the ports system, but it also has a packaging system called PBI (Push Button Installer or PC-BSD Installer). These packages work a lot like installing a Windows application. You can find them by going to pbidir.com. You download them and then double click to install them. Uninstalling them is a lot like uninstalling a program on Windows. They even get installed into a /Programs directory instead of integrated into the normal hierarchy.

The installer gives you the opportunity to install a bunch of these PBIs. They all tend to be large applications like Firefox and OpenOffice.org. I even noticed Opera in the list. I installed pretty much everything except for Opera. I was surprised to see that it took 5.7 gigs of disk space.

In general, looking at / is a bit strange. Aside from /Programs, there's also /PCBSD and a few other surprises.

By default, the system comes with sshd installed and running, but a firewall blocking access to it (Packet Filter from OpenBSD). This is actually the opposite of Ubuntu which does not come with sshd installed, nor does it autoconfigure a firewall.

I don't use FreeBSD much these days, but if you're one of the hand full of people besides me who actually care about FreeBSD on the desktop, this is a really cool development :)

Thursday, November 20, 2008

VMware Euphoria

I've been playing around with VMware since about 2000, but I've never had a computer powerful enough to really run it. Yesterday, I bought another 1gig stick of RAM for my MacBook, which puts me at 2gigs. That's not a heck of a lot, but it's enough.

I now have OS X, Ubuntu, and NetBSD running full screen on different OS X Spaces. I setup VMware Fusion to allow Ubuntu to use both CPUs and 1gig of RAM, whereas I only allocated 1 CPU and 256mb of RAM for NetBSD. OS X does fine with whatever the other two don't use. Ubuntu now has enough horsepower that I can even play the video game I wrote at full speed.

With a simple hot key, I can be in OS X, Ubuntu, or NetBSD. Even better: I can shut the lid of my laptop, and all three suspend without crashing. They all share my Mac's wireless connection, which tends to be pretty stable. If something is giving me a hard time installing under MacPorts, I can just install it on Ubuntu.

Being a minimalist, I only have one computer, so it's kind of hard to play around with fringe OSs, which I used to love doing. That's about to change. Next up, MINIX 3 and pc-bsd!

Oh my gosh that's cool!!!

Tuesday, November 18, 2008

Misspelled Variables

Care to guess what happens when you execute the following PHP?
define('FOO', 'Hi');
print(FO);
It prints 'FO'.

I do believe PHP got this from Perl:
perl
print FOO . "FOO"; # Prints FOOFOO
It works even if you're strict:
perl -w
use strict;
print FOO . "FOO"; # Prints FOOFOO
Ruby behaves differently depending on whether you try to print an undefined variable/method or an undefined attribute:
irb
>> print a
NameError: undefined local variable or method `a' for main:Object
from (irb):1
>> print @a
nil=> nil
Python raises an exception:
python
>>> print a
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'a' is not defined
In a compiled language, these sorts of errors would be caught at compile time. However, a compiled language would never let me do something like:
python
>>> var_name = 'a'
>>> locals()[var_name] = 'yep'
>>> print a
yep
This example is a bit contrived, but I've definitely done things like it.

Personally, I like the flexibility of a scripting language. I like it even more when there's a tool like pychecker that can catch these sorts of errors. However, just because a scripting language doesn't have a compilation step that can catch stupid spelling mistakes doesn't mean it should accept them at runtime. I'd much rather deal with an exception than spend half an hour fighting a bug caused by a spelling error!

As a general rule, I think software should fail fast rather than glossing over bugs that will surely cause trouble later. I can handle the exception if I need to, but what I can't handle is a silent bug.

Monday, November 17, 2008

AI: Thankful for Bad AI

Imagine if the first computers man was able to create worked in pretty much the same way the human brain works. Imagine if they were pretty decent at reasoning, and terrible at calculating things quickly without error. Image that instead of having a quest for artificial intelligence, we had a quest for a "really fast, really accurate data cruncher." It'd be a different world. It definitely makes me grateful that we have humans *and* computers, each very useful in their own way.

The question of whether computers can think is like the question of whether submarines can swim -- Edsger W. Dijkstra

NetBSD: X11 Forwarding over SSH

I installed NetBSD 4.0.1 under VMware Fusion 2.0.1 on my OS X 10.5 box, and I had a heck of a time getting X11 forwarding working. I was getting the sshd configuration slightly wrong. Anyway, on the server I edited /etc/ssh/sshd_config:
X11Forwarding yes
X11DisplayOffset 10
# X11UseLocalhost yes
XAuthLocation /usr/X11R6/bin/xauth
Then I ran:
rm /home/jj/.Xauthority
/etc/rc.d/sshd restart
To login from my Mac, I ran:
ssh -YA jj@192.168.64.128
Viola! xterm now works!

Wednesday, October 29, 2008

Ruby: An Interesting Block Pattern

Ruby has blocks, which enable all sorts of interesting idioms. I'm going to show one that will be familiar to Rails enthusiasts, but was new to me.

I was reading some code in a book, and it had the following:
def if_found(obj)
if obj
yield
else
render :text => "Not found.", :status => "404 Not Found"
false
end
end
Here's how you call it:
if_found(obj) do
# We have a valid obj. Render something with it.
end
The code in the block will only execute if the obj was found. If it wasn't found, the response will already have been taken care of.

I've been in the same situation in Python (using Pylons), and I coded something like:
def handle_not_found(obj):
if not obj:
return render_404_page()
return None
Here's how you call it:
response = handle_not_found(obj)
if response:
return response
# Otherwise, continue normally.
Pylons likes to return responses, whereas render in Ruby works as a side effect whose return value isn't important. However, that's not my point.

My point is that the Python code uses "if response:" whereas the Ruby code uses "if_found(obj) do". Python uses an explicit if statement, whereas Ruby hides the actual if statement in a block. Similarly, Rubyists tend to write "my_list.each do |i|..." (even though Ruby has a for statement), whereas Pythonistas use "for i in my_list".

Ok, now that I've totally made a mountain out of a molehill, please note that I'm not saying either is better than the other. I'm just saying it's interesting to note the difference.

Tuesday, October 28, 2008

Python: Some Notes on lxml

I wrote a webcrawler that uses lxml, XPath, and Beautiful Soup to easily pull data from a set of poorly formatted Web pages. In summary, it works, and I'm quite happy :)

The script needs to pull data from hundreds of Web pages, but not millions, so I opted to use threads. The script actually takes the list of things to look for as a set of XPath expressions on the command line, which makes it super flexible. Let me give you some hints for the parts that I found difficult.

First of all, here's how to install it. If you're using Ubuntu, then:
apt-get install libxslt1-dev libxml2-dev
# I also have python-dev, build-essentials, etc. installed.
easy_install lxml
easy_install BeautifulSoup
If you're using MacPorts, do
port install py25-lxml
easy_install BeautifulSoup
The FAQ states that if you use MacPorts, you may encounter difficulties because you will have multiple versions of libxml and libxslt installed. For instance, the following may segfault:
python -c "import webbrowser; from lxml import etree, html"
Whereas the following shouldn't:
env DYLD_LIBRARY_PATH=/opt/local/lib \
python -c "import webbrowser; from lxml import etree, html"
You also have to be careful of thread safety issues. I was sharing an effectively read-only instance of the etree.XPath class between multiple threads, but that ended up causing bus errors. Ah, the joys of extensions written in C! It's a good reminder that the safest way to do multithreaded programming is to have each thread live in its own process ;)

lxml permits access to regular expressions from within XPath expressions. That's super useful. I had a hard time getting it working though. I forgot to pass in the right XML namespace in one part of the code. For some reason, I wasn't getting an error message. (As a general rule, I love it when software fails fast and complains loudly when I do something stupid.) Furthermore, my knowledge of XSLT was weak enough that I had a really hard time figuring out how to combine the XPath expression with the regex. Anyway, here's how to create an etree.XPath instance containing a regex:
from lxml import etree
XPATH_NAMESPACES = dict(re='http://exslt.org/regular-expressions')
xpath = etree.XPath("re:replace(//title/text(), 'From', '', 'To')",
namespaces=XPATH_NAMESPACES)
match = xpath(tree)
Anyway, lxml is frickin' awesome, and so is BeautifulSoup. Together, I can take really, really crappy HTML, and access it seemlessly.

Friday, October 17, 2008

Python: Permission denied: '/var/www/.python-eggs'

I have a Pylons app, and I got the following exception in my logs:
The following error occurred while trying to extract file(s) to the Python egg
cache:

[Errno 13] Permission denied: '/var/www/.python-eggs'

The Python egg cache directory is currently set to:

/var/www/.python-eggs

Perhaps your account does not have write access to this directory? You can
change the cache directory by setting the PYTHON_EGG_CACHE environment
variable to point to an accessible directory.
The problem is that the app was running as www-data (which was the user created for nginx and Apache). www-data's home directory is /var/www, but it doesn't have write access to it. (I'm afraid of allowing write access so that it can unpack eggs into that directory because that directory is the web root. In general, you should be careful of what you put in the web root.)

There are a few ways to address this problem. One is to make sure to always use --always-unzip when installing eggs. Another is to create a place for www-data to store its eggs by either changing its home directory or by setting the environmental variable PYTHON_EGG_CACHE.

I decided the simplest thing to do was to simply create a new user with a proper home directory.
adduser myapp  # Used a throwaway password.
vipw # Set the shell to /bin/false.
Once I did that, I updated the app to run as the myapp user and made sure it had access to all the directories it needed.

Trac requires its own user. I figure it's reasonable for my app to have its own user too.

Wednesday, October 15, 2008

Web: Flock

I've been using Flock for a few months now, and I finally noticed that it's not open source. Time for me to switch to Firefox 3 ;)

Tuesday, September 30, 2008

Web: REST Verbs

I find it curious that REST enthusiasts insist on viewing the world through the five verbs GET, HEAD, PUT, POST, and DELETE. It reminds me of a story:

Back in the early '80s, I worked for DARPA. During the height of the Cold War, we were really worried about being attacked by Russia. My team was charged with designing a RESTful interface to a nuclear launch site; as far as technology goes, we were way ahead of our time.

Anyway, I wanted the interface to be "PUT /bomb". However, my co-worker insisted that it should be "DELETE /russia". One of my other buddies suggested that we compromise on something more mainstream like "POST /russia/bomb".

Finally, my boss put an end to the whole fiasco. He argued that any strike against the USSR would necessarily be in retaliation to an attack from them. Hence, he suggested that it be "GET /even", so that's what we went with.

You have to understand, back then, GETs with side effects weren't yet considered harmful.

IPv6 T-shirt


Here's a shout out to all my homies in the IPv6 world! If you can't read it, it says "There is no place like 127.0.0.1 (except maybe ::1)". Thanks go to Tarek Ziade (ziade.tarek at gmail.com) for the custom T-shirt design.

Books: Expert Python Programming

I just received my copy of Expert Python Programming. I was the technical editor, and I also wrote the foreword. This is the first time I've ever been mentioned on the front cover of a book, so I'm very excited!

I really enjoyed editing this book. It's the first expert-level book on Python I've read. For a long time, I considered writing one. Tarek beat me to the punch, and I think he did a fantastic job!

Thursday, September 25, 2008

A Python Programmmer's Perspective on C#

Being a language fanatic, I was really excited when I met a really smart guy named Corey Kosak who gave me a tour of C#'s newest features. I had heard a lot of good things about C# lately, including that it had been strongly influenced by Haskell, which makes sense since Microsoft actually funds research on Haskell. Anyway, a lot of C#'s newest features are a lot more like Python than Java. Let me show you some examples.
Here is a sample C# iterator:
foreach(var x in CountForeverFrom(123).Take(5)) {
Console.WriteLine(x);
}
In Python, I'd write:
for i in itertools.islice(itertools.count(123), 5):
print i

C# also iterators that are similar to Python's generators. Here is the C#:
public static IEnumerable<int> CountForeverFrom(int start) {
while(true) {
yield return start;
start++;
}
}
In Python, I'd write:
def count_forever_from(start):
while True:
yield start
start += 1

C#'s LINQ syntax is similar to Python's generator expressions. Here's the C#:
var names=new[] { "bill", "bob", "tim", "tom", "corey",
"carl", "jj", "sophie" };
foreach(var x in (from name in names where name.Length>5 select name)) {
Console.WriteLine(x);
}
In Python, I'd write:
names = ["bill", "bob", "tim", "tom", "corey", "carl", "jj", "sophie"]
for x in (name for name in names if len(name) > 5):
print x

Here's a pretty amazing example that ties a lot of things together. It shows LINQ, a "group by" clause, an anonymous but strongly-typed class ("new {...}"), and even some type inferencing ("var item" and "item.FirstChar")
var crap=from n in names
group n by n[0]
into g
select new { FirstChar=g.Key,
Data=(from x in g select x).ToArray() };

foreach(var item in crap) {
Console.WriteLine(
"First group is {0} which has length {1}. The contents are:",
item.FirstChar, item.Data.Length);
foreach(var x in item.Data) {
Console.WriteLine(x);
}
}
Corey said that C#'s type inferencing is still pretty basic. It can figure out the type of a local variable, but it's definitely not as sophisticated as ML's type system. Also note that the anonymous class is more impressive that an inner class in Java because it didn't require you to use a name or an interface.

"Loosely translated", in Python I'd write:
crap = itertools.groupby(names, lambda n: n[0])
for first_char, subiter in crap:
group = list(subiter)
print "Group is %s which has length %s. The contents are:\n%s" % (
first_char, len(group), "\n".join(group))

C#'s Select method can be used like map in Python. Notice the use of an anonymous function!
var newInts=ints.Select(x => x*x);
In Python, I'd write:
new_ints = map(lambda x: x * x, ints)
The C# version runs lazily (i.e. "on the fly"), which means it only computes as much as requested. Python's map function isn't lazy. However, itertools.imap is.
The above example can also be written in LINQ style:
var newInts2=(from temp in ints select temp*temp);
In Python I'd write:
new_ints2 = (temp * temp for temp in ints)
Both the C# and the Python are lazy in this case.
If you don't want newInts to be lazy, you can do:
var intArray=newInts.ToArray();
or
var intList=new List<int>(newInts);
In Python, I'd write:
list(new_ints)

Since C# has anonymous functions, it should come as no surprise that it also has nested scopes and first-class functions (i.e. you can return a function). Although you can't nest named functions, it's easy enough to fake with anonymous functions:
private static Action<int> NestedFunctions() {
int x=5;

Action<int> addToX=newValue => {
x+=newValue;
};

addToX(34);
addToX(57);
Console.WriteLine(x);

return addToX;
}
In Python, I'd write:
def nested_functions():

def add_to_x(new_value):
add_to_x.x += new_value

add_to_x.x = 5
add_to_x(34)
add_to_x(57)
print add_to_x.x
return add_to_x

C# also has closures:
private static void BetterExampleOfClosures() {
var a=MakeAction(5);
a();
a();
a();
}

private static Action MakeAction(int x) {
return () => Console.WriteLine(x++);
}
Python has closures too. (There's a small caveat here. You can modify a variable that's in an outer scope, but there's no syntax for rebinding that variable. Python 3000 fixes this with the introduction of a nonlocal keyword. In the meantime, it's trivial to work around this problem.):
def better_example_of_closures():
a = make_action(5)
a()
a()
a()


def make_action(x):

def action():
print action.x
action.x += 1

action.x = x
return action

C#'s generics are a bit more powerful than Java's generics since they don't suffer from erasure. I can't say I'm an expert on the subject. Nonetheless, I'm pretty sure you can't easily translate this example into Java. It creates a new instance of the same class as the instance that was passed as a parameter:
public abstract class Animal {
public abstract void Eat();
}

public class Cow : Animal {
public override void Eat() {
}
}

public class Horse : Animal {
public override void Eat() {
}
}

public static T Func<T>(T a, List<T> list) where T : Animal, new() {
return new T();
}
Corey told me that while C#'s generics are stronger than Java's generics, they still weren't as strong as C++'s generics since C++ generics act in an almost macro-like way.

Python has duck typing, so it doesn't have or need generics. Here's what I would write in Python:
class Animal():
def eat(self):
raise NotImplementedError

class Cow():
def eat(self):
pass

class Horse():
def eat(self):
pass

def func(a, list_of_a):
return a.__class__()

Unfortunately, those are all the examples I have, but let me mention a few other things he showed me.

C# has a method called Aggregate that is the same as what other languages called inject or reduce.

C# has Lisp-like macros! You can pass an AST (abstract syntax tree) around, play with it, and then compile it at runtime.

C# has an interesting feature called "extension methods". They're somewhat like a mixin or reopening a class in Ruby. Using an extension method, you can set things up so that you can write "5.Minutes()". Unlike a mixin or reopening a class, they're pure syntax and do not actually affect the class. Hence, the above translates to something like "SomeClass.Minutes(5)". Although "5" looks like the object being acted upon, it's really just a parameter to a static method.

Another thing that impressed me was just how hard Visual Studio works to keep your whitespace neat. It doesn't just indent your code. It also adds whitespace within your expressions.

Ok, that's it. As usual, I hope you've enjoyed a look at another language. I'd like to thank Corey Kosak for sending me the C# code. If I've gotten anything wrong, please do not be offended, just post a correction in the comments.

Tuesday, September 23, 2008

Python: Debugging Memory Leaks

I wrote a simple tool that could take Web logs and replay them against a server in "real time". I was performance testing my Web app over the course of a day by hitting it with many days worth of Web logs at the same time.

By monitoring top, I found out that it was leaking memory. I was excited to try out Guppy, but it didn't help. Neither did playing around with the gc module. I had too many objects coming and going to make sense of it all.

Hence, I fell back to a simple process of elimination. Divide-and-conquer! I would make a change to the code, then I would exercise the code in a loop and monitor the output from top for ever-increasing memory usage.

Several hours later, I was able to nail it down to this simple repro:
# This program leaks memory rather quickly.  Removing the charset
# parameter fixes it.

import MySQLdb
import sys


while True:
connection = MySQLdb.connect(user='user', passwd='password',
host='localhost', db='development',
charset='utf8')
try:
cursor = connection.cursor()
cursor.execute('select * from mytable where false')
sys.stdout.write('.')
sys.stdout.flush()
finally:
connection.close()
It makes sense that if the memory leak is at the C level, I might not be able to find it with Python-level tools. I'll go hunting tomorrow to see if the MySQLdb team has already fixed it, and if not, I'll submit a bug.

Friday, September 12, 2008

Software Engineering: Reuse Has Finally Arrived

Have you noticed that code reuse works these days? For a long time, software engineers struggled with the difficulty of reusing existing software, but it's now common place

Let me give you some examples. I use Linux, Nginx, MySQL, and Python, not to mention a Web browser. These days, very few people need to write a custom kernel, Web server, database, or programming language to solve their particular problem. Sure it happens, but it's far more common to reuse something existing.

I even make use of an existing Web framework, Pylons, and an existing templating engine, Mako. Those things are often written from scratch, but I didn't need to. They were fine.

Even within my own code, I find plenty of places for reuse. Each of my clients has a pretty different setup. Their input formats and output formats are often pretty different, but by using a UNIXy "small tools that can be pieced together" approach, I usually write only a small amount of code when I get a new customer.

What has changed? Why is it suddenly so easy to reuse code? Has object-oriented programming finally paid off? Maybe. However, I think the more likely culprit is open source. Small companies are now viable because they have access to a huge corpus of freely available source code. They don't have to pay for it. They can look at the source if the documentation is inadequate. They can contribute bug fixes if they encounter bugs. They can even hack it in deep ways to accomplish special tasks. This is particularly common in the BSD world.

Last of all, testing and a strong dedication to docstrings help me with reusing my own code. Per agile thinking, I don't try to get it right the first time. If I need to add a feature to make use of code in an unexpected way, I can. The docstrings help me understand what's already there, and the tests help make sure I don't break it.

Thursday, September 11, 2008

Free Software: Stallman and Births

Since I have four children, I found the following quote from Stallman to be very disturbing:
Hundreds of thousands of babies are born every day. While the whole phenomenon is menacing, one of them by itself is not newsworthy. Nor is it a difficult achievement—even some fish can do it.
When a fellow Emacs developer said that he had just become a father, Stallman replied, "I am sorry to hear it."

Perhaps he was just trolling. Well, Stallman's right. Even fish can reproduce. However, even a dog knows not to piss on his friend's leg.

Python: Bambi Meets Godzilla

I just re-read a blog post that I read a couple years ago called Bambi Meets Godzilla, and I enjoyed it just as much the second time around. It's a brief history of Smalltalk, Java, Perl, Python, and Ruby, and it talks about why hype is vitally important. It also spends a fair amount of time critiquing Python's culture. If you haven't read it yet, stop reading my post, and go read it instead ;)

It reminds me of The UNIX-HATERS Handbook, which I also love. The funny thing is that to some degree, he's right about Python's culture. I've seen it with my own eyes.

Don't believe me? If I were to admit that I preferred Ruby on Rails over Django, how long do you think it would take for someone to flame me in a comment calling me either an idiot, a troll, a loser, or a heretic, or to say something like "You can recognize good design by the inanity of its detractors"?

Tuesday, September 09, 2008

Web: SilverStripe

A couple years ago, I built my church's website using Plone. I had to read most of "The Definitive Guide to Plone", but I did it and it worked.

Recently, I realized it was time to overhaul the website. My buddy is a Plone expert, and he told me I would have an easier time rebuilding the website than trying to migrate it since my version of Plone is so old. After two years, I had forgotten much of what I knew about Plone, and I knew that my book was out of date.

I went looking for something that didn't have quite the same learning curve. Plone is fantastic if you're a Plone expert, but I'm not. I just needed "an overly simplistic content management system." I tried out Drupal and Joomla, but for long and complicated reasons, some of which involved my ISP, I decided against them; I'm sure they're quite nice.

My buddy Leon Atkinson told me that he had seen a cool demo for SilverStripe. SilverStripe is PHP, but I decided to watch the video anyway. I was amazed. It's worth the five minutes it takes to watch the screencast.

I decided to actually try it out. Within three hours, I had installed it, read one page of the tutorial, and actually built out a decent portion of the website.

What I like about SilverStripe is that it's super simple. It uses TinyMCE, so you can get a lot done with just a WYSIWYG editor. However, it also encourages you to dip into flat files in the filesystem to edit templates. It's like the best of both worlds for me. I'm almost done with the website, and I still haven't actually had to code any PHP yet. TinyMCE is occasionally a bit fickle and I've seen weird caching problems, but overall, I'm really happy.

Permit me to wax philosophic. For a hundred different reasons, I prefer Python over PHP. However, there's no denying that there's a ton of really good projects written in PHP. Consider php forum, MediaWiki, WordPress, Flickr, etc.

I have a pet theory about why this is so. I care an awful lot about how I build something, but I don't care much at all about what I build. Hence, I can use Python to build whatever, and I'm happy. Seriously, I know a ton of stuff, and I write beautiful code, but I never have any interesting ideas about what to code ;)

Most people aren't like me. For them, a programming language is just a tool to build something they want. They don't care how it gets coded as long as it does get coded. Product people often build beautiful things using not-so-beautiful code. I'm not saying that PHP can't be beautiful. I'm just saying that sometimes it doesn't matter.

Perhaps I'm just feeling a bit bipolar.

Wednesday, September 03, 2008

Python for Unix and Linux System Administration

The good news is that I was a lead technical editor of Python for Unix and Linux System Administration which just came out.

The bad news is that as my wife called me to tell me that my copy of the book had arrived, I noticed that someone had clipped my car in the parking lot and tore off part of the bumper. It looks like I'll have to replace the whole bumper.

C'est la vie.

Anyway, about the book, it's exactly what the title says it is. If you have a computer science background, this book is not for you. However, if you're a sysadmin trying to learn Python, it's perfect. In fact, when I think of all the sysadmins I've met who do a bit of scripting, this book matches them perfectly.

Tuesday, August 26, 2008

Linux: Trac and Subversion on Ubuntu with Nginx and SSL

I just setup Trac and Subversion on Ubuntu. I decided to proxy tracd behind Nginx so that I could use SSL. I used ssh to access svn. I got email and commit hooks for everything working. I used runit to run tracd. In all, it took me about four days. Here's a brain dump of my notes:
Set up Trac and Subversion:
Setup runit:
touch /etc/inittab # Latest Ubuntu uses "upstart" instead of the sysv init.
apt-get install runit
initctl start runsvdir
initctl status runsvdir
While still on oldserver, I took care of some Trac setup:
Setup permissions:
See: http://trac.edgewall.org/wiki/TracPermissions
trac-admin:
permission list
permission remove anonymous '*'
permission remove authenticated '*'
permission add authenticated BROWSER_VIEW CHANGESET_VIEW FILE_VIEW LOG_VIEW MILESTONE_VIEW REPORT_SQL_VIEW REPORT_VIEW ROADMAP_VIEW SEARCH_VIEW TICKET_CREATE TICKET_MODIFY TICKET_VIEW TIMELINE_VIEW WIKI_CREA
TE WIKI_MODIFY WIKI_VIEW
Note: The above matches the default, but with no anonymous access.
permission add jj TRAC_ADMIN
Went through the admin section in the GUI and setup everything.
Fixed inconsistent version field ("" vs. None):
sqlite3 db/trac.db:
update ticket set version = null;
apt-get install subversion-tools python-subversion
apt-get install python-pysqlite2
easy_install docutils:
/usr/bin/rst2newlatex.py
/usr/bin/rst2xml.py
/usr/bin/rstpep2html.py
/usr/bin/rst2s5.py
/usr/bin/rst2latex.py
/usr/bin/rst2pseudoxml.py
/usr/bin/rst2html.py
easy_install pygments:
/usr/bin/pygmentize
easy_install pytz
Setup users:
Used "adduser" to create users.
Grabbed their passwords from /etc/shadow on oldserver.
addgroup committers
Added the users to the committers group.
Setup svn:
mkdir -p /var/lib/svn
svnadmin create /var/lib/svn/example
Copied our svn repository db from oldserver to /var/lib/svn/example/db.
chgrp -R committers /var/lib/svn/example/db
Setup trac:
easy_install Trac:
/usr/bin/trac-admin
/usr/bin/tracd
+Genshi-0.5.1-py2.5-linux-i686.egg
mkdir -p /var/lib/trac
cd /var/lib/trac
trac-admin example initenv:
I pointed it at the svn repo path, but otherwise used the default
settings.
Copied stuff from our trac instance on oldserver to
/var/lib/trac/example/attachments and /var/lib/trac/example/db.
I chose not to keep our trac.ini since Trac has changed so much.
I chose not to keep our passwords file since they were too easy.
htpasswd -c /var/lib/trac/example/conf/users.htpasswd jj
Edited /var/lib/trac/example/conf/trac.ini.
adduser trac # Used a throwaway password.
vipw # Changed home to /var/lib/trac and set shell to /bin/false.
chown -R trac:trac /var/lib/trac # Per the instructions. Weird.
find /var/lib/trac/example/attachments -type d -exec chmod 755 '{}' \;
find /var/lib/trac/example/attachments -type f -exec chmod 644 '{}' \;
trac-admin /var/lib/trac/example resync
Setup trac under runit:
Setup logging:
mkdir -p /etc/sv/trac/log
mkdir -p /var/log/trac

cat > /etc/sv/trac/log/run << __END__
#!/bin/sh

exec 2>&1
exec chpst -u trac:trac svlogd -tt /var/log/trac
__END__

chmod +x /etc/sv/trac/log/run
chown -R trac:trac /var/log/trac
Setup trac:

cat > /etc/sv/trac/run << __END__
#!/bin/sh

exec 2>&1
exec chpst -u trac:trac tracd -s --hostname=localhost --port 9115 --basic-auth='*',/var/lib/trac/example/conf/users.htpasswd,'24 Hr. Diner' /var/lib/trac/example
__END__

chmod +x /etc/sv/trac/run
ln -s /etc/sv/trac /etc/service/
Setup Nginx to proxy to Trac and handle SSL:
cd /etc/nginx
openssl req -new -x509 -nodes -out development.example.com.crt \
-keyout development.example.com.key
Edit sites-available/default.
/etc/init.d/nginx restart
Setup post-commit hook:
cd /var/lib/svn/example/hooks
wget http://trac.edgewall.org/browser/trunk/contrib/trac-post-commit-hook?format=txt \
-O trac-post-commit-hook
chmod +x trac-post-commit-hook
cp post-commit.tmpl post-commit
chmod +x post-commit
Edited post-commit.
mkdir /var/lib/trac/example/.egg-cache
chown -R trac:committers \
/var/lib/trac/example/.egg-cache \
/var/lib/trac/example/db
chmod 775 /var/lib/trac/example/.egg-cache \
/var/lib/trac/example/db
chmod 664 /var/lib/trac/example/db/trac.db
Setup trac notifications:
Edit /var/lib/trac/example/conf/trac.ini.
sv restart trac
Here's the most important part of Nginx's sites-available/default:
# Put Trac on HTTPS on port 9443.
server {
listen 9443;
server_name development.example.com;

access_log /var/log/nginx/development.access.log;
error_log /var/log/nginx/development.error.log;

ssl on;
ssl_certificate /etc/nginx/development.example.com.crt;
ssl_certificate_key /etc/nginx/development.example.com.key;

ssl_session_timeout 5m;

ssl_protocols SSLv2 SSLv3 TLSv1;
ssl_ciphers ALL:!ADH:!EXPORT56:RC4+RSA:+HIGH:+MEDIUM:+LOW:+SSLv2:+EXP;
ssl_prefer_server_ciphers on;

location / {
root html;
index index.html index.htm;
proxy_pass http://127.0.0.1:9115;
}
}
Here's the most important part of svn's post-commit hook:
REPOS="$1"
REV="$2"
MAILING_LIST="commits@example.com"
TRAC_ENV="/var/lib/trac/example"

/usr/share/subversion/hook-scripts/commit-email.pl "$REPOS" "$REV" \
"$MAILING_LIST"
/usr/bin/python /var/lib/svn/example/hooks/trac-post-commit-hook \
-p "$TRAC_ENV" -r "$REV"
Here are the changes I made to trac.ini:
===================================================================
--- var/lib/trac/example/conf/trac.ini (revision 464)
+++ var/lib/trac/example/conf/trac.ini (revision 475)
@@ -58,13 +58,13 @@
mime_encoding = base64
smtp_always_bcc =
smtp_always_cc =
-smtp_default_domain =
-smtp_enabled = false
-smtp_from = trac@localhost
+smtp_default_domain = example.com
+smtp_enabled = true
+smtp_from = trac@development.example.com
smtp_from_name =
smtp_password =
smtp_port = 25
-smtp_replyto = trac@localhost
+smtp_replyto = ops@example.com
smtp_server = localhost
smtp_subject_prefix = __default__
smtp_user =
@@ -152,7 +152,7 @@
authz_file =
authz_module_name =
auto_reload = False
-base_url =
+base_url = https://development.example.com:9443
check_auth_ip = true
database = sqlite:db/trac.db
default_charset = iso-8859-15
@@ -166,7 +166,7 @@
repository_type = svn
show_email_addresses = false
timeout = 20
-use_base_url_for_redirect = False
+use_base_url_for_redirect = True

[wiki]
ignore_missing_pages = false
Wow, that was painful!

Monday, August 25, 2008

Books: Basics of Compiler Design

I started reading Basics of Compiler Design. I think, perhaps, it might have helped if I had actually taken the course rather than simply try to read the book.

Here's a simple rule of thumb:
Never use three pages of complicated mathematics to explain that which can be explained using either a simple picture or a short snippet of pseudo code.
The section on "Converting an NFA to a DFA" had me at the point of tears. After a couple hours, I finally understood it. However, even after I understood it, I knew I could do a better job teaching it. A little bit of Scheme written by the SICP guys would have been infinitely clearer.

I hate to be harsh, but it seemed like the author was just having a good time playing with TeX. I picked this book because it was short and didn't dive into code too much. What I found is that it uses math instead of code. I'd prefer code.

The worst part of reading this book by myself is that even if I make it to the end, I won't know if I truly mastered the material because I won't have a compiler to show for my work. After all, there's no one around to grade my written assignments, and the book doesn't actually take you all the way through writing a real compiler.

Thursday, August 21, 2008

Humor: I've Been Simpsonized!

Thanks to Dean Fraser (jericho at telusplanet dot net) at Springfield Punx for the artwork.

Books: The Art of UNIX Programming

I just finished reading The Art of UNIX Programming. In short, I liked it a lot.

Here are a few fun quotes:
Controlling complexity is the essence of computer programming -- Brian Kernighan [p. 14]
Software design and implementation should be a joyous art, a kind of high-level play...To do Unix philosophy right, you need to have (or recover) that attitude. [p. 27]
Microsoft actually admitted publicly that NT security is impossible in March 2003. [p. 69, Unfortunately, the URL he provided no longer works.]
One good test for whether an API is well designed is this one: if you try to write a description of it in purely human language (with no source-code extracts allowed), does it make sense? It is a very good idea to get into the habit of writing informal descriptions of your APIs before you code them. [p. 85, this is a good explanation for why I write docstrings before I write code.]
C++ is anticompact--the language's designer has admitted that he doesn't expect any one programmer to ever understand it all. [p. 89]
One thing Raymond does very well is document things that the rest of us implicitly assume. For instance, he described the various cultures revolving around UNIX. Now I know why I'm so mixed up! I sympathize with several different cultures such as:
  • Old-school UNIX hackers
  • The Open Source movement
  • The Free Software movement
  • BSD hackers
  • MIT Lisp hackers
  • The IETF
My copy of the book is from 2004, and as timeless as this book is, I still wish I could get a "post-modern" opinion on several topics. For instance:
  • Linux is so commonplace these days, what should we do now that everyone takes it for granted?
  • OS X has really won the hearts of a lot of developers. Is there any hope that the rest of the world will move closer to the Free Software ideal? (Please see my post A Hybrid World of Open and Closed Source Software.)
  • I'd love to get his take on Eclipse, TextMate, and modern-day Emacs and Vim.
  • I'd also love to get his opinions on Ruby and Rails.
In general, I think it's a fair critique that there weren't enough critiques of Unix. He mostly saved them until the last chapter. I would have enjoyed more critiques throughout. As much as I love Unix, one of my favorite books is The UNIX-HATERS Handbook.

Similarly, all of his discussion on Emacs vs. Vi seemed a bit biased. I know it's hard not to be biased on this topic, but I was a bit frustrated when he called all of Emacs' complexity "optional complexity" and all of Vi's complexity "accidental and ad-hoc complexity." Because of his statements I even gave Emacs another shot. However, as usual, I was reminded that in theory Emacs is my favorite editor, but in practice I'm a Vim user.

Nonetheless, I do have high praise for this book. When I was totally burnt out and couldn't code for two months, I found this book refreshing and relaxing. I owe Raymond my thanks :)

Tuesday, August 19, 2008

Python: the csv module and mysqlimport

Here's one way to get Python's csv module and mysqlimport to play nicely with one another.

When exporting something with the csv module, use:
csv.writer(fileobj, dialect='excel-tab', lineterminator='\n')
When importing with mysqlimport, use:
mysqlimport \
--user=USERNAME \
--password \
--columns=COLUMNS \
--compress \
--fields-optionally-enclosed-by='"' \
--fields-terminated-by='\t' \
--fields-escaped-by='' \
--lines-terminated-by='\n' \
--local \
--lock-tables \
--verbose \
DATABASE INPUT.tsv
In particular, the "--fields-escaped-by=''" took me a while to figure out. Hence, the csv module and mysqlimport will agree that '"' is escaped via '""' rather than '\"'.

Wednesday, August 13, 2008

Math: pi

As of today, I am roughly 33π×107 seconds old.

Saturday, August 09, 2008

Linux: LinuxWorld, BeOS, Openmoko

I went to LinuxWorld Conference & Expo again this year like I always do. My mentor Leon Atkinson and I always go together. Here are a few notes.

There was a guy who had a booth for the New York Times. I asked him what it had to do with Linux. He said, "Nothing, but I've sold about 40 subscriptions in the last two days and made about $2000. Wanna buy a subscription?" I felt like I had been hit with a 5lb chunk of pink meat right in the face. There was another booth selling office chairs and another selling (I think) foot messages.

I didn't see Novell, HP, O'Reilly, Slashdot, GNOME, KDE, or a ton of other booths I expected to see. I talked with the lead editor at another "very large, but purposely unnamed" publisher, and he said that they wouldn't be back next year either.

There was a pretty cool spherical sculpture made of used computer parts. I was also pleased to see a bunch of guys putting together used computers and loading Linux on them for schools.

Other than that, I think LinuxWorld may be dead or dying. The editor of that publishing company said that this happens to conferences. They "run their course." Since Linux and FOSS were almost a religious experience for me when I was in college, I'm sorry to see LinuxWorld fizzle out.

I talked to the Haiku guys. I've been watching them. They're trying to rebuild BeOS. I knew that Palm bought Be's IP, so I asked them whatever happened to BeOS's source code. A very knowledgeable person gave me the whole rundown. The summary is that a company now owns it but can't release it for legal reasons. There's too much software in there that they can't get a clear copyright on, and they also have proprietary codecs that they're not allowed to release. He said that there was nothing to fear; Haiku is coming along nicely. They have some of the original BeOS developers, and they are staying true to the super-finely threaded nature of the original BeOS kernel. Unfortunately, it's not yet ready for production use, but they've come a long way.

I talked to a guy at the Openmoko booth. I told him that I'd be very interested in running Openmoko hardware, which is fully open, with Android, which I'm guessing will be relatively polished by the end of the year. He said that they had been talking to Google about it, but it's still up to Google to decide on a timetable. Unfortunately, Openmoko still isn't ready for everyday use yet. I'm waiting hopefully.

Wednesday, August 06, 2008

SICP: Truly Conquering SICP

This guy is my hero:
I’ve written 52 blog posts (not including this one) in the SICP category, spread over 10 months...Counting with the cloc tool (Count Lines Of Code), the total physical LOC count1 for the code I’ve written during this time: 7,300 LOC of Common Lisp, 4,100 LOC of Scheme.
Gees, and I was excited when I finished the videos. I feel so inadequate ;)

Python: sort | uniq -c via the subprocess module

Here is "sort | uniq -c" pieced together using the subprocess module:
from subprocess import Popen, PIPE

p1 = Popen(["sort"], stdin=PIPE, stdout=PIPE)
p1.stdin.write('FOO\nBAR\nBAR\n')
p1.stdin.close()
p2 = Popen(["uniq", "-c"], stdin=p1.stdout, stdout=PIPE)
for line in p2.stdout:
print line.rstrip()
Note, I'm not bothering to check the exit status. You can see my previous post about how to do that.

Now, here's the question. Why does the program freeze if I put the two Popen lines together? I don't understand why I can't setup the pipeline, then feed it data, then close the stdin, and then read the result.

Tuesday, August 05, 2008

Python: Memory Conservation Tip: Temporary dbms

A dbm is an on disk hash mapping from strings to strings. The shelve module is a simple wrapper around the anydbm module that takes care of pickling the values. It's nice because it mimics the dict API so well. It's simple and useful. However, one thing that isn't so simple is trying to use a temporary file for the dbm.

The problem is that shelve uses anydb which uses whichdb. When you create a temporary file securely, it hands you an open file handle. There's no secure way to get a temporary file that isn't opened yet. Since the file already exists, whichdb tries to figure out what format it uses. Since it doesn't contain anything yet, you get a big explosion.

The solution is to use a temporary directory. The next question is, how do you make sure that temporary directory gets cleaned up without reams of code? Well, just like with temporary files, you can delete the temporary directory even if your code still has an open file handle referencing a file in the temporary directory. Don't ya just love UNIX ;)

Here's some code:
import os
import shelve
import shutil
from tempfile import mkdtemp

tmpd = mkdtemp('', 'myprogram-')
filename = os.path.join(tmpd, 'mydbm')
dbm = shelve.open(filename, flag='n')
shutil.rmtree(tmpd)
# I can continue to use dbm for as long as I'd like.
On my system, the shelve module ends up using the dbm module which creates two files. Furthermore, my tests end up exercising this code in four different places. Despite all of that, since the tmpd is removed immediately, no matter how fast I type ls -l, I never even see the directory ;)

Monday, August 04, 2008

Python: Memory Conservation Tip: sort Tricks

The UNIX "sort" command is really quite amazing. It's fast and it can deal with a lot of data with very little memory. Throw in the "-u" flag to make the results unique, and you have quite a useful utility. In fact, you'd be surprised at how you can use it.

Suppose you have a bunch of pairs:
a b
b c
a c
a c
b d
...
You want to figure out which atoms (i.e. items) are related to which other atoms. This is easy to do with a dict of sets:
referrers[left].add(right)
referrers[right].add(left)
Notice, I used a set because I only want to know if two things are related, not how many times they are related.

My situation is strange. It's small enough so that I don't need to use a cluster. However, it's too big for such a dict to fit into memory. It's not too big for the data to fit in /tmp.

The question is, how do you get this sort of a hash to run from disk? Berkeley DB is one option. You could probably also use Lucene. Another option is to simply use sort.

If you open up a two-way pipe to the sort command, you can output all the pairs, and then later read them back in:
a b
a c
b c
b d
...
sort is telling me that a is related to b and c, b is related to c and d, etc. Notice, it also removed the duplicate pair a c, and took care of the temp file handling. Best of all, you can stream data to and from the sort command. When you're dealing with a lot of data, you want to stream things as much as possible.

Now that I've shown you the general idea, let me give you a couple more hints. First of all, to shell out to sort, I use:
from subprocess import Popen, PIPE
pipe = Popen(['sort', '-u'], bufsize=1, stdin=PIPE, stdout=PIPE)
I like to use the csv module when working with tab-separated data, so I create a reader and writer for pipe.stdout and pipe.stdin respectively. You may not need to in your situation.

When you're done writing to sort, you need to tell it you're done:
pipe.stdin.close()  # Tell sort we're ready.
Now here's the next trick. I don't want the rest of the program to worry about the details of piping out to sort. The rest of the program should have a nice clean iterator to work with. Remember, I'm streaming, and the part of the code that's reading the data from the pipe is far away.

Hence, instead of passing it a reference to the pipe, I instead send it a reference to a generator. That way the generator can do all the munging necessary, and no one even needs to know that I'm using a pipe.

The last trick is that when I read:
a b
a c
I need to recognize that b and c both belong to a. Hence, I use a generator I wrote called groupbysorted.

Putting it all together, the generator looks like:
def first((a, b)): return a
def second((a, b)): return b

def get_references():
"""This is a generator that munges the results from sort -u.

When the streaming is done, make sure sort exited cleanly.

"""
for (x, pairs) in groupbysorted(reader, keyfunc=first):
yield (x, map(second, pairs))
status = pipe.wait()
if status != 0:
raise RuntimeError("sort exited with status %s: %s" %
(status, pipe.stderr.read()))
Now, the outside world has a nice clean iterator to work with that will generate things like:
(a, [b, c])
(b, [c, d])
...
The pipe will get cleaned up as soon as the iterator is done.

Python: Memory Conservation Tip: Nested Dicts

I'm working with a large amount of data, and I have a data structure that looks like:
pair_counts[(a, b)] = count
It turns out that in my situation, I can save memory by switching to:
pair_counts[a][b] = count
Naturally, the normal rules of premature optimization apply: I wrote for readability, waited until I ran out of memory, did lots of profiling, and then optimized as little as possible.

In my small test case, this dropped my memory usage from 84mb to 61mb.

Saturday, August 02, 2008

Python: Memory Conservation Tip: intern()

I'm working with a lot of data, and running out of memory is a problem. When I read a line of data, I've often seen the same data before. Rather than have two pointers that point to two separate copies of "foo", I'd prefer to have two pointers that point to the same copy of "foo". This makes a lot of sense in Python since strings are immutable anyway.

I knew that this was called the flyweight design pattern, but I didn't know if it was already implemented somewhere in Python. (Strictly speaking, I thought it was called the "flywheel" design pattern, and my buddy Drew Perttula corrected me.)

My first attempt was to write code like:
>>> s1 = "foo"
>>> s2 = ''.join(['f', 'o', 'o'])
>>> s1 == s2
True
>>> s1 is s2
False
>>> identity_cache = {}
>>> s1 = identity_cache.setdefault(s1, s1)
>>> s2 = identity_cache.setdefault(s2, s2)
>>> s1 == 'foo'
True
>>> s1 == s2
True
>>> s1 is s2
True
This code looks up the word "foo" by value and returns the same instance every time. Notice, it works.

However, Monte Davidoff pointed out that this is what the intern builtin is for. From the docs:
Enter string in the table of ``interned'' strings and return the interned string - which is string itself or a copy. Interning strings is useful to gain a little performance on dictionary lookup - if the keys in a dictionary are interned, and the lookup key is interned, the key comparisons (after hashing) can be done by a pointer compare instead of a string compare. Normally, the names used in Python programs are automatically interned, and the dictionaries used to hold module, class or instance attributes have interned keys. Changed in version 2.3: Interned strings are not immortal (like they used to be in Python 2.2 and before); you must keep a reference to the return value of intern() around to benefit from it.
Here it is in action:
>>> s1 = "foo"
>>> s2 = ''.join(['f', 'o', 'o'])
>>> s1 == s2
True
>>> s1 is s2
False
>>> s1 = intern(s1)
>>> s2 = intern(s2)
>>> s1 == 'foo'
True
>>> s1 == s2
True
>>> s1 is s2
True
Well did it work? My program still functions, but I didn't get a tremendous savings in memory. It turns out that I don't have enough dups, and that's not where I'm spending all my memory anyway. Oh well, at least I learned about the intern() function.