Friday, January 26, 2007

CSS: Hacking Copy-and-Paste

If you copy-and-paste the contents of an HTML table into a text editor or Excel, it "does the right thing". This is a useful feature. What happens, though, if you want a column to appear in the copy-and-pasted copy, but not actually take up space on the screen? For instance, sometimes you might want to output the URL for a link in addition to the anchor text, and you want the URL for the link in a separate column. Sure, you can generate a report in CSV format, but the following trick can be bolted onto existing tables. Here's the HTML and CSS:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "">

<title>Hacking Cut-and-Paste</title>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"></meta>
<style type="text/css">
* The "cutandpaste" class makes things invisible, unless you cut-and-paste
* or print them.
@media screen {
.cutandpaste {
display: none;
<th class="cutandpaste">SKU</th>
<td><a href="#">Shoes</a></td>
<td class="cutandpaste">192</td>
<td><a href="#">Bikes</a></td>
<td class="cutandpaste">257</td>
Here's what it looks like in my browser vs. what it looks like when you copy-and-paste it into an editor:

HTML: Browser Bug?

If you put an h1 inside a div, the spacing above the h1 caused by the h1 will go outside the div. However, subtle variations will make the spacing go inside the div. I'm confused. I would call this a browser bug, but it seems to be somewhat consistent among browsers. Here's a simple test case:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "">

<title>H1 Whitespace Test</title>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"></meta>
<style type="text/css">
#header {
background-color: yellow;
height: 20px;

#body {
background-color: yellow;

/* If you uncomment this, it behaves as I would expect.
border: 1px solid black;
<div id="header"></div>
<div id="body">
<!--If you use text instead of an h1 here, it behaves as I would expect.-->

<p>This is some other content.</p>
Note that both the HTML and CSS pass validation tests.

Here's what it looks like as is:

Here's what it looks like with a border around the div. Notice it loses the white space:

Here's what it looks like if you don't use an h1. Again, it loses the white space:

What's up with that? Is this a browser bug? Is it a spec bug? Is there something ugly that I don't know about h1s and the box model?

Tuesday, January 23, 2007

Python: Running __main__ from Another Script

Let's say you have a script
def f(n):
print "n = %s" % n

if __name__ == '__main__':
How do you run the __main__ without calling the script on the command line? That is, how do you call it from within another Python script? Simply importing it isn't good enough. Here's how to run the other module's __main__:
>>> import temp
>>> execfile(temp.__file__)
n = 5

Tuesday, January 16, 2007

Python: groupbysorted

Updated: It turns out that I was wrong about itertools.groupby. It works exactly the same as this code, so you should use it instead.

This is a variation of itertools.groupby.

The itertools.groupby iterator assumes that the input is not sorted but will fit in memory. This iterator has the same API, but assumes the opposite.

__docformat__ = "restructuredtext"

class peekable:

"""Make an iterator peekable.

This is implemented with an eye toward simplicity. On the downside,
you can't do things like peek more than one item ahead in the
iterator. On the bright side, it doesn't require anything from
itertools, etc., so it's less likely to encounter strange bugs,
which occassionally do happen.

Example usage::

>>> numbers = peekable(range(6))
>>> numbers.peek()
>>> for i in numbers:
... print i


_None = () # Perhaps None is a valid value.

def __init__(self, iterable):
self._iterable = iter(iterable)
self._buf = self._None

def __iter__(self):
return self

def _is_empty(self):
return self._buf is self._None

def peek(self):
"""Peek at the next element.

This may raise StopIteration.

if self._is_empty():
self._buf =
return self._buf

def next(self):
if self._is_empty():
ret = self._buf
self._buf = self._None
return ret

def groupbysorted(iterable, keyfunc=None):

"""This is a variation of itertools.groupby.

The itertools.groupby iterator assumes that the input is not sorted
but will fit in memory. This iterator has the same API, but assumes
the opposite.

Example usage::

>>> for (key, subiter) in groupbysorted(
... ((1, 1), (1, 2), (2, 1), (2, 3), (2, 9)),
... keyfunc=lambda row: row[0]):
... print "New key:", key
... for x in subiter:
... print "Row:", x
New key: 1
Row: (1, 1)
Row: (1, 2)
New key: 2
Row: (2, 1)
Row: (2, 3)
Row: (2, 9)

This requires the peekable class. See my comment here_.

Note, you must completely iterate over each subiter or groupbysorted will
get confused.

.. _here:


iterable = peekable(iterable)

if not keyfunc:
def keyfunc(x):
return x

def peekkey():
return keyfunc(iterable.peek())

def subiter():
while True:
if peekkey() != currkey:

while True:
currkey = peekkey()
yield (currkey, subiter())

Friday, January 12, 2007

Programmer Productivity: My Talk is Now Available

I did a talk for the Bay Area Python Users' Group on programmer productivity. It was held at Google as a Google tech talk. It's now available for your viewing pleasure.

Thursday, January 11, 2007

Humor: Binary

So a quy walks into a bar and asks for a spreadsheet. The bartender asks, "How do you want it?" The guy replies, "In binary--but put the ones first."

Apple: Missing My Mac (Display)

I have a Dell Inspiron 6400. It's actually a really nice laptop. It has an Intel Core Duo, and its resolution is 1680x1050.

For years, I've used various desktop backgrounds that were mostly gray. They all have something interesting to look at, but they all have very little color. In the past, I've had Apple notebooks, and I really liked the default blue Apple background; I find it quite comforting. I've tried to use the same background on a Dell, but for some reason it just irritates me.

I've had two theories about this. One is that the background doesn't match the color of the rest of the notebook. The other is that the Apple display is nicer. Well, I'm sure everyone already knows the answer.

Today, I held my Dell up to a big Dell cinema display being driven by a PowerBook. The difference was clear. Having seen them at the store, I wouldn't be surprised if the Apple cinema display was even nicer. It's depressing how faded my laptop looks.

So, as the title said, I'm missing my Mac display. I'll probably go back to using a grayish background.

Monday, January 08, 2007

Python: Dealing with Huge Data Sets in MySQLdb

I have a table that's about 750mb. It has 25 million rows. To do what I need to do, I need to pull it all into Python. It's okay, the box has 8 gigs of RAM.

However, when I do the query, "cursor.execute" never seems to return. I look at top, and I see that Python is taking up 100% of the CPU and a steadily increasing amount of RAM. Tracing through the code, I see that the code is hung on:
# > /usr/lib/python2.4/site-packages/MySQLdb/
# -> return self._result.fetch_row(size, self._fetch_type)
I was hoping to stream data from the server, but it appears some C code is trying to store it completely. After a few minutes, "show processlist;" in MySQL reports that the server is done, even with sending the data. So why won't "cursor.execute" hurry up and return?
If you're wondering, unfortunately, I can't break this up into multiple queries. If I use a limits to go through the data one chunk at a time, I have to continually resort the data on every query. I can't do the sorting in Python nearly as conveniently as I can do it in MySQL. Furthermore, one simple query can result in one simple table scan, which is faster than a lot of the alternatives.
Anyway, I found out that MySQLdb has an under-documented streaming API. It all comes down to using a different type of cursor:
import MySQLdb
from MySQLdb.cursors import SSCursor

connection = MySQLdb.connect(...)

# Normally, you would use:
# cursor = connection.cursor()
# However, using this version, MySQLdb will read rows from the server one at a time.

cursor = SSCursor(connection)

Thursday, January 04, 2007

Business: Cisco to Acquire IronPort

The company I use to work for, IronPort, just got acquired by Cisco. We're all very excited. Yes, I bought my shares. Now, let's see if this play money turns into actual money at some point ;)

Tuesday, January 02, 2007

Python: Mako

There's a new Python templating engine called Mako. It's basically a modern, more-Pythonic version of Myghty, which is a Python version of Mason. It makes sense to switch if you're already using Myghty. It also makes sense to use if you're a Python guy who wants to avoid learning something new and just wants to dump a bit of Python in the middle of some HTML.

I like Mike Bayer, Mako's author, but I prefer Genshi. Nonetheless, if Mike wants to go out and write another templating engine, more power to him!

However, my feeling is that Python needs another templating engine like I need another open source kernel!
<sarcasm>Yeah, thanks a lot Apple! Sure, Darwin's great! Too bad I can't use my airport card!</sarcasm>
Seriously, I'd be a lot happier if they kept Darwin and released Cocoa. Now, that would be progress.

*sigh* ;)

Vim: snippetsEmu

Just this morning, one of my buddies was ragging on me that TextMate was cool because of snippet expansion. I personally think this is optimizing the wrong thing since the typing part of programming is the easy part. Nonetheless, I'm happy to see that Vim has a knockoff. Best of all, it's easy to use and pretty useful.

You can get the plugin here. Once you install it per the instructions, you can open up a Python file, insert the text "def", hit tab and get what's shown in the image. Hitting tab again jumps between the fields. Even better, there's a snippets file for Genshi.

Clustering: Hadoop

Google wrote a white paper called MapReduce: Simplified Data Processing on Large Clusters. It's a simple way to write software that works on a cluster of computers. Google also wrote a white paper on The Google File System.
Hadoop is a framework for running applications on large clusters of commodity hardware. The Hadoop framework transparently provides applications both reliability and data motion. Hadoop implements a computational paradigm named map/reduce, where the application is divided into many small fragments of work, each of which may be executed or reexecuted on any node in the cluster. In addition, it provides a distributed file system that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both map/reduce and the distributed file system are designed so that node failures are automatically handled by the framework.
Put simply, Hadoop is an open-source implementation of Google's map/reduce and distributed file system written in Java.

I needed something like that, so I decided to give it a whirl. I prefer to code in Python, so it's fortunate that Hadoop can "shell out" to Python on each of the remote systems. Shelling out once per system has negligible overhead, so that's fine.

You'll need to read the whitepaper to fully understand map/reduce, but let's look at some code. First, let's look at my input. It's a file:
Now, here's my mapper:
#!/usr/bin/env python

"""Figure out whether each number is even or odd."""

import sys

for line in sys.stdin:
num, _ignored = line[:-1].split("\t")
is_odd = int(num) % 2
print "%s\t%s" % (is_odd, num)
Here's my reducer:
#!/usr/bin/env python

"""Count and sum the even and odd numbers."""

import sys

counts = {0: 0, 1: 0}
sums = counts.copy()
for line in sys.stdin:
is_odd, num = map(int, line[:-1].split("\t"))
counts[is_odd] += 1
sums[is_odd] += num
for i in range(2):
name = {0: "even", 1: "odd"}[i]
print "%s\tcount:%s sum:%s" % (name, counts[i], sums[i])
This resulted in a single file:
even count:500 sum:249500
odd count:500 sum:250000
Once Hadoop is installed, executing this job is done at the shell via:
hadoop jar /usr/local/hadoop-install/hadoop/build/hadoop-streaming.jar \
-mapper -reducer -input input.txt -output out-dir
This was the first time I had ever written software for a cluster, and all in all, it was pretty easy. Too bad I didn't actually have a couple thousand machines to run this on ;)

(By the way, during installation, I ran into a couple issues which I was able to work around easily. I won't bother repeating them here. You can find my workarounds on the mailing list. You may need to wait for the archive to be updated since I just posted them earlier today.)