Suppose you have a bunch of pairs:
a bYou want to figure out which atoms (i.e. items) are related to which other atoms. This is easy to do with a dict of sets:
referrers[left].add(right)Notice, I used a set because I only want to know if two things are related, not how many times they are related.
My situation is strange. It's small enough so that I don't need to use a cluster. However, it's too big for such a dict to fit into memory. It's not too big for the data to fit in /tmp.
The question is, how do you get this sort of a hash to run from disk? Berkeley DB is one option. You could probably also use Lucene. Another option is to simply use sort.
If you open up a two-way pipe to the sort command, you can output all the pairs, and then later read them back in:
a bsort is telling me that a is related to b and c, b is related to c and d, etc. Notice, it also removed the duplicate pair a c, and took care of the temp file handling. Best of all, you can stream data to and from the sort command. When you're dealing with a lot of data, you want to stream things as much as possible.
Now that I've shown you the general idea, let me give you a couple more hints. First of all, to shell out to sort, I use:
from subprocess import Popen, PIPEI like to use the csv module when working with tab-separated data, so I create a reader and writer for pipe.stdout and pipe.stdin respectively. You may not need to in your situation.
pipe = Popen(['sort', '-u'], bufsize=1, stdin=PIPE, stdout=PIPE)
When you're done writing to sort, you need to tell it you're done:
pipe.stdin.close() # Tell sort we're ready.Now here's the next trick. I don't want the rest of the program to worry about the details of piping out to sort. The rest of the program should have a nice clean iterator to work with. Remember, I'm streaming, and the part of the code that's reading the data from the pipe is far away.
Hence, instead of passing it a reference to the pipe, I instead send it a reference to a generator. That way the generator can do all the munging necessary, and no one even needs to know that I'm using a pipe.
The last trick is that when I read:
a bI need to recognize that b and c both belong to a. Hence, I use a generator I wrote called groupbysorted.
Putting it all together, the generator looks like:
def first((a, b)): return aNow, the outside world has a nice clean iterator to work with that will generate things like:
def second((a, b)): return b
"""This is a generator that munges the results from sort -u.
When the streaming is done, make sure sort exited cleanly.
for (x, pairs) in groupbysorted(reader, keyfunc=first):
yield (x, map(second, pairs))
status = pipe.wait()
if status != 0:
raise RuntimeError("sort exited with status %s: %s" %
(a, [b, c])The pipe will get cleaned up as soon as the iterator is done.
(b, [c, d])