Skip to main content

Python: Memory Conservation Tip: sort Tricks

The UNIX "sort" command is really quite amazing. It's fast and it can deal with a lot of data with very little memory. Throw in the "-u" flag to make the results unique, and you have quite a useful utility. In fact, you'd be surprised at how you can use it.

Suppose you have a bunch of pairs:
a b
b c
a c
a c
b d
...
You want to figure out which atoms (i.e. items) are related to which other atoms. This is easy to do with a dict of sets:
referrers[left].add(right)
referrers[right].add(left)
Notice, I used a set because I only want to know if two things are related, not how many times they are related.

My situation is strange. It's small enough so that I don't need to use a cluster. However, it's too big for such a dict to fit into memory. It's not too big for the data to fit in /tmp.

The question is, how do you get this sort of a hash to run from disk? Berkeley DB is one option. You could probably also use Lucene. Another option is to simply use sort.

If you open up a two-way pipe to the sort command, you can output all the pairs, and then later read them back in:
a b
a c
b c
b d
...
sort is telling me that a is related to b and c, b is related to c and d, etc. Notice, it also removed the duplicate pair a c, and took care of the temp file handling. Best of all, you can stream data to and from the sort command. When you're dealing with a lot of data, you want to stream things as much as possible.

Now that I've shown you the general idea, let me give you a couple more hints. First of all, to shell out to sort, I use:
from subprocess import Popen, PIPE
pipe = Popen(['sort', '-u'], bufsize=1, stdin=PIPE, stdout=PIPE)
I like to use the csv module when working with tab-separated data, so I create a reader and writer for pipe.stdout and pipe.stdin respectively. You may not need to in your situation.

When you're done writing to sort, you need to tell it you're done:
pipe.stdin.close()  # Tell sort we're ready.
Now here's the next trick. I don't want the rest of the program to worry about the details of piping out to sort. The rest of the program should have a nice clean iterator to work with. Remember, I'm streaming, and the part of the code that's reading the data from the pipe is far away.

Hence, instead of passing it a reference to the pipe, I instead send it a reference to a generator. That way the generator can do all the munging necessary, and no one even needs to know that I'm using a pipe.

The last trick is that when I read:
a b
a c
I need to recognize that b and c both belong to a. Hence, I use a generator I wrote called groupbysorted.

Putting it all together, the generator looks like:
def first((a, b)): return a
def second((a, b)): return b

def get_references():
"""This is a generator that munges the results from sort -u.

When the streaming is done, make sure sort exited cleanly.

"""
for (x, pairs) in groupbysorted(reader, keyfunc=first):
yield (x, map(second, pairs))
status = pipe.wait()
if status != 0:
raise RuntimeError("sort exited with status %s: %s" %
(status, pipe.stderr.read()))
Now, the outside world has a nice clean iterator to work with that will generate things like:
(a, [b, c])
(b, [c, d])
...
The pipe will get cleaned up as soon as the iterator is done.

Comments

daniellindsley said…
Thanks for sharing the performance tips! Very interesting.
Bob Van Zant said…
This isn't really streaming though? sort(1) won't finish it's sort until it receives EOF? So your script is only surviving now because sort(1) is more memory efficient than your python implementation. I bet you could make this work in nearly the same amount of memory but entirely in python :-) Depending on how long the strings are that you're storing you can try storing the 16 bytes of like an md5 hash instead of the actual string. And depending on how many strings there are, using just 8 bytes of the md5 might even be appropriate.

Of course that only gives you an incremental improvement. The best way to fix things up would be to devise an algorithm that doesn't need all of the data at once :-)
Justin said…
I don't think itertools.groupby works how you think it does... The function you wrote does exactly the same thing as the one in itertools.


I have an implementation of external merge-sort in python somewhere.. it worked and didn't use much memory, but it was kind of slow.
Hmm, I forgot to mention "sort | uniq -c". Also, quite useful.
Mark said…
Networkx https://networkx.lanl.gov/wiki
is a very nice Python graph library from LANL that can handle large numbers of nodes and edges, and can enforce rules like "only one edge per pair of nodes". It's in Ubuntu (and probably Debian) as python-networkx.

I'm guessing that it could handle your problem in-memory, and it might be useful for your next-step manipulations on the connected atoms.
> pair_counts[normalize(a,b)] = count

I already do this.
> I don't think itertools.groupby works how you think it does...

You're right. I was mistaken. Thanks.
> This isn't really streaming though?

From the perspective of my code it is.

> sort(1) won't finish it's sort until it receives EOF?

That's true, but sort is simply amazing at how little memory it uses. It's
far better at managing memory and temp files that I could hope to be.

> So your script is only surviving now because sort(1) is more memory
> efficient than your python implementation.

There's no shame in that. It is in keeping with Python's philosophy to
a) not reinvent the wheel b) rely on C for performance.

> I bet you could make this work in nearly the same amount of memory but
> entirely in python :-)

Yes, I could, but it would require a lot of code. That's my point. Using
sort in clever ways can save you a lot of code.

> Depending on how long the strings are that you're storing you can try
> storing the 16 bytes of like an md5 hash instead of the actual string.

Even if I somehow compress them quite a bit, I have too many. I'm dealing
with 150gb log files.

> And depending on how many strings there are, using just 8 bytes of the md5
> might even be appropriate.

Unfortunately, not this case.

> Of course that only gives you an incremental improvement. The best way to
> fix things up would be to devise an algorithm that doesn't need all of the
> data at once :-)

Yes. Exactly. That's why I'm using sort. It gives me the data I want at the
optimal time. My program no uses an almost constant amount of memory, is
pretty dang fast (considering how much data there is), and requires a lot less
code than when I was doing fancy caching strategies.
Anonymous said…
Thanks for posting this - very helpful.

I had to add close_fds=True on Linux. When I tried to open two pipes at once, it hung on the read. Adding close_fds fixed it.
Thanks for the tip!
Clarissa Lucas said…
A very informative article on Python memory conservation. Thanks for sharing.