Python: Memory Conservation Tip: sort Tricks
The UNIX "sort" command is really quite amazing. It's fast and it can deal with a lot of data with very little memory. Throw in the "-u" flag to make the results unique, and you have quite a useful utility. In fact, you'd be surprised at how you can use it.
Suppose you have a bunch of pairs:
My situation is strange. It's small enough so that I don't need to use a cluster. However, it's too big for such a dict to fit into memory. It's not too big for the data to fit in /tmp.
The question is, how do you get this sort of a hash to run from disk? Berkeley DB is one option. You could probably also use Lucene. Another option is to simply use sort.
If you open up a two-way pipe to the sort command, you can output all the pairs, and then later read them back in:
Now that I've shown you the general idea, let me give you a couple more hints. First of all, to shell out to sort, I use:
When you're done writing to sort, you need to tell it you're done:
Hence, instead of passing it a reference to the pipe, I instead send it a reference to a generator. That way the generator can do all the munging necessary, and no one even needs to know that I'm using a pipe.
The last trick is that when I read:
Putting it all together, the generator looks like:
Suppose you have a bunch of pairs:
a bYou want to figure out which atoms (i.e. items) are related to which other atoms. This is easy to do with a dict of sets:
b c
a c
a c
b d
...
referrers[left].add(right)Notice, I used a set because I only want to know if two things are related, not how many times they are related.
referrers[right].add(left)
My situation is strange. It's small enough so that I don't need to use a cluster. However, it's too big for such a dict to fit into memory. It's not too big for the data to fit in /tmp.
The question is, how do you get this sort of a hash to run from disk? Berkeley DB is one option. You could probably also use Lucene. Another option is to simply use sort.
If you open up a two-way pipe to the sort command, you can output all the pairs, and then later read them back in:
a bsort is telling me that a is related to b and c, b is related to c and d, etc. Notice, it also removed the duplicate pair a c, and took care of the temp file handling. Best of all, you can stream data to and from the sort command. When you're dealing with a lot of data, you want to stream things as much as possible.
a c
b c
b d
...
Now that I've shown you the general idea, let me give you a couple more hints. First of all, to shell out to sort, I use:
from subprocess import Popen, PIPEI like to use the csv module when working with tab-separated data, so I create a reader and writer for pipe.stdout and pipe.stdin respectively. You may not need to in your situation.
pipe = Popen(['sort', '-u'], bufsize=1, stdin=PIPE, stdout=PIPE)
When you're done writing to sort, you need to tell it you're done:
pipe.stdin.close() # Tell sort we're ready.Now here's the next trick. I don't want the rest of the program to worry about the details of piping out to sort. The rest of the program should have a nice clean iterator to work with. Remember, I'm streaming, and the part of the code that's reading the data from the pipe is far away.
Hence, instead of passing it a reference to the pipe, I instead send it a reference to a generator. That way the generator can do all the munging necessary, and no one even needs to know that I'm using a pipe.
The last trick is that when I read:
a bI need to recognize that b and c both belong to a. Hence, I use a generator I wrote called groupbysorted.
a c
Putting it all together, the generator looks like:
def first((a, b)): return aNow, the outside world has a nice clean iterator to work with that will generate things like:
def second((a, b)): return b
def get_references():
"""This is a generator that munges the results from sort -u.
When the streaming is done, make sure sort exited cleanly.
"""
for (x, pairs) in groupbysorted(reader, keyfunc=first):
yield (x, map(second, pairs))
status = pipe.wait()
if status != 0:
raise RuntimeError("sort exited with status %s: %s" %
(status, pipe.stderr.read()))
(a, [b, c])The pipe will get cleaned up as soon as the iterator is done.
(b, [c, d])
...
Comments
Of course that only gives you an incremental improvement. The best way to fix things up would be to devise an algorithm that doesn't need all of the data at once :-)
I have an implementation of external merge-sort in python somewhere.. it worked and didn't use much memory, but it was kind of slow.
is a very nice Python graph library from LANL that can handle large numbers of nodes and edges, and can enforce rules like "only one edge per pair of nodes". It's in Ubuntu (and probably Debian) as python-networkx.
I'm guessing that it could handle your problem in-memory, and it might be useful for your next-step manipulations on the connected atoms.
I already do this.
You're right. I was mistaken. Thanks.
From the perspective of my code it is.
> sort(1) won't finish it's sort until it receives EOF?
That's true, but sort is simply amazing at how little memory it uses. It's
far better at managing memory and temp files that I could hope to be.
> So your script is only surviving now because sort(1) is more memory
> efficient than your python implementation.
There's no shame in that. It is in keeping with Python's philosophy to
a) not reinvent the wheel b) rely on C for performance.
> I bet you could make this work in nearly the same amount of memory but
> entirely in python :-)
Yes, I could, but it would require a lot of code. That's my point. Using
sort in clever ways can save you a lot of code.
> Depending on how long the strings are that you're storing you can try
> storing the 16 bytes of like an md5 hash instead of the actual string.
Even if I somehow compress them quite a bit, I have too many. I'm dealing
with 150gb log files.
> And depending on how many strings there are, using just 8 bytes of the md5
> might even be appropriate.
Unfortunately, not this case.
> Of course that only gives you an incremental improvement. The best way to
> fix things up would be to devise an algorithm that doesn't need all of the
> data at once :-)
Yes. Exactly. That's why I'm using sort. It gives me the data I want at the
optimal time. My program no uses an almost constant amount of memory, is
pretty dang fast (considering how much data there is), and requires a lot less
code than when I was doing fancy caching strategies.
I had to add close_fds=True on Linux. When I tried to open two pipes at once, it hung on the read. Adding close_fds fixed it.