Skip to main content

Python: Memory Conservation Tip: sort Tricks

The UNIX "sort" command is really quite amazing. It's fast and it can deal with a lot of data with very little memory. Throw in the "-u" flag to make the results unique, and you have quite a useful utility. In fact, you'd be surprised at how you can use it.

Suppose you have a bunch of pairs:
a b
b c
a c
a c
b d
...
You want to figure out which atoms (i.e. items) are related to which other atoms. This is easy to do with a dict of sets:
referrers[left].add(right)
referrers[right].add(left)
Notice, I used a set because I only want to know if two things are related, not how many times they are related.

My situation is strange. It's small enough so that I don't need to use a cluster. However, it's too big for such a dict to fit into memory. It's not too big for the data to fit in /tmp.

The question is, how do you get this sort of a hash to run from disk? Berkeley DB is one option. You could probably also use Lucene. Another option is to simply use sort.

If you open up a two-way pipe to the sort command, you can output all the pairs, and then later read them back in:
a b
a c
b c
b d
...
sort is telling me that a is related to b and c, b is related to c and d, etc. Notice, it also removed the duplicate pair a c, and took care of the temp file handling. Best of all, you can stream data to and from the sort command. When you're dealing with a lot of data, you want to stream things as much as possible.

Now that I've shown you the general idea, let me give you a couple more hints. First of all, to shell out to sort, I use:
from subprocess import Popen, PIPE
pipe = Popen(['sort', '-u'], bufsize=1, stdin=PIPE, stdout=PIPE)
I like to use the csv module when working with tab-separated data, so I create a reader and writer for pipe.stdout and pipe.stdin respectively. You may not need to in your situation.

When you're done writing to sort, you need to tell it you're done:
pipe.stdin.close()  # Tell sort we're ready.
Now here's the next trick. I don't want the rest of the program to worry about the details of piping out to sort. The rest of the program should have a nice clean iterator to work with. Remember, I'm streaming, and the part of the code that's reading the data from the pipe is far away.

Hence, instead of passing it a reference to the pipe, I instead send it a reference to a generator. That way the generator can do all the munging necessary, and no one even needs to know that I'm using a pipe.

The last trick is that when I read:
a b
a c
I need to recognize that b and c both belong to a. Hence, I use a generator I wrote called groupbysorted.

Putting it all together, the generator looks like:
def first((a, b)): return a
def second((a, b)): return b

def get_references():
"""This is a generator that munges the results from sort -u.

When the streaming is done, make sure sort exited cleanly.

"""
for (x, pairs) in groupbysorted(reader, keyfunc=first):
yield (x, map(second, pairs))
status = pipe.wait()
if status != 0:
raise RuntimeError("sort exited with status %s: %s" %
(status, pipe.stderr.read()))
Now, the outside world has a nice clean iterator to work with that will generate things like:
(a, [b, c])
(b, [c, d])
...
The pipe will get cleaned up as soon as the iterator is done.

Comments

Anonymous said…
Thanks for sharing the performance tips! Very interesting.
Bob Van Zant said…
This isn't really streaming though? sort(1) won't finish it's sort until it receives EOF? So your script is only surviving now because sort(1) is more memory efficient than your python implementation. I bet you could make this work in nearly the same amount of memory but entirely in python :-) Depending on how long the strings are that you're storing you can try storing the 16 bytes of like an md5 hash instead of the actual string. And depending on how many strings there are, using just 8 bytes of the md5 might even be appropriate.

Of course that only gives you an incremental improvement. The best way to fix things up would be to devise an algorithm that doesn't need all of the data at once :-)
Justin A said…
I don't think itertools.groupby works how you think it does... The function you wrote does exactly the same thing as the one in itertools.


I have an implementation of external merge-sort in python somewhere.. it worked and didn't use much memory, but it was kind of slow.
jjinux said…
Hmm, I forgot to mention "sort | uniq -c". Also, quite useful.
Anonymous said…
Networkx https://networkx.lanl.gov/wiki
is a very nice Python graph library from LANL that can handle large numbers of nodes and edges, and can enforce rules like "only one edge per pair of nodes". It's in Ubuntu (and probably Debian) as python-networkx.

I'm guessing that it could handle your problem in-memory, and it might be useful for your next-step manipulations on the connected atoms.
jjinux said…
> pair_counts[normalize(a,b)] = count

I already do this.
jjinux said…
> I don't think itertools.groupby works how you think it does...

You're right. I was mistaken. Thanks.
jjinux said…
> This isn't really streaming though?

From the perspective of my code it is.

> sort(1) won't finish it's sort until it receives EOF?

That's true, but sort is simply amazing at how little memory it uses. It's
far better at managing memory and temp files that I could hope to be.

> So your script is only surviving now because sort(1) is more memory
> efficient than your python implementation.

There's no shame in that. It is in keeping with Python's philosophy to
a) not reinvent the wheel b) rely on C for performance.

> I bet you could make this work in nearly the same amount of memory but
> entirely in python :-)

Yes, I could, but it would require a lot of code. That's my point. Using
sort in clever ways can save you a lot of code.

> Depending on how long the strings are that you're storing you can try
> storing the 16 bytes of like an md5 hash instead of the actual string.

Even if I somehow compress them quite a bit, I have too many. I'm dealing
with 150gb log files.

> And depending on how many strings there are, using just 8 bytes of the md5
> might even be appropriate.

Unfortunately, not this case.

> Of course that only gives you an incremental improvement. The best way to
> fix things up would be to devise an algorithm that doesn't need all of the
> data at once :-)

Yes. Exactly. That's why I'm using sort. It gives me the data I want at the
optimal time. My program no uses an almost constant amount of memory, is
pretty dang fast (considering how much data there is), and requires a lot less
code than when I was doing fancy caching strategies.
Anonymous said…
Thanks for posting this - very helpful.

I had to add close_fds=True on Linux. When I tried to open two pipes at once, it hung on the read. Adding close_fds fixed it.
jjinux said…
Thanks for the tip!
Clarissa Lucas said…
A very informative article on Python memory conservation. Thanks for sharing.

Popular posts from this blog

Drawing Sierpinski's Triangle in Minecraft Using Python

In his keynote at PyCon, Eben Upton, the Executive Director of the Rasberry Pi Foundation, mentioned that not only has Minecraft been ported to the Rasberry Pi, but you can even control it with Python . Since four of my kids are avid Minecraft fans, I figured this might be a good time to teach them to program using Python. So I started yesterday with the goal of programming something cool for Minecraft and then showing it off at the San Francisco Python Meetup in the evening. The first problem that I faced was that I didn't have a Rasberry Pi. You can't hack Minecraft by just installing the Minecraft client. Speaking of which, I didn't have the Minecraft client installed either ;) My kids always play it on their Nexus 7s. I found an open source Minecraft server called Bukkit that "provides the means to extend the popular Minecraft multiplayer server." Then I found a plugin called RaspberryJuice that implements a subset of the Minecraft Pi modding API for B

Ubuntu 20.04 on a 2015 15" MacBook Pro

I decided to give Ubuntu 20.04 a try on my 2015 15" MacBook Pro. I didn't actually install it; I just live booted from a USB thumb drive which was enough to try out everything I wanted. In summary, it's not perfect, and issues with my camera would prevent me from switching, but given the right hardware, I think it's a really viable option. The first thing I wanted to try was what would happen if I plugged in a non-HiDPI screen given that my laptop has a HiDPI screen. Without sub-pixel scaling, whatever scale rate I picked for one screen would apply to the other. However, once I turned on sub-pixel scaling, I was able to pick different scale rates for the internal and external displays. That looked ok. I tried plugging in and unplugging multiple times, and it didn't crash. I doubt it'd work with my Thunderbolt display at work, but it worked fine for my HDMI displays at home. I even plugged it into my TV, and it stuck to the 100% scaling I picked for the othe

Creating Windows 10 Boot Media for a Lenovo Thinkpad T410 Using Only a Mac and a Linux Machine

TL;DR: Giovanni and I struggled trying to get Windows 10 installed on the Lenovo Thinkpad T410. We struggled a lot trying to create the installation media because we only had a Mac and a Linux machine to work with. Everytime we tried to boot the USB thumb drive, it just showed us a blinking cursor. At the end, we finally realized that Windows 10 wasn't supported on this laptop :-/ I've heard that it took Thomas Edison 100 tries to figure out the right material to use as a lightbulb filament. Well, I'm no Thomas Edison, but I thought it might be noteworthy to document our attempts at getting it to boot off a USB thumb drive: Download the ISO. Attempt 1: Use Etcher. Etcher says it doesn't work for Windows. Attempt 2: Use Boot Camp Assistant. It doesn't have that feature anymore. Attempt 3: Use Disk Utility on a Mac. Erase a USB thumb drive: Format: ExFAT Scheme: GUID Partition Map Mount the ISO. Copy everything from