Monday, January 08, 2007

Python: Dealing with Huge Data Sets in MySQLdb

I have a table that's about 750mb. It has 25 million rows. To do what I need to do, I need to pull it all into Python. It's okay, the box has 8 gigs of RAM.

However, when I do the query, "cursor.execute" never seems to return. I look at top, and I see that Python is taking up 100% of the CPU and a steadily increasing amount of RAM. Tracing through the code, I see that the code is hung on:
# > /usr/lib/python2.4/site-packages/MySQLdb/cursors.py(282)_fetch_row()
# -> return self._result.fetch_row(size, self._fetch_type)
I was hoping to stream data from the server, but it appears some C code is trying to store it completely. After a few minutes, "show processlist;" in MySQL reports that the server is done, even with sending the data. So why won't "cursor.execute" hurry up and return?
If you're wondering, unfortunately, I can't break this up into multiple queries. If I use a limits to go through the data one chunk at a time, I have to continually resort the data on every query. I can't do the sorting in Python nearly as conveniently as I can do it in MySQL. Furthermore, one simple query can result in one simple table scan, which is faster than a lot of the alternatives.
Anyway, I found out that MySQLdb has an under-documented streaming API. It all comes down to using a different type of cursor:
import MySQLdb
from MySQLdb.cursors import SSCursor

connection = MySQLdb.connect(...)

# Normally, you would use:
#
# cursor = connection.cursor()
#
# However, using this version, MySQLdb will read rows from the server one at a time.

cursor = SSCursor(connection)

2 comments:

puneeth said...

Hi

I do have same problom, please let me know how did u resolve this issue

mail me to puneethmech@gmail.com

Anonymous said...

@puneeth

Read more closely, it is already answered in the post. Use SSCursor instead ordinary cursor.