tag:blogger.com,1999:blog-11788780.post6491050657917754411..comments2023-12-29T13:22:33.104-08:00Comments on JJinuxLand: Python: Memory Conservation Tip: sort Tricksjjinuxhttp://www.blogger.com/profile/03270879497119114175noreply@blogger.comBlogger11125tag:blogger.com,1999:blog-11788780.post-386061902035940382014-07-14T18:20:25.833-07:002014-07-14T18:20:25.833-07:00A very informative article on Python memory conser...A very informative article on Python memory conservation. Thanks for sharing. <br />Clarissa Lucashttp://www.process-box.com/noreply@blogger.comtag:blogger.com,1999:blog-11788780.post-33710664814113644022009-10-03T16:21:58.785-07:002009-10-03T16:21:58.785-07:00Thanks for the tip!Thanks for the tip!jjinuxhttps://www.blogger.com/profile/03270879497119114175noreply@blogger.comtag:blogger.com,1999:blog-11788780.post-12987452362456469112009-10-02T18:10:35.584-07:002009-10-02T18:10:35.584-07:00Thanks for posting this - very helpful.
I had to ...Thanks for posting this - very helpful.<br /><br />I had to add close_fds=True on Linux. When I tried to open two pipes at once, it hung on the read. Adding close_fds fixed it.Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-11788780.post-67492268841812531172008-08-08T23:59:00.000-07:002008-08-08T23:59:00.000-07:00> This isn't really streaming though?From t...> This isn't really streaming though?<BR/><BR/>From the perspective of my code it is.<BR/><BR/>> sort(1) won't finish it's sort until it receives EOF?<BR/><BR/>That's true, but sort is simply amazing at how little memory it uses. It's<BR/>far better at managing memory and temp files that I could hope to be.<BR/><BR/>> So your script is only surviving now because sort(1) is more memory <BR/>> efficient than your python implementation.<BR/><BR/>There's no shame in that. It is in keeping with Python's philosophy to<BR/>a) not reinvent the wheel b) rely on C for performance.<BR/><BR/>> I bet you could make this work in nearly the same amount of memory but <BR/>> entirely in python :-)<BR/><BR/>Yes, I could, but it would require a lot of code. That's my point. Using<BR/>sort in clever ways can save you a lot of code.<BR/><BR/>> Depending on how long the strings are that you're storing you can try <BR/>> storing the 16 bytes of like an md5 hash instead of the actual string.<BR/><BR/>Even if I somehow compress them quite a bit, I have too many. I'm dealing<BR/>with 150gb log files.<BR/><BR/>> And depending on how many strings there are, using just 8 bytes of the md5<BR/>> might even be appropriate.<BR/><BR/>Unfortunately, not this case.<BR/><BR/>> Of course that only gives you an incremental improvement. The best way to<BR/>> fix things up would be to devise an algorithm that doesn't need all of the<BR/>> data at once :-)<BR/><BR/>Yes. Exactly. That's why I'm using sort. It gives me the data I want at the<BR/>optimal time. My program no uses an almost constant amount of memory, is<BR/>pretty dang fast (considering how much data there is), and requires a lot less<BR/>code than when I was doing fancy caching strategies.jjinuxhttps://www.blogger.com/profile/03270879497119114175noreply@blogger.comtag:blogger.com,1999:blog-11788780.post-13598579480491663202008-08-08T23:42:00.000-07:002008-08-08T23:42:00.000-07:00> I don't think itertools.groupby works how...> I don't think itertools.groupby works how you think it does...<BR/><BR/>You're right. I was mistaken. Thanks.jjinuxhttps://www.blogger.com/profile/03270879497119114175noreply@blogger.comtag:blogger.com,1999:blog-11788780.post-38492874343932386742008-08-07T22:03:00.000-07:002008-08-07T22:03:00.000-07:00> pair_counts[normalize(a,b)] = countI already ...> pair_counts[normalize(a,b)] = count<BR/><BR/>I already do this.jjinuxhttps://www.blogger.com/profile/03270879497119114175noreply@blogger.comtag:blogger.com,1999:blog-11788780.post-23813220758075454152008-08-07T15:56:00.000-07:002008-08-07T15:56:00.000-07:00Networkx https://networkx.lanl.gov/wikiis a very n...Networkx https://networkx.lanl.gov/wiki<BR/>is a very nice Python graph library from LANL that can handle large numbers of nodes and edges, and can enforce rules like "only one edge per pair of nodes". It's in Ubuntu (and probably Debian) as python-networkx.<BR/><BR/>I'm guessing that it could handle your problem in-memory, and it might be useful for your next-step manipulations on the connected atoms.Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-11788780.post-43696526504999787582008-08-05T23:53:00.000-07:002008-08-05T23:53:00.000-07:00Hmm, I forgot to mention "sort | uniq -c". Also, ...Hmm, I forgot to mention "sort | uniq -c". Also, quite useful.jjinuxhttps://www.blogger.com/profile/03270879497119114175noreply@blogger.comtag:blogger.com,1999:blog-11788780.post-13271480795732684812008-08-04T20:16:00.000-07:002008-08-04T20:16:00.000-07:00I don't think itertools.groupby works how you thin...I don't think itertools.groupby works how you think it does... The function you wrote does exactly the same thing as the one in itertools.<BR/><BR/><BR/>I have an implementation of external merge-sort in python somewhere.. it worked and didn't use much memory, but it was kind of slow.Justin Ahttps://www.blogger.com/profile/07567730572096907480noreply@blogger.comtag:blogger.com,1999:blog-11788780.post-47390517831468142132008-08-04T20:05:00.000-07:002008-08-04T20:05:00.000-07:00This isn't really streaming though? sort(1) won't ...This isn't really streaming though? sort(1) won't finish it's sort until it receives EOF? So your script is only surviving now because sort(1) is more memory efficient than your python implementation. I bet you could make this work in nearly the same amount of memory but entirely in python :-) Depending on how long the strings are that you're storing you can try storing the 16 bytes of like an md5 hash instead of the actual string. And depending on how many strings there are, using just 8 bytes of the md5 might even be appropriate.<BR/><BR/>Of course that only gives you an incremental improvement. The best way to fix things up would be to devise an algorithm that doesn't need all of the data at once :-)Bob Van Zanthttps://www.blogger.com/profile/04357848795026149057noreply@blogger.comtag:blogger.com,1999:blog-11788780.post-44720386641529616812008-08-04T18:09:00.000-07:002008-08-04T18:09:00.000-07:00Thanks for sharing the performance tips! Very inte...Thanks for sharing the performance tips! Very interesting.Anonymousnoreply@blogger.com