Talk: Python Tools, the UNIX Philosophy, and sort Tricks

Updated link.

I recently gave a talk at BayPiggies called Python Tools, the UNIX Philosophy, and sort Tricks. Thanks go to Glen Jarvis for recording it. The "slides" are below:
This is a random collection of topics related to Python tools.

Talk about the UNIX philosophy:
Small tools.
My problems tend to be too large for RAM, but not too big for one machine.
UNIX and batch processing are a natural fit.
Multiple processes = multiple CPUs.
Multiple programming languages = more flexibility.
Pipes = concurrency without the pain.
Scales linearly and predictably, unlike databases.
UNIX tools that already exist are helpful and fast.

Use the optparse module to provide consistent command line APIs:
Here's an example of the setup from the docs:
: from optparse import OptionParser
: parser = OptionParser()
: parser.add_option("-f", "--file", dest="filename",
: help="write report to FILE", metavar="FILE")
: parser.add_option("-q", "--quiet",
: action="store_false", dest="verbose", default=True,
: help="don't print status messages to stdout")
: (options, args) = parser.parse_args()
Here's an example of my own help text
: Usage: cleancuttsv.py [options]
:
: Options:
: -h, --help show this help message and exit
: --assert-head=FIELD1\tFIELD2\t...
: assert that the first line of the file matches this
: --delete-head delete the first line of input
: -n NUM, --num-fields=NUM
: assert that there are this many fields per line
: --drop-blank-lines delete blank lines instead of raising an error
:

sort:
http://jjinux.blogspot.com/2008/08/python-sort-uniq-c-via-subprocess.html
sort -S 20% -T /mnt/some_other_drive ...
http://jjinux.blogspot.com/2008/08/python-memory-conservation-tip-sort.html

tsv:
You need a consistent format.
Downsides:
Most UNIX tools don't understand true TSV, but only an approximation thereof:
My own code raises an exception in cases where it would actually matter.
Many UNIX tools are ignorant of encoding issues:
Sometimes playing dumb works and sometimes it hurts.
Using the csv module:
: import csv
:
: DEFAULT_KARGS = dict(dialect='excel-tab', lineterminator='\n')
: MYSQL_LOAD_DATA_INFILE_DESC = """\
: FIELDS TERMINATED BY '\t'
: OPTIONALLY ENCLOSED BY '"'
: ESCAPED BY ''
: LINES TERMINATED BY '\n'"""
:
: def create_default_reader(iterable):
: """Return a csv.reader with our default options."""
: return csv.reader(iterable, **DEFAULT_KARGS)
: ...
Using mysqlimport.
: mysqlimport \
: --user=$MYSQL_USERNAME \
: --password=$MYSQL_PASSWORD \
: --columns=id,name \
: --fields-optionally-enclosed-by='"' \
: --fields-terminated-by='\t' \
: --fields-escaped-by='' \
: --lines-terminated-by='\n' \
: --local \
: --lock-tables \
: --replace \
: --verbose \
: $DATABASE ${BUILD}/sometable.tsv
To see warnings:
http://jjinux.blogspot.com/2009/03/mysql-encoding-hell.html

Show pdb in the context of a web app:
: import pdb
: from pprint import pprint
: pdb.set_trace()
: pprint(request.environ)
http://localhost:5000/api/ratio

Comments

Oinopion said…
The link doesn't work :(
Thanks for the heads up. I'll see what's up.
Ok, the link has been fixed. Thanks again.