Tuesday, October 28, 2008

Python: Some Notes on lxml

I wrote a webcrawler that uses lxml, XPath, and Beautiful Soup to easily pull data from a set of poorly formatted Web pages. In summary, it works, and I'm quite happy :)

The script needs to pull data from hundreds of Web pages, but not millions, so I opted to use threads. The script actually takes the list of things to look for as a set of XPath expressions on the command line, which makes it super flexible. Let me give you some hints for the parts that I found difficult.

First of all, here's how to install it. If you're using Ubuntu, then:
apt-get install libxslt1-dev libxml2-dev
# I also have python-dev, build-essentials, etc. installed.
easy_install lxml
easy_install BeautifulSoup
If you're using MacPorts, do
port install py25-lxml
easy_install BeautifulSoup
The FAQ states that if you use MacPorts, you may encounter difficulties because you will have multiple versions of libxml and libxslt installed. For instance, the following may segfault:
python -c "import webbrowser; from lxml import etree, html"
Whereas the following shouldn't:
env DYLD_LIBRARY_PATH=/opt/local/lib \
python -c "import webbrowser; from lxml import etree, html"
You also have to be careful of thread safety issues. I was sharing an effectively read-only instance of the etree.XPath class between multiple threads, but that ended up causing bus errors. Ah, the joys of extensions written in C! It's a good reminder that the safest way to do multithreaded programming is to have each thread live in its own process ;)

lxml permits access to regular expressions from within XPath expressions. That's super useful. I had a hard time getting it working though. I forgot to pass in the right XML namespace in one part of the code. For some reason, I wasn't getting an error message. (As a general rule, I love it when software fails fast and complains loudly when I do something stupid.) Furthermore, my knowledge of XSLT was weak enough that I had a really hard time figuring out how to combine the XPath expression with the regex. Anyway, here's how to create an etree.XPath instance containing a regex:
from lxml import etree
XPATH_NAMESPACES = dict(re='http://exslt.org/regular-expressions')
xpath = etree.XPath("re:replace(//title/text(), 'From', '', 'To')",
match = xpath(tree)
Anyway, lxml is frickin' awesome, and so is BeautifulSoup. Together, I can take really, really crappy HTML, and access it seemlessly.


Shannon -jj Behrens said...

See also: http://blog.ianbicking.org/2008/12/10/lxml-an-underappreciated-web-scraping-library/

taocode said...

I had trouble getting libxml2/libxslt and lxml working on my Mac OS X (10.5). I wrote what worked for me at: taocode.blogspot.com

Amit said...

Thanks for the tip. Helped me parse XML produced by lshw. https://gist.github.com/4554484.