I wrote a webcrawler that uses lxml, XPath, and Beautiful Soup to easily pull data from a set of poorly formatted Web pages. In summary, it works, and I'm quite happy :)
The script needs to pull data from hundreds of Web pages, but not millions, so I opted to use threads. The script actually takes the list of things to look for as a set of XPath expressions on the command line, which makes it super flexible. Let me give you some hints for the parts that I found difficult.
First of all, here's how to install it. If you're using Ubuntu, then:
lxml permits access to regular expressions from within XPath expressions. That's super useful. I had a hard time getting it working though. I forgot to pass in the right XML namespace in one part of the code. For some reason, I wasn't getting an error message. (As a general rule, I love it when software fails fast and complains loudly when I do something stupid.) Furthermore, my knowledge of XSLT was weak enough that I had a really hard time figuring out how to combine the XPath expression with the regex. Anyway, here's how to create an etree.XPath instance containing a regex:
The script needs to pull data from hundreds of Web pages, but not millions, so I opted to use threads. The script actually takes the list of things to look for as a set of XPath expressions on the command line, which makes it super flexible. Let me give you some hints for the parts that I found difficult.
First of all, here's how to install it. If you're using Ubuntu, then:
apt-get install libxslt1-dev libxml2-devIf you're using MacPorts, do
# I also have python-dev, build-essentials, etc. installed.
easy_install lxml
easy_install BeautifulSoup
port install py25-lxmlThe FAQ states that if you use MacPorts, you may encounter difficulties because you will have multiple versions of libxml and libxslt installed. For instance, the following may segfault:
easy_install BeautifulSoup
python -c "import webbrowser; from lxml import etree, html"Whereas the following shouldn't:
env DYLD_LIBRARY_PATH=/opt/local/lib \You also have to be careful of thread safety issues. I was sharing an effectively read-only instance of the etree.XPath class between multiple threads, but that ended up causing bus errors. Ah, the joys of extensions written in C! It's a good reminder that the safest way to do multithreaded programming is to have each thread live in its own process ;)
python -c "import webbrowser; from lxml import etree, html"
lxml permits access to regular expressions from within XPath expressions. That's super useful. I had a hard time getting it working though. I forgot to pass in the right XML namespace in one part of the code. For some reason, I wasn't getting an error message. (As a general rule, I love it when software fails fast and complains loudly when I do something stupid.) Furthermore, my knowledge of XSLT was weak enough that I had a really hard time figuring out how to combine the XPath expression with the regex. Anyway, here's how to create an etree.XPath instance containing a regex:
from lxml import etreeAnyway, lxml is frickin' awesome, and so is BeautifulSoup. Together, I can take really, really crappy HTML, and access it seemlessly.
XPATH_NAMESPACES = dict(re='http://exslt.org/regular-expressions')
xpath = etree.XPath("re:replace(//title/text(), 'From', '', 'To')",
namespaces=XPATH_NAMESPACES)
match = xpath(tree)
Comments