The HTML out in the wild may be messy, but it is of vital importance.
Don't use HTMLParser. minidom is horrible. Beautiful Soup is nicer. html5lib is theoretically fantastic, but it's very slow. libxml is really nice. It's similar to html5lib, but way, way faster.