html5lib 0.10 is now available for your HTML-parsing pleasure.
html5lib is an implementation of the HTML 5 parsing algorithm, available in both Python and Ruby flavours. The HTML 5 algorithm is based on reverse engineering the behaviour of popular web browsers and so is compatible with the myriad of broken HTML encountered on the web.
Features in 0.10:
- Parse HTML to a variety of common tree formats including minidom, ElementTree and BeautifulSoup (Python), and hpricot and rexml (Ruby) as well as a custom simpletree format
- Automatic detection of character encoding from
metaelements and using frequency analysis (if chardet is available)
- Sanitization of markup and CSS using a whitelist approach
- Liberal XML parsing
- Conversion of trees to event streams and Genshi-inspired filters for those streams
- Flexible serializers for writing out streams in HTML and XHTML-syntax
- A prototype HTML 5 validator
- A large test suite