The WHATWG Blog — html5lib

html5lib 0.10 Released

Tuesday, October 9th, 2007

html5lib 0.10 is now available for your HTML-parsing pleasure.

html5lib is an implementation of the HTML 5 parsing algorithm, available in both Python and Ruby flavours. The HTML 5 algorithm is based on reverse engineering the behaviour of popular web browsers and so is compatible with the myriad of broken HTML encountered on the web.

Features in 0.10:

Parse HTML to a variety of common tree formats including minidom, ElementTree and BeautifulSoup (Python), and hpricot and rexml (Ruby) as well as a custom simpletree format
Automatic detection of character encoding from meta elements and using frequency analysis (if chardet is available)
Sanitization of markup and CSS using a whitelist approach
Liberal XML parsing
Conversion of trees to event streams and Genshi-inspired filters for those streams
Flexible serializers for writing out streams in HTML and XHTML-syntax
A prototype HTML 5 validator
A large test suite

Download:

Posted in WHATWG | 3 Comments »