The WHATWG Blog — html5lib 0.10 Released

html5lib 0.10 Released

October 9th, 2007 by James Graham in WHATWG

html5lib 0.10 is now available for your HTML-parsing pleasure.

html5lib is an implementation of the HTML 5 parsing algorithm, available in both Python and Ruby flavours. The HTML 5 algorithm is based on reverse engineering the behaviour of popular web browsers and so is compatible with the myriad of broken HTML encountered on the web.

Features in 0.10:

Parse HTML to a variety of common tree formats including minidom, ElementTree and BeautifulSoup (Python), and hpricot and rexml (Ruby) as well as a custom simpletree format
Automatic detection of character encoding from meta elements and using frequency analysis (if chardet is available)
Sanitization of markup and CSS using a whitelist approach
Liberal XML parsing
Conversion of trees to event streams and Genshi-inspired filters for those streams
Flexible serializers for writing out streams in HTML and XHTML-syntax
A prototype HTML 5 validator
A large test suite

Download:

La loterie du longdesc ↔ Call for Comments

Cool! Thanks man

I suck at programming, but I downloaded the Python one yesterday to see what it was like. Wow, it really makes my stuff look like a dog’s breakfast! It’s just so neat and carefully crafted. And there’s a lot of intricacy in what it does.

Even though everything is written with enviable concision, there’s a lot of code which has gone into this. Awesome work! 🙂

(Just noticed that each time I previewed, backslashes were being added into my name.)

html5lib 0.10 Released

3 Responses to “html5lib 0.10 Released”