html5lib 0.10 Released
html5lib 0.10 is now available for your HTML-parsing pleasure.
html5lib is an implementation of the HTML 5 parsing algorithm, available in both Python and Ruby flavours. The HTML 5 algorithm is based on reverse engineering the behaviour of popular web browsers and so is compatible with the myriad of broken HTML encountered on the web.
Features in 0.10:
- Parse HTML to a variety of common tree formats including minidom, ElementTree and BeautifulSoup (Python), and hpricot and rexml (Ruby) as well as a custom simpletree format
- Automatic detection of character encoding from
meta
elements and using frequency analysis (if chardet is available) - Sanitization of markup and CSS using a whitelist approach
- Liberal XML parsing
- Conversion of trees to event streams and Genshi-inspired filters for those streams
- Flexible serializers for writing out streams in HTML and XHTML-syntax
- A prototype HTML 5 validator
- A large test suite
Download:
Cool! Thanks man
I suck at programming, but I downloaded the Python one yesterday to see what it was like. Wow, it really makes my stuff look like a dog’s breakfast! It’s just so neat and carefully crafted. And there’s a lot of intricacy in what it does.
Even though everything is written with enviable concision, there’s a lot of code which has gone into this. Awesome work! 🙂
(Just noticed that each time I previewed, backslashes were being added into my name.)