html5lib 0.10 Released

October 9th, 2007 by James Graham

html5lib 0.10 is now available for your HTML-parsing pleasure.

html5lib is an implementation of the HTML 5 parsing algorithm, available in both Python and Ruby flavours. The HTML 5 algorithm is based on reverse engineering the behaviour of popular web browsers and so is compatible with the myriad of broken HTML encountered on the web.

Features in 0.10:

  • Parse HTML to a variety of common tree formats including minidom, ElementTree and BeautifulSoup (Python), and hpricot and rexml (Ruby) as well as a custom simpletree format
  • Automatic detection of character encoding from meta elements and using frequency analysis (if chardet is available)
  • Sanitization of markup and CSS using a whitelist approach
  • Liberal XML parsing
  • Conversion of trees to event streams and Genshi-inspired filters for those streams
  • Flexible serializers for writing out streams in HTML and XHTML-syntax
  • A prototype HTML 5 validator
  • A large test suite

Download:

3 Responses to “html5lib 0.10 Released”

  1. Pipo Lambert Says:

    Cool! Thanks man

  2. Ben 'Cerbera' Millard Says:

    I suck at programming, but I downloaded the Python one yesterday to see what it was like. Wow, it really makes my stuff look like a dog’s breakfast! It’s just so neat and carefully crafted. And there’s a lot of intricacy in what it does.

    Even though everything is written with enviable concision, there’s a lot of code which has gone into this. Awesome work! :)

  3. Ben 'Cerbera' Millard Says:

    (Just noticed that each time I previewed, backslashes were being added into my name.)