html5lib 0.9 Released

March 11th, 2007 by James Graham

html5lib 0.9 is now available for your parsing pleasure.

html5lib is an implementation of the WHATWG HTML parsing algorithm in Python and released under a MIT-license It enables malformed HTML to be parsed into standard minidom and ElementTree structures,in a way that is highly compatible with the behavior of major desktop web browsers. As well as parsing to trees html5lib contains a DOM to SAX converter; it is hoped that by supporting these standard APIs, toolchains based on draconian XML parsers can be repurposed to process HTML content with minimal effort.

In addition to the HTML parsing capability, html5lib 0.9 contains an experimental liberal XML parser based on the WHATWG algorithm without the HTML-specific error handling. This is suitable for parsing XML from sources that cannot guarantee wellformedness; e.g. web feeds.

The 0.9 release is expected to be the last major release before 1.0 and no new features will be added before 1.0 is released. Instead we will work on any remaining correctness issues, other bugs, and on improving the messages reported when parse errors are encountered. Bug reports are very much appreciated. Users or people looking to get involved are encouraged to join the mailing list or visit the #WHATWG channel on freenode.net

This entry was posted on Sunday, March 11th, 2007 at 22:04 and is filed under WHATWG. You can follow any responses to this entry through the RSS 2.0 feed. Both comments and pings are currently closed.

2 Responses to “html5lib 0.9 Released”

  1. Devon Young says:

    Is the XML parsing kicked in when there’s an XHTML mime type? I’m also curious if you know of anyone porting this to PHP, Ruby, Perl, or any other service side language. I’d be very interested in a PHP version.

  2. jgraham says:

    The XML parsing is only used when explicitly called (html5lib knows nothing about MIME types as it’s not a networking library).

    I believe gsnedders is working on a PHP implementation.