html5lib 0.9 Released
html5lib 0.9 is now available for your parsing pleasure.
html5lib is an implementation of the WHATWG HTML parsing algorithm in Python and released under a MIT-license It enables malformed HTML to be parsed into standard minidom and ElementTree structures,in a way that is highly compatible with the behavior of major desktop web browsers. As well as parsing to trees html5lib contains a DOM to SAX converter; it is hoped that by supporting these standard APIs, toolchains based on draconian XML parsers can be repurposed to process HTML content with minimal effort.
In addition to the HTML parsing capability, html5lib 0.9 contains an experimental liberal XML parser based on the WHATWG algorithm without the HTML-specific error handling. This is suitable for parsing XML from sources that cannot guarantee wellformedness; e.g. web feeds.
The 0.9 release is expected to be the last major release before 1.0 and no new features will be added before 1.0 is released. Instead we will work on any remaining correctness issues, other bugs, and on improving the messages reported when parse errors are encountered. Bug reports are very much appreciated. Users or people looking to get involved are encouraged to join the mailing list or visit the #WHATWG channel on freenode.net
Is the XML parsing kicked in when there’s an XHTML mime type? I’m also curious if you know of anyone porting this to PHP, Ruby, Perl, or any other service side language. I’d be very interested in a PHP version.
The XML parsing is only used when explicitly called (html5lib knows nothing about MIME types as it’s not a networking library).
I believe gsnedders is working on a PHP implementation.