Validator.nu HTML Parser 1.1.0
I have released a new version of the Validator.nu HTML Parser (an implementation of the HTML5 parsing algorithm in Java). The new release supports SVG and MathML subtrees, is faster than the old version, fixes bugs, is more portable and supports applications that want to do document.write()
.
The parser comes with a sample app that makes it possible to use XSLT programs written for XHTML5+SVG+MathML with text/html
.
Warning! The internal APIs have changed. Please refer to the Upgrade Guide below.
Change Log
- Made the SAX, DOM and XOM parser entry point constructors default to altering the infoset instead of throwing when the input needs coercing to be an XML 1.0 4th ed. plus Namespaces infoset.
- Isolated Java IO dependent code from the parser core. The parser core now compiles on Google Web Toolkit.
- Refactored the tokenizer to use a
switch
branch per state instead of method per state. - Made various performance tweaks to the tokenizer.
- Implemented support for MathML and SVG foreign content. (Note that the SVG part is based on spec text that has been commented out from the spec at the request of the SVG WG.)
- Made the parser suspendable after any input character.
- Made it possible for custom
TreeBuilder
subclasses to request parser suspension. (Applications wishing to implementdocument.write()
should provide their ownTreeBuilder
subclass and adocument.write()
-aware replacement of theDriver
class. Look in thegwt-src/
directory for sample code.) - Made changes to the parser core to make it more suitable for mechanical translation into other object-oriented programming languages that have C-like control structures but not necessarily a garbage collector (with focus on targeting C++). This work is not complete.
- Made the HTML serializer do the right thing when input represents a conforming XHTML+SVG+MathML tree. (Results may be bad for non-conforming input trees.)
- Developed sample programs for converting between HTML5 and XHTML5 when the input is known to be conforming.
- Provided an XML serializer so that the sample code no longer depends on the Xalan serializer.
- Improved API documentation.
- Fixed bugs in the tokenizer, tree builder and the input stream character encoding decoder.
- Made coercion to an XML infoset work according to the HTML5 spec.
- Added ID uniqueness checking.
- Various other fixes.
Upgrade Guide from 1.0.7 to 1.1.0
In all cases, you need to check that your application does not break when it receives SVG or MathML subtrees.
- If you use the parser through the SAX, DOM or XOM API and do not pass an explicit
XmlViolationPolicy
to the constructor ofHtmlParser
,HtmlDocumentBuilder
orHtmlBuilder
: If you really wanted the old default behavior, you should now pass
XmlViolationPolicy.FATAL
to the constructor.If you did not really want to have fatal errors by default, you do not need to do anything, since
ALTER_INFOSET
is now the default.- If you use the parser through the SAX, DOM or XOM API and do pass an explicit
XmlViolationPolicy
to the constructor ofHtmlParser
,HtmlDocumentBuilder
orHtmlBuilder
: You do not need to change your code to upgrade.
- If you have your own subclass of
TreeBuilder
: The abstract methods on
TreeBuilder
now have additional arguments for passing the namespace URI. You should upgrade your subclass to deal with the namespace URIs. (The URI is always an interned string, so you can use==
to compare.)The entry point for passing in a SAX
InputSource
has moved from theTokenizer
class to theDriver
class (in theio
package), so you should change your references fromTokenizer
toDriver
.- If you have your own implementation of
TokenHandler
: Please refer to the JavaDocs of
TokenHandler
. Also note the new separation ofTokenizer
andDriver
mentioned above.