The WHATWG Blog — validator.nu

Validator.nu HTML Parser 1.1.0

Monday, August 25th, 2008

I have released a new version of the Validator.nu HTML Parser (an implementation of the HTML5 parsing algorithm in Java). The new release supports SVG and MathML subtrees, is faster than the old version, fixes bugs, is more portable and supports applications that want to do document.write().

The parser comes with a sample app that makes it possible to use XSLT programs written for XHTML5+SVG+MathML with text/html.

Warning! The internal APIs have changed. Please refer to the Upgrade Guide below.

Change Log

Made the SAX, DOM and XOM parser entry point constructors default to altering the infoset instead of throwing when the input needs coercing to be an XML 1.0 4th ed. plus Namespaces infoset.
Isolated Java IO dependent code from the parser core. The parser core now compiles on Google Web Toolkit.
Refactored the tokenizer to use a switch branch per state instead of method per state.
Made various performance tweaks to the tokenizer.
Implemented support for MathML and SVG foreign content. (Note that the SVG part is based on spec text that has been commented out from the spec at the request of the SVG WG.)
Made the parser suspendable after any input character.
Made it possible for custom TreeBuilder subclasses to request parser suspension. (Applications wishing to implement document.write() should provide their own TreeBuilder subclass and a document.write()-aware replacement of the Driver class. Look in the gwt-src/ directory for sample code.)
Made changes to the parser core to make it more suitable for mechanical translation into other object-oriented programming languages that have C-like control structures but not necessarily a garbage collector (with focus on targeting C++). This work is not complete.
Made the HTML serializer do the right thing when input represents a conforming XHTML+SVG+MathML tree. (Results may be bad for non-conforming input trees.)
Developed sample programs for converting between HTML5 and XHTML5 when the input is known to be conforming.
Provided an XML serializer so that the sample code no longer depends on the Xalan serializer.
Improved API documentation.
Fixed bugs in the tokenizer, tree builder and the input stream character encoding decoder.
Made coercion to an XML infoset work according to the HTML5 spec.
Added ID uniqueness checking.
Various other fixes.

Upgrade Guide from 1.0.7 to 1.1.0

In all cases, you need to check that your application does not break when it receives SVG or MathML subtrees.

If you use the parser through the SAX, DOM or XOM API and do not pass an explicit XmlViolationPolicy to the constructor of HtmlParser, HtmlDocumentBuilder or HtmlBuilder:

If you really wanted the old default behavior, you should now pass XmlViolationPolicy.FATAL to the constructor.

If you did not really want to have fatal errors by default, you do not need to do anything, since ALTER_INFOSET is now the default.

If you use the parser through the SAX, DOM or XOM API and do pass an explicit XmlViolationPolicy to the constructor of HtmlParser, HtmlDocumentBuilder or HtmlBuilder:

You do not need to change your code to upgrade.

If you have your own subclass of TreeBuilder:

The abstract methods on TreeBuilder now have additional arguments for passing the namespace URI. You should upgrade your subclass to deal with the namespace URIs. (The URI is always an interned string, so you can use == to compare.)

The entry point for passing in a SAX InputSource has moved from the Tokenizer class to the Driver class (in the io package), so you should change your references from Tokenizer to Driver.

If you have your own implementation of TokenHandler:

Please refer to the JavaDocs of TokenHandler. Also note the new separation of Tokenizer and Driver mentioned above.

Posted in Syntax | Comments Off on Validator.nu HTML Parser 1.1.0

HTML5 conformance checking in Vim

Sunday, May 18th, 2008

Kai Hendry has written an HTML filetype plugin for Vim that allows you to use Henri Sivonen’s Validator.nu conformance checking (validation) service remotely to check the contents of any HTML document you edit in Vim and determine if the document is HTML5-conformant (valid).

The filetype plugin is also demo'ed in a screencast tutorial on editing Web applications that Kai has blogged about in a VIM IDE for Web applications posting on his blog (see the blog posting for a link to the video).

All that you need to do to install the Vim filetype plugin is to download the plugin source and save it into ~/.vim/ftplugin/html.vim. To use it to check a document, first do :make within Vim, then use :cope and :clist and :cnext and such to locate the errors (for more details, read the section of the Vim docs that relates to those commands.)

How and why it works

Vim has a set of “quickfix” commands that provide something that many development IDEs also have these days: A way to run a compiler or lint checker or other external tool on the contents of a file you are editing, and then to have any errors returned — along with the line and column numbers of the places in your file where the errors occur — as a list that you can then easily step through or jump through one-by-one and fix. It’s a very powerful feature.

Kai’s HTML filetype plugin provides a way to use Vim’s “quickfix” commands to do conformance checking of HTML5 files. The plugin is dead simple; it’s just two lines:

set makeprg=curl\ -s\ -F\ laxtype=yes\ -F\ parser=html5\
  \ -F\ level=error\ -F\ out=gnu\ -F\ doc=@%\
  http://validator.nu
set errorformat=\"%f\":%l.%c-%m

(Note that I've just wrapped the first line for the purpose of readability in this post.)

The makeprg option in the first line tells Vim what “make program” you want to use when checking HTML files. And the errorformat option in the second line tells Vim the expected format of error messages from that “make program” — so that it can parse the error messages to get the line and column numbers of the places in your file where the errors occur (the meanings of the various parts of the string used in that errorformat value are: %f, filename; %l, line number; %c, column number; %m, error message).

Interaction with Validator.nu

What Kai’s HTML filetype plugin does it to use as the “make program” the curl command-line HTTP client, and in turn, to have curl send a POST request to Validator.nu. The contents of that POST request are set by the parameters and values specified by the -F options passed to curl. Essentially what this does is to emulate what would happen if you used the form-based interface at the Validator.nu website to manually set the values of the various form fields in that interface. (Note that wget could be probably used here (with different options) to do the same thing.)

What Validator.nu does in return is to send a response with the list of errors — in a format that allows the list of errors to be easily parsed by tools that have built-in support (like Vim’s “quickfix”) for reading error lists that are in a regular format and doing something with them.

GNU-formatted error output

In this case, since the out=gnu parameter and value were passed to Validator.nu, the particular format in which Validator.nu returns the error list is the standard GNU error format that’s used by many applications (including that other editor, Emacs). This use case (enabling remote validation and error-evaluation with editing applications) is actually one of the main cases for which Henri added the GNU-formatted error-reporting option to Validator.nu.

Validator.nu + Vim = easy HTML5 conformance checking

The end result is that you get the error information back into Vim in a way that lets you more easily locate and fix the errors.

So setting just two options is all it takes in an editing application like Vim to enable Validator.nu to be used remotely like this (that is, to do integrated HTML5 conformance-checking and error-reporting within the editor). This seems to me to be a pretty good testament (another in a long list) to the utility of the Validator.nu service and to the foresight that’s gone into its design.

It guess it also says a lot about the utility of Vim and the foresight that’s gone into its design — but we all already know how great Vim is, right? 🙂

Posted in Conformance Checking | 5 Comments »