The WHATWG Blog

Please leave your sense of logic at the door, thanks!

Validating attribute values

by Henri Sivonen in Conformance Checking, Syntax

Traditionally, SGML-based HTML validation has treated most attribute values as “anything goes” strings. This has meant that all kinds of bogus values have passed as valid. W3C XML Schema added a fixed set of datatypes. The spec is mostly useless for HTML5 validation, since the HTML5 microsyntaxes do not match exactly the XSD datatypes for the same concepts. XSD regular expressions were suitable for representing the syntax of a number of HTML5 microsyntaxes, though. XSD datatypes can be used from RELAX NG and Validator.nu used to XSD regular expressions for many HTML5 microsyntaxes.

The problem with using XSD regular expressions has been that they are not user-friendly. When an attribute value did not match the required regular expression, the UI told the user that the attribute value was “bad”. Nothing else.

Fortunately, unlike XSD, RELAX NG allows pluggable datatype libraries. A datatype library is a library written in a general-purpose programming language. The RELAX NG engine calls the library to check if a string conforms to a named datatype. Validator.nu has used this approach for a long time for the more complex microsyntaxes in HTML5.

I have recently made an effort to move Validator.nu away from XSD regular expressions to a more comprehensive custom datatype library. Even though as a formalism regular expressions are sufficient for many syntaxes, writing checking by hand allows more useful error messages in cases of failure. Moreover, having identifiers for the datatypes makes it possible to tell which datatype failed as opposed to the UI being able to tell that some regular expression failed under the hood. This allows the UI to pull in per-datatype advice from a wiki page.

In addition to improving the user experience with previously supported microsyntaxes such as integers, I have implemented support for previously unsupported microsyntaxes such as MIME types and Media Queries.

There is still work to do. For example, the syntaxes for accept-charset and WF2 type=email are not done. data: and mailto: IRIs are not properly validated yet. The syntaxes for image map coordinates still use XSD regular expressions. The advice on the wiki page is far from complete. (You can help!)

For the parts already implemented, please try the new features out and let me know what needs improvement.

Comments are closed.