The WHATWG Blog — validator

This Week in HTML5 – Episode 35

Wednesday, September 16th, 2009

On August 7, 2009, Adrian Bateman did what no man or woman had ever done before: he gave substantive feedback on the current editor's draft of HTML5 on behalf of Microsoft. His feedback was detailed and well-reasoned, and it spawned much discussion. See also: Adrian's followup on <progress> (and more here); his followup on <canvas>; on <datagrid> (which was actually dropped from HTML5 just minutes after Adrian posted his initial feedback, for unrelated reasons); his discussion with a Mozilla developer about <bb> (which was also subsequently dropped); discussion about <dialog> (which has now been dropped -- more on that in just a minute); Adrian's followup on <keygen>, with additional concerns listed here; his followup on the new input types; and last but not least, his position on <video> and <audio> (and followup about best-choice algorithms).

As you might expect, much of the discussion since August 7 has been driven by Microsoft's feedback. After five years of virtual silence, nobody wants to miss the opportunity to engage with a representative of the world's still-dominant browser. I want to focus on two discussions that have led to recent spec changes.

The rise and fall of `<keygen>`

The <keygen> element was invented by Netscape and subsequently reverse-engineered by every other browser vendor except Microsoft. It had never been part of any HTML specification before; indeed, it wasn't well-documented anywhere. It was added to HTML5 earlier this year (and covered in episode 12 and episode 31). The spec text borrows heavily from this incredibly detailed documentation posted to the WHATWG mailing list last July.

Adrian, on behalf of Microsoft, has stated in no uncertain terms that Microsoft has no intention of ever implementing the <keygen> element, and they would like it to be removed from HTML5. But what does "implementing" <keygen> mean? Well, the point of <keygen> is to provide a cryptography API, but as Ian Hickson points out, the element itself "integrates tightly with the form submission model, it affects the DOM APIs of other elements, it affects the parser, it affects the form control validity model -- it's not a feature that can be sensibly considered 'optional' if our goal is cross-browser interoperability. However, there is an alternative that I think would still satisfy Microsoft's desires to not implement <keygen>'s cryptographic features while still bringing interoperability to the platform in every other respect: we could make the support of each individual signature algorithm optional."

r3843 does just that -- it makes the cryptography parts of <keygen> optional. It is important that the element itself be implemented (even without the crypto bits), because it interacts with the DOM and the parsing model in strange ways.

As an postscript of sorts, I should point out that a recent change to the keytype attribute (r3868) allows client-side script to detect whether the crypto bits are actually supported. Detecting features is important, and this subtle change will allow authors to write feature detection scripts instead of relying on browser sniffing to decide whether to use keygen-based cryptography.

The art of conversation is, like, dead and stuff

Another hot topic this week is the removal of the <dialog> element. As I mentioned at the beginning of this article, Microsoft questioned the wisdom of a specialized element for marking up dialog. Other people have suggested that the element does not actually go far enough -- it lets you mark up basic conversation (people talking), but provides no semantics for stage directions, actions, thoughts, voiceover narration, and so on. (See also: Unwebbable, "The screenplay problem.")

To decide this burning question, the <dialog> element was removed, and conversations are now listed in the section on Common idioms without dedicated elements, with an example of how one might mark up a conversation with more generic markup, if one were inclined to do so. Predictably, this has already caused some backlash from the pro-<dialog> camp.

The conversation about marking up conversations also intersects with another burning question: can you use the <cite> element to mark up a person's name? HTML 4 said yes, and even provided an example that used the <cite> element that way. Dan Connolly, who added the <cite> element to HTML 2 (yes, 2), says "I consider that a bug in the HTML 4 spec. I wish I had reviewed it more closely." Still, specs are normative, not what their authors say about them after-the-fact, and the web has collectively had 12 years of HTML 4 which explicitly blessed the technique. I've used <cite> to mark up people's names for years in my own markup, and I'm certainly not going to go back and change all those blog entries to conform to somebody's sense of purity.

New examples

Speaking of examples, the HTML5 spec just got a whole lot more of them. To wit:

r3784: <html> example
r3785: <head> example
r3786: <base> example
r3787: <link> example
r3789: <meta> example
r3790: <style> example
r3791: <script> example
r3793 and r3856: <noscript> example
r3796: <article> example
r3808: <iframe> example
r3809: <embed> example
r3810: <canvas> example
r3811: <math> example
r3814: <form> and <fieldset> examples
r3815: list="" example
r3816: readonly="" example
r3817: required="" example
r3818: multiple="" example
r3819: maxlength="" example
r3820: step, min, and max examples
r3821: <button> example
r3822: <select> example
r3823: <optgroup> example
r3824: <textarea> and <output> examples
r3825: <details> example
r3801 and r3806: <h1> example
r3804: <hr> and <span> examples
r3805: <del> example
r3863: example of how to get the filename out of input.value
r3799: example of how to mark up comments on a weblog using <article>, <section>, and <header>

Validating attribute values

Monday, November 26th, 2007

Traditionally, SGML-based HTML validation has treated most attribute values as “anything goes” strings. This has meant that all kinds of bogus values have passed as valid. W3C XML Schema added a fixed set of datatypes. The spec is mostly useless for HTML5 validation, since the HTML5 microsyntaxes do not match exactly the XSD datatypes for the same concepts. XSD regular expressions were suitable for representing the syntax of a number of HTML5 microsyntaxes, though. XSD datatypes can be used from RELAX NG and Validator.nu used to XSD regular expressions for many HTML5 microsyntaxes.

The problem with using XSD regular expressions has been that they are not user-friendly. When an attribute value did not match the required regular expression, the UI told the user that the attribute value was “bad”. Nothing else.

Fortunately, unlike XSD, RELAX NG allows pluggable datatype libraries. A datatype library is a library written in a general-purpose programming language. The RELAX NG engine calls the library to check if a string conforms to a named datatype. Validator.nu has used this approach for a long time for the more complex microsyntaxes in HTML5.

I have recently made an effort to move Validator.nu away from XSD regular expressions to a more comprehensive custom datatype library. Even though as a formalism regular expressions are sufficient for many syntaxes, writing checking by hand allows more useful error messages in cases of failure. Moreover, having identifiers for the datatypes makes it possible to tell which datatype failed as opposed to the UI being able to tell that some regular expression failed under the hood. This allows the UI to pull in per-datatype advice from a wiki page.

In addition to improving the user experience with previously supported microsyntaxes such as integers, I have implemented support for previously unsupported microsyntaxes such as MIME types and Media Queries.

There is still work to do. For example, the syntaxes for accept-charset and WF2 type=email are not done. data: and mailto: IRIs are not properly validated yet. The syntaxes for image map coordinates still use XSD regular expressions. The advice on the wiki page is far from complete. (You can help!)

For the parts already implemented, please try the new features out and let me know what needs improvement.

Posted in Conformance Checking, Syntax | Comments Off on Validating attribute values

This Week in HTML5 – Episode 35

The rise and fall of <keygen>

The art of conversation is, like, dead and stuff

New examples

Further Reading

Validating attribute values

The rise and fall of `<keygen>`