The WHATWG Blog — Response to “Notes on HTML 5”

Response to “Notes on HTML 5”

September 3rd, 2009 by Mark Pilgrim, Google in WHATWG

The W3C Technical Architecture Group (TAG) is in the process of reviewing HTML 5. Noah Mendelsohn recently posted his initial, personal, not-speaking-on-behalf-of-TAG notes on HTML 5. Here are my initial, personal, not-speaking-on-behalf-of-WHATWG responses.

Limitations of the XML serialization

This may be old news, but I was surprised to see that document.write() is not supported when parsing the XML serialization. This seems to put the nail in the coffin of XML as a serialization format for colloquial HTML. I understand that there are a variety of issues in making a sensible definition of how this would work, but my intuition is that it could be done reasonably cleanly (albeit not with most off-the-shelf XML parsers).

Many, many things helped drive the nail in the coffin of XML as a serialization format for colloquial HTML. This was probably one of them. Others that come to mind:

Draconian error handling enforced at runtime does not scale to the complexities of modern-day web applications. Ensuring well-formedness becomes increasingly difficult when content is dynamically cobbled together from multiple sources, some of which are beyond your control (user-generated content, third-party ad servers, and so on).
It provides no perceivable benefit to users. Draconianly handled content does not do more, does not download faster, and does not render faster than permissively handled content. Indeed, it is almost guaranteed to download slower, because it requires more bytes to express the same meaning -- in the form of end tags, self-closing tags, quoted attributes, and other markup which provides no end-user benefit but serves only to satisfy the artifical constraints of an intentionally restricted syntax.
IE never supported it, forcing you down the rabbit hole of polyglot documents.

Other applicable specifications

"Authors must not use elements, attributes, and attribute values for purposes other than their appropriate intended semantic purpose. Authors must not use elements, attributes, and attribute values that are not permitted by this specification or other applicable specifications." This is one of the most important sentences in the entire specification, but it's somewhat vague. If "other applicable specifications" means: any specification that anyone claims is applicable to HTML 5 extension, then we can extend the langauge with most any element without breaking conformance; if "applicable specifications" is a smaller (or empty) set, then this may be saying that HTML 5 has limited (or zero) extensibility.

The phrase "or other application specifications" is indeed quite important, and quite intentional. It explicitly allows something that previous HTML specifications only implicitly allowed, or disallowed-but-everyone-ignored-that-part: future specifications which extend the vocabulary of previous specifications. Ian Hickson uses RDFa as an example: "If an RDFa specification said that text/html could have arbitrary xmlns:* attributes, then the HTML5 specification would (by virtue of the above-quoted sentence) defer to it and thus it would be allowed. ... Of course, if a community doesn't acknowledge the authority of such a spec, and they _do_ acknowledge the authority of the HTML5 spec, then it would be (for them) as if that spec didn't exist. Similarly, there might be a community that only acknowledges the HTML4 spec and doesn't consider HTML5 to be relevant, in which case for them, HTML5 isn't relevant. This is how specs work."

In response to a question about how validators could possible work with such flexible constraints, Ian Hickson writes, "The same way validators work now. The validator implementors decide which specs they think are relevant to their users. The W3C CSS validator, for instance, can be configured to check against CSS2.1 rules or against SVG rules about CSS. It can't be configured to check against CSS2.1 + the :-moz-any-link extension, because the CSS validator implementors have decided that CSS2.1 and SVG are relevant, but not the Mozilla CSS extensions. ... The W3C HTML validator, similarly, supports checking a document against XHTML 1.1, or XHTML + RDFa, or XHTML + SVG, but doesn't support checking it against XHTML + DOCBOOK. So its implementors have decided that RDFa, SVG, and XHTML are relevant, but DOCBOOK is not."

People who think that such lack of constraints can only lead to madness go strangely silent when it is pointed out that they have already violated such constraints, and the world failed to end as a result.

URL terminology

The HTML 5 draft uses the term URL, not URI.

I was under the impression that HTML 5 uses the term "URL" because that's what everyone else in the world uses (outside a few standards wonks who know the difference between URLs, URIs, IRIs, XRIs, LEIRIs, and so on). A few minutes of research supported my impression. Quoth Ian Hickson: "'URL' is what everyone outside the standards world calls them. The few people who understand what on earth IRI, URN, URI, and URL are supposed to mean and how to distinguish them have demonstrated that they are able to understand such complicated terminology and can deal with the reuse of the term 'URL'. Others, who think 'URL' mean exactly what the HTML5 spec defines it as, have not demonstrated an ability to understand these subtleties and are better off with us using the term they're familiar with. The real solution is for the URI and IRI specs to be merged, for the URI spec to change its definitions to match what 'URL' is defined as in HTML5 (e.g. finally defining error handling as part of the core spec), and for everyone to stop using terms other than 'URL'."

It should be noted that not everyone agrees with this. For example, Roy Fielding (who obviously understands the subtle differences between URLs and other things) recently stated: "Use of the term URL in a manner that directly contradicts an Internet standard is negligent and childish. HTML5 can rot until that is fixed." Maciej Stachowiak, recently appointed co-chair of the W3C HTML Working Group, recently stated: "We need to get the references in order first, because whether HTML5 references Web Address, or IRIbis, or something else, makes a difference to what we'll think about the naming issue. We need to decide as a Working Group if it's acceptable to use the term URL in a different way than RFC3986 (while making the difference clear). If it's unacceptable, then we need to propose an alternate term."

As the old saying goes, "There are only two hard problems in Computer Science: cache invalidation and naming things." I personally don't care about the matter either way, but there is obviously a wide spectrum of opinion.

IRI-bis

It's unclear whether the factoring to reference WebAddr and/or IRI-bis will be retained.

"WebAddr" refers to Web Addresses, a now-defunct proposal to split out the definition of "URL" that HTML 5 uses (which intentionally differs from the "official" definition in order to handle existing web content). The work on the "Web Addresses" specification has now been rolled into IRI-bis; "bis" means "next," so "IRI-bis" means "the next version of the IRI specification." According to Ian Hickson, so important definitions were lost in the process of splitting out "Web Addresses" from HTML 5 and subsequently merging "Web Addresses" into IRI-bis. There is also some feedback about newlines within URLs.

Work on IRI-bis is ongoing. As it relates to HTML 5, it is tracked as HTML ISSUE-56.

Content sniffing

HTML 5 calls for user agents to ignore normative Content-type in certain cases.

HTML 5 calls for user agents to ignore normative Content-Type in certain cases because this is required to handle existing web content. Based on [PDF] the research of Adam Barth and others into the content sniffing rules of a certain closed-source market-dominating browser, progress has been made towards reducing the amount of content sniffing on the public web. (Counterpoint: two steps forward, one step back.)

Personally, I have long been opposed to content sniffing, but if sniffing is going to occur, I would vastly prefer documented algorithms to undocumented ones. The "hope" behind documenting the sniffing rules now is that the web community can "freeze" the rules now and forever, i.e. not add any more complexity to an already complex world. I am personally skeptical, since HTML 5 also introduces (or at least promotes-to-their-own-elements) two new media families, audio and video, for which undocumented or underdocumented sniffing may already occur within proprietary browser plug-ins. And with @font-face support shipping or on the verge of shipping in multiple browsers, there may be new sniffing rules introduced there as well. I hope my concerns are unfounded.

Still, having sniffing rules documented in HTML 5 may -- someday soon -- reduce the complexity of a shipping product. And how often does that happen?

Willful violations

HTML 5 acknowledges in several places that it is in "willful violation" of other specifications from the W3C and IETF.

As stated in §1.5.2 Compliance with other specifications, "This specification interacts with and relies on a wide variety of other specifications. In certain circumstances, unfortunately, the desire to be compatible with legacy content has led to this specification violating the requirements of these other specifications. Whenever this has occurred, the transgressions have been noted as 'willful violations'."

This is the complete list of "willful violations" in the August 25th W3C Editor's Draft of HTML 5. (The WHATWG draft changes almost daily, whenever a change is checked in.)

§2.5.1 Terminology: "The term "URL" in this specification is used in a manner distinct from the precise technical meaning it is given in RFC 3986. Readers familiar with that RFC will find it easier to read this specification if they pretend the term "URL" as used herein is really called something else altogether. This is a willful violation of RFC 3986."
§2.7 Character Encodings: "The requirement to treat certain encodings as other encodings according to the table above is a willful violation of the W3C Character Model specification, motivated by a desire for compatibility with legacy content." Related bugs: Bug 7444, bug 7453, bug 7215, bug 7381. There was a recent discussion about character encoding, starting with Addison Phillips' feedback on behalf of the I18N working group, with several followups. Some salient quotes: Maciej (Apple, Safari): "Browsers for Latin-script locales pretty much universally use Windows-1252 as the default of last resort. This is necessary to be compatible with legacy content on the existing Web." Mark Davis (Google): "At Google, the encoding label is taken only as a weak signal (a small factor in the heuristic detection). It is completely overwhelmed by the byte content analysis. (There are too many unlabeled pages *and mislabeled pages* for the label to be used as is.)" Henri Sivoven (Mozilla contributor), in response to a question about what constitutes a "legacy environment" for the purposes of character encoding detection: "The Web is a legacy environment." (Ian Hickson echoes this sentiment.)
§2.7 Character Encodings: "The requirement to default UTF-16 to LE rather than BE is a willful violation of RFC 2781, motivated by a desire for compatibility with legacy content."
§3.4 Interactions with XPath and XSLT: "These requirements are a willful violation of the XPath 1.0 specification, motivated by desire to have implementations be compatible with legacy content while still supporting the changes that this specification introduces to HTML regarding which namespace is used for HTML elements."
§3.4 Interactions with XPath and XSLT: "If the transformation program outputs an element in no namespace, the processor must, prior to constructing the corresponding DOM element node, change the namespace of the element to the HTML namespace, ASCII-lowercase the element's local name, and ASCII-lowercase the names of any non-namespaced attributes on the element. This requirement is a willful violation of the XSLT 1.0 specification, required because this specification changes the namespaces and case-sensitivity rules of HTML in a manner that would otherwise be incompatible with DOM-based XSLT transformations. (Processors that serialize the output are unaffected.)" There is a long discussion of these two violations in the comments of bug 7059.
§4.10.16.3 Form submission algorithm: "Step 9: If action is the empty string, let action be the document's address. This step is a willful violation of RFC 3986, which would require base URL processing here. This violation is motivated by a desire for compatibility with legacy content."
§6.5.3 Script processing model: "If the script's global object is a Window object, then in JavaScript, the this keyword in the global scope must return the Window object's WindowProxy object. This is a willful violation of the JavaScript specification current at the time of writing (ECMAScript edition 3). The JavaScript specification requires that the this keyword in the global scope return the global object, but this is not compatible with the security design prevalent in implementations as specified herein."
§12.3.4 Other elements, attributes, and APIs: Regarding document.all, "These requirements are a willful violation of the JavaScript specification current at the time of writing (ECMAScript edition 3). The JavaScript specification requires that the ToBoolean() operator convert all objects to the true value, and does not have provisions for objects acting as if they were undefined for the purposes of certain operators. This violation is motivated by a desire for compatibility with two classes of legacy content: one that uses the presence of document.all as a way to detect legacy user agents, and one that only supports those legacy user agents and uses the document.all object without testing for its presence first."

As you can probably guess from these quotes, the HTML 5 community has decided that compatibility with existing web content trumps all other concerns. Other than the use of the term "URL," all of the "willful violations" are instances where other specifications do not adequately describe existing web content. I personally do not understand most of the issues listed here, so I have no opinion on whether the alleged benefit of violating existing standards is worth the alleged cost.

Versioning

In-band global version identifiers, if new implementations handle them reasonably, may be useful for (a) authoring applications that want to track versions used for authoring (b) informative error handling when applications encounter constructs that are apparently 'in error'.

The idea of version identifiers has been hashed and rehashed throughout the 5+ year process of defining HTML 5. Most notably, Microsoft proposed a version identifer shortly after the W3C HTML Working Group reformed around HTML 5. Their proposal generated much discussion, but was not ultimately adopted. Several months later, Microsoft shipped Internet Explorer 8 with a "feature" called X-UA-Compatible, which serves as a kind of IE-specific version identifier. I am personally not a fan of this approach, in part because [PNG] it adds a lot of complexity for web developers who want to figure out why the still-dominant browser doesn't render their content according to standards.

Versioning in HTML 5 is tracked as HTML ISSUE-4.

This Week in HTML 5 – Episode 34 ↔ Spelling HTML5

17 Responses to “Response to “Notes on HTML 5””

Rob says:

2009-09-04 at 01:47

It really bothers me that it seems XML is given the cold shoulder from the paragraphs above. What many in the whatwg like to call “draconion”, I call “programming”. It’s like saying all computer languages are draconion because they cannot be written without error.

Microsoft does support XML very well. They just don’t support XHTML at all, but is that the reason not to support XML? Should we not concern ourselves with SVG for the same reason?

I’m not Roy Fielding but I know the difference between a URL and a URI and it’s plain as day. Using URL because the masses erroneously do is no excuse to stoop to that level, too.

While I’ve been very excited about using HTML5, these sorts of things are making me question whether this group has an engineering point of view in mind at all or is it just going with popular thinking.

This is the first time I’ve been disgruntled with the information coming out. I wish I had the time to be more involved but, alas, I feel I don’t have the in-depth knowledge I see on the mailing list and IRC channels either, though I code backend ecommerce sites for a living.
Name says:

2009-09-04 at 02:18

“the HTML 5 community has decided” – it seems to me that some parts of the community are extremely divided. Could you abstain from declaring consensus where there is none? Sorry for the harsh tone. This rhetorical tactic is a major pet peeve of mine.
Benjamin Sergeant says:

2009-09-04 at 02:34

The cool thing with XML is that it’s easy to parse (and to scrape), and you don’t have to use SGML-lib to parse this with python … no ?
Anne van Kesteren says:

2009-09-04 at 08:34

Programming is different. You do not grab related bits of code from different servers together and then hand it over to the user to compile it and pray it works.
Walter McGrain says:

2009-09-04 at 12:12

I am left to wonder when the smug tone of Google employees exhibited here finally manifest itself in the overall corporate posture of Google itself…
Rob says:

2009-09-04 at 15:02

@Anne,
That’s why you need this so-called “draconian” method to assure consistency and reliability in the data exchange instead of allowing browsers to guess at the meaning and display whatever.
Mark Pilgrim, Google says:

2009-09-04 at 15:13

> like saying all computer languages are draconion because they cannot be written without error

Virtually all computer languages *are* draconian. That is not a good argument that HTML should be.
Mark Pilgrim, Google says:

2009-09-04 at 15:13

The cool thing with XML is that it’s easy to parse

XML is actually quite difficult to parse “from scratch,” but easy to parse if you use the proper tools. But that is practically a tautology; all formats are easy to parse if you use pre-written tools that do the parsing for you. The proper tool to parse HTML 5 is an HTML 5 parser like html5lib.
Mark Pilgrim, Google says:

2009-09-04 at 15:16

I’m not Roy Fielding but I know the difference between a URL and a URI and it’s plain as day.

Perhaps you could explain it to me, because it’s always confused to bejeezus out of me.

Using URL because the masses erroneously do is no excuse to stoop to that level, too.

“The masses” of which you speak are the millions of people who are going to use HTML 5, while Roy has repeatedly stated that he never will.
Mark Pilgrim, Google says:

2009-09-04 at 15:21

it seems to me that some parts of the community are extremely divided

There are certainly people in the world who disagree with the principle that handling existing web content should trump all other concerns. There are people in the world who think that all existing web content should be discarded and a new web built from scratch. Those people are certainly entitled to their opinions. But defining, parsing, documenting, and otherwise handling existing web content is one of the fundamental principles on which HTML 5 was founded. See, for example, FAQ: why does HTML 5 legitimise tag soup and HTML Design Principles: Priority of Constituencies. “In case of conflict, consider users over authors over implementors over specifiers over theoretical purity.”
Mark Pilgrim, Google says:

2009-09-04 at 15:28

instead of allowing browsers to guess at the meaning and display whatever.

Your concerns would get a more sympathetic audience if you could show even a basic familiarity with HTML 5. The specification devotes hundreds of pages to parsing HTML in a deterministic manner. There is no “guessing” or non-determinism. You may wish to argue that draconian error handling allows the parsing rules to be simpler, and that is certainly true, but it does not follow that draconian error handling is the only way to make parsing deterministic.

Enforcing draconian error handling would make over 99% of web content disappear overnight. This is not a guess; it is an accurate statistic backed by multiple independent studies of different real-world samples of HTML content. I think most people involved in HTML 5 wish the parsing rules could be simpler. But no one has put forth a credible proposal for doing so that is compatible with existing web content.
Chester H. Oakley says:

2009-09-04 at 20:44

HTML is design, not programming. Web designers are never going to write valid XML. The end.
Roy T. Fielding says:

2009-09-04 at 22:14

Mark, do your research. Every single Web book on my bookshelf (well over 70) uses the term URL to mean exactly what is currently standardized as URI. Go ahead and try a search on Google Scholar if you don’t believe me. Whether we call that thing URL or URI is not important. The IETF and W3C have decided to call it URI. What I objected to is the definition of SOMETHING ELSE (an algorithm for parsing attributes) being assigned the name URL in HTML5. That isn’t just inconsistent — it is brain-numbing stupidity. If you can find anyone in the world who will honestly say that something like “/” is, in its entirety, a URL, then please do so.
Maciej Stachowiak says:

2009-09-05 at 05:44

Roy, “/” used to be, in its entirety, considered a valid relative URI/URL, at least as of RFC2396. RFC3986 renamed it to a “relative URI reference” and it remains valid as such. Colloquially, I’d expect most people would say it’s still ok to call it a relative URL in a context that expects such.
Roy T. Fielding says:

2009-09-08 at 17:13

Maciej, “/” is a valid relative URL (see RFC 1808). Before that spec it was called a partial URI (RFC 1630). That does not make “/” a URL, nor does it make the term URL any less specific in its definition as a Uniform Resource Locator (where uniform specifically refers to the syntax), nor does it change the URL syntax such that “/” would be a match or that non-ASCII characters are part of the URL.

All of the documentation on “what is a URL” states, unequivocally, that a URL starts with a scheme name and consists entirely of ASCII characters for portability and safe usage over email. All of it — every single document I know of that has ever been printed or found on the Web.

The WHATWG’s opinion stating that URL is the common term for whatever might appear in a reference is wrong. If you cannot accept that as a fact, then please step away from the Kool-Aid and find me an independent reference that matches how Ian has chosen to redefine the term.
Rob says:

2009-09-10 at 11:11

> Your concerns would get a more sympathetic audience if you could show even a basic familiarity with HTML 5. The specification devotes hundreds of pages to parsing HTML in a deterministic manner.

Of course I’m aware of this but it’s only error handling and an error in the code is an error and any attempt to to display its intention is a guess at what the author meant.

Isn’t this what software tools are for? Alert the author to errors in their writing and offer suggestions to fix them? Word processors attempt to do this and frequently guess my intentions wrong, too, so I’m glad they don’t display my final output.

> Enforcing draconian error handling would make over 99% of web content disappear overnight.

That’s probably not a bad thing.
Pijotre says:

2009-09-29 at 19:12

I ran into the following problem using X-UA-Compatible.

I render my page as XHTML1.0 Strict and it all validates fine and it all shows fine in IE8. The problem is about this line:

If its not there IE-8 always suggests the compatibility mode … this disappears if its set in the meta.
If it is there IE8 behaves well but I can’t validate it as “Tentatively passed” for html 5 anymore, because

Line 7, Column 55: Bad value X-UA-Compatible for attribute http-equiv on element meta.

So then I tried it in the .htaccess, but my hosting-company doesn’t support this line
Header set X-UA-Compatible “edge”
nor
Header set X-UA-Compatible IE=edge env=best-standards-support
(or any combination thereof)

So whats the best way to go ahead? tell Microsoft off for introducing yet another standard thats rubbish. Or talk to the guy of w3c who is responsible for allowing that value in http-equiv or telling my hosting company to allow me to set X-UA-Compatibility ?

Any course of action would be appreciated.