Response to “Notes on HTML 5”
The W3C Technical Architecture Group (TAG) is in the process of reviewing HTML 5. Noah Mendelsohn recently posted his initial, personal, not-speaking-on-behalf-of-TAG notes on HTML 5. Here are my initial, personal, not-speaking-on-behalf-of-WHATWG responses.
Limitations of the XML serialization
This may be old news, but I was surprised to see that
document.write()is not supported when parsing the XML serialization. This seems to put the nail in the coffin of XML as a serialization format for colloquial HTML. I understand that there are a variety of issues in making a sensible definition of how this would work, but my intuition is that it could be done reasonably cleanly (albeit not with most off-the-shelf XML parsers).
Many, many things helped drive the nail in the coffin of XML as a serialization format for colloquial HTML. This was probably one of them. Others that come to mind:
- Draconian error handling enforced at runtime does not scale to the complexities of modern-day web applications. Ensuring well-formedness becomes increasingly difficult when content is dynamically cobbled together from multiple sources, some of which are beyond your control (user-generated content, third-party ad servers, and so on).
- It provides no perceivable benefit to users. Draconianly handled content does not do more, does not download faster, and does not render faster than permissively handled content. Indeed, it is almost guaranteed to download slower, because it requires more bytes to express the same meaning -- in the form of end tags, self-closing tags, quoted attributes, and other markup which provides no end-user benefit but serves only to satisfy the artifical constraints of an intentionally restricted syntax.
- IE never supported it, forcing you down the rabbit hole of polyglot documents.
Other applicable specifications
"Authors must not use elements, attributes, and attribute values for purposes other than their appropriate intended semantic purpose. Authors must not use elements, attributes, and attribute values that are not permitted by this specification or other applicable specifications." This is one of the most important sentences in the entire specification, but it's somewhat vague. If "other applicable specifications" means: any specification that anyone claims is applicable to HTML 5 extension, then we can extend the langauge with most any element without breaking conformance; if "applicable specifications" is a smaller (or empty) set, then this may be saying that HTML 5 has limited (or zero) extensibility.
The phrase "or other application specifications" is indeed quite important, and quite intentional. It explicitly allows something that previous HTML specifications only implicitly allowed, or disallowed-but-everyone-ignored-that-part: future specifications which extend the vocabulary of previous specifications. Ian Hickson uses RDFa as an example: "If an RDFa specification said that text/html could have arbitrary xmlns:* attributes, then the HTML5 specification would (by virtue of the above-quoted sentence) defer to it and thus it would be allowed. ... Of course, if a community doesn't acknowledge the authority of such a spec, and they _do_ acknowledge the authority of the HTML5 spec, then it would be (for them) as if that spec didn't exist. Similarly, there might be a community that only acknowledges the HTML4 spec and doesn't consider HTML5 to be relevant, in which case for them, HTML5 isn't relevant. This is how specs work."
In response to a question about how validators could possible work with such flexible constraints, Ian Hickson writes, "The same way validators work now. The validator implementors decide which specs they think are relevant to their users. The W3C CSS validator, for instance, can be configured to check against CSS2.1 rules or against SVG rules about CSS. It can't be configured to check against CSS2.1 + the :-moz-any-link extension, because the CSS validator implementors have decided that CSS2.1 and SVG are relevant, but not the Mozilla CSS extensions. ... The W3C HTML validator, similarly, supports checking a document against XHTML 1.1, or XHTML + RDFa, or XHTML + SVG, but doesn't support checking it against XHTML + DOCBOOK. So its implementors have decided that RDFa, SVG, and XHTML are relevant, but DOCBOOK is not."
People who think that such lack of constraints can only lead to madness go strangely silent when it is pointed out that they have already violated such constraints, and the world failed to end as a result.
The HTML 5 draft uses the term URL, not URI.
I was under the impression that HTML 5 uses the term "URL" because that's what everyone else in the world uses (outside a few standards wonks who know the difference between URLs, URIs, IRIs, XRIs, LEIRIs, and so on). A few minutes of research supported my impression. Quoth Ian Hickson: "'URL' is what everyone outside the standards world calls them. The few people who understand what on earth IRI, URN, URI, and URL are supposed to mean and how to distinguish them have demonstrated that they are able to understand such complicated terminology and can deal with the reuse of the term 'URL'. Others, who think 'URL' mean exactly what the HTML5 spec defines it as, have not demonstrated an ability to understand these subtleties and are better off with us using the term they're familiar with. The real solution is for the URI and IRI specs to be merged, for the URI spec to change its definitions to match what 'URL' is defined as in HTML5 (e.g. finally defining error handling as part of the core spec), and for everyone to stop using terms other than 'URL'."
It should be noted that not everyone agrees with this. For example, Roy Fielding (who obviously understands the subtle differences between URLs and other things) recently stated: "Use of the term URL in a manner that directly contradicts an Internet standard is negligent and childish. HTML5 can rot until that is fixed." Maciej Stachowiak, recently appointed co-chair of the W3C HTML Working Group, recently stated: "We need to get the references in order first, because whether HTML5 references Web Address, or IRIbis, or something else, makes a difference to what we'll think about the naming issue. We need to decide as a Working Group if it's acceptable to use the term URL in a different way than RFC3986 (while making the difference clear). If it's unacceptable, then we need to propose an alternate term."
As the old saying goes, "There are only two hard problems in Computer Science: cache invalidation and naming things." I personally don't care about the matter either way, but there is obviously a wide spectrum of opinion.
It's unclear whether the factoring to reference WebAddr and/or IRI-bis will be retained.
"WebAddr" refers to Web Addresses, a now-defunct proposal to split out the definition of "URL" that HTML 5 uses (which intentionally differs from the "official" definition in order to handle existing web content). The work on the "Web Addresses" specification has now been rolled into IRI-bis; "bis" means "next," so "IRI-bis" means "the next version of the IRI specification." According to Ian Hickson, so important definitions were lost in the process of splitting out "Web Addresses" from HTML 5 and subsequently merging "Web Addresses" into IRI-bis. There is also some feedback about newlines within URLs.
HTML 5 calls for user agents to ignore normative Content-type in certain cases.
HTML 5 calls for user agents to ignore normative Content-Type in certain cases because this is required to handle existing web content. Based on [PDF] the research of Adam Barth and others into the content sniffing rules of a certain closed-source market-dominating browser, progress has been made towards reducing the amount of content sniffing on the public web. (Counterpoint: two steps forward, one step back.)
Personally, I have long been opposed to content sniffing, but if sniffing is going to occur, I would vastly prefer documented algorithms to undocumented ones. The "hope" behind documenting the sniffing rules now is that the web community can "freeze" the rules now and forever, i.e. not add any more complexity to an already complex world. I am personally skeptical, since HTML 5 also introduces (or at least promotes-to-their-own-elements) two new media families, audio and video, for which undocumented or underdocumented sniffing may already occur within proprietary browser plug-ins. And with
@font-face support shipping or on the verge of shipping in multiple browsers, there may be new sniffing rules introduced there as well. I hope my concerns are unfounded.
Still, having sniffing rules documented in HTML 5 may -- someday soon -- reduce the complexity of a shipping product. And how often does that happen?
HTML 5 acknowledges in several places that it is in "willful violation" of other specifications from the W3C and IETF.
As stated in §1.5.2 Compliance with other specifications, "This specification interacts with and relies on a wide variety of other specifications. In certain circumstances, unfortunately, the desire to be compatible with legacy content has led to this specification violating the requirements of these other specifications. Whenever this has occurred, the transgressions have been noted as 'willful violations'."
- §2.5.1 Terminology: "The term "URL" in this specification is used in a manner distinct from the precise technical meaning it is given in RFC 3986. Readers familiar with that RFC will find it easier to read this specification if they pretend the term "URL" as used herein is really called something else altogether. This is a willful violation of RFC 3986."
- §2.7 Character Encodings: "The requirement to treat certain encodings as other encodings according to the table above is a willful violation of the W3C Character Model specification, motivated by a desire for compatibility with legacy content." Related bugs: Bug 7444, bug 7453, bug 7215, bug 7381. There was a recent discussion about character encoding, starting with Addison Phillips' feedback on behalf of the I18N working group, with several followups. Some salient quotes: Maciej (Apple, Safari): "Browsers for Latin-script locales pretty much universally use Windows-1252 as the default of last resort. This is necessary to be compatible with legacy content on the existing Web." Mark Davis (Google): "At Google, the encoding label is taken only as a weak signal (a small factor in the heuristic detection). It is completely overwhelmed by the byte content analysis. (There are too many unlabeled pages *and mislabeled pages* for the label to be used as is.)" Henri Sivoven (Mozilla contributor), in response to a question about what constitutes a "legacy environment" for the purposes of character encoding detection: "The Web is a legacy environment." (Ian Hickson echoes this sentiment.)
- §2.7 Character Encodings: "The requirement to default UTF-16 to LE rather than BE is a willful violation of RFC 2781, motivated by a desire for compatibility with legacy content."
- §3.4 Interactions with XPath and XSLT: "These requirements are a willful violation of the XPath 1.0 specification, motivated by desire to have implementations be compatible with legacy content while still supporting the changes that this specification introduces to HTML regarding which namespace is used for HTML elements."
- §3.4 Interactions with XPath and XSLT: "If the transformation program outputs an element in no namespace, the processor must, prior to constructing the corresponding DOM element node, change the namespace of the element to the HTML namespace, ASCII-lowercase the element's local name, and ASCII-lowercase the names of any non-namespaced attributes on the element. This requirement is a willful violation of the XSLT 1.0 specification, required because this specification changes the namespaces and case-sensitivity rules of HTML in a manner that would otherwise be incompatible with DOM-based XSLT transformations. (Processors that serialize the output are unaffected.)" There is a long discussion of these two violations in the comments of bug 7059.
- §126.96.36.199 Form submission algorithm: "Step 9: If
actionis the empty string, let
actionbe the document's address. This step is a willful violation of RFC 3986, which would require base URL processing here. This violation is motivated by a desire for compatibility with legacy content."
- §6.5.3 Script processing model: "If the script's global object is a
thiskeyword in the global scope must return the
thiskeyword in the global scope return the global object, but this is not compatible with the security design prevalent in implementations as specified herein."
- §12.3.4 Other elements, attributes, and APIs: Regarding
ToBoolean()operator convert all objects to the true value, and does not have provisions for objects acting as if they were undefined for the purposes of certain operators. This violation is motivated by a desire for compatibility with two classes of legacy content: one that uses the presence of
document.allas a way to detect legacy user agents, and one that only supports those legacy user agents and uses the
document.allobject without testing for its presence first."
As you can probably guess from these quotes, the HTML 5 community has decided that compatibility with existing web content trumps all other concerns. Other than the use of the term "URL," all of the "willful violations" are instances where other specifications do not adequately describe existing web content. I personally do not understand most of the issues listed here, so I have no opinion on whether the alleged benefit of violating existing standards is worth the alleged cost.
In-band global version identifiers, if new implementations handle them reasonably, may be useful for (a) authoring applications that want to track versions used for authoring (b) informative error handling when applications encounter constructs that are apparently 'in error'.
The idea of version identifiers has been hashed and rehashed throughout the 5+ year process of defining HTML 5. Most notably, Microsoft proposed a version identifer shortly after the W3C HTML Working Group reformed around HTML 5. Their proposal generated much discussion, but was not ultimately adopted. Several months later, Microsoft shipped Internet Explorer 8 with a "feature" called X-UA-Compatible, which serves as a kind of IE-specific version identifier. I am personally not a fan of this approach, in part because [PNG] it adds a lot of complexity for web developers who want to figure out why the still-dominant browser doesn't render their content according to standards.
Versioning in HTML 5 is tracked as HTML ISSUE-4.