The WHATWG Blog

Sniffing for RSS 1.0 feeds served as text/html

September 29th, 2009 by Mark Pilgrim, Google

I recently found myself testing how browsers sniff for RSS 1.0 feeds that are served with an incorrect MIME type. (Yes, my life is full of delicious irony.) I thought I'd share my findings so far.

Firefox

Firefox's feed sniffing algorithm is located in nsFeedSniffer.cpp. As you can see, starting at line 353, it takes the first 512 bytes of the page, looks for a root tag called rss (for RSS 2.0), atom (for Atom 0.3 and 1.0), or rdf:RDF (for RSS 1.0). The RSS 1.0 marker is really a generic RDF marker, so it then does some additional checks for the two required namespaces of an RSS 1.0 feed, http://www.w3.org/1999/02/22-rdf-syntax-ns# and http://purl.org/rss/1.0/. This check is quite simple; it literally just checks for the presence of both strings, not caring whether they are the value of an xmlns attribute (or indeed any attribute at all).

Firefox has an additional feature which tripped up my testing until I understood it. IE and Safari both have a mode where they essentially say "I detected this page as a feed and tried to parse it, but I failed, so now I'm giving up, and here's an error message describing why I gave up." Firefox does not have a mode like this. As far as I can tell, if it decides that a resource is a feed but then fails to parse the resource as a feed, it reparses the resource with feed handling disabled. So an non-well-formed feed served as application/rss+xml will actually trigger a "Do you want to download this file" dialog, because Firefox tried to parse it as a feed, failed, then reparsed it as some-random-media-type-that-I-don't-handle. A non-well-formed feed served as text/html will actually render as HTML, but only after Firefox silently tries (and fails) to parse it as a feed.

There's nothing wrong with this approach; in fact, it seems much more end-user-friendly than throwing up an incomprehensible error message. I just mention it because it tripped me up while testing.

Internet Explorer

Internet Explorer's feed sniffing algorithm is documented by the Windows RSS team. About RSS 1.0, it states:

IE7 detects a RSS 1.0 feed using the content types application/xml or text/xml. ... The document is checked for the strings <rdf:RDF, http://www.w3.org/1999/02/22-rdf-syntax-ns# and http://purl.org/rss/1.0/. IE7 detects that it is a feed if all three strings are found within the first 512 bytes of the document. ... IE7 also supports other generic Content-Types by checking the document for specific Atom and RSS strings.

Now that I understand IE's algorithm, I have to concede that this documentation is 100% accurate. However, it doesn't tell the full story. Here's what actually happens. If the Content-Type is

application/xml
text/xml
application/octet-stream
text/plain
text/html
the empty string, or
missing altogether

...then IE will trigger its feed sniffing. Once IE triggers its feed sniffing, it will never change its mind (unlike Firefox). If feed parsing fails, IE will throw up an error message complaining of feed coding errors or an unsupported feed format. The presence or absence of a charset parameter in the Content-Type header made absolutely no difference in any of the cases I tested.

And how exactly does IE detect an RSS 1.0 feed, once it decides to sniff? The documentation on MSDN is literally true: "The document is checked for the strings <rdf:RDF, http://www.w3.org/1999/02/22-rdf-syntax-ns# and http://purl.org/rss/1.0/. IE7 detects that it is a feed if all three strings are found within the first 512 bytes of the document." Combined with our knowledge of which Content-Types IE considers "generic," we can conclude that the following page, served as text/html, will be treated as a feed in IE:

<!-- <rdf:RDF -->
<!-- http://www.w3.org/1999/02/22-rdf-syntax-ns# -->
<!-- http://purl.org/rss/1.0/ -->
<script>alert('Hi!');</script>

[live demonstration]

Why Bother?

I am working with Adam Barth and Ian Hickson to update draft-abarth-mime-sniff-01 (the content sniffing algorithm referenced by HTML5) to sniff RSS 1.0 feeds served as text/html. It is unlikely that we will adopt IE's algorithm, since it seems unnecessarily pathological. I am proposing the following change, which would bring the content sniffing specification in line with Firefox's sniffing algorithm:

In the "Feed or HTML" section, insert the following steps between step 10 and step 11:

10a. Initialize /RDF flag/ to 0.

10b. Initialize /RSS flag/ to 0.

10c. If the bytes with positions pos to pos+23 in s are exactly equal to 0x68, 0x74, 0x74, 0x70, 0x3A, 0x2F, 0x2F, 0x70, 0x75, 0x72, 0x6C, 0x2E, 0x6F, 0x72, 0x67, 0x2F, 0x72, 0x73, 0x73, 0x2F, 0x31, 0x2E, 0x30, 0x2F respectively (ASCII for "http://purl.org/rss/1.0/"), then:

Increase pos by 23.

Set /RSS flag/ to 1.

10d. If the bytes with positions pos to pos+42 in s are exactly equal to 0x68, 0x74, 0x74, 0x70, 0x3A, 0x2F, 0x2F, 0x77, 0x77, 0x77, 0x2E, 0x77, 0x33, 0x2E, 0x6F, 0x72, 0x67, 0x2F, 0x31, 0x39, 0x39, 0x39, 0x2F, 0x30, 0x32, 0x2F, 0x32, 0x32, 0x2D, 0x72, 0x64, 0x66, 0x2D, 0x73, 0x79, 0x6E, 0x74, 0x61, 0x78, 0x2D, 0x6E, 0x73, 0x23 respectively (ASCII for "http://www.w3.org/1999/02/22-rdf-syntax-ns#"), then:

Increase pos by 42.

Set /RDF flag/ to 1.

10e. Increase pos by 1.

10f. If /RDF flag/ is 1, and /RSS flag/ is 1, then the /sniffed type/ of the resource is "application/rss+xml". Abort these steps.

10g. If pos points beyond the end of the byte stream s, then continue to step 11 of this algorithm.

10h. Jump back to step 10c of this algorithm.

This Week in HTML5 – Episode 36

September 27th, 2009 by Mark Pilgrim, Google

Since I started publishing these weekly summaries over a year ago, I've watched the HTML5 specification grow up. In episode 1, the big news of the week was the birth of an entirely new specification (Web Workers). Slowly, steadily, and sometimes painstakingly, the HTML5 specification has matured to the point where the hottest topic last week was the removal of a little-used element (<dialog>) and the struggle to find a suitable replacement for marking up conversations.

This week's changes are mundane, and I expect (and hope!) that future summaries will be even more mundane. That's a good thing; it tells me that, as implementors continue implementing and authors continue authoring, there are no show-stoppers and fewer and fewer "gotchas" to trip them up. Thus, the overarching theme this week -- and I use the term "theme" very loosely -- is "the never-ending struggle to get the details right."

Parsing

HTML5 is full of algorithms. Most of them are small parts of one mega-algorithm, called Parsing HTML Documents. Contrary to popular belief, the HTML parsing algorithm is deterministic: for any sequence of bytes, there is one (and only one) "correct" way to interpret it as HTML. Notice I said "any sequence of bytes," not just "any sequence of bytes that conforms to a specific DTD or schema." This is intentional; HTML5 not only defines what constitutes "valid" HTML markup (for the benefit of conformance checkers), it also defines how to parse "invalid" markup (for the benefit of browsers and other HTML consumers that take existing web content as input). And sweet honey on a stick, there sure is a lot of invalid markup out there.

r3896 tells parsers to ignore almost any end tags before the <html> tags. There are a few special end tags which cause the parser to start constructing a new document: </html>, </head>, </body>, and oddly enough, </br>. [Related: Bug 7672]
r3909 clarifies how user agents should parse the type attribute of a <script> tag. The type attribute is optional; authors can simply omit it if they're embedding JavaScript.
r3923 tweaks the algorithm for parsing the DOCTYPE declaration. This affects DOCTYPE sniffing.
r3967 clarifies the algorithm for ignoring the first newline or carriage return character at the beginning of a <pre> block. [Background: [whatwg] Initial carriage return in <pre> and <textarea>]
r3968 explains why the <embed> element can have an infinite number of attributes. (Answer: because they are passed directly to the third-party plugin that handles the embedded content, and there are no restrictions on what kind of plugins you can have or what attributes they can take as input.)
r3991 adds to the already-long list of legacy, non-conforming attributes that user agents may encounter in existing web content.
r3871 and r3982 tweak the handling of Unicode surrogates. [Background: [whatwg] Surrogate pairs and character references]

Accessibility

As with so many things in the accessibility world, all of this week's changes revolve around the thorny problem of focus. I previously explained why focus is so important in episode 24.

r3887 specifies that each <area> in a client-side image map should be focusable.
r3919 encourages browser vendors to expose tooltips to keyboard-only users. For example, in Firefox 3.5, if you hover your cursor over a hyperlink that defines a title attribute, you will see the title attribute as a tooltip. But if you tab to the same link with the keyboard, no tooltip appears. Now imagine that you're physically unable to use a mouse, and you begin to see the problem. [Background: Bug 7362 and Issue 80]
r3928 defines an intriguing proposal about canvas accessibility, which probably deserves its own article. Here's the short version: you can already define "fallback content" within a <canvas> element that is shown to browsers that don't support the canvas API. This change dictates that the "fallback content" should remain keyboard-focusable even in browsers that do support the canvas API. To quote the spec: "This allows authors to make an interactive canvas keyboard-focusable: authors should have a one-to-one mapping of interactive regions to focusable elements in the fallback content." This is a draft proposal; as far as I know, no browser actually supports it yet, and it may get reverted in the future. [Background: Bug 7404]
r3969 clarifies that browsers must do nothing when the user activates a label whose corresponding input control is hidden (in any manner, including a display: none CSS rule). [Background: Bug 7583]

Security

All of this week's security-related changes revolve around document.domain. As you might expect from its name, this property returns the domain name of the current document. Unfortunately (for security), the property is not read-only; you can also set document.domain to pretty much anything. This can cause all sorts of horrible side effects, since so many things (cookies, local storage, same-origin restrictions on XMLHttpRequest) rely on the domain of the document. This set of changes attempts to reduce the nasty side effects (and the possible attack surface) in case you absolutely must set document.domain to something other than its default calcuated value.

r3875 states that setting document.domain should release the storage mutex. (The storage mutex is a global lock that is acquired when setting cookies and released immediately afterwards. Since cookies are domain-specific, changing the domain dynamically like this needs to release the lock in case the page wants to update the cookies on the new domain.) [Background: [whatwg] Storage mutex and cookies can lead to browser deadlock, [whatwg] RFC: Alternatives to storage mutex for cookies and localStorage, [whatwg] Application defined "locks"]
r3878 states that setting document.domain makes Web Storage unusable, to avoid deadlocks with Web Storage's own locking mechanism. [Background: [whatwg] localStorage, the storage mutex, document.domain, and workers]
r3879 warns against setting document.domain on web applications that are hosted on shared servers. The spec explains the problem: "If an untrusted third party is able to host an HTTP server at the same IP address but on a different port, then the same-origin protection that normally protects two different sites on the same host will fail, as the ports are ignored when comparing origins after the document.domain attribute has been used."

Semantics

r3905, r3948, and r3966 clarify that the profile attribute (used by various microformats) takes a space-separated list of addresses, not just a single address. This has been the subject of heated debate for over 12 years, because HTML 4 claims that "the value of the profile attribute is a URI; user agents may use this URI in two ways..." while simultaneously claiming that "this attribute specifies the location of one or more meta data profiles, separated by white space." [Background: let's keep metadata profiles (head/@profile) in HTML for use in GRDDL etc., [whatwg] HTML4's profile="" attribute's absence in HTML5, Bug 7413, Bug 7484, Bug 7512, and Issue 55.]
r3869 tweaks the definitions of <section>, <article>, and <details> based on an informal study by Jeremy Keith. r3979 further tweaks the definition of <article>, and r3978 mentions that the <article> element is semantically similar to the <entry> element in RFC 4287 (Atom Syndication Format). [Background: [whatwg] article/section/details naming/definition problems. Related: Bug 7551]
r3954 further clarifies the definition of <footer>. [Background: Bug 7502]
r3907 clarifies the workings of the registries for the enumerated values of <link rel>, <meta name>, and <meta http-equiv> attributes.
r3904 tweaks the semantics of <link rel="up">.
r3962 modifies the outline algorithm (used to generate a kind of "table of contents" of an HTML document based on sections and headers) to handle an obscure edge case. [Background: IRC discussion of edge case, Bug 7527, and in particular this comment on Bug 7527]
r3987 gives an example to clarify that the <nav> element does not always need to be a child of a <header> element.

Video

As regular readers of this column are aware, one of the big new user-visible features of HTML5 is native video support without plugins. As video is incredibly complicated, so to is the video support in HTML5. (Although not related to this week's changes, you may be interested to read my series, A gentle introduction to video encoding.)

r3867 modifies the algorithm for sizing anamorphic video within a <video> element, and r3913 defines how to display a frame of anamorphic video in a canvas pattern. [Background: Re: video size when aspect ratio is not 1, What The Heck Is Anamorphic?]
r3924 defines what happens when you dynamically insert a <source> element as a child of a <video> element that also has a src attribute, and r3925 defines what happens when you dynamically remove a <video src> attribute. [Background: Bug 7631, Bug 7632]
r3927 gives advice on how browsers could render an <audio> element with a controls attribute.
r3992 makes further refinements to the play() and pause() algorithms.

Web Forms

Forms continue to be difficult.

r3874 allows browsers to reset the list of selected files of an <input type="file"> element by setting its value attribute to the empty string. [Background: [whatwg] Setting .value on <input type=file>]
r3922 clarifies that setting the disabled attribute of a <fieldset> element should not disable the children of the fieldset's <legend> element. [Background: Bug 7591]
r3934 defines that the maxLength property should return -1 on a <textarea> or <input> element that does not include a maxlength attribute. [Background: Bug 7427]
r3957 clarifies that implicit form submission should validate the form first. [Background: Bug 7511]

Interesting Discussion Threads This Week

I like this proposal for adding a document.head property. It would presumably be faster than document.getElementsByTagName('head')[0], and more reliable than document.documentElement.firstChild.
Character encoding on the web is even worse than you think.
Re: On testing HTML

Around the Web

Brad Neuberg: A video introduction to HTML5
Google (my employer) released an intriguing project called Google Chrome Frame. It's a plugin for Internet Explorer that enables a number of new HTML5 capabilities on an opt-in basis. Here's a few technical details. Reaction has ranged from impressed to unimpressed to positively ironic. Google Wave is one of the first web applications to opt-in and suggest that Internet Explorer users download the plugin.
Steve Faulkner: HTML5 & WAI-ARIA: Happy Families, a slide deck about current HTML5 accessibility features, misfeatures, and support in browsers and assistive technologies.
Burst Engine is "an OpenSource vector animation engine for the HTML5 Canvas Element."
Peter-Paul Koch: The HTML5 drag and drop disaster

Tune in next week for another exciting edition of "This Week in HTML5."

Tags: accessibility, Parsing, security, semantics, thisweekinhtml5, video, webforms2
Posted in Weekly Review | 1 Comment »

This Week in HTML5 – Episode 35

September 16th, 2009 by Mark Pilgrim, Google

On August 7, 2009, Adrian Bateman did what no man or woman had ever done before: he gave substantive feedback on the current editor's draft of HTML5 on behalf of Microsoft. His feedback was detailed and well-reasoned, and it spawned much discussion. See also: Adrian's followup on <progress> (and more here); his followup on <canvas>; on <datagrid> (which was actually dropped from HTML5 just minutes after Adrian posted his initial feedback, for unrelated reasons); his discussion with a Mozilla developer about <bb> (which was also subsequently dropped); discussion about <dialog> (which has now been dropped -- more on that in just a minute); Adrian's followup on <keygen>, with additional concerns listed here; his followup on the new input types; and last but not least, his position on <video> and <audio> (and followup about best-choice algorithms).

As you might expect, much of the discussion since August 7 has been driven by Microsoft's feedback. After five years of virtual silence, nobody wants to miss the opportunity to engage with a representative of the world's still-dominant browser. I want to focus on two discussions that have led to recent spec changes.

The rise and fall of `<keygen>`

The <keygen> element was invented by Netscape and subsequently reverse-engineered by every other browser vendor except Microsoft. It had never been part of any HTML specification before; indeed, it wasn't well-documented anywhere. It was added to HTML5 earlier this year (and covered in episode 12 and episode 31). The spec text borrows heavily from this incredibly detailed documentation posted to the WHATWG mailing list last July.

Adrian, on behalf of Microsoft, has stated in no uncertain terms that Microsoft has no intention of ever implementing the <keygen> element, and they would like it to be removed from HTML5. But what does "implementing" <keygen> mean? Well, the point of <keygen> is to provide a cryptography API, but as Ian Hickson points out, the element itself "integrates tightly with the form submission model, it affects the DOM APIs of other elements, it affects the parser, it affects the form control validity model -- it's not a feature that can be sensibly considered 'optional' if our goal is cross-browser interoperability. However, there is an alternative that I think would still satisfy Microsoft's desires to not implement <keygen>'s cryptographic features while still bringing interoperability to the platform in every other respect: we could make the support of each individual signature algorithm optional."

r3843 does just that -- it makes the cryptography parts of <keygen> optional. It is important that the element itself be implemented (even without the crypto bits), because it interacts with the DOM and the parsing model in strange ways.

As an postscript of sorts, I should point out that a recent change to the keytype attribute (r3868) allows client-side script to detect whether the crypto bits are actually supported. Detecting features is important, and this subtle change will allow authors to write feature detection scripts instead of relying on browser sniffing to decide whether to use keygen-based cryptography.

The art of conversation is, like, dead and stuff

Another hot topic this week is the removal of the <dialog> element. As I mentioned at the beginning of this article, Microsoft questioned the wisdom of a specialized element for marking up dialog. Other people have suggested that the element does not actually go far enough -- it lets you mark up basic conversation (people talking), but provides no semantics for stage directions, actions, thoughts, voiceover narration, and so on. (See also: Unwebbable, "The screenplay problem.")

To decide this burning question, the <dialog> element was removed, and conversations are now listed in the section on Common idioms without dedicated elements, with an example of how one might mark up a conversation with more generic markup, if one were inclined to do so. Predictably, this has already caused some backlash from the pro-<dialog> camp.

The conversation about marking up conversations also intersects with another burning question: can you use the <cite> element to mark up a person's name? HTML 4 said yes, and even provided an example that used the <cite> element that way. Dan Connolly, who added the <cite> element to HTML 2 (yes, 2), says "I consider that a bug in the HTML 4 spec. I wish I had reviewed it more closely." Still, specs are normative, not what their authors say about them after-the-fact, and the web has collectively had 12 years of HTML 4 which explicitly blessed the technique. I've used <cite> to mark up people's names for years in my own markup, and I'm certainly not going to go back and change all those blog entries to conform to somebody's sense of purity.

New examples

Speaking of examples, the HTML5 spec just got a whole lot more of them. To wit:

r3784: <html> example
r3785: <head> example
r3786: <base> example
r3787: <link> example
r3789: <meta> example
r3790: <style> example
r3791: <script> example
r3793 and r3856: <noscript> example
r3796: <article> example
r3808: <iframe> example
r3809: <embed> example
r3810: <canvas> example
r3811: <math> example
r3814: <form> and <fieldset> examples
r3815: list="" example
r3816: readonly="" example
r3817: required="" example
r3818: multiple="" example
r3819: maxlength="" example
r3820: step, min, and max examples
r3821: <button> example
r3822: <select> example
r3823: <optgroup> example
r3824: <textarea> and <output> examples
r3825: <details> example
r3801 and r3806: <h1> example
r3804: <hr> and <span> examples
r3805: <del> example
r3863: example of how to get the filename out of input.value
r3799: example of how to mark up comments on a weblog using <article>, <section>, and <header>

Spelling HTML5

September 10th, 2009 by Henri Sivonen

What’s the right way to spell “HTML5”? The short answer is: “HTML5” (without a space).

People in the WHATWG community have commonly referred to HTML5 as “HTML5” for quite a while. However, when the W3C HTML WG voted on adopting “Web Applications 1.0” the question about the title said “HTML 5”. Thus, the W3C HTML WG voted to adopt “HTML 5” as the title, but it wasn’t a vote for or against the space but about “HTML” and “5” in contrast to e.g. “Web Applications 1.0”. Anyway, as a result, the spec was retitled literally “HTML 5”.

This lead to inconsistency. Sometimes people kept writing “HTML5” and sometimes “HTML 5” (even on whatwg.org). This kind of inconsistency is bad for branding. The Super Friends pointed this issue out as the first thing they pointed out.

Now both the WHATWG Draft Standard and W3C Editor’s Draft spell it “HTML5”.

Posted in WHATWG | 29 Comments »

Response to “Notes on HTML 5”

September 3rd, 2009 by Mark Pilgrim, Google

The W3C Technical Architecture Group (TAG) is in the process of reviewing HTML 5. Noah Mendelsohn recently posted his initial, personal, not-speaking-on-behalf-of-TAG notes on HTML 5. Here are my initial, personal, not-speaking-on-behalf-of-WHATWG responses.

Limitations of the XML serialization

This may be old news, but I was surprised to see that document.write() is not supported when parsing the XML serialization. This seems to put the nail in the coffin of XML as a serialization format for colloquial HTML. I understand that there are a variety of issues in making a sensible definition of how this would work, but my intuition is that it could be done reasonably cleanly (albeit not with most off-the-shelf XML parsers).

Many, many things helped drive the nail in the coffin of XML as a serialization format for colloquial HTML. This was probably one of them. Others that come to mind:

Draconian error handling enforced at runtime does not scale to the complexities of modern-day web applications. Ensuring well-formedness becomes increasingly difficult when content is dynamically cobbled together from multiple sources, some of which are beyond your control (user-generated content, third-party ad servers, and so on).
It provides no perceivable benefit to users. Draconianly handled content does not do more, does not download faster, and does not render faster than permissively handled content. Indeed, it is almost guaranteed to download slower, because it requires more bytes to express the same meaning -- in the form of end tags, self-closing tags, quoted attributes, and other markup which provides no end-user benefit but serves only to satisfy the artifical constraints of an intentionally restricted syntax.
IE never supported it, forcing you down the rabbit hole of polyglot documents.

Other applicable specifications

"Authors must not use elements, attributes, and attribute values for purposes other than their appropriate intended semantic purpose. Authors must not use elements, attributes, and attribute values that are not permitted by this specification or other applicable specifications." This is one of the most important sentences in the entire specification, but it's somewhat vague. If "other applicable specifications" means: any specification that anyone claims is applicable to HTML 5 extension, then we can extend the langauge with most any element without breaking conformance; if "applicable specifications" is a smaller (or empty) set, then this may be saying that HTML 5 has limited (or zero) extensibility.

The phrase "or other application specifications" is indeed quite important, and quite intentional. It explicitly allows something that previous HTML specifications only implicitly allowed, or disallowed-but-everyone-ignored-that-part: future specifications which extend the vocabulary of previous specifications. Ian Hickson uses RDFa as an example: "If an RDFa specification said that text/html could have arbitrary xmlns:* attributes, then the HTML5 specification would (by virtue of the above-quoted sentence) defer to it and thus it would be allowed. ... Of course, if a community doesn't acknowledge the authority of such a spec, and they _do_ acknowledge the authority of the HTML5 spec, then it would be (for them) as if that spec didn't exist. Similarly, there might be a community that only acknowledges the HTML4 spec and doesn't consider HTML5 to be relevant, in which case for them, HTML5 isn't relevant. This is how specs work."

In response to a question about how validators could possible work with such flexible constraints, Ian Hickson writes, "The same way validators work now. The validator implementors decide which specs they think are relevant to their users. The W3C CSS validator, for instance, can be configured to check against CSS2.1 rules or against SVG rules about CSS. It can't be configured to check against CSS2.1 + the :-moz-any-link extension, because the CSS validator implementors have decided that CSS2.1 and SVG are relevant, but not the Mozilla CSS extensions. ... The W3C HTML validator, similarly, supports checking a document against XHTML 1.1, or XHTML + RDFa, or XHTML + SVG, but doesn't support checking it against XHTML + DOCBOOK. So its implementors have decided that RDFa, SVG, and XHTML are relevant, but DOCBOOK is not."

People who think that such lack of constraints can only lead to madness go strangely silent when it is pointed out that they have already violated such constraints, and the world failed to end as a result.

URL terminology

The HTML 5 draft uses the term URL, not URI.

I was under the impression that HTML 5 uses the term "URL" because that's what everyone else in the world uses (outside a few standards wonks who know the difference between URLs, URIs, IRIs, XRIs, LEIRIs, and so on). A few minutes of research supported my impression. Quoth Ian Hickson: "'URL' is what everyone outside the standards world calls them. The few people who understand what on earth IRI, URN, URI, and URL are supposed to mean and how to distinguish them have demonstrated that they are able to understand such complicated terminology and can deal with the reuse of the term 'URL'. Others, who think 'URL' mean exactly what the HTML5 spec defines it as, have not demonstrated an ability to understand these subtleties and are better off with us using the term they're familiar with. The real solution is for the URI and IRI specs to be merged, for the URI spec to change its definitions to match what 'URL' is defined as in HTML5 (e.g. finally defining error handling as part of the core spec), and for everyone to stop using terms other than 'URL'."

It should be noted that not everyone agrees with this. For example, Roy Fielding (who obviously understands the subtle differences between URLs and other things) recently stated: "Use of the term URL in a manner that directly contradicts an Internet standard is negligent and childish. HTML5 can rot until that is fixed." Maciej Stachowiak, recently appointed co-chair of the W3C HTML Working Group, recently stated: "We need to get the references in order first, because whether HTML5 references Web Address, or IRIbis, or something else, makes a difference to what we'll think about the naming issue. We need to decide as a Working Group if it's acceptable to use the term URL in a different way than RFC3986 (while making the difference clear). If it's unacceptable, then we need to propose an alternate term."

As the old saying goes, "There are only two hard problems in Computer Science: cache invalidation and naming things." I personally don't care about the matter either way, but there is obviously a wide spectrum of opinion.

IRI-bis

It's unclear whether the factoring to reference WebAddr and/or IRI-bis will be retained.

"WebAddr" refers to Web Addresses, a now-defunct proposal to split out the definition of "URL" that HTML 5 uses (which intentionally differs from the "official" definition in order to handle existing web content). The work on the "Web Addresses" specification has now been rolled into IRI-bis; "bis" means "next," so "IRI-bis" means "the next version of the IRI specification." According to Ian Hickson, so important definitions were lost in the process of splitting out "Web Addresses" from HTML 5 and subsequently merging "Web Addresses" into IRI-bis. There is also some feedback about newlines within URLs.

Work on IRI-bis is ongoing. As it relates to HTML 5, it is tracked as HTML ISSUE-56.

Content sniffing

HTML 5 calls for user agents to ignore normative Content-type in certain cases.

HTML 5 calls for user agents to ignore normative Content-Type in certain cases because this is required to handle existing web content. Based on [PDF] the research of Adam Barth and others into the content sniffing rules of a certain closed-source market-dominating browser, progress has been made towards reducing the amount of content sniffing on the public web. (Counterpoint: two steps forward, one step back.)

Personally, I have long been opposed to content sniffing, but if sniffing is going to occur, I would vastly prefer documented algorithms to undocumented ones. The "hope" behind documenting the sniffing rules now is that the web community can "freeze" the rules now and forever, i.e. not add any more complexity to an already complex world. I am personally skeptical, since HTML 5 also introduces (or at least promotes-to-their-own-elements) two new media families, audio and video, for which undocumented or underdocumented sniffing may already occur within proprietary browser plug-ins. And with @font-face support shipping or on the verge of shipping in multiple browsers, there may be new sniffing rules introduced there as well. I hope my concerns are unfounded.

Still, having sniffing rules documented in HTML 5 may -- someday soon -- reduce the complexity of a shipping product. And how often does that happen?

Willful violations

HTML 5 acknowledges in several places that it is in "willful violation" of other specifications from the W3C and IETF.

As stated in §1.5.2 Compliance with other specifications, "This specification interacts with and relies on a wide variety of other specifications. In certain circumstances, unfortunately, the desire to be compatible with legacy content has led to this specification violating the requirements of these other specifications. Whenever this has occurred, the transgressions have been noted as 'willful violations'."

This is the complete list of "willful violations" in the August 25th W3C Editor's Draft of HTML 5. (The WHATWG draft changes almost daily, whenever a change is checked in.)

§2.5.1 Terminology: "The term "URL" in this specification is used in a manner distinct from the precise technical meaning it is given in RFC 3986. Readers familiar with that RFC will find it easier to read this specification if they pretend the term "URL" as used herein is really called something else altogether. This is a willful violation of RFC 3986."
§2.7 Character Encodings: "The requirement to treat certain encodings as other encodings according to the table above is a willful violation of the W3C Character Model specification, motivated by a desire for compatibility with legacy content." Related bugs: Bug 7444, bug 7453, bug 7215, bug 7381. There was a recent discussion about character encoding, starting with Addison Phillips' feedback on behalf of the I18N working group, with several followups. Some salient quotes: Maciej (Apple, Safari): "Browsers for Latin-script locales pretty much universally use Windows-1252 as the default of last resort. This is necessary to be compatible with legacy content on the existing Web." Mark Davis (Google): "At Google, the encoding label is taken only as a weak signal (a small factor in the heuristic detection). It is completely overwhelmed by the byte content analysis. (There are too many unlabeled pages *and mislabeled pages* for the label to be used as is.)" Henri Sivoven (Mozilla contributor), in response to a question about what constitutes a "legacy environment" for the purposes of character encoding detection: "The Web is a legacy environment." (Ian Hickson echoes this sentiment.)
§2.7 Character Encodings: "The requirement to default UTF-16 to LE rather than BE is a willful violation of RFC 2781, motivated by a desire for compatibility with legacy content."
§3.4 Interactions with XPath and XSLT: "These requirements are a willful violation of the XPath 1.0 specification, motivated by desire to have implementations be compatible with legacy content while still supporting the changes that this specification introduces to HTML regarding which namespace is used for HTML elements."
§3.4 Interactions with XPath and XSLT: "If the transformation program outputs an element in no namespace, the processor must, prior to constructing the corresponding DOM element node, change the namespace of the element to the HTML namespace, ASCII-lowercase the element's local name, and ASCII-lowercase the names of any non-namespaced attributes on the element. This requirement is a willful violation of the XSLT 1.0 specification, required because this specification changes the namespaces and case-sensitivity rules of HTML in a manner that would otherwise be incompatible with DOM-based XSLT transformations. (Processors that serialize the output are unaffected.)" There is a long discussion of these two violations in the comments of bug 7059.
§4.10.16.3 Form submission algorithm: "Step 9: If action is the empty string, let action be the document's address. This step is a willful violation of RFC 3986, which would require base URL processing here. This violation is motivated by a desire for compatibility with legacy content."
§6.5.3 Script processing model: "If the script's global object is a Window object, then in JavaScript, the this keyword in the global scope must return the Window object's WindowProxy object. This is a willful violation of the JavaScript specification current at the time of writing (ECMAScript edition 3). The JavaScript specification requires that the this keyword in the global scope return the global object, but this is not compatible with the security design prevalent in implementations as specified herein."
§12.3.4 Other elements, attributes, and APIs: Regarding document.all, "These requirements are a willful violation of the JavaScript specification current at the time of writing (ECMAScript edition 3). The JavaScript specification requires that the ToBoolean() operator convert all objects to the true value, and does not have provisions for objects acting as if they were undefined for the purposes of certain operators. This violation is motivated by a desire for compatibility with two classes of legacy content: one that uses the presence of document.all as a way to detect legacy user agents, and one that only supports those legacy user agents and uses the document.all object without testing for its presence first."

As you can probably guess from these quotes, the HTML 5 community has decided that compatibility with existing web content trumps all other concerns. Other than the use of the term "URL," all of the "willful violations" are instances where other specifications do not adequately describe existing web content. I personally do not understand most of the issues listed here, so I have no opinion on whether the alleged benefit of violating existing standards is worth the alleged cost.

Versioning

In-band global version identifiers, if new implementations handle them reasonably, may be useful for (a) authoring applications that want to track versions used for authoring (b) informative error handling when applications encounter constructs that are apparently 'in error'.

The idea of version identifiers has been hashed and rehashed throughout the 5+ year process of defining HTML 5. Most notably, Microsoft proposed a version identifer shortly after the W3C HTML Working Group reformed around HTML 5. Their proposal generated much discussion, but was not ultimately adopted. Several months later, Microsoft shipped Internet Explorer 8 with a "feature" called X-UA-Compatible, which serves as a kind of IE-specific version identifier. I am personally not a fan of this approach, in part because [PNG] it adds a lot of complexity for web developers who want to figure out why the still-dominant browser doesn't render their content according to standards.

Versioning in HTML 5 is tracked as HTML ISSUE-4.

Posted in WHATWG | 17 Comments »

The WHATWG Blog

Sniffing for RSS 1.0 feeds served as text/html

Firefox

Internet Explorer

Why Bother?

Further Reading

This Week in HTML5 – Episode 36

Parsing

Accessibility

Security

Semantics

Video

Web Forms

Interesting Discussion Threads This Week

Around the Web

This Week in HTML5 – Episode 35

The rise and fall of `<keygen>`

The art of conversation is, like, dead and stuff

New examples

Further Reading

Spelling HTML5

Response to “Notes on HTML 5”

Limitations of the XML serialization

Other applicable specifications

URL terminology

IRI-bis

Content sniffing

Willful violations

Versioning

Sniffing for RSS 1.0 feeds served as text/html

Firefox

Internet Explorer

Why Bother?

Further Reading

This Week in HTML5 – Episode 36

Parsing

Accessibility

Security

Semantics

Video

Web Forms

Interesting Discussion Threads This Week

Around the Web

This Week in HTML5 – Episode 35

The rise and fall of <keygen>

The art of conversation is, like, dead and stuff

New examples

Further Reading

Spelling HTML5

Response to “Notes on HTML 5”

Limitations of the XML serialization

Other applicable specifications

URL terminology

IRI-bis

Content sniffing

Willful violations

Versioning

The rise and fall of `<keygen>`