The WHATWG Blog — encoding

This Week in HTML 5 – Episode 28

Thursday, April 2nd, 2009

Welcome back to "This Week in HTML 5," where I'll try to summarize the major activity in the ongoing standards process in the WHATWG and W3C HTML Working Group.

The big news for the week of March 23^rd is that SVG can once again be included directly in HTML 5 documents served as text/html:

I've made the following changes to HTML5:

Uncommented out the XXXSVG bits, reintroducing the ability to have SVG content in text/html.

Defined <script> processing for SVG <script> in text/html by deferring to the SVG Tiny 1.2 spec and blocking synchronous document.write(). The alternative to this is to integrate the SVG script processing model with the (pretty complicated) HTML script processing model, which would require changes to SVG and might result in a dependency from SVG to HTML5. Anne would like to do this, but I'm not convinced it's wise, and it certainly would be more complex than what we have now. If we ever want to add async="" or defer="" to SVG scripts, then this would probably be a necessary part of that process, though.

Added a paragraph suggesting: "To enable authors to use SVG tools that only accept SVG in its XML form, interactive HTML user agents are encouraged to provide a way to export any SVG fragment as a namespace-well-formed XML fragment."

Added a paragraph defining the allowed content model for SVG <title> elements in text/html documents.

r2904 (and, briefly, r2910) give all the details of this solution. There are still a number of differences between the text in HTML 5 and the proposal brought by the SVG working group. Some of these are addressed further down in the announcement:

SVG-in-XML is case-preserving; SVG-in-HTML is not.
SVG-in-XML requires quoted attribute values; SVG-in-HTML does not.
When SVG-in-XML uses CDATA blocks, they show up as CDATA nodes in the DOM; when SVG-in-HTML uses CDATA blocks, they show up in the DOM as conventional text nodes. [Clarified based on Henri's feedback]
The <svg> element can not be the root element of a text/html document.

Doug Schepers, who has been the SVG working group's HTML 5 liason, does not like this solution:

To be honest, I think it's not a good use of the SVG WG's time to provide feedback when Ian already has his mind made up, even if I don't believe that he is citing real evidence to back up his decision. What I see is this: one set of implementers and authors (the SVG WG) and the majority of the author and user community (in public comments) asking for some sort of preservation of SVG as an XML format, even if it's looser and error-corrected in practice, and a few implementers (Jonas and Lachy, most notably) disagreeing, and Ian giving preference to the minority opinion. Maybe there is sound technical rationale for doing so, but I haven't been satisfied on that score.

Turning to technical matters, one of the features of web forms in HTML 5 is allowing the attributes for form submission on either the <form> element (as in HTML 4) or on the submit button (new in HTML 5). Originally, the attributes for submit buttons were named action, enctype, method, novalidate, and target, which exactly mirrored the attribute names that could be declared on the <form> element.

However, in January 2008, Hallvord R. M. Steen (Opera developer) noted that "INPUT action [attribute] breaks web applications frequently. Both GMail and Yahoo mail (the new Oddpost-based version) use input/button.action and were seriously broken by WF2's action attribute."

Following up in November 2008, Ian Hickson replied, "I notice that Opera still supports 'action' and doesn't seem to have problems in GMail; is this still a problem?" to which Hallvord replied, "GMail fixed it on their side a while ago. It is still a problem with Yahoo mail, breaking most buttons in their UI for a browser that supports 'action'. We work around this with a browser.js hack. ('Still a problem' means 'I tested this again a couple of weeks ago and things were still broken without this patch'.)"

Ian replied, "This is certainly problematic. It's unclear what we should do. It's hard to use another attribute name, since the whole point is reusing existing ones... can we trigger this based on quirks mode, maybe? Though I hate to add new quirks." Hallvord did not like that idea: "In my personal opinion, I don't see why re-using attribute names is considered so important if we can find an alternative that feels memorable and usable. How does this look? <input type="submit" formaction="http://www.example.com/">"

Finally, in March 2009, Ian replied:

That seems reasonable. I've changed "action", "method", "target", "enctype" and "novalidate" attributes on <input> and <button> to start with "form" instead: "formaction", "formmethod", "formtarget", "formenctype" and "formnovalidate".

And thus we have r2890: Rename attributes for form submission to avoid clashes with existing usage.

Other interesting changes this week:

r2889 adds support for select.add(e) and select.options.add(e) with no second argument.
r2888 defines how to determine the character encoding of Web Worker scripts. Briefly, it says to look for a Byte Order Mark, then look at the Content-Type HTTP header, then fall back to UTF-8.
r2898, r2899, r2901, r2914, and r2916 define a locking mechanism to allow thread-safe read/write access to document.cookie and .localStorage. The lock is acquired during page fetching (which sets the cookie based on HTTP headers) and released once the cookie is set. It is also released automatically whenever something modal happens (such as window.alert()). (I first mentioned the discussion of this issue in episode 27. The problem is that Web Workers allows threaded client-side script execution, which means access to shared storage like document.cookie needs to be made explicitly thread-safe with some sort of locking mechanism.)

Tune in next week for another exciting episode of "This Week in HTML 5."

Posted in Weekly Review | 3 Comments »

This Week in HTML 5 – Episode 25

Tuesday, March 31st, 2009

Welcome back to "This Week in HTML 5," where I'll try to summarize the major activity in the ongoing standards process in the WHATWG and W3C HTML Working Group.

The big news for the week of February 23^rd (yes, I'm that far behind) is a collection of changes about how video is processed. The changes revolve around the resource selection algorithm to handle cases where the src attribute of a <video> element is set dynamically from script.

Background reading on the resource selection algorithm: Re: play() sometimes doesn't do anything now that load() is async, r2849: "Change the way resources are loaded for media elements to make it actually work", r2873: "make all invokations of the resource selection algorithm asynchronous."

This Week in HTML 5 – Episode 17

Tuesday, December 30th, 2008

Welcome back to "This Week in HTML 5," where I'll try to summarize the major activity in the ongoing standards process in the WHATWG and W3C HTML Working Group.

The big news this week is a major revamp of table headers, following up from the last major edits last March. Ian summarizes the most recent round of changes:

Header cells can now themselves have headers.

I have reversed the way the algorithm is presented, such that it starts from a cell and reports the headers rather than generating the list of headers for each cell on a header-by-header basis.

If headers="" points to a <td> element, the association is set up, but I have left this non-conforming to help authors catch mistakes.

Header cells that are automatically associating do not stop associating when they hit equivalent cells unless they have also hit a <td> first.

The "col" and "row" scope values now act like the implied auto value except that they force the direction.

Empty header cells don't get automatically associated.

I have removed the wide header cell heuristic.

I have made headers="" use the same ID discovery mechanism as getElementById(), to avoid implementations having to support multiple such mechanisms.

Finally, I have made the spec define if a header is a column header or a row header in the case where scope="" is omitted.

I haven't added summary="" on table; nothing particularly new has been raised on the topic since the last times I looked at this.

Accessibility advocates are disappointed by the continued non-inclusion of the summary attribute. Their reasoning is that "the summary attribute is a very, very practical and useful attribute," despite their own user testing that shows otherwise. As Ian put it, "I am hesitant to include a feature like summary="" when all evidence seems to point to it being widely misused by authors and ignored by the users it intends to help." As with all issues, this is not the final word on the matter, but it's where we stand today.

In other news, r2566 addresses a very subtle issue with fetching images. The problem stems from the following (arguably pointless) markup: <img src=""> A fair number of web pages actually try to declare an image with an empty src attribute. According to the HTTP and URL specifications, this markup means that there is an image at the same address as the HTML document -- a theoretically possible but highly unlikely scenario. Internet Explorer apparently catches this mistake and just silently drops the image. Other browsers do not; they will actually try to fetch the image, which results in a "duplicate" request for the page (once to successfully retrieve the page, and again to unsuccessfully retrieve the image).

Boris Zbarsky, a leading Mozilla developer, states

We (Gecko) have had 28 independent bug reports filed (with people bothering to create an account in the bug database, etc) about the behavior difference from IE here. That's a much larger number of bug reports than we usually get about a given issue. I can't tell you why this pattern is so common (e.g. whether some authoring frameworks produce it in some cases), but it seems that a number of web developers not only produce markup like this but notice the requests in their HTTP logs and file bugs about it.

r2566 addresses the issue by special-casing <img src> to allow browsers to ignore an image if its fetch request would result in fetching exactly the same URL as its HTML document:

When an img is created with a src attribute, and whenever the src attribute is set subsequently, the user agent must fetch the resource specifed by the src attribute's value, unless the user agent cannot support images, or its support for images has been disabled, or the user agent only fetches elements on demand, or the element's src attribute has a value that is an ignored self-reference.

The src attribute's value is an ignored self-reference if its value is the empty string, and the base URI of the element is the same as the document's address.

Other interesting tidbits this week:

r2568 adds a storageArea attribute to StorageEvent object. [StorageEvent deficiency]
r2556 changes the processing model of the <meta charset> attribute by requiring that it appear in the first 512 bytes of the document. For those of you playing along at home, <meta charset="..."> is the new <meta http-equiv="Content-Type" content="text/html; charset=...">. Both forms are fully supported in all major browsers. [Comparing conformance requirements against real-world docs]
r2557, r2559, r2560, r2562, r2563, and r2604 add a variety of common markup errors to the list of errors that HTML validators may treat as minor. [Re: comparing conformance requirements against real-world docs]
r2561 allows the height and width attributes on <input type="image">, a construct that is already supported by all major browsers. [Re: comparing conformance requirements against real-world docs]
r2601 adds an example of something that all browsers do anyway -- killing scripts that run too long.
r2597 removes the notification API, which was kicked around in 2006 but never saw significant interest from either authors or browser vendors. [Notifications API removed]
r2596 defines window.close(), window.focus(), and window.blur(). The focus() and blur() methods have historically been used to produce "pop-up" and "pop-under" windows containing advertisements. Most modern browsers now control how and whether scripts can do this, and the HTML 5 specification goes so far as to recommend that "[u]ser agents are encouraged to ignore calls to this blur() method entirely."
r2552 gives an example of embedding RDF metadata in XHTML. As the spec notes, this is not possible in HTML, although you could always use RDFa.
r2595 gives an example of marking up a tag cloud.

Tune in next week for another exciting episode of "This Week in HTML 5."

Posted in Weekly Review | 12 Comments »

This Week in HTML 5 – Episode 5

Monday, September 15th, 2008

Welcome back to "This Week in HTML 5," where I'll try to summarize the major activity in the ongoing standards process in the WHATWG and W3C HTML Working Group.

The big news this week is the merging of the Web Forms 2 specification into the HTML 5 specification, and updating it with the collected feedback of the last two years.

form [r2142]
fieldset [r2143]
input [r2144]
button [r2145]
label [r2146]
select [r2148]
datalist [r2150]
optgroup [r2151]
option [r2152]
textarea [r2153]
output [r2154]
The form attribute on various elements, which associates an element with its parent form, and the elements attribute of the form element, which associates the form with its elements. (Form-associated elements no longer need to be children of the <form> element itself within the DOM, so explicit association is required. Form-associated elements that are DOM children of the <form> element are implicitly associated, so your existing markup will continue to work the way you think it does.) [r2157]

Meanwhile, revisions 2160, 2161, 2163, 2164, and 2165 begin the long, hard process of defining when and how a form is submitted. This is one of those things that "everybody knows" but nobody has actually, you know, documented. For example, do you submit a form when you toggle a checkbox? Of course not, "everybody knows" that. Is an unchecked checkbox included in the form data when it is submitted? No, "everybody knows" that too. How do you submit to an ftp:// URL? A mailto:// URL? A data:// URL? What are the three values of the enctype attribute, and how do they affect the form data when it is submitted to a data:// URL with the PUT method?¹ Umm... How exactly do you construct the names of the X and Y coordinates to submit a server-side image map? (By the way, server-side image maps are inaccessible, so don't use them unless you provide an accessible fallback form with equivalent functionality.) Web Forms 2 (and now HTML 5) will tell you.

Another interesting set of changes revolves around character encoding. If you don't know anything about character encoding, I would strongly recommend Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) Tim Bray's three-part series, On the Goodness of Unicode, On Character Strings, and Characters vs. Bytes, and anything written by Martin Dürst.

Now then: r2125 warns against using EBCDIC on public-facing web pages. For those of you under 30, EBCDIC is a character encoding invented by IBM in the 1960s for their System/360 mainframe. On non-IBM hardware, EBCDIC lost the encoding war to ASCII, and later Unicode, and it is rarely seen on the public web. r2131 says that browsers should ignore an out-of-band encoding definition that they do not support. For example, if a web page is served with an HTTP Content-Type header with a charset parameter that defines a character encoding the browser does not support, the browser should ignore it and continue the process of determining the character encoding by other means. And finally, r2137 says that browsers should treat US-ASCII as Windows-1252 when determining character encoding. As the HTML 5 specification notes, "The requirement to treat certain encodings as other encodings according to the table above is a willful violation of the W3C Character Model specification."

Other interesting changes this week:

r2122 makes it non-conforming to have empty unquoted attribute values like <input value=>
r2130 provides a special case for the definitionURL attribute for those embedding MathML in HTML.
r2140 and r2141 allow a legacy DOCTYPE if and only if the HTML page is the result of an XSLT transform. The exact DOCTYPE string is <!DOCTYPE HTML PUBLIC 'XSLT-compat'>

Tune in next week for another exciting episode of "This Week in HTML 5."

Footnotes:

When submitting to a data:// URL with the PUT method, the three values of enctype are application/x-www-form-urlencoded, multipart/form-data, and text/plain. Amaze your friends at the next tech conference!

Posted in Weekly Review | 15 Comments »