The WHATWG Blog

Please leave your sense of logic at the door, thanks!

This Week in HTML 5 – Episode 5

by Mark Pilgrim, Google in Weekly Review

Welcome back to "This Week in HTML 5," where I'll try to summarize the major activity in the ongoing standards process in the WHATWG and W3C HTML Working Group.

The big news this week is the merging of the Web Forms 2 specification into the HTML 5 specification, and updating it with the collected feedback of the last two years.

Meanwhile, revisions 2160, 2161, 2163, 2164, and 2165 begin the long, hard process of defining when and how a form is submitted. This is one of those things that "everybody knows" but nobody has actually, you know, documented. For example, do you submit a form when you toggle a checkbox? Of course not, "everybody knows" that. Is an unchecked checkbox included in the form data when it is submitted? No, "everybody knows" that too. How do you submit to an ftp:// URL? A mailto:// URL? A data:// URL? What are the three values of the enctype attribute, and how do they affect the form data when it is submitted to a data:// URL with the PUT method?1 Umm... How exactly do you construct the names of the X and Y coordinates to submit a server-side image map? (By the way, server-side image maps are inaccessible, so don't use them unless you provide an accessible fallback form with equivalent functionality.) Web Forms 2 (and now HTML 5) will tell you.

Another interesting set of changes revolves around character encoding. If you don't know anything about character encoding, I would strongly recommend Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) Tim Bray's three-part series, On the Goodness of Unicode, On Character Strings, and Characters vs. Bytes, and anything written by Martin Dürst.

Now then: r2125 warns against using EBCDIC on public-facing web pages. For those of you under 30, EBCDIC is a character encoding invented by IBM in the 1960s for their System/360 mainframe. On non-IBM hardware, EBCDIC lost the encoding war to ASCII, and later Unicode, and it is rarely seen on the public web. r2131 says that browsers should ignore an out-of-band encoding definition that they do not support. For example, if a web page is served with an HTTP Content-Type header with a charset parameter that defines a character encoding the browser does not support, the browser should ignore it and continue the process of determining the character encoding by other means. And finally, r2137 says that browsers should treat US-ASCII as Windows-1252 when determining character encoding. As the HTML 5 specification notes, "The requirement to treat certain encodings as other encodings according to the table above is a willful violation of the W3C Character Model specification."

Other interesting changes this week:

Tune in next week for another exciting episode of "This Week in HTML 5."


Footnotes:

  1. When submitting to a data:// URL with the PUT method, the three values of enctype are application/x-www-form-urlencoded, multipart/form-data, and text/plain. Amaze your friends at the next tech conference!

15 Responses to “This Week in HTML 5 – Episode 5”

  1. Wow, I’ve never seen a standard ‘in the making’ like this, complete with changesets. What a nice way to give people insight into the process!

    Keep doing this, it’s great.

    Now for the feedback. Why single out EBDIC and why discourage any encoding?

  2. Steve, the point of specifying is that when new browsers come to the market, they know exactly what they have to implement in order to render Web pages rather than having to reverse engineer what other browsers are doing.

    mike, some encodings are discouraged because they have led to security issues in the past. Also, keeping the amount of encodings to a limited set is a good thing. We don’t want to keep adding new encoding converters just for the sake of it.

  3. Anne, I didn’t mean to imply it shouldn’t be specified, I was just curious about when this practice started/if any UAs never followed it.

  4. Steve, ah ok. I don’t really know when this started exactly. Most likely a long time ago and probably Internet Explorer was first given the encoding.

  5. Repetition templates are most likely dropped as they only address a small subset of the use cases. If someone provides more research we might include another mechanism that addresses more use cases.

  6. Mark – I appreciate these weekly HTML5 episodes. Please KUTGW!

    Anyhow, it’s good to see Web Forms 2 is moving forward.