The WHATWG Blog

Please leave your sense of logic at the door, thanks!

Archive for February, 2009

This Week Day in HTML 5 – Episode 23

Friday, February 27th, 2009

Welcome back to "This Week in HTML 5," where I'll try to summarize the major activity in the ongoing standards process in the WHATWG and W3C HTML Working Group. The pace of HTML 5 changes has reached a fever pitch, so I'm going to split out these episodes into daily (!) rather than weekly summaries until things calm down.

The big news for February 12, 2009 is the minting of the the spellcheck attribute, which web authors can use to provide a hint about whether a particular form field expects the sort of input that would benefit from client-side spell checking. r2801 lays it out:

User agents can support the checking of spelling and grammar of editable text, either in form controls (such as the value of textarea elements), or in elements in an editing host (using contenteditable).

For each element, user agents must establish a default behavior, either through defaults or through preferences expressed by the user. There are three possible default behaviors for each element:

true-by-default
The element will be checked for spelling and grammar if its contents are editable.
false-by-default
The element will never be checked for spelling and grammar.
inherit-by-default
The element's default behavior is the same as its parent element's. Elements that have no parent element cannot have this as their default behavior.

The spellcheck attribute is an enumerated attribute whose keywords are true and false. The true keyword map to the true state. The false keyword maps to the false state. In addition, there is a third state, the inherit state, which is the missing value default (and the invalid value default).

Starting with version 2, Mozilla Firefox has offered built-in spell checking of <textarea> elements (on by default) and <input type=text> elements (off by default). You can change the default behavior by setting the spellcheck attribute. (test case)

The other big news of the day is the addition of the <form autocomplete> attribute, while lets web authors provide a hint about whether they would like browsers to save the form's contents and pre-fill the form the next time the user encounters it. r2798:

When an input element's resulting autocompletion state is on, the user agent may store the value entered by the user so that if the user returns to the page, the UA can prefill the form. Otherwise, the user agent should not remember the control's value.

... A user agent may allow the user to override the resulting autocompletion state and set it to always on, always allowing values to be remembered and prefilled), or always off, never remembering values. However, the ability to override the resulting autocompletion state to on should not be trivially accessible, as there are significant security implications for the user if all values are always remembered, regardless of the site's preferences.

<form autocomplete> is commonly used on sensitive login forms where the site does not want users to be able to store their password in their browser (which is generally done in an insecure way). Most browsers honor these hints by default, although there are ways to override them if you dislike the idea of web authors disabling useful bits of your browser's functionality.

Other interesting changes of the day:

Discussion of the day: Gregory J. Rosmaita gives details on report of PFWG HTML5 actions ("PFWG" = Protocols and Formats Working Group). The original post was about accessibility issues, specifically a response to the <image alt> attribute becoming optional and the omission of the headers and summary attributes in the HTML 5 table model. But the thread was quickly hijacked by a discussion of the fact that the W3C published another working draft of HTML 5 on February 12.

Wait... what? Oh yes, in true "burying the lede" fashion, I suppose I should mention that the biggest news of February 12th is that the W3C published another working draft of HTML 5. Except that readers of this series will find it uninteresting, since it's just a snapshot of the progress-to-date. (The spec is "published" on whatwg.org every time it changes anyway.) Working drafts have no formal status; they are merely intended to encourage early and wide review. Still, the rest of the world might think it's important, so be sure to bring it up at this weekend's cocktail parties.

Tune in... well, sometime soon-ish for another exciting episode of "This Week Day In HTML 5."

Posted in Weekly Review | 2 Comments »

This Week Day In HTML 5 – Episode 22

Wednesday, February 25th, 2009

Welcome back to "This Week in HTML 5," where I'll try to summarize the major activity in the ongoing standards process in the WHATWG and W3C HTML Working Group. The pace of HTML 5 changes has reached a fever pitch, so I'm going to split out these episodes into daily (!) rather than weekly summaries until things calm down.

The big news for February 11, 2009 is the addition of an algorithm to parse a color in an IE-compatible way. r2776 lays it all out:

Some obsolete legacy attributes parse colors in a more complicated manner, using the rules for parsing a legacy color value, which are given in the following algorithm. When invoked, the steps must be followed in the order given, aborting at the first step that returns a value. This algorithm will either return a simple color or an error.

  1. Let input be the string being parsed.

  2. If input is the empty string, then return an error.

  3. If input is an ASCII case-insensitive match for the string "transparent", then return an error.

  4. If input is an ASCII case-insensitive match for one of the keywords listed in the SVG color keywords or CSS2 System Colors sections of the CSS3 Color specification, then return the simple color corresponding to that keyword. [CSS3COLOR]

  5. If input is four characters long, and the first character in input is a U+0023 NUMBER SIGN (#) character, and the the last three characters of input are all in the range U+0030 DIGIT ZERO (0) .. U+0039 DIGIT NINE (9), U+0041 LATIN CAPITAL LETTER A .. U+0046 LATIN CAPITAL LETTER F, and U+0061 LATIN SMALL LETTER A .. U+0066 LATIN SMALL LETTER F, then run these substeps:

    1. Let result be a simple color.

    2. Interpret the second character of input as a hexadecimal digit; let the red component of result be the resulting number multiplied by 17.

    3. Interpret the third character of input as a hexadecimal digit; let the green component of result be the resulting number multiplied by 17.

    4. Interpret the fourth character of input as a hexadecimal digit; let the blue component of result be the resulting number multiplied by 17.

    5. Return result.

  6. Replace any characters in input that have a Unicode codepoint greater than U+FFFF (i.e. any characters that are not in the basic multilingual plane) with the two-character string "00".

  7. If input is longer than 128 characters, truncate input, leaving only the first 128 characters.

  8. If the first character in input is a U+0023 NUMBER SIGN character (#), remove it.

  9. Replace any character in input that is not in the range U+0030 DIGIT ZERO (0) .. U+0039 DIGIT NINE (9), U+0041 LATIN CAPITAL LETTER A .. U+0046 LATIN CAPITAL LETTER F, and U+0061 LATIN SMALL LETTER A .. U+0066 LATIN SMALL LETTER F with the character U+0030 DIGIT ZERO (0).

  10. While input's length is zero or not a multiple of three, append a U+0030 DIGIT ZERO (0) character to input.

  11. Split input into three strings of equal length, to obtain three components. Let length be the length of those components (one third the length of input).

  12. If length is greater than 8, then remove the leading length-8 characters in each component, and let length be 8.

  13. While length is greater than two and the first character in each component is a U+0030 DIGIT ZERO (0) character, remove that character and reduce length by one.

  14. If length is still greater than two, truncate each component, leaving only the first two characters in each.

  15. Let result be a simple color.

  16. Interpret the first component as a hexadecimal number; let the red component of result be the resulting number.

  17. Interpret the second component as a hexadecimal number; let the green component of result be the resulting number.

  18. Interpret the third component as a hexadecimal number; let the blue component of result be the resulting number.

  19. Return result.

Information on exactly which attributes are subject to this algorithm is scattered throughout the spec. Here is the complete list:

The other big news today is the addition of a section on matching HTML elements using selectors. Some of these (:link, :visited, :active) will be familiar to anyone who has written a CSS stylesheet, but there are a number of new selectors that correspond to concepts introduced in HTML 5.

Other interesting changes of the day:

Discussion of the day: What's the problem? "Reuse of 1998 XHTML namespace is potentially misleading/wrong". Take it away, Lachlan:

I believe the issue is that the XHTML2 WG think they have change control over that namespace URI and that we shouldn't be using it. Additionally, the latest XHTML 2 editor's draft is now using the namespace.

This issue has been discussed in depth around mid 2007. The problem is that XHTML5 and XHTML2 are completely incompatible with each other and they cannot possibly use the same namespace as each other.

But XHTML2 also has several major incompatibilities with XHTML1, which would effectively make it impossible to implement both XHTML 1.x and 2 in the same implementation, if they share the same namespace. XHTML 5, on the other hand, has not only been designed with compatibility in mind, success is dependent upon continuing to use the same namespace.

Basically, the only solution to this issue that should be considered is that we continue using the namespace and the XHTML2 WG use a different namespace.

I'm sure that will go over well with the 12 people who are still working on XHTML 2.

Tune in... well, sometime soon-ish for another exciting episode of "This Week Day In HTML 5."

Posted in Weekly Review | 6 Comments »

The Road to HTML 5: character encoding

Friday, February 13th, 2009

Welcome back to my semi-regular column, "The Road to HTML 5," where I'll try to explain some of the new elements, attributes, and other features in the upcoming HTML 5 specification.

The feature of the day is character encoding, specifically how to determine the character encoding of an HTML document. I am never happier than when I am writing about character encoding. But first, here is my standard "elevator pitch" description of what character encoding is:

When you think of "text," you probably think of "characters and symbols I see on my computer screen." But computers don't deal in characters and symbols; they deal in bits and bytes. Every piece of text you've ever seen on a computer screen is actually stored in a particular character encoding. There are many different character encodings, some optimized for particular languages like Russian or Chinese or English, and others that can be used for multiple languages. Very roughly speaking, the character encoding provides a mapping between the stuff you see on your screen and the stuff your computer actually stores in memory and on disk.

In reality, it's more complicated than that. Many characters are common to multiple encodings, but each encoding may use a different sequence of bytes to actually store those characters in memory or on disk. So you can think of the character encoding as a kind of decryption key for the text. Whenever someone gives you a sequence of bytes and claims it's "text," you need to know what character encoding they used so you can decode the bytes into characters and display them (or process them, or whatever).

source

And once again, I'll repeat my standard set of background links for those of you who don't know anything about character encoding. You must read Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) You should read Tim Bray's three-part series, On the Goodness of Unicode, On Character Strings, and Characters vs. Bytes, and anything written by Martin Dürst.

I should also point out that you should always specify a character encoding on every HTML page you serve. Not specifying an encoding can lead to security vulnerabilities.

So, how does your browser actually determine the character encoding of the stream of bytes that a web server sends? If you're familiar with HTTP headers, you may have seen a header like this:

Content-Type: text/html; charset="utf-8"

Briefly, this says that the web server thinks it's sending you an HTML document, and that it thinks the document uses the UTF-8 character encoding. Unfortunately, in the whole magnificent soup of the world wide web, very few authors actually have control over their HTTP server. Think Blogger: the content is provided by individuals, but the servers are run by Google. So HTML 4 provided a way to specify the character encoding in the HTML document itself. You've probably seen this too:

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

Briefly, this says that the web author thinks they have authored an HTML document using the UTF-8 character encoding. Now, you could easily imagine a situation where both the server and the document provide encoding information. Furthermore, they might not match (especially if they're run by different people). So which one wins? Well, there's a precedence order in case the document is served with conflicting information.

This is what HTML 4.01 has to say about the precedence order for determining the character encoding:

  1. User override (e.g. the user picked an encoding from a menu in their browser).
  2. An HTTP "charset" parameter in a "Content-Type" field.
  3. A META declaration with an "http-equiv" attribute set to "Content-Type" and a value set for "charset".
  4. The charset attribute set on an element that designates an external resource.
  5. Unspecified heuristic analysis.

And this is what HTML 5 has to say about it. I won't quote the whole thing here, but suffice to say it's a 7-step algorithm; step 4 has 2 sub-steps, the first of which has 7 branches, one of which has 8 sub-steps, one of which actually links to a separate algorithm that itself has 7 steps... It goes on like that for a while. The gist of it is

  1. User override.
  2. An HTTP "charset" parameter in a "Content-Type" field.
  3. A Byte Order Mark before any other data in the HTML document itself.
  4. A META declaration with a "charset" attribute.
  5. A META declaration with an "http-equiv" attribute set to "Content-Type" and a value set for "charset".
  6. Unspecified heuristic analysis.

...and then...

  1. Normalize the given character encoding string according to the Charset Alias Matching rules defined in Unicode Technical Standard #22.
  2. Override some problematic encodings, i.e. intentionally treat some encodings as if they were different encodings. The most common override is treating US-ASCII and ISO-8859-1 as Windows-1252, but there are several other encoding overrides listed in this table. As the specification notes, "The requirement to treat certain encodings as other encodings according to the table above is a willful violation of the W3C Character Model specification."

Two things should leap out at you here. First, WTF is a <meta charset> attribute? Well, it's exactly what it sounds like. It looks like this:

<meta charset=UTF-8>

I was able to find only scattered discussion about this attribute on the WHATWG mailing list.

The best explanation of the new <meta charset> attribute was given a few months later, in an unrelated thread, on a separate mailing list. Andrew Sidwell explains:

The rationale for the <meta charset=""> attribute combination is that UAs already implement it, because people tend to leave things unquoted, like:

<META HTTP-EQUIV=Content-Type CONTENT=text/html; charset=ISO-8859-1>

(There are even a few <meta charset> test cases if you don't believe that browsers already do this.)

Second, who the f— does the WHATWG think they are specifying "a willful violation of the W3C Character Model specification" This is a fair question. As with many such questions, the answer is that HTML 5 is only codifying what browsers do already. ISO-8859-1 and Windows-1252 are very similar encodings. One place they differ is in so-called "smart quotes" and "curly apostrophes" — the pretty little typographical flourishes that authors love and that Microsoft Word (and many other editors) output by default. Many authors specify a ISO-8559-1 or US-ASCII encoding (because they copied that part of their template from somewhere else), but then they use curly quotes from the Windows-1252 encoding. This mistake is so widespread that browsers already treat ISO-8859-1 as Windows-1252. HTML 5 is just "paving the cowpaths" here.

To sum up: character encoding is complicated, and it has not been made any easier by several decades of poorly written software used by copy-and-paste–educated authors. You should always specify a character encoding on every HTML document, or bad things will happen. You can do it the hard way (HTTP Content-Type header), the easy way (<meta http-equiv> declaration), or the new way (<meta charset> attribute), but please do it. The web thanks you.

Posted in Tutorials | 11 Comments »

This Week in HTML 5 – Episode 21

Tuesday, February 10th, 2009

Welcome back to "This Week in HTML 5," where I'll try to summarize the major activity in the ongoing standards process in the WHATWG and W3C HTML Working Group.

The big news this week is more major work on the non-normative section on rendering HTML documents, including a lot of reverse-engineered documentation of legacy (invalid) attributes that users expect browsers to support.

In addition, one major section was dropped from HTML 5 this week: an algorithm for determining what object is under the cursor (presuming, of course, that the cursor is within the region of the screen which contains an HTML document, and the current context has a screen, and the current context has a cursor). Ian Hickson has announced on www-style that, in accordance with that group's consensus, the algorithm would be better maintained in a future CSS specification.

Around the web:

Tune in next week for another exciting episode of "This Week in HTML 5."

Posted in Weekly Review | 4 Comments »

This Week in HTML 5 – Episode 20

Tuesday, February 3rd, 2009

Welcome back to "This Week in HTML 5," where I'll try to summarize the major activity in the ongoing standards process in the WHATWG and W3C HTML Working Group.

The big news this week is the beginning of the non-normative section on rendering HTML documents. For those of you not up on spec-writing lingo, "non-normative" means "you can ignore this and still claim to be in compliance with the specification." It's advice, not commands. On the other hand, it's generally useful advice, so ignoring it completely is probably not in your best interests.

Currently, the rendering section includes advice on

Scrolling through the rest of the (mostly empty) rendering section shows lots of potential for future advice on form controls, data grids, favicons, and even the <marquee> element.

Rendering-related revisions: r2734, r2735, r2736, r2737, r2738.

Switching back to the normative parts of the spec, we have r2720, which makes the outerHTML property and the insertAdjacentHTML() method work in XHTML. For the purposes of this discussion — indeed, for the purposes of the entire HTML 5 specification — "XHTML" means "content served with a Content-Type: application/xhtml+xml". In addition, the section The XHTML Syntax has been entirely reorganized and rewritten to consolidate the rules for parsing and serializing XHTML documents and fragments. [Background: Re: outerHTML/insertAdjacentHTML in XML mode]

Other interesting tidbits this week:

Tune in next week for another exciting episode of "This Week in HTML 5."

Posted in Weekly Review | 3 Comments »