The WHATWG Blog — Mark Pilgrim, Google

Author Archive

The Road to HTML 5: contentEditable

Friday, March 6th, 2009

Welcome back to my semi-regular column, "The Road to HTML 5," where I'll try to explain some of the new elements, attributes, and other features in the upcoming HTML 5 specification.

The feature of the day is contentEditable, by which I mean client-side in-browser "rich text" editing. All major browsers support this now, including Firefox 3, Safari 3, Opera 9, Google Chrome, and Internet Explorer (since 5.5). Of course, the devil is in the details.

In this article:

What is contentEditable?
How does it work?
A brief and extremely biased timeline of standardization
Conclusion
Further reading

What is `contentEditable`?

There are really two attributes involved, designMode and contentEditable. The designMode attribute governs the entire document (i.e. it makes the entire document editable, like a dedicated HTML editor). The contentEditable attribute governs just the element on which it appears, and that element's children -- like a rich text editor control within a page. In fact, that was the original use case: enabling web developers to build rich text editors. There are now a variety of such editors available under various licenses.

Both of these attributes, designMode and contentEditable, were originally designed and implemented by Microsoft in Windows Internet Explorer (5.5, to be exact). There was some superficial documentation on how to use them (so developers could develop rich text editors), but little thought of interoperability. So, no details on all the nitty gritty details of exactly what markup is generated when you press ENTER right here, or what the DOM looks like as you backspace your way through a start tag. Much of this sort of information was later reverse-engineered, and cross-browser support for basic operations is actually quite good. (Browsers still vary widely on the details.) The designMode and contentEditable attributes, and the APIs that drive rich text editors, are implemented in all major browsers, including Firefox, Opera, Safari, Google Chrome, and of course Internet Explorer.

How does it work?

Mark Finkle wrote a nice high-level summary of designMode, and later added a post about contentEditable once it appeared in the Firefox 3 alphas. (That was back in 2007.) Quoting Mark:

Mozilla has a rich text editing system (called Midas) and an API similar to Internet Explorer's. Mozilla, like Internet Explorer, supports the ability to make an entire document editable by setting the designMode property of the document object. Once in design mode, the document can be manipulated using various DHTML commands.

... Firefox 3 is expanding its rich WYSIWYG editing capabilities by adding support for the contentEditable attribute. Setting contentEditable to "true" allows you to make parts of a document editable. ...

The API for interacting with the document is:

document.execCommand
Executes the given command.

document.queryCommandEnabled
Determines whether the given command can be executed on the document in its current state.

document.queryCommandIndeterm
Determines whether the current selection is in an indetermined state.

document.queryCommandState
Determines whether the given command has been executed on the current selection.

document.queryCommandValue
Determines the current value of the document, range, or current selection for the given command.

Once you have an editable document (designMode) or element (contentEditable), you use this set of API calls to issue "commands" on the editable region, and to query the current state of the region. Commands are things like "bold," "italic," "underline," "create a link," "change foreground color," and so on -- all the commands you would expect from a rich text editor. Here's a test page with 36 commands.

In other words, "supporting the contentEditable attribute" is really just the tip of the iceberg. The real compatibility story is written in the commands which are passed to the document.execCommand() function. So which browsers support which commands?

As you can see from Peter's chart, basic stuff like bold, italic, creating links, and changing colors is well-supported across browsers. After that, the compatibility story gets hairy.

A brief and extremely biased timeline of standardization

Reverse-engineering and standardizing contentEditable and its associated APIs was one of the original goals of the WHAT working group, as part of (what at the time was called) "Web Applications 1.0" and is now known as "HTML 5."

July 2000. Microsoft releases Internet Explorer 5.5, with support for designMode and contentEditable. Other than some sparse documentation on MSDN, no further details were given.
November 2004. Ian Hickson: [whatwg] Re: several messages. "One set of the ideas that was brought up in this forum was the ability to extend <textarea> to support syntax highlighting, or WYSIWYG editing of BB code markup, or just the ability to do rich text editing of any kind. Having considered all the suggestions, the only thing I could really see as being realistic would be to do something similar to (and ideally compatible with) IE's 'contentEditable' and 'designMode' attributes."
April 2005. Olav Junker Kjær: "I have been thinking about HTML editing, which I think is a critical feature."
July 2005. Anne van Kesteren: begins to reverse-engineer Microsoft's contentEditable implementation.
July 2005. Anne van Kesteren: [whatwg] [html5] contenteditable specification. The first message points to annevankesteren.nl/projects/whatwg/spec, while (somewhat surprisingly) still exists. An unfinished but fascinating piece of reverse engineering. (Most of this is now merged into the HTML 5 spec, in the Command APIs section.)
July 2005. Anne van Kesteren: [whatwg] [html5] contenteditable specification (again)
July 2005. Matthew Raymond: [whatwg] [WF2] Readonly and default pseudoclass matching. Discussion of how contentEditable interacts with proposed (and now standardized) pseudo-classes like :read-only.
August 2005. dolphinling: [whatwg] What exactly is contentEditable for?. Other than the obvious answer, "rich text editing," the discussion focused on how modified content is/could be/should be submitted to the server, and whether it could be implemented without scripting.
September 2005. Ian Hickson: [whatwg] Status update. "Web Apps 1, a.k.a. 'HTML5'. Still very much work in progress. ... The main things that need fleshing out are: (1) contentEditable: at least to a state comparable to WinIE, although we might want to add some of the new features being requested on this list like declarative association with a form."
September 2005. Matthew Raymond: [whatwg] Re: Simple template-based editing. Discussion of a (never-implemented) <htmlarea> element.
June 2006. Lachlan Hunt: [whatwg] Spellchecking proposal #2. Discussion of how contentEditable interacts with the spellcheck attribute. See also: Ian Hickson's feedback, with examples.
July 2006. Anne van Kesteren: [whatwg] contenteditable="" needs fixing. (This later led to these test cases.)
January 2007. Simon Pieters: [whatwg] contenteditable, <em> and <strong>
January 2007. Anne van Kesteren: [whatwg] contenteditable="" values and isContentEditable
February 2007. Ian Hickson: "I have added it to the now-written designMode section."
April 2007. David Hyatt: "The use case of being able to drop images into a contenteditable region and have them show up as <img /> elements at the appropriate place and then get automatically uploaded somewhere is a really compelling one." See also: Ian Hickson's feedback, 19 months later.
November 2008. Ian Hickson: "contentEditable is being made more consistent, but there's still no non-scripted solution, since nobody can agree on what elements to allow."
December 2008. Ian Hickson: [whatwg] <input placeholder="">. Discussion of whether the placeholder attribute should also apply to contentEditable regions and designMode documents in an <iframe>.

Conclusion

The original use case for contentEditable -- building rich text editors -- is alive and well on the web. Cross-browser compatibility can be charitably described as "evolving," but we are long past the point where "only IE can do that fancy rich-text stuff." Standardization through the WHATWG has shaken out numerous interoperability bugs and led to thoughtful consideration of a wide variety of edge cases. Most of these benefits had to be realized through reverse engineering, rather than cooperation, but the work has been done and the web is better for it.

The Road to HTML 5: spellchecking

Wednesday, March 4th, 2009

Welcome back to my semi-regular column, "The Road to HTML 5," where I'll try to explain some of the new elements, attributes, and other features in the upcoming HTML 5 specification.

The feature of the day is spell checking, by which I mean client-side in-browser checking of text in standard <textarea> and <input type=text> elements. Several browsers support this out-of-the-box, including Firefox 2 and 3, Safari 3, Opera 9, and Google Chrome. However, each browser has different defaults of which elements get spell-checked, and only a handful allow the web author to suggest whether browsers should offer checking on a particular element.

In this article:

A brief history of the spellcheck attribute
Examples
Browser support
Detecting support for the spellcheck attribute
Conclusion

A brief history of the `spellcheck` attribute

That last bit, by the way, is why this is relevant to HTML 5. Browser features are interesting, but are mostly outside the purview of spec-land. But the idea of a markup hint to suggest turning spell-checking on or off has been bounced around for years. To wit:

May 2006: Mozilla bug 339127 - Provide a way for a web page to enable/disable spell checking on a given field. Brett Wilson outlines the thinking, and a potential algorithm, for using the accept attribute to trigger spell-checking.
May 2006: Ian Hickson mentions <input type="text" accept=""> on the WHATWG mailing list, triggering a long discussion (continued in June archives). This discussion resulted in...
June 2006: Spellchecking proposal #2, which argued against the more-general accept attribute and in favor of a more-specific spellcheck attribute. More discussion ensued, which led to...
June 2006: Spellchecking mark III, which, unsurprisingly, led to even more discussion, but no resolution.
December 2008: "the [spellcheck] attribute has seen very little interest outside of Google ... I have therefore not added this feature to HTML5 for the time being. If there is more interest in this feature, please speak up." Anne van Kesteren (Opera) immediately replied, "Opera wants to support this feature as well in due course." Maciej Stachowiak (Apple) stated, "WebKit by default spellchecks (and grammar checks) all editable parts of the document, and it is not obvious to me why one would want to force it off for particular form controls or editable HTML areas." More vigorous discussion (continued in January 2009 archives).
February 2009: Ian Hickson announces, "I have added spellcheck="" to the spec." Followups mostly focus on what the attribute should be called, and what values it should take. (More on this in a minute.)
February 26, 2009: "Based on the interest (not uniform interest, but interest nonetheless) on this topic, I've left the feature in the spec." Yeah, February 29th -- that was last week. So don't in any way consider this the final word on the subject.

Examples

Getting down to the technical details, the spellcheck attribute is a bit of an oddball. Most boolean attributes (such as <option selected>) are false if they are absent, true if they are present, and true if they are present with a value the same as the attribute name (e.g. <option selected=selected>). The spellcheck attribute is not like that; instead, it requires an attribute value of either true or false.

So this is valid:

<textarea spellcheck="true">

And this is valid:

<textarea spellcheck="false">

But this is not valid:

<textarea spellcheck>

Browser support

Browser support is currently... limited.

Markup	Firefox 3.0.6	Google Chrome 1.0.154.48	Safari 3.2.1	Opera 9.62
`<input type=text>`	offer on right-click	no check	check as you type	offer on right-click
`<input type=text spellcheck=true>`	check as you type	no check	check as you type	offer on right-click
`<input type=text spellcheck=false>`	offer on right-click	no check	check as you type	offer on right-click
`<input type=text spellcheck>` invalid	offer on right-click	no check	check as you type	offer on right-click
`<input type=text spellcheck=spellcheck>` invalid	offer on right-click	no check	check as you type	offer on right-click
`<input type=text spellcheck=on>` invalid	offer on right-click	no check	check as you type	offer on right-click
`<input type=text spellcheck=off>` invalid	offer on right-click	no check	check as you type	offer on right-click
`<textarea>`	check as you type	check as you type	check as you type	offer on right-click
`<textarea spellcheck=true>`	check as you type	check as you type	check as you type	offer on right-click
`<textarea spellcheck=false>`	offer on right-click	check as you type	check as you type	offer on right-click
`<textarea spellcheck>` invalid	check as you type	check as you type	check as you type	offer on right-click
`<textarea spellcheck=spellcheck>` invalid	check as you type	check as you type	check as you type	offer on right-click
`<textarea spellcheck=on>` invalid	check as you type	check as you type	check as you type	offer on right-click
`<textarea spellcheck=off>` invalid	check as you type	check as you type	check as you type	offer on right-click

In other words:

In the absence of the spellcheck attribute, Firefox offers as-you-type spellcheck <textarea> elements but not <input type=text> elements. It treats the spellcheck attribute with a true or false value as a signal to offer as-you-type spellcheck (or turn it off where it defaults to on). All invalid markup variations are ignored, in the sense that they do not change Firefox's per-element-type defaults. It lets the user turn spellcheck on and off on a per-element basis, which overrides both the spellcheck attribute and the browser's per-element-type defaults.
Google Chrome offers as-you-type spellcheck on <textarea> elements but not <input type=text> elements. It ignores the spellcheck attribute entirely. It does not offer the end user the option to change the default behavior or manually check individual fields.
Safari 3 offers as-you-type spellcheck on <textarea> and <input type=text> elements. It ignores the spellcheck attribute entirely. It allows the user to toggle as-you-type spellcheck globally, which immediately affects all elements of all types. It does not offer the end user the option to change the default behavior or manually check individual fields.
Opera 9 offers spellcheck from the context menu on <textarea> and <input type=text> elements. It ignores the spellcheck attribute entirely. It does not offer as-you-type spellcheck.

Detecting support for the `spellcheck` attribute

Browsers that support the spellcheck attribute will always reflect the attribute in the .spellcheck property of the element's DOM node, even if the spellcheck attribute does not appear in the page markup. You can use this to construct a simple test to check whether the browser supports the spellcheck attribute:

if ('spellcheck' in document.createElement('textarea')) {
    alert('browser supports spellcheck attribute');
  } else {
    alert('browser does not support spellcheck attribute');
  }

This will pop up an alert stating "browser supports spellcheck attribute" in Firefox 2 and 3, or an alert stating "browser does not support spellcheck attribute" in Safari 3, Opera 9, Google Chrome, and Internet Explorer.

Note: Internet Explorer will reflect any attribute present in the page markup. If you include a spellcheck attribute on an element and then test whether that element's DOM node contains a .spellcheck property, IE will always return true. The safest way to check is to create a new element in script, like the example above, instead of testing a pre-existing element on your page.

Conclusion

You can start using the spellcheck attribute today, but it only affects the behavior of Firefox. However, it has no adverse effects in other browsers. Be sure to use either spellcheck="true" or spellcheck="false", as these are the only values supported by Firefox (and the only valid values according to the HTML 5 spec as it stands today).

Posted in Tutorials | 9 Comments »

This Week Day in HTML 5 – Episode 23

Friday, February 27th, 2009

Welcome back to "This Week in HTML 5," where I'll try to summarize the major activity in the ongoing standards process in the WHATWG and W3C HTML Working Group. The pace of HTML 5 changes has reached a fever pitch, so I'm going to split out these episodes into daily (!) rather than weekly summaries until things calm down.

The big news for February 12, 2009 is the minting of the the spellcheck attribute, which web authors can use to provide a hint about whether a particular form field expects the sort of input that would benefit from client-side spell checking. r2801 lays it out:

User agents can support the checking of spelling and grammar of editable text, either in form controls (such as the value of textarea elements), or in elements in an editing host (using contenteditable).

For each element, user agents must establish a default behavior, either through defaults or through preferences expressed by the user. There are three possible default behaviors for each element:

true-by-default

The element will be checked for spelling and grammar if its contents are editable.

false-by-default

The element will never be checked for spelling and grammar.

inherit-by-default

The element's default behavior is the same as its parent element's. Elements that have no parent element cannot have this as their default behavior.

The spellcheck attribute is an enumerated attribute whose keywords are true and false. The true keyword map to the true state. The false keyword maps to the false state. In addition, there is a third state, the inherit state, which is the missing value default (and the invalid value default).

Starting with version 2, Mozilla Firefox has offered built-in spell checking of <textarea> elements (on by default) and <input type=text> elements (off by default). You can change the default behavior by setting the spellcheck attribute. (test case)

The other big news of the day is the addition of the <form autocomplete> attribute, while lets web authors provide a hint about whether they would like browsers to save the form's contents and pre-fill the form the next time the user encounters it. r2798:

When an input element's resulting autocompletion state is on, the user agent may store the value entered by the user so that if the user returns to the page, the UA can prefill the form. Otherwise, the user agent should not remember the control's value.

... A user agent may allow the user to override the resulting autocompletion state and set it to always on, always allowing values to be remembered and prefilled), or always off, never remembering values. However, the ability to override the resulting autocompletion state to on should not be trivially accessible, as there are significant security implications for the user if all values are always remembered, regardless of the site's preferences.

<form autocomplete> is commonly used on sensitive login forms where the site does not want users to be able to store their password in their browser (which is generally done in an insecure way). Most browsers honor these hints by default, although there are ways to override them if you dislike the idea of web authors disabling useful bits of your browser's functionality.

Other interesting changes of the day:

r2802 allows external Javascript files to contain a BOM to facilitate identifying scripts in non-ASCII-compatible character encodings.
r2796 adds some examples of using the unloved <small> element.

Discussion of the day: Gregory J. Rosmaita gives details on report of PFWG HTML5 actions ("PFWG" = Protocols and Formats Working Group). The original post was about accessibility issues, specifically a response to the <image alt> attribute becoming optional and the omission of the headers and summary attributes in the HTML 5 table model. But the thread was quickly hijacked by a discussion of the fact that the W3C published another working draft of HTML 5 on February 12.

Wait... what? Oh yes, in true "burying the lede" fashion, I suppose I should mention that the biggest news of February 12th is that the W3C published another working draft of HTML 5. Except that readers of this series will find it uninteresting, since it's just a snapshot of the progress-to-date. (The spec is "published" on whatwg.org every time it changes anyway.) Working drafts have no formal status; they are merely intended to encourage early and wide review. Still, the rest of the world might think it's important, so be sure to bring it up at this weekend's cocktail parties.

Tune in... well, sometime soon-ish for another exciting episode of "This ~~Week~~ Day In HTML 5."

Posted in Weekly Review | 2 Comments »

This Week Day In HTML 5 – Episode 22

Wednesday, February 25th, 2009

The big news for February 11, 2009 is the addition of an algorithm to parse a color in an IE-compatible way. r2776 lays it all out:

Some obsolete legacy attributes parse colors in a more complicated manner, using the rules for parsing a legacy color value, which are given in the following algorithm. When invoked, the steps must be followed in the order given, aborting at the first step that returns a value. This algorithm will either return a simple color or an error.

Let input be the string being parsed.

If input is the empty string, then return an error.

If input is an ASCII case-insensitive match for the string "transparent", then return an error.

If input is an ASCII case-insensitive match for one of the keywords listed in the SVG color keywords or CSS2 System Colors sections of the CSS3 Color specification, then return the simple color corresponding to that keyword. [CSS3COLOR]

If input is four characters long, and the first character in input is a U+0023 NUMBER SIGN (#) character, and the the last three characters of input are all in the range U+0030 DIGIT ZERO (0) .. U+0039 DIGIT NINE (9), U+0041 LATIN CAPITAL LETTER A .. U+0046 LATIN CAPITAL LETTER F, and U+0061 LATIN SMALL LETTER A .. U+0066 LATIN SMALL LETTER F, then run these substeps:

Let result be a simple color.

Interpret the second character of input as a hexadecimal digit; let the red component of result be the resulting number multiplied by 17.

Interpret the third character of input as a hexadecimal digit; let the green component of result be the resulting number multiplied by 17.

Interpret the fourth character of input as a hexadecimal digit; let the blue component of result be the resulting number multiplied by 17.

Return result.

Replace any characters in input that have a Unicode codepoint greater than U+FFFF (i.e. any characters that are not in the basic multilingual plane) with the two-character string "00".

If input is longer than 128 characters, truncate input, leaving only the first 128 characters.

If the first character in input is a U+0023 NUMBER SIGN character (#), remove it.

Replace any character in input that is not in the range U+0030 DIGIT ZERO (0) .. U+0039 DIGIT NINE (9), U+0041 LATIN CAPITAL LETTER A .. U+0046 LATIN CAPITAL LETTER F, and U+0061 LATIN SMALL LETTER A .. U+0066 LATIN SMALL LETTER F with the character U+0030 DIGIT ZERO (0).

While input's length is zero or not a multiple of three, append a U+0030 DIGIT ZERO (0) character to input.

Split input into three strings of equal length, to obtain three components. Let length be the length of those components (one third the length of input).

If length is greater than 8, then remove the leading length-8 characters in each component, and let length be 8.

While length is greater than two and the first character in each component is a U+0030 DIGIT ZERO (0) character, remove that character and reduce length by one.

If length is still greater than two, truncate each component, leaving only the first two characters in each.

Let result be a simple color.

Interpret the first component as a hexadecimal number; let the red component of result be the resulting number.

Interpret the second component as a hexadecimal number; let the green component of result be the resulting number.

Interpret the third component as a hexadecimal number; let the blue component of result be the resulting number.

Return result.

Information on exactly which attributes are subject to this algorithm is scattered throughout the spec. Here is the complete list:

<font color>
<frame bordercolor>
<frameset bordercolor>
<hr color>
<table bgcolor>
<thead bgcolor>
<tfoot bgcolor>
<tbody bgcolor>
<tr bgcolor>
<td bgcolor>
<th bgcolor>
<body text>
<body link>
<body vlink>
<body alink>
<body bgcolor>

The other big news today is the addition of a section on matching HTML elements using selectors. Some of these (:link, :visited, :active) will be familiar to anyone who has written a CSS stylesheet, but there are a number of new selectors that correspond to concepts introduced in HTML 5.

:link and :visited match hyperlinks (<a>, <area>, and <link> elements with an href attribute).
:active matches certain elements while they are being activated, like a button between mousedown and mouseup (or keydown and keyup)
:enabled and :disabled match hyperlinks and certain other elements that can be disabled, like form fields
:checked matches checkboxes and radio buttons
:indeterminate matches checkboxes in the indeterminate state
:default matches default buttons in forms
:valid and :invalid match form fields that have constraints
:in-range and :out-of-range match form fields that have range-based constraints (i.e. they can either overflow or underflow)
:required and :optional match certain form fields
:read-write matches editable form fields and other editable elements, and :read-only matches any element that is not read-write

Other interesting changes of the day:

r2784 defines the rendering rules for the <area> element.
r2789 changes the list of allowable separators of coordinates in the coords attribute of an <area> element. [discussion, test suite]
r2777, r2779, and r2781 make a variety of tweaks to the non-normative rendering section that I covered in episodes 20 and 21.

Discussion of the day: What's the problem? "Reuse of 1998 XHTML namespace is potentially misleading/wrong". Take it away, Lachlan:

I believe the issue is that the XHTML2 WG think they have change control over that namespace URI and that we shouldn't be using it. Additionally, the latest XHTML 2 editor's draft is now using the namespace.

This issue has been discussed in depth around mid 2007. The problem is that XHTML5 and XHTML2 are completely incompatible with each other and they cannot possibly use the same namespace as each other.

But XHTML2 also has several major incompatibilities with XHTML1, which would effectively make it impossible to implement both XHTML 1.x and 2 in the same implementation, if they share the same namespace. XHTML 5, on the other hand, has not only been designed with compatibility in mind, success is dependent upon continuing to use the same namespace.

Basically, the only solution to this issue that should be considered is that we continue using the namespace and the XHTML2 WG use a different namespace.

I'm sure that will go over well with the 12 people who are still working on XHTML 2.

Tune in... well, sometime soon-ish for another exciting episode of "This ~~Week~~ Day In HTML 5."

Posted in Weekly Review | 6 Comments »

The Road to HTML 5: character encoding

Friday, February 13th, 2009

Welcome back to my semi-regular column, "The Road to HTML 5," where I'll try to explain some of the new elements, attributes, and other features in the upcoming HTML 5 specification.

The feature of the day is character encoding, specifically how to determine the character encoding of an HTML document. I am never happier than when I am writing about character encoding. But first, here is my standard "elevator pitch" description of what character encoding is:

When you think of "text," you probably think of "characters and symbols I see on my computer screen." But computers don't deal in characters and symbols; they deal in bits and bytes. Every piece of text you've ever seen on a computer screen is actually stored in a particular character encoding. There are many different character encodings, some optimized for particular languages like Russian or Chinese or English, and others that can be used for multiple languages. Very roughly speaking, the character encoding provides a mapping between the stuff you see on your screen and the stuff your computer actually stores in memory and on disk.

In reality, it's more complicated than that. Many characters are common to multiple encodings, but each encoding may use a different sequence of bytes to actually store those characters in memory or on disk. So you can think of the character encoding as a kind of decryption key for the text. Whenever someone gives you a sequence of bytes and claims it's "text," you need to know what character encoding they used so you can decode the bytes into characters and display them (or process them, or whatever).

— source

And once again, I'll repeat my standard set of background links for those of you who don't know anything about character encoding. You must read Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) You should read Tim Bray's three-part series, On the Goodness of Unicode, On Character Strings, and Characters vs. Bytes, and anything written by Martin Dürst.

I should also point out that you should always specify a character encoding on every HTML page you serve. Not specifying an encoding can lead to security vulnerabilities.

So, how does your browser actually determine the character encoding of the stream of bytes that a web server sends? If you're familiar with HTTP headers, you may have seen a header like this:

Content-Type: text/html; charset="utf-8"

Briefly, this says that the web server thinks it's sending you an HTML document, and that it thinks the document uses the UTF-8 character encoding. Unfortunately, in the whole magnificent soup of the world wide web, very few authors actually have control over their HTTP server. Think Blogger: the content is provided by individuals, but the servers are run by Google. So HTML 4 provided a way to specify the character encoding in the HTML document itself. You've probably seen this too:

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

Briefly, this says that the web author thinks they have authored an HTML document using the UTF-8 character encoding. Now, you could easily imagine a situation where both the server and the document provide encoding information. Furthermore, they might not match (especially if they're run by different people). So which one wins? Well, there's a precedence order in case the document is served with conflicting information.

This is what HTML 4.01 has to say about the precedence order for determining the character encoding:

User override (e.g. the user picked an encoding from a menu in their browser).
An HTTP "charset" parameter in a "Content-Type" field.
A META declaration with an "http-equiv" attribute set to "Content-Type" and a value set for "charset".
The charset attribute set on an element that designates an external resource.
Unspecified heuristic analysis.

And this is what HTML 5 has to say about it. I won't quote the whole thing here, but suffice to say it's a 7-step algorithm; step 4 has 2 sub-steps, the first of which has 7 branches, one of which has 8 sub-steps, one of which actually links to a separate algorithm that itself has 7 steps... It goes on like that for a while. The gist of it is

User override.
An HTTP "charset" parameter in a "Content-Type" field.
A Byte Order Mark before any other data in the HTML document itself.
A META declaration with a "charset" attribute.
A META declaration with an "http-equiv" attribute set to "Content-Type" and a value set for "charset".
Unspecified heuristic analysis.

...and then...

Normalize the given character encoding string according to the Charset Alias Matching rules defined in Unicode Technical Standard #22.
Override some problematic encodings, i.e. intentionally treat some encodings as if they were different encodings. The most common override is treating US-ASCII and ISO-8859-1 as Windows-1252, but there are several other encoding overrides listed in this table. As the specification notes, "The requirement to treat certain encodings as other encodings according to the table above is a willful violation of the W3C Character Model specification."

Two things should leap out at you here. First, WTF is a <meta charset> attribute? Well, it's exactly what it sounds like. It looks like this:

<meta charset=UTF-8>

I was able to find only scattered discussion about this attribute on the WHATWG mailing list.

March 2006: Internet character encoding declaration, specifically this post by Lachlan Hunt which laid out the requirements for "paving the cowpaths" of common authoring mistakes.
June 2007: Internal character encoding declaration, Drop UTF-32, and UTF and BOM terminology

The best explanation of the new <meta charset> attribute was given a few months later, in an unrelated thread, on a separate mailing list. Andrew Sidwell explains:

The rationale for the <meta charset=""> attribute combination is that UAs already implement it, because people tend to leave things unquoted, like:

<META HTTP-EQUIV=Content-Type CONTENT=text/html; charset=ISO-8859-1>

(There are even a few <meta charset> test cases if you don't believe that browsers already do this.)

Second, who the f— does the WHATWG think they are specifying "a willful violation of the W3C Character Model specification"‽ This is a fair question. As with many such questions, the answer is that HTML 5 is only codifying what browsers do already. ISO-8859-1 and Windows-1252 are very similar encodings. One place they differ is in so-called "smart quotes" and "curly apostrophes" — the pretty little typographical flourishes that authors love and that Microsoft Word (and many other editors) output by default. Many authors specify a ISO-8559-1 or US-ASCII encoding (because they copied that part of their template from somewhere else), but then they use curly quotes from the Windows-1252 encoding. This mistake is so widespread that browsers already treat ISO-8859-1 as Windows-1252. HTML 5 is just "paving the cowpaths" here.

To sum up: character encoding is complicated, and it has not been made any easier by several decades of poorly written software used by copy-and-paste–educated authors. You should always specify a character encoding on every HTML document, or bad things will happen. You can do it the hard way (HTTP Content-Type header), the easy way (<meta http-equiv> declaration), or the new way (<meta charset> attribute), but please do it. The web thanks you.

Posted in Tutorials | 11 Comments »

Author Archive

The Road to HTML 5: contentEditable

What is contentEditable?

How does it work?

A brief and extremely biased timeline of standardization

Conclusion

Further reading

The Road to HTML 5: spellchecking

A brief history of the spellcheck attribute

Examples

Browser support

Detecting support for the spellcheck attribute

Conclusion

This Week Day in HTML 5 – Episode 23

This Week Day In HTML 5 – Episode 22

The Road to HTML 5: character encoding

What is `contentEditable`?

A brief history of the `spellcheck` attribute

Detecting support for the `spellcheck` attribute