Please leave your sense of logic at the door, thanks!

The Road to HTML 5: character encoding

by Mark Pilgrim, Google in Tutorials

Welcome back to my semi-regular column, "The Road to HTML 5," where I'll try to explain some of the new elements, attributes, and other features in the upcoming HTML 5 specification.

The feature of the day is character encoding, specifically how to determine the character encoding of an HTML document. I am never happier than when I am writing about character encoding. But first, here is my standard "elevator pitch" description of what character encoding is:

When you think of "text," you probably think of "characters and symbols I see on my computer screen." But computers don't deal in characters and symbols; they deal in bits and bytes. Every piece of text you've ever seen on a computer screen is actually stored in a particular character encoding. There are many different character encodings, some optimized for particular languages like Russian or Chinese or English, and others that can be used for multiple languages. Very roughly speaking, the character encoding provides a mapping between the stuff you see on your screen and the stuff your computer actually stores in memory and on disk.

In reality, it's more complicated than that. Many characters are common to multiple encodings, but each encoding may use a different sequence of bytes to actually store those characters in memory or on disk. So you can think of the character encoding as a kind of decryption key for the text. Whenever someone gives you a sequence of bytes and claims it's "text," you need to know what character encoding they used so you can decode the bytes into characters and display them (or process them, or whatever).


And once again, I'll repeat my standard set of background links for those of you who don't know anything about character encoding. You must read Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) You should read Tim Bray's three-part series, On the Goodness of Unicode, On Character Strings, and Characters vs. Bytes, and anything written by Martin Dürst.

I should also point out that you should always specify a character encoding on every HTML page you serve. Not specifying an encoding can lead to security vulnerabilities.

So, how does your browser actually determine the character encoding of the stream of bytes that a web server sends? If you're familiar with HTTP headers, you may have seen a header like this:

Content-Type: text/html; charset="utf-8"

Briefly, this says that the web server thinks it's sending you an HTML document, and that it thinks the document uses the UTF-8 character encoding. Unfortunately, in the whole magnificent soup of the world wide web, very few authors actually have control over their HTTP server. Think Blogger: the content is provided by individuals, but the servers are run by Google. So HTML 4 provided a way to specify the character encoding in the HTML document itself. You've probably seen this too:

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

Briefly, this says that the web author thinks they have authored an HTML document using the UTF-8 character encoding. Now, you could easily imagine a situation where both the server and the document provide encoding information. Furthermore, they might not match (especially if they're run by different people). So which one wins? Well, there's a precedence order in case the document is served with conflicting information.

This is what HTML 4.01 has to say about the precedence order for determining the character encoding:

  1. User override (e.g. the user picked an encoding from a menu in their browser).
  2. An HTTP "charset" parameter in a "Content-Type" field.
  3. A META declaration with an "http-equiv" attribute set to "Content-Type" and a value set for "charset".
  4. The charset attribute set on an element that designates an external resource.
  5. Unspecified heuristic analysis.

And this is what HTML 5 has to say about it. I won't quote the whole thing here, but suffice to say it's a 7-step algorithm; step 4 has 2 sub-steps, the first of which has 7 branches, one of which has 8 sub-steps, one of which actually links to a separate algorithm that itself has 7 steps... It goes on like that for a while. The gist of it is

  1. User override.
  2. An HTTP "charset" parameter in a "Content-Type" field.
  3. A Byte Order Mark before any other data in the HTML document itself.
  4. A META declaration with a "charset" attribute.
  5. A META declaration with an "http-equiv" attribute set to "Content-Type" and a value set for "charset".
  6. Unspecified heuristic analysis.

...and then...

  1. Normalize the given character encoding string according to the Charset Alias Matching rules defined in Unicode Technical Standard #22.
  2. Override some problematic encodings, i.e. intentionally treat some encodings as if they were different encodings. The most common override is treating US-ASCII and ISO-8859-1 as Windows-1252, but there are several other encoding overrides listed in this table. As the specification notes, "The requirement to treat certain encodings as other encodings according to the table above is a willful violation of the W3C Character Model specification."

Two things should leap out at you here. First, WTF is a <meta charset> attribute? Well, it's exactly what it sounds like. It looks like this:

<meta charset=UTF-8>

I was able to find only scattered discussion about this attribute on the WHATWG mailing list.

The best explanation of the new <meta charset> attribute was given a few months later, in an unrelated thread, on a separate mailing list. Andrew Sidwell explains:

The rationale for the <meta charset=""> attribute combination is that UAs already implement it, because people tend to leave things unquoted, like:

<META HTTP-EQUIV=Content-Type CONTENT=text/html; charset=ISO-8859-1>

(There are even a few <meta charset> test cases if you don't believe that browsers already do this.)

Second, who the f— does the WHATWG think they are specifying "a willful violation of the W3C Character Model specification" This is a fair question. As with many such questions, the answer is that HTML 5 is only codifying what browsers do already. ISO-8859-1 and Windows-1252 are very similar encodings. One place they differ is in so-called "smart quotes" and "curly apostrophes" — the pretty little typographical flourishes that authors love and that Microsoft Word (and many other editors) output by default. Many authors specify a ISO-8559-1 or US-ASCII encoding (because they copied that part of their template from somewhere else), but then they use curly quotes from the Windows-1252 encoding. This mistake is so widespread that browsers already treat ISO-8859-1 as Windows-1252. HTML 5 is just "paving the cowpaths" here.

To sum up: character encoding is complicated, and it has not been made any easier by several decades of poorly written software used by copy-and-paste–educated authors. You should always specify a character encoding on every HTML document, or bad things will happen. You can do it the hard way (HTTP Content-Type header), the easy way (<meta http-equiv> declaration), or the new way (<meta charset> attribute), but please do it. The web thanks you.

11 Responses to “The Road to HTML 5: character encoding”

  1. I wonder why HTTP header is preferred over the META tag. Just as you mentioned, many webmasters have the ability to change the second one, but not the first one.

    Even in shared hosting situations .htaccess is mostly taboo.

  2. Yay for subtle use of interrobang!

    Seriously though, good post. Character encoding is probably my most ignored web feature, so thanks for reminding me of its importance.

  3. I find the HTTP header approach is by far the easiest to set up. It’s a single line per text-based format in my Apache Server configuration file:

    AddType 'text/css; charset=utf-8' .css
    AddType 'text/html; charset=utf-8' .html .htm .phps

    What’s so hard about that? Compare it to me adding a <meta> element to every HTML page of my website and adding @charset to every stylesheet. HTTP ftw!

    Using the HTTP header is also the most straightforward for browsers to process. It tells them the charset before they start consuming the body of the response.

    Surely the HTTP header approach is what articles and examples should encourage when it comes to encoding documents on the web?

  4. Ben Millard said:

    AddType ‘text/css; charset=utf-8’ .css
    AddType ‘text/html; charset=utf-8? .html .htm .phps

    What’s so hard about that?

    The ‘AddType’ configuration is not hard, but it requires access to the server configuration, which ist — as vbence stated — not always an option.

    Furthermore it prevents you from using different charsets for different files. The line “AddType ‘text/html; charset=utf-8? .html” requires you to encode each and every HTML-Document in UTF-8 without exception.

  5. [The ‘AddType’ configuration] prevents you from using different charsets for different files. The line “AddType ‘text/html; charset=utf-8? .html” requires you to encode each and every HTML-Document in UTF-8 without exception.

    Not true: Apache offers powerful fine-grained control over which configuration directives apply to which files. See the Apache docs about <Directory>, <Files>, <Location> and their regular expression variants.

    But in fact, configuring Apache to add charset to its Content-Type headers is even easier with the AddCharset directive (part of Apache’s mod_mime):

    AddCharset UTF-8 .utf8

    With that line in your Apache config, every file with a name containing .utf8 (for example, article.html.utf8) will be served with the appropriate Content-Type header for UTF-8 character encoding. And since your @href URIs already don’t use names with any type extension (courtesy of mod_negotiation) – they don’t, do they..? –, you really don’t have to change anything anywhere except the file names and that one config line. Fire and forget.

    And what’s best, many Apache installations come pre-configured this way for many common character encodings. So it’s entirely possible that renaming your files is all you need to do!

  6. I have never understood why, when the document declares a different charset than the web server sets in the document’s header, why any web browser would choose to believe the web server and not the document author. Using .htaccess files is not a production web server best practice (large performance penalty for using), and modifying the httpd.conf file (or equivalent) for every document added or updated seems ridiculous to me.

    The charset declared in the HTTP header should be a fall back default, used only if the document doesn’t supply a valid one.

    The proper place for defining this, it seems to me, would be the DOCTYPE declaration.

    The the issue I have with charset is that you can only set one for an entire document. HTML 5 allows a charset attribute on tags that “import” content. But many single documents have parts that are editable or come from multiple sources. (Think of forms, blogs and forums, and mash-ups produced by, say PHP, from multiple sources.) It may not be technically feasible, but charset should be an allowed attribute on EVERY block-level element. At the very least, on ARTICLE, DIV, SECTION, and the like.