Declaring the Character Encoding
HTML requires that authors declare the character encoding of the file either
using HTTP headers (when served over HTTP) or metadata in the file. In previous
versions of HTML, authors could specify the character encoding using a relatively
complex meta
element like this:
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
The idea of the http-equiv
attribute was that it would act as
a substitute for real HTTP headers. However, in practice, that is not entirely
true. Only a few headers actually have any effect in browsers. In fact,
HTML4 even suggested that servers use this attribute to gather information
for HTTP response message headers; but in reality, no known server ever did
this.
Although the MIME type is included in the value for the Content-Type
header
above, it has no effect in browsers. The only useful and practical piece of
information in that element is: charset=UTF-8
.
In order to simplify the meta
element and remove unnecessary markup, HTML5
has changed it slightly. The new way to declare the character encoding in the
file will be to use the following:
<meta charset="UTF-8">
Obviously, that is much shorter and easier to remember. Luckily, due to the way encoding detection has been implemented by browsers, it is backwards compatible and believed to be supported by all known browsers.
Along with this, the spec has recently defined how encoding detection must be implemented by browsers and imposed a few additional restrictions for documents to be considered conforming.
- When serialised, the charset attribute and its value must be contained completely in the first 512 bytes of the file.
- The attribute value must be serialised without the use of character entity
references of any kind. e.g. You cannot use
<meta charset=" UTF-8">
to declare UTF-8. This is because the encoding detection algorithm does not decode character references, because it occurs before the actual parsing begins. - The character encoding used must be a rough superset of US-ASCII e.g. you can’t use this for EBCDIC encoded files.
- User agents must not support the CESU-8, UTF-7, BOCU-1 and SCSU encodings.
If the encoding is either UTF-8, UTF-16 or UTF-32, then authors can use a BOM at the start of the file to indicate the character encoding.
This seems to complicate things.
Instead of creating a whole new attribute, why not just require HTML5 comforming UA’s to parse the file as UTF-8 (or whatever charset you want) by default if there’s no encoding found? The only time one would need to use the meta element then, would be if they used a charset different from the default and needed to specify it directly in the file (in case someone saves the file offline) as well as the HTTP header. Then people who don’t know what they’re doing, will get it right more often than not. People who do know what they’re doing, usually get it right more often than not even if it’s a few extra characters in a line to do it.
Most authors don’t care about character encoding and don’t know anything about it. They shouldn’t need to, unless they’re doing something unusual.
Devon, most HTML pages on the Web that don’t declare their encoding is windows-1252, so requireing browsers to default to utf-8 would break a lot of pages. Windows-1252 is also what most browsers use by default for HTML.
You can change the default in your browser though; try changing it to utf-8.
What do you mean “no encoding found” do you mean “found at the transport level i.e. in the HTTP headers”? That would be incompatible with legacy browsers and hence incompatible with the design of HTML5 (it would also, IMHO, be poor design as the file itself should be able to carry around its own encoding metadata; according to Ruby’s Postulate, encoding metadata in the file is significantly more likely to be correct than that at the transport-layer). I already worry about the 512 character limit causing problems, not to mention the possibility the algorithm leaves for the detected encoding to depend on the connection speed.
Yes, I meant ISO-8859-1 is the default HTTP encoding. Keep in mind that the default for application/xhtml+xml is UTF-8. Since ISO-8859-1 is a subset of UTF-8, there shouldn’t be a serious issue with it and it will encourage internationalization.
In addition, it would also minimize any complications between using HTML5 and XHTML5 syntax’s, so it doesn’t become another text/xml vs application/xml type of issue.
If a file is Windows-1252, which isn’t any Spec’s default that I’m aware of, it should be declared in a meta tag that it is the encoding. But, is there a real need to add a whole new attribute just for one value and one value only? I doubt that. It only complicates the markup language. Not much of a complication, but a little here and a little there…next thing you know, you’ve got a lot of split ends you need to tidy up. An http-equiv attribute makes perfect sense for anything that is, well, a markup equivalent to an HTTP header.
I already have my browser’s default set to UTF-8, and there’s rarely ever any problem on any webpage. It’s not really an issue.
How will HTML5-compliant browsers handle charset (or, character encoding) mismatches, e.g., server=iso-8859-1 but meta element=UTF-8? Which takes precedence?
When served by a web server, HTTP header’s would take preference over meta elements. In the absence of either of those, there should be some default or else it’ll be like with HTML 4.01 where the UA itself defines the default and that’s just messy. That’s what I’m thinking would make sense. If someone wants something other than the default, they should define it. And I’m not convinced there should be an extra attribute just to define a character encoding. It doesn’t quite make enough sense. I’d have pointed this out on the mailing list, but my e-mails don’t seem to go through (and I am subscribed)
That’s not exactly true… ISO-8859-1 is a subset of Unicode, sure, but if I wanted to write out, say, lower-case-thorn lower-case-y-umlaut, in ISO-8859-1, that would be 0xFE 0xFF, which in UTF-8 is the byte-order-mark. In UTF-8, 0xFE 0xFF is represented by 0xDF 0xBE 0xDF 0xBF (I think, assuming my math is correct).
Now, ASCII, being 7-bit, is a subset of both UTF-8 and ISO-8859-1. Was that what you meant to say?
Later,
Blake.
Devon Young, you seem to be missing the fact that this attribute is already widely supported (and many sites rely on it!). Besides that it obviously decreases the complexity of authoring although there may be a small learning curve for old timers.
Devon, the ISO-8859-1 encoding is not a fully compatible subset of the UTF-8 encoding. Although the ISO-8859-1 repertoire and code positions are the same as the first 256 characters in Unicode, only the US-ASCII subset of both are fully compatible. Characters in the range from 128 to 255 are encoded differently. e.g. A copyright symbol (©, U+00A9) is encoded with a single octet in ISO-8859-1 (
0xA9
), and 2 octets in UTF-8 (0xC2 0xA9
).Thus, if your file were saved as ISO-8859-1, but a browser read it as UTF-8, then it would be an error and the browser would render a replacement character: ? (U+FFFD). Conversely, if your file was saved as UTF-8, but read as ISO-8859-1, the browser would display the 2 characters: ©. (Try it now. Change your browser to read this page as ISO-8859-1 instead of UTF-8 and see how those characters get rendered. But remember to change it back to UTF-8 before posting another comment)
Windows-1252 is a superset of ISO-8859-1, both of which are supersets of US-ASCII. Both are single octet encodings and encode characters in the same way. Windows-1252 defines extra characters in the range from 128 to 159; whereas in ISO-8859-1, they are C1 Control Characters.
For more information, see my Guide to Unicode.
Windows-1252 is not the only other encoding used, although it is the most common in western cultures. There’s also ISO-8859-2 to -15, Windows-1250 to -1258, Shift_JIS, GB2312, and many, many more. Just see how many choices your browser offers to manually set the encoding.
Why invent new attributes? Only slightly longer, but fully backwards compatible would be:
<meta name=”charset” content=”utf-8″>
Siegfried, your proposal may be backward compatible with HTML4 specification of meta element but it’s not compatible with browsers. And by this I mean browsers won’t understand that this meta is describing character encoding. Contrary, as Lachlan Hunt wrote:
Do major (and minor) search engines support this?