Declaring the Character Encoding
HTML requires that authors declare the character encoding of the file either
using HTTP headers (when served over HTTP) or metadata in the file. In previous
versions of HTML, authors could specify the character encoding using a relatively
meta element like this:
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
The idea of the
http-equiv attribute was that it would act as
a substitute for real HTTP headers. However, in practice, that is not entirely
true. Only a few headers actually have any effect in browsers. In fact,
HTML4 even suggested that servers use this attribute to gather information
for HTTP response message headers; but in reality, no known server ever did
Although the MIME type is included in the value for the
above, it has no effect in browsers. The only useful and practical piece of
information in that element is:
In order to simplify the
meta element and remove unnecessary markup, HTML5
has changed it slightly. The new way to declare the character encoding in the
file will be to use the following:
Obviously, that is much shorter and easier to remember. Luckily, due to the way encoding detection has been implemented by browsers, it is backwards compatible and believed to be supported by all known browsers.
Along with this, the spec has recently defined how encoding detection must be implemented by browsers and imposed a few additional restrictions for documents to be considered conforming.
- When serialised, the charset attribute and its value must be contained completely in the first 512 bytes of the file.
- The attribute value must be serialised without the use of character entity
references of any kind. e.g. You cannot use
<meta charset=" UTF-8">to declare UTF-8. This is because the encoding detection algorithm does not decode character references, because it occurs before the actual parsing begins.
- The character encoding used must be a rough superset of US-ASCII e.g. you can’t use this for EBCDIC encoded files.
- User agents must not support the CESU-8, UTF-7, BOCU-1 and SCSU encodings.
If the encoding is either UTF-8, UTF-16 or UTF-32, then authors can use a BOM at the start of the file to indicate the character encoding.