Archive for February, 2007
Declaring the Character Encoding
HTML requires that authors declare the character encoding of the file either
using HTTP headers (when served over HTTP) or metadata in the file. In previous
versions of HTML, authors could specify the character encoding using a relatively
complex meta
element like this:
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
The idea of the http-equiv
attribute was that it would act as
a substitute for real HTTP headers. However, in practice, that is not entirely
true. Only a few headers actually have any effect in browsers. In fact,
HTML4 even suggested that servers use this attribute to gather information
for HTTP response message headers; but in reality, no known server ever did
this.
Although the MIME type is included in the value for the Content-Type
header
above, it has no effect in browsers. The only useful and practical piece of
information in that element is: charset=UTF-8
.
In order to simplify the meta
element and remove unnecessary markup, HTML5
has changed it slightly. The new way to declare the character encoding in the
file will be to use the following:
<meta charset="UTF-8">
Obviously, that is much shorter and easier to remember. Luckily, due to the way encoding detection has been implemented by browsers, it is backwards compatible and believed to be supported by all known browsers.
Along with this, the spec has recently defined how encoding detection must be implemented by browsers and imposed a few additional restrictions for documents to be considered conforming.
- When serialised, the charset attribute and its value must be contained completely in the first 512 bytes of the file.
- The attribute value must be serialised without the use of character entity
references of any kind. e.g. You cannot use
<meta charset=" UTF-8">
to declare UTF-8. This is because the encoding detection algorithm does not decode character references, because it occurs before the actual parsing begins. - The character encoding used must be a rough superset of US-ASCII e.g. you can’t use this for EBCDIC encoded files.
- User agents must not support the CESU-8, UTF-7, BOCU-1 and SCSU encodings.
If the encoding is either UTF-8, UTF-16 or UTF-32, then authors can use a BOM at the start of the file to indicate the character encoding.
What Makes the Application of HTML 5 Different?
All markup languages have three aspects: theory, application and philosophy.
Most web developers do not concern themselves with the theory of markup languages, e.g., HTML, XHTML and XML. That is for those who write the specifications and UAs, or User Agents. UAs make things work. Most web developers are not interested in how things works as long as it works.
Most web developers, however, are concerned with the practical application of markup languages in websites they construct. Specification requirements are easier to understand.
HTML 5 isn't any different. It has its theory but that is not what this article is about. (The theory of HTML 5 can be saved for a later day.) The following addresses the application of HTML 5 by web developers while an attempt is made to understand those reasons which make HTML 5 different.
Three fundamental considerations are made by web developers. They are:
1. Document Type Declaration
The W3C DTD, or Document Type Definition, specifies either “Standards Mode” or “Quirks Mode” for UAs parsing Cascading Style Sheets (CSS). (“Standards Mode” is the default for XHTML [except when an XML declaration has been included above the DTD which will then trigger “Quirks Mode”].) Web developers who have chosen HTML 4.01 use DTDs which trigger “Standards Mode”. HTML5 specifies a Document Type Declaration, i.e., <!DOCTYPE html>
. which triggers “Standards Mode”. (This DocType was not invented by The WHAT WG; it existed previously.) Further, DTDs are unnecessary for elements and attributes. All elements and attributes are recognized by UAs, e.g., browsers, with this DocType.
2. MIME Type
HTML 4.01 is primarily sent as “text/html
”. XHTML 1.0 is primarily sent as “text/html
”. Web Applications 1.0, 1.4.1 - HTML vs XHTML states that all documents sent as “text/html
” are HTML5.
3. Well-Formedness
XHTML introduced the concept of “well-formedness”. (See XHTML™ 1.0 The Extensible HyperText Markup Language (Second Edition), Appendix C.) “Well-formedness” was simple. However, these days, “well-formedness” has come to include all of the requirements set forth in Appendix C and Section 4. Differences with HTML 4, too. It is one of the cardinal principles of web standards. “Well-formed” sites define web standards.
Conclusion
Most web developers want their CSS displayed in a “standards-compliant mode”; most web developers will continue sending documents as “text/html
”; and, most will not veer from writing “well-formed” code.
It – Then - seems that all one needs to do for using HTML 5 is:
- Replace the W3C Document Type Definition with
<!DOCTYPE html>
. - Continue sending documents as “
text/html
”. - Do not alter “well-formed” source code.
Nevertheless, all website pages remain (X)HTML but with a different DocType.
So, theory aside, the application of HTML 5 isn’t any different from common, existing practices practiced by authors who use web standards.
It’s philosophically different. That's all.
WHATWG’s development model
You may be wondering just how do things work around here:
The process of writing the specs boils down to two general areas: defining the existing HTML language better and defining new features.
For defining the existing language better, the goal is to take the existing specs and reverse engineer the way existing browsers work and to do extensive research on existing documents to work out what the specs should say.
There are multiple aspects and phases to defining new features. Based on feedback from authors; what they are asking for that the existing specs lacked, forums, browser bug systems, common class names, support of features libraries like toolkits, the group tries to make a semantically clean, accessible, backwards-compatible solution - all while trying to remain consistent with the existing language.
You (yes, you!) are welcome to raise questions, concerns or leave comments on future WHATWG proposals by joining the mailing list; documented and wider reach, or join the live community at #whatwg channel on the freenode IRC network; hack and chat, or use the forums; perhaps less chaotic than the mailing list.
Just to give an idea on how all this works, we'll use the mailing list as an example on how a spec is defined in the proposal. Simply bring up an issue for a collective brainstorming session. It may seem confusing at first since there is no set structure in which these discussions take place, but note that the purpose of these open debates is to generate enough feedback where the community can refine the issue. The discussions in the mailing list are not meant to make a final decision on the issue but to clarify them. The editor then looks at all feedback and makes a decision on a spec.
If you keep this model in mind, it is not so chaotic after all.
The WHATWG Forums
Let me introduce you to The WHATWG Forums! If you don't feel comfortable with mailing lists but still want to discuss the future of HTML, how to use HTML5, or ask for help, then these forums are for you.
More forums or sub-forums will be added as needed. Contact me (zcorpan) for administrative stuff (becoming moderator, etc).
Enjoy!