HTML vs. XHTML
One of the predominate issues raised in the recent call for comments was surrounding the whole HTML vs. XHTML debate, with many people suggesting that we should not be extending HTML, but rather focussing on XHTML only.
However, what many people have failed to realise is that HTML5 resolves this issue: The (X)HTML5 specification
is in fact specifying extensions to both HTML and XHTML simultaneously, and
the choice of using either is no longer dependent upon the DOCTYPE
.
Many authors use an XHTML 1.0 DOCTYPE
and then proceed
to claim they’re using XHTML, but browsers make the decision whether to
treat a document as HTML or XHTML based on the MIME type. (X)HTML5 endorses
dispatching on MIME type: If the document is served as text/html
,
it gets parsed as HTML; but if it is served with an XML MIME type, like
application/xml
or application/xhtml+xml
,
it gets parsed as XHTML.
Document Serializations
HTML5 introduces the concept of serialisations for an HTML document. A serialisation in this context refers to the physical representation of the essence of the document—the document tree. (X)HTML5 requires user agents that support scripting to expose the document tree to the script using the DOM API and data model. HTML5 uses the HTML serialisation and XHTML5 uses the XML serialisation. Because of this, the distinction between an HTML and XHTML document is reduced.
In most cases, either serialisation can be used to represent exactly
the same document. The main differences are that the HTML serialisation, due to
backwards-compatibility reasons cannot represent structured inline-level elements (e.g. <ol>
, <ul>
, etc.)
as children of the <p>
element and the XML serialization cannot
represent all possible document trees that may be created as a result of error
recovery in the HTML parsing algorithm. Also, in browsers, some scripting API features and
CSS layout details work differently depending on the serialisation of the document due to
backwards compatibility considerations.
The XML serialisation used by XHTML must be a well-formed XML 1.0 document. However, unlike previous versions of HTML, the HTML serialisation is no longer considered an application of SGML, but instead defines its own syntax. While the syntax is inspired by SGML, it is being defined in a way that more closely resembles the way browsers actually handle HTML in the real world, particularly in regards to error handling.
The HTML5 serialisation and the accompanying parsing algorithm are needed for three reasons:
- The browser that currently holds the majority market share doesn’t support XHTML (that is actually served and processed as XHTML).
- The legacy
text/html
content out there needs a well-defined parsing algorithm—something that SGML-based HTML specifications haven’t been able to provide. - There are content management systems, Web applications and workflows that are not based on XML tools and cannot produce well-formed output reliably. These systems can benefit from new features even though they wouldn’t work reliably with the XHTML serialisation.
On the other hand, thanks to the HTML/XHTML duality, new systems can be built on solid off-the-shelf XML tools internally and convert to and from the HTML5 serialisation at input/output boundaries. Once the installed base of browsers supports application/xhtml+xml
properly, these systems can swap the output serialiser and start using XHTML-only features such as lists inside paragraphs.
The New DOCTYPE
In practice, the DOCTYPE
serves two purposes: DTD based validation
and (for HTML only) DOCTYPE
sniffing. Since HTML is no longer considered
an application of SGML and because there are many limitations with DTD based
validation, there will not be any official DTDs for (X)HTML5.
As a result, in the HTML serialisation, the only purpose for even having a
DOCTYPE
is to trigger standards mode in browsers. Thus, because it doesn’t
need to refer to a DTD at all, the DOCTYPE
is simply this:
<!DOCTYPE html>
I’m sure you would agree that that is about as simple and easy to remember
as possible. But, for XHTML, it’s even simpler. There isn’t one! Since
browsers have not (and will not) introduce DOCTYPE
sniffing for
XML, there is little need for a DOCTYPE
.
However, I should point out that there is one other minor practical issue with DTDlessness in XML. Entity references which are declared in the XHTML 1.0 DTD will not be able to be used. However, since browsers don't use validating parsers and do not read DTDs from the network anyway, the use of entity references is not recommended. Instead, it is recommended to use character references or a good character encoding (UTF-8) that supports the characters natively.
Conformance Checking
You’re no doubt wondering, if there are no DTDs, how will one go about validating their markup. Well, that’s simple. There are in fact other, more robust methods available for checking document conformance. There are several different schema languages that can be used, including RELAX NG and Schematron. However, even they cannot fully express the machine-checkable conformance requirements of (X)HTML5.
Henri Sivonen is in the process of developing a conformance checker for HTML5, which is being designed to report much more useful error messages beyond those that are possible using just a DTD based approach. For example, the table integrity checker discussed previously is one feature that is impossible to implement using DTDs.
Seeing as HTML5 will define its own syntax, will XML-style empty element syntax be supported, or would this cause significant problems?
Imon, no it won’t. In fact, it expressly forbids it and defines that it must be ignored by browsers. We have to retain compatibility with the way existing browsers handle it, or many sites would break.
Update: Due to recent discussion on the mailing list, the trailing slash is now permitted on empty elements (now known as “void elements” in the spec), But the syntax is still completely meaningless in HTML, it serves no purpose whatsoever and has only been permitted because it’s both harmless and widely used.
One of the advantages of XHTML is that a valid XHTML document is automatically a valid XML document. This allows for easy non-human parsing of XHTML using standard XML tools. Is this going to be true with (X)HTML5?
All I see following the conformance checker link is a text field for a document. How do I check my document if your site is down or I do not have access to the internet? What if your site is compromised? Will you have (X)HTML5 specifications readable by current XML/HTML/SGML validation tools? If yes, in what format? If you will still need to create validation specifications, why do you want to get rid of DTD in the DOCTYPE?
You need an HTML5 parser instead of an XML parser at the start of your XML tool pipeline. You can use XML tools except the XML parser itself.
The software is Open Source, you are free to run your own copy.
No. The current validation tools cannot check for all HTML5 conformance requirements. They cannot check for all HTML 4.01 conformance requirements, either.
Please define “valid specification” and explain the need.
DTDs are woefully inadequate and broken for their purpose.