The WHATWG Blog — Conformance Checking

Archive for the ‘Conformance Checking’ Category

HTML vs. XHTML

Wednesday, November 15th, 2006

One of the predominate issues raised in the recent call for comments was surrounding the whole HTML vs. XHTML debate, with many people suggesting that we should not be extending HTML, but rather focussing on XHTML only.

However, what many people have failed to realise is that HTML5 resolves this issue: The (X)HTML5 specification is in fact specifying extensions to both HTML and XHTML simultaneously, and the choice of using either is no longer dependent upon the DOCTYPE.

Many authors use an XHTML 1.0 DOCTYPE and then proceed to claim they’re using XHTML, but browsers make the decision whether to treat a document as HTML or XHTML based on the MIME type. (X)HTML5 endorses dispatching on MIME type: If the document is served as text/html, it gets parsed as HTML; but if it is served with an XML MIME type, like application/xml or application/xhtml+xml, it gets parsed as XHTML.

Document Serializations

HTML5 introduces the concept of serialisations for an HTML document. A serialisation in this context refers to the physical representation of the essence of the document—the document tree. (X)HTML5 requires user agents that support scripting to expose the document tree to the script using the DOM API and data model. HTML5 uses the HTML serialisation and XHTML5 uses the XML serialisation. Because of this, the distinction between an HTML and XHTML document is reduced.

In most cases, either serialisation can be used to represent exactly the same document. The main differences are that the HTML serialisation, due to backwards-compatibility reasons cannot represent structured inline-level elements (e.g. <ol>, <ul>, etc.) as children of the <p> element and the XML serialization cannot represent all possible document trees that may be created as a result of error recovery in the HTML parsing algorithm. Also, in browsers, some scripting API features and CSS layout details work differently depending on the serialisation of the document due to backwards compatibility considerations.

The XML serialisation used by XHTML must be a well-formed XML 1.0 document. However, unlike previous versions of HTML, the HTML serialisation is no longer considered an application of SGML, but instead defines its own syntax. While the syntax is inspired by SGML, it is being defined in a way that more closely resembles the way browsers actually handle HTML in the real world, particularly in regards to error handling.

The HTML5 serialisation and the accompanying parsing algorithm are needed for three reasons:

The browser that currently holds the majority market share doesn’t support XHTML (that is actually served and processed as XHTML).
The legacy text/html content out there needs a well-defined parsing algorithm—something that SGML-based HTML specifications haven’t been able to provide.
There are content management systems, Web applications and workflows that are not based on XML tools and cannot produce well-formed output reliably. These systems can benefit from new features even though they wouldn’t work reliably with the XHTML serialisation.

On the other hand, thanks to the HTML/XHTML duality, new systems can be built on solid off-the-shelf XML tools internally and convert to and from the HTML5 serialisation at input/output boundaries. Once the installed base of browsers supports application/xhtml+xml properly, these systems can swap the output serialiser and start using XHTML-only features such as lists inside paragraphs.

The New `DOCTYPE`

In practice, the DOCTYPE serves two purposes: DTD based validation and (for HTML only) DOCTYPE sniffing. Since HTML is no longer considered an application of SGML and because there are many limitations with DTD based validation, there will not be any official DTDs for (X)HTML5.

As a result, in the HTML serialisation, the only purpose for even having a DOCTYPE is to trigger standards mode in browsers. Thus, because it doesn’t need to refer to a DTD at all, the DOCTYPE is simply this:

<!DOCTYPE html>

I’m sure you would agree that that is about as simple and easy to remember as possible. But, for XHTML, it’s even simpler. There isn’t one! Since browsers have not (and will not) introduce DOCTYPE sniffing for XML, there is little need for a DOCTYPE.

However, I should point out that there is one other minor practical issue with DTDlessness in XML. Entity references which are declared in the XHTML 1.0 DTD will not be able to be used. However, since browsers don't use validating parsers and do not read DTDs from the network anyway, the use of entity references is not recommended. Instead, it is recommended to use character references or a good character encoding (UTF-8) that supports the characters natively.

Conformance Checking

You’re no doubt wondering, if there are no DTDs, how will one go about validating their markup. Well, that’s simple. There are in fact other, more robust methods available for checking document conformance. There are several different schema languages that can be used, including RELAX NG and Schematron. However, even they cannot fully express the machine-checkable conformance requirements of (X)HTML5.

Henri Sivonen is in the process of developing a conformance checker for HTML5, which is being designed to report much more useful error messages beyond those that are possible using just a DTD based approach. For example, the table integrity checker discussed previously is one feature that is impossible to implement using DTDs.

Posted in Conformance Checking, Syntax | 6 Comments »

Table Integrity Checker

Tuesday, November 14th, 2006

I am working on a conformance checking service for (X)HTML5. The service is grammar-based for the most part with RELAX NG as the schema language. Some extra-grammatical constraints are expressed as Schematron assertions. Currently, as a Mozilla Foundation grantee, I am working on writing checkers (in Java) for spec features that cannot (practically or at all) be checked using RELAX NG or Schematron.

In a Web two-point-ohey perpetual beta fashion, I am deploying the new prototype features early to allow testing.

The first non-schema checker prototype is a table integrity checker. Since the table model for (X)HTML5 is now being specified, the prototype is speculatively based on the HTML 4.01 table model and browser behavior. The differences from HTML 4.01 are that colspan='0' is treated as colspan='1' and that headers must refer to th cells. The top left corner of cells is placed in the first available slot on the row, which is browser-compatible but different from what the CSS2 spec says.

The checker emits both warnings and errors. Depending on how the spec turns out, errors may become warnings or vice versa.

Currently, the errors are:

Table cell is overlapped by later table cell.
Table cell overlaps an earlier table cell. (Single overlap gets reported in both directions to show source location for both cells.)
Table cell spans past the end of its row group.
Row has no cells starting on it.
Table row column count is greater than the column count established by cols/colgroups.
Table row column count is less than the column count established by cols/colgroups.
The headers attribute doesn’t point to th elements in the same table.
Column has no cells starting on it. (Contiguous cell ranges established by a single element are coalesced to a single error to protect against denial of service attacks.)

Currently, the warnings are:

colspan exceeds 1000, which is a magic number in Gecko (and according to comments in Gecko source, in IE and Opera, too)
rowspan exceeds 8190, which is a magic number in Gecko
Table row column count is greater than the column count established by the first row in the absence of cols/colgroups.
Table row column count is less than the column count established by the first row in the absence of cols/colgroups.
A col element causes a span attribute to be ignored on the parent colgroup. (Conforming in HTML 4 / XHTML 1.0; non-conforming in (X)HTML5. With (X)HTML5 there’s also a schema-level error.)

The table integrity checker only sees a projection of the document tree that contains nothing but table-significant elements and crazy subtrees of table-significant elements in wrong places are silently pruned. These are dealt with on the RELAX NG level. The table integrity checker assumes that it is being used together with a reasonable schema.

The table integrity checker is also enabled for the HTML 4.01 / XHTML 1.0 presets on the generic side of the service, so testing with today’s content is possible.

There’s a pseudo-schema called http://hsivonen.iki.fi/checkers/table/ which isn’t a schema but a magic URL that causes the system to instantiate the table integrity checker. There’s a pseudo-pseudo-schema called http://hsivonen.iki.fi/checkers/all/ which expands to all pseudo-schemas, but at the moment, there’s only one.

Please let me know if the table integrity checker does not work as advertised.

Posted in Conformance Checking, Processing Model | Comments Off on Table Integrity Checker