The WHATWG Blog

Please leave your sense of logic at the door, thanks!

Charmod Norm Checking

Charmod Norm is still in the Working Draft state, but if it were to become a normative part of (X)HTML5, it would belong to the area of the conformance checking service that I am working on now, so I prototyped Charmod Norm enforcement as well.

The checker outsources most work to ICU4J. Most complexity in my code is due to trying to avoid buffering as much as possible while still using the ICU4J API unmodified—and due to dealing with the halves of a surrogate pair falling into different UTF-16 code unit buffers. On the spec reading front, I couldn’t map “the second character in the canonical decomposition mapping of some character that is not listed in the Composition Exclusion Table defined in [UTR #15]” to the ICU4J API on my own. Fortunately, I got excellent help on the icu-support mailing list.

It turned out that the most time-consuming part was not writing the normalization checker but reworking how Ælfred2 deals with US-ASCII, ISO-8859-1, UTF-8, UTF-16 and UTF-32. In HS Ælfred2, all character encodings are now decoded using the java.nio.charset framework.

Requirements

The definition for Fully-normalized Text involves checking normalization before and after parsing. That is, the source text is required to be NFC-normalized and after parsing the constructs parsed out of the source are required to be NFC-normalized and are required not to start with a “composing character” (which is not exactly the same as a “combining character”).

I don’t really like the way the definition involves peeking underneath the parser, but it does have the benefit that if the source is in NFC, you won’t accidentally break the document by editing the source in an NFC-normalizing text editor.

Interpretation

Charmod Norm does not define what “constructs” are in the context of XML 1.0 or HTML5.

However, XML 1.1 does define what “relevant constructs” are, so that definition might be generalizable to XML 1.0 and HTML5. Unfortunately, XML 1.1 defines relevant constructs in terms of the grammar productions of XML itself instead of the significant information items that an XML processor reports to the application.

Personally, I think the XML 1.1 definition is neither practically useful nor something for which I’d be motivated to write an implementation. So for the purpose of prototyping, I made up a definition on my own. Web Applications 1.0 just might get away with making my definition normative for XHTML5 considering that XML 1.0 doesn’t have a definition.

I consider the SAX2 ContentHandler (excluding qNames) and DTDHandler benchmarks of cluefulness when it comes to XML-related spec, API and application design. In general, if your application isn’t an editor that needs to reconstruct the XML source from parsed data, your application is most likely broken if it needs to know something about an XML document being parsed that is not exposed through those two interfaces. On the other hand, a spec that can’t be conformed to by viewing XML through only those two interfaces is broken. Moreover, DTDHandler is about notations, which are pretty much obsolete, so that leaves only ContentHandler.

This gives the following definition of constructs:

Implementation

There is a new pseudo-schema called http://hsivonen.iki.fi/checkers/nfc/. It is enabled for all (X)HTML presets. When this pseudo-schema is in use or when the schema selection is in the automatic mode, the normalization checking of source text underneath the parser is enabled as well.

The following checks are made:

The version of Unicode that is used is 5.0.0.

The column and line numbers reported on errors are very inaccurate due to buffering.

I have not tested whether all the character encoding decoders that I have installed are normalizing transcoders. If you have Windows-1258 Vietnamese test cases, please try them out and let me know what happens. Also, please let me know if the issue applies to something other than legacy Vietnamese encodings.

As usual, the new code is enabled for testing. Please let me know if it doesn’t work as described.

Tags: , , , ,
Posted in Conformance Checking, Syntax | 1 Comment »

Proposing features

I'm happy to see the increased interest in HTML5 recently — especially with the amazing work Lachlan, Henri, and others are doing with this blog, the validator, feedback, and so forth.

Since the volume of feature requests is only going to increase in the near future, I thought I'd list some things that would make evaluating proposals easier. Here are some key things that any proposal should include:

Obviously, we want to keep the language as simple as possible. That means not everyone will get what they want. Having clear answers to the questions above will help all of us work out what is most important.

Posted in WHATWG | 5 Comments »

Charmod Checking

Web Forms 2.0 requires documents to conform to Charmod. The current Web Applications 1.0 draft does not mention Charmod, but since (X)HTML5 includes both Web Applications 1.0 and Web Forms 2.0, my working assumption is that (X)HTML5 documents are required to conform to Charmod.

It turns out that the best opportunity for checking whether a document conforms to Charmod is in the parser. Hence, I added the checks to my special-purpose HTML parser and to HS Ælfred2—my fork of GNU Ælfred2.

Charmod says:

NOTE: RFC 2119 makes it clear that requirements that use SHOULD are not optional and must be complied with unless there are specific reasons not to: “This word, or the adjective ‘RECOMMENDED’, mean that there may exist valid reasons in particular circumstances to ignore a particular item, but the full implications must be understood and carefully weighed before choosing a different course.”

Further, Charmod says: “A specification conforms to this document if it——documents the reason for any deviation from criteria where the imperative is SHOULD, SHOULD NOT, or RECOMMENDED——”. I have an implementation, but I’m documenting my decisions not to enforce some SHOULDs anyway.

Here’s how I have addressed the requirements of Charmod that apply to content (marked as [C] is Charmod). Disclaimer: The implementation decisions I have taken with prototype software are not endorsed by the WHAT WG or anyone else.

C001

Specifications, software and content MUST NOT require or depend on a one-to-one correspondence between characters and the sounds of a language.

This requirement is not machine-checkable and, hence, is not enforced by the software.

C002

Specifications, software and content MUST NOT require or depend on a one-to-one mapping between characters and units of displayed text.

This requirement is not machine-checkable and, hence, is not enforced by the software.

C003

Protocols, data formats and APIs MUST store, interchange or process text data in logical order.

HTML5 as a data format uses logical order. It is not practical to try to figure out in software if the author is trying to subvert the nature of the format on this point. Currently, the software doesn’t enforce this at all. However, it might be useful to catch encoding labels that are used for visual Hebrew or Arabic.

C013

Textual data objects defined by protocol or format specifications MUST be in a single character encoding.

A single character encoding decoder is instantiated per HTTP resource. Encoding violations are treated as fatal. However, some mixed encodings are not caught by this and need human judgment. For example, software can’t tell if ISO-8859-1 and ISO-8859-2 bytes are mixed in one HTTP resource.

C022

Character encodings that are not in the IANA registry SHOULD NOT be used, except by private agreement.

An error is reported.

C023

If an unregistered character encoding is used, the convention of using 'x-' at the beginning of the name MUST be followed.

An error is reported.

C049

The character encoding of content SHOULD be chosen so that it maximizes the opportunity to directly represent characters (ie. minimizes the need to represent characters by markup means such as character escapes) while avoiding obscure encodings that are unlikely to be understood by recipients.

UTF-8 maximizes the opportunity to directly represent characters. A warning is issued if the document uses an encoding that is not supported “everywhere”. For XHTML5 the non-obscure encodings are US-ASCII, ISO-8859-1, UTF-8 and UTF-16. For HTML5, the non-obscure encodings are currently the intersection of IANA-registered encodings supported by Sun JDK 1.4.2_8 and Python 2.4.3. (The service supports a wider set of encodings.) The character spectrum use of the document is not analyzed, because I think it wouldn’t be useful way to use my time considering that using UTF-8 always satisfies this requirement.

C034

If facilities are offered for identifying character encoding, content MUST make use of them; where the facilities offered for character encoding identification include defaults (e.g. in XML 1.0 [XML 1.0]), relying on such defaults is sufficient to satisfy this identification requirement.

An error is reported if an HTML5 document does not have an explicit character encoding declaration (either internal or external).

C024

Content and software that label text data MUST use one of the names required by the appropriate specification (e.g. the XML specification when editing XML text) and SHOULD use the MIME preferred name of a character encoding to label data in that character encoding.

An error is reported if an encoding label is not the MIME preferred name.

C025

An IANA-registered charset name MUST NOT be used to label text data in a character encoding other than the one identified in the IANA registration of that name.

Encoding violations are treated as fatal. However, this doesn’t catch cases where the document byte sequence is legal in the declared encoding. For example, ISO-8859-2 labeled as ISO-8859-1 is not conclusively machine-detectable.

C073

Publicly interchanged content SHOULD NOT use codepoints in the private use area.

Charmod does allow the use of private use area for script that have not yet been encoded. Since human judgment is needed, the software only emits a warning. Moreover, C040 denies denying the use of the PUA.

C076

Content MUST NOT use a code point for any purpose other than that defined by its coded character set.

This requirement is not machine-checkable and, hence, is not enforced by the software.

C047

Escapes SHOULD only be used when the characters to be expressed are not directly representable in the format or the character encoding of the document, or when the visual representation of the character is unclear.

This requirement is not enforced—not even as a warning. Using the five pre-defined entities in XML, using the HTML5 entities from the specification or using numeric characters references is harmless when it comes to the parsed document tree. Enforcing this requirement would mean proclaiming a prevalent authoring practice non-conforming on the grounds of the aesthetics of view source. Moreover, Charmod doesn’t give a solid machine-checkable definition for characters whose visual representation is unclear.

C048

Content SHOULD use the hexadecimal form of character escapes rather than the decimal form when there are both.

This requirement is not enforced—not even as a warning. Using the five pre-defined entities in XML, using the HTML5 entities from the specification or using numeric characters references is harmless when it comes to the parsed document tree. Enforcing this requirement would mean proclaiming a prevalent authoring practice non-conforming on the grounds of the aesthetics of view source.

C054

Users of specifications (software developers, content developers) SHOULD whenever possible prefer ways other than string indexing to identify substrings or point within a string.

This requirement is not machine-checkable to the extent it might apply to the (X)HTML5 layer and, hence, is not enforced by the software.

In the spirit of perpetual beta, the new code is enabled for all (X)HTML presets in the generic UI. Please let me know if it doesn’t work as described.

Tags: ,
Posted in Conformance Checking | Comments Off on Charmod Checking

HTML vs. XHTML

One of the predominate issues raised in the recent call for comments was surrounding the whole HTML vs. XHTML debate, with many people suggesting that we should not be extending HTML, but rather focussing on XHTML only.

However, what many people have failed to realise is that HTML5 resolves this issue: The (X)HTML5 specification is in fact specifying extensions to both HTML and XHTML simultaneously, and the choice of using either is no longer dependent upon the DOCTYPE.

Many authors use an XHTML 1.0 DOCTYPE and then proceed to claim they’re using XHTML, but browsers make the decision whether to treat a document as HTML or XHTML based on the MIME type. (X)HTML5 endorses dispatching on MIME type: If the document is served as text/html, it gets parsed as HTML; but if it is served with an XML MIME type, like application/xml or application/xhtml+xml, it gets parsed as XHTML.

Document Serializations

HTML5 introduces the concept of serialisations for an HTML document. A serialisation in this context refers to the physical representation of the essence of the document—the document tree. (X)HTML5 requires user agents that support scripting to expose the document tree to the script using the DOM API and data model. HTML5 uses the HTML serialisation and XHTML5 uses the XML serialisation. Because of this, the distinction between an HTML and XHTML document is reduced.

In most cases, either serialisation can be used to represent exactly the same document. The main differences are that the HTML serialisation, due to backwards-compatibility reasons cannot represent structured inline-level elements (e.g. <ol>, <ul>, etc.) as children of the <p> element and the XML serialization cannot represent all possible document trees that may be created as a result of error recovery in the HTML parsing algorithm. Also, in browsers, some scripting API features and CSS layout details work differently depending on the serialisation of the document due to backwards compatibility considerations.

The XML serialisation used by XHTML must be a well-formed XML 1.0 document. However, unlike previous versions of HTML, the HTML serialisation is no longer considered an application of SGML, but instead defines its own syntax. While the syntax is inspired by SGML, it is being defined in a way that more closely resembles the way browsers actually handle HTML in the real world, particularly in regards to error handling.

The HTML5 serialisation and the accompanying parsing algorithm are needed for three reasons:

  1. The browser that currently holds the majority market share doesn’t support XHTML (that is actually served and processed as XHTML).
  2. The legacy text/html content out there needs a well-defined parsing algorithm—something that SGML-based HTML specifications haven’t been able to provide.
  3. There are content management systems, Web applications and workflows that are not based on XML tools and cannot produce well-formed output reliably. These systems can benefit from new features even though they wouldn’t work reliably with the XHTML serialisation.

On the other hand, thanks to the HTML/XHTML duality, new systems can be built on solid off-the-shelf XML tools internally and convert to and from the HTML5 serialisation at input/output boundaries. Once the installed base of browsers supports application/xhtml+xml properly, these systems can swap the output serialiser and start using XHTML-only features such as lists inside paragraphs.

The New DOCTYPE

In practice, the DOCTYPE serves two purposes: DTD based validation and (for HTML only) DOCTYPE sniffing. Since HTML is no longer considered an application of SGML and because there are many limitations with DTD based validation, there will not be any official DTDs for (X)HTML5.

As a result, in the HTML serialisation, the only purpose for even having a DOCTYPE is to trigger standards mode in browsers. Thus, because it doesn’t need to refer to a DTD at all, the DOCTYPE is simply this:

<!DOCTYPE html>

I’m sure you would agree that that is about as simple and easy to remember as possible. But, for XHTML, it’s even simpler. There isn’t one! Since browsers have not (and will not) introduce DOCTYPE sniffing for XML, there is little need for a DOCTYPE.

However, I should point out that there is one other minor practical issue with DTDlessness in XML. Entity references which are declared in the XHTML 1.0 DTD will not be able to be used. However, since browsers don't use validating parsers and do not read DTDs from the network anyway, the use of entity references is not recommended. Instead, it is recommended to use character references or a good character encoding (UTF-8) that supports the characters natively.

Conformance Checking

You’re no doubt wondering, if there are no DTDs, how will one go about validating their markup. Well, that’s simple. There are in fact other, more robust methods available for checking document conformance. There are several different schema languages that can be used, including RELAX NG and Schematron. However, even they cannot fully express the machine-checkable conformance requirements of (X)HTML5.

Henri Sivonen is in the process of developing a conformance checker for HTML5, which is being designed to report much more useful error messages beyond those that are possible using just a DTD based approach. For example, the table integrity checker discussed previously is one feature that is impossible to implement using DTDs.

Tags: , ,
Posted in Conformance Checking, Syntax | 6 Comments »

Table Integrity Checker

I am working on a conformance checking service for (X)HTML5. The service is grammar-based for the most part with RELAX NG as the schema language. Some extra-grammatical constraints are expressed as Schematron assertions. Currently, as a Mozilla Foundation grantee, I am working on writing checkers (in Java) for spec features that cannot (practically or at all) be checked using RELAX NG or Schematron.

In a Web two-point-ohey perpetual beta fashion, I am deploying the new prototype features early to allow testing.

The first non-schema checker prototype is a table integrity checker. Since the table model for (X)HTML5 is now being specified, the prototype is speculatively based on the HTML 4.01 table model and browser behavior. The differences from HTML 4.01 are that colspan='0' is treated as colspan='1' and that headers must refer to th cells. The top left corner of cells is placed in the first available slot on the row, which is browser-compatible but different from what the CSS2 spec says.

The checker emits both warnings and errors. Depending on how the spec turns out, errors may become warnings or vice versa.

Currently, the errors are:

Currently, the warnings are:

The table integrity checker only sees a projection of the document tree that contains nothing but table-significant elements and crazy subtrees of table-significant elements in wrong places are silently pruned. These are dealt with on the RELAX NG level. The table integrity checker assumes that it is being used together with a reasonable schema.

The table integrity checker is also enabled for the HTML 4.01 / XHTML 1.0 presets on the generic side of the service, so testing with today’s content is possible.

There’s a pseudo-schema called http://hsivonen.iki.fi/checkers/table/ which isn’t a schema but a magic URL that causes the system to instantiate the table integrity checker. There’s a pseudo-pseudo-schema called http://hsivonen.iki.fi/checkers/all/ which expands to all pseudo-schemas, but at the moment, there’s only one.

Please let me know if the table integrity checker does not work as advertised.

Posted in Conformance Checking, Processing Model | Comments Off on Table Integrity Checker