Wednesday, November 15th, 2006
Charmod Norm is
still in the Working Draft state, but if it were to become a
normative part of (X)HTML5, it would belong to the area of the
conformance checking service that I am working on now, so I
prototyped Charmod Norm enforcement as well.
The checker outsources most work to ICU4J.
Most complexity in my code is due to trying to avoid buffering as
much as possible while still using the ICU4J API unmodified—and due
to dealing with the halves of a surrogate pair falling into different
UTF-16 code unit buffers. On the spec reading front, I couldn’t map
“the second character in the canonical decomposition mapping of
some character that is not listed in the Composition Exclusion Table
defined in [UTR #15]” to the ICU4J API on my own. Fortunately, I
got excellent help on the icu-support mailing list.
It turned out that the most time-consuming part was not writing
the normalization checker but reworking how Ælfred2 deals with
US-ASCII, ISO-8859-1, UTF-8, UTF-16 and UTF-32. In HS Ælfred2,
all character encodings are now decoded using the java.nio.charset
framework.
Requirements
The definition for Fully-normalized
Text involves checking normalization before and after parsing.
That is, the source text is required to be NFC-normalized and after
parsing the constructs parsed out of the source are required to be
NFC-normalized and are required not to start with a “composing
character” (which is not exactly the same as a “combining
character”).
I don’t really like the way the definition involves peeking
underneath the parser, but it does have the benefit that if the
source is in NFC, you won’t accidentally break the document by
editing the source in an NFC-normalizing text editor.
Interpretation
Charmod Norm does not define what “constructs” are in the
context of XML 1.0 or HTML5.
However, XML 1.1 does define what “relevant
constructs” are, so that definition might be generalizable to
XML 1.0 and HTML5. Unfortunately, XML 1.1 defines relevant constructs
in terms of the grammar productions of XML itself instead of the
significant information items that an XML processor reports to the
application.
Personally, I think the XML 1.1 definition is neither practically
useful nor something for which I’d be motivated to write an
implementation. So for the purpose of prototyping, I made up a
definition on my own. Web Applications 1.0 just might get away with
making my definition normative for XHTML5 considering that XML 1.0
doesn’t have a definition.
I consider the SAX2 ContentHandler
(excluding qName
s) and DTDHandler
benchmarks of cluefulness when it comes to XML-related spec, API and
application design. In general, if your application isn’t an editor
that needs to reconstruct the XML source from parsed data, your
application is most likely broken if it needs to know something about
an XML document being parsed that is not exposed through those two
interfaces. On the other hand, a spec that can’t be conformed to by
viewing XML through only those two interfaces is broken. Moreover,
DTDHandler
is about notations, which are pretty much
obsolete, so that leaves only ContentHandler
.
This gives the following definition of constructs:
-
Local names of elements
-
Local names of attributes
-
Attribute values
-
Declared namespace prefixes
-
Declared namespace URIs
-
PI targets
-
PI data
-
Concatenations of consecutive character data between element
boundaries and PIs ignoring comments and CDATA section boundaries.
Implementation
There is a new pseudo-schema called
http://hsivonen.iki.fi/checkers/nfc/
. It is enabled for
all (X)HTML presets. When this pseudo-schema is in use or when the
schema selection is in the automatic mode, the normalization checking
of source text underneath the parser is enabled as well.
The following checks are made:
-
Whether the source text is in Unicode Normalization Form C.
-
Whether each construct is in Unicode Normalization Form C.
-
Whether the first character of a construct is a composing
character.
The version of Unicode that is used is 5.0.0.
The column and line numbers reported on errors are very inaccurate
due to buffering.
I have not tested whether all the character encoding decoders that
I have installed are normalizing
transcoders. If you have Windows-1258
Vietnamese
test cases, please try them out and let me know what happens. Also,
please let me know if the issue applies to something other than
legacy Vietnamese encodings.
As usual, the new code is enabled for testing.
Please let me know if it doesn’t
work as described.
Posted in Conformance Checking, Syntax | 1 Comment »
Wednesday, November 15th, 2006
One of the predominate issues raised in the recent call for comments was surrounding
the whole HTML vs. XHTML debate, with many people suggesting that we should
not be extending HTML, but rather focussing on XHTML only.
However, what many people have failed to realise is that HTML5 resolves this issue: The (X)HTML5 specification
is in fact specifying extensions to both HTML and XHTML simultaneously, and
the choice of using either is no longer dependent upon the DOCTYPE
.
Many authors use an XHTML 1.0 DOCTYPE
and then proceed
to claim they’re using XHTML, but browsers make the decision whether to
treat a document as HTML or XHTML based on the MIME type. (X)HTML5 endorses
dispatching on MIME type: If the document is served as text/html
,
it gets parsed as HTML; but if it is served with an XML MIME type, like
application/xml
or application/xhtml+xml
,
it gets parsed as XHTML.
Document Serializations
HTML5 introduces the concept of serialisations for an HTML document. A serialisation
in this context refers to the physical representation of the essence of the
document—the document tree. (X)HTML5 requires user agents that support scripting
to expose the document tree to the script using the DOM API and data model.
HTML5 uses the HTML serialisation and XHTML5 uses the XML serialisation.
Because of this, the distinction between an HTML and XHTML document is reduced.
In most cases, either serialisation can be used to represent exactly
the same document. The main differences are that the HTML serialisation, due to
backwards-compatibility reasons cannot represent structured inline-level elements (e.g. <ol>
, <ul>
, etc.)
as children of the <p>
element and the XML serialization cannot
represent all possible document trees that may be created as a result of error
recovery in the HTML parsing algorithm. Also, in browsers, some scripting API features and
CSS layout details work differently depending on the serialisation of the document due to
backwards compatibility considerations.
The XML serialisation used by XHTML must be a well-formed XML 1.0 document.
However, unlike previous versions of HTML, the HTML serialisation is no longer
considered an application of SGML, but instead defines
its own syntax. While
the syntax is inspired by SGML, it is being defined in a way that more closely
resembles the way browsers actually handle HTML in the real world, particularly
in regards to error handling.
The HTML5 serialisation and the accompanying parsing algorithm are needed for three reasons:
- The browser that currently holds the majority market share doesn’t support XHTML (that is actually served and processed as XHTML).
- The legacy
text/html
content out there needs a well-defined parsing algorithm—something that SGML-based HTML specifications haven’t been able to provide.
- There are content management systems, Web applications and workflows that are not based on XML tools and cannot produce well-formed output reliably. These systems can benefit from new features even though they wouldn’t work reliably with the XHTML serialisation.
On the other hand, thanks to the HTML/XHTML duality, new systems can be built on solid off-the-shelf XML tools internally and convert to and from the HTML5 serialisation at input/output boundaries. Once the installed base of browsers supports application/xhtml+xml
properly, these systems can swap the output serialiser and start using XHTML-only features such as lists inside paragraphs.
The New DOCTYPE
In practice, the DOCTYPE
serves two purposes: DTD based validation
and (for HTML only) DOCTYPE
sniffing. Since HTML is no longer considered
an application of SGML and because there are many limitations with DTD based
validation, there will not be any official DTDs for (X)HTML5.
As a result, in the HTML serialisation, the only purpose for even having a
DOCTYPE
is to trigger standards mode in browsers. Thus, because it doesn’t
need to refer to a DTD at all, the DOCTYPE
is simply this:
<!DOCTYPE html>
I’m sure you would agree that that is about as simple and easy to remember
as possible. But, for XHTML, it’s even simpler. There isn’t one! Since
browsers have not (and will not) introduce DOCTYPE
sniffing for
XML, there is little need for a DOCTYPE
.
However, I should point
out that there is one other minor practical issue with DTDlessness in
XML. Entity references which are declared in the XHTML 1.0 DTD will
not be able to be used. However, since browsers don't use validating parsers
and do not read DTDs from the network anyway, the use of entity references
is not recommended. Instead, it is recommended to use character references
or a good character encoding (UTF-8) that supports the characters natively.
Conformance Checking
You’re no doubt wondering, if there are no DTDs, how will one go about validating
their markup. Well, that’s simple. There are in fact other, more robust methods
available for checking document conformance. There are several different schema
languages that can be used, including RELAX NG and Schematron. However, even
they cannot fully express the machine-checkable conformance requirements of
(X)HTML5.
Henri Sivonen is in the process of developing a conformance
checker for HTML5,
which is being designed to report much more useful error messages beyond
those that are possible using just a DTD based approach. For example, the table
integrity checker discussed previously is one feature that is impossible
to implement using DTDs.
Posted in Conformance Checking, Syntax | 6 Comments »