Archive for November, 2006
Wednesday, November 15th, 2006
Web
Forms 2.0 requires documents to conform to Charmod.
The current Web
Applications 1.0 draft does not mention Charmod, but since
(X)HTML5 includes both Web Applications 1.0 and Web Forms 2.0, my
working assumption is that (X)HTML5 documents are required to conform
to Charmod.
It turns out that the best opportunity for checking whether a
document conforms to Charmod is in the parser. Hence, I added the
checks to my special-purpose HTML parser and to HS Ælfred2—my
fork of GNU Ælfred2.
Charmod says:
NOTE: RFC 2119 makes it clear that requirements that use
SHOULD are not optional and must be complied with unless there are
specific reasons not to: “This word, or the adjective
‘RECOMMENDED’, mean that there may exist valid reasons in
particular circumstances to ignore a particular item, but the full
implications must be understood and carefully weighed before choosing
a different course.”
Further, Charmod says: “A specification conforms to this
document if it——documents the reason for any deviation from
criteria where the imperative is SHOULD, SHOULD NOT, or
RECOMMENDED——”. I have an implementation, but I’m
documenting my decisions not to enforce some SHOULDs anyway.
Here’s how I have addressed the requirements of Charmod that
apply to content (marked as [C] is Charmod). Disclaimer: The
implementation decisions I have taken with prototype software are not
endorsed by the WHAT WG or anyone else.
C001 |
Specifications, software and content MUST NOT require or depend
on a one-to-one correspondence between characters and the sounds
of a language.
This requirement is not machine-checkable and, hence, is
not enforced by the software.
|
C002 |
Specifications, software and content MUST NOT require or depend
on a one-to-one mapping between characters and units of displayed
text.
This requirement is not machine-checkable and, hence, is
not enforced by the software.
|
C003 |
Protocols, data formats and APIs MUST store, interchange or
process text data in logical order.
HTML5 as a data format uses logical order. It is not
practical to try to figure out in software if the author is trying
to subvert the nature of the format on this point. Currently, the
software doesn’t enforce this at all. However, it might be
useful to catch encoding labels that are used for visual Hebrew or
Arabic.
|
C013 |
Textual data objects defined by protocol or format
specifications MUST be in a single character encoding.
A single character encoding decoder is instantiated per
HTTP resource. Encoding violations are treated as fatal. However,
some mixed encodings are not caught by this and need human
judgment. For example, software can’t tell if ISO-8859-1 and
ISO-8859-2 bytes are mixed in one HTTP resource.
|
C022 |
Character encodings that are not in the IANA registry SHOULD
NOT be used, except by private agreement.
An error is reported.
|
C023 |
If an unregistered character encoding is used, the convention
of using 'x-' at the beginning of the name MUST be followed.
An error is reported.
|
C049 |
The character encoding of content SHOULD be chosen so that it
maximizes the opportunity to directly represent characters (ie.
minimizes the need to represent characters by markup
means such as character
escapes) while avoiding obscure encodings that are unlikely to
be understood by recipients.
UTF-8 maximizes the opportunity to directly represent characters. A warning
is issued if the document uses an encoding that is not supported
“everywhere”. For XHTML5 the non-obscure encodings are
US-ASCII, ISO-8859-1, UTF-8 and UTF-16. For HTML5, the non-obscure
encodings are currently the intersection of IANA-registered
encodings supported by Sun JDK 1.4.2_8 and Python 2.4.3. (The
service supports a wider set of encodings.) The character spectrum
use of the document is not analyzed, because I think it wouldn’t
be useful way to use my time considering that using UTF-8 always
satisfies this requirement.
|
C034 |
If facilities are offered for identifying character encoding,
content MUST make use of them; where the facilities offered for
character encoding identification include defaults (e.g. in XML
1.0 [XML 1.0]),
relying on such defaults is sufficient to satisfy this
identification requirement.
An error is reported if an HTML5 document does not have an
explicit character encoding declaration (either internal or
external).
|
C024 |
Content and software that label text data MUST use one of the
names required by the appropriate specification (e.g. the XML
specification when editing XML text) and SHOULD use the MIME
preferred name of a character encoding to label data in that
character encoding.
An error is reported if an encoding label is not the MIME
preferred name.
|
C025 |
An IANA-registered charset name MUST NOT be used
to label text data in a character encoding other than the one
identified in the IANA registration of that name.
Encoding violations are treated as fatal. However, this
doesn’t catch cases where the document byte sequence is legal in
the declared encoding. For example, ISO-8859-2 labeled as
ISO-8859-1 is not conclusively machine-detectable.
|
C073 |
Publicly interchanged content SHOULD NOT use codepoints in the
private use area.
Charmod does allow the use of private use area for script
that have not yet been encoded. Since human judgment is needed,
the software only emits a warning. Moreover, C040
denies denying the use of the PUA.
|
C076 |
Content MUST NOT use a code point for any purpose other than
that defined by its coded character set.
This requirement is not machine-checkable and, hence, is
not enforced by the software.
|
C047 |
Escapes SHOULD only be used when the characters to be expressed
are not directly representable in the format or the character
encoding of the document, or when the visual representation of the
character is unclear.
This requirement is not enforced—not even as a warning.
Using the five pre-defined entities in XML, using the HTML5
entities from the specification or using numeric characters
references is harmless when it comes to the parsed document tree.
Enforcing this requirement would mean proclaiming a prevalent
authoring practice non-conforming on the grounds of the aesthetics
of view source. Moreover, Charmod doesn’t give a solid
machine-checkable definition for characters whose visual
representation is unclear.
|
C048 |
Content SHOULD use the hexadecimal form of character escapes
rather than the decimal form when there are both.
This requirement is not enforced—not even as a warning.
Using the five pre-defined entities in XML, using the HTML5
entities from the specification or using numeric characters
references is harmless when it comes to the parsed document tree.
Enforcing this requirement would mean proclaiming a prevalent
authoring practice non-conforming on the grounds of the aesthetics
of view source.
|
C054 |
Users of specifications (software developers, content
developers) SHOULD whenever possible prefer ways other than string
indexing to identify substrings or point within a string.
This requirement is not machine-checkable to the extent it
might apply to the (X)HTML5 layer and, hence, is not enforced by
the software.
|
In the spirit of perpetual beta, the new code is enabled for all
(X)HTML presets in the generic
UI. Please let me know if it doesn’t work as described.
Posted in Conformance Checking | Comments Off on Charmod Checking
Wednesday, November 15th, 2006
One of the predominate issues raised in the recent call for comments was surrounding
the whole HTML vs. XHTML debate, with many people suggesting that we should
not be extending HTML, but rather focussing on XHTML only.
However, what many people have failed to realise is that HTML5 resolves this issue: The (X)HTML5 specification
is in fact specifying extensions to both HTML and XHTML simultaneously, and
the choice of using either is no longer dependent upon the DOCTYPE
.
Many authors use an XHTML 1.0 DOCTYPE
and then proceed
to claim they’re using XHTML, but browsers make the decision whether to
treat a document as HTML or XHTML based on the MIME type. (X)HTML5 endorses
dispatching on MIME type: If the document is served as text/html
,
it gets parsed as HTML; but if it is served with an XML MIME type, like
application/xml
or application/xhtml+xml
,
it gets parsed as XHTML.
Document Serializations
HTML5 introduces the concept of serialisations for an HTML document. A serialisation
in this context refers to the physical representation of the essence of the
document—the document tree. (X)HTML5 requires user agents that support scripting
to expose the document tree to the script using the DOM API and data model.
HTML5 uses the HTML serialisation and XHTML5 uses the XML serialisation.
Because of this, the distinction between an HTML and XHTML document is reduced.
In most cases, either serialisation can be used to represent exactly
the same document. The main differences are that the HTML serialisation, due to
backwards-compatibility reasons cannot represent structured inline-level elements (e.g. <ol>
, <ul>
, etc.)
as children of the <p>
element and the XML serialization cannot
represent all possible document trees that may be created as a result of error
recovery in the HTML parsing algorithm. Also, in browsers, some scripting API features and
CSS layout details work differently depending on the serialisation of the document due to
backwards compatibility considerations.
The XML serialisation used by XHTML must be a well-formed XML 1.0 document.
However, unlike previous versions of HTML, the HTML serialisation is no longer
considered an application of SGML, but instead defines
its own syntax. While
the syntax is inspired by SGML, it is being defined in a way that more closely
resembles the way browsers actually handle HTML in the real world, particularly
in regards to error handling.
The HTML5 serialisation and the accompanying parsing algorithm are needed for three reasons:
- The browser that currently holds the majority market share doesn’t support XHTML (that is actually served and processed as XHTML).
- The legacy
text/html
content out there needs a well-defined parsing algorithm—something that SGML-based HTML specifications haven’t been able to provide.
- There are content management systems, Web applications and workflows that are not based on XML tools and cannot produce well-formed output reliably. These systems can benefit from new features even though they wouldn’t work reliably with the XHTML serialisation.
On the other hand, thanks to the HTML/XHTML duality, new systems can be built on solid off-the-shelf XML tools internally and convert to and from the HTML5 serialisation at input/output boundaries. Once the installed base of browsers supports application/xhtml+xml
properly, these systems can swap the output serialiser and start using XHTML-only features such as lists inside paragraphs.
The New DOCTYPE
In practice, the DOCTYPE
serves two purposes: DTD based validation
and (for HTML only) DOCTYPE
sniffing. Since HTML is no longer considered
an application of SGML and because there are many limitations with DTD based
validation, there will not be any official DTDs for (X)HTML5.
As a result, in the HTML serialisation, the only purpose for even having a
DOCTYPE
is to trigger standards mode in browsers. Thus, because it doesn’t
need to refer to a DTD at all, the DOCTYPE
is simply this:
<!DOCTYPE html>
I’m sure you would agree that that is about as simple and easy to remember
as possible. But, for XHTML, it’s even simpler. There isn’t one! Since
browsers have not (and will not) introduce DOCTYPE
sniffing for
XML, there is little need for a DOCTYPE
.
However, I should point
out that there is one other minor practical issue with DTDlessness in
XML. Entity references which are declared in the XHTML 1.0 DTD will
not be able to be used. However, since browsers don't use validating parsers
and do not read DTDs from the network anyway, the use of entity references
is not recommended. Instead, it is recommended to use character references
or a good character encoding (UTF-8) that supports the characters natively.
Conformance Checking
You’re no doubt wondering, if there are no DTDs, how will one go about validating
their markup. Well, that’s simple. There are in fact other, more robust methods
available for checking document conformance. There are several different schema
languages that can be used, including RELAX NG and Schematron. However, even
they cannot fully express the machine-checkable conformance requirements of
(X)HTML5.
Henri Sivonen is in the process of developing a conformance
checker for HTML5,
which is being designed to report much more useful error messages beyond
those that are possible using just a DTD based approach. For example, the table
integrity checker discussed previously is one feature that is impossible
to implement using DTDs.
Posted in Conformance Checking, Syntax | 6 Comments »
Tuesday, November 14th, 2006
I am working on a conformance checking service for (X)HTML5. The service is grammar-based for the most part with RELAX NG as the schema language. Some extra-grammatical constraints are expressed as Schematron assertions. Currently, as a Mozilla Foundation grantee, I am working on writing checkers (in Java) for spec features that cannot (practically or at all) be checked using RELAX NG or Schematron.
In a Web two-point-ohey perpetual beta fashion, I am deploying the new prototype features early to allow testing.
The first non-schema checker prototype is a table integrity checker. Since the table model for (X)HTML5 is now being specified, the prototype is speculatively based on the HTML 4.01 table model and browser behavior. The differences from HTML 4.01 are that colspan='0'
is treated as colspan='1'
and that headers
must refer to th
cells. The top left corner of cells is placed in the first available slot on the row, which is browser-compatible but different from what the CSS2 spec says.
The checker emits both warnings and errors. Depending on how the spec turns out, errors may become warnings or vice versa.
Currently, the errors are:
- Table cell is overlapped by later table cell.
- Table cell overlaps an earlier table cell. (Single overlap gets reported in both directions to show source location for both cells.)
- Table cell spans past the end of its row group.
- Row has no cells starting on it.
- Table row column count is greater than the column count established by cols/colgroups.
- Table row column count is less than the column count established by cols/colgroups.
- The headers attribute doesn’t point to th elements in the same table.
- Column has no cells starting on it. (Contiguous cell ranges established by a single element are coalesced to a single error to protect against denial of service attacks.)
Currently, the warnings are:
- colspan exceeds 1000, which is a magic number in Gecko (and according to comments in Gecko source, in IE and Opera, too)
- rowspan exceeds 8190, which is a magic number in Gecko
- Table row column count is greater than the column count established by the first row in the absence of cols/colgroups.
- Table row column count is less than the column count established by the first row in the absence of cols/colgroups.
- A col element causes a span attribute to be ignored on the parent colgroup. (Conforming in HTML 4 / XHTML 1.0; non-conforming in (X)HTML5. With (X)HTML5 there’s also a schema-level error.)
The table integrity checker only sees a projection of the document tree that contains nothing but table-significant elements and crazy subtrees of table-significant elements in wrong places are silently pruned. These are dealt with on the RELAX NG level. The table integrity checker assumes that it is being used together with a reasonable schema.
The table integrity checker is also enabled for the HTML 4.01 / XHTML 1.0 presets on the generic side of the service, so testing with today’s content is possible.
There’s a pseudo-schema called http://hsivonen.iki.fi/checkers/table/
which isn’t a schema but a magic URL that causes the system to instantiate the table integrity checker. There’s a pseudo-pseudo-schema called http://hsivonen.iki.fi/checkers/all/
which expands to all pseudo-schemas, but at the moment, there’s only one.
Please let me know if the table integrity checker does not work as advertised.
Posted in Conformance Checking, Processing Model | Comments Off on Table Integrity Checker
Monday, November 13th, 2006
This is the Web Hypertext Application Technology Working Group community blog, run by some of the more active members of the community. The aim of this blog is to keep the public informed about the development of (X)HTML 5, the WHATWG, and related topics, and get feedback from those who do not wish to participate directly in the mailing list.
This follows the success of the recent campaign to gather feedback from the wider community, via the announcements that were cross posted on on The Web Standards Project, Lachy’s Log, Molly.com and 456 Berea Street. Work is currently underway to develop an FAQ based on that feedback.
As always, any questions or comments are welcome. We’ll do our best to listen to and respond whenever necessary.
Posted in WHATWG | 1 Comment »