I have added checking of the
textContent of the
t elements to the conformance checking service technology preview. The
textContent is checked if
progress does not have an attribute called
value or if
t does not have an attribute called
I took liberties with date formats. Also, I am assuming that it is an error if the algorithm for finding a ratio fails.
I have also made the error messages prettier. Additionally, there is now a pseudo-schema called
http://hsivonen.iki.fi/checkers/debug/ which dumps the parse events as warnings. (It makes the most sense when used as the first schema URI on the list of schema URIs.)
Recently, I was informed that the XHTML5 facet of the (X)HTML5-specific interface was crashing due to a null dereference. When I fixed it, I somehow managed to disconnect the back end from the parser for the HTML5 facet. That one is now fixed, too.
On the generic side, the XSLT schema kept crashing the engine. The problem was infinite recursion in the JDK regular expression engine that caused the runtime stack to overflow. I have now changed the system so that regular expressions in the XSD datatype library for RELAX NG are backed by the Xerces 2 regular expression engine rather than the JDK regular expression engine.
The nice thing about a managed runtime is that stack overflows and null dereferences don’t bring the whole app down. In fact, they don’t even crash the thread; the front end can still show an error to the user. The problem is that previously the errors were logged to a file that I didn’t read until someone reported a problem and most of the time people don’t report the problems when they are told that the error was logged. Now the system sends me the stack trace by email if the back end crashes. (And I have fixed all known crashers in advertised features.)
I have also polished the Jing error messages a bit.
Charmod Norm is
still in the Working Draft state, but if it were to become a
normative part of (X)HTML5, it would belong to the area of the
conformance checking service that I am working on now, so I
prototyped Charmod Norm enforcement as well.
The checker outsources most work to ICU4J.
Most complexity in my code is due to trying to avoid buffering as
much as possible while still using the ICU4J API unmodified—and due
to dealing with the halves of a surrogate pair falling into different
UTF-16 code unit buffers. On the spec reading front, I couldn’t map
“the second character in the canonical decomposition mapping of
some character that is not listed in the Composition Exclusion Table
defined in [UTR #15]” to the ICU4J API on my own. Fortunately, I
got excellent help on the icu-support mailing list.
It turned out that the most time-consuming part was not writing
the normalization checker but reworking how Ælfred2 deals with
US-ASCII, ISO-8859-1, UTF-8, UTF-16 and UTF-32. In HS Ælfred2,
all character encodings are now decoded using the java.nio.charset
The definition for Fully-normalized
Text involves checking normalization before and after parsing.
That is, the source text is required to be NFC-normalized and after
parsing the constructs parsed out of the source are required to be
NFC-normalized and are required not to start with a “composing
character” (which is not exactly the same as a “combining
I don’t really like the way the definition involves peeking
underneath the parser, but it does have the benefit that if the
source is in NFC, you won’t accidentally break the document by
editing the source in an NFC-normalizing text editor.
Charmod Norm does not define what “constructs” are in the
context of XML 1.0 or HTML5.
However, XML 1.1 does define what “relevant
constructs” are, so that definition might be generalizable to
XML 1.0 and HTML5. Unfortunately, XML 1.1 defines relevant constructs
in terms of the grammar productions of XML itself instead of the
significant information items that an XML processor reports to the
Personally, I think the XML 1.1 definition is neither practically
useful nor something for which I’d be motivated to write an
implementation. So for the purpose of prototyping, I made up a
definition on my own. Web Applications 1.0 just might get away with
making my definition normative for XHTML5 considering that XML 1.0
doesn’t have a definition.
I consider the SAX2
benchmarks of cluefulness when it comes to XML-related spec, API and
application design. In general, if your application isn’t an editor
that needs to reconstruct the XML source from parsed data, your
application is most likely broken if it needs to know something about
an XML document being parsed that is not exposed through those two
interfaces. On the other hand, a spec that can’t be conformed to by
viewing XML through only those two interfaces is broken. Moreover,
DTDHandler is about notations, which are pretty much
obsolete, so that leaves only
This gives the following definition of constructs:
Local names of elements
Local names of attributes
Declared namespace prefixes
Declared namespace URIs
Concatenations of consecutive character data between element
boundaries and PIs ignoring comments and CDATA section boundaries.
There is a new pseudo-schema called
http://hsivonen.iki.fi/checkers/nfc/. It is enabled for
all (X)HTML presets. When this pseudo-schema is in use or when the
schema selection is in the automatic mode, the normalization checking
of source text underneath the parser is enabled as well.
The following checks are made:
Whether the source text is in Unicode Normalization Form C.
Whether each construct is in Unicode Normalization Form C.
Whether the first character of a construct is a composing
The version of Unicode that is used is 5.0.0.
The column and line numbers reported on errors are very inaccurate
due to buffering.
I have not tested whether all the character encoding decoders that
I have installed are normalizing
transcoders. If you have
test cases, please try them out and let me know what happens. Also,
please let me know if the issue applies to something other than
legacy Vietnamese encodings.
As usual, the new code is enabled for testing.
Please let me know if it doesn’t
work as described.
I'm happy to see the increased interest in HTML5 recently — especially with the amazing work Lachlan, Henri, and others are doing with this blog, the validator, feedback, and so forth.
Since the volume of feature requests is only going to increase in the near future, I thought I'd list some things that would make evaluating proposals easier. Here are some key things that any proposal should include:
- What is the problem you are trying to solve?
- What is the feature you are suggesting to help solve it?
- What is the processing model for that feature, including error handling? This should be very clear, including things such as event timing if the feature involves events, how to create graphs representing the data in the case of semantic proposals, etc.
- Why do you think browsers would implement this feature?
- Why do you think authors would use this feature?
- What evidence is there that this feature is desparately needed?
Obviously, we want to keep the language as simple as possible. That means not everyone will get what they want. Having clear answers to the questions above will help all of us work out what is most important.