Charmod Norm Checking
Charmod Norm is still in the Working Draft state, but if it were to become a normative part of (X)HTML5, it would belong to the area of the conformance checking service that I am working on now, so I prototyped Charmod Norm enforcement as well.
The checker outsources most work to ICU4J. Most complexity in my code is due to trying to avoid buffering as much as possible while still using the ICU4J API unmodified—and due to dealing with the halves of a surrogate pair falling into different UTF-16 code unit buffers. On the spec reading front, I couldn’t map “the second character in the canonical decomposition mapping of some character that is not listed in the Composition Exclusion Table defined in [UTR #15]” to the ICU4J API on my own. Fortunately, I got excellent help on the icu-support mailing list.
It turned out that the most time-consuming part was not writing the normalization checker but reworking how Ælfred2 deals with US-ASCII, ISO-8859-1, UTF-8, UTF-16 and UTF-32. In HS Ælfred2, all character encodings are now decoded using the java.nio.charset framework.
Requirements
The definition for Fully-normalized Text involves checking normalization before and after parsing. That is, the source text is required to be NFC-normalized and after parsing the constructs parsed out of the source are required to be NFC-normalized and are required not to start with a “composing character” (which is not exactly the same as a “combining character”).
I don’t really like the way the definition involves peeking underneath the parser, but it does have the benefit that if the source is in NFC, you won’t accidentally break the document by editing the source in an NFC-normalizing text editor.
Interpretation
Charmod Norm does not define what “constructs” are in the context of XML 1.0 or HTML5.
However, XML 1.1 does define what “relevant constructs” are, so that definition might be generalizable to XML 1.0 and HTML5. Unfortunately, XML 1.1 defines relevant constructs in terms of the grammar productions of XML itself instead of the significant information items that an XML processor reports to the application.
Personally, I think the XML 1.1 definition is neither practically useful nor something for which I’d be motivated to write an implementation. So for the purpose of prototyping, I made up a definition on my own. Web Applications 1.0 just might get away with making my definition normative for XHTML5 considering that XML 1.0 doesn’t have a definition.
I consider the SAX2 ContentHandler
(excluding qName
s) and DTDHandler
benchmarks of cluefulness when it comes to XML-related spec, API and
application design. In general, if your application isn’t an editor
that needs to reconstruct the XML source from parsed data, your
application is most likely broken if it needs to know something about
an XML document being parsed that is not exposed through those two
interfaces. On the other hand, a spec that can’t be conformed to by
viewing XML through only those two interfaces is broken. Moreover,
DTDHandler
is about notations, which are pretty much
obsolete, so that leaves only ContentHandler
.
This gives the following definition of constructs:
-
Local names of elements
-
Local names of attributes
-
Attribute values
-
Declared namespace prefixes
-
Declared namespace URIs
-
PI targets
-
PI data
-
Concatenations of consecutive character data between element boundaries and PIs ignoring comments and CDATA section boundaries.
Implementation
There is a new pseudo-schema called
http://hsivonen.iki.fi/checkers/nfc/
. It is enabled for
all (X)HTML presets. When this pseudo-schema is in use or when the
schema selection is in the automatic mode, the normalization checking
of source text underneath the parser is enabled as well.
The following checks are made:
-
Whether the source text is in Unicode Normalization Form C.
-
Whether each construct is in Unicode Normalization Form C.
-
Whether the first character of a construct is a composing character.
The version of Unicode that is used is 5.0.0.
The column and line numbers reported on errors are very inaccurate due to buffering.
I have not tested whether all the character encoding decoders that
I have installed are normalizing
transcoders. If you have Windows-1258
Vietnamese
test cases, please try them out and let me know what happens. Also,
please let me know if the issue applies to something other than
legacy Vietnamese encodings.
As usual, the new code is enabled for testing. Please let me know if it doesn’t work as described.