Wednesday, November 15th, 2006
Charmod Norm is
still in the Working Draft state, but if it were to become a
normative part of (X)HTML5, it would belong to the area of the
conformance checking service that I am working on now, so I
prototyped Charmod Norm enforcement as well.
The checker outsources most work to ICU4J.
Most complexity in my code is due to trying to avoid buffering as
much as possible while still using the ICU4J API unmodified—and due
to dealing with the halves of a surrogate pair falling into different
UTF-16 code unit buffers. On the spec reading front, I couldn’t map
“the second character in the canonical decomposition mapping of
some character that is not listed in the Composition Exclusion Table
defined in [UTR #15]” to the ICU4J API on my own. Fortunately, I
got excellent help on the icu-support mailing list.
It turned out that the most time-consuming part was not writing
the normalization checker but reworking how Ælfred2 deals with
US-ASCII, ISO-8859-1, UTF-8, UTF-16 and UTF-32. In HS Ælfred2,
all character encodings are now decoded using the java.nio.charset
framework.
Requirements
The definition for Fully-normalized
Text involves checking normalization before and after parsing.
That is, the source text is required to be NFC-normalized and after
parsing the constructs parsed out of the source are required to be
NFC-normalized and are required not to start with a “composing
character” (which is not exactly the same as a “combining
character”).
I don’t really like the way the definition involves peeking
underneath the parser, but it does have the benefit that if the
source is in NFC, you won’t accidentally break the document by
editing the source in an NFC-normalizing text editor.
Interpretation
Charmod Norm does not define what “constructs” are in the
context of XML 1.0 or HTML5.
However, XML 1.1 does define what “relevant
constructs” are, so that definition might be generalizable to
XML 1.0 and HTML5. Unfortunately, XML 1.1 defines relevant constructs
in terms of the grammar productions of XML itself instead of the
significant information items that an XML processor reports to the
application.
Personally, I think the XML 1.1 definition is neither practically
useful nor something for which I’d be motivated to write an
implementation. So for the purpose of prototyping, I made up a
definition on my own. Web Applications 1.0 just might get away with
making my definition normative for XHTML5 considering that XML 1.0
doesn’t have a definition.
I consider the SAX2 ContentHandler
(excluding qName
s) and DTDHandler
benchmarks of cluefulness when it comes to XML-related spec, API and
application design. In general, if your application isn’t an editor
that needs to reconstruct the XML source from parsed data, your
application is most likely broken if it needs to know something about
an XML document being parsed that is not exposed through those two
interfaces. On the other hand, a spec that can’t be conformed to by
viewing XML through only those two interfaces is broken. Moreover,
DTDHandler
is about notations, which are pretty much
obsolete, so that leaves only ContentHandler
.
This gives the following definition of constructs:
-
Local names of elements
-
Local names of attributes
-
Attribute values
-
Declared namespace prefixes
-
Declared namespace URIs
-
PI targets
-
PI data
-
Concatenations of consecutive character data between element
boundaries and PIs ignoring comments and CDATA section boundaries.
Implementation
There is a new pseudo-schema called
http://hsivonen.iki.fi/checkers/nfc/
. It is enabled for
all (X)HTML presets. When this pseudo-schema is in use or when the
schema selection is in the automatic mode, the normalization checking
of source text underneath the parser is enabled as well.
The following checks are made:
-
Whether the source text is in Unicode Normalization Form C.
-
Whether each construct is in Unicode Normalization Form C.
-
Whether the first character of a construct is a composing
character.
The version of Unicode that is used is 5.0.0.
The column and line numbers reported on errors are very inaccurate
due to buffering.
I have not tested whether all the character encoding decoders that
I have installed are normalizing
transcoders. If you have Windows-1258
Vietnamese
test cases, please try them out and let me know what happens. Also,
please let me know if the issue applies to something other than
legacy Vietnamese encodings.
As usual, the new code is enabled for testing.
Please let me know if it doesn’t
work as described.
Posted in Conformance Checking, Syntax | 1 Comment »
Wednesday, November 15th, 2006
Web
Forms 2.0 requires documents to conform to Charmod.
The current Web
Applications 1.0 draft does not mention Charmod, but since
(X)HTML5 includes both Web Applications 1.0 and Web Forms 2.0, my
working assumption is that (X)HTML5 documents are required to conform
to Charmod.
It turns out that the best opportunity for checking whether a
document conforms to Charmod is in the parser. Hence, I added the
checks to my special-purpose HTML parser and to HS Ælfred2—my
fork of GNU Ælfred2.
Charmod says:
NOTE: RFC 2119 makes it clear that requirements that use
SHOULD are not optional and must be complied with unless there are
specific reasons not to: “This word, or the adjective
‘RECOMMENDED’, mean that there may exist valid reasons in
particular circumstances to ignore a particular item, but the full
implications must be understood and carefully weighed before choosing
a different course.”
Further, Charmod says: “A specification conforms to this
document if it——documents the reason for any deviation from
criteria where the imperative is SHOULD, SHOULD NOT, or
RECOMMENDED——”. I have an implementation, but I’m
documenting my decisions not to enforce some SHOULDs anyway.
Here’s how I have addressed the requirements of Charmod that
apply to content (marked as [C] is Charmod). Disclaimer: The
implementation decisions I have taken with prototype software are not
endorsed by the WHAT WG or anyone else.
C001 |
Specifications, software and content MUST NOT require or depend
on a one-to-one correspondence between characters and the sounds
of a language.
This requirement is not machine-checkable and, hence, is
not enforced by the software.
|
C002 |
Specifications, software and content MUST NOT require or depend
on a one-to-one mapping between characters and units of displayed
text.
This requirement is not machine-checkable and, hence, is
not enforced by the software.
|
C003 |
Protocols, data formats and APIs MUST store, interchange or
process text data in logical order.
HTML5 as a data format uses logical order. It is not
practical to try to figure out in software if the author is trying
to subvert the nature of the format on this point. Currently, the
software doesn’t enforce this at all. However, it might be
useful to catch encoding labels that are used for visual Hebrew or
Arabic.
|
C013 |
Textual data objects defined by protocol or format
specifications MUST be in a single character encoding.
A single character encoding decoder is instantiated per
HTTP resource. Encoding violations are treated as fatal. However,
some mixed encodings are not caught by this and need human
judgment. For example, software can’t tell if ISO-8859-1 and
ISO-8859-2 bytes are mixed in one HTTP resource.
|
C022 |
Character encodings that are not in the IANA registry SHOULD
NOT be used, except by private agreement.
An error is reported.
|
C023 |
If an unregistered character encoding is used, the convention
of using 'x-' at the beginning of the name MUST be followed.
An error is reported.
|
C049 |
The character encoding of content SHOULD be chosen so that it
maximizes the opportunity to directly represent characters (ie.
minimizes the need to represent characters by markup
means such as character
escapes) while avoiding obscure encodings that are unlikely to
be understood by recipients.
UTF-8 maximizes the opportunity to directly represent characters. A warning
is issued if the document uses an encoding that is not supported
“everywhere”. For XHTML5 the non-obscure encodings are
US-ASCII, ISO-8859-1, UTF-8 and UTF-16. For HTML5, the non-obscure
encodings are currently the intersection of IANA-registered
encodings supported by Sun JDK 1.4.2_8 and Python 2.4.3. (The
service supports a wider set of encodings.) The character spectrum
use of the document is not analyzed, because I think it wouldn’t
be useful way to use my time considering that using UTF-8 always
satisfies this requirement.
|
C034 |
If facilities are offered for identifying character encoding,
content MUST make use of them; where the facilities offered for
character encoding identification include defaults (e.g. in XML
1.0 [XML 1.0]),
relying on such defaults is sufficient to satisfy this
identification requirement.
An error is reported if an HTML5 document does not have an
explicit character encoding declaration (either internal or
external).
|
C024 |
Content and software that label text data MUST use one of the
names required by the appropriate specification (e.g. the XML
specification when editing XML text) and SHOULD use the MIME
preferred name of a character encoding to label data in that
character encoding.
An error is reported if an encoding label is not the MIME
preferred name.
|
C025 |
An IANA-registered charset name MUST NOT be used
to label text data in a character encoding other than the one
identified in the IANA registration of that name.
Encoding violations are treated as fatal. However, this
doesn’t catch cases where the document byte sequence is legal in
the declared encoding. For example, ISO-8859-2 labeled as
ISO-8859-1 is not conclusively machine-detectable.
|
C073 |
Publicly interchanged content SHOULD NOT use codepoints in the
private use area.
Charmod does allow the use of private use area for script
that have not yet been encoded. Since human judgment is needed,
the software only emits a warning. Moreover, C040
denies denying the use of the PUA.
|
C076 |
Content MUST NOT use a code point for any purpose other than
that defined by its coded character set.
This requirement is not machine-checkable and, hence, is
not enforced by the software.
|
C047 |
Escapes SHOULD only be used when the characters to be expressed
are not directly representable in the format or the character
encoding of the document, or when the visual representation of the
character is unclear.
This requirement is not enforced—not even as a warning.
Using the five pre-defined entities in XML, using the HTML5
entities from the specification or using numeric characters
references is harmless when it comes to the parsed document tree.
Enforcing this requirement would mean proclaiming a prevalent
authoring practice non-conforming on the grounds of the aesthetics
of view source. Moreover, Charmod doesn’t give a solid
machine-checkable definition for characters whose visual
representation is unclear.
|
C048 |
Content SHOULD use the hexadecimal form of character escapes
rather than the decimal form when there are both.
This requirement is not enforced—not even as a warning.
Using the five pre-defined entities in XML, using the HTML5
entities from the specification or using numeric characters
references is harmless when it comes to the parsed document tree.
Enforcing this requirement would mean proclaiming a prevalent
authoring practice non-conforming on the grounds of the aesthetics
of view source.
|
C054 |
Users of specifications (software developers, content
developers) SHOULD whenever possible prefer ways other than string
indexing to identify substrings or point within a string.
This requirement is not machine-checkable to the extent it
might apply to the (X)HTML5 layer and, hence, is not enforced by
the software.
|
In the spirit of perpetual beta, the new code is enabled for all
(X)HTML presets in the generic
UI. Please let me know if it doesn’t work as described.
Posted in Conformance Checking | Comments Off on Charmod Checking