The WHATWG Blog — Henri Sivonen

Author Archive

Experience the HTML5 parsing algorithm in the Live DOM Viewer

Monday, June 30th, 2008

If you’ve investigated how browsers parse HTML, you’ve probably used Hixie’s Live DOM Viewer to see what happens. Wouldn’t it be cool, though, if you could experiment with the HTML5 parsing algorithm in the same UI? Well, now you can.

I was looking for a way to experiment with document.write() in the code base of the Validator.nu HTML Parser and I was looking for a way to let people see the parse tree output of the HTML5 parsing algorithm more easily. Instead of writing a test harness fully in Java, I thought it would be better to use the Live DOM Viewer and a browser engine as the test harness. The good news is that Google Web Toolkit makes it possible to put these pieces together, and the trunk of the Validator.nu HTML parser now comes with a document.write()-aware tokenizer driver and a tree builder subclass for GWT.

The bad news is that the Java-to-JavaScript compiler of GWT has a bug that blocks me from putting the result online as JavaScript. The Hosted Mode of GWT, works, though.

Here’s how you can run the Validator.nu HTML Parser in the Live DOM Viewer locally in the Hosted Mode of GWT (on Mac or Linux):

Check out the source: svn co http://svn.versiondude.net/whattf/htmlparser/trunk/ htmlparser
Download and untar GWT 1.5 RC1
On Linux, install libstdc++5 and a JDK (Ubuntu's OpenJDK-based package worked for me).
Edit the paths in HtmlParser-shell (Mac) or HtmlParser-linux (Linux) to point to the location of GWT.
Run HtmlParser-shell (Mac) or HtmlParser-linux (Linux)

Known problems:

The Linux version of GWT runs an outdated version of Gecko, and the rendered view doesn't work. The DOM view does.
The Mac version of GWT runs a Web Inspector-enabled version of WebKit, but SVG does not draw.
document.write() semantics are right only for inline scripts.
Copying and pasting using keyboard shortcuts doesn’t work. (Use the context menu.)
On Linux, GTW prints a lot of harmless warnings about not finding annotations. (I don’t know why that happens. The annotations should be among translatables.)
Gecko (used by GTW on Linux) doesn't allow the creation of xmlns attributes in no namespace, so things stop working if you try to put an attribute called xmlns on HTML elements.
The DOM view on Linux doesn't report names with colons in them per the HTML5 spec.

(Aside: This code could have applicability beyond testing the parser. If the compiler bug were fixed or worked around, a script could document.write() a math element and an svg element to sniff if they are parsed according to HTML5 and if they aren't, move aside load event handlers, document.write() <plaintext style='display:none'>, wait until DOMContentLoaded, load the the already created html, head and body elements onto the tree builder stack and head pointer of the HTML5 parser to and reparse the content of the plaintext element as HTML5 and call the load event handlers. See Philip Taylor’s proof of concept with S-expressions.)

Posted in Syntax | 1 Comment »

New Image Report Feature in Validator.nu

Friday, April 18th, 2008

There have been lots and lots of e-mail on the public-html mailing list about making the alt attribute syntactically required in HTML5. At the core of this debate is on one hand using HTML5 validators to send a strong message about accessibility and on the other hand of avoiding a situation where a simplified and idealistic strong message leads to behavior that is counterproductive considering the goal of making the Web accessible. As a policy debate, it is similar to abstinence-only sex education debates.

A validator is a computer program and cannot tell if a textual alternative is appropriate for a given image in a given context. That's why accessibility checking needs to be done by a person. A person may use a software tool to make the checking easier, but trusting on fully automated software to determine whether a page is accessible is misguided.

Given this basic problem, a policy that insists on the alt attribute always being present doesn’t necessarily lead to accessibility. In fact, considering that syntactic correctness and accessibility are different evaluation axes both in terms of computability and in terms of how HTML authors (other than accessibility advocates) tend to view things (judging from observations about the behavior of HTML authors who use validators), a policy that insists on the alt attribute being always present will likely cause people to put the attribute in there but with inappropriate content. In particular, putting an empty alt on images whose presence is important for understanding the context of other content is bad, because in that case the presence of those images is concealed from a non-graphical user. Also, a textual alternative that just says “image” is not an improvement over what, for example, Safari with VoiceOver says in the absence of alt, but would be worse than a smarter client-side heuristic.

Furthermore, there is a very real case where a textual alternative simply isn’t available to the HTML generator: a user uploads photos to a content management system and refuses to supply textual alternatives at the same moment. HTML 4 didn’t account for this case. In fact, requiring alt to under all circumstances assumes that markup is written by a person who knows what the images are at the time of writing markup. It doesn’t make sense to pretend that the case where the markup generator doesn’t have textual alternatives available doesn’t exist. The HTML 5 syntax needs to account for all use cases.

Expecting markup generators to knowingly emit markup that is not valid is not a winning proposition. Quoting me from 2006:

Authoring tools are judged by taking a page authored using the tool and running it through the W3C Validator or, presumably in the future, through an HTML5 conformance checker. Authoring tool makers who are capable of making their tool produce syntactically conforming documents will want to do so and minimize the chance that the users of their software tarnish the reputation of the tool in the eyes of people who use an automated test as a litmus test of authoring tool bogosity. (People who test tools that way will outnumber the people who make a more profound analysis due to the "validate, validate, validate" propaganda.)

To summarize: As a matter of principle, subjective checking or checking that is not applicable for all pages does not belong in the validation function. Practice is more important than principle, though. Baking the alt requirement into the validation function would be bad when the user of the validation function wants a clean report on syntax but isn’t as concerned with accessibility. It is bad for accessibility when authors put the simplest value that silences the validator into the attribute in order to make the validation report look clean, since doing so gives user agents like Safari with VoiceOver less information to work with. That's why I think the requirement to have an alt attribute present doesn’t belong in the validation function also as a practical matter.

It turns out, though, that some people think of validation as a first step toward accessibility, even though syntactic correctness and accessibility really are different evaluation axes. They expect a validator to help them flag images that are lacking a textual alternative. Moreover, the alt issue seems to be taken as the single most important web accessibility issue with the rest of issues somewhere in the long tail. When there is a demand for validators to flag images without alt, validators probably should meet the demand.

To this end, I have developed a new feature for Validator.nu: Image Report. This new feature is not part of the validation function. It also doesn’t do exactly want people are asking of the syntax definition in the long e-mail thread. (It is not a new idea for a validator user interface to offer tools that help a human perform an assessment about the page outside the validation function. For example, the W3C Validator has offered a “Show Document Outline” feature, which is also on file as a request for enhancement for Validator.nu.)

The new feature tries to address the issue of finding missing textual alternatives but it also seeks to address the issue of faulty textual alternatives. Furthermore, it seeks to address these in a way that doesn’t induce people to write bad textual alternatives in order to make the report look cleaner.

When you turn the feature on, it always lists all the images. There is no textual alternative you can fake to make the list look shorter. Instead, there are four categories and you can only change the category in which an image appears.

This has the benefit of removing the badge hunting problem: people trying to silence the validator without actually raising the quality of their page. However, it also has the benefit that the user can review the textual alternatives for appropriateness and the user can review that the right images have been marked as omitted from non-graphical presentation. Since this tool addresses more problems than simply making alt required on the syntax level, I believe this solution is much better than furiously staying entrenched in the status quo of HTML 4 validation, fearing so much a step backwards as to being too afraid to explore steps forward.

Finally, it should be noted that this feature is, by necessity, itself inaccessible to people who cannot view bitmap images. Yet, I think it is legitimate for this feature to be implemented with an HTML user interface. Also, this feature itself is a case where the generator of the user interface markup has no knowledge of the content of the images it is presenting to the user. Hence, it is itself an example of omitting the alt attribute. It would be truly ironic, if the syntax definition of HTML5 prevented Validator.nu from being self-validating.

Posted in Conformance Checking, Syntax | 4 Comments »

Validator.nu HTML Parser 1.0.7 Released

Saturday, April 5th, 2008

There is now a new release of the Validator.nu HTML Parser. Change highlights:

Adds optional support for heuristic encoding sniffing using the ICU4J sniffer, jchardet or both.
Adds support for rewinding and reparsing when becoming confident about the character encoding and the tentative encoding was wrong.
Performs encoding name matching per spec instead of using the JDK mechanism.
Implements spec changes up until just before SVG and MathML support. (Those will merit 1.1 or something.)
Warning: The semantics of the doctype token have changed in case you have your own token handler (unlikely).

Posted in Processing Model, Syntax | Comments Off on Validator.nu HTML Parser 1.0.7 Released

Validator.nu now more useful when migrating existing designs

Saturday, February 2nd, 2008

Due to implementation details, the HTML5 facet of Validator.nu used to ignore the content of obsolete elements such as center, because obsolete elements were simply unknown. This wasn’t particularly useful when assessing the HTML5-upgradeability of an existing design that wrapped everything in center, for example.

The HTML5 facet of Validator.nu now knows about obsolete container elements that existed as deprecated in HTML 4.01. This means that center is still an error, but the contents are now checked as HTML5.

Also, Validator.nu now allows legacy-style internal encoding declarations per the latest Editor’s Draft.

Posted in Conformance Checking, Syntax | Comments Off on Validator.nu now more useful when migrating existing designs

Validator.nu HTML Parser 1.0.6 Released

Tuesday, January 22nd, 2008

Version 1.0.6 of the Validator.nu HTML Parser has been released. The new version fixes a crasher bug in bytes to characters conversion, works around a crash when the ICU4J 3.8.1 UTF-7 decoder is in the classpath, improves error message wording and brings errors and warnings pertaining to legacy encodings up-to-date per the current HTML 5 draft.

This update is highly recommended for all applications that use the parser by giving it an URI or an InputStream. For applications that give the parser a Reader the update is not necessary.

Posted in Processing Model, Syntax | Comments Off on Validator.nu HTML Parser 1.0.6 Released