The WHATWG Blog

Please leave your sense of logic at the door, thanks!

Validator.nu HTML Parser 1.1.0

I have released a new version of the Validator.nu HTML Parser (an implementation of the HTML5 parsing algorithm in Java). The new release supports SVG and MathML subtrees, is faster than the old version, fixes bugs, is more portable and supports applications that want to do document.write().

The parser comes with a sample app that makes it possible to use XSLT programs written for XHTML5+SVG+MathML with text/html.

Warning! The internal APIs have changed. Please refer to the Upgrade Guide below.

Change Log

Upgrade Guide from 1.0.7 to 1.1.0

In all cases, you need to check that your application does not break when it receives SVG or MathML subtrees.

If you use the parser through the SAX, DOM or XOM API and do not pass an explicit XmlViolationPolicy to the constructor of HtmlParser, HtmlDocumentBuilder or HtmlBuilder:

If you really wanted the old default behavior, you should now pass XmlViolationPolicy.FATAL to the constructor.

If you did not really want to have fatal errors by default, you do not need to do anything, since ALTER_INFOSET is now the default.

If you use the parser through the SAX, DOM or XOM API and do pass an explicit XmlViolationPolicy to the constructor of HtmlParser, HtmlDocumentBuilder or HtmlBuilder:

You do not need to change your code to upgrade.

If you have your own subclass of TreeBuilder:

The abstract methods on TreeBuilder now have additional arguments for passing the namespace URI. You should upgrade your subclass to deal with the namespace URIs. (The URI is always an interned string, so you can use == to compare.)

The entry point for passing in a SAX InputSource has moved from the Tokenizer class to the Driver class (in the io package), so you should change your references from Tokenizer to Driver.

If you have your own implementation of TokenHandler:

Please refer to the JavaDocs of TokenHandler. Also note the new separation of Tokenizer and Driver mentioned above.

Tags: , , , , , ,
Posted in Syntax | Comments Off on Validator.nu HTML Parser 1.1.0

This Week in HTML 5 – Episode 3

Welcome back to "This Week in HTML 5," where I'll try to summarize the major activity in the ongoing standards process in the WHATWG and W3C HTML Working Group.

The biggest news this week is the birth of the event loop.

To coordinate events, user interaction, scripts, rendering, networking, and so forth, user agents must use event loops as described in this section.

... An event loop has one or more task queues. A task queue is an ordered list of tasks, which can be:

Events

Asynchronously dispatching an Event object at a particular EventTarget object is a task.

Parsing

The HTML parser tokenising a single byte, and then processing any resulting tokens, is a task.

Callbacks

Calling a callback asynchronously is a task.

Using a resource

When an algorithm fetches a resource, if the fetching occurs asynchronously then the processing of the resource once some or all of the resource is available is a task.

Reacting to DOM manipulation

Some elements have tasks that trigger in response to DOM manipulation, e.g. when that element is inserted into the document.

The purpose of defining an event loop is to unify the definition of things that happen asychronously. (I want to avoid saying "events" since that term is already overloaded.) For example, if an image defines an onload callback function, exactly when does it get called? Questions like this are now answered in terms of adding tasks to a queue and processing them in an event loop.

The other major news this week is the addition of the hashchange event, which occurs when the user clicks an in-page link that goes somewhere else on the same page, or when a script programmatically sets the location.hash property. This is primarily useful for AJAX applications that wish to maintain a history of user actions while remaining on the same page. As a concrete example, executing a search of your messages in GMail takes you to a list of search results, but does not change the base URL, just the hash; clicking the Back button takes you back to the previous view within GMail (such as your inbox), again without changing the base URL (just the hash). GMail employs some nasty hacks to make this work in all browsers; the hashchange event is designed to make those hacks slightly less nasty. Microsoft Internet Explorer 8 pioneered the hashchange event, and its definition in HTML 5 is designed to match Internet Explorer's behavior.

Other interesting changes this week:

Tune in next week for another exciting episode of "This Week in HTML 5."

Tags:
Posted in Weekly Review | 4 Comments »

This Week in HTML 5 – Episode 2

Welcome back to "This Week in HTML 5," where I'll try to summarize the major activity in the ongoing standards process in the WHATWG and W3C HTML Working Group.

The biggest news this week is revision 2020, which standardizes the navigator object:

The navigator attribute of the Window interface must return an instance of the Navigator interface, which represents the identity and state of the user agent (the client), and allows Web pages to register themselves as potential protocol and content handlers.

Currently, HTML 5 defines four properties and two methods:

This is only a subset of navigator properties and methods that browsers already support. See Navigator Object on Google Doctype for complete browser compatibility information.

Next up: Content-Language. No, not the HTTP header, not even the <html lang> attribute, but the <meta> tag! As reported by Henri Sivonen,

It seems that some authoring tools and authors use <meta http-equiv='content-language' content='languagetag'> instead of <html lang='languagetag'>.

This led to revision 2057, which defines the <meta> http-equiv="Content-Language"> directive and its relationship with lang, xml:lang, and the Content-Language HTTP header.

In the continuing saga of the alt attribute, the new syntax for alternate text of auto-generated images (which I covered in last week's episode) has generated some followup discussion. Philip Taylor is concerned that it will increase complexity for authoring tools; others feel the complexity is worth the cost. James Graham suggested a no-text-equivalent attribute; similar proposals have been discussed before and rejected.

Switching to the new Web Workers specification (which I also covered last week), Aaron Boodman (one of the developers of Google Gears) posted his initial feedback. This kicked off a long discussion and led to the creation of the Worker object.

Other interesting changes this week:

Administrivia: "This Week in HTML 5" now has its own feed.

Tune in next week for another exciting episode of "This Week in HTML 5."

Tags:
Posted in Weekly Review | 3 Comments »

HTML5 Live DOM Viewer—Now in Your Browser

Earlier, I blogged about running the Validator.nu HTML Parser inside Hixie’s Live DOM Viewer using the magic of the hosted mode of the Google Web Toolkit. Back then, a compiler bug in GTW 1.5 RC1 prevented the parser from running as JavaScript in the Web mode. Google has now released GWT 1.5 RC2, which contains a fix for the bug.

So without further ado, here’s Live DOM Viewer with an HTML5 parser running as JavaScript in your browser.

Try pasting in the SVG lion or some MathML in Firefox 3 and Opera 9.5.

Known problems:

A big thanks for the GWT team for making this work!

Tags:
Posted in DOM, Syntax | Comments Off on HTML5 Live DOM Viewer—Now in Your Browser

This Week in HTML 5 – Episode 1

Welcome to a new semi-regular column, "This Week in HTML 5," where I'll try to summarize the major activity in the ongoing standards process in the WHATWG and W3C HTML Working Group.

The biggest news is the birth of the Web Workers draft specification. Quoting the spec, "This specification defines an API that allows Web application authors to spawn background workers running scripts in parallel to their main page. This allows for thread-like operation with message-passing as the coordination mechanism." This is the standardization of the API that Google Gears pioneered last year. See also: initial Workers thread, announcement of new spec, response to Workers feedback.

Also notable this week: even more additions to the Requirements for providing text to act as an alternative for images. 4 new cases were added:

  1. A link containing nothing but an image
  2. A group of images that form a single larger image
  3. An image not intended for the user (such as a "web bug" tracking image)
  4. Text that has been rendered to a graphic for typographical effect

Additionally, the spec now tries to define what authors should do if they know they have an image but don't know what it is. Quoting again from the spec:

If the src attribute is set and the alt attribute is set to a string whose first character is a U+007B LEFT CURLY BRACKET character ({) and whose last character is a U+007D RIGHT CURLY BRACKET character (}), the image is a key part of the content, and there is no textual equivalent of the image available. The string consisting of all the characters between the first and the last character of the value of the alt attribute gives the kind of image (e.g. photo, diagram, user-uploaded image). If that value is the empty string (i.e. the attribute is just "{}"), then even the kind of image being shown is not known.

  • If the image is available, the element represents the image specified by the src attribute.
  • If the image is not available or if the user agent is not configured to display the image, then the user agent should display some sort of indicator that the image is not being rendered, and, if possible, provide to the user the information regarding the kind of image that is (as derived from the alt attribute).

See also: revision 1972, revision 1976, revision 1978, revision 1979, Images and alternate text.

Other interesting changes this week:

Tune in next week for another exciting episode of "This Week in HTML 5."

Tags:
Posted in Processing Model, Weekly Review, WHATWG | 21 Comments »