The WHATWG Blog

This Week in HTML 5 – Episode 28

Thursday, April 2nd, 2009

Welcome back to "This Week in HTML 5," where I'll try to summarize the major activity in the ongoing standards process in the WHATWG and W3C HTML Working Group.

The big news for the week of March 23^rd is that SVG can once again be included directly in HTML 5 documents served as text/html:

I've made the following changes to HTML5:

Uncommented out the XXXSVG bits, reintroducing the ability to have SVG content in text/html.

Defined <script> processing for SVG <script> in text/html by deferring to the SVG Tiny 1.2 spec and blocking synchronous document.write(). The alternative to this is to integrate the SVG script processing model with the (pretty complicated) HTML script processing model, which would require changes to SVG and might result in a dependency from SVG to HTML5. Anne would like to do this, but I'm not convinced it's wise, and it certainly would be more complex than what we have now. If we ever want to add async="" or defer="" to SVG scripts, then this would probably be a necessary part of that process, though.

Added a paragraph suggesting: "To enable authors to use SVG tools that only accept SVG in its XML form, interactive HTML user agents are encouraged to provide a way to export any SVG fragment as a namespace-well-formed XML fragment."

Added a paragraph defining the allowed content model for SVG <title> elements in text/html documents.

r2904 (and, briefly, r2910) give all the details of this solution. There are still a number of differences between the text in HTML 5 and the proposal brought by the SVG working group. Some of these are addressed further down in the announcement:

SVG-in-XML is case-preserving; SVG-in-HTML is not.
SVG-in-XML requires quoted attribute values; SVG-in-HTML does not.
When SVG-in-XML uses CDATA blocks, they show up as CDATA nodes in the DOM; when SVG-in-HTML uses CDATA blocks, they show up in the DOM as conventional text nodes. [Clarified based on Henri's feedback]
The <svg> element can not be the root element of a text/html document.

Doug Schepers, who has been the SVG working group's HTML 5 liason, does not like this solution:

To be honest, I think it's not a good use of the SVG WG's time to provide feedback when Ian already has his mind made up, even if I don't believe that he is citing real evidence to back up his decision. What I see is this: one set of implementers and authors (the SVG WG) and the majority of the author and user community (in public comments) asking for some sort of preservation of SVG as an XML format, even if it's looser and error-corrected in practice, and a few implementers (Jonas and Lachy, most notably) disagreeing, and Ian giving preference to the minority opinion. Maybe there is sound technical rationale for doing so, but I haven't been satisfied on that score.

Turning to technical matters, one of the features of web forms in HTML 5 is allowing the attributes for form submission on either the <form> element (as in HTML 4) or on the submit button (new in HTML 5). Originally, the attributes for submit buttons were named action, enctype, method, novalidate, and target, which exactly mirrored the attribute names that could be declared on the <form> element.

However, in January 2008, Hallvord R. M. Steen (Opera developer) noted that "INPUT action [attribute] breaks web applications frequently. Both GMail and Yahoo mail (the new Oddpost-based version) use input/button.action and were seriously broken by WF2's action attribute."

Following up in November 2008, Ian Hickson replied, "I notice that Opera still supports 'action' and doesn't seem to have problems in GMail; is this still a problem?" to which Hallvord replied, "GMail fixed it on their side a while ago. It is still a problem with Yahoo mail, breaking most buttons in their UI for a browser that supports 'action'. We work around this with a browser.js hack. ('Still a problem' means 'I tested this again a couple of weeks ago and things were still broken without this patch'.)"

Ian replied, "This is certainly problematic. It's unclear what we should do. It's hard to use another attribute name, since the whole point is reusing existing ones... can we trigger this based on quirks mode, maybe? Though I hate to add new quirks." Hallvord did not like that idea: "In my personal opinion, I don't see why re-using attribute names is considered so important if we can find an alternative that feels memorable and usable. How does this look? <input type="submit" formaction="http://www.example.com/">"

Finally, in March 2009, Ian replied:

That seems reasonable. I've changed "action", "method", "target", "enctype" and "novalidate" attributes on <input> and <button> to start with "form" instead: "formaction", "formmethod", "formtarget", "formenctype" and "formnovalidate".

And thus we have r2890: Rename attributes for form submission to avoid clashes with existing usage.

Other interesting changes this week:

r2889 adds support for select.add(e) and select.options.add(e) with no second argument.
r2888 defines how to determine the character encoding of Web Worker scripts. Briefly, it says to look for a Byte Order Mark, then look at the Content-Type HTTP header, then fall back to UTF-8.
r2898, r2899, r2901, r2914, and r2916 define a locking mechanism to allow thread-safe read/write access to document.cookie and .localStorage. The lock is acquired during page fetching (which sets the cookie based on HTTP headers) and released once the cookie is set. It is also released automatically whenever something modal happens (such as window.alert()). (I first mentioned the discussion of this issue in episode 27. The problem is that Web Workers allows threaded client-side script execution, which means access to shared storage like document.cookie needs to be made explicitly thread-safe with some sort of locking mechanism.)

Tune in next week for another exciting episode of "This Week in HTML 5."

Posted in Weekly Review | 3 Comments »

This Week in HTML 5 – Episode 9

Tuesday, October 14th, 2008

Welcome back to "This Week in HTML 5," where I'll try to summarize the major activity in the ongoing standards process in the WHATWG and W3C HTML Working Group.

Most of the changes in the spec this week revolve around the <textarea> element.

r2305 covers editing a <textarea>.
r2309 defines the rows and cols attributes.
r2310 defines the wrap attribute. (Long supported by Netscape and still supported in Internet Explorer, the wrap attribute has never been standardized until now.)
r2311 defines the maxlength attribute.
r2312 defines the required attribute, also new in HTML 5.
r2313 removes support for the accept attribute, which has always been problematic and its (limited) potential has never been implemented. This only affects the <textarea> element; <input type=file> elements still have an accept attribute that controls what types of files may be uploaded.

Shelley Powers pointed out that I haven't mentioned the issue of distributed extensibility yet. (The clearest description of the issue is Sam Ruby's message from last year, which spawned a long discussion.) The short version: XHTML (served with the proper MIME type, application/xhtml+xml) supports embedding foreign data in arbitrary namespaces, including SVG and MathML. None of these technologies (XHTML, SVG, or MathML) have had much success on the public web. Despite Chris Wilson's assertion that "we cannot definitively say why XHTML has not been successful on the Web," I think it's pretty clear that Internet Explorer's complete lack of support for the application/xhtml+xml MIME type has something to do with it. (Chris is the project lead on Internet Explorer 8.)

Still, it is true that XHTML does support distributed extensibility, and many people believe that the web would be richer if SVG and MathML (and other as-yet-unknown technologies) could be embedded and rendered in HTML pages. The key phrase here is "as-yet-unknown technologies." In that light, the recent SVG-in-HTML proposal (which I mentioned several weeks ago) is beside the point. The point of distributed extensibility is that it does not require approval from a standards body. "Let a thousand flowers bloom" and all that, where by "flowers," I mean "namespaces." This is an unresolved issue.

Other interesting changes this week:

r2314 ensures that the required attribute only applies to form controls whose value can change.
r2316 defines the name attribute for form controls.
r2317 defines the disabled attribute for form controls.
r2320 defines all the different ways that a form control can fail to satisfy its constraints. For example, an <input maxlength=20> element with a 21-character value.
r2322 defines exactly how form data should be encoded before being submitted to the server. I've previously mentioned character encoding in this series; this revision marks the first time that an HTML specification has acknowledged the existence of <input type=hidden name=_charset_> method of specifying the character encoding of submitted form data.
r2319 removes support for data templates and repetition templates. These were inventions in the original Web Forms 2 specification, but they were never picked up by any major browser.

Around the web:

Anne van Kesteren gave an interview on the state of several bleeding edge web standards.
Simon Pieters ponders what to do about nested <h1> elements.
In response to Olivier Gendrin, Anne van Kesteren points out that the CSSOM standard defines a window.media attribute. Unlike CSS media types, window.media would be directly queryable from Javascript.

Tune in next week for another exciting episode of "This Week in HTML 5."

Posted in Weekly Review | 2 Comments »

Validator.nu HTML Parser 1.1.0

Monday, August 25th, 2008

I have released a new version of the Validator.nu HTML Parser (an implementation of the HTML5 parsing algorithm in Java). The new release supports SVG and MathML subtrees, is faster than the old version, fixes bugs, is more portable and supports applications that want to do document.write().

The parser comes with a sample app that makes it possible to use XSLT programs written for XHTML5+SVG+MathML with text/html.

Warning! The internal APIs have changed. Please refer to the Upgrade Guide below.

Change Log

Made the SAX, DOM and XOM parser entry point constructors default to altering the infoset instead of throwing when the input needs coercing to be an XML 1.0 4th ed. plus Namespaces infoset.
Isolated Java IO dependent code from the parser core. The parser core now compiles on Google Web Toolkit.
Refactored the tokenizer to use a switch branch per state instead of method per state.
Made various performance tweaks to the tokenizer.
Implemented support for MathML and SVG foreign content. (Note that the SVG part is based on spec text that has been commented out from the spec at the request of the SVG WG.)
Made the parser suspendable after any input character.
Made it possible for custom TreeBuilder subclasses to request parser suspension. (Applications wishing to implement document.write() should provide their own TreeBuilder subclass and a document.write()-aware replacement of the Driver class. Look in the gwt-src/ directory for sample code.)
Made changes to the parser core to make it more suitable for mechanical translation into other object-oriented programming languages that have C-like control structures but not necessarily a garbage collector (with focus on targeting C++). This work is not complete.
Made the HTML serializer do the right thing when input represents a conforming XHTML+SVG+MathML tree. (Results may be bad for non-conforming input trees.)
Developed sample programs for converting between HTML5 and XHTML5 when the input is known to be conforming.
Provided an XML serializer so that the sample code no longer depends on the Xalan serializer.
Improved API documentation.
Fixed bugs in the tokenizer, tree builder and the input stream character encoding decoder.
Made coercion to an XML infoset work according to the HTML5 spec.
Added ID uniqueness checking.
Various other fixes.

Upgrade Guide from 1.0.7 to 1.1.0

In all cases, you need to check that your application does not break when it receives SVG or MathML subtrees.

If you use the parser through the SAX, DOM or XOM API and do not pass an explicit XmlViolationPolicy to the constructor of HtmlParser, HtmlDocumentBuilder or HtmlBuilder:

If you really wanted the old default behavior, you should now pass XmlViolationPolicy.FATAL to the constructor.

If you did not really want to have fatal errors by default, you do not need to do anything, since ALTER_INFOSET is now the default.

If you use the parser through the SAX, DOM or XOM API and do pass an explicit XmlViolationPolicy to the constructor of HtmlParser, HtmlDocumentBuilder or HtmlBuilder:

You do not need to change your code to upgrade.

If you have your own subclass of TreeBuilder:

The abstract methods on TreeBuilder now have additional arguments for passing the namespace URI. You should upgrade your subclass to deal with the namespace URIs. (The URI is always an interned string, so you can use == to compare.)

The entry point for passing in a SAX InputSource has moved from the Tokenizer class to the Driver class (in the io package), so you should change your references from Tokenizer to Driver.

If you have your own implementation of TokenHandler:

Please refer to the JavaDocs of TokenHandler. Also note the new separation of Tokenizer and Driver mentioned above.

Posted in Syntax | Comments Off on Validator.nu HTML Parser 1.1.0