The WHATWG Blog

Please leave your sense of logic at the door, thanks!

This Summer in HTML 5 – Episode 33

I hope you enjoyed your summer. My oldest son started kindergarten today. Let's talk about HTML 5.

When last we checked, HTML 5 was humming along towards Last Call in October. Much has been made of this date; I won't bore you with the details, except to say that HTML 5 is very close to entering the next phase of its existence. Regular readers of this blog already know that parts of HTML 5 are already shipping in major browsers. The recently-released Firefox 3.5 supports <audio> and <video>, offline web applications, the drag-and-drop API, and the <canvas> text API. (Technically Firefox 3.0 supported the <canvas> text API too, properly cordoned off in its own vendor-specific functions because the API was not finalized at the time. You can paper over the differences fairly easily.)

So what new and exciting stuff has been added to HTML 5 this summer?

Microdata

At the table in the kitchen, there were three bowls of porridge. Goldilocks was hungry. She tasted the porridge from the first bowl. "This porridge is too hot!" she exclaimed.

So, she tasted the porridge from the second bowl. "This porridge is too cold," she said.

So, she tasted the last bowl of porridge. "Ahhh, this porridge is just right," she said happily and she ate it all up.

— The Story of Goldilocks and the Three Bears

r3074 introduces the concept of microdata. Microdata is designed to allow authors to include additional semantics in their pages for which there is no appropriate HTML element or attribute. For example, HTML is not expressive enough to mark up a contact in an address book (complete with individual fields for name, street address, email, and phone number) or an event on a calendar (complete with start date, end date, and location). Instead of creating new elements and attributes for every possible vocabulary, you can use the microdata attributes to enhance existing elements.

There are a number of other technologies with goals similar to microdata, including microformats and RDFa. As Ian Hickson explained in the message "Annotating structured data that HTML has no semantics for" that introduced microdata, microformats are fine for specific formats but are not flexible enough to be parseable by a generic parser, while RDFa relies on CURIEs and XML namespaces in a way that would require changes to HTML parsing algorithms to work interoperably between text/html and application/xhtml+xml. (Forgive me if I didn't explain that very well. There was a lot of yelling and very little explaining once it became clear that RDFa was not going to be included in HTML 5, so I probably missed some of the nuances.) Work is ongoing to create an RDFa-in-HTML specification.

ARIA

ARIA stands for "Accessible Rich Internet Applications." It is an emerging standard for making web applications more accessible to people using assistive technologies (including, but not limited to, blind people who browse the web with the help of screenreaders). The basic technique is for authors to define "roles" and "states" on individual elements to indicate what sort of control the element represents. For example, HTML has no "treeview" control, but JavaScript libraries like Dojo let you include a treeview in your web-based application with a combination of generic HTML elements, a few images, and a whole lotta JavaScript. ARIA gives you a way to say that the "treeview" HTML element (which is probably just a <div>) is acting as a treeview (that's its "role"). Each item in the treeview can be in the "expanded" or "collapsed" state, and the state changes as the user interacts with the control. Major browsers, including Microsoft Internet Explorer (8) and Firefox (2+) will notice the custom role on the element and announce to assistive technologies that this <div> element is acting as a treeview. (In fact, Dojo already supports these roles and states, due to work funded by IBM.)

r3657 adds the section Annotations for assistive technology products to HTML 5. There are still a number of unanswered questions about how the custom semantics defined by ARIA interact with the native semantics defined by HTML 5.

Everything Old is New Again

As regular readers of this blog already know, HTML 5 goes to great lengths to specify existing browser behavior, even to the point of "willfully violating" other specifications. Vast stretches of the HTML 5 specification are devoted to elements, attributes, and scripting features that nobody likes but everyone is required to support. To that end, r3502 defines the <listing>, <plaintext>, <acronym>, <xmp>, and <dir> elements; r3133 and r3141 define the <marquee> element; r3155, r3403, r3409, and r3410 define document.all.

Other important changes include the location.reload() method (r3220), the textarea.textLength property (r3177), a new rollback() method for synchronous SQL transactions r3210), and the ability to upload multiple files at a time from a web form (r3544 and r3545).

Features Removed

"The food here is terrible!"

"I know, and such small portions!"

(variously attributed)

Everyone complains that HTML 5 is too big, but nobody has any reasonable solution for making it smaller. (Splitting it into multiple specifications to make it "smaller" is like cutting a pie into slices to give it fewer calories.) However, based on implementor feedback, HTML 5 has shed a few poundsfeatures this summer. To wit:

Administrative Stuff

"Man didn't the right form."

"What man?"

"The man from the cat detector van."

"The loony detector van, you mean."

"Look, it's people like you what cause unrest."

Monty Python's "Fish License"

When web servers send you HTML, they are supposed to label it as such with the HTTP Content-Type header. Each content type (an HTML page, a JPEG image, an MPEG-4 video) has its own "MIME type." MIME types must be registered with the IANA.

r3552 adds the registration information for text/html, application/xhtml+xml, text/event-stream, text/cache-manifest, and application/microdata+json. r3582 adds the registration information for text/ping.

Standards frequently include references to other standards. References can be "normative" or "informative." To quote RFC 3967 (a standard about creating standards), "a normative reference specifies a document that must be read to fully understand or implement the subject matter in the new [standard], or whose contents are effectively part of the new [standard], as its omission would leave the new [standard] incompletely specified. An informative reference is not normative; rather, it provides only additional background information." r3580 adds a list of references to HTML 5.

Tune in next week as we return to our regular weekly schedule of "This Week in HTML 5."

Tags: , , , ,
Posted in Weekly Review | 7 Comments »

Microdata (part 1)

One of the features we've added in HTML5 is a way to include machine-readable annotations that people can scrape in a simple and well-defined way. This means that if a site wants to make the information available, you don't have to rely on brittle screen-scraping to get the information out.

This is easiest to understand with an example.

Suppose that you had an issue tracking database like Bugzilla, and that you wanted other tools to be able to pull information about issues in that database.

Today, Bugzilla exposes an XML file for each bug, but this means maintaining two parallel formats for the bug page. Instead of providing such a separate interface, you can use microdata, the new attributes in HTML5. That way, even as your issue tracker changes its interface from version to version, the underlying data can still be reliably readable from the same HTML page.

Imagine the markup today looks like this:

<body>
 <h1>Issue 12941: Too many pies in the pie factory</h1>
 <dl>
  <dt>Reporter</dt>
  <dd>[email protected]</dd>
  <dt>Priority</dt>
  <dd>AAA</dd>
  ...

To annotate this with microdata, we just mint some names, and then label each field with those names. The names are in "reverse-DNS" form; if the bug system was at "example.net", then the names would be "net.example.bug", "net.example.number", and so on. Thus we get:

<body item="net.example.bug">
 <h1>Issue <span itemprop="net.example.number">12941</span>:
  <span itemprop="net.example.title">Too many pies in the pie factory</span></h1>
 <dl>
  <dt>Reporter</dt>
  <dd itemprop="net.example.reporter">[email protected]</dd>
  <dt>Priority</dt>
  <dd itemprop="net.example.priority">AAA</dd>
  ...

The item="net.example.bug" attribute says "here is a bug". The various itemprop attributes provide name/value pairs for the bug. The snippet above would result in the following tree of data:

net.example.bug:
  net.example.number = "12941"
  net.example.title = "Too many pies in the pie factory"
  net.example.reporter = "[email protected]"
  net.example.priority = "AAA"

Now it doesn't matter if the page is dramatically changed, the same data can still be made unambiguously available:

<body>
 <h1>Example.Net Bugs Database</h1>
 <section item="net.example.bug">
  <h1 itemprop="net.example.title">Too many pies in the pie factory</span></h1>
  <p>#<span itemprop="net.example.number">12941</span>; reported
  by <span itemprop="net.example.reporter">[email protected]</span>.</p>
  <p>PRIORITY: <strong itemprop="net.example.priority">AAA</strong>.</p>
  ...

This concludes this brief introduction to microdata! Some future blog posts will introduce a few aspects of microdata that I didn't discuss here:

Posted in WHATWG | 19 Comments »

Quality Assurance tools for HTML5

I see more and more people switch over to HTML5 these days, and to help you make sure you did things correctly, there are some tools at your disposal that might be good to know about.

HTML5 validator

To make sure you didn't misspell any tag or nest elements in a way that is not allowed, or find similar mistakes in your markup, you can use Validator.nu.

Alt text for images

The above-mentioned validator has a feature to help you quality-check your alternative text for your img elements. Check the Show Image Report checkbox.

You can also disable images in your browser or try to use a text-only browser — the information that the images convey should still be available (but in text form). Sometimes an image doesn't convey any further information than what the surrounding text already says, and in such cases you should use the empty value: alt="".

For further advice and examples on how to use the alt attribute, the HTML 5 spec has lots of information on the topic. If you're not going to read it all, just read the section called General guidelines.

Document outline

The document outline is the structure of sections in the document, built from the h1-h6 elements as well as the new sectioning elements (section, article, aside, nav). The document outline is more commonly known as the Table of Contents.

To make sure that you have used the new sectioning elements correctly, you can check that the resulting outline makes sense with the HTML5 Outliner.

If you see "Untitled Section" and didn't expect them, chances are that you should have used div instead of section.

If you have a subtitle of a heading that shouldn't be part of the document outline, you should use the hgroup element:

<hgroup>
 <h1>The World Wide Web Consortium</h1>
 <h2>Leading the Web to Its Full Potential...</h2>
</hgroup>

In this example, only the h1 will show up in the document outline.

Table inspector

(This only applies to table elements used for tabular data — not for layout.)

HTML tables have two types of cells: header cells (th elements) and data cells (td elements). These cells are associated together in the table: a data cell in the middle of the table can have associated header cells, typically in the first row and/or the first column of the table. To a user who can see, this association seems obvious, but users who cannot see need some help from the computer to understand which cells are associated with which.

You should mark up your header cells with the th element and check that your cells get associated as you intended using the Table Inspector. If it isn't as you intended, you can consider simplifying or rearranging your table, or you can override the default association using scope or headers attributes.

Other tools?

If you know about other tools for helping with quality assurance of HTML5, or if you have made your own, please share!

Tags: , ,
Posted in Conformance Checking | 4 Comments »

Help Test HTML5 Parsing in Gecko

The HTML5 parsing algorithm is meant to demystify HTML parsing and make it uniform across implementations in a backwards-compatible way. The algorithm has had “in the lab” testing, but so far it hasn’t been tested inside a browser by a large number of people. You can help change that now!

A while ago, an implementation of the HTML5 parsing algorithm landed on mozilla-central preffed off. Anyone who is testing Firefox nightly builds can now opt to turn on the HTML5 parser and test it.

How to Participate?

First, this isn’t release-quality software. Testing the HTML5 parser carries all the same risks as testing a nightly build in general, and then some. It may crash, it may corrupt your Firefox profile, etc. If you aren’t comfortable with taking the risks associated with running nighly builds, you shouldn’t participate.

If you are still comfortable with testing, download a trunk nightly build, run it, navigate to about:config and flip the preference named html5.enable to true. This makes Gecko use the HTML5 parser when loading pages into the content area and when setting innerHTML. The HTML5 parser is not used for HTML embedded in feeds, Netscape bookmark import, View Source, etc., yet.

The html5.enable preference doesn’t require a restart to take effect. It takes effect the next time you load a page.

What to Test?

The main thing is getting the HTML5 parser exposed to a wide range of real Web content that people browse. This may turn up crashes or compatibility problems.

So the way to help is to use nightly builds with the HTML5 parser for browsing as usual. If you see no difference, things are going well! If you see a page misbehaving—or, worse, crashing—with the HTML5 parser turned on but not with it turned off, please report the problem.

Reporting Bugs

Please file bugs in the “Core” product under “HTML: Parser” component with “[HTML5] ” at the start of the summary.

Known Problems

First and foremost, please refer to the list of known bugs.

However, I’d like to highlight a particular issue: Support for comments ending with --!> is in the spec, but the patch hasn’t landed, yet. Support for similar endings of pseudo-comment escapes within script element content is not in the spec yet. The practical effect is that the rest of the page may end up being swallowed up inside a comment or a script element.

Another issue is that the new parser doesn’t yet inhibit document.write() in places where it shouldn’t be allowed per spec but where the old parser allowed it.

Is There Anything New?

So what’s fun if success is that you notice no change? There are important technical things under the hood—like TCP packet boundaries not affecting the parse result and there never being unnotified nodes in the tree when the event loop spins—but you aren’t supposed to notice.

However, there is a major new visible feature, too. With the HTML5 parser, you can use SVG and MathML in text/html pages. This means that you can:

And yes, you can even put SVG inside MathML <annotation-xml> or MathML inside <foreignObject>. The mixing you’ve seen in XML is now supported in HTML, too.

If you aren’t concerned with taking the steps to make things degrade nicely in browsers that don’t support SVG and MathML in HTML, you can simply copy and paste XML output from your favorite SVG or MathML editor into your HTML source as long as the editor doesn’t use namespace prefixes for elements and uses the prefix xlink for XLink attributes.

If you don’t use the XML empty element syntax and you put you SVG text nodes in CDATA sections, the page will degrade gracefully in older HTML browser so that the image simply disappears but the rest of the page is intact. You can even put a fallback bitmap as <img> inside <desc>. Unfortunately, there isn’t a similar technique for MathML, though if you want to develop one, I suggest experimenting with the <annotation> as your <desc>-like container.

There are known issues with matching camelCase names with Selectors or getElementByTagName, though.

Posted in Browsers, Processing Model, Syntax | 8 Comments »

Validator.nu HTML Parser 1.2.1

Version 1.2.1 of the Validator.nu HTML Parser is now available. It fixes an incompatibility with the DOM implementation of the latest Xerces.

Posted in DOM, Processing Model, Syntax | Comments Off on Validator.nu HTML Parser 1.2.1