The WHATWG Blog — 2009

Archive for July, 2009

Microdata (part 1)

Friday, July 31st, 2009

One of the features we've added in HTML5 is a way to include machine-readable annotations that people can scrape in a simple and well-defined way. This means that if a site wants to make the information available, you don't have to rely on brittle screen-scraping to get the information out.

This is easiest to understand with an example.

Suppose that you had an issue tracking database like Bugzilla, and that you wanted other tools to be able to pull information about issues in that database.

Today, Bugzilla exposes an XML file for each bug, but this means maintaining two parallel formats for the bug page. Instead of providing such a separate interface, you can use microdata, the new attributes in HTML5. That way, even as your issue tracker changes its interface from version to version, the underlying data can still be reliably readable from the same HTML page.

Imagine the markup today looks like this:

<body>
 <h1>Issue 12941: Too many pies in the pie factory</h1>
 <dl>
  <dt>Reporter</dt>
  <dd>[email protected]</dd>
  <dt>Priority</dt>
  <dd>AAA</dd>
  ...

To annotate this with microdata, we just mint some names, and then label each field with those names. The names are in "reverse-DNS" form; if the bug system was at "example.net", then the names would be "net.example.bug", "net.example.number", and so on. Thus we get:

<body item="net.example.bug">
 <h1>Issue <span itemprop="net.example.number">12941</span>:
  <span itemprop="net.example.title">Too many pies in the pie factory</span></h1>
 <dl>
  <dt>Reporter</dt>
  <dd itemprop="net.example.reporter">[email protected]</dd>
  <dt>Priority</dt>
  <dd itemprop="net.example.priority">AAA</dd>
  ...

The item="net.example.bug" attribute says "here is a bug". The various itemprop attributes provide name/value pairs for the bug. The snippet above would result in the following tree of data:

net.example.bug:
  net.example.number = "12941"
  net.example.title = "Too many pies in the pie factory"
  net.example.reporter = "[email protected]"
  net.example.priority = "AAA"

Now it doesn't matter if the page is dramatically changed, the same data can still be made unambiguously available:

<body>
 <h1>Example.Net Bugs Database</h1>
 <section item="net.example.bug">
  <h1 itemprop="net.example.title">Too many pies in the pie factory</span></h1>
  <p>#<span itemprop="net.example.number">12941</span>; reported
  by <span itemprop="net.example.reporter">[email protected]</span>.</p>
  <p>PRIORITY: <strong itemprop="net.example.priority">AAA</strong>.</p>
  ...

This concludes this brief introduction to microdata! Some future blog posts will introduce a few aspects of microdata that I didn't discuss here:

How to annotate URIs, dates and times, and hidden data using microdata.
How to nest items within each other.
How to annotate an item with more than one type, or how to give a single value multiple names.
The predefined vocabularies.
How to add annotations outside of an item="", using subject="".

Posted in WHATWG | 19 Comments »

Quality Assurance tools for HTML5

Sunday, July 19th, 2009

I see more and more people switch over to HTML5 these days, and to help you make sure you did things correctly, there are some tools at your disposal that might be good to know about.

Validator.nu (and its Show Image Report feature)
HTML5 Outliner
Table Inspector

HTML5 validator

To make sure you didn't misspell any tag or nest elements in a way that is not allowed, or find similar mistakes in your markup, you can use Validator.nu.

Alt text for images

The above-mentioned validator has a feature to help you quality-check your alternative text for your img elements. Check the Show Image Report checkbox.

You can also disable images in your browser or try to use a text-only browser — the information that the images convey should still be available (but in text form). Sometimes an image doesn't convey any further information than what the surrounding text already says, and in such cases you should use the empty value: alt="".

For further advice and examples on how to use the alt attribute, the HTML 5 spec has lots of information on the topic. If you're not going to read it all, just read the section called General guidelines.

Document outline

The document outline is the structure of sections in the document, built from the h1-h6 elements as well as the new sectioning elements (section, article, aside, nav). The document outline is more commonly known as the Table of Contents.

To make sure that you have used the new sectioning elements correctly, you can check that the resulting outline makes sense with the HTML5 Outliner.

If you see "Untitled Section" and didn't expect them, chances are that you should have used div instead of section.

If you have a subtitle of a heading that shouldn't be part of the document outline, you should use the hgroup element:

<hgroup>
 <h1>The World Wide Web Consortium</h1>
 <h2>Leading the Web to Its Full Potential...</h2>
</hgroup>

In this example, only the h1 will show up in the document outline.

Table inspector

(This only applies to table elements used for tabular data — not for layout.)

HTML tables have two types of cells: header cells (th elements) and data cells (td elements). These cells are associated together in the table: a data cell in the middle of the table can have associated header cells, typically in the first row and/or the first column of the table. To a user who can see, this association seems obvious, but users who cannot see need some help from the computer to understand which cells are associated with which.

You should mark up your header cells with the th element and check that your cells get associated as you intended using the Table Inspector. If it isn't as you intended, you can consider simplifying or rearranging your table, or you can override the default association using scope or headers attributes.

Other tools?

If you know about other tools for helping with quality assurance of HTML5, or if you have made your own, please share!

Posted in Conformance Checking | 4 Comments »

Help Test HTML5 Parsing in Gecko

Wednesday, July 8th, 2009

The HTML5 parsing algorithm is meant to demystify HTML parsing and make it uniform across implementations in a backwards-compatible way. The algorithm has had “in the lab” testing, but so far it hasn’t been tested inside a browser by a large number of people. You can help change that now!

A while ago, an implementation of the HTML5 parsing algorithm landed on mozilla-central preffed off. Anyone who is testing Firefox nightly builds can now opt to turn on the HTML5 parser and test it.

How to Participate?

First, this isn’t release-quality software. Testing the HTML5 parser carries all the same risks as testing a nightly build in general, and then some. It may crash, it may corrupt your Firefox profile, etc. If you aren’t comfortable with taking the risks associated with running nighly builds, you shouldn’t participate.

If you are still comfortable with testing, download a trunk nightly build, run it, navigate to about:config and flip the preference named html5.enable to true. This makes Gecko use the HTML5 parser when loading pages into the content area and when setting innerHTML. The HTML5 parser is not used for HTML embedded in feeds, Netscape bookmark import, View Source, etc., yet.

The html5.enable preference doesn’t require a restart to take effect. It takes effect the next time you load a page.

What to Test?

The main thing is getting the HTML5 parser exposed to a wide range of real Web content that people browse. This may turn up crashes or compatibility problems.

So the way to help is to use nightly builds with the HTML5 parser for browsing as usual. If you see no difference, things are going well! If you see a page misbehaving—or, worse, crashing—with the HTML5 parser turned on but not with it turned off, please report the problem.

Reporting Bugs

Please file bugs in the “Core” product under “HTML: Parser” component with “[HTML5] ” at the start of the summary.

Known Problems

First and foremost, please refer to the list of known bugs.

However, I’d like to highlight a particular issue: Support for comments ending with --!> is in the spec, but the patch hasn’t landed, yet. Support for similar endings of pseudo-comment escapes within script element content is not in the spec yet. The practical effect is that the rest of the page may end up being swallowed up inside a comment or a script element.

Another issue is that the new parser doesn’t yet inhibit document.write() in places where it shouldn’t be allowed per spec but where the old parser allowed it.

Is There Anything New?

So what’s fun if success is that you notice no change? There are important technical things under the hood—like TCP packet boundaries not affecting the parse result and there never being unnotified nodes in the tree when the event loop spins—but you aren’t supposed to notice.

However, there is a major new visible feature, too. With the HTML5 parser, you can use SVG and MathML in text/html pages. This means that you can:

Use SVG graphics inline without having to change your HTML content to work with XML parsing and without having to develop an alternative page for IE.
Use properly laid out math without having to change your HTML content to work with XML parsing.
Use SVG effects without external files.

And yes, you can even put SVG inside MathML <annotation-xml> or MathML inside <foreignObject>. The mixing you’ve seen in XML is now supported in HTML, too.

If you aren’t concerned with taking the steps to make things degrade nicely in browsers that don’t support SVG and MathML in HTML, you can simply copy and paste XML output from your favorite SVG or MathML editor into your HTML source as long as the editor doesn’t use namespace prefixes for elements and uses the prefix xlink for XLink attributes.

If you don’t use the XML empty element syntax and you put you SVG text nodes in CDATA sections, the page will degrade gracefully in older HTML browser so that the image simply disappears but the rest of the page is intact. You can even put a fallback bitmap as <img> inside <desc>. Unfortunately, there isn’t a similar technique for MathML, though if you want to develop one, I suggest experimenting with the <annotation> as your <desc>-like container.

There are known issues with matching camelCase names with Selectors or getElementByTagName, though.

Posted in Browsers, Processing Model, Syntax | 8 Comments »