Archive for July, 2009
One of the features we've added in HTML5 is a way to include
machine-readable annotations that people can scrape in a simple and
well-defined way. This means that if a site wants to make the
information available, you don't have to rely on brittle
screen-scraping to get the information out.
This is easiest to understand with an example.
Suppose that you had an issue tracking database like Bugzilla, and that you wanted
other tools to be able to pull information about issues in that
Today, Bugzilla exposes an XML file for each bug, but this means
maintaining two parallel formats for the bug page. Instead of
providing such a separate interface, you can use microdata, the
new attributes in HTML5. That way, even as your issue tracker changes
its interface from version to version, the underlying data can still
be reliably readable from the same HTML page.
Imagine the markup today looks like this:
<h1>Issue 12941: Too many pies in the pie factory</h1>
To annotate this with microdata, we just mint some names, and then
label each field with those names. The names are in "reverse-DNS"
form; if the bug system was at "example.net", then the names would be
"net.example.bug", "net.example.number", and so on. Thus we get:
<h1>Issue <span itemprop="net.example.number">12941</span>:
<span itemprop="net.example.title">Too many pies in the pie factory</span></h1>
item="net.example.bug" attribute says "here is a
bug". The various
itemprop attributes provide name/value
pairs for the bug. The snippet above would result in the following
tree of data:
net.example.number = "12941"
net.example.title = "Too many pies in the pie factory"
net.example.reporter = "firstname.lastname@example.org"
net.example.priority = "AAA"
Now it doesn't matter if the page is dramatically changed, the same
data can still be made unambiguously available:
<h1>Example.Net Bugs Database</h1>
<h1 itemprop="net.example.title">Too many pies in the pie factory</span></h1>
<p>#<span itemprop="net.example.number">12941</span>; reported
by <span itemprop="net.example.reporter">email@example.com</span>.</p>
<p>PRIORITY: <strong itemprop="net.example.priority">AAA</strong>.</p>
This concludes this brief introduction to microdata! Some future blog posts will introduce a few aspects of microdata that I didn't discuss here:
- How to annotate URIs, dates and times, and hidden data using microdata.
- How to nest items within each other.
- How to annotate an item with more than one type, or how to give a single value multiple names.
- The predefined vocabularies.
- How to add annotations outside of an
I see more and more people switch over to HTML5 these days, and to help you make sure you did things correctly, there are some tools at your disposal that might be good to know about.
To make sure you didn't misspell any tag or nest elements in a way that is not allowed, or find similar mistakes in your markup, you can use Validator.nu.
Alt text for images
The above-mentioned validator has a feature to help you quality-check your alternative text for your
img elements. Check the Show Image Report checkbox.
You can also disable images in your browser or try to use a text-only browser — the information that the images convey should still be available (but in text form). Sometimes an image doesn't convey any further information than what the surrounding text already says, and in such cases you should use the empty value:
For further advice and examples on how to use the
alt attribute, the HTML 5 spec has lots of information on the topic. If you're not going to read it all, just read the section called General guidelines.
The document outline is the structure of sections in the document, built from the
h6 elements as well as the new sectioning elements (
nav). The document outline is more commonly known as the Table of Contents.
To make sure that you have used the new sectioning elements correctly, you can check that the resulting outline makes sense with the HTML5 Outliner.
If you see "Untitled Section" and didn't expect them, chances are that you should have used
div instead of
If you have a subtitle of a heading that shouldn't be part of the document outline, you should use the
<h1>The World Wide Web Consortium</h1>
<h2>Leading the Web to Its Full Potential...</h2>
In this example, only the
h1 will show up in the document outline.
(This only applies to
table elements used for tabular data — not for layout.)
HTML tables have two types of cells: header cells (
th elements) and data cells (
td elements). These cells are associated together in the table: a data cell in the middle of the table can have associated header cells, typically in the first row and/or the first column of the table. To a user who can see, this association seems obvious, but users who cannot see need some help from the computer to understand which cells are associated with which.
You should mark up your header cells with the
th element and check that your cells get associated as you intended using the Table Inspector. If it isn't as you intended, you can consider simplifying or rearranging your table, or you can override the default association using
If you know about other tools for helping with quality assurance of HTML5, or if you have made your own, please share!
The HTML5 parsing algorithm is meant to demystify HTML parsing and
make it uniform across implementations in a backwards-compatible way.
The algorithm has had “in the lab” testing, but so far it hasn’t
been tested inside a browser by a large number of people. You
can help change that now!
A while ago, an implementation of the HTML5 parsing algorithm
landed on mozilla-central
preffed off. Anyone who is testing Firefox nightly builds can now opt
to turn on the HTML5 parser and test it.
How to Participate?
First, this isn’t release-quality software. Testing the HTML5
parser carries all the same risks as testing a nightly build in
general, and then some. It may crash, it may corrupt your Firefox
profile, etc. If you aren’t comfortable with taking the risks
associated with running nighly builds, you shouldn’t participate.
If you are still comfortable with testing, download a trunk
build, run it, navigate to
about:config and flip the
makes Gecko use the HTML5 parser when loading pages into the content
area and when setting
innerHTML. The HTML5 parser is not
used for HTML embedded in feeds, Netscape bookmark import, View
Source, etc., yet.
html5.enable preference doesn’t require a
restart to take effect. It takes effect the next time you load a
What to Test?
The main thing is getting the HTML5 parser exposed to a wide range
of real Web content that people browse. This may turn up crashes or
So the way to help is to use nightly builds with the HTML5 parser
for browsing as usual. If you see no difference, things are going
well! If you see a page misbehaving—or, worse, crashing—with the
HTML5 parser turned on but not with it turned off, please report the
Please file bugs in the
“Core” product under “HTML: Parser” component with “[HTML5]
” at the start of the summary.
First and foremost, please refer to the list
of known bugs.
However, I’d like to highlight a particular issue: Support for
comments ending with
--!> is in the spec, but the
hasn’t landed, yet. Support for similar endings of
pseudo-comment escapes within
script element content is
the spec yet. The practical effect is that the rest of the page
may end up being swallowed up inside a comment or a
Another issue is that the new parser doesn’t yet inhibit
document.write() in places where it shouldn’t be
allowed per spec but where the old parser allowed it.
Is There Anything New?
So what’s fun if success is that you notice no change? There are
important technical things under the hood—like TCP packet
boundaries not affecting the parse result and there never being
unnotified nodes in the tree when the event loop spins—but you
aren’t supposed to notice.
However, there is a major new visible feature, too. With the HTML5
parser, you can use SVG and MathML in
This means that you can:
And yes, you can even put SVG inside MathML
or MathML inside
<foreignObject>. The mixing
you’ve seen in XML is now supported in HTML, too.
If you aren’t concerned with taking the steps to make things
degrade nicely in browsers that don’t support SVG and MathML in
HTML, you can simply copy and paste XML output from your favorite SVG
or MathML editor into your HTML source as long as the editor doesn’t
use namespace prefixes for elements and uses the prefix
for XLink attributes.
If you don’t use the XML empty element syntax and you put you
SVG text nodes in CDATA sections, the page will degrade gracefully in
older HTML browser so that the image simply disappears but the rest
of the page is intact. You can even put a fallback bitmap as
<desc>. Unfortunately, there isn’t a
similar technique for MathML, though if you want to develop one, I
suggest experimenting with the
There are known issues with matching camelCase names with