I hope you enjoyed your summer. My oldest son started kindergarten today. Let's talk about HTML 5.
When last we checked, HTML 5 was humming along towards Last Call in October. Much has been made of this date; I won't bore you with the details, except to say that HTML 5 is very close to entering the next phase of its existence. Regular readers of this blog already know that parts of HTML 5 are already shipping in major browsers. The recently-released Firefox 3.5 supports <audio>
and <video>
, offline web applications, the drag-and-drop API, and the <canvas>
text API. (Technically Firefox 3.0 supported the <canvas>
text API too, properly cordoned off in its own vendor-specific functions because the API was not finalized at the time. You can paper over the differences fairly easily.)
So what new and exciting stuff has been added to HTML 5 this summer?
Microdata
At the table in the kitchen, there were three bowls of porridge. Goldilocks was hungry. She tasted the porridge from the first bowl. "This porridge is too hot!" she exclaimed.
So, she tasted the porridge from the second bowl. "This porridge is too cold," she said.
So, she tasted the last bowl of porridge. "Ahhh, this porridge is just right," she said happily and she ate it all up.
— The Story of Goldilocks and the Three Bears
r3074 introduces the concept of microdata. Microdata is designed to allow authors to include additional semantics in their pages for which there is no appropriate HTML element or attribute. For example, HTML is not expressive enough to mark up a contact in an address book (complete with individual fields for name, street address, email, and phone number) or an event on a calendar (complete with start date, end date, and location). Instead of creating new elements and attributes for every possible vocabulary, you can use the microdata attributes to enhance existing elements.
There are a number of other technologies with goals similar to microdata, including microformats and RDFa. As Ian Hickson explained in the message "Annotating structured data that HTML has no semantics for" that introduced microdata, microformats are fine for specific formats but are not flexible enough to be parseable by a generic parser, while RDFa relies on CURIEs and XML namespaces in a way that would require changes to HTML parsing algorithms to work interoperably between text/html
and application/xhtml+xml
. (Forgive me if I didn't explain that very well. There was a lot of yelling and very little explaining once it became clear that RDFa was not going to be included in HTML 5, so I probably missed some of the nuances.) Work is ongoing to create an RDFa-in-HTML specification.
ARIA
ARIA stands for "Accessible Rich Internet Applications." It is an emerging standard for making web applications more accessible to people using assistive technologies (including, but not limited to, blind people who browse the web with the help of screenreaders). The basic technique is for authors to define "roles" and "states" on individual elements to indicate what sort of control the element represents. For example, HTML has no "treeview" control, but JavaScript libraries like Dojo let you include a treeview in your web-based application with a combination of generic HTML elements, a few images, and a whole lotta JavaScript. ARIA gives you a way to say that the "treeview" HTML element (which is probably just a <div>
) is acting as a treeview (that's its "role"). Each item in the treeview can be in the "expanded" or "collapsed" state, and the state changes as the user interacts with the control. Major browsers, including Microsoft Internet Explorer (8) and Firefox (2+) will notice the custom role on the element and announce to assistive technologies that this <div>
element is acting as a treeview. (In fact, Dojo already supports these roles and states, due to work funded by IBM.)
r3657 adds the section Annotations for assistive technology products to HTML 5. There are still a number of unanswered questions about how the custom semantics defined by ARIA interact with the native semantics defined by HTML 5.
Everything Old is New Again
As regular readers of this blog already know, HTML 5 goes to great lengths to specify existing browser behavior, even to the point of "willfully violating" other specifications. Vast stretches of the HTML 5 specification are devoted to elements, attributes, and scripting features that nobody likes but everyone is required to support. To that end, r3502 defines the <listing>
, <plaintext>
, <acronym>
, <xmp>
, and <dir>
elements; r3133 and r3141 define the <marquee>
element; r3155, r3403, r3409, and r3410 define document.all
.
Other important changes include the location.reload()
method (r3220), the textarea.textLength
property (r3177), a new rollback()
method for synchronous SQL transactions r3210), and the ability to upload multiple files at a time from a web form (r3544 and r3545).
Features Removed
"The food here is terrible!"
"I know, and such small portions!"
(variously attributed)
Everyone complains that HTML 5 is too big, but nobody has any reasonable solution for making it smaller. (Splitting it into multiple specifications to make it "smaller" is like cutting a pie into slices to give it fewer calories.) However, based on implementor feedback, HTML 5 has shed a few poundsfeatures this summer. To wit:
- r3555 removes the
<datagrid>
element and its associated APIs. Originally envisioned as a two-dimensional editable "spreadsheet-lite," it was never implemented in any browser.
- r3621 removes the
<bb>
element, which was originally designed to support "installing" web applications as standalone programs. There were a number of security-related concerns, and browser vendors flatly refused to implement it.
- r3342 removes any mention of what an optimal video codec would look like. Contrary to popular belief, this revision does not remove the
<video>
element itself; the <video>
element is alive and well and implemented in Safari, Firefox, Google Chrome, and an experimental build of Opera. However, it is true that there is no single video codec that is supported out-of-the-box by all browsers. Firefox and Opera only support Ogg Theora, Google Chrome supports H.264 and Theora, and Safari supports whatever QuickTime supports (which doesn't include Ogg Theora unless you install a third-party plugin).
Administrative Stuff
"Man didn't the right form."
"What man?"
"The man from the cat detector van."
"The loony detector van, you mean."
"Look, it's people like you what cause unrest."
— Monty Python's "Fish License"
When web servers send you HTML, they are supposed to label it as such with the HTTP Content-Type
header. Each content type (an HTML page, a JPEG image, an MPEG-4 video) has its own "MIME type." MIME types must be registered with the IANA.
r3552 adds the registration information for text/html
, application/xhtml+xml
, text/event-stream
, text/cache-manifest
, and application/microdata+json
. r3582 adds the registration information for text/ping
.
Standards frequently include references to other standards. References can be "normative" or "informative." To quote RFC 3967 (a standard about creating standards), "a normative reference specifies a document that must be read to fully understand or implement the subject matter in the new [standard], or whose contents are effectively part of the new [standard], as its omission would leave the new [standard] incompletely specified. An informative reference is not normative; rather, it provides only additional background information." r3580 adds a list of references to HTML 5.
Tune in next week as we return to our regular weekly schedule of "This Week in HTML 5."
Tags: aria, bb, datagrid, microdata, thisweekinhtml5
Posted in Weekly Review | 7 Comments »
One of the features we've added in HTML5 is a way to include
machine-readable annotations that people can scrape in a simple and
well-defined way. This means that if a site wants to make the
information available, you don't have to rely on brittle
screen-scraping to get the information out.
This is easiest to understand with an example.
Suppose that you had an issue tracking database like Bugzilla, and that you wanted
other tools to be able to pull information about issues in that
database.
Today, Bugzilla exposes an XML file for each bug, but this means
maintaining two parallel formats for the bug page. Instead of
providing such a separate interface, you can use microdata, the
new attributes in HTML5. That way, even as your issue tracker changes
its interface from version to version, the underlying data can still
be reliably readable from the same HTML page.
Imagine the markup today looks like this:
<body>
<h1>Issue 12941: Too many pies in the pie factory</h1>
<dl>
<dt>Reporter</dt>
<dd>[email protected]</dd>
<dt>Priority</dt>
<dd>AAA</dd>
...
To annotate this with microdata, we just mint some names, and then
label each field with those names. The names are in "reverse-DNS"
form; if the bug system was at "example.net", then the names would be
"net.example.bug", "net.example.number", and so on. Thus we get:
<body item="net.example.bug">
<h1>Issue <span itemprop="net.example.number">12941</span>:
<span itemprop="net.example.title">Too many pies in the pie factory</span></h1>
<dl>
<dt>Reporter</dt>
<dd itemprop="net.example.reporter">[email protected]</dd>
<dt>Priority</dt>
<dd itemprop="net.example.priority">AAA</dd>
...
The item="net.example.bug"
attribute says "here is a
bug". The various itemprop
attributes provide name/value
pairs for the bug. The snippet above would result in the following
tree of data:
net.example.bug:
net.example.number = "12941"
net.example.title = "Too many pies in the pie factory"
net.example.reporter = "[email protected]"
net.example.priority = "AAA"
Now it doesn't matter if the page is dramatically changed, the same
data can still be made unambiguously available:
<body>
<h1>Example.Net Bugs Database</h1>
<section item="net.example.bug">
<h1 itemprop="net.example.title">Too many pies in the pie factory</span></h1>
<p>#<span itemprop="net.example.number">12941</span>; reported
by <span itemprop="net.example.reporter">[email protected]</span>.</p>
<p>PRIORITY: <strong itemprop="net.example.priority">AAA</strong>.</p>
...
This concludes this brief introduction to microdata! Some future blog posts will introduce a few aspects of microdata that I didn't discuss here:
- How to annotate URIs, dates and times, and hidden data using microdata.
- How to nest items within each other.
- How to annotate an item with more than one type, or how to give a single value multiple names.
- The predefined vocabularies.
- How to add annotations outside of an
item=""
, using subject=""
.
Posted in WHATWG | 19 Comments »
I see more and more people switch over to HTML5 these days, and to help you make sure you did things correctly, there are some tools at your disposal that might be good to know about.
HTML5 validator
To make sure you didn't misspell any tag or nest elements in a way that is not allowed, or find similar mistakes in your markup, you can use Validator.nu.
Alt text for images
The above-mentioned validator has a feature to help you quality-check your alternative text for your img
elements. Check the Show Image Report checkbox.
You can also disable images in your browser or try to use a text-only browser — the information that the images convey should still be available (but in text form). Sometimes an image doesn't convey any further information than what the surrounding text already says, and in such cases you should use the empty value: alt=""
.
For further advice and examples on how to use the alt
attribute, the HTML 5 spec has lots of information on the topic. If you're not going to read it all, just read the section called General guidelines.
Document outline
The document outline is the structure of sections in the document, built from the h1
-h6
elements as well as the new sectioning elements (section
, article
, aside
, nav
). The document outline is more commonly known as the Table of Contents.
To make sure that you have used the new sectioning elements correctly, you can check that the resulting outline makes sense with the HTML5 Outliner.
If you see "Untitled Section" and didn't expect them, chances are that you should have used div
instead of section
.
If you have a subtitle of a heading that shouldn't be part of the document outline, you should use the hgroup
element:
<hgroup>
<h1>The World Wide Web Consortium</h1>
<h2>Leading the Web to Its Full Potential...</h2>
</hgroup>
In this example, only the h1
will show up in the document outline.
Table inspector
(This only applies to table
elements used for tabular data — not for layout.)
HTML tables have two types of cells: header cells (th
elements) and data cells (td
elements). These cells are associated together in the table: a data cell in the middle of the table can have associated header cells, typically in the first row and/or the first column of the table. To a user who can see, this association seems obvious, but users who cannot see need some help from the computer to understand which cells are associated with which.
You should mark up your header cells with the th
element and check that your cells get associated as you intended using the Table Inspector. If it isn't as you intended, you can consider simplifying or rearranging your table, or you can override the default association using scope
or headers
attributes.
Other tools?
If you know about other tools for helping with quality assurance of HTML5, or if you have made your own, please share!
Tags: accessibility, tools, Validation
Posted in Conformance Checking | 4 Comments »
The HTML5 parsing algorithm is meant to demystify HTML parsing and
make it uniform across implementations in a backwards-compatible way.
The algorithm has had “in the lab” testing, but so far it hasn’t
been tested inside a browser by a large number of people. You
can help change that now!
A while ago, an implementation of the HTML5 parsing algorithm
landed on mozilla-central
preffed off. Anyone who is testing Firefox nightly builds can now opt
to turn on the HTML5 parser and test it.
How to Participate?
First, this isn’t release-quality software. Testing the HTML5
parser carries all the same risks as testing a nightly build in
general, and then some. It may crash, it may corrupt your Firefox
profile, etc. If you aren’t comfortable with taking the risks
associated with running nighly builds, you shouldn’t participate.
If you are still comfortable with testing, download a trunk
nightly
build, run it, navigate to about:config
and flip the
preference named html5.enable
to true
. This
makes Gecko use the HTML5 parser when loading pages into the content
area and when setting innerHTML
. The HTML5 parser is not
used for HTML embedded in feeds, Netscape bookmark import, View
Source, etc., yet.
The html5.enable
preference doesn’t require a
restart to take effect. It takes effect the next time you load a
page.
What to Test?
The main thing is getting the HTML5 parser exposed to a wide range
of real Web content that people browse. This may turn up crashes or
compatibility problems.
So the way to help is to use nightly builds with the HTML5 parser
for browsing as usual. If you see no difference, things are going
well! If you see a page misbehaving—or, worse, crashing—with the
HTML5 parser turned on but not with it turned off, please report the
problem.
Reporting Bugs
Please file bugs in the
“Core” product under “HTML: Parser” component with “[HTML5]
” at the start of the summary.
Known Problems
First and foremost, please refer to the list
of known bugs.
However, I’d like to highlight a particular issue: Support for
comments ending with --!>
is in the spec, but the
patch
hasn’t landed, yet. Support for similar endings of
pseudo-comment escapes within script
element content is
not in
the spec yet. The practical effect is that the rest of the page
may end up being swallowed up inside a comment or a script
element.
Another issue is that the new parser doesn’t yet inhibit
document.write()
in places where it shouldn’t be
allowed per spec but where the old parser allowed it.
Is There Anything New?
So what’s fun if success is that you notice no change? There are
important technical things under the hood—like TCP packet
boundaries not affecting the parse result and there never being
unnotified nodes in the tree when the event loop spins—but you
aren’t supposed to notice.
However, there is a major new visible feature, too. With the HTML5
parser, you can use SVG and MathML in text/html
pages.
This means that you can:
And yes, you can even put SVG inside MathML <annotation-xml>
or MathML inside <foreignObject>
. The mixing
you’ve seen in XML is now supported in HTML, too.
If you aren’t concerned with taking the steps to make things
degrade nicely in browsers that don’t support SVG and MathML in
HTML, you can simply copy and paste XML output from your favorite SVG
or MathML editor into your HTML source as long as the editor doesn’t
use namespace prefixes for elements and uses the prefix xlink
for XLink attributes.
If you don’t use the XML empty element syntax and you put you
SVG text nodes in CDATA sections, the page will degrade gracefully in
older HTML browser so that the image simply disappears but the rest
of the page is intact. You can even put a fallback bitmap as <img>
inside <desc>
. Unfortunately, there isn’t a
similar technique for MathML, though if you want to develop one, I
suggest experimenting with the <annotation>
as
your <desc>
-like container.
There are known issues with matching camelCase names with
Selectors
or getElementByTagName
,
though.
Posted in Browsers, Processing Model, Syntax | 8 Comments »
Version 1.2.1 of the Validator.nu HTML Parser is now available. It fixes an incompatibility with the DOM implementation of the latest Xerces.
Posted in DOM, Processing Model, Syntax | Comments Off on Validator.nu HTML Parser 1.2.1