Microdata (part 1)
One of the features we've added in HTML5 is a way to include machine-readable annotations that people can scrape in a simple and well-defined way. This means that if a site wants to make the information available, you don't have to rely on brittle screen-scraping to get the information out.
This is easiest to understand with an example.
Suppose that you had an issue tracking database like Bugzilla, and that you wanted other tools to be able to pull information about issues in that database.
Today, Bugzilla exposes an XML file for each bug, but this means maintaining two parallel formats for the bug page. Instead of providing such a separate interface, you can use microdata, the new attributes in HTML5. That way, even as your issue tracker changes its interface from version to version, the underlying data can still be reliably readable from the same HTML page.
Imagine the markup today looks like this:
<body> <h1>Issue 12941: Too many pies in the pie factory</h1> <dl> <dt>Reporter</dt> <dd>[email protected]</dd> <dt>Priority</dt> <dd>AAA</dd> ...
To annotate this with microdata, we just mint some names, and then label each field with those names. The names are in "reverse-DNS" form; if the bug system was at "example.net", then the names would be "net.example.bug", "net.example.number", and so on. Thus we get:
<body item="net.example.bug"> <h1>Issue <span itemprop="net.example.number">12941</span>: <span itemprop="net.example.title">Too many pies in the pie factory</span></h1> <dl> <dt>Reporter</dt> <dd itemprop="net.example.reporter">[email protected]</dd> <dt>Priority</dt> <dd itemprop="net.example.priority">AAA</dd> ...
item="net.example.bug" attribute says "here is a
bug". The various
itemprop attributes provide name/value
pairs for the bug. The snippet above would result in the following
tree of data:
net.example.bug: net.example.number = "12941" net.example.title = "Too many pies in the pie factory" net.example.reporter = "[email protected]" net.example.priority = "AAA"
Now it doesn't matter if the page is dramatically changed, the same data can still be made unambiguously available:
<body> <h1>Example.Net Bugs Database</h1> <section item="net.example.bug"> <h1 itemprop="net.example.title">Too many pies in the pie factory</span></h1> <p>#<span itemprop="net.example.number">12941</span>; reported by <span itemprop="net.example.reporter">[email protected]</span>.</p> <p>PRIORITY: <strong itemprop="net.example.priority">AAA</strong>.</p> ...
This concludes this brief introduction to microdata! Some future blog posts will introduce a few aspects of microdata that I didn't discuss here:
- How to annotate URIs, dates and times, and hidden data using microdata.
- How to nest items within each other.
- How to annotate an item with more than one type, or how to give a single value multiple names.
- The predefined vocabularies.
- How to add annotations outside of an
Out of interest, why reverse DNS format? Easier to parse?
This is convenient, but to be honest I do despair that there are so many ways now to embed data. Add microdata to the following list: microformats, embedded XML, embedded RDF, and link’d documented. We really need to standardise on a method to do it, as right now it’s just painful to try and create a progressive strategy towards a semantic web when everything’s going in so many directions.
I know different ways have different advantages, but I think the benefits of a not-quite-perfect standard approach outweigh the convenience of picking the easiest format to implement. It’s like what XML achieved is being reversed by the new generation of web standardisers.
Why the f reinvent the wheel? What the f was wrong with alligning with microformats and RDFa?
What problem does this solve that’s not already solved by microformats, RDFa or some combination of the two?
bruce: It’s short and unambiguous, and used for this purpose in a number of other places already. If anyone has any suggestions for something even shorter but still unambiguous, please suggest it — this hasn’t been set in stone yet.
Why not use XML data islands instead of hacks like this?
With xml data islands it would be easy to know where the data is, how to grab it with any extension and show it to the user on demand, like bizcards, profiles, etc.
All we need in html5 is to allow one/many XML hidden tags to coexist with html tags and use datasource and dataitem from html to reference xml data.
I dunno if I am convinced that joe web developer will actually use these type of facilities, and that more complex facilities like RDFa would be so far out of reach for the type of person who would use them, that it justifies Yet Another Mechanism?
I considered RDFa, but it’s order of magnitude more complex than microdata without really adding anything useful, and people have had huge trouble getting used to RDFa (just look at the disaster that was the Google documentation when they talked about RDFa support recently).
Microformats is still supported in HTML5, but isn’t extensible in a machine-readable fashion (you can’t just use a generic parser to grab out the name-value pairs, you have to have a per-vocabulary dedicated parser), so it isn’t quite as useful for just arbitrary small-use data (as opposed to specific problem spaces, where Microformats excels).
It’s still very early days in this space, I don’t think there’s anything wrong with a little competition. 🙂
Well… I’m really skeptical about microdata. In a Web designer point of view, why don’t you used RDFa to do such a thing ? I can understand that in some case this could be interesting for browser vendor. It could allow them to extract data without using the DOM (Well, at least, it can speed up some treatment). But for people who made web sites it’s just another technique which is redundant with some other ones. I’m not sure this is really interesting for web designers.
By the way, you’re right when you said that their a need for a little competition on how to put some semantic into HTML but is it the good way a that point ? To me RDFa is not that hard and when I see the whole microdata spec, frankly I don’t see so much difference with RDFa : you need a grammar, you need some vocabularies, you need some URI ! So except for the syntax what is the real difference ?
Definitively, I thing there is somthing to do to have a way to put some semantic in HTML which is simpler to understand (and to use) than RDFa but with the extensibility that lack to microformat.
To achieve this, microdata could be a way. But at that point, I’m not convince. In the end it will be the users that will choose.
And thanks for all the job done with HTML5.
Ian Hickson wrote: people have had huge trouble getting used to RDFa
Ian wrote: [RDFa is an] order of magnitude more complex than microdata without really adding anything useful
That’s your personal opinion, which you are certainly entitled to assert. However, it is not a criticism that is, IMHO, widely held. You keep repeating this opinion, making it sound as if it is fact. In reality, your opinion was not widely raised during the extensive internal and public review process that RDFa went through.
I’ve been one of the more aggressive proponents for simplifying, as much as is reasonably possible, RDFa in the RDFa community. I believe that RDFa is slightly more complex than Microdata because it solves a larger set of problems. When you attempt to solve a larger set of problems, you usually end up with somewhat more complex solutions. The statements you keep repeating are analogous to “Java/C++ is so much more complex than shell scripting!” – well yes, it is… but C++ is capable of solving a larger set of problems than shell scripting. (I still owe you a list of use cases that Microdata does not address — I haven’t forgotten).
However, to say that RDFa is an “order of magnitude” more complex than Microdata is not only vague, it overly simplifies the current state of solutions for structured data on the Web.
RDFa is out there, it’s starting to enjoy wide deployment, and it works for many people.
Ian wrote: don’t think there’s anything wrong with a little competition.
On this point, I am in complete agreement with you. The real test for RDFa, Microformats and Microdata is in the marketplace… authors will pick the best set of solutions that work for them.
Why are the properties at the same ‘level’ as the item? Why not use the item as a namespace and add the properties one level down? Wouldn’t that also remove the need to have two different attributes?
Which would extract out to:
some.thing.foo = “value”
some.thing.bar = “value”
Ben: That would prevent an item from having multiple types or from having items from multiple different sources. We do end up doing something very similar to that for the predefined vocabularies though, in that those are all one-word identifiers.
1) `itemprop` is an ugly name
2) The rest sounds good to me. Let’s roll with it and can everyone else please STFU up RDFa vs. Microformats vs. Microdata?? Somebody just pick a format and standardize it.
It’s not like anyone is actually using these constructs in a practical way currently. It’ll be another 4 years before anyone even give’s an F about this.
Garv said: “It’s not like anyone is actually using these constructs in a practical way currently.”
Fortunately, that’s not true, at least for microformats. There are many millions of instances “in the wild”, and they’re already recognised by Google and Yahoo, among others.
I think that the best solution is to have a solution-based (format-type-based) umm…solution. And this has already been addressed by Microformats.
Microformats provide predefined vocabularies, like you mentioned here. They can also “markup” a variety of data (addresses, people, events, locations, etc.) Furthermore, they are already being deployed a lot and (some of them) are recognized by browsers (Web Slices in IE8, etc.)
I think that Microformats are the way to go. And if Microformats do not cover whatever you are trying to describe, others here have mentioned RDFa, I haven’t really investigated or read about RDFa, but it sounds as though this is an unnecessary addition of attributes that few people will use.
I agree with Garv, “itemprop” is ugly. Why not “property”?
It was called “property” originally, but people complained because RDFa already had an attribute called “property”.
Having all microdata attributes start with “item” makes them appear next to each other in lists and autocomplete, which seems useful. (subject="" was renamed to itemfor="".)
As a webmaster and semantic web student, i personally think using RDFa is a more appropriate way to go, things can’t be 100% accurate though but will help greatly. all major search engines have included Micro-formats inclusion in algorithm a green signal. excellent web world coming on line soon 🙂 .. I am keen to see how trust and other factors will come in place.
Let’s see how the world take it from theory to real life RDFa and Microformat development.