The WHATWG Blog

html5lib 0.10 Released

October 9th, 2007 by James Graham

html5lib 0.10 is now available for your HTML-parsing pleasure.

html5lib is an implementation of the HTML 5 parsing algorithm, available in both Python and Ruby flavours. The HTML 5 algorithm is based on reverse engineering the behaviour of popular web browsers and so is compatible with the myriad of broken HTML encountered on the web.

Features in 0.10:

Parse HTML to a variety of common tree formats including minidom, ElementTree and BeautifulSoup (Python), and hpricot and rexml (Ruby) as well as a custom simpletree format
Automatic detection of character encoding from meta elements and using frequency analysis (if chardet is available)
Sanitization of markup and CSS using a whitelist approach
Liberal XML parsing
Conversion of trees to event streams and Genshi-inspired filters for those streams
Flexible serializers for writing out streams in HTML and XHTML-syntax
A prototype HTML 5 validator
A large test suite

Download:

Tags: html5lib, Parsing
Posted in WHATWG | 3 Comments »

La loterie du `longdesc`

September 18th, 2007 by Quillery Pierre

This is a French translation of this article : The longdesc lottery

Parlons maintenant de l'attribut longdesc. En HTML 4, il est défini comme un pointeur vers une longue description, pour une image complexe. Tout le monde peut apprendre à écrire une longue description pertinente. Il n'y a qu'un seul problème : dans les faits, personne ne s'en soucie, et celui qui s'en soucie se trompe.

Maintenant, quantifions le phénomène. En Août 2007, Ian Hickson a analysé un échantillon d'un milliard d'éléments <img> dans l'index de Google. Approximativement 1,3 millions (soit 0,13 %) avaient efectivement un attribut longdesc. Eh bien direz-vous, c'est normal : toutes les images n'ont pas besoin d'un tel attribut. Et vous auriez raison. Mais sans se soucier de savoir s'il est nécessaire ou pas, longdesc n'est pas utilisé si souvent : un seul pour une centaine d'image.

Maintenant, voyons dans combien de cas l'attribut longdesc est utilisé judicieusement. Bien sûr, ce critère est plus subjectif, mais on peut tout de même relever les erreurs les plus évidentes. Des 1,3 millions d'images qui avaient un attribut longdesc, ôtons celles ou l'attribut longdesc ...

est vide
n'est pas une url valide
pointe vers l'image elle-même (c'est à dire la même url que l'attribut src)
pointe vers la page sur laquelle on se trouve déjà
pointe vers la racine d'un autre domaine
est le même que l'attribut href du lien qui entoure l'image (le longdesc est redondant, puisqu'il est possible de suivre le lien de l'image à la place)

Cela élimine purement et simplement 1,25 million (environ 96%) d'images du lot. Ce n'est pas 96% de toutes les images présentes sur le web - c'est 96% des 0.13% des images qui incluaient un attribut longdesc en première instance. Et lorsqu'on regarde plus attentivement aux 50 000 images restantes, (soit 4% de 1,3 million) les résultats empirent encore : des liens vers d'autres images, des liens brisés, des liens vers une description d'une ligne identiques à l'attribut alt, et des liens vers une page qui vous indique les dimensions de l'image, mais pas son contenu (Wikipedia, c'est bien de toi dont je parle). Si on extrapole à 1,3 million d'image, les 50 000 se réduisent à 10 000. Cela signifie que moins de 1% des images qui fournissent un attribut longdesc sont réellement utiles. Pas plus d'une image sur 100 est correcte (sur les 1% qui se donnent la peine d'essayer).

Parallèlement, les même personnes qui souhaitaient conserver l'attribut longdesc ont récemment réalisé quelques expériences de test par les utilisateurs. C'est-à-dire qu'ils ont testé avec quelle précision une vraie personne aveugle avec un vrai lecteur d'écran pouvait lire une vraie page web. Il s'est avéré que le sujet ne connaissait pas l'existence de l'attribut longdesc avant que le testeur n'en fasse mention. Peut-on vraiment lui en vouloir ? 99.87% des images qu'il avait rencontré n'avaient même pas d'attribut longdesc. Même s'il en avait eu connaissance, et qu'il en avait rencontré une par hasard, il restait tout même 99% de chances que les informations fournies ne présentent aucun intérêt. Il a ainsi plus de chance de gagner à la loterie.

Je ne dis pas qu'il n'y a pas là un réel problème qu'il faudrait résoudre. Il y en a bel et bien un. Les gens peuvent publier des images complexes qui nécessitent des alternatives textuelles tout aussi complexes. Les diagrammes, graphiques et autres photos très détaillées. Mais peu importe. « une image vaut mieux qu'un long discours » et tout ça ... L'attribut longdesc est, théoriquement, une solution à ce problème. Mais cela ne veut pas dire pour autant qu'il s'agisse d'une bonne solution et encore moins de la seule solution. Cela fait 10 ans maintenant que nous vivons avec longdesc et je peux vous l'assurer : cela ne fonctionne pas. Ainsi, pourrions nous éviter la levée de boucliers et commencer à parler d'une meilleure solution ?

Posted in Elements | Comments Off on La loterie du longdesc

Not that 80

September 18th, 2007 by Henri Sivonen

In his post Parroting Pareto, Jeremy Keith says that HTML5 needs to cover cases that “fall far outside the 80%-90% curve”, in particular accessibility. “By their very nature, accessibility concerns are not going to affect the majority of users. That doesn’t mean they can be dismissed.”

My understanding of applying the 80/20 rule to the design of HTML5 is that the “80” isn’t about 80% of users. It is about (proverbial) 80% of authoring cases. That is, it doesn’t make sense to support (for accessibility or otherwise) things that people would only publish very rarely if engineering support for the rarity would complicate the implementation of the language significantly.

See Hixie’s email to the HTML WG on the topic.

Posted in WHATWG | 4 Comments »

The `longdesc` lottery

September 14th, 2007 by Mark Pilgrim, Google

Let's talk about the longdesc attribute. In HTML 4, it's defined as a pointer to a long description for a complex image. Anyone can learn how to write a good long description. There's only one problem: virtually no one bothers, and virtually everyone who does bother gets it wrong.

Let's quantify that. In August 2007, Ian Hickson analyzed a sample of 1 billion <img> elements in Google's index. Approximately 1.3 million (0.13%) had a longdesc attribute. That's OK, you say, not every image needs a longdesc attribute. And you would be right. But regardless of whether it's needed or not, it's not being used that often: just over one in a thousand images.

Now let's look at how often the longdesc attribute is actually used correctly. Of course this is a more subjective question, but we can spot some obvious errors. Out of those 1.3 million images with a longdesc attribute, let's subtract the ones where the longdesc attribute...

is blank
is not a valid URL
points to the image itself (i.e. the same URL as the src attribute)
points to the page you're already on
points to the root level of another domain
is the same as a parent link's href attribute (i.e. the longdesc is redundant because you could just follow the image link instead)

That knocks out a whopping 1.25 million (about 96%) right off the bat. That's not 96% of all the images on the web; that's 96% of the 0.13% of images that included a longdesc attribute in the first place. And when you take a closer look at the remaining 50,000 (4% of 1.3 million), the results get even worse: links to other images, links gone 404, links to one-line text descriptions identical to the alt attribute, and links to pages that describe the image size but not its contents (Wikipedia, I'm looking at you). Extrapolating back to 1.3 million, that 50,000 shrinks to about 10,000. That means that less than 1% of images that provide a longdesc attribute are actually useful. No more than one in a hundred get it right, of one in a thousand that even try.

Meanwhile, the very people advocating for keeping the longdesc attribute have recently conducted some user testing. That is, testing how well an actual blind person with an actual screen reader can read actual web pages. It turned out that the test subject didn't know that longdesc even existed before the tester told him about it. Can you blame him? 99.87% of the images he'd ever encountered had no longdesc attribute at all. Even if he had known about it, and he had actually stumbled across one, he would still be up against 99 to 1 odds that following it would be worth his time. He has a better chance of winning the lottery.

I'm not saying there isn't a real problem to be solved here. There is. People can publish complex images that require complex text alternatives. Charts, graphs, detailed photographs. Whatever. "A picture is worth 1000 words," and all that. The longdesc attribute is, theoretically, a solution to this problem. But that doesn't mean it's a good solution, and it's certainly not the only solution. We've been living with longdesc for 10 years now, and let me tell you, it's not working out. So can we please get past the grandstanding and start talking about a better solution?

Posted in Elements | 40 Comments »