Traditionally, SGML-based HTML validation has treated most attribute values as “anything goes” strings. This has meant that all kinds of bogus values have passed as valid. W3C XML Schema added a fixed set of datatypes. The spec is mostly useless for HTML5 validation, since the HTML5 microsyntaxes do not match exactly the XSD datatypes for the same concepts. XSD regular expressions were suitable for representing the syntax of a number of HTML5 microsyntaxes, though. XSD datatypes can be used from RELAX NG and Validator.nu used to XSD regular expressions for many HTML5 microsyntaxes.
The problem with using XSD regular expressions has been that they are not user-friendly. When an attribute value did not match the required regular expression, the UI told the user that the attribute value was “bad”. Nothing else.
Fortunately, unlike XSD, RELAX NG allows pluggable datatype libraries. A datatype library is a library written in a general-purpose programming language. The RELAX NG engine calls the library to check if a string conforms to a named datatype. Validator.nu has used this approach for a long time for the more complex microsyntaxes in HTML5.
I have recently made an effort to move Validator.nu away from XSD regular expressions to a more comprehensive custom datatype library. Even though as a formalism regular expressions are sufficient for many syntaxes, writing checking by hand allows more useful error messages in cases of failure. Moreover, having identifiers for the datatypes makes it possible to tell which datatype failed as opposed to the UI being able to tell that some regular expression failed under the hood. This allows the UI to pull in per-datatype advice from a wiki page.
In addition to improving the user experience with previously supported microsyntaxes such as integers, I have implemented support for previously unsupported microsyntaxes such as MIME types and Media Queries.
There is still work to do. For example, the syntaxes for accept-charset
and WF2 type=email
are not done. data:
and mailto:
IRIs are not properly validated yet. The syntaxes for image map coordinates still use XSD regular expressions. The advice on the wiki page is far from complete. (You can help!)
For the parts already implemented, please try the new features out and let me know what needs improvement.
Tags: datatypes, microsyntaxes, validator
Posted in Conformance Checking, Syntax | Comments Off on Validating attribute values
Currently, Validator.nu mines the HTML 5 spec for UI text describing permissible content models, element contexts and element-specific attributes. The text is shown when an element or attribute is misplaced on missing.
Unfortunately, the spec does not contain similarly extractable text for microsyntax descriptions. Microsyntaxes are syntaxes that appear mostly in attribute values—for example, HTML5 integer, Web Forms 2.0 week
, RFC 2616 media types (aka. MIME types) or CSS3 Media Queries.
Based on IRC discussions, there is interest in producing the descriptions collaboratively. To that end, I have seeded the WHATWG wiki with a page for microsyntax descriptions. If you would like to help make validator messages better, please feel free to edit the wiki (under the MIT license).
Tags: wiki microsyntax collaboration authoring
Posted in Conformance Checking, Syntax | 1 Comment »
The W3C is having its technical plenary day today, and a number of WHATWG contributors are there. It's hard to participate remotely in this event, but you can watch and listen — the W3C is publishing an audio stream (in Ogg; a Java applet alternative is available too), and has commissioned realtime captioning for the event. There's also W3C IRC channel on the topic on irc.w3.org, port 6665, channel #tp, password beantown
*
(a single asterisk) (it's not clear why there's a password, just go with it) (no password anymore). You can also chat with WHATWG contributors who are present at the event on our own IRC channel.
The agenda for the day is available from the W3C site. Don't forget to adjust the times from the Boston timezone to your timezone if you want to listen to a particular session.
Tags: irc, W3C
Posted in W3C, WHATWG | Comments Off on The WHATWG at the W3C technical plenary
The WHATWG has how published a snapshot version of the HTML5 spec for review. Ian Hickson wrote to the WHATWG mailing list:
Last November, as part of the feedback on the W3C HTML WG charter, I wrote
an e-mail saying that I thought a realistic timetable would have a
first working draft released in October 2007.
We don't really need archived copies with the way the WHATWG works, since
everything happens in the open with a Subversion interface and everything,
but, I figured that I should "publish" an archived copy anyway, so today I
put out a frozen "call for comments" draft:
http://www.whatwg.org/specs/web-apps/2007-10-26/multipage/
If anyone was hoping for a semi-stable version to start reviewing the
draft, I would say that this is it. We're pretty much feature-complete at
this point, which is to say I don't think we'll be adding any major
features to HTML5 going forward (though of course minor features like
additions to certain APIs are likely to still occur).
There is a public issues list:
http://www.whatwg.org/issues/
...which has about 3700 issues in it. The next order of business is simply
to go through all of those issues. I've been tracking the issue count
since early October, and at the moment the count is reducing at a rate of
about 7 a day, which works out to being about a year and a bit of solid
work, which puts us on track to reach Last Call in 2009, as I predicted in
the aforementioned e-mail.
I'd like to thank everyone here in the WHATWG community for helping make
this work fun and pleasant. It's really nice to be able to work in such a
friendly atmosphere. I hope the coming year will continue the same way!
Cheers,
I'd like to thank Ian for his hard work on editing the spec. Keep it up! 🙂
Posted in WHATWG | Comments Off on Call for Comments
html5lib 0.10 is now available for your HTML-parsing pleasure.
html5lib is an implementation of the HTML 5 parsing algorithm, available in both Python and Ruby flavours. The HTML 5 algorithm is based on reverse engineering the behaviour of popular web browsers and so is compatible with the myriad of broken HTML encountered on the web.
Features in 0.10:
- Parse HTML to a variety of common tree formats including minidom, ElementTree and BeautifulSoup (Python), and hpricot and rexml (Ruby) as well as a custom simpletree format
- Automatic detection of character encoding from
meta
elements and using frequency analysis (if chardet is available)
- Sanitization of markup and CSS using a whitelist approach
- Liberal XML parsing
- Conversion of trees to event streams and Genshi-inspired filters for those streams
- Flexible serializers for writing out streams in HTML and XHTML-syntax
- A prototype HTML 5 validator
- A large test suite
Download:
Tags: html5lib, Parsing
Posted in WHATWG | 3 Comments »