html5lib 0.9 is now available for your parsing pleasure.
html5lib is an implementation of the WHATWG HTML parsing algorithm in Python and released under a MIT-license It enables malformed HTML to be parsed into standard minidom and ElementTree structures,in a way that is highly compatible with the behavior of major desktop web browsers. As well as parsing to trees html5lib contains a DOM to SAX converter; it is hoped that by supporting these standard APIs, toolchains based on draconian XML parsers can be repurposed to process HTML content with minimal effort.
In addition to the HTML parsing capability, html5lib 0.9 contains an experimental liberal XML parser based on the WHATWG algorithm without the HTML-specific error handling. This is suitable for parsing XML from sources that cannot guarantee wellformedness; e.g. web feeds.
The 0.9 release is expected to be the last major release before 1.0 and no new features will be added before 1.0 is released. Instead we will work on any remaining correctness issues, other bugs, and on improving the messages reported when parse errors are encountered. Bug reports are very much appreciated. Users or people looking to get involved are encouraged to join the mailing list or visit the #WHATWG channel on freenode.net
Posted in WHATWG | 2 Comments »
this post refers to the "Write" interface from WordPress utilized to post comments to WHATWG blogs.
there is well-intentioned, but mis-implemented markup in the edit form; namely, improper implemetation of the FIELDSET
.
for a proper FIELDSET
, one needs to do 4 things:
- open the
FIELDSET
(which this form does)
- define a
LEGEND
for the FIELDSET
(which this form does NOT); the natural candidates for LEGEND
are the level 3 headers (H3
) classed dbx-handle so instead of repeatedly hearing "click to open this box", i would also get the pseudo-box (which i would call sub-forms) LEGEND
as an indicator of what i am about to open or close. i would also make the alt text device independent - instead of "click here to open this box", i would propose "show sub-form" and "hide sub-form"
- bind individual
FORM
controls to their textual labels by use of the LABEL
element and the for/id
mechanism that ties the form control (which takes the "id") to a LABEL
(which takes the "for") or multiple labels; the LABEL
should contain the actual, textual label, and NOT the FORM
control, as in this form; this form has the attribute set set correctly to bind the LABEL
to the FORM
control, but since the LABEL
element is opened PRIOR to the INPUT
element, no labeling is available to the user - in my case (i use a screen-reader) the sub-forms that appear when one opens a FIELDSET
to reveal a FORM
appear unlabeled to my screenreader, because of invalid markup.
- close the
FIELDSET
(which this form does)
Posted in Forms, WHATWG | 10 Comments »
The W3C today publicly announced that they are restarting an HTML specification effort. This is great news and a clear validation of the WHATWG effort, which has been leading the maintenance and development of HTML since 2004.
Surprisingly, the W3C never actually contacted the WHATWG during the chartering process. However, the WHATWG model has clearly had some influence on the creation of this group, and the charter says that the W3C will try to "actively pursue convergence with WHATWG". Hopefully they will get in contact soon.
In the meantime, apparently anyone can actually join the W3C effort. The instructions to join the group are as follows:
- Fill in the Public Access Request Form; in the "Reason" field, put: "To apply for participation in the HTML Working Group as an Invited Expert."
- Within about five minutes you'll receive a confirmation code by e-mail. Follow the instructions in that e-mail.
- You should get a reply back from that within two days, giving you a username and password. Fill in the W3C Invited Expert Application form. Under "Financial Support", if you're not going to attend any meetings or if you're going to attend meetings on your own dime, just put "Self-supported". Under "Possible W3C Membership", if you're employed but your employer doesn't know you're doing this, or doesn't care, just pick "My employer does not intend to join".
- E-mail Dan Connolly and
Karl DubostMike Smith ([email protected], [email protected][email protected]) asking for approval. (Just say "Hi, I'd like to join the HTML working group. Thanks.") - You should get a reply back within about ten days, at which point you can fill in the Joining the HTML Working Group form.
I would encourage everyone interested in working with the HTML working group to go through these steps as soon as possible, so that you will be a member of the group before the work starts.
Joining the group doesn't commit you to anything (e.g. you won't have to attend meetings or anything if you don't want to). The group's charter clearly says that all decisions will be made in ways that don't require attending meetings.
This post has been updated a few times to take into account new information about how to join the group.
Posted in WHATWG | 21 Comments »
Last month, I did a presentation on the future of HTML at the WSG meeting in Sydney. For those of you who couldn't make it, or those who wish to hear it again, I have finally got around to publishing the slides, audio recording and transcript.
Posted in WHATWG | Comments Off on The Future of HTML Presentation Slides
HTML requires that authors declare the character encoding of the file either
using HTTP headers (when served over HTTP) or metadata in the file. In previous
versions of HTML, authors could specify the character encoding using a relatively
complex meta
element like this:
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
The idea of the http-equiv
attribute was that it would act as
a substitute for real HTTP headers. However, in practice, that is not entirely
true. Only a few headers actually have any effect in browsers. In fact,
HTML4 even suggested that servers use this attribute to gather information
for HTTP response message headers; but in reality, no known server ever did
this.
Although the MIME type is included in the value for the Content-Type
header
above, it has no effect in browsers. The only useful and practical piece of
information in that element is: charset=UTF-8
.
In order to simplify the meta
element and remove unnecessary markup, HTML5
has changed it slightly. The new way to declare the character encoding in the
file will be to use the following:
<meta charset="UTF-8">
Obviously, that is much shorter and easier to remember. Luckily, due to the
way encoding detection has been implemented by browsers, it is backwards compatible
and believed to be supported by all known browsers.
Along with this, the spec has recently defined how encoding detection must
be implemented by browsers and imposed a few additional restrictions for documents
to be considered conforming.
- When serialised,
the charset attribute
and its value must be contained completely in the first 512 bytes of the file.
- The attribute value must be serialised without the use of character entity
references of any kind. e.g. You cannot use
<meta charset=" UTF-8">
to
declare UTF-8. This is because the encoding detection algorithm does not decode
character references, because it occurs before the actual parsing begins.
- The character encoding used must be a rough superset of US-ASCII e.g. you
can’t use this for EBCDIC encoded files.
- User agents must not support the CESU-8, UTF-7, BOCU-1 and SCSU encodings.
If the encoding is either UTF-8, UTF-16 or UTF-32, then authors can use a
BOM at the start of the file to indicate the character encoding.
Posted in WHATWG | 12 Comments »