Wednesday, July 8th, 2009
The HTML5 parsing algorithm is meant to demystify HTML parsing and
make it uniform across implementations in a backwards-compatible way.
The algorithm has had “in the lab” testing, but so far it hasn’t
been tested inside a browser by a large number of people. You
can help change that now!
A while ago, an implementation of the HTML5 parsing algorithm
landed on mozilla-central
preffed off. Anyone who is testing Firefox nightly builds can now opt
to turn on the HTML5 parser and test it.
How to Participate?
First, this isn’t release-quality software. Testing the HTML5
parser carries all the same risks as testing a nightly build in
general, and then some. It may crash, it may corrupt your Firefox
profile, etc. If you aren’t comfortable with taking the risks
associated with running nighly builds, you shouldn’t participate.
If you are still comfortable with testing, download a trunk
nightly
build, run it, navigate to about:config and flip the
preference named html5.enable to true. This
makes Gecko use the HTML5 parser when loading pages into the content
area and when setting innerHTML. The HTML5 parser is not
used for HTML embedded in feeds, Netscape bookmark import, View
Source, etc., yet.
The html5.enable preference doesn’t require a
restart to take effect. It takes effect the next time you load a
page.
What to Test?
The main thing is getting the HTML5 parser exposed to a wide range
of real Web content that people browse. This may turn up crashes or
compatibility problems.
So the way to help is to use nightly builds with the HTML5 parser
for browsing as usual. If you see no difference, things are going
well! If you see a page misbehaving—or, worse, crashing—with the
HTML5 parser turned on but not with it turned off, please report the
problem.
Reporting Bugs
Please file bugs in the
“Core” product under “HTML: Parser” component with “[HTML5]
” at the start of the summary.
Known Problems
First and foremost, please refer to the list
of known bugs.
However, I’d like to highlight a particular issue: Support for
comments ending with --!> is in the spec, but the
patch
hasn’t landed, yet. Support for similar endings of
pseudo-comment escapes within script element content is
not in
the spec yet. The practical effect is that the rest of the page
may end up being swallowed up inside a comment or a script
element.
Another issue is that the new parser doesn’t yet inhibit
document.write() in places where it shouldn’t be
allowed per spec but where the old parser allowed it.
Is There Anything New?
So what’s fun if success is that you notice no change? There are
important technical things under the hood—like TCP packet
boundaries not affecting the parse result and there never being
unnotified nodes in the tree when the event loop spins—but you
aren’t supposed to notice.
However, there is a major new visible feature, too. With the HTML5
parser, you can use SVG and MathML in text/html pages.
This means that you can:
And yes, you can even put SVG inside MathML <annotation-xml>
or MathML inside <foreignObject>. The mixing
you’ve seen in XML is now supported in HTML, too.
If you aren’t concerned with taking the steps to make things
degrade nicely in browsers that don’t support SVG and MathML in
HTML, you can simply copy and paste XML output from your favorite SVG
or MathML editor into your HTML source as long as the editor doesn’t
use namespace prefixes for elements and uses the prefix xlink
for XLink attributes.
If you don’t use the XML empty element syntax and you put you
SVG text nodes in CDATA sections, the page will degrade gracefully in
older HTML browser so that the image simply disappears but the rest
of the page is intact. You can even put a fallback bitmap as <img>
inside <desc>. Unfortunately, there isn’t a
similar technique for MathML, though if you want to develop one, I
suggest experimenting with the <annotation> as
your <desc>-like container.
There are known issues with matching camelCase names with
Selectors
or getElementByTagName,
though.
Posted in Browsers, Processing Model, Syntax | 8 Comments »
Friday, March 27th, 2009
I put together a new release of the Validator.nu HTML Parser. This is a highly recommended update for everyone who is using a previous version the parser in an application.
- Fixed an issue where under rare circumstances attribute values were leaking into element content.
- Fixed a bug where
isindex processing added attributes to all elements that were supposed to have no attributes.
- Implemented spec changes. (Too numerous to enumerate, but, as a highlight, framesets parse much better now.)
- Moved to WebKit-style foster parenting.
- Changed the API for tree builder subclasses again due to new constraints. If you have previously written your own tree builder subclass, you need to change it.
- Fixed the bundled XML serializer.
- Made it possible to generate a C++ version that does not leak memory from the Java source.
- Removed the C++ translator from the release. (Get it from SVN.)
Posted in Processing Model, Syntax | No Comments »
Friday, September 12th, 2008
There has been a certain amount of controversy over the supposed date of 2022 for HTML 5 to be "finished". It is somewhat important to realise the significance that should be attached to this date:
None at all
OK, strictly speaking that's not quite true, but it's a pretty good approximation to the truth. What really matters is when browsers ship HTML5 features. Given that's already happening, there is really no cause for alarm. By 2022 we hope to have a full testsuite and two full implementations but then we also expect to see products shipping with features from HTML 6.
Posted in Processing Model, WHATWG | 4 Comments »
Wednesday, August 6th, 2008
Welcome to a new semi-regular column, "This Week in HTML 5," where I'll try to summarize the major activity in the ongoing standards process in the WHATWG and W3C HTML Working Group.
The biggest news is the birth of the Web Workers draft specification. Quoting the spec, "This specification defines an API that allows Web application authors to spawn background workers running scripts in parallel to their main page. This allows for thread-like operation with message-passing as the coordination mechanism." This is the standardization of the API that Google Gears pioneered last year. See also: initial Workers thread, announcement of new spec, response to Workers feedback.
Also notable this week: even more additions to the Requirements for providing text to act as an alternative for images. 4 new cases were added:
- A link containing nothing but an image
- A group of images that form a single larger image
- An image not intended for the user (such as a "web bug" tracking image)
- Text that has been rendered to a graphic for typographical effect
Additionally, the spec now tries to define what authors should do if they know they have an image but don't know what it is. Quoting again from the spec:
If the src attribute is set and the alt attribute is set to a string whose first character is a U+007B LEFT CURLY BRACKET character ({) and whose last character is a U+007D RIGHT CURLY BRACKET character (}), the image is a key part of the content, and there is no textual equivalent of the image available. The string consisting of all the characters between the first and the last character of the value of the alt attribute gives the kind of image (e.g. photo, diagram, user-uploaded image). If that value is the empty string (i.e. the attribute is just "{}"), then even the kind of image being shown is not known.
- If the image is available, the element represents the image specified by the src attribute.
- If the image is not available or if the user agent is not configured to display the image, then the user agent should display some sort of indicator that the image is not being rendered, and, if possible, provide to the user the information regarding the kind of image that is (as derived from the alt attribute).
See also: revision 1972, revision 1976, revision 1978, revision 1979, Images and alternate text.
Other interesting changes this week:
- revision 1951: define
window.top
- revision 1956: "User agents must not run executable code embedded in the image resource."
- revision 1958: more notes on what is a valid image (a surprisingly difficult question)
- revision 1965: allow
<a> elements to straddle paragraphs
- revision 1998: define what happens when you set
onclick='' on a document outside a Window
- revision 1999: define
javascript: in Window-less environments
- revision 2001: define 'directionality' in terms of the
dir='' attribute for cases where the 'direction' property has no computed value
- revision 2002: define processing for the second argument to
getDataURL() for image/jpeg
- revision 2004: specify how to handle transparent images in the
toDataURL() method
- revision 2008: make patterns required in the
<canvas> API
- revision 2016: when
<script type=''> is given, it must match the type of the script, even if the script is Javascript
- revision 2019: remove
autosubmit='' from the <menu> element
Tune in next week for another exciting episode of "This Week in HTML 5."
Posted in Processing Model, Weekly Review, WHATWG | 21 Comments »