Wednesday, July 8th, 2009
The HTML5 parsing algorithm is meant to demystify HTML parsing and
make it uniform across implementations in a backwards-compatible way.
The algorithm has had “in the lab” testing, but so far it hasn’t
been tested inside a browser by a large number of people. You
can help change that now!
A while ago, an implementation of the HTML5 parsing algorithm
landed on mozilla-central
preffed off. Anyone who is testing Firefox nightly builds can now opt
to turn on the HTML5 parser and test it.
How to Participate?
First, this isn’t release-quality software. Testing the HTML5
parser carries all the same risks as testing a nightly build in
general, and then some. It may crash, it may corrupt your Firefox
profile, etc. If you aren’t comfortable with taking the risks
associated with running nighly builds, you shouldn’t participate.
If you are still comfortable with testing, download a trunk
nightly
build, run it, navigate to about:config
and flip the
preference named html5.enable
to true
. This
makes Gecko use the HTML5 parser when loading pages into the content
area and when setting innerHTML
. The HTML5 parser is not
used for HTML embedded in feeds, Netscape bookmark import, View
Source, etc., yet.
The html5.enable
preference doesn’t require a
restart to take effect. It takes effect the next time you load a
page.
What to Test?
The main thing is getting the HTML5 parser exposed to a wide range
of real Web content that people browse. This may turn up crashes or
compatibility problems.
So the way to help is to use nightly builds with the HTML5 parser
for browsing as usual. If you see no difference, things are going
well! If you see a page misbehaving—or, worse, crashing—with the
HTML5 parser turned on but not with it turned off, please report the
problem.
Reporting Bugs
Please file bugs in the
“Core” product under “HTML: Parser” component with “[HTML5]
” at the start of the summary.
Known Problems
First and foremost, please refer to the list
of known bugs.
However, I’d like to highlight a particular issue: Support for
comments ending with --!>
is in the spec, but the
patch
hasn’t landed, yet. Support for similar endings of
pseudo-comment escapes within script
element content is
not in
the spec yet. The practical effect is that the rest of the page
may end up being swallowed up inside a comment or a script
element.
Another issue is that the new parser doesn’t yet inhibit
document.write()
in places where it shouldn’t be
allowed per spec but where the old parser allowed it.
Is There Anything New?
So what’s fun if success is that you notice no change? There are
important technical things under the hood—like TCP packet
boundaries not affecting the parse result and there never being
unnotified nodes in the tree when the event loop spins—but you
aren’t supposed to notice.
However, there is a major new visible feature, too. With the HTML5
parser, you can use SVG and MathML in text/html
pages.
This means that you can:
And yes, you can even put SVG inside MathML <annotation-xml>
or MathML inside <foreignObject>
. The mixing
you’ve seen in XML is now supported in HTML, too.
If you aren’t concerned with taking the steps to make things
degrade nicely in browsers that don’t support SVG and MathML in
HTML, you can simply copy and paste XML output from your favorite SVG
or MathML editor into your HTML source as long as the editor doesn’t
use namespace prefixes for elements and uses the prefix xlink
for XLink attributes.
If you don’t use the XML empty element syntax and you put you
SVG text nodes in CDATA sections, the page will degrade gracefully in
older HTML browser so that the image simply disappears but the rest
of the page is intact. You can even put a fallback bitmap as <img>
inside <desc>
. Unfortunately, there isn’t a
similar technique for MathML, though if you want to develop one, I
suggest experimenting with the <annotation>
as
your <desc>
-like container.
There are known issues with matching camelCase names with
Selectors
or getElementByTagName
,
though.
Posted in Browsers, Processing Model, Syntax | 8 Comments »
Monday, May 25th, 2009
Version 1.2.1 of the Validator.nu HTML Parser is now available. It fixes an incompatibility with the DOM implementation of the latest Xerces.
Posted in DOM, Processing Model, Syntax | Comments Off on Validator.nu HTML Parser 1.2.1
Friday, March 27th, 2009
I put together a new release of the Validator.nu HTML Parser. This is a highly recommended update for everyone who is using a previous version the parser in an application.
- Fixed an issue where under rare circumstances attribute values were leaking into element content.
- Fixed a bug where
isindex
processing added attributes to all elements that were supposed to have no attributes.
- Implemented spec changes. (Too numerous to enumerate, but, as a highlight, framesets parse much better now.)
- Moved to WebKit-style foster parenting.
- Changed the API for tree builder subclasses again due to new constraints. If you have previously written your own tree builder subclass, you need to change it.
- Fixed the bundled XML serializer.
- Made it possible to generate a C++ version that does not leak memory from the Java source.
- Removed the C++ translator from the release. (Get it from SVN.)
Posted in Processing Model, Syntax | Comments Off on Validator.nu HTML Parser 1.2.0
Monday, August 25th, 2008
I have released a new version of the Validator.nu HTML Parser (an implementation of the HTML5 parsing algorithm in Java). The new release supports SVG and MathML subtrees, is faster than the old version, fixes bugs, is more portable and supports applications that want to do document.write()
.
The parser comes with a sample app that makes it possible to use XSLT programs written for XHTML5+SVG+MathML with text/html
.
Warning! The internal APIs have changed. Please refer to the Upgrade Guide below.
Change Log
- Made the SAX, DOM and XOM parser entry point constructors default to altering the infoset instead of throwing when the input needs coercing to be an XML 1.0 4th ed. plus Namespaces infoset.
- Isolated Java IO dependent code from the parser core. The parser core now compiles on Google Web Toolkit.
- Refactored the tokenizer to use a
switch
branch per state instead of method per state.
- Made various performance tweaks to the tokenizer.
- Implemented support for MathML and SVG foreign content. (Note that the SVG part is based on spec text that has been commented out from the spec at the request of the SVG WG.)
- Made the parser suspendable after any input character.
- Made it possible for custom
TreeBuilder
subclasses to request parser suspension. (Applications wishing to implement document.write()
should provide their own TreeBuilder
subclass and a document.write()
-aware replacement of the Driver
class. Look in the gwt-src/
directory for sample code.)
- Made changes to the parser core to make it more suitable for mechanical translation into other object-oriented programming languages that have C-like control structures but not necessarily a garbage collector (with focus on targeting C++). This work is not complete.
- Made the HTML serializer do the right thing when input represents a conforming XHTML+SVG+MathML tree. (Results may be bad for non-conforming input trees.)
- Developed sample programs for converting between HTML5 and XHTML5 when the input is known to be conforming.
- Provided an XML serializer so that the sample code no longer depends on the Xalan serializer.
- Improved API documentation.
- Fixed bugs in the tokenizer, tree builder and the input stream character encoding decoder.
- Made coercion to an XML infoset work according to the HTML5 spec.
- Added ID uniqueness checking.
- Various other fixes.
Upgrade Guide from 1.0.7 to 1.1.0
In all cases, you need to check that your application does not break when it receives SVG or MathML subtrees.
- If you use the parser through the SAX, DOM or XOM API and do not pass an explicit
XmlViolationPolicy
to the constructor of HtmlParser
, HtmlDocumentBuilder
or HtmlBuilder
:
If you really wanted the old default behavior, you should now pass XmlViolationPolicy.FATAL
to the constructor.
If you did not really want to have fatal errors by default, you do not need to do anything, since ALTER_INFOSET
is now the default.
- If you use the parser through the SAX, DOM or XOM API and do pass an explicit
XmlViolationPolicy
to the constructor of HtmlParser
, HtmlDocumentBuilder
or HtmlBuilder
:
You do not need to change your code to upgrade.
- If you have your own subclass of
TreeBuilder
:
The abstract methods on TreeBuilder
now have additional arguments for passing the namespace URI. You should upgrade your subclass to deal with the namespace URIs. (The URI is always an interned string, so you can use ==
to compare.)
The entry point for passing in a SAX InputSource
has moved from the Tokenizer
class to the Driver
class (in the io
package), so you should change your references from Tokenizer
to Driver
.
- If you have your own implementation of
TokenHandler
:
Please refer to the JavaDocs of TokenHandler
. Also note the new separation of Tokenizer
and Driver
mentioned above.
Posted in Syntax | Comments Off on Validator.nu HTML Parser 1.1.0
Thursday, August 14th, 2008
Earlier, I blogged about running the Validator.nu HTML Parser inside Hixie’s Live DOM Viewer using the magic of the hosted mode of the Google Web Toolkit. Back then, a compiler bug in GTW 1.5 RC1 prevented the parser from running as JavaScript in the Web mode. Google has now released GWT 1.5 RC2, which contains a fix for the bug.
So without further ado, here’s Live DOM Viewer with an HTML5 parser running as JavaScript in your browser.
Try pasting in the SVG lion or some MathML in Firefox 3 and Opera 9.5.
Known problems:
- SVG
use
does not work in Firefox. Update: Fixed in Minefield nightlies.
- SVG does not render is Safari.
- IE does not support
createElementNS
and, thus, does not work at all.
A big thanks for the GWT team for making this work!
Posted in DOM, Syntax | Comments Off on HTML5 Live DOM Viewer—Now in Your Browser