WHATWG Weekly: Encoding woes and WebVTT
If you want to contribute to the WHATWG Blog or Wiki, join IRC (#whatwg
on Freenode). We had to shut down user registration unfortunately due to excessive spam. Welcome to another WHATWG Weekly. If it were themed, this would be about Sinterklaas.
Encoding problem
In response to Faruk Ate?' plea for defaulting to UTF-8, David Baron explained the platform encoding problem. The problem is that currently the default encoding varies per user (depending on locale primarily) and sites rely on locale-specific default encodings. Such sites visited by a user using a Dutch computer and a user using a Chinese computer, will render differently. In particular, their byte streams will be decoded using a different encoding. The implication is that the web is less global than it should be. How exactly we are to overcome the platform encoding problem, without everyone explicitly opting in to an encoding using <meta charset=utf-8>
(please do so if you are a web developer), is still unclear. Ideas welcome!
WebVTT
Revision 6837 made it possible for WebVTT to be published as a standalone Living Standard. It will primarily be developed by the Web Media Text Tracks Community Group on the [email protected] mailing list. WebVTT is the platform's captioning and subtitle format (for HTML video) and its development can be tracked on Twitter via @webvtt.
Video conferencing
The same revision that let WebVTT be published as standalone document, removed everything related to peer-to-peer connections and video conferencing. The W3C Web Real-Time Communications Working Group forked our work in WebRTC 1.0: Real-time Communication Between Browsers and we (the WHATWG) are okay with them working on it instead.
Miscellaneous
My colleague Karl has been blogging again on the W3C Blog, read his summaries from the weeks of November 14 and November 21.
Yours truly added native JSON support to XMLHttpRequest. Just set responseType
to "json
" and response
will give you a JSON-decoded object once fetching is done.
> Ideas welcome
“HTML as a living standard” is designed to be compatible with legacy UAs. So there’s no way to get rid of the opt-in.
At least user agents should default to render undeclared content as UTF-8, and the standard should enforce that. It’s the only way to bring sites to declare their content correctly or recode it to the standard. I don’t think it would cause more mojibake than having the default encoding depend on the locale, a hidden system setting that most users never set or don’t even have the admin privileges to set. The locale should be used to choose a default language (and date format etc.), not an encoding. Conflating locale and encoding is just as wrong as using national flags for languages.
As 7-bit ASCII is a subset of UTF-8, ASCII pages wouldn’t break. For ISO-8859-x pages, there can be no reasonable expectation for any particular value of x.
When will folks accept that the phrase “Living Standard” is an oxymoron? Standards, by their very definition, do not change over time. Imagine if UTF-8 (previously mentioned) changed over time so that their byte streams started being decoded using a different encoding from the one in use today. What would that do to all the legacy content in existence? If you insist on working with a moving target and referencing it as something, then call it an evolving specification. But Standard? Hardly.
WebVTT *IS ONE OF* the platform’s captioning and subtitle formats, a fact well know by those in the know, but studiously not discussed by the majority of the browser vendors so far. However, a sneak peek of Internet Explorer 10 (http://ie.microsoft.com/testdrive/Graphics/VideoCaptions/) shows that at least Microsoft will also be supporting the industry standard TTML (Timed Text Markup Language) – and likely the superset of TTML developed by the Society of Motion Picture and Television Engineers (SMPTE) with their SMPTE-TT (SMPTE Time Text) – which is TTML with a few additions to address the legacy needs of certain meta-data requirements.
That said, work on WebVTT is indeed under way inside the W3C, with a goal of ensuring that WebVTT can deliver all of the existing functionality that TTML gives the commercial media producers today. This includes ensuring that there is a direct one-to-one mapping of all features/functions of TTML in WebVTT, as well as ensuring that WebVTT can deliver on the US Federally mandated CEA 608/708 Standards (which are living but unchanging) and the EBU-STL subtitles format (European Broadcasting Union format for broadcast subtitles files). If you are interested in this topic, I echo Anne’s invitation to become involved: http://www.w3.org/community/texttracks/
Funny that you bring UTF-8 up John. It has changed. It previously supported characters of up to six bytes. And HTML recently defined its error handling. (These kind of changes happen for pretty much any piece of technology on the web.)
And as for WebVTT, one browser supporting another format does not really make it a standard. Remember VML?
Lorna, as David explained in that email, it does cause significant breakage of existing content.
Anne,
Virtually every Flash-based media player today also supports TTML (or at least, DFXP, a profile of TTML) so it’s not like this is some “from out of space” format we are talking about. TTML has been accepted and adopted by many large-scale content producers, has been incorporated into many authoring systems and is a real and viable Standard today. It likely will remain so for some time to come.
If WebVTT can meet the needs of these existing content producers as an output format however, I think we will see that TTML will recede somewhat into the background, remaining an exchange format (it was designed to be both), and that those same large-scale content producers (who will have existing legacy TTML content) will convert to the more-web-friendly WebVTT format moving forward, as TTML, being XML based, will lend itself to that nicely.
My only point is that WebVTT is *not* the only web-format for time-stamping out there today – other options exist that have their benefits as well: truth in advertising.
I am not saying there are no other formats, I am saying there is only one viable format for the platform.
Anne,
LOL, with all due respect, that’s like saying there is only one viable codec for video on the platform – how’s that working out for y’all?
With Internet Explorer *still* holding the largest share of the global “platform”, (http://gs.statcounter.com/) and IE 10 (preview) natively supporting TTML, TTML *IS* a viable format today – DFXP (a profile of TTML) can also be used for “fallback” to flash-based players for legacy users, something that WebVTT cannot do today (http://www.longtailvideo.com/support/jw-player/22/making-video-accessible | http://www.cpcweb.com/solutions/web.htm).
“Please leave your sense of logic at the door” does not mean to suspend belief or ignore facts. Pretending TTML is not a viable time-stamp option for content creators is to ignore reality on the ground today.
What remains to be seen is if any _other_ browser supports TTML, as if just 1 more does, it will continue to tip the scales in favor of supporting both formats on “the platform” – something that IE is doing (now/soon).
As previously noted (in my earlier response) *IF* the WebVTT group at the W3C can also wrangle WebVTT to deliver everything that TTML can deliver today (so far, it can’t, but it’s close) *THEN* WebVTT will likely emerge as the de-facto standard for a caption time-stamp format, which I’ve also already stated. But we are also not there yet, something that mainstream readers of this blog should be aware of. Hiding inconvenient truths does nobody any service.
I’m afraid I fail to see your point. Internet Explorer also supported VML at one point, and tried XDomainRequest out for a while until they also implemented CORS for XMLHttpRequest. That TTML is out there is an unfortunate truth, but I’m not trying to hide it. I’m just stating what I think we will end up with.