Sniffing for RSS 1.0 feeds served as text/html
I recently found myself testing how browsers sniff for RSS 1.0 feeds that are served with an incorrect MIME type. (Yes, my life is full of delicious irony.) I thought I'd share my findings so far.
Firefox
Firefox's feed sniffing algorithm is located in nsFeedSniffer.cpp
. As you can see, starting at line 353, it takes the first 512 bytes of the page, looks for a root tag called rss
(for RSS 2.0), atom
(for Atom 0.3 and 1.0), or rdf:RDF
(for RSS 1.0). The RSS 1.0 marker is really a generic RDF marker, so it then does some additional checks for the two required namespaces of an RSS 1.0 feed, http://www.w3.org/1999/02/22-rdf-syntax-ns#
and http://purl.org/rss/1.0/
. This check is quite simple; it literally just checks for the presence of both strings, not caring whether they are the value of an xmlns
attribute (or indeed any attribute at all).
Firefox has an additional feature which tripped up my testing until I understood it. IE and Safari both have a mode where they essentially say "I detected this page as a feed and tried to parse it, but I failed, so now I'm giving up, and here's an error message describing why I gave up." Firefox does not have a mode like this. As far as I can tell, if it decides that a resource is a feed but then fails to parse the resource as a feed, it reparses the resource with feed handling disabled. So an non-well-formed feed served as application/rss+xml
will actually trigger a "Do you want to download this file" dialog, because Firefox tried to parse it as a feed, failed, then reparsed it as some-random-media-type-that-I-don't-handle. A non-well-formed feed served as text/html
will actually render as HTML, but only after Firefox silently tries (and fails) to parse it as a feed.
There's nothing wrong with this approach; in fact, it seems much more end-user-friendly than throwing up an incomprehensible error message. I just mention it because it tripped me up while testing.
Internet Explorer
Internet Explorer's feed sniffing algorithm is documented by the Windows RSS team. About RSS 1.0, it states:
IE7 detects a RSS 1.0 feed using the content types
application/xml
ortext/xml
. ... The document is checked for the strings<rdf:RDF
,http://www.w3.org/1999/02/22-rdf-syntax-ns#
andhttp://purl.org/rss/1.0/
. IE7 detects that it is a feed if all three strings are found within the first 512 bytes of the document. ... IE7 also supports other genericContent-Type
s by checking the document for specific Atom and RSS strings.
Now that I understand IE's algorithm, I have to concede that this documentation is 100% accurate. However, it doesn't tell the full story. Here's what actually happens. If the Content-Type
is
application/xml
text/xml
application/octet-stream
text/plain
text/html
- the empty string, or
- missing altogether
...then IE will trigger its feed sniffing. Once IE triggers its feed sniffing, it will never change its mind (unlike Firefox). If feed parsing fails, IE will throw up an error message complaining of feed coding errors or an unsupported feed format. The presence or absence of a charset
parameter in the Content-Type
header made absolutely no difference in any of the cases I tested.
And how exactly does IE detect an RSS 1.0 feed, once it decides to sniff? The documentation on MSDN is literally true: "The document is checked for the strings <rdf:RDF
, http://www.w3.org/1999/02/22-rdf-syntax-ns#
and http://purl.org/rss/1.0/
. IE7 detects that it is a feed if all three strings are found within the first 512 bytes of the document." Combined with our knowledge of which Content-Type
s IE considers "generic," we can conclude that the following page, served as text/html
, will be treated as a feed in IE:
<!-- <rdf:RDF --> <!-- http://www.w3.org/1999/02/22-rdf-syntax-ns# --> <!-- http://purl.org/rss/1.0/ --> <script>alert('Hi!');</script>
Why Bother?
I am working with Adam Barth and Ian Hickson to update draft-abarth-mime-sniff-01 (the content sniffing algorithm referenced by HTML5) to sniff RSS 1.0 feeds served as text/html
. It is unlikely that we will adopt IE's algorithm, since it seems unnecessarily pathological. I am proposing the following change, which would bring the content sniffing specification in line with Firefox's sniffing algorithm:
In the "Feed or HTML" section, insert the following steps between step 10 and step 11:
10a. Initialize /RDF flag/ to 0.
10b. Initialize /RSS flag/ to 0.
10c. If the bytes with positions pos to pos+23 in s are exactly equal to 0x68, 0x74, 0x74, 0x70, 0x3A, 0x2F, 0x2F, 0x70, 0x75, 0x72, 0x6C, 0x2E, 0x6F, 0x72, 0x67, 0x2F, 0x72, 0x73, 0x73, 0x2F, 0x31, 0x2E, 0x30, 0x2F respectively (ASCII for "http://purl.org/rss/1.0/"), then:
- Increase pos by 23.
- Set /RSS flag/ to 1.
10d. If the bytes with positions pos to pos+42 in s are exactly equal to 0x68, 0x74, 0x74, 0x70, 0x3A, 0x2F, 0x2F, 0x77, 0x77, 0x77, 0x2E, 0x77, 0x33, 0x2E, 0x6F, 0x72, 0x67, 0x2F, 0x31, 0x39, 0x39, 0x39, 0x2F, 0x30, 0x32, 0x2F, 0x32, 0x32, 0x2D, 0x72, 0x64, 0x66, 0x2D, 0x73, 0x79, 0x6E, 0x74, 0x61, 0x78, 0x2D, 0x6E, 0x73, 0x23 respectively (ASCII for "http://www.w3.org/1999/02/22-rdf-syntax-ns#"), then:
- Increase pos by 42.
- Set /RDF flag/ to 1.
10e. Increase pos by 1.
10f. If /RDF flag/ is 1, and /RSS flag/ is 1, then the /sniffed type/ of the resource is "application/rss+xml". Abort these steps.
10g. If pos points beyond the end of the byte stream s, then continue to step 11 of this algorithm.
10h. Jump back to step 10c of this algorithm.
Further Reading
You can see the results of my research to date and test the feeds for yourself. Because my research results are plain text with embedded HTML tags, I have added 512 bytes of leading whitespace to the page to foil browsers' plain-text-or-HTML content sniffing. Mmmm -- delicious, delicious irony.
Update: Belorussian translation is available.