Sniffing for RSS 1.0 feeds served as text/html
I recently found myself testing how browsers sniff for RSS 1.0 feeds that are served with an incorrect MIME type. (Yes, my life is full of delicious irony.) I thought I'd share my findings so far.
Firefox's feed sniffing algorithm is located in
nsFeedSniffer.cpp. As you can see, starting at line 353, it takes the first 512 bytes of the page, looks for a root tag called
rss (for RSS 2.0),
atom (for Atom 0.3 and 1.0), or
rdf:RDF (for RSS 1.0). The RSS 1.0 marker is really a generic RDF marker, so it then does some additional checks for the two required namespaces of an RSS 1.0 feed,
http://purl.org/rss/1.0/. This check is quite simple; it literally just checks for the presence of both strings, not caring whether they are the value of an
xmlns attribute (or indeed any attribute at all).
Firefox has an additional feature which tripped up my testing until I understood it. IE and Safari both have a mode where they essentially say "I detected this page as a feed and tried to parse it, but I failed, so now I'm giving up, and here's an error message describing why I gave up." Firefox does not have a mode like this. As far as I can tell, if it decides that a resource is a feed but then fails to parse the resource as a feed, it reparses the resource with feed handling disabled. So an non-well-formed feed served as
application/rss+xml will actually trigger a "Do you want to download this file" dialog, because Firefox tried to parse it as a feed, failed, then reparsed it as some-random-media-type-that-I-don't-handle. A non-well-formed feed served as
text/html will actually render as HTML, but only after Firefox silently tries (and fails) to parse it as a feed.
There's nothing wrong with this approach; in fact, it seems much more end-user-friendly than throwing up an incomprehensible error message. I just mention it because it tripped me up while testing.
Internet Explorer's feed sniffing algorithm is documented by the Windows RSS team. About RSS 1.0, it states:
IE7 detects a RSS 1.0 feed using the content types
text/xml. ... The document is checked for the strings
http://purl.org/rss/1.0/. IE7 detects that it is a feed if all three strings are found within the first 512 bytes of the document. ... IE7 also supports other generic
Content-Types by checking the document for specific Atom and RSS strings.
Now that I understand IE's algorithm, I have to concede that this documentation is 100% accurate. However, it doesn't tell the full story. Here's what actually happens. If the
- the empty string, or
- missing altogether
...then IE will trigger its feed sniffing. Once IE triggers its feed sniffing, it will never change its mind (unlike Firefox). If feed parsing fails, IE will throw up an error message complaining of feed coding errors or an unsupported feed format. The presence or absence of a
charset parameter in the
Content-Type header made absolutely no difference in any of the cases I tested.
And how exactly does IE detect an RSS 1.0 feed, once it decides to sniff? The documentation on MSDN is literally true: "The document is checked for the strings
http://purl.org/rss/1.0/. IE7 detects that it is a feed if all three strings are found within the first 512 bytes of the document." Combined with our knowledge of which
Content-Types IE considers "generic," we can conclude that the following page, served as
text/html, will be treated as a feed in IE:
<!-- <rdf:RDF --> <!-- http://www.w3.org/1999/02/22-rdf-syntax-ns# --> <!-- http://purl.org/rss/1.0/ --> <script>alert('Hi!');</script>
I am working with Adam Barth and Ian Hickson to update draft-abarth-mime-sniff-01 (the content sniffing algorithm referenced by HTML5) to sniff RSS 1.0 feeds served as
text/html. It is unlikely that we will adopt IE's algorithm, since it seems unnecessarily pathological. I am proposing the following change, which would bring the content sniffing specification in line with Firefox's sniffing algorithm:
In the "Feed or HTML" section, insert the following steps between step 10 and step 11:
10a. Initialize /RDF flag/ to 0.
10b. Initialize /RSS flag/ to 0.
10c. If the bytes with positions pos to pos+23 in s are exactly equal to 0x68, 0x74, 0x74, 0x70, 0x3A, 0x2F, 0x2F, 0x70, 0x75, 0x72, 0x6C, 0x2E, 0x6F, 0x72, 0x67, 0x2F, 0x72, 0x73, 0x73, 0x2F, 0x31, 0x2E, 0x30, 0x2F respectively (ASCII for "http://purl.org/rss/1.0/"), then:
- Increase pos by 23.
- Set /RSS flag/ to 1.
10d. If the bytes with positions pos to pos+42 in s are exactly equal to 0x68, 0x74, 0x74, 0x70, 0x3A, 0x2F, 0x2F, 0x77, 0x77, 0x77, 0x2E, 0x77, 0x33, 0x2E, 0x6F, 0x72, 0x67, 0x2F, 0x31, 0x39, 0x39, 0x39, 0x2F, 0x30, 0x32, 0x2F, 0x32, 0x32, 0x2D, 0x72, 0x64, 0x66, 0x2D, 0x73, 0x79, 0x6E, 0x74, 0x61, 0x78, 0x2D, 0x6E, 0x73, 0x23 respectively (ASCII for "http://www.w3.org/1999/02/22-rdf-syntax-ns#"), then:
- Increase pos by 42.
- Set /RDF flag/ to 1.
10e. Increase pos by 1.
10f. If /RDF flag/ is 1, and /RSS flag/ is 1, then the /sniffed type/ of the resource is "application/rss+xml". Abort these steps.
10g. If pos points beyond the end of the byte stream s, then continue to step 11 of this algorithm.
10h. Jump back to step 10c of this algorithm.
You can see the results of my research to date and test the feeds for yourself. Because my research results are plain text with embedded HTML tags, I have added 512 bytes of leading whitespace to the page to foil browsers' plain-text-or-HTML content sniffing. Mmmm -- delicious, delicious irony.