The WHATWG Blog

Please leave your sense of logic at the door, thanks!

The URL Pattern Standard

Welcome to the newest standard maintained by the WHATWG: the URL Pattern Standard! The URL Pattern Standard gives a generic pattern syntax for matching URLs, and extracting the parts from them. It is inspired by the path-to-regexp library, although it extends beyond paths to encompass all the parts of a URL. You can read more about the API on MDN.

The URL Pattern Standard joins us as a graduation from the WICG, where it was authored by Ben Kelly. As part of the move to becoming a Living Standard, Jeremy Roman and Shunya Shishido are joining Ben as editors to help maintain and evolve the standard.

We see the URL Pattern Standard being adopted in many upcoming proposals, including speculation rules, compression dictionary transport, and service worker static routing. It has also seen adoption with implementations beyond web browsers, such as in Deno, Cloudflare Workers, Next.js Edge Runtime, and Netlify Functions. We are excited to provide a home for this primitive going forward.

Posted in What's Next, WHATWG | 1 Comment »

Retro-specifying fetch/performance

In the last year or so, my main task was to tackle some of the "specification technical debt" that had been accumulating over a few years, specifically at the crossroads of Fetch, performance APIs, and the HTML Standard.

Monkey patching and hand waving

When starting this work, the behaviors of some of the features relating to web performance were specified in the form of "monkey patching" — describing how another spec needs to be modified rather than modifying it; or "hand waving" — describing the expected behavior of a feature in general or vague terms.

These ways of specifying have an advantage at early stages of specification, and allow moving fast without requiring consensus prematurely.

However, as the specs mature and become widely used, these practices accumulate and come with a few costs.

Interoperability

Some details and flows may be too vaguely defined in the hand-waved description or monkey-patch. This, in turn, creates interoperability issues — one browser vendor may interpret a loosely defined phrase in some way, and another vendor in a different way.

Example: Resource Timing specified that loading images should result in a resource timing entry. However, it was not specified (or tested) whether those entries should be added before or after the image's load event.

Bugs and security flaws

When having to spell out all the details of a previously loose spec, sometimes bugs and security/privacy flaws emerge. This has happened as part of this work more than once, for example in w3c/resource-timing#260: Resource Timing was exposing some of its properties for cross-origin resources, and by specifying it more rigorously we found that some properties that were exposed should actually be hidden.

The role of web platform tests

Of course, writing a few words into a standard doesn't fix interoperability issues or security flaws — it requires the two other pillars, tests and implementations. A big part of this debt-paying work involves adding missing coverage to web platform tests, and posting implementation bugs.

Work done so far

Work in progress

Retroactive consensus

Some of the features in question are already shipped, and vendors took liberties to interpret them in different ways. This raises a challenge of achieving retroactive consensus. Vendors might have good reasons to do things differently, but the differences make it difficult for web developers to write interoperable code. In the area of performance APIs, interoperability issues mean that performance measurements or hints might create different results across browsers.

Because the feature is already shipped, achieving consensus retroactively means that browser vendors would have to go back to their implementations and change things, which apart from the work involved could potentially make things suboptimal for some existing cases. Is interoperability worth it? That's a good question and the answer is dynamic.

A current example of this is the attempts to specify prefetch. It is a relatively old feature that is implemented differently across browsers and those two facts make it difficult to spec after the fact. Another bigger example is the discussion about navigation start time being the zero point, which is a cross-origin information leak, but where fixing it would have big implications.

What's next

The technical debt in the fetch/HTML/performance crossroads is a little smaller than before, but still far from being paid. Apart from the in-progress pull requests, some of the remaining debt is tracked here.

How you could help

Apart from the retroactive consensus challenge, an additional challenge is involving more people in the conversation. You can help by reading the issues in progress, forming opinions, finding flaws, posting what you find, proactively becoming involved. If you're not part of the conversation already and are curious, follow the links in this post, read, ask questions, and make suggestions. We'd love to hear from you!

Posted in Processing Model | Comments Off on Retro-specifying fetch/performance

New Living Standards

The last time we introduced a new Living Standard was Infra, in 2016. This year has seen a flurry of activity, with four new standards joining the WHATWG!

The Web IDL Standard defines the interface language and JavaScript mapping for all web platform APIs. It migrated to the WHATWG from its old location, on the personal GitHub page of its original editor Cameron McCormack. Thanks to Cameron for his many years of stewardship, and thanks to the current editors Edgar Chen and Tiancheng "Timothy" Gu for their help in the move!

The Test Utils Standard defines APIs that are not exposed on the open web by default, but are specifically useful in testing web browser functionality. So far it defines the testUtils.gc() method, allowing us to test garbage collection-dependent APIs. James Graham is the editor.

The WebSockets Standard consolidates content that was formerly spread across the HTML Standard and the Fetch Standard. This new home will provide a natural place for standardizing the WebSocketStream API, which integrates WebSockets with streams. Adam Rice is the editor.

Finally, the File System Standard will specify an API for an origin-private filesystem, drawn from part of the existing File System Access specification as well as the AccessHandle proposal. (The portions of the File System API specification for accessing the local file system will remain in incubation, until they gather multi-implementer interest.) Marijn Kruisselbrink will be the editor.

We're happy to see such excitement about working in the WHATWG, and will strive to continue to provide a welcoming community where new features can be developed or existing incubations can graduate. We hope you enjoy these new Living Standards. As always, feel free to join us on GitHub to discuss improvements and additions. And if there are more specifications that would like to become WHATWG Living Standards, please get in touch!

Posted in What's Next, WHATWG | Comments Off on New Living Standards

Newline normalizations in form submission

If you work with form submissions, you might have noticed that form values containing newlines are normalized to CRLF, no matter whether the DOM value had LF or CR instead:

<form action="./post" method="post" enctype="application/x-www-form-urlencoded">
  <input type="hidden" name="hidden" value="a&#x0D;b&#x0A;c&#x0D;&#x0A;d" />
  <input type="submit" />
</form>

<script>
  // Checking that the DOM has the correct newlines and the normalization
  // happens during the form submission.
  const hiddenInput = document.querySelector("input[type=hidden]");
  console.log("%s", JSON.stringify(hiddenInput.value)); // "a\rb\nc\r\nd"
</script>
hidden=a%0D%0Ab%0D%0Ac%0D%0Ad

But although it might seem simple on the surface, newline normalization in form submissions is a topic that runs deeper than I thought, with bugs in the spec and differences across browsers. This post goes through what the spec used to do, what browsers (used to) implement, and how we went about fixing it.

First, some background on form submission

The data to submit from a form is modeled as an entry list – entries being pairs of names (strings) and values (either strings or File objects). This is a list rather than a map because a form can have multiple values for each name – which is is how <input type="file" multiple> and <select multiple> work – and their relative order with the rest of form entries matters.

The algorithm that does the job of going through every submittable element associated with a particular form and collecting their corresponding form entries is the "construct the entry list" algorithm. This algorithm does what you'd expect – discard disabled controls and buttons not pressed, and then for each control it calls the "append an entry" algorithm, which used to replace any newlines in the entry name and value (if that value is a string) with CRLF before appending the entry.

"Construct the entry list" is called early into the form submission algorithm, and the resulting entry list is then passed to the encoding algorithm for the corresponding enctype. Since only the multipart/form-data enctype supports uploading files, the algorithms for both application/x-www-form-urlencoded and text/plain encode the value's filename instead.

First signs of trouble

My first foray into the encoding of form payloads was in defining precisely how entry names (and filenames of file entry values) had to be escaped in multipart/form-data payloads, and since LF and CR have to be percent escaped, newlines came up during testing.

One thing I noticed is that, if you have newlines inside a filename – yes, that is something you can do – they're normalized differently than for an entry name or a string value.

<form id="form" action="./post" method="post" enctype="multipart/form-data">
  <input type="hidden" name="hidden a&#x0D;b" value="a&#x0D;b" />
  <input id="fileControl" type="file" name="file a&#x0D;b" />
</form>

<script>
  // A file with filename "a\rb", empty contents, and "application/octet-stream"
  // MIME type.
  const file = new File([], "a\rb");

  const dataTransfer = new DataTransfer();
  dataTransfer.items.add(file);
  document.getElementById("fileControl").files = dataTransfer.files;

  document.getElementById("form").submit();
</script>

Here is the resulting multipart/form-data payload in Chrome and Safari (newlines are always CRLF):

------WebKitFormBoundaryjlUA0jn3NUYxIh2A
Content-Disposition: form-data; name="hidden a%0D%0Ab"

a
b
------WebKitFormBoundaryjlUA0jn3NUYxIh2A
Content-Disposition: form-data; name="file a%0D%0Ab"; filename="a%0Db"
Content-Type: application/octet-stream


------WebKitFormBoundaryjlUA0jn3NUYxIh2A--

And this is in Firefox 88 (the current stable version as of this writing):

-----------------------------26303884030012461673680556885
Content-Disposition: form-data; name="hidden a b"

a
b
-----------------------------26303884030012461673680556885
Content-Disposition: form-data; name="file a b"; filename="a b"
Content-Type: application/octet-stream


-----------------------------26303884030012461673680556885--

As you can see, Firefox substitutes a space for any newlines (CR, LF or CRLF) in the multipart/form-data encoding of entry names and filenames, rather than percent-encoding them as do Chrome and Safari. This behavior was made illegal in the spec in pull request #6282, but it couldn't be fixed in Firefox until the spec decided on a normalization behavior. In the case of values, Firefox normalizes to CRLF as the other browsers do.

As for Chrome and Safari, here we see that newlines in entry names and string values are normalized to CRLF, but filenames are not normalized. From the entry list construction algorithm as described above, this makes sense because entry values are only normalized to CRLF when they are strings – files are unchanged, and so are their filenames.

Except that, if you change the form's enctype in the above example to application/x-www-form-urlencoded, you get this in every browser:

hidden+a%0D%0Ab=a%0D%0Ab&file+a%0D%0Ab=a%0D%0Ab

Since multipart/form-data is the only enctype that allows file uploads, other enctypes use their filenames instead. But here it seems like every browser is CRLF-normalizing the filenames, even though in the spec that substitution happens long after constructing the entry list.

Normalizations with FormData and fetch

The FormData class started out as a way to send multipart/form-data form payloads through the XMLHttpRequest and fetch APIs without having to generate that payload in JavaScript. As such, FormData instances are basically a JS-accessible wrapper over an entry list.

So let's try the same with FormData:

const formData = new FormData();
formData.append("hidden a\rb", "a\rb");
formData.append("file a\rb", new File([], "a\rb"));

// FormData objects in fetch will always be serialized as multipart/form-data.
await fetch("./post", { method: "POST", body: formData });

Safari sends the same form payload as above, with names and values normalized to CRLF, and so does Firefox 88 with values normalized to CRLF (and names and values having their newlines escaped as spaces). But Chrome keeps names, filenames and values unnormalized (here the ? character stands for CR):

------WebKitFormBoundarySMGkMfD8mVOnmGDP
Content-Disposition: form-data; name="hidden a%0Db"

a?b
------WebKitFormBoundarySMGkMfD8mVOnmGDP
Content-Disposition: form-data; name="file a%0Db"; filename="a%0Db"
Content-Type: application/octet-stream


------WebKitFormBoundarySMGkMfD8mVOnmGDP--

Since FormData is just a wrapper over an entry list, and fetch simply calls the multipart/form-data encoding algorithm, no normalizations should take place. So it looks like Chrome was following the spec here, while Firefox and Safari were apparently doing some newline normalization (for Firefox, on string values only) at the time of serializing as multipart/form-data.

With FormData you can also investigate what the "construct the entry list" algorithm does, since if you pass a <form> element to the FormData constructor, it will call that algorithm outside of a form submission context, and let you inspect the resulting entry list.

<form id="form">
  <input type="hidden" name="a&#x0D;b" value="a&#x0D;b" />
</form>

<script>
  const formData = new FormData(document.getElementById("form"));
  for (const [name, value] of formData.entries()) {
    console.log("%s %s", JSON.stringify(name), JSON.stringify(value));
  }
  // Firefox and Safari print: "a\rb" "a\rb"
  // Chrome prints: "a\r\nb" "a\r\nb"
  // These results don't depend on the form's enctype.
</script>

So it seems like Firefox and Safari are not normalizing as they construct the entry list, and instead normalize names and values at the time that they encode the form into an enctype. In particular, since the application/x-www-form-urlencoded and text/plain enctypes don't allow file uploads, file entry values are substituted with their filenames before the normalization. Entry lists that aren't created from the "construct an entry list" algorithm get normalized all the same.

Chrome instead follows the specification (as it used to be) in normalizing in "construct an entry list" and not normalizing later, even for entry lists created through other means. But that doesn't explain why filenames in the application/x-www-form-urlencoded and text/plain enctypes are normalized. Does Chrome also have an additional normalization layer?

Investigating late normalization with the formdata event

It would be great to investigate in more detail what Chrome and other browsers do after constructing the entry list. Since the entry list construction already normalizes entries, any further normalizations that might happen further down the line are obscured in the common case.

In the case of multipart/form-data, we can test this because using a FormData object with fetch doesn't invoke "construct an entry list", and so can see what happens to unnormalized entries. For other enctypes there is no way to create an entry list that doesn't go through "construct an entry list", but as it turns out, the "construct an entry list" algorithm itself offers two ways to end up with unnormalized entries: form-associated custom elements (only implemented in Chrome so far) and the formdata event (implemented in Chrome and Firefox). Here we'll only be covering the latter, since their results are equivalent.

One thing I skipped when I covered the "construct an entry list" algorithm above is that, at the end of the algorithm, after all entries corresponding to controls have been added to the entry list, a formdata event is fired on the relevant <form> element. This event has a formData attribute which allows you not only to inspect the entry list at that point, but to modify it.

<form
  id="form"
  action="./post"
  method="post"
  enctype="application/x-www-form-urlencoded"
>
  <!-- Empty -->
</form>

<script>
  const form = document.getElementById("form");
  form.addEventListener("formdata", (evt) => {
    evt.formData.append("string a\rb", "a\rb");
    evt.formData.append("file a\rb", new File([], "a\rb"));
  });
  form.submit();
</script>

For both Chrome and Firefox (not Safari because it doesn't support the formdata event), trying this with the application/x-www-form-urlencoded enctype gets you a normalized result:

string+a%0D%0Ab=a%0D%0Ab&file+a%0D%0Ab=a%0D%0Ab

Firefox shows the same normalizations for the text/plain enctype; Chrome instead normalizes only filenames, not names and values. And with multipart/form-data we get the same result as with fetch and FormData above: Chrome doesn't normalize anything, Firefox normalizes string values (with names and filenames being replaced with spaces).

So in short:

Remember that these differences across browsers don't really affect the encoding of normal forms, they only matter if you're using FormData with fetch, the formdata event, or form-associated custom elements.

Fixing the spec

So which behavior do we choose? Firefox replacing newlines in multipart/form-data names and values with a space is illegal as per PR #6282, but anything else is fair game.

For text/plain, we have Firefox and Safari behaving in the same way, and Chrome disagreeing. Since text/plain cannot represent inputs unambiguously, is little used in any case, and you would need either form-associated custom elements or to use the formdata event to see a difference, it seems extremely unlikely that there is web content that depends on either case. So it makes more sense to treat text/plain just like application/x-www-form-urlencoded and normalize names, filenames and values.

For multipart/form-data, there is the added compatibility risk that you can observe this case by using FormData and fetch, so it's more likely to cause webcompat issues, no matter whether we went with Safari's or Chrome's behavior. In the end we choose to go with Safari's, in order to be consistent at normalizing all enctypes – although the multipart/form-data has to be different in not normalizing filenames, of course.

So in pull request #6287 we fixed this by:

  1. Adding a new algorithm that runs before the application/x-www-form-urlencoded and text/plain serializers. This algorithm first extracts a filename from the entry value, if it's a file, and then CRLF-normalizes both the name and value.
  2. We changed the multipart/form-data encoding algorithm to have a first step that CRLF-normalizes names and string values, leaving file values intact.

Do we need the early normalization though?

At the time that we decided the above changes to the spec, I still thought that the spec's (and Chrome's) behavior of normalizing in "construct the entry list" was correct. But later on I realized that, once we have the late normalizations mentioned above, the early normalization in "construct the entry list" doesn't matter for form submission, since the late normalizations do everything the early one can do and more. The only way you could observe whether that early normalization is present or not is through the FormData constructor. So it would make sense to remove that early normalization and standardize on Firefox and Safari's behavior here, as we did in pull request #6624.

One kink remained, though: <textarea> elements support the wrap="hard" attribute, which adds linebreaks into the submitted value corresponding to how the text is linewrapped in the UI. In the spec, this is done through the "textarea wrapping transformation", which takes the textarea's "raw value", normalizes it to CRLF, and when wrap="hard", adds CRLF newlines to wrap the contents. But if you test this on Safari, all newlines (both normalized and added) are LF – and Firefox currently doesn't implement wrap="hard", but it does normalize newlines to LF. So should this be changed?

I thought it was better to align on Firefox's and Safari's behavior, especially since this could simplify the mess that is the difference between "raw value", "API value" and "value" for <textarea> in the spec. Chrome disagreed at first – but as it turns out, Chrome's implementation of the textarea wrapping transformation normalizes to LF, and it's only the normalization in "construct an entry list" that normalizes those newlines to CRLF. So Chrome would align with Firefox and Safari in that area by just removing the early normalization.

Pull request #6697 fixes this issue with <textarea> in the spec, and the follow-up issue #6662 will take care of simplifying the textarea wrapping transformation in the spec, as well as ensure that it matches implementations.

Implementation fixes

After those three pull requests, the current spec mandates that no normalization happen in "construct the entry list", and that the entry lists to be encoded with some enctype go through a transformation that depends on the enctype:

Safari implements the current spec behavior, and so doesn't need any fixes as a result of these spec changes. However, when working on them I noticed that Safari had a preexisting bug in that it wasn't converting entry names and string values into scalar value strings in the "construct an entry list" algorithm, which could lead to FormData objects containing DOMString values despite the WebIDL declaration. What's more, those lone surrogates would remain there until the time of serializing the form, and might show up in the form payload as WTF-8 surrogate byte sequences. This is bug 225299.

Firefox acts much like Safari, except in that it escapes newlines in multipart/form-data names and filenames as spaces rather than CRLF-normalizing in the case of names and then percent-encoding. With the normalization now defined in the spec, this could now be changed. This is bug 1686765. I additionally also found the same bug as Safari with not converting strings into scalar value strings, except in this case the normalization did take place when encoding (bug 1709066). Both issues are now fixed in Firefox Nightly and will ship on Firefox 90.

Finally, Chrome would need to remove the normalization it performs in "construct an entry list", update the normalization it performs on the text/plain enctype to cover not only filenames but also names and values, and add an additional normalization for multipart/form-data names and values. This is covered in issue 1167095. Since Chrome is the browser that needs the most changes as a result of these spec changes, and they are understandably concerned with backwards compatibility, these fixes are expected to ship behind a flag so they can be quickly rolled back in case of trouble.

Conclusion

Although form submission is not necessarily seen as an easy topic, newline normalization in that context might not seem like such a big deal from the outside. But in a platform like the web, where spec and implementation decisions can easily snowball into compatibility concerns, even things that might look simple might have a history that makes standardizing on one behavior hard.

Posted in Browsers, Forms, WHATWG | Comments Off on Newline normalizations in form submission

Update from the Steering Group

A couple things the Steering Group has been working on recently reached a new milestone and seemed noteworthy enough to highlight to the wider WHATWG community.

As you may know the WHATWG formally collaborates with W3C on the DOM and HTML standards. In practice there is also a significant overlap in membership and sharing of ideas. We recently reached a new milestone in this endeavor as W3C has marked Review Drafts of the DOM and HTML standards as W3C Recommendations. We’d like to take this opportunity to thank the W3C community for their cooperation on these important web standards.

In response to community feedback we clarified “work in the field of web technologies” in the Contributor and Workstream Participant Agreement. And while for the most part Individuals and Entities alike have had no trouble signing up and contributing en masse to the WHATWG, we added the possibility for individuals to be invited. This has some similarities to W3C’s Invited Expert program and will be used for a select few cases where this might be worthwhile.

Finally, to make it easier to embed code from our standards into software, the BSD 3-Clause license can now be used for that purpose as stipulated in the IPR Policy.

Thanks to everyone for their continued feedback on these matters over the years!

Posted in WHATWG | Comments Off on Update from the Steering Group