Requirements for URL schemes for web applications. December 2007

The question of how to design URLs for web applications does not seem to get a lot of attention by web developers — although it is a very visible feature and often the first usability test a site has to pass. So many apps and websites out there come with unreadable, unhackable URLs and/or, even worse, do not produce meaningful filenames when you try “save as…” on a page or downloadable file. I wanted to really get this right in the first place, so I started with writing down the things that annoy me regarding how URLs are constructed elsewhere, added a few “nice to have” features and lessons learned from my own previous experiences and compiled a requirements list that my scheme(1) should meet.

Requirements for a perfect URL scheme

When I talk about “modern web applications” here I have a rough concept of the state of the art in mind (buzzword alert!): AJAX interface with a fallback to server-driven interactivity, REST APIs(2), widgetizable applications and an object-oriented data model. Other websites may have different requirements, but most of what I say here will apply to a broad range of applications.

1. URLs should be human readable, rememberable and robust for voice communication (e.g. phone)

One of the most important requirements for URLs is that they should make sense to a human. Machine-generated identifiers, deep folder hierarchies, special characters, strange extensions or query parameters are examples for URL parts that humans cannot easily parse or remember. More often than you think URLs are communicated by voice communication, over the phone or in the radio, therefore they should unambiguously serialze and deserialize from spoken language (which, by the way, is also a strong case for non-case sensitive or all-lowercase URLs).

Obviously this requirement cannot be solved by a specification or scheme alone, but also lies in the application’s responsibility. For example, if you have a document with the name ‘qa4@@67g$O’, the best scheme will not help you to provide a human readable URL.

2. URLs should be short

While this requirement is slightly related to the previous one, it is not the same. A longer URL may be much easier to remember, as you can easily try by using one of these tinyurl.com generated URLs. Still, partly for the reason of transmitting and displaying on devices with limited capabilities (think text messages), the URLs of our system should be naturally short with no superfluous namespaces, extensions or query parameters.

3. The scheme should produce meaningful filenames & extensions for every page and file served

This is one of my top annoyances on the web, as most web applications fail to fulfill this requirement. Even with document servers that are supposed to provide files for download (e.g. PDFs), I frequently encounter missing extensions, not to speak of meaningful and unambiguous filenames. Whenever we serve a non-temporary page or document for viewing
it should be possible for the user to save that page to her local machine, getting a meaningful filename with the correct filename extension.

4. The scheme should make use of established specifications and conventions — filename extensions (‘.’), hierarchy (‘/’), queries (‘?’), fragments (‘#’)

Obviously I do not want to reinvent URLs, but build on top of established standards and conventions. The problem is that these standards have emerged in a time before XML, Javascript and AJAX, and so we have to find an interpretation of the concepts offered to us that match the requirements of modern web applications.

5. The scheme should use as few reserved characters as possible, leaving most of the allowed character set for the application

I want to avoid introducing reserved characters in addition to the ones mentioned above - partly because it would be a nonstandard extension to what we already have, partly because human readability would suffer by introducing additional non-letter characters.

6. URLs should be “hackable”

Closely related to human parseability requested in requirement 1, “hackability” means that parts of the URL could be removed or added by users, leading to deterministic, meaningful results. Some examples of this would be: removing everything after the last ‘/’ should give you an overview of the content one level up a hierarchy, removing or changing extensions should change the content type served, adding parts of an URL seen elsewhere on the site should result in similar behavior on any other resource etc.. Note that a system with “hackable” URLs should also protect the user from accidentally modifying or deleting resources this way, so that users can safely play around with the URL scheme without causing damage.

7. URLs should be able to address resources AND actions to be performed on these resources

This is the major new requirement I see with modern web applications. When HTTP was designed, URLs were meant as a way to access resources, in most cases static files. The operations to be performed on these resources were pushed into the HTTP protocol in form of the GET, PUT, POST, … methods. In modern web applications it turned out that a predefined set of methods cannot cover all the different requirements that may emerge. We therefore need a way to express actions to be performed on resources within the URL, since no other method is provided by the HTTP protocol.

8. Resources or results of an action should be presentable in different formats (HTML, XML, JSON, plain text, …), providing correct file name extensions

As static pages play a less important role in modern web applications and standards for the exchange of various data flavours emerge, it is important to be able to present a resource (e.g. an article) in different formats (e.g. HTML, Plain Text and RSS). New standards are emerging all the time and may become relevant in the future, so the pool of formats to choose from should be expandable in the future to support yet unanticipated usage of the data.

9. The scheme should support content to be served in a format determined by the preferences of the client application

In addition to the explicit choice of format through the URL scheme discussed in the last item, the mechanism of implicit content format selection offered by HTTP through the “Accept” header(3) should be supported. The Accept header allows the client to send a list of supported content types along with each request, allowing the server software to select a content type best suited for the task.

This way our system can send different responses to a mobile phone and a full-fledged web browser, although both are using the same URL to access the content.

10. It should be possible to mix static files and generated content

Although static files play a less important role in modern web applications, it is obvious that every web application will need to serve static files form the file system to the client. While in production environments this is often solved by a dedicated subdomain (e.g. static.foo.com), it should still be possible to serve static files and generated content from a single server.

11. A single resource should have a single (normalized) URL

Flexible URL schemes often provide multiple paths to a single resource, leading to problems in unambiguously identifying resources. Search engines, for example, may not be able to conclude that two different URLs point to the same resource, usually leading to a reduced ranking for both pages. If multiple paths to a single resource are possible, an unambiguous normalization method (often called ‘URL canonicalization’) should be provided.

12. Support the features of “standard” global filenames like robots.txt, sitemap.xml, favicon.ico

Due to very unfortunate historical developments, some “standards” have emerged (none of them is actually a standard accepted by any official standardization authority) that rely on the presence of static files in the URL hierarchy of a server. Since these “standards” provide convenient or important functionality for many web applications, the URL scheme must either support these filenames by ensuring that no other resource can exist with an identical URL, or we need to find ways to support the features provided by these files through other means.

This concludes the requirements for a perfect URL scheme I have discovered and found important. Originally, I was planning for this to be part of a series of blog posts, discussing the constraints derived from the URL and HTTP specifications, a concrete proposed URL scheme for a web application I was planning at the time, and implementation issues, but I never got around writing those ;).

Footnotes

1 - Let me state in advance that when I say “URL scheme” here, I do not mean to propose a new scheme outside the scope of what HTTP already supports. My intention is to find a way to structure the path component (everything that comes after the server name and port in the URL) in a way that reflects the needs of modern web applications.

2 - I have to admit here I have never quite grasped what REST really is or why everyone is talking about it. As far as I understand, it is simply the concept of having a stateless API where every call is simply a HTTP request, and I have the vague idea that the URL scheme proposed here goes well along with it. If anyone has any further insigts, please enlighten me! (And yes, I have skimmed through the dissertation)

3 - The Accept header is discussed in detail in the HTTP/1.1 specification (RFC2616), section 14.