Previous: Quick Look Home: Next: Parsing

URLs

A URL, short for "Uniform Resource Locator," is a compact string of characters identifying an abstract or physical resource. It has these five parts, with may be optional or disallowed depending on the context:

PartsDiagram

Each part’s syntax is defined by a set of production rules in rfc3986. All valid URLs conform to this grammar, also called the "generic syntax." Here is an example URL which describes a file and its location on a network host:

https://www.example.com/path/to/file.txt?userid=1001&pages=3&results=full#page1

The parts and their corresponding text is as follows:

Part Text

scheme

"https"

authority

"www.example.com"

path

"/path/to/file.txt"

query

"userid=1001&pages=3&results=full"

fragment

"page1"

The production rule for the example above is called a URI, which can contain all five parts. The specification using ABNF notation is:

URI           = scheme ":" hier-part [ "?" query ] [ "#" fragment ]

hier-part     = "//" authority path-abempty
              / path-absolute
              / path-rootless
              / path-empty

In this notation, the square brackets ("\// [" and "\]") denote optional elements, quoted text represents character literals, and slashes are used to indicate a choice between one of several elements. For the complete specification of ABNF notation please consult rfc2234, "Augmented BNF for Syntax Specifications." When using this library to process or create URLs, it is necessary to choose which of these top-level production rules are applicable for a given use-case: absolute-URI, origin-form, relative-ref, URI, or URI-reference. These are discussed in greater depth later.

Scheme

The most important part is the scheme, whose production rule is:

scheme        = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )

The scheme, which some informal texts incorrectly refer to as "protocol", defines how the rest of the URL is interpreted. Public schemes are registered and managed by the Internet Assigned Numbers Authority (IANA). Here are some registered schemes and their corresponding specifications:

Scheme Specification

http

magnet

mailto

payto

telnet

urn

Private schemes are possible, defined by organizations to enumerate internal resources such as documents or physical devices, or to facilitate the operation of their software. These are not subject to the same rigor as the registered ones; they can be developed and modified by the organization to meet specific needs with less concern for interoperability or backward compatibility. Note that private does not imply secret; some private schemes such as Amazon’s "s3" have publicly available specifications and are quite popular. Here are some examples:

Scheme Specification

app

odbc

slack

In some cases the scheme is implied by the surrounding context and therefore omitted. Here is a complete HTTP/1.1 GET request for the target URL "/index.htm":

GET /index.htm HTTP/1.1
Host: www.example.com
Accept: text/html
User-Agent: Beast

The scheme of "http" is implied here because the context is already an HTTP request. The production rule for the URL in the request above is called origin-form, defined in the HTTP specification thusly:

origin-form    = absolute-path [ "?" query ]

absolute-path  = 1*( "/" segment )

All URLs have a scheme, whether it is explicit or implicit. The scheme determines what the rest of the URL means.

Here are some more examples of URLs using various schemes (and one example of something that is not a URL):

URL Notes

https://www.boost.org/index.html

Hierarchical URL with https protocol. Resource in the HTTP protocol.

ftp://host.dom/etc/motd

Hierarchical URL with ftp scheme. Resource in the FTP protocol.

urn:isbn:045145052

Opaque URL with urn scheme. Identifies isbn resource.

mailto:person@example.com

Opaque URL with mailto scheme. Identifies e-mail address.

index.html

URL reference. Missing scheme and authority.

www.boost.org

A Protocol-Relative Link (PRL). Not.

Authority

The authority determines how a resource can be accessed. It contains two parts: the userinfo that holds identity credentials, and the host and port which identify a communication endpoint having dominion over the resource described in the remainder of the URL. This is the ABNF specification for the authority part:

authority   = [ user [ ":" password ] "@" ] host [ ":" port ]

The combination of user and optional password is called the userinfo.

The authority determines how a resource can be accessed. It contains two parts: the userinfo that holds identity credentials, and the host and port which identify a communication endpoint having dominion over the resource described in the remainder of the URL.

AuthorityDiagram

Some observations:

  • The use of the password field is deprecated.

  • The authority always has a defined host field, even if empty.

  • The host can be a name, or an IPv4, an IPv6, or an IPvFuture address.

  • All but the port field use percent-encoding to escape delimiters.

The host subcomponent represents where resources are located.

Note that if an authority is present, the host is always defined even if it is the empty string (corresponding to a zero-length reg-name in the BNF).

url_view u( "https:///path/to_resource" );
assert( u.has_authority() );
assert( u.authority().buffer().empty() );
assert( u.path() == "/path/to_resource" );

The authority component also influences how we should interpret the URL path. If the authority is present, the path component must either be empty or begin with a slash.

Although the specification allows the format username:password, the password component should be used with care.

It is not recommended to transfer password data through URLs unless this is an empty string indicating no password.

Containers

This library provides the following containers, which are capable of storing any possible URL:

  • url: A modifiable container for a URL.

  • url_view: A non-owning reference to a valid URL.

  • static_url: A URL with fixed-capacity storage.

These containers maintain a useful invariant: they always contain a valid URL. In addition, the library provides the authority_view container which holds a non-modifiable reference to a valid authority. An authority by itself, is not a valid URL.

In the sections that follow we describe the mechanisms use to parse strings using various specific grammars, followed by the interface for inspecting and modified each of the main parts of the URL. Finally we discuss important algorithms availble to use with URLs.