How You Are Reading This Page

Andrew Stephens, Sunday the 22nd of October, 2017 in Computing, Popular or Notable

We use the Internet every day and most of us take it for granted that you click on a link and your computerOr your phone, or tablet, or console. All these devices are just computers in disguise. displays the new page for you to read. But how does that work? What happens to get a web page from wherever they are stored to your eyes?

The short answer that you have probably read before is that the page is transferred from a web server to your browser over the internet. This is true but what does it mean?

It would be a lifetime's work to explain every facet of the process but what follows is broadly speaking the series of steps taken for you to see this page.

The Internet

First things first; some quick definitions:

Here I am ignoring firewalls that restrict what can be sent, and NAT devices which complicate the story somewhat The Internet is a worldwide network of computers that can pass information to each other. If your device is "on the Internet" it can send and receive information to any other public computer on the Internet. Information is transferred in small blocks called packets that are sent to the destination computer's IP addressThe IP just stands for Internet Protocol.

A packet works like a postcard - it has a sender, a destination, and a short message. The Internet works much like a postal service in this analogy. When you send a postcard you write your return address and the destination address along with your message and put it in a postbox. The card gets picked up and taken to a depot where the destination address is looked at. The card is then placed on a appropriate vehicle and taken to a depot that is closer to the destination. This process might be repeated several times if the destination is in another country. Eventually the card will be delivered to the destination having passed to several mail centers on the way.

These intermediate computers are called routers. If you are connected through a residential ISP the little box with the blinking lights they gave you is a router and it is the first port of call for packets sent from your computer. The Internet works the same way. Your computer will send a packet to the closest computer which will examine the destination address and send the packet onwards. A package can go through dozens of intermediate computers before reaching the destination IP Address.

Technically what I describe here is a IPv4 address. There is a newer addressing system, IPv6 which looks like this: 2607:f8b0:4007:804::200e. For the purposes of this document they work the same way. I am also ignoring private address behind NATs. The IP Address of a computer is basically like a street address that postcards are sent to. They are written as a set of 4 numbers separated by periods. For instance, the IP Address of the server that hosts this website is 45.55.153.122.

Packets sent over the Internet are often broken up and reassembled before delivery, something that doesn't happen (usually) to postcards. Packets can also arrive out of order or even be delivered twice. The postcard analogy only goes so far.Postcards get delivered whether they are transferred by foot, horse, boat, truck or airplane. Depending on how you connect to the Internet and the destination, a packet might be transferred via a physical wire, a radio broadcast over your wireless network, a radio link to your nearest cell-phone tower or orbiting satellite, or an undersea fibre-optic cable. Most likely a single packet will end up being transferred in multiple ways before it gets to the where it needs to be.

And that is all the Internet is. All the cool stuff we do over the Internet is built on this foundation.

The Web Browser and the URL

A Web Browser is the software that requests and displays web pages like the one your are viewing now.

The web browser displaying a page. The URL is visible near the top.
The web browser displaying a page. The URL is visible near the top.

Internet Explorer, Chrome (pictured), and Safari are all web browsers and they all perform the same basic functions. When you click a link, the web browser uses the URLUniversal Resource Locator to make the request. The URL points to a specific page, in most browsers you can see the URL being requested in the location bar at the top of the page.

For instance, the URL of the page you are reading is:

The (simplified) anatomy of a URL showing the different sections
The (simplified) anatomy of a URL showing the different sections

The URL is made up of several different sections:

The URL contains everything needed to make a request over the Internet, but first your browser must look up the domain name to find the IP Address it needs to send the request to. It is time to talk about the Domain Name System

Domain Name System (DNS)

Remember that addresses on the Internet are just quartets of numbers (172.217.12.174, for example), but URLs typically contain a human readable domain name such as google.com or sheep.horse. The Domain Name System consists of computers on the internet programmed to act as name servers which can look up domain names like you might look up a name in an old fashioned phone book.

Take sheep.horse for example. To find the ip address, your browser first sends a packet to the name server that your computer is configured to useNormally this will be set by your ISP if you are using a home network, or your cellular data provider if you are using your phone. If you are at work or school, they might have their own local name server.. That name server doesn't know anything about sheep.horse but it does know the addresses of the root name servers so it resends the query in a packet addressed to one of them. The root name server doesn't know anything about sheep.horse either but it does know the address of the name server that knows about the .horse domain so it sends that back.

Block diagram showing the parties involved in the DNS query process
Block diagram showing the parties involved in the DNS query process

Again, this description is a simplified overview of the process. In reality there are multiple levels of cacheing going on, domains can have more than one address, and requests and responses can be too large to fit into a single packet. DNS requests always follow this basic process though.Your local name server then sends a query to the .horse name server asking for sheep.horse again. The .horse name server doesn't know either, but it does know the authoritative name server for sheep.horse because I pay money to the owners of the .horse name server to register that domain name with them. Your local name server gets the address of this authoritative name serverThe authoritative name server is not unique to sheep.horse, it stores the addresses for lots of web sites. It is run by the company I pay to host the computer that serves this web site. and performs one final query to it for sheep.horse, the authoritative name server knows the address and returns the result. In this case the answer is 45.55.153.122.

The whole process takes a fraction of a second. Once your local name server has seen the answer to the query it will store it locally for a period of time so that subsequent queries for the address of sheep.horse will be even quicker.

Now the browser has the IP address of sheep.horse it is time to actually make the request. But there are a more things to explain first.

TCP

The Internet works with packets and packets only contain very short messages. Web pages are huge by comparison - they will never fit. What we need is a system of sending large amounts of data via very short packets.

Going back to the postcard analogy, imagine you wanted to send a short novel to your friend on the other side of the world but all you have access to was a stack of postcards and a lot of stamps. What you need is a system to pack the contents of the novel into the postcards in a way that your friend could reassemble the whole novel on the other side.

TCP is, of course, more complicated than this, mainly because the simplified system outlined here would not perform well in the real world. I have also skipped over the 3 way handshake that initiates a TCP connection. It is all just packets though.You could start by copying the first paragraph to a postcard and sending it, then the second paragraph to a second postcard, numbering each postcard so your friend know which order to reassemble them into. This would be all you needed if the postal service was infallible but in reality some postcards are likely to go missing en route.

However, since you numbered your postcards, your friend knows which ones went missing and could send postcards back to you saying "Hey, could you resend postcards #263, #264 and #265? They never arrived." Once you resend these postcards your friend has the whole novel.

Or, stretching the analogy slightly, he might send back a postcard saying "Hey, stop sending these postcards for a few days, I can't read that fast".

Eventually at the end of the process he will send back an acknowledgement saying "Thanks, I received all 2873 postcards and I read all the way to the end. What else you you have?". The important thing is that the two of you have established a way of sending large streams of information reliably using only small unreliable postcards.

The Internet version of this system is called TCPTransmission Control Protocol and it works in pretty much the same way. It allows computers to set up a two-way communication channel that allows large amounts of information to flow back and forwards. This channel is still operating using packets but the software using the channel can act as though the packets don't exist, all it sees is the reassembled information.

Most things that get transferred over the Internet are sent via TCP, this web page was no exception

Making the Connection to the Web Server

Your browser has looked up the IP Address of sheep.horse's web server (45.55.153.122) and has a means of communication (TCP). The web server is a computer that I rentYou don't get a whole computer for the $5 a month I pay, in reality this web site runs off a Virtual Machine - one of many running on a single physical computer with a split personality. It doesn't matter for the purposes of this document. specifically to run a program that just sits there 24/7 waiting for TCP connections from browsers like yours. In case you are wondering, I believe it is physically located somewhere in New York; I have not gone to visit it.

There is a problem I haven't mentioned. TCP works over IP packets, and IP packets are just like postcards in yet other way - anyone who handles your postcards (a corrupt mail sorter, for instance) can read them. You don't want people to eavesdrop on your web browsing.

Even worse, somebody could manipulate the packets in order to make this page appear to say something I never wrote. Or they could intercept the packets your browser sent and answer themselves, sending back whatever they want. How can you be sure that you are reading the real sheep.horse?

None of this matters much for this web page but you don't want people listening in on your banking transactions and personal emails, or manipulating the information you send and read.

The solution to all these problems is yet another protocol called TLSTransport Layer Security. TLS works on top of TCP and it ensures that anything transmitted over the connection is encrypted (to prevent eavesdroppers) and authenticated (preventing someone else pretending to be sheep.horse).

TLS works by having the server send back an tamperproof certificate that contains the name of the site (sheep.horse) and a keyThis is called the public key, the server has a corresponding secret private key that it never divulges. Data encrypted with the public key can only be read with the private key and visa versa. How does that work? The answer is maths - lots of maths. that can be used to encrypt data in such a way that only the server can decrypt it.

But if people can intercept and modify packets, how does the browser know that this certificate is legitimate and not somebody sending back a fake one? The certificate was created and signed by a Certificate Authority. A Certificate Authority is just a third-party organization that I managed to convince that I owned sheep.horse - they did some minimal amount of work to confirm that this was true and issued me with a certificate.

I am deliberately glossing over how certificate signing works. Suffice to say that the browser can detect that the certificate was issued by who it claims to have been issued by. The CA signs certificates with its own private key and the browser knows the CA's public key. Again, the answer is lot of maths.Your browser contains a built in list of Trusted Certificate Authorities that it knows it can rely on. The certificate for sheep.horse was signed by an organization called Let's Encrypt which is on this list so your browser can use its built in knowledge of how it expects a certificate signed by Let's Encrypt to look. If this fails then the browser will not continue with the request.

Since you are reading this the certificate was found to be valid and the browser could be sure that it was talking to the real sheep.horse web server. Next the browser generates a random key and encrypts so that only the sheep.horse web server can read itThis is done using the server's public key and the aforementioned lots of maths and transmits it. Now both the browser and the web server have a secret key that only known to them. For here on all data transmitted between these two sides will be encrypted using this key.

This all may seem overly complicated but now we have a reliable (thanks to TCP), private, tamper-proof, and authenticated (via TLS) channel of communication built entirely on just the unreliable and easily manipulated packets that are sent over the insecure Internet. All this work has taken a fraction of a second and now your browser is in a position to actually request the page.

It has been a long road but we are nearly at the end.

Requesting the Page

With a secure and reliable connection, your browser can now request the page from the server. It does this using a protocol called HTTP/2Hypertext Transport Protocol version 2. Older websites might use HTTP version 1 which is conceptually much the same that all web sites understand. First your browser sent a request to the server, similar to this one:

---HEADERS---
:method = GET
:scheme = https
:authority = sheep.horse
:path = /2017/10/how_you_are_reading_this_page.html
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Encoding: br, gzip, deflate
Accept-Language: en-us
DNT: 1
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13) AppleWebKit/604.1.38 (KHTML, like Gecko) Version/11.0 Safari/604.1.38

I am skipping over vast parts of the HTTP/2 protocol. The requests and responses don't look exactly like this and in practice the responses are broken up into smaller sections. I have edited out some headers from the examples to keep things clearer. Each line of this request helps specify exactly what the browser wants back. In this case the browser wants to GET (the method) the data at /2017/​10/how_you_are​_reading_this​_page.html (the path) from the website sheep.horse (the authority) using https (another name for HTTP/2 over TLS).

The rest of the lines make up the headers of a request, and specify how the browser expects the data to be returned, or information about the browser that the server can use to modify the response. For instance, the Accept-Language: header in this case is en-us, for US English but if you are in France your browser might send fr_fr, for French. This would tell the web server that it should return the french version of the requested pagesheep.horse is not clever enough to do this.

The web server saw this request and sent back the following response:

---HEADERS---
:status = 200
ETag: W/"59beb7fe-18fd"
Content-Type: text/html
Date: Sat, 07 Oct 2017 18:09:25 GMT
Last-Modified: Sun, 17 Sep 2017 17:59:26 GMT
Server: nginx/1.10.3 (Ubuntu)
Expires: Sat, 07 Oct 2017 19:09:25 GMT
Content-Encoding: gzip
Cache-Control: max-age=3600
Strict-Transport-Security: max-age=15768000
---DATA---
<!doctype html>
<html lang="en">
<head>
  <meta charset="utf-8">
  <title>How You Are Reading This Page</title>
  <meta name="author" content="Andrew Stephens" />
  <meta property="og:url"                content="https://sheep.horse/2017/9/how_you_are_reading_this_page.html" />
.
.
.

404 error
You may be familiar with the infamous 404 status code which is returned when the server can't find the requested information, usually this means that somebody mistyped a URL.
The first line is the status code, in this case 200 which simply means that the server has found what the browser requested and is sending the data.

The response also contains some other headers, these are mainly concerned with how the data is returned and for how long the browser should cache the response.

Finally, after the headers comes the actual data, in the form of an HTMLHypertext Markup Language page. This consists of structured text that contains the contents of this page interspersed with tags that tell the browser how the text should be displayed. View Page Source menu You can look at the HTML source to this page right in the browser by right-clicking on the page and selecting View Page Source from the menuYou probably won't be able to do this if you are using a mobile device.

But web pages don't just consist of text. Within the HTML source code to this page are references to additional image files that the browser needs to download and insert into the page as it is displayed. Each image has its own URL that your browser uses to make another request back to the server. To speed things up, your browser will reuse the same TLS connection so that the overhead of validating the web server's certificate does not have to be repeated.

It is not unusual for the browser to have to make hundreds of requests back to the original server before everything on the page can be displayed. It is also possible for a page to include a resource that exists on a completely different server, in which case the browser has to make a whole new connection. Sheep.horse is a very simple website, on other sites your browser could be keeping track of dozens of simultaneous connections.

Table showing the files that were downloaded by your browser to show this page. This is actually a short list - it is not unusual for a single page to require hundreds of files.

File NamePurpose
how_you_are_reading_this_page.htmlThe HTML source for the page
tufte.cssStylesheet that contains general instructions on how the page should look (colors, etc)
view_page_source.png
404_error.png
dns.svg
uri.svg
webbrowser.png
Each image in this document is stored in a separate file
et-book-blod-line-figures.woff
et-book-roman-old-style-figures.woff
et-book-display-italic-old-style-figures.woff
et-book-roman-line-figures.woff
The stylesheet specifies that the text be displayed using certain typefaces, these need to be downloaded as well

Conclusion

I hope this gave you some understanding of the enormous complexity that underlies viewing a webpage - something that we all do hundreds of times a day. I did not even go into some other important details such as how exactly your browser lays out what you see on that page - itself a very involved process.

I sometimes curse when a web page fails to load properly but the fact that the Internet works at all is something of a modern miracle.