tex: add stuff on browsers

also mention Gemini (https://gemini.circumlunar.space/)
2023-05-23 00:03:47 +02:00 · 2023-05-23 00:03:47 +02:00 · f7db0cb375
commit f7db0cb375
parent dd8eb1d3c5
3 changed files with 196 additions and 12 deletions
--- a/tex/abbreviations.tex
+++ b/tex/abbreviations.tex
@ -13,11 +13,15 @@ PID & Process ID \\
 Cgroup & Control group \\
 TLS & Transport Layer Security \\
 TCP & Transmission Control Protocol \\
 SSH & Secure Shell \\
 DNS & Domain Name System \\
 ISP & Internet Service Provider \\
 GPG & GNU Privacy Guard \\
 GNU & GNU's Not Unix! \\
 CSS & Cascading Style Sheets \\
 API & Application Programming Interface \\
 CLI & Command Line Interface \\
 SCM & Source Code Management \\
 HIBP & Have I Been Pwned \\
--- a/tex/references.bib
+++ b/tex/references.bib
@ -162,7 +162,7 @@
 	title = {A simple, modern and secure encryption tool (and Go library) with small explicit keys, no config options, and UNIX-style composability.},
 	author = {Filippo Sotille and Ben Cox and age contributors},
 	year = 2021,
-	note={{Available from: \url{https://github.com/FiloSottile/age}. [viewed 2023-05-17]}}
+	note={{Available from: \url{https://github.com/FiloSottile/age}. [viewed 2023-05-23]}}
 }
@misc{x25519rfc7748,
@ -186,7 +186,49 @@
 	publisher    = "GitHub",
 	howpublished = {[online]},
 	year         = "2007",
-	note={{Available from: \url{https://github.com/504ensicsLabs/LiME}. [viewed 2023-05-17]}}
+	note={{Available from: \url{https://github.com/504ensicsLabs/LiME}. [viewed 2023-05-23]}},
 }
@misc{wwwf,
 	howpublished = {[online]},
 	title = {History of the Web},
 	author = {{World Wide Web Foundation}},
 	year = 2021,
 	note={{Available from: \url{https://webfoundation.org/about/vision/history-of-the-web/}. [viewed 2023-05-23]}}
 }
@misc{ddvweb,
 	howpublished = {[online]},
 	title = {What is this Gemini thing anyway, and why am I excited about it?},
 	author = {{Drew DeVault}},
 	year = 2020,
 	month = nov,
 	note={{Available from: \url{https://drewdevault.com/2020/11/01/What-is-Gemini-anyway.html}. [viewed 2023-05-23]}}
 }
@misc{gemini,
 	howpublished = {[online]},
 	title = {Project Gemini},
 	author = {Solderpunk and Sean Conner and {{The Gemini Contributors}}},
 	year = 2019,
 	note={{Available from: \url{https://gemini.circumlunar.space/} and over Gemini from: \url{gemini://gemini.circumlunar.space/} [viewed 2023-05-23]}}
 }
@misc{geminispec,
 	howpublished = {[online]},
 	title = {Speculative Specification},
 	author = {Solderpunk and Sean Conner and {{The Gemini Contributors}}},
 	year = 2019,
 	note={{Available from: \url{https://gemini.circumlunar.space/docs/specification.gmi} and over Gemini from: \url{gemini://gemini.circumlunar.space/docs/specification.gmi} [viewed 2023-05-23]}}
 }
@misc{chromiumrootdns,
 	howpublished = {[online]},
 	title = {This well-intentioned Chrome feature is causing serious problems},
 	author = {Anthony Spadafora},
 	year = 2020,
 	month = aug,
 	note={{Available from: \url{https://www.techradar.com/news/this-well-intentioned-chrome-feature-is-causing-serious-problems} [viewed 2023-05-23]}}
 }
 % =========================================================================== %
--- a/tex/text.tex
+++ b/tex/text.tex
@ -120,31 +120,169 @@ Entropy, dictionaries, multiple factors.
 \n{1}{Web security}\label{sec:websecurity}
 The internet, being the vast space of intertwined concepts and ideas, is a
-superset of the Web, which is the part of the internet that is discussed in the
+superset of the Web, since not everything that is available on internet can be
-next section.
+described as web \emph{resources}. But precisely that is the part of the
 internet that is discussed in the next sections and covers what browsers are,
 what they do and how they relate to web security.
 \n{2}{Browsers}\label{sec:browsers}
 The following subsection covers what browsers are, what they do and how they
 relate to web security.
 TODO: describe how browsers find out where the web page lives, get a webpage,
 parse it, parse stylesheets, run scripts, apply SAMEORIGIN restrictions etc.
 TODO: (privileged process running untrusted code on user's computer), history,
 present, security focus of the development teams, user facing signalling
 (padlock colours, scary warnings).
 Browsers, sometimes used together with the word that can serve as a real tell
 for their specialisation - web browsers - are programs intended for
 \emph{browsing} of \emph{the web}. In more technical terms, browsers are
 programs that facilitate (directly or via intermediary tools) domain name
 lookups, connecting to web servers, optionally establishing a secure
 connection, requesting the web page in question, determining its \emph{security
 policy} and resolving what accompanying resources the web page specifies and
 depending on the applicable security policy, requesting those from their
 respective origins, applying stylesheets and running scripts. Constructing a
 program that can speak many protocols, securely runs untrusted code from the
 internet is no easy task.
 \n{3}{Complexity}
 Browsers these days are also quite ubiquitous programs running on
 \emph{billions} of consumer-grade mobile devices (which are also notorious for
 bad update hygiene) or desktop devices all over the world. Regular users
 usually expect them to work flawlessly with a multitude of network conditions,
 network scenarios (café WiFi, cellular data in a remote location, home
 broadband that is DNS-poisoned by the ISP), differently tuned (or commonly
 misconfigured) web servers, a combination of modern and \emph{legacy}
 encryption schemes and different levels of conformance to web standards from
 both web server and website developers. Of course, if a website is broken, it
 is the browser's fault.  Browsers are expected to detect if \emph{captive
 portals} (a type of access control that usually tries to force the user through
 a webpage with terms of use) are active and offer redirects. All of this is
 immense complexity and the combination of ubiquity and great exposure this type
 of software gets is in the authors opinion the cause behind a staggering amount
 of vulnerabilities found, reported and fixed in browsers every year.
 \n{3}{Standardisation}
 Over the years, a consortium of parties interested in promoting and developing
 the web (also due to its potential as a digital marketplace, i.e.\ financial
 incentives) and browser vendors (of which the most neutral participant is
 perhaps \emph{Mozilla}, with Chrome being run by Google, Edge by Microsoft and
 Safari/Webkit by Apple) has evolved a great volume of web standards, which are
 also relatively frequently getting updated or deprecated and replaced by
 revised or new ones, rendering the browser maintenance task into essentially a
 cat-and-mouse game.
 It is the web's extensibility that enabled this build-up and ironically has
 been proclaimed by some to be its greatest asset. It has also been ostensibly
 been criticised~\cite{ddvweb} in the past and the frustration with the status
 quo of web standards has relatively recently prompted a group of people to even
 create ``\textit{a new application-level internet protocol for the distribution
 of arbitrary files, with some special consideration for serving a lightweight
 hypertext format which facilitates linking between files}'':
 Gemini~\cite{gemini}\cite{geminispec} that in the words of its authors can be
 thought of as ``\textit{the web, stripped right back to its essence}'' or as
 ``\textit{Gopher, souped up and modernised just a little}'', depending upon the
 reader's perspective, noting that the latter view is probably more accurate.
 \n{3}{HTTP}
 Originally, HTTP was also designed just for fetching hypertext
 \emph{resources}, but it has evolved since then, particularly due to its
 extensibility, to allow for fetching of all sorts of web resources a modern
 website of today provides, such as scripts or images, or even to \emph{post}
 content back to servers.
 HTTP relies on TCP (Transmission Control Protocol), which is one of the
 \emph{reliable} (mandated by HTTP) protocols used to send data across
 contemporary IP (Internet Protocol) networks, to deliver the data it requests
 or sends. When Tim Berners-Lee invented the World Wide Web (WWW) in 1989 while
 working at CERN (The European Organization for Nuclear Research) with a rather
 noble intent as a ``\emph{wide-area hypermedia information retrieval initiative
  to give universal access to a large universe of documents}''~\cite{wwwf}, he
  also invented the HyperText Markup Language (HTML) to serve as a formatting
  method for these new hypermedia documents. The first website was written
  roughly the same way as today's websites are, using HTML, although the markup
  language has changed since, with the current version being HTML5.
 It has been mentioned that the client \textbf{requests} a \textbf{resource} and
 receives a \textbf{response}, so those terms should probably be defined.
 A request is what the client sends to the server. A resource is what it
 requests and a response is the answer provided by the server.
 HTTP follows a classic client-server model whereby it is \textbf{always} the
 client that initiates the request.
 A web page is, to be blunt, a chunk of \emph{hypertext}. To display a web page,
 a browser first needs to send a request to fetch the HTML representing the
 page, which is then parsed and additional requests for sub-resources are made.
 If a page defines a layout information in the form of CSS, that is parsed as
 well.
 A web page needs to be present on the local computer first \emph{before} it can
 be parsed by the browser, and since websites are usually still served by
 programs called \emph{web servers} as in the \emph{early days}, that presents a
 problem of how tell the browser where from the resource should be pulled. In
 today's browsers, the issue is sorted (short of the CLI) by the \emph{address
 bar}, a place into which user types what they wish the browser to fetch for
 them.
 The formal name of this segment is a \emph{Universal Resource Locator}, or URL,
 and it contains the schema (or the protocol, such as \texttt{http://}), the
 host address or a domain name and a (TCP) port number.
 Since a TCP connection needs to be established first, to connect to a server
 whose only URL contains a domain name, the browser needs to perform a domain
 name \emph{lookup} using system facilities, or as was the case for a couple of
 notorious Chromium versions, send some additional and unrelated queries which
 (with Chromium-based derivatives' numbers) ended up placing unnecessary load
 directly at the root DNS servers~\cite{chromiumrootdns}.
 If a raw IP address+port combination is used, the browser attempts to connect
 to it directly and requests the user-requested page by default using the
 \texttt{GET} \emph{method}. A \emph{well-known} HTTP port 80 is assumed unless
 other port is explicitly specified and it can be omitted both if host is a
 domain name or an IP address.
 The method is a way for the user-agent to define what operation it wants to
 perform. \texttt{GET} is used for fetching resources while \texttt{POST} is
 used to send data to the server, such as to post the values of an HTML form.
 A server response is comprised of a \textbf{status code}, a status message,
 HTTP \textbf{headers} and an optional \textbf{body} containing the content. The
 status code indicates if the original request was successful or not and the
 browser is generally there to interpret these status codes to the user. There
 is enough status codes to be confused by the sheer numbers but luckily, there
 is a method to the madness and they can be divided into groups/classes:
 \begin{itemize}
  \item 1xx: Informational responses
  \item 2xx: Successful responses
  \item 3xx: Redirection responses
  \item 4xx: Client error responses
  \item 5xx: Server error responses
 \end{itemize}
 In case the \emph{user agent} (a web \emph{client}) such as a browser receives
 a response with content, it has to parse it.
 A header is additional information sent by both the server and the client.
 \n{2}{Cross-site scripting}\label{sec:xss}
 \n{2}{Content Security Policy}\label{sec:csp}
 Content Security Policy has been an important addition to the arsenal of
-website administrators, even though not everybody has necessarily taken notice
+website operators, even though not everybody has necessarily been utilising it
-or even utilised it properly. To understand what guarantees it provides and
+properly or even taken notice. To understand what guarantees it provides and
 what kind of protections it employs, it is first necessary to grok how websites
-are parsed and displayed, which has been discussed in depth in
+are parsed and displayed, which has been discussed in depth in previous
-Section~\ref{sec:browsers}.
+sections.
 \n{1}{Sandboxing}\label{sec:sandboxing}
 \n{2}{User isolation}