tex: add extensive changes

2023-07-18 20:52:41 +02:00 · 2023-07-18 20:52:41 +02:00 · 39428c908a
commit 39428c908a
parent b8dcac8235
4 changed files with 950 additions and 794 deletions
--- a/graphics/pcmt-use-case.pdf
+++ b/graphics/pcmt-use-case.pdf
--- a/tex/appendices.tex
+++ b/tex/appendices.tex
@ -51,4 +51,70 @@ Blake3:\\
 SHA3-256:\\
 \texttt{66ebbdb20b5459360368d29615e6e80f36bcf464d5519ca08ae651f27a8970bf}\\

+
+\priloha{Whys}\label{appendix:whys}
+
+This appendix is concerned with explaining why certain technologies were used.
+
+\n{2}{Why Go}\label{appendix:whygo}
+
+First, a question of \textit{`Why pick Go for building a web application?'}
+might arise, so the following few lines will try to address that.
+
+Go~\cite{golang}, or \emph{Golang} for SEO-friendliness and disambiguating Go
+the ancient game, is a strongly typed, high-level \emph{garbage-collected}
+language where functions are first-class citizens and errors are values.
+
+The appeal for the author comes from a number of features of the language, such
+as built-in support for concurrency and unit testing, sane \emph{zero} values,
+lack of pointer arithmetic, inheritance and implicit type conversions,
+easy-to-read syntax, producing a statically linked binary by default, etc., on
+top of that, the language has got a cute mascot. Thanks to the foresight of the
+Go Authors regarding \emph{the formatting question} (i.e.\ where to put the
+braces, \textbf{tabs vs.\ spaces}, etc.), most of the discussions on this topic
+have been foregone. Every \emph{gopher}\footnote{euph.\ a person writing in the
+Go programming language} is expected to format their source code with the
+official formatter (\texttt{gofmt}), which automatically ensures that the code
+adheres to the one formatting standard. Then, there is \emph{The Promise} of
+backwards compatibility for Go 1.x, which makes it a good choice for long-term
+without the fear of being rug-pulled.
+
+
+\n{2}{Why Nix/devenv}\label{appendix:whynix}
+
+Nix (\url{https://builtwithnix.org/}) is a functional programming language
+resembling Haskell and a declarative package manager, which has been used in
+this project in the form of \texttt{devenv} tool (\url{https://devenv.sh/}) to
+create \textbf{declarable} and \textbf{reproducible} development environment.
+The author has previously used Nix directly with \emph{flakes} and liked
+\texttt{devenv}, as it effectively exposed only a handful of parameters for
+configuration, and rid of the need to manage the full flake, which is of course
+still an option for people who choose so. See \texttt{devenv.nix} in the
+repository root.
+
+
+\priloha{Terminology}\label{appendix:terms}
+
+\n{2}{Linux}
+
+The term \emph{Linux} is exclusively used in the meaning of the
+Linux kernel~\cite{linux}.
+
+\n{2}{GNU/Linux}
+
+As far as a Linux-based operating system is concerned, the term ``GNU/Linux''
+as defined by the Free Software Foundation~\cite{fsfgnulinux} is used. While it
+is longer and arguably a little bit cumbersome, the author aligns with the
+opinion that this term more correctly describes its actual target. Being aware
+that there are many people who conflate the complete operating system with its
+(be it core) component, the kernel, the author is taking care to distinguish
+the two, although writing from experience, colloquially, this probably brings
+more confusion and a lengthy explanation is usually required.
+
+\n{2}{The program}
+
+By \emph{the program} or \emph{the application} without any additional context
+the author most probably means the Password Compromise Monitoring Tool program.
+
+
 % =========================================================================== %
--- a/tex/references.bib
+++ b/tex/references.bib
@ -179,7 +179,7 @@
@misc{age,
 	howpublished = {[online]},
 	title = {A simple, modern and secure encryption tool (and Go library) with small explicit keys, no config options, and UNIX-style composability.},
-	author = {Filippo Sotille and Ben Cox and age contributors},
+	author = {Filippo Valsorda and Ben Cox and age contributors},
 	year = 2021,
 	note={{Available from: \url{https://github.com/FiloSottile/age}. [viewed 2023-05-23]}}
 }
--- a/tex/text.tex
+++ b/tex/text.tex
@ -72,177 +72,130 @@ practices in an effort to build a maintainable and long-lasting piece of
 software that serves its users well. When deployed, it could provide real
 value.

+Terminology is located in Appendix~\ref{appendix:terms}, feel free to give it a
+read.

 % =========================================================================== %
 \part{Theoretical part}

-\n{1}{Terminology}
-
-\n{2}{Linux}
-
-The term \emph{Linux} is exclusively used in the meaning of the
-Linux kernel~\cite{linux}.
-
-
-\n{2}{GNU/Linux}
-
-As far as a Linux-based operating system is concerned, the term ``GNU/Linux''
-as defined by the Free Software Foundation~\cite{fsfgnulinux} is used. While it
-is longer and arguably a little bit cumbersome, the author aligns with the
-opinion that this term more correctly describes its actual target. Being aware
-there are many people that conflate the complete operating system with its (be
-it core) component, the kernel, the author is taking care to distinguish the
-two, although writing from experience, colloquially, this probably brings more
-confusion and a lengthy explanation is usually required.
-
-
-\n{2}{Containers}
-
-When the concept of \emph{containerisation} and \emph{containers} is mentioned
-throughout this work, the author has OCI containers~\cite{ocicontainers} in
-mind, which is broadly a superset of \emph{Linux Containers} where some set of
-processes is presented with a view of kernel resources (there are multiple
-kinds of resources, such as IPC queues; network devices, stacks, ports; mount
-points, process IDs, user and group IDs, Cgroups and others) that differs for
-each different set of processes, similar in thought to FreeBSD
-\emph{jails}~\cite{freebsdjails} with the distinction being that they are, of
-course, facilitated by the Linux kernel namespace
-functionality~\cite{linuxnamespaces}, which is in turn regarded to be
-\emph{inspired} by Plan 9's namespaces~\cite{plan9namespaces}, Plan 9 being a
-Bell Labs successor to Unix 8th Edition, discontinued in 2015.
-While there without a doubt \emph{is} specificity bound to using each of the
-tools that enable creating (Podman vs.\ Buildah vs.\ Docker BuildX) or running
-(ContainerD vs.\ runC vs.\ crun) container images, when describing an action
-that gets performed with or onto a container, the process should generally be
-explained in such a way that it is repeatable using any spec-conforming tool
-that is available and \emph{intended for the job}.
-
-\vspace*{-\baselineskip}
-\n{2}{The program}
-
-By \emph{the program} or \emph{the application} without any additional context
-the author usually means the Password Compromise Monitoring Tool program.
-
-
 \n{1}{Cryptography primer}\label{sec:cryptographyprimer}

 \n{2}{Encryption}

-Encryption is the process of transforming certain data, called a
-\emph{message}, using, as Aumasson writes in Serious Cryptography, ``an
-algorithm called a \emph{cipher} and a secret value called the
-key''~\cite{seriouscryptography}. Its purpose is to protect the said message so
-that only its intended recipients that know/hold the key are able to
-\emph{decipher} and read it.
-
-\n{3}{Symmetric encryption}
-
-Symmetric encryption is simply when the \emph{key} used is to facilitate both
-encryption and decryption operations.
-
-\n{3}{Asymmetric encryption}
-
-Asymmetric encryption is different from symmetric encryption in that there are
-now two keys in use - a key \emph{pair}. One part is used solely for
-encryption, while the other part's only purpose is to decrypt. This notion of
-two keys is generally transposed to a domain called \emph{public key
-cryptography}, whereby the decryption component is declared private and the
-encryption component is called \emph{public}, hence the name. The rationale is
-that everybody can encrypt messages \emph{for} the recipient but only they are
-able to \emph{decrypt} them, which is a feature allowed by the mathematical
-complementarity of the two components, and also explains why the private key
-should be kept \emph{private}. Compared to symmetric encryption, this variant
-is generally slower.
-
-\n{3}{The key exchange problem}
-
-Suppose a communication scheme that is protected by a pre-shared secret.
-In order to establish secure communications, this secret needs to be
-distributed to the other party via untrusted channels. In 1976 Whitfield Diffie
-and Martin Hellman published a paper in which they devised a \emph{public-key
-distribution scheme}, which allows the two parties to arrive at a shared secret
-by exchanging information via insecure channels with the presence of an
-eavesdropper. This scheme (or its variations) is in use to this day.
+\textbf{TODO:} add \emph{why} we care and how it's going to be used.

 \n{2}{Hash functions}

-Hash functions are cryptographic algorithms used to help with a number of
-things: integrity verification, password protection, digital signature,
-public-key encryption and others. Hashes are used in forensic analysis to prove
-authenticity of digital artifacts, to uniquely identify a change-set within
-revision-based source code management systems such as Git, Subversion or
-Mercurial, to detect known-malicious software by anti-virus programs or by
-advanced filesystems in order to verify block integrity and enable repairs, and
-also in many other applications that each person using a modern computing
-device has come across, such as when connecting to a website protected by the
-famed HTTPS.
+Hash functions are algorithms used to help with a number of things: integrity
+verification, password protection, digital signature, public-key encryption and
+others. Hashes are used in forensic analysis to prove authenticity of digital
+artifacts, to uniquely identify a change-set within revision-based source code
+management systems such as Git, Subversion or Mercurial, to detect
+known-malicious software by anti-virus programs or by advanced filesystems in
+order to verify block integrity and enable repairs, and also in many other
+applications that each person using a modern computing device has come across,
+such as when connecting to a website protected by the famed HTTPS.

-The popularity stems from a common use case: the need to identify a chunk of
-data. Of course, two chunks of data, two files, frames or packets could always
-be compared bit by bit, but that can get prohibitive from both cost and energy
-point of view relatively quickly. That is when the hash functions come in,
-since they are able to take a long input and produce a short output, named a
-digest or a hash value. It also does not work the other way around, a file
-cannot be reconstructed from the hash digest, it is a one-way function.
+The popularity of hash functions stems from a common use case: the need to
+simplify reliably identifying a chunk of data. Of course, two chunks of data,
+two files, frames or packets could always be compared bit by bit, but that can
+get prohibitive from both cost and energy point of view relatively quickly.
+That is when the hash functions come in, since they are able to take a long
+input and produce a short output, named a digest or a hash value. The function
+also only works one way.

-\n{3}{Rainbow tables}
+A file, or any original input data for that matter, cannot be reconstructed
+from the hash digest alone by somehow \emph{reversing} the hashing operation,
+since at the heart of any hash function there is essentially a compression
+function.

-As passwords are in more responsible scenarios stored not directly but as
-hashes, attackers that would be interested in recovering the passwords really
-only have one option (except finding a critical vulnerability in the hash
-function): rainbow tables. Rainbow tables are lists of pre-computed hashes
-paired with the passwords that were used to create them. When attackers gain
-access to a password breach that contains hashes, all it takes is to find a
-match within the rainbow table and reversely resolve that to the known
-message: the password.
+Most alluringly, hashes are frequently used with the intent of
+\emph{protecting} passwords by making those unreadable, while still being able
+to verify that the user knows the password, therefore should be authorised.

-One of the popular counter-measures to pre-computed tables is adding a
-\emph{salt} to the user-provided password before passing it to the KDF (Key
-Derivation Function) or the hash function. Of course, the salt should be random
-\textbf{per-user} and not reused, as that would mean that two users with the
-same password would still end up with the same hash, and the salt should also
-be adequately long to be effective. As the salt is supposed to be
-\emph{random}, it would be a good idea to use an actual CSPRNG, such as
-\textbf{Fortuna}~\cite{fortuna} as a source of entropy (randomness). In
-FreeBSD, Fortuna is in fact the one serving \texttt{/dev/random}.
+As the hashing operation is irreversible, once the one-way function produces a
+short a digest, there is no way to reconstruct the original message from it.
+That is, unless the input of the hash function is also known, in which case all
+it takes is hashing the supposed input and comparing the digest with existing
+digests that are known to be digests of passwords.
+
+\\ \textbf{TODO:} ad more on \emph{why} we care and what types of hashes should be
+used (with refs) and why.


-\n{3}{TLS}\label{sec:tls}
+\n{3}{Types and use cases}
+
+Hash functions can be loosely categorised based on their intended use case to
+\emph{password protection hashes}, \emph{integrity verification hashes},
+\emph{message authentication codes} and \emph{cryptographic hashes}. Each of
+these possess unique characteristics and using the wrong type of hash function
+for the wrong job can potentially result in a security breach.
+
+As an example, suppose \texttt{MD5}, a popular hash function internally using
+the same data structure - \emph{Merkle-Damgård} construction - as
+\texttt{BLAKE3}. While the former produces 128 bit digests, the latter by
+default outputs 256 bit digest with no upper limit (Merkle tree extensibility).
+
+There is a list of differences that could further be mentioned, however, they
+both have one thing in common: they are \emph{designed} to be \emph{fast}. The
+latter, as a cryptographic hash function, is conjectured to be \emph{random
+oracle indifferentiable}, secure against length extension, but it is also in
+fact faster than all of \texttt{MD5}, \texttt{SHA3-256}, \texttt{SHA-1} and
+even \texttt{Blake2} family of functions.
+
+The use case of both is to (quickly) verify integrity of a given chunk of data,
+in case of \texttt{BLAKE3} with pre-image and collision resistance in mind, not
+to secure a password by hashing it first, which poses a big issue when used
+to...secure passwords by hashing them first.
+
+A password hash function, such as \texttt{argon2} or \texttt{bcrypt} are good
+choices for securely storing hashed passwords, namely because they place CPU
+and memory burden on the host computing the digest, as well as limit potential
+parallelism, thus preventing the scale at which an exhaustive search could be
+launched. Additionally, both functions automatically \emph{salt} the passwords
+before hashing them, which means that two exact same passwords of two different
+users will not end up hashing to the same digest value, making it that much
+harder to recover the original, supposedly weak password.
+
+
+\n{3}{Why are hashes interesting}
+
+As already mentioned, since hashes are often used to store the
+representation of the password instead of the password itself, they become a
+subject of interest when they get leaked. There have been enough instances of
+leaked raw passwords that anyone with enough interest can put together a neat
+list of hashes of the most popular passwords.
+
+So while the service does not store plain text passwords, which is good, using
+a hashing function not designed to protect passwords does not offer much
+additional protection in case of weak passwords, which are the most commonly
+used ones.
+
+It seems logical that a service that is not using cryptographic primitives
+correctly is more likely to get hacked and have its users' passwords/hashes
+leaked. Then, the Internet ends up serving as a storage of every data dump,
+often exposing these passwords/hashes for everyone to access.
+
+
+\n{2}{TLS}\label{sec:tls}

 The Transport Layer Security protocol (or TLS) serves as as an encryption and
 \emph{authentication} protocol to secure internet communications. An important
-part of the protocol is the \emph{handhake}, during which the two communicating
+part of the protocol is the \emph{handshake}, during which the two communicating
 parties exchange messages that acknowledge each other's presence, verify each
 other, choose what cryptographic algorithms will be used and decide session
 keys. As there are multiple versions of the protocol in active duty even at the
 moment, the server together with the client need to agree upon the version they
-are going to use (these days it should be 1.2 or 1.3), pick cipher suites
-(TLSv1.3 dramatically reduced the number of available suites), the client
-verifies the server's public key (and the signature of the certificate
-authority that issued it) and they both generate session keys for use after
-handshake completion.
+are going to use (these days it is recommended to use either 1.2 or 1.3),
+pick cipher suites (), the client verifies the server's public key (and the signature of the
+certificate authority that issued it) and they both generate session keys for
+use after handshake completion.

-The handshake consists of multiple stages (again, depending on the version), for
-TLSv1.3 that would be:
-
-\begin{itemize}
-  \item \textbf{Client hello}: client sends a client hello message containing
-    the protocol version, a list of cipher suites and the client random value.
-    The client in this step also includes the ephemeral Diffie-Helman (EDH)
-    parameters, which are later used for calculating the pre-master key.
-  \item \textbf{Server generating a master secret}: the server has got the
-    cipher suites, the client's paramaters and client random and already has
-    the server random, which means it can create the master secret.
-  \item \textbf{Server hello and ``Finished''}: the server includes in the
-    hello its certificate, digital signature, server random, the chosen
-    cipher suite, and sends a ``Finished'' (meaning \emph{ready}) message.
-  \item \textbf{Signature and certificate verification}: the client at this
-    step verifies server's certificate and signature, generates the master
-    secret and is ready (sends the ``Finished'' message).
-\end{itemize}
-
-At the end of the process, the connection is protected by symmetric encryption
-using the session key that the both parties have arrived at.
+TLSv1.3 dramatically reduced the number of available suites to only include the
+ones deemed secure enough, which is why it is no longer needed to manually
+specify what cipher suite should be used (or rely on the client/server to
+choose wisely). While possibly facing compatibility issues with legacy devices,
+the simplicity that enabling TLSv1.3 brings is a worthy trade-off.


 \n{1}{Passwords}\label{sec:passwords}
@ -381,151 +334,7 @@ internet that is discussed in the next sections and covers what browsers are,
 what they do and how they relate to web security.


-\n{2}{Browsers}\label{sec:browsers}
-
-Browsers, sometimes used together with the word that can serve as a real tell
-for their specialisation - \emph{web} browsers - are programs intended for
-\emph{browsing} of \emph{the web}. In more technical terms, browsers are
-programs that facilitate (directly or via intermediary tools) domain name
-lookups, connecting to web servers, optionally establishing a secure
-connection, requesting the web page in question, determining its \emph{security
-policy} and resolving what accompanying resources the web page specifies and
-depending on the applicable security policy, requesting those from their
-respective origins, applying stylesheets and running scripts. Constructing a
-program that can speak many protocols and securely runs untrusted code from the
-internet is no easy task.
-
-\n{3}{Complexity}
-
-Browsers these days are also quite ubiquitous programs running on
-\emph{billions} of consumer grade mobile devices (which are also notorious for
-bad update hygiene) or desktop devices all over the world. Regular users
-usually expect them to work flawlessly with a multitude of network conditions,
-network scenarios (the proverbial café WiFi, cellular data in a remote
-location, home broadband that is DNS-poisoned by the ISP), differently tuned
-(or commonly misconfigured) web servers, a combination of modern and
-\emph{legacy} encryption schemes and different levels of conformance to web
-standards from both web server and website developers. Of course, if a website
-is broken, it is the browser's fault.  Browsers are expected to detect if
-\emph{captive portals} (a type of access control that usually tries to force
-the user through a webpage with terms of use) are active and offer redirects.
-All of this is immense complexity and the combination of ubiquity and great
-exposure that this type of software gets is, in the author's opinion, the cause
-behind a staggering amount of vulnerabilities found, reported and fixed in
-browsers every year.
-
-\n{3}{Standardisation}
-
-Over the years, a consortium of parties interested in promoting and developing
-the web (also due to its potential as a digital marketplace, i.e.\ financial
-incentives) and browser vendors (of which the most neutral participant is
-perhaps \emph{Mozilla}, with Chrome being run by Google, Edge by Microsoft and
-Safari/Webkit by Apple) has evolved a great volume of web standards, which are
-also relatively frequently getting updated or deprecated and replaced by
-revised or new ones, rendering the browser maintenance task into essentially a
-cat-and-mouse game.
-
-It is the web's extensibility that enabled this build-up and ironically has
-been proclaimed by some to be its greatest asset. It has also been ostensibly
-been criticised~\cite{ddvweb} in the past and the frustration with the status
-quo of web standards has relatively recently prompted a group of people to even
-create ``\textit{a new application-level internet protocol for the distribution
-of arbitrary files, with some special consideration for serving a lightweight
-hypertext format which facilitates linking between files}'':
-Gemini~\cite{gemini}\cite{geminispec} that in the words of its authors can be
-thought of as ``\textit{the web, stripped right back to its essence}'' or as
-``\textit{Gopher, souped up and modernised just a little}'', depending upon the
-reader's perspective, noting that the latter view is probably more accurate.
-
-\n{3}{HTTP}
-
-Originally, HTTP was also designed just for fetching hypertext
-\emph{resources}, but it has evolved since then, particularly due to its
-extensibility, to allow for fetching of all sorts of web resources a modern
-website of today provides, such as scripts or images, or even to \emph{post}
-content back to servers.
-
-HTTP relies on TCP (Transmission Control Protocol), which is one of the
-\emph{reliable} (mandated by HTTP) protocols used to send data across
-contemporary IP (Internet Protocol) networks, to deliver the data it requests
-or sends. When Tim Berners-Lee invented the World Wide Web (WWW) in 1989 while
-working at CERN (The European Organization for Nuclear Research) with a rather
-noble intent as a ``\emph{wide-area hypermedia information retrieval initiative
-  to give universal access to a large universe of documents}''~\cite{wwwf}, he
-  also invented the HyperText Markup Language (HTML) to serve as a formatting
-  method for these new hypermedia documents. The first website was written
-  roughly the same way as today's websites are, using HTML, although the markup
-  language has changed since, with the current version being HTML5.
-
-It has been mentioned that the client \textbf{requests} a \textbf{resource} and
-receives a \textbf{response}, so those terms should probably be defined.
-
-A request is what the client sends to the server. A resource is what it
-requests and a response is the answer provided by the server.
-
-HTTP follows a classic client-server model whereby it is \textbf{always} the
-client that initiates the request.
-
-A web page is, to be blunt, a chunk of \emph{hypertext}. To display a web page,
-a browser first needs to send a request to fetch the HTML representing the
-page, which is then parsed and additional requests for sub-resources are made.
-If a page defines a layout information in the form of CSS, that is parsed as
-well.
-
-A web page needs to be present on the local computer first \emph{before} it can
-be parsed by the browser, and since websites are usually still served by
-programs called \emph{web servers} as in the \emph{early days}, that presents a
-problem of how tell the browser where the resource should be fetched from. In
-today's browsers, the issue is sorted (short of the CLI) by the \emph{address
-bar}, a place into which user types what they wish the browser to fetch for
-them.
-
-The formal name of this segment is a \emph{Universal Resource Locator}, or URL,
-and it contains the schema (or the protocol, such as \texttt{http://}), the
-host address or a domain name and a (TCP) port number.
-
-Since a TCP connection needs to be established first, to connect to a server
-whose only URL contains a domain name, the browser needs to perform a domain
-name \emph{lookup} using system facilities, or as was the case for a couple of
-notorious Chromium versions, send some additional and unrelated queries which
-(with Chromium-based derivatives' numbers) ended up placing unnecessary load
-directly at the root DNS servers~\cite{chromiumrootdns}.
-
-If a raw IP address+port combination is used, the browser attempts to connect
-to it directly and requests the user-requested page by default using the
-\texttt{GET} \emph{method}. A \emph{well-known} HTTP port 80 is assumed unless
-other port is explicitly specified and it can be omitted both if host is a
-domain name or an IP address.
-
-The method is a way for the user-agent to define what operation it wants to
-perform. \texttt{GET} is used for fetching resources while \texttt{POST} is
-used to send data to the server, such as to post the values of an HTML form.
-
-A server response is comprised of a \textbf{status code}, a status message,
-HTTP \textbf{headers} and an optional \textbf{body} containing the content. The
-status code indicates if the original request was successful or not and the
-browser is generally there to interpret these status codes to the user. There
-is enough status codes to be confused by the sheer numbers but luckily, there
-is a method to the madness and they can be divided into groups/classes:
-
-\begin{itemize}
-  \item 1xx: Informational responses
-  \item 2xx: Successful responses
-  \item 3xx: Redirection responses
-  \item 4xx: Client error responses
-  \item 5xx: Server error responses
-\end{itemize}
-
-In case the \emph{user agent} (a web \emph{client}) such as a browser receives
-a response with content, it has to parse it.
-
-A header is additional information sent by both the server and the client that
-can guide or alter the behaviour of software reading it. For instance a
-\texttt{Cache-control} header with a duration value can be used by the server
-to signify that the client can store certain resources for some time before
-needing to re-fetch them, if they are not \emph{expired}.
-
-\n{3}{Site Isolation}
+\n{2}{Site Isolation}

 Modern browsers such as Firefox or Chromium come with a security focus in mind.
 Their developers are acutely aware of the dangers that parsing untrusted code
@ -558,6 +367,7 @@ access to session tokens and any cookies associated with the website's origin,
 apart from being able to rewrite the HTML content. The results of XSS can
 range from account compromise to identity theft.

+
 \n{2}{Content Security Policy}\label{sec:csp}

 Content Security Policy (CSP) has been an important addition to the arsenal of
@ -600,15 +410,380 @@ in production. There are many more directives and settings than mentioned in
 this section, the author encourages anybody interested to give it a read, e.g.\
 at \url{https://web.dev/csp/}.

-\n{2}{Summary}
+\textbf{TODO}: add more concrete examples.

-Passwords are in use everywhere and probably will be for the foreseeable
-future. As long as passwords are going to be handled and stored by
-service/application providers, they are going to get leaked, be it due to
-provider carelessness or the attackers' resolve and wit. Of course, sifting
-through all the available password breach data by hand is not a reasonable
-option, and therefore tools should come in to provide assistance. The next part
-of the thesis will explore that and offer a solution.
+
+\n{1}{Configuration}
+
+Every non-trivial program usually offers at least \emph{some} way to
+tweak/manage its behaviour, and these changes are usually persisted
+\emph{somewhere} on the filesystem of the host: in a local SQLite3 database, a
+\emph{LocalStorage} key-value store in the browser, a binary or plain text
+configuration file.  These configuration files need to be read and checked at
+least on program start-up and either stored into operating memory for the
+duration of the runtime of the program, or loaded and parsed and the memory
+subsequently \emph{freed} (initial configuration).
+
+There is an abundance of configuration languages (or file formats used to craft
+configuration files, whether they were intended for it or not) available, TOML,
+INI, JSON, YAML, to name some of the popular ones (as of today).
+
+Dhall stood out as a language that was designed with both security and the
+needs of dynamic configuration scenarios in mind, borrowing a concept or two
+from Nix~\cite{nixoslearn}~\cite{nixlang} (which in turn sources more than a
+few of its concepts from Haskell), and in its apparent core being very similar
+to JSON, which adds to familiar feel. In fact, in Dhall's authors' own words it
+is: ``a programmable configuration language that you can think of as: JSON +
+functions + types + imports''~\cite{dhalllang}.
+
+Among all of the listed features, the especially intriguing one to the author
+was the promise of \emph{types}. There are multiple examples directly on the
+project's documentation webpage demonstrating for instance the declaration and
+usage of custom types (that are, of course merely combinations of the primitive
+types that the language provides, such as \emph{Bool}, \emph{Natural} or
+\emph{List}, to name just a few), so it was not exceedingly hard to start
+designing a custom configuration \emph{schema} for the program.
+Dhall not being a Turing-complete language also guarantees that evaluation
+\emph{always} terminates eventually, which is a good attribute to possess as a
+configuration language. 
+
+\n{3}{Safety considerations}
+
+Having a programmable configuration language that understands functions and
+allows importing not only arbitrary text from random internet URLs, but also
+importing and \emph{evaluating} (i.e.\ running) potentially untrusted code, it
+is important that there are some safety mechanisms employed, which can be
+relied on by the user. Dhall offers this in multiple features: enforcing a
+same-origin policy and (optionally) pinning a cryptographic hash of the value
+of the expression being imported.
+
+\n{3}{Possible alternatives}
+
+While developing the program, the author has also
+come across certain shortcomings of Dhall, namely long start-up with \emph{cold
+cache}, which can generally be observed in the scenario of running the program
+in an environment that does not allow to write the cache files (a read-only
+filesystem), of does not keep the written cache files, such as a container that
+is not configured to mount a persistent volume at the pertinent location.
+
+To describe the way Dhall works when performing an evaluation, it resolves
+every expression down to a combination of its most basic types (eliminating all
+abstraction and indirection) in the process called
+\textbf{normalisation}~\cite{dhallnorm} and then saves this result in the
+host's cache. The \texttt{dhall-haskell} binary attempts to resolve the
+variable \texttt{\$\{XDG\_CACHE\_HOME\}} (have a look at \emph{XDG Base
+Directory Spec}~\cite{xdgbasedirspec} for details) to decide \emph{where} the
+results of the normalisation will be written for repeated use. Do note that
+this behaviour has been observed on a GNU/Linux host and the author has not
+verified this behaviour on a non-GNU/Linux host, such as FreeBSD.
+
+If normalisation is performed inside an ephemeral container (as opposed to, for
+instance, an interactive desktop session), the results effectively get lost on
+each container restart. That is both wasteful and not great for user
+experience, since the normalisation of just a handful of imports (which
+internally branches widely) can take an upwards of two minutes, during which
+the user is left waiting for the hanging application with no reporting on the
+progress or current status.
+
+While workarounds for the above mentioned problem can be devised relatively
+easily (such as bind mounting persistent volumes inside the container in place
+of the \texttt{\$\{XDG\_CACHE\_HOME\}/dhall} and
+\texttt{\$\{XDG\_CACHE\_HOME\}/dhall-haskell} to preserve the cache between
+restarts, or let the cache be pre-computed during container build, since the
+application is only really expected to run together with a compatible version
+of the configuration schema and this version \emph{is} known at container build
+time), it would certainly feel better if there was no need to work
+\emph{around} the configuration system of choice.
+
+Alternatives such as CUE (\url{https://cuelang.org/}) offer themselves nicely
+as a potentially almost drop-in replacement for Dhall feature-wise, while also
+resolving costly \emph{cold cache} normalisation operations, which is in
+author's view Dhall's titular issue.
+
+
+\n{1}{Compromise Monitoring}
+
+There are, of course, several ways one could approach monitoring of compromised
+of credentials, some more \emph{manual} in nature than others. When using a
+service that is suspected/expected to be breached in the future, one can always
+create a unique username/password combination specifically for the subject
+service and never use that combination anywhere else. That way, if the
+credentials ever \emph{do} happen to appear in a data dump online in the
+future, it is going to be a safe assumption as to where they came from.
+
+Unfortunately, the task of actually \emph{monitoring} the credentials can prove
+to be a little more arduous than one could expect at first. There are a couple
+of points that can prove to pose a challenge in case the search is performed by
+hand, namely:
+
+\begin{itemize}
+  \item finding the breached data to look through
+  \item verifying the trustworthiness of the data
+  \item varying quality of the data
+  \item sifting through (possibly) unstructured data by hand
+\end{itemize}
+
+Of course, as this is a popular topic for a number of people, the above
+mentioned work has already been packaged into neat and practical online
+offerings. In case one decides in favour of using those, an additional range of
+issues (the previous one still applicable) arises:
+
+\begin{itemize}
+  \item the need to trust the provider with input credentials
+  \item relying on the goodwill of the provider to be able to access the data
+  \item hoping that the terms of service are kept
+\end{itemize}
+
+Besides that, there is a plethora of breaches floating around the Internet
+available simply as zip files, which makes the job even harder.
+
+The overarching goal of this thesis is devising and implementing a system in
+which the user can \emph{monitor} whether their credentials have been
+\emph{compromised} (at least as far as the data can tell), and allowing them to
+do so without needing to entrust their sensitive data to a provider.
+
+
+\n{2}{Data Sources}\label{sec:dataSources}
+
+A data source in this place is considered anything that provides the
+application with data that it understands.
+
+Of course, the results of credential compromise verification/monitoring is only
+going to be as good as the data underpinning it, which is why it is imperative
+that high quality data sources be used, if at all possible. While great care
+does have to be taken to only choose the highest quality data sources, the
+application must offer a means to be able to utilise these.
+
+The sources from which breached data can be loaded into an application can be
+split into two basic categories: \textbf{online} or \textbf{local}, and it is
+possible to further discern between \emph{structured} and \emph{unstructured}
+data.
+
+An online source is generally a service that ideally exposes a programmatic
+API, which an application can query and from which it can request the necessary
+subsets of data.
+These types of services often additionally front the data by a user-friendly
+web interface for one-off searches, which is, however, not of use here.
+
+Among some examples of online services could be named:
+
+\begin{itemize}
+  \item {Have I Been Pwned?} - \url{https://haveibeenpawned.com}
+  \item {DeHashed} - \url{https://dehashed.com}
+\end{itemize}
+
+Large lumps of unstructured data available on forums or shady web servers would
+technically also count here, given that they provide data and are available
+online. However, even though data is frequently found online precisely in this
+form, it is also not of direct use for the application without manual
+\emph{preprocessing}, as it is attended to in
+Section~\ref{sec:localDatasetPlugin}.
+
+Another source is then simply any locally supplied data, which, of course,
+could have been obtained from a breach available online beforehand.
+
+Locally supplied data is specific in that it needs to be formatted in such a
+way that it can be understood by the application. That is, the data is not in
+its raw form anymore but has been morphed into the precise shape the
+application needs for further processing. Once imported, the application can
+query the data at will, as it knows exactly the shape of it.
+
+This supposes the existence of a \emph{format} for importing, schema of which
+is devised in Section~\ref{sec:localDatasetPlugin}.
+
+
+\n{3}{Local Dataset Plugin}\label{sec:localDatasetPlugin}
+
+Unstructured breach data from locally available datasets can be imported into
+the application by first making sure it adheres to the specified schema (have a
+look at the \emph{Breach Data Schema} in Listing~\ref{breachDataGoSchema}). If
+it does not (which is very likely with random breach data, as already mentioned
+in Section~\ref{sec:dataSources}), it needs to be converted to a form that
+\emph{does} before importing it to the application, e.g.\ using a Python script
+or a similar method.
+
+Attempting to import data that does not follow the outlined schema should
+result in an error. Equally so, importing a dataset which is over a reasonable
+size limit should by default be rejected by the program as a precaution.
+Unmarshaling, for instance, a 1 TiB document would most likely result in an
+out-of-memory (OOM) situation on the host running the application, assuming
+contemporary consumer hardware conditions (not HPC).
+
+\vspace{\parskip}
+\begin{lstlisting}[language=Go, caption={Breach Data Schema represented as a Go
+struct with imports from the standard library assumed},
+label=breachDataGoSchema]
+    type breachDataSchema struct {
+        Name              string
+        Time              time.Time
+        IsVerified        bool
+        ContainsPasswords bool
+        ContainsHashes    bool
+        HashType          string
+        HashSalted        bool
+        HashPepperred     bool
+        ContainsUsernames bool
+        ContainsEmails    bool
+        Data              any
+    }
+\end{lstlisting}
+\vspace*{-\baselineskip}
+
+The Go representation shown in Listing~\ref{breachDataGoSchema} will in
+actuality translate to a YAML document written and supplied by an
+administrative user of the program. The YAML format was chosen for several
+reasons:
+
+\begin{itemize}
+  \item relative ease of use (plain text, readability)
+  \item capability to store multiple \emph{documents} inside of a single file
+  \item most of the inputs being implicitly typed as strings
+  \item support for inclusion of comments
+  \item machine readability thanks to being a superset of JSON
+\end{itemize}
+
+The last point specifically should allow for documents similar to what can be
+seen in Listing~\ref{breachDataYAMLSchema} to be ingested by the program, read
+and written by humans and programs alike.
+
+\smallskip
+\begin{lstlisting}[language=YAML, caption={Example Breach Data Schema supplied
+to the program as a YAML file, optionally containing multiple documents},
+label=breachDataYAMLSchema]
+    ---
+    name: Horrible breach
+    time: 2022-04-23T00:00:00Z+02:00
+    isVerified: false
+    containsPasswds: false
+    containsHashes: true
+    containsEmails: true
+    hashType: md5
+    hashSalted: false
+    hashPeppered: false
+    data:
+      hashes:
+        - hash1
+        - hash2
+        - hash3
+      emails:
+        - email1
+        - 
+        - email3
+    ---
+    # document #2, describing another breach.
+    name: Horrible breach 2
+    ...
+\end{lstlisting}
+\vspace*{-\baselineskip}
+
+Notice how the emails list in Listing~\ref{breachDataYAMLSchema} misses one
+record, perhaps because it was not supplied or mistakenly omitted. This is a
+valid scenario (mistakes happen) and the application needs to be able to handle
+it. The alternative would be to require the user to prepare the data in such a
+way that the empty/partial records would be dropped entirely.
+
+\n{3}{Have I Been Pwned? Integration}
+
+Troy Hunt's \textbf{Have I Been Pwned?} online service
+(\url{https://haveibeenpwned.com/}) has been chosen as the online source of
+compromised data. The service offers private APIs that are protected by API
+keys. The application's \texttt{hibp} module and database representation models
+the values returned by this API, which allows searching in large breaches using
+email addresses.\\
+The architecture there is relatively simple: the application administrator
+configures an API key for HIBP, the user enters the query parameters, the
+application constructs a query and calls the API and waits for a response. As
+the API is rate-limited based on the key supplied, this can pose an issue and
+it has not been fully resolved in the UI. The application then parses the
+returned data and binds it to the local model for validation. If that goes
+well, the data is saved into the database as a cache and the search query is
+performed on the saved data. If it returns anything, it is displayed to the
+user for browsing.
+
+
+\n{1}{Deployment recommendations}\label{sec:deploymentRecommendations}
+
+It is, of course, recommended that the application runs in a secure environment
+\allowbreak although definitions of that almost certainly differ depending on
+who you ask. General recommendations would be either to effectively reserve a
+machine for a single use case - running this program - so as to dramatically
+decrease the potential attack surface of the host, or run the program isolated
+in a container or a virtual machine. Further, if the host does not need
+management access (it is a deployed-to-only machine that is configured
+out-of-band, such as with a \emph{golden} image/container or declaratively with
+Nix), then an SSH \emph{daemon} should not be running in it, since it is not
+needed. In an ideal scenario, the host machine would have as little software
+installed as possible besides what the application absolutely requires.
+
+System-wide cryptographic policies should target highest feasible security
+level, if at all available (such as by default on Fedora or RHEL), covering
+SSH, DNSSec, IPsec, Kerberos and TLS protocols. Firewalls should be configured
+and SELinux (kernel-level mandatory access control and security policy
+mechanism) running in \emph{enforcing} mode, if available.
+
+\n{2}{Transport security}
+
+User connecting to the application should rightfully expect for their data to
+be protected \textit{in transit} (i.e.\ on the way between their browser and
+the server), which is what \emph{Transport Layer Security} family of
+protocols~\cite{tls13rfc8446} was designed for, and which is the underpinning
+of HTTPS. TLS utilises the primitives of asymmetric cryptography to let the
+client authenticate the server (verify that it is who it claims it is) and
+negotiate a symmetric key for encryption in the process named the \emph{TLS
+handshake} (see Section~\ref{sec:tls} for more details), the final purpose of
+which is establishing a secure communications connection. The operator should
+configure the program to either directly utilise TLS using configuration or
+have it listen behind a TLS-terminating \emph{reverse proxy}.
+
+
+\n{2}{Containerisation}
+
+Whether the pre-built or a custom container image is used to deploy the
+application, it still needs access to secrets, such as database connection
+string (containing database host, port, user, password/encrypted password,
+authentication method and database name).
+
+The application should be able to handle the most common Postgres
+authentication methods~\cite{pgauthmethods}, namely \emph{peer},
+\emph{scram-sha-256}, \emph{user name maps} and raw \emph{password}, although
+the \emph{password} option should not be used in production, \emph{unless} the
+connection to the database is protected by TLS.\ In any case, using the
+\emph{scram-sha-256}~\cite{scramsha256rfc7677} method is preferable. One of the
+ways to verify in development environment that everything works as intended is
+the \emph{Password generator for PostgreSQL} tool~\cite{goscramsha256}, which
+allows retrieving the encrypted string from a raw user input.
+
+If the application running in a container wants to use the \emph{peer}
+authentication method, it is up to the operator to supply the Postgres socket
+to the application (e.g.\ as a volume bind mount). This scenario was not
+tested; however, and the author is also not entirely certain how \emph{user
+namespaces} (on GNU/Linux) would influence the process (as in when the
+\emph{ID}s of a user \textbf{outside} the container are mapped to a range of
+\emph{UIDs} \textbf{inside} the container), for which the setup would likely
+need to account.
+
+Equally, if the application is running inside the container, the operator needs
+to make sure that the database is either running in a network that is also
+directly attached to the container or that there is a mechanism in place that
+routes the requests for the database hostname to the destination.
+
+One such mechanism is container name based routing inside \emph{pods}
+(Podman/Kubernetes), where the resolution of container names is the
+responsibility of a specially configured (often auto-configured) piece of
+software called Aardvark for the former and CoreDNS for the latter.
+
+
+\n{1}{Summary}
+
+Passwords (and/or passphrases) are in use everywhere and quite probably will be
+for the foreseeable future. If not as \textit{the} principal way to
+authenticate, then at least as \textit{a} way to authenticate. As long as
+passwords are going to be handled and stored by service/application providers,
+they are going to get leaked, be it due to provider carelessness or the
+attackers' resolve and wit. Of course, sifting through all the available
+password breach data by hand is not a reasonable option, and therefore tools
+providing assistance come in handy. The next part of this diploma thesis will
+explore that issue and introduce a solution.


 % =========================================================================== %
@ -616,14 +791,10 @@ of the thesis will explore that and offer a solution.

 \n{1}{Kudos}

-\textbf{Disclaimer:} the author is not affiliated in any way with any of the
-projects described on this page.
-
-The \textit{Password Compromise Monitoring Tool} (\texttt{pcmt}) program has
-been developed using and utilising a great deal of free (as in Freedom) and
-open-source software in the process, either directly or as an outstanding work
-tool, and the author would like to take this opportunity to recognise that
-fact.
+The program that has been developed as part of this thesis used and utilised a
+great deal of free (as in \textit{freedom}) and open-source software in the
+process, either directly or as an outstanding work tool, and the author would
+like to take this opportunity to recognise that fact\footnotemark.

 In particular, the author acknowledges that this work would not be the same
 without:
@ -641,9 +812,12 @@ without:

 All of the code written has been typed into VIM (\texttt{9.0}), the shell used
 to run the commands was ZSH, both running in the author's terminal emulator of
-choice - \texttt{kitty} on a \raisebox{.8ex}{\texttildelow}8 month (at the time
-of writing) installation of \textit{Arch Linux (by the way)} using a
-\texttt{6.3.x-wanderer-zfs-xanmod1} variant of the Linux kernel.
+choice, \texttt{kitty}. The development machines ran a recent installation of
+\textit{Arch Linux (by the way)} and Fedora 38, both using a \texttt{6.3.x}
+XanMod variant of the Linux kernel.
+
+\footnotetext{\textbf{Disclaimer:} the author is not affiliated in any way with any
+of the projects described on this page.}


 \n{1}{Development}
@ -689,9 +863,9 @@ There is one caveat to this though, git first needs some additional
 configuration for the code in Listing~\ref{gitverif} to work as one would
 expect. Namely that the public key used to verify the signature needs to be
 stored in git's ``allowed signers file'', then git needs to be told where that
-file is using the configuration value \texttt{gpg.ssh.allowedsignersfile} and
-finally the configuration value of the \texttt{gpg.format} field needs to be
-set to \texttt{ssh}.
+file is located using the configuration value
+\texttt{gpg.ssh.allowedsignersfile} and finally the configuration value of the
+\texttt{gpg.format} field needs to be set to \texttt{ssh}.

 Because git allows the configuration values to be local to each repository,
 both of the mentioned issues can be solved by running the following commands
@ -703,10 +877,11 @@ label=gitsshprep, basicstyle=\linespread{0.9}\footnotesize\ttfamily]
 % # set the signature format for the local repository.
 % git config --local gpg.format ssh
 % # save the public key.
-% cat >./tmp/.allowed_signers \
-  <<<'leo ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIKwshTdBgLzwY4d8N7VainZCngH88OwvPGhZ6bm87rBO'
+% cat > ./.tmp-allowed_signers \
+  <<<'surtur <insert literal surtur pubkey>
+  leo <insert literal leo pubkey>'
 % # set the allowed signers file path for the local repository.
-% git config --local gpg.ssh.allowedsignersfile=./tmp/.allowed_signers
+% git config --local gpg.ssh.allowedsignersfile=./.tmp-allowed_signers
 \end{lstlisting}
 \vspace*{-\baselineskip}

@ -767,17 +942,17 @@ The fourth pipeline focuses on linting the Containerfile and building the
 container, although the latter action is only performed on feature branches,
 \emph{pull requests} or \emph{tag} events.

-The median build time as of writing was 1 minute, which includes running all
-four pipelines, and that is acceptable. Build times might of course vary
-depending on the hardware, for reference, these builds were being run on a
-machine equipped with a Zen 3 Ryzen 5 5600 CPU with nominal clock times, DDR4
-3200MHz RAM, a couple of PCIe Gen 4 NVMe drives in a mirrored setup (using ZFS)
-and a 400Mbps downlink, software-wise running Arch with an author-flavoured
-Xanmod kernel version 6.3.x.
-
 \obr{Drone CI median build
 time}{fig:drone-median-build}{.84}{graphics/drone-median-build}

+The median build time as of writing was 1 minute, which includes running all
+four pipelines, and that is acceptable. Build times might of course vary
+depending on the hardware, for reference, these builds were run on a machine
+equipped with a Zen 3 Ryzen 5 5600 CPU with nominal clock times, DDR4 3200MHz
+RAM, a couple of PCIe Gen 4 NVMe drives in a mirrored setup (using ZFS) and a
+400Mbps downlink, software-wise running Arch with an author-flavoured Xanmod
+kernel version 6.3.x.
+

 \n{2}{Source code repositories}\label{sec:repos}

@ -805,20 +980,28 @@ The repository containing the \LaTeX{} source code of this thesis:\\

 \n{2}{Toolchain}

-Throughout the creation of this work, the \emph{current} version of the Go
+Throughout the creation of this work, the \emph{then-current} version of the Go
 programming language was used, i.e. \texttt{go1.20}.

+To read more on why Go was chosen, see Appendix~\ref{appendix:whygo}.
+Nix/\texttt{devenv} tools have also aided heavily during development, see
+Appendix~\ref{appendix:whynix} to learn more.
+
 \tab{Tool/Library-Usage Matrix}{tab:toolchain}{1.0}{ll}{
  \textbf{Name} & \textbf{Usage} \\
  Go programming language & program core \\
  Dhall configuration language & program configuration \\
  Echo & HTTP handlers, controllers, web server \\
  ent & ORM using graph-based modelling \\
-  bluemonday & HTML sanitising \\
+  bluemonday & sanitising HTML \\
  TailwindCSS & stylesheets using a utility-first approach \\
-  PostgreSQL & storing data \\
+  PostgreSQL & persistently storing data \\
 }

+Table~\ref{tab:depsversionmx} contains the names and versions of the most
+important libraries and supporting software that were used to build the
+application.
+
 \tab{Dependency-Version Matrix}{tab:depsversionmx}{1.0}{ll}{
  \textbf{Name} & \textbf{version} \\
  \texttt{echo} (\url{https://echo.labstack.com/}) & 4.10.2 \\
@ -829,90 +1012,85 @@ programming language was used, i.e. \texttt{go1.20}.
  \texttt{PostgreSQL} (\url{https://www.postgresql.org/}) & 15.2 \\
 }

-\n{2}{A word about Go}
-First, a question of \textit{`Why pick Go for building a web
-application?'} might arise, so the following few lines will try to address
-that.
-
-Go~\cite{golang}, or \emph{Golang} for SEO-friendliness and disambiguating Go
-the ancient game, is a strongly typed, high-level \emph{garbage-collected}
-language where functions are first-class citizens and errors are values.
-
-The appeal for the author comes from a number of features of the language, such
-as built-in support for concurrency and unit testing, sane \emph{zero} values,
-lack of pointer arithmetic, inheritance and implicit type conversions,
-easy-to-read syntax, producing a statically linked binary by default, etc., on
-top of that, the language has got a cute mascot. Thanks to the foresight of the
-Go Authors regarding \emph{the formatting question} (i.e.\ where to put the
-braces, \textbf{tabs vs.\ spaces}, etc.), most of the discussions on this topic
-have been foregone. Every \emph{gopher}~\footnote{euph.\ a person writing in
-the Go programming language} is expected to format their source code with the
-official formatter (\texttt{gofmt}), which automatically ensures that the code
-adheres to the one formatting standard. Then, there is \emph{The Promise} of
-backwards compatibility for Go 1.x, which makes it a good choice for long-term
-without the fear of being rug-pulled.
-
-
-\n{2}{A word about Nix/devenv}
-
-Nix (\url{https://builtwithnix.org/}) is a declarative package manager and a
-functional programming language resembling Haskell, which has been used in this
-project in the form of \texttt{devenv} tool (\url{https://devenv.sh/}) to
-create \textbf{declarable} and \textbf{reproducible} development environment.
-The author has previously used Nix directly with \emph{flakes} and liked
-\texttt{devenv}, as it effectively exposed only a handful of parameters for
-configuration, and rid of the need to manage the full flake, which is of course
-still an option for people who choose so. See \texttt{devenv.nix} in the
-repository root.

 \n{1}{Application architecture}

-The source code of the main module further is split into Go \emph{packages}
-appropriately along a couple of domains: logging, core application, web
-routers, configuration and settings, etc. In Go, packages are delimited by
-folder structure -- each folder can be package.
+\n{2}{Package structure}
+
+The source code of the main module is organised into smaller, self-contained Go
+\emph{packages} appropriately along a couple of domains: logging, core
+application, web routers, configuration and settings, etc. In Go, packages are
+delimited by folder structure -- each folder can be a package.

 Generally speaking, the program aggregates decision points into central places,
-such as \texttt{run.go}, which imports child packages that facilitate each of
-loading the configuration, connecting to the database and running migrations,
-consolidating flag, environment variable and configuration-based values into
-canonical \emph{settings}, setting up routes and handling graceful shutdown.
+such as \texttt{run.go}, which then imports child packages that facilitate each
+of the task of loading the configuration, connecting to the database and
+running migrations, consolidating flag, environment variable and
+configuration-based values into canonical \emph{settings}, setting up routes
+and handling graceful shutdown.
+
+\n{3}{Internal package}
+
+The \texttt{internal} package was not used as of writing, but the author plans
+to eventually migrate \emph{internal} logic of the program into the internal
+package to prevent accidental imports.
+
+
+\n{2}{Logging}

 The program uses dependency injection to share a single logger instance,
 similar applies to the database client. These are passed around as a pointer,
 so the underlying data stays the same. As a rule of thumb, every larger
 \texttt{struct} that needs to be passed around is passed around as a pointer.

-The \texttt{internal} package was not used as of writing, but the author plans
-to eventually migrate \emph{internal} logic of the program into the internal
-package to prevent accidental imports.

-The authentication logic is relatively simple and the author would like to
+\n{2}{Authentication}
+
+The authentication logic is relatively simple and the author attempted to
 isolate it into a custom \emph{middleware}. User passwords are hashed using a
-secure KDF before being sent to the database. The KDF used is \texttt{bcrypt}
-(with a sane \emph{Cost} of 10), which automatically includes \emph{salt} for
-the password and provides ``length-constant'' time hash comparisons. The author
-plans to add support for the more modern \texttt{scrypt} and the
-state-of-the-art, P-H-C (Password Hashing Competition) winner algorithm
-\texttt{Argon2} (\url{https://github.com/P-H-C/phc-winner-argon2}). Besides, no
-raw queries are used to access the database, helping decrease the likelihood of
-SQL injection attacks.
+secure KDF before being sent to the database. The KDF of choice is
+\texttt{bcrypt} (with a sane \emph{Cost} of 10), which automatically includes
+\emph{salt} for the password and provides ``length-constant'' time hash
+comparisons. The author plans to add support for the more modern
+\texttt{scrypt} and the state-of-the-art, P-H-C (Password Hashing Competition)
+winner algorithm \texttt{Argon2}
+(\url{https://github.com/P-H-C/phc-winner-argon2}) for flexibility.
+
+\n{2}{SQLi prevention}
+
+No raw SQL queries are directly used to access the database, thus decreasing
+the likelihood of SQL injection attacks. Instead, parametric queries are
+constructed in code using a graph-like API of the \texttt{ent} library, which
+is attended to in-depth in Section~\ref{sec:dbschema}.
+
+
+\n{2}{Configurability}
+
+Virtually any important value in the program has been made into a configuration
+value, so that the operator can customise the experience as needed. A choice of
+sane configuration defaults was attempted, which resulted in the configuration
+file essentially only needing to contain secrets, unless there is a need to
+override the defaults. It is not entirely \emph{zero-config} situation, rather
+a \emph{minimal-config} one. An example can be seen in
+Section~\ref{sec:configuration}.
+
+
+\n{2}{Embedded assets}

 An important thing to mention is embedded assets and templates. Go has multiple
 mechanisms to natively embed arbitrary files directly into the binary during
 the regular build process. The built-in \texttt{embed} package was used to
 bundle all template files and web assets, such as images, logos and stylesheets
 at the package level, and these are also the passed around the application as
-needed.  There is also a toggle in the application configuration, which can
-instruct the program at start to either rely entirely on embedded assets or
-pull live files from the filesystem. The former option makes the application
-more portable, while the latter allows for flexibility not only during
-development. Basically, any important value in the program has been made into a
-configuration value, so that the operator can customise the experience as
-needed. A choice of sane configuration defaults was attempted, which resulted
-in the configuration file essentially only needing to contain secrets, unless
-there is a need to override the defaults. It is not entirely \emph{zero-config}
-situation, rather a \emph{minimal-config} one.
+needed.
+
+There is also a toggle in the application configuration, which can instruct the
+program at start to either rely entirely on embedded assets or pull live files
+from the filesystem. The former option makes the application more portable,
+while the latter allows for flexibility not only during development.
+
+
+\n{2}{Composability}

 Templates used for rendering of the web pages were created in a composable
 manner, split into smaller, reusable parts, such as \texttt{footer.tmpl} and
@ -924,6 +1102,9 @@ performed ergonomically and directly using Echo's built-in facilities. A
 popular HTML sanitiser \emph{bluemonday} has been employed to aid with battling
 XSS.

+
+\n{2}{Server-side rendering}
+
 The application constructs the web pages entirely server-side and it runs
 without a single line of JavaScript, of which the author is especially proud.
 It improves load times, decreases attack surface, increases maintainability and
@ -933,12 +1114,8 @@ updates (where \texttt{PUT}s should be used) and the accompanying frequent
 full-page refreshes, but that still is not enough to warrant the use of
 JavaScript.

-As an aside, the author has briefly experimented with WebAssembly for this
-project, but has ultimately scrapped the functionality in favour of the
-entirely server-side rendered one. It is possible that it would get revisited
-if the client-side dynamic functionality was necessary and performance
-mattered. Even from the short experiments it was obvious how much faster
-WebAssembly was compared to JavaScript.
+
+\n{2}{Frontend}

 Frontend-side, the application was styled using TailwindCSS, which promotes
 using of flexible \emph{utility-first} classes in the markup (HTML) instead of
@ -950,61 +1127,112 @@ need to be parsed by Tailwind in order to construct its final stylesheet and
 there is also an original CLI tool for that called \texttt{tailwindcss}.
 Overall, simple and accessible layouts had preference over convoluted ones.

+\n{3}{Frontend experiments}
+
+As an aside, the author has briefly experimented with WebAssembly for this
+project, but has ultimately scrapped the functionality in favour of the
+entirely server-side rendered one. It is possible that it would get revisited
+if the client-side dynamic functionality was necessary and performance
+mattered. Even from the short experiments it was obvious how much faster
+WebAssembly was compared to JavaScript.
+
+
+\newpage
+\n{2}{User isolation}
+
+Users are allowed into certain parts of the application based on the role they
+currently posses. For the moment, two basic roles were envisioned, while this
+list might get amended in the future, should the need arise:
+
+\begin{itemize}
+  \item Administrator
+  \item User
+\end{itemize}
+
+\obr{Application use case diagram}{fig:usecasediagram}{.9}{graphics/pcmt-use-case.pdf}
+
+It is paramount that the program protects itself from the insider threats as
+well and therefore each role is only able to perform actions that it is
+explicitly assigned. While there definitely is certain overlap between the
+capabilities of the two outlined roles, each also possesses unique features
+that the other one does not.
+
+For example, the administrator role is not able to perform searches on the
+breach data directly using their administrator account, for that a separate
+user account has to be devised. Similarly, the regular user is not able to
+manage breach lists and other users, because that is a privileged operation.
+
+In-application administrators are not able to view sensitive (any) user data
+and should therefore only be able to perform the following actions:
+
+\begin{itemize}
+  \item Create user accounts
+  \item View list of users
+  \item View user email
+  \item Change user email
+  \item Toggle whether user is an administrator
+  \item Delete user accounts
+\end{itemize}
+
+Let us consider a case when a user manages self, while demoting from
+administrator to a regular user is permitted, promoting self to be an
+administrator would constitute a \emph{privilege escalation} and likely be a
+precursor to at least a \emph{denial of service} of sorts.
+
+
+\n{2}{Zero trust principle}
+
+\textit{Confidentiality, i.e.\ not trusting the provider}
+
+There is no way for the application (and consequently, the in-application
+administrator) to read user's data. This is possible by virtue of encrypting
+the pertinent data before saving them in the database by a state-of-the-art
+\emph{age} key~\cite{age} (backed by X25519~\cite{x25519rfc7748}), which is in
+turn safely stored encrypted by a passphrase that only the user controls. Of
+course, the user-supplied password is run by a password based key derivation
+function (PBKDF: a key derivation function with a sliding computational cost)
+before letting it encrypt the \emph{age} key.
+
+The \emph{age} key is only generated when the user changes their password for
+the first time to prevent scenarios such as in-application administrator with
+access to physical database being able to both \textbf{recover} the key from
+the database and \textbf{decrypt} it given that they already know the user
+password (because they set it), which would subsequently give them unbounded
+access to any future encrypted data, as long as they would be able to maintain
+their database access. This is why the \emph{age} key generation and protection
+are bound to the first password change. Of course, the evil administrator could
+just perform the change themselves; however, the user would at least be able to
+find those changes in the activity logs and know not to use the application.
+But given the scenario of a total database compromise, the author finds all
+hope is already lost at that point. At least when the database is dumped, it
+only contains non-sensitive, functional information in plain test, everything
+else should be encrypted.
+
+Consequently, both the application operators and the in-application
+administrators should never be able to learn the details of what the user is
+tracking, the same being applicable even to potential attackers with direct
+access to the database. Thus the author maintains that every scenario that
+could potentially lead to a data breach (apart from a compromised user machine
+and the like) would have to entail some form of operating memory acquisition,
+for instance using \texttt{LiME}~\cite{lime}, or perhaps directly the
+\emph{hypervisor}, if considering a virtualised (``cloud'') environments.
+

 \n{1}{Implementation}

-\n{2}{Configuration}
-
-Every non-trivial program usually offers at least \emph{some} way to
-tweak/manage its behaviour, and these changes are usually persisted
-\emph{somewhere} on the filesystem of the host: in a local SQLite3 database, a
-\emph{LocalStorage} key-value store in the browser, a binary or plain text
-configuration file.  These configuration files need to be read and checked at
-least on program start-up and either stored into operating memory for the
-duration of the runtime of the program, or loaded and parsed and the memory
-subsequently \emph{freed} (initial configuration).
-
-There is an abundance of configuration languages (or file formats used to craft
-configuration files, whether they were intended for it or not) available, TOML,
-INI, JSON, YAML, to name some of the popular ones (as of today).
-
-Dhall stood out as a language that was designed with both security and the
-needs of dynamic configuration scenarios in mind, borrowing a concept or two
-from Nix~\cite{nixoslearn}~\cite{nixlang} (which in turn sources more than a
-few of its concepts from Haskell), and in its apparent core being very similar
-to JSON, which adds to familiar feel. In fact, in Dhall's authors' own words it
-is: ``a programmable configuration language that you can think of as: JSON +
-functions + types + imports''~\cite{dhalllang}.
-
-Among all of the listed features, the especially intriguing one to the author
-was the promise of \emph{types}. There are multiple examples directly on the
-project's documentation webpage demonstrating for instance the declaration and
-usage of custom types (that are, of course merely combinations of the primitive
-types that the language provides, such as \emph{Bool}, \emph{Natural} or
-\emph{List}, to name just a few), so it was not exceedingly hard to start
-designing a custom configuration \emph{schema} for the program.
-Dhall not being a Turing-complete language also guarantees that evaluation
-\emph{always} terminates eventually, which is a good attribute to possess as a
-configuration language. 
-
-
-\n{3}{Dhall Schema}
+\n{2}{Dhall Configuration Schema}\label{sec:configuration}

 The configuration schema was at first being developed as part of the main
 project's repository, before it was determined that it would benefit both the
 development and overall clarity if the schema lived in its own repository (see
-Section~\ref{sec:repos} for details). This enabled it to be independently
-developed and versioned, and only pulled into the main application whenever it
-is determined the application is ready for it.
+Section~\ref{sec:repos} for details). This now enables the schema to be
+independently developed and versioned, and only pulled into the main
+application whenever the application is determined to be ready for it.

-The full schema with type annotations can be seen in Listing~\ref{dhallschema}.
-The \texttt{let} statement declares a variable called \texttt{Schema} and
-assigns it the result of the expression on the right side of the equals sign,
-which has for practical reasons been trimmed and is displayed without the
-\emph{default} block, which is instead shown in its own
-Listing~\ref{dhallschemadefaults}.

-\vspace{\parskip}
+% \vspace{\parskip}
+\smallskip
+% \vspace{\baselineskip}
 \begin{lstlisting}[language=Haskell, caption={Dhall configuration schema version 0.0.1-rc.2},
 label=dhallschema, basicstyle=\linespread{0.9}\footnotesize\ttfamily]
 let Schema =
@ -1055,8 +1283,16 @@ let Schema =
 \end{lstlisting}
 \vspace*{-\baselineskip}

-The main configuration is comprised of both raw attributes and child records,
-which allow for grouping of related functionality. For instance, configuration
+Full schema with type annotations can be seen in Listing~\ref{dhallschema}.
+
+The \texttt{let} statement declares a variable called \texttt{Schema} and
+assigns to it the result of the expression on the right side of the equals
+sign, which has for practical reasons been trimmed and is displayed without the
+\emph{default} block. The default block is instead shown in its own
+Listing~\ref{dhallschemadefaults}.
+
+The main configuration comprises both raw attributes and child records, which
+allow for grouping of related functionality. For instance, configuration
 settings pertaining mailserver setup are grouped in a record named
 \textbf{Mailer}. Its attribute \textbf{Enabled} is annotated as \textbf{Bool},
 which was deemed appropriate for a on-off switch-like functionality, with the
@ -1067,10 +1303,19 @@ while \textbf{true} is evaluated as an \emph{unbound} variable, that is, a
 variable \emph{not} defined in the current \emph{scope} and thus not
 \emph{present} in the current scope.

-\vspace{\parskip}
+Another one of Dhall specialties is that `$==$' and `$!=$' (in)equality
+operators \textbf{only} work on values of type \texttt{Bool}, which for example
+means that variables of type \texttt{Natural} (\texttt{uint}) or \texttt{Text}
+(\texttt{string}) cannot be compared directly as in other languages, which
+either leaves the work for a higher-level language (such as Go), or from the
+perspective of the Dhall authors, \emph{enums} are promoted when the value
+matters.
+
+\newpage
+% \vspace{\parskip}
 \begin{lstlisting}[language=Haskell, caption={Dhall configuration defaults for
 schema version 0.0.1-rc.2},
-label=dhallschemadefaults, basicstyle=\linespread{0.9}\scriptsize\ttfamily]
+label=dhallschemadefaults, basicstyle=\linespread{0.9}\footnotesize\ttfamily]
  , default =
      -- | have sane defaults.
      { Host = ""
@ -1122,8 +1367,7 @@ label=dhallschemadefaults, basicstyle=\linespread{0.9}\scriptsize\ttfamily]
      , Init =
        { CreateAdmin =
            -- | if this is True, attempt to create a user with admin
-            -- | privileges with the password specified below (or better -
-            -- | overriden); it fails if users already exist in the DB.
+            -- | privileges with the password specified below
            False
        , AdminPassword =
            -- | used for the first admin, forced change on first login.
@ -1135,71 +1379,9 @@ label=dhallschemadefaults, basicstyle=\linespread{0.9}\scriptsize\ttfamily]

 in  Schema
 \end{lstlisting}
-
-Another one of specialties of Dhall is that $==$ and $!=$ equality operators
-only work on values of type \texttt{Bool}, which for example means that
-variables of type \texttt{Natural} (\texttt{uint}) or \texttt{Text}
-(\texttt{string}) cannot be compared directly as in other languages, which
-either leaves the work for a higher-level language (such as Go), or from the
-perspective of the Dhall authors, \emph{enums} are promoted when the value
-matters.
-
-
-\n{3}{Safety considerations}
-
-Having a programmable configuration language that understands functions and
-allows importing not only arbitrary text from random internet URLs, but also
-importing and \emph{evaluating} (i.e.\ running) potentially untrusted code, it
-is important that there are some safety mechanisms employed, which can be
-relied on by the user. Dhall offers this in multiple features: enforcing a
-same-origin policy and (optionally) pinning a cryptographic hash of the value
-of the expression being imported.
-
-
-\n{3}{Possible alternatives}
-
-While developing the program, the author has also
-come across certain shortcomings of Dhall, namely long start-up with \emph{cold
-cache}, which can generally be observed in the scenario of running the program
-in an environment that does not allow to write the cache files (a read-only
-filesystem), of does not keep the written cache files, such as a container that
-is not configured to mount a persistent volume at the pertinent location.
-
-To describe the way Dhall works when performing an evaluation, it resolves
-every expression down to a combination of its most basic types (eliminating all
-abstraction and indirection) in the process called
-\textbf{normalisation}~\cite{dhallnorm} and then saves this result in the
-host's cache. The \texttt{dhall-haskell} binary attempts to resolve the
-variable \texttt{\$\{XDG\_CACHE\_HOME\}} (have a look at \emph{XDG Base
-Directory Spec}~\cite{xdgbasedirspec} for details) to decide \emph{where} the
-results of the normalisation will be written for repeated use. Do note that
-this behaviour has been observed on a GNU/Linux host and the author has not
-verified this behaviour on a non-GNU/Linux host, such as FreeBSD.
-
-If normalisation is performed inside an ephemeral container (as opposed to, for
-instance, an interactive desktop session), the results effectively get lost on
-each container restart, which is both wasteful and not great for user
-experience, since the normalisation of just a handful of imports (which
-internally branches widely) can take an upwards of two minutes, during which
-the user is left waiting for the hanging application with no reporting on the
-progress or current status.
-
-While workarounds for the above mentioned problem can be devised relatively
-easily (such as bind mounting persistent volumes inside the container in place
-of the \texttt{\$\{XDG\_CACHE\_HOME\}/dhall} and
-\texttt{\$\{XDG\_CACHE\_HOME\}/dhall-haskell} to preserve the cache between
-restarts, or let the cache be pre-computed during container build, since the
-application is only really expected to run together with a compatible version
-of the configuration schema and this version \emph{is} known at container build
-time), it would certainly feel better if there was no need to work
-\emph{around} the configuration system of choice.
-
-Alternatives such as CUE (\url{https://cuelang.org/}) offer themselves nicely
-as a potentially almost drop-in replacement for Dhall feature-wise, while also
-resolving costly \emph{cold cache} normalisation operations, which is in
-author's view Dhall's titular issue.
-
-
+\vspace*{-\baselineskip}
+\vspace*{-\baselineskip}
+\vspace*{-\baselineskip}
 \n{2}{Data integrity and authenticity}

 The user can interact with the application via a web client, such as a browser,
@ -1243,194 +1425,20 @@ e.g.\ for tamper protection purposes and similar; however, that work remains
 yet to be materialised.


-\n{2}{User isolation}
-
-Users are allowed into certain parts of the application based on the role they
-currently posses. For the moment, two basic roles were envisioned, while this
-list might get amended in the future, if the need arises:
-
-\begin{itemize}
-  \item Administrator
-  \item User
-\end{itemize}
-
-It is paramount that the program protects itself from the insider threats as
-well and therefore each role is only able to perform actions that it is
-explicitly assigned. While there definitely is certain overlap between the
-capabilities of the two outlined roles, each also possesses unique features
-that the other does not.
-
-For example, the administrator role is not able to perform searches on the
-breach data directly using their administrator account, for that a separate
-user account has to be devised. Similarly, the regular user is not able to
-manage breach lists and other users, because that is a privileged operation.
-
-In-application administrators are not able to view sensitive (any) user data
-and should therefore only be able to perform the following actions:
-
-\begin{itemize}
-  \item Create user accounts
-  \item View list of users
-  \item View user email
-  \item Change user email
-  \item Change user email
-  \item Toggle whether user is an administrator
-  \item Delete user accounts
-\end{itemize}
-
-Let us consider a case when a user manages self, while demoting from
-administrator to a regular user is permitted, promoting self to be an
-administrator would constitute a \emph{privilege escalation} and likely be a
-precursor to at least a \emph{denial of service} of sorts.
-
-
-\n{2}{Zero trust principle}
-
-\textit{Data confidentiality, i.e.\ not trusting the provider}
-
-There is no way for the application (and consequently, the in-application
-administrator) to read user's data. This is possible by virtue of encrypting
-the pertinent data before saving them in the database by a state-of-the-art
-\emph{age} key~\cite{age} (backed by X25519~\cite{x25519rfc7748}), which in
-turn is safely stored encrypted by a passphrase that only the user controls. Of
-course, the user-supplied password is run by a password based key derivation
-function (PBKDF: a key derivation function with a sliding computational cost)
-before letting it encrypt the \emph{age} key.
-
-The \emph{age} key is only generated when the user changes their password for
-the first time to prevent scenarios such as in-application administrator with
-access to physical database being able to both \textbf{recover} the key from
-the database and \textbf{decrypt} it given that they already know the user
-password (because they set it), which would subsequently give them unbounded
-access to any future encrypted data, as long as they would be able to maintain
-their database access. This is why the \emph{age} key generation and protection
-are bound to the first password change. Of course, the evil administrator could
-just perform the change themselves; however, the user would at least be able to
-find those changes in the activity logs and know not to use the application.
-But given the scenario of a total database compromise, the author finds all
-hope is already lost at that point. At least when the database is dumped, it
-only contains non-sensitive, functional information in plain test, everything
-else should be encrypted.
-
-Consequently, both the application operators and the in-application
-administrators should never be able to learn the details of what the user is
-tracking, the same being applicable even to potential attackers with direct
-access to the database. Thus the author maintains that every scenario that
-could potentially lead to a data breach (apart from a compromised user machine
-and the like) would have to entail some form of operating memory acquisition,
-for instance using \texttt{LiME}~\cite{lime}, or perhaps directly the
-\emph{hypervisor}, if considering a virtualised (``cloud'') environments.
-
-
-\n{2}{Compromise Monitoring}
-
-\n{3}{Have I Been Pwned? Integration}
-
-Troy Hunt's Have I Been Pwned? online service
-(\url{https://haveibeenpwned.com/}) has been chosen as the online source of
-compromised data. The service offers private APIs that are protected by API
-keys. The application's \texttt{hibp} module and database representation models
-the values returned by this API, which allows searching in large breaches using
-email addresses.\\
-The architecture there is relatively simple: the application administrator
-configures an API key for HIBP, the user enters the query parameters, the
-application constructs a query and calls the API and waits for a response. As
-the API is rate-limited based on the key supplied, this can pose an issue and
-it has not been fully resolved in the UI. The application then parses the
-returned data and binds it to the local model for validation. If that goes
-well, the data is saved into the database as a cache and the search query is
-performed on the saved data. If it returns anything, it is displayed to the
-user for browsing.
-
-
-\n{3}{Local Dataset Plugin} Breach data from locally available datasets can be
-imported into the application by first making sure it adheres to the specified
-schema (have a look at the \emph{breach data schema} in
-Listing~\ref{breachDataGoSchema}). If it doesn't (which is very likely with
-random breach data), it needs to be converted to a form that does before
-importing it to the application, e.g.\ using a Python script or similar.
-Attempting to import data that does not follow the outlined schema would result
-in an error. Also, importing a dataset which is over a reasonable size limit
-would by default be rejected by the program as a precaution, since marshaling
-e.g.\ a 1 TiB document would likely result in an OOM situation on the host,
-assuming regular consumer hardware conditions, not HPC.
-
-\vspace{\parskip}
-\begin{lstlisting}[language=Go, caption={Breach Data Schema represented as a Go struct with imports from the standard library are assumed},
-label=breachDataGoSchema]
-    type breachDataSchema struct {
-        Name              string
-        Time              time.Time
-        IsVerified        bool
-        ContainsPasswords bool
-        ContainsHashes    bool
-        HashType          string
-        HashSalted        bool
-        HashPepperred     bool
-        ContainsUsernames bool
-        ContainsEmails    bool
-        Data              any
-    }
-\end{lstlisting}
-\vspace*{-\baselineskip}
-
-The Go representation shown in Listing~\ref{breachDataGoSchema} will in
-actuality be written and supplied by the user of the program as a YAML
-document. YAML was chosen for multiple reasons: relative ease of use (plain
-text, readable, support for inclusion of comments, its capability to store
-multiple \emph{documents} inside of a single file with most of the inputs
-implicitly typed as strings while thanks to being a superset of JSON it sports
-machine readability. That should allow for documents similar to what can be
-seen in Listing~\ref{breachDataYAMLSchema} to be ingested by the program,
-read and written by humans and programs alike.
-
-\smallskip
-\begin{lstlisting}[language=YAML, caption={Example Breach Data Schema supplied
-to the program as a YAML file, optionally containing multiple documents},
-label=breachDataYAMLSchema]
-    ---
-    name: Horrible breach
-    time: 2022-04-23T00:00:00Z+02:00
-    isVerified: false
-    containsPasswds: false
-    containsHashes: true
-    containsEmails: true
-    hashType: md5
-    hashSalted: false
-    hashPeppered: false
-    data:
-      hashes:
-        - hash1
-        - hash2
-        - hash3
-      emails:
-        - email1
-        - 
-        - email3
-    ---
-    # document #2, describing another breach.
-    name: Horrible breach 2
-    ...
-\end{lstlisting}
-\vspace*{-\baselineskip}
-
-Notice how the emails list in Listing~\ref{breachDataYAMLSchema} misses one
-record, perhaps because it was not supplied or mistakenly omitted. This is a
-valid scenario (mistakes happen) and the application needs to be able to handle
-it. The alternative would be to require the user to prepare the data in such a
-way that the empty/partial records would be dropped entirely.
-
-
-\n{2}{Database configuration}
+\n{2}{Database schema}\label{sec:dbschema}

 The database schema is not being created manually in the database. Instead, an
 Object-relational Mapping (ORM) tool named ent is used, which allows defining
-the table schema and relations entirely in Go.
+the table schema and relations entirely in Go. The application does not need
+for the database schema to be pre-created when the application starts, it only
+requires a connection string providing access to the database for a reasonably
+privileged user if that is the case.

-The best part about ent is that there is no need to define supplemental methods
-on the models, since with ent these are meant to be \emph{code generated} (in
-the older sense of word, not with Large Language Models). That creates files
-with models based on the types of the attributes in the database model, and the
+The best part about \texttt{ent} is that there is no need to define
+supplemental methods on the models, as with \texttt{ent} these are meant to be
+\emph{code generated} (in the older sense of word, not with Large Language
+Models) into existence. Code generation creates files with actual Go models
+based on the types of the attributes in the database schema model, and the
 respective relations are transformed into methods on the receiver or functions
 taking object attributes as arguments.

@ -1449,90 +1457,164 @@ These methods can further be imported into other packages and this makes
 working with the database a morning breeze.


-\n{1}{Production}
+\n{1}{Deployment}

-It is, of course, recommended that the application runs in a secure environment
-\allowbreak although definitions of that almost certainly differ depending on
-who you ask. General recommendations would be either to effectively reserve a
-machine for a single use case - running this program - so as to dramatically
-decrease the potential attack surface of the host, or run the program isolated
-in a container or a virtual machine. Further, if the host does not need
-management access (it is a deployed-to-only machine that is configured
-out-of-band, such as with a \emph{golden} image/container or declaratively with
-Nix), then an SSH \emph{daemon} should not be running in it, since it is not
-needed. In an ideal scenario, the host machine would have as little software
-installed as possible besides what the application absolutely requires.
+\textbf{TODO}: mention how \texttt{systemd} aids in running the pod.

-A demonstration of the above can be found in the multi-stage Containerfile that
-is available in the main sources. The resulting container image only contains a
-statically linked copy of the program, a default configuration file and
-corresponding Dhall expressions cached at build time, which only support the
-main configuration file. Since the program also needs a database, an example
-scenario could include the container being run in a Podman pod together with
-the database, which would not have to be exposed from the pod and would
-therefore only be available over \texttt{localhost}.
+A deployment setup as suggested in Section~\ref{sec:deploymentRecommendations}
+is already partially covered by the multi-stage \texttt{Containerfile} that is
+available in the main sources. Once built, the resulting container image only
+contains a handful of things it absolutely needs:

-It goes without saying that the operator should substitute values of any
-default configuration secrets with the new ones that were securely generated.
+\begin{itemize}
+  \item a statically linked copy of the program
+  \item a default configuration file and corresponding Dhall expressions cached
+    at build time
+  \item a recent CA certs bundle
+\end{itemize}

-System-wide cryptographic policies should target highest feasible security
-level, if at all available (such as by default on Fedora or RHEL), covering
-SSH, DNSSec, IPsec, Kerberos and TLS protocols. Firewalls should be configured
-and SELinux (kernel-level mandatory access control and security policy
-mechanism) running in \emph{enforcing} mode, if available.
+Since the program also needs a database for proper functioning, an example
+scenario includes the application container being run in a Podman \textbf{pod}
+together with the database. That results in not having to expose the database
+to the entire host or out of the pod at all, it is only be available over pod's
+\texttt{localhost}.
+
+It goes without saying that the default values of any configuration secrets
+should be substituted by the application operator with new, securely generated
+ones.


-\n{2}{Deployment recommendations}
+\n{2}{Rootless Podman}

-\n{3}{Transport security}
+Assuming rootless Podman set up and the \texttt{just} tool installed on the
+host, the application could be deployed by following a series of relatively
+simple steps:

-User connecting to the application should rightfully expect for their data to
-be protected \textit{in transit} (i.e.\ on the way between their browser and
-the server), which is what \emph{Transport Layer Security} family of
-protocols~\cite{tls13rfc8446} was designed for, and which is the underpinning
-of HTTPS. TLS utilises the primitives of asymmetric cryptography to let the
-client authenticate the server (verify that it is who it claims it is) and
-negotiate a symmetric key for encryption in the process named the \emph{TLS
-handshake} (see Section~\ref{sec:tls} for more details), the final purpose of
-which is establishing a secure communications connection. The operator should
-configure the program to either directly utilise TLS using configuration or
-have it listen behind a TLS-terminating \emph{reverse proxy}.
+\begin{itemize}
+  \item build (or pull) the application container image
+  \item create a pod with user namespacing, exposing the application port
+  \item run the database container inside the pod
+  \item run the application inside the pod
+\end{itemize}

+In concrete terms, it would resemble something along the lines of
+Listing~\ref{podmanDeployment}. Do note that all the commands are executed
+under the unprivileged \texttt{user@containerHost} that is running rootless
+Podman, i.e.\ it has \texttt{UID}/\texttt{GID} mapping entries in
+\texttt{/etc/setuid} and \texttt{\etc/setgid} files \textbf{prior} to running any
+Podman commands.

-\n{3}{Containerisation}
-Whether the pre-built or a custom container image is used to deploy the
-application, it still needs access to secrets, such as database connection
-string (containing database host, port, user, password/encrypted password,
-authentication method and database name).
+% \newpage

-Currently, the application is able to handle \emph{peer}, \emph{scram-sha-256},
-\emph{user name maps} and raw \emph{password} as Postgres authentication
-methods~\cite{pgauthmethods}, although the \emph{password} option should not be
-used in production, \emph{unless} the connection to the database is protected
-by TLS.\ In any case, using the \emph{scram-sha-256}~\cite{scramsha256rfc7677}
-method is preferable and one way to verify in development environment that
-everything works as intended is the \emph{Password generator for PostgreSQL}
-tool~\cite{goscramsha256}, which allows to get the encrypted string from a raw
-user input.
+\begin{lstlisting}[language=bash, caption={Example application deployment using
+rootless Podman},
+label=podmanDeployment, basicstyle=\linespread{0.9}\small\ttfamily]
+# From inside the project folder, build the image locally using kaniko.
+just kaniko

-If the application running in a container wants to use the \emph{peer}
-authentication method, it is up to the operator to supply the Postgres socket
-to the application (e.g.\ as a volume bind mount). This scenario was not
-tested; however, and the author is also not entirely certain how \emph{user
-namespaces} (on GNU/Linux) would influence the process (given that the
-\emph{ID}s of a user \textbf{outside} the container are mapped to a range of
-\emph{UIDs} \textbf{inside} the container), for which the setup would likely
-need to account.
+# Create a pod.
+podman pod create --userns=keep-id -p3005:3000 --name pcmt

-Equally, if the application is running inside the container, the operator needs
-to make sure that the database is either running in a network that is also
-directly attached to the container or that there is a mechanism in place that
-routes the requests for the database hostname to the destination.
+# Run the database in the pod.
+podman run --pod pcmt --replace -d --name "pcmt-pg" --rm \
+    -e POSTGRES_INITDB_ARGS="--auth-host=scram-sha-256 \
+         --auth-local=scram-sha-256" \
+    -e POSTGRES_PASSWORD=postgres -v $PWD/tmp/db:/var/lib/postgresql/data \
+    docker.io/library/postgres:15.2-alpine3.17

-One such mechanism is container name based routing inside \emph{pods}
-(Podman/Kubernetes), where the resolution of container names is the
-responsibility of a specially configured piece of software called Aardvark for
-the former and CoreDNS for the latter.
+# Run the application in the pod.
+podman run --pod pcmt --replace --name pcmt-og -d --rm \
+    -e PCMT_LIVE=False \
+    -e PCMT_DBTYPE="postgres" \
+    -e PCMT_CONNSTRING="host=pcmt-pg port=5432 sslmode=disable \
+         user=postgres dbname=postgres password=postgres"
+    -v $PWD/config.dhall:/config.dhall:ro \
+    docker.io/immawanderer/pcmt:testbuild -config /config.dhall
+\end{lstlisting}
+
+To summarise Listing~\ref{podmanDeployment}, first, the application
+container is built from inside the project folder using \texttt{kaniko}.
+Alternatively, the container image could be pulled from the container
+repository, but it makes more sense showing the image being built from sources
+since the listing depicts a \texttt{:testbuild} tag being used.
+
+Next, a \emph{pod} is created and given a name, setting the port binding for
+the application. Then, the database container is started inside the pod.
+
+As a final step, the application container itself is run inside the pod. The application configuration named \texttt{config.dhall} located in
+\texttt{\$PWD} is mounted as a volume into container's \texttt{/config.dhall},
+providing the application with a default configuration. The default container
+does contain a default configuration for reference, however, running the
+container as is without additional configuration would fail as it does not
+contain the necessary secrets.
+
+\n{3}{Sanity checks}
+
+Do also note that the application connects to the database using its
+\emph{container} name, i.e.\ not the IP address. That is possible thanks to
+Podman setting up DNS inside the pod in such a way that all containers in the
+pod can reach each other using their (container) names. Interestingly,
+connecting via \texttt{localhost} would also work, as from inside the pod, any
+container in the pod can reach any other container in the same pod via pod's
+\texttt{localhost}.
+In fact, \emph{pinging} the database or application containers from an ad-hoc
+\texttt{alpine} container added to the pod yields:
+
+\vspace{\parskip}
+\begin{lstlisting}[language=bash, caption={Pinging pod containers using their
+names}, label=podmanPing, basicstyle=\linespread{0.9}\small\ttfamily]
+user@containerHost % podman run --rm -it --user=0 --pod=pcmt \
+    docker.io/library/alpine:3.18
+/ # ping -c2 pcmt-og
+PING pcmt-og (127.0.0.1): 56 data bytes
+64 bytes from 127.0.0.1: seq=0 ttl=42 time=0.072 ms
+64 bytes from 127.0.0.1: seq=1 ttl=42 time=0.118 ms
+
+--- pcmt-og ping statistics ---
+2 packets transmitted, 2 packets received, 0% packet loss
+round-trip min/avg/max = 0.072/0.095/0.118 ms
+/ # ping -c2 pcmt-pg
+PING pcmt-pg (127.0.0.1): 56 data bytes
+64 bytes from 127.0.0.1: seq=0 ttl=42 time=0.045 ms
+64 bytes from 127.0.0.1: seq=1 ttl=42 time=0.077 ms
+
+--- pcmt-pg ping statistics ---
+2 packets transmitted, 2 packets received, 0% packet loss
+round-trip min/avg/max = 0.045/0.061/0.077 ms
+/ #
+\end{lstlisting}
+\vspace*{-\baselineskip}
+
+The pod created in Listing~\ref{podmanDeployment} only set the binding for a
+port used by the application (\texttt{5005/tcp}). The Postgres default port
+\texttt{5432/tcp} is not among pod's port bindings, as can be seen in the pod
+creation command. This can also easily be verified using the command in
+Listing~\ref{podmanPortBindings}:
+
+\begin{lstlisting}[language=bash, caption={Podman pod port bindings},
+label=podmanPortBindings, basicstyle=\linespread{0.9}\small\ttfamily]
+user@containerHost % podman pod inspect pcmt \
+    --format="Port bindings: {{.InfraConfig.PortBindings}}\n\
+    Host network: {{.InfraConfig.HostNetwork}}"
+Port bindings: map[3000/tcp:[{ 5005}]]
+Host network: false
+\end{lstlisting}
+\vspace*{-\baselineskip}
+
+To be absolutely sure, trying to connect to the database from outside of the
+pod (i.e. from the container host) should \emph{fail}, unless, of course, there
+is another process listening on that port:
+
+\begin{lstlisting}[language=bash, caption={In-pod database is unreachable from
+the host}, breaklines=true, label=podDbUnreachable,
+basicstyle=\linespread{0.9}\small\ttfamily]
+user@containerHost % curl localhost:5432
+--> curl: (7) Failed to connect to localhost port 5432 after 0 ms: Couldn't connect to server
+\end{lstlisting}
+\vspace*{-\baselineskip}
+
+The error in Listing~\ref{podDbUnreachable} is expected, as it is the result of
+the database port not been exposed from the pod.


 \n{1}{Validation}
@ -1541,19 +1623,15 @@ the former and CoreDNS for the latter.

 Unit testing is a hot topic for many people and the author does not count
 himself to be a staunch supporter of neither extreme. The ``no unit tests''
-seems to discount any benefit there is to unit testing, while a `` TDD-only''
-(TDD, or Test Driven Development is a development methodology whereby tests are
-written first, then a complementary piece of code that is supposed to be
-tested, just enough to get past the compile errors and to see the test fail,
-then the code is refactored to make the test pass and then it can be fearlessly
-extended because the test is the safety net catching us when the user slips and
-alters the originally intended behaviour) approach can be a little too much for
-some people's taste. The author tends to sport a \emph{middle ground} approach
-here, with writing enough tests where meaningful but not necessarily testing
-everything or writing tests prior to code, although arguably that practice
-should result in writing a \emph{better} designed code, particularly because
-there has to be a prior though about it because it needs to be tested
-\emph{first}.
+seems to discount any benefit there is to unit testing, while a ``
+TDD-only''\footnotemark{} approach can be a little too much for some people's
+taste. The author tends to prefer a \emph{middle ground} approach in this
+particular case, i.e. writing enough tests where meaningful but not necessarily
+testing everything or writing tests prior to business logic code. Arguably,
+following the practice of TDD should result in writing a \emph{better designed}
+code, particularly because there needs to be a prior thought about the shape
+and function of the code, as it is tested for before it is even written, but it
+adds an slight inconvenience to what is otherwise a straightforward process.

 Thanks to Go's built in support for testing via its \texttt{testing} package
 and the tooling in the \texttt{go} tool, writing tests is relatively simple. Go
@ -1578,6 +1656,15 @@ informing the developer that no tests were found, which is handy to learn if it
 was not intended/expected. When compiling regular source code, the Go files
 with \texttt{\_test} in the name are simply ignored by the build tool.

+\footnotetext{TDD, or Test Driven Development, is a development methodology
+  whereby tests are written \emph{first}, then a complementary piece of code
+  that is supposed to be tested is added, just enough to get past the compile
+  errors and to see the test \emph{fail} and then is the code finally
+  refactored to make the test \emph{pass}. The code can then be fearlessly
+extended because the test is the safety net catching the programmer when the
+mind slips and alters the originally intended behaviour of the code.}
+
+
 \n{2}{Integration tests}

 Integrating with external software, namely the database in case of this
@ -1724,26 +1811,29 @@ by \emph{Let's Encrypt}\allowbreak issued, short-lived, ECDSA
 a testing instance; therefore, limits to prevent abuse might be imposed.


+\n{3}{Deployment validation}
+
+TODO: show the results of testing the app in prod using
+\url{https://testssl.sh/}.
+
+
 % =========================================================================== %
 \nn{Conclusion}

 The objectives of the thesis have been to create the Password Compromise
 Monitoring Tool aimed at security-conscious user in order to validate their
-assumptions on the security of their credentials. The thesis opened by
-introducing common terminology and continued with a dive into cryptography
-topics such as encryption, Diffie-Hellman key distribution scheme and briefly
-mentioned TLS. Furthermore, it discussed the inner workings of browsers and the
-protocols that underpin them.
+assumptions on the security of their credentials. The thesis opened by diving
+into cryptography topics such as encryption and briefly mentioned TLS.

 Additionally, security mechanisms such as Site Isolation and Content Security
-Policy, commonly employed by mainstream browsers of today, were
-introduced and the reader learnt how Content Security Policy is easily and
-dynamically configured.
+Policy, commonly employed by mainstream browsers of today, were introduced and
+the reader learnt how Content Security Policy is easily and dynamically
+configured.

 An extensive body of the thesis then revolved around the practical part,
-describing everything from tooling used through application high-level-view
-architecture to implementation of specific parts of the application across the
-stack.
+describing everything from tooling used through high-level view of
+application's architecture to implementation of specific parts of the
+application across the stack.

 Finally, the practical part concluded by broadly depicting validation
 methods used to verify if the application worked correctly.