1
0
This repository has been archived on 2023-09-01. You can view files and clone it, but cannot push or open issues or pull requests.
masters-thesis/tex/text.tex
2023-05-25 22:06:18 +02:00

1658 lines
83 KiB
TeX

% =========================================================================== %
% Encoding: UTF-8 (žluťoučký kůň úpěl ďábelšké ódy)
% =========================================================================== %
% =========================================================================== %
\nn{Introduction}
Introduce the goals and the methods attempted to achieve the goals.
% =========================================================================== %
\part{Theoretical part}
\n{1}{Terminology}
\n{2}{Linux}
The term \emph{Linux} is exclusively used in the meaning of the
Linux kernel~\cite{linux}.
\n{2}{GNU/Linux}
As far as a Linux-based operating system is concerned, the term ``GNU/Linux''
as defined by the Free Software Foundation~\cite{fsfgnulinux} is used. While it
is longer and arguably a little bit cumbersome, the author aligns with the
opinion that this term more correctly describes its actual target. Being aware
there are many people that conflate the complete operating system with its (be
it core) component, the kernel, the author is taking care to distinguish the
two, although writing from experience, colloquially, this probably brings more
confusion and a lengthy explanation is usually required.
\n{2}{Containers}
When the concept of \emph{containerisation} and \emph{containers} is mentioned
throughout this work, the author has OCI containers~\cite{ocicontainers} in
mind, which is broadly a superset of \emph{Linux Containers} where some set of
processes is presented with a view of kernel resources (there are multiple
kinds of resources, such as IPC queues; network devices, stacks, ports; mount
points, process IDs, user and group IDs, Cgroups and others) that differs for
each different set of processes, similar in thought to FreeBSD
\emph{jails}~\cite{freebsdjails} with the distinction being that they are, of
course, facilitated by the Linux kernel namespace
functionality~\cite{linuxnamespaces}, which is in turn regarded to be
\emph{inspired} by Plan 9's namespaces~\cite{plan9namespaces}, Plan 9 being a
Bell Labs successor to Unix 8th Edition, discontinued in 2015.
While there without a doubt \emph{is} specificity bound to using each of the
tools that enable creating (Podman vs.\ Buildah vs.\ Docker BuildX) or running
(ContainerD vs.\ runC vs.\ crun) container images, when describing an action
that gets performed with or onto a container, the process should generally be
explained in such a way that it is repeatable using any spec-conforming tool
that is available and \emph{intended for the job}.
\vspace*{-\baselineskip}
\n{2}{The program}
By \emph{the program} or \emph{the application} without any additional context
the author usually means the Password Compromise Monitoring Tool program.
\n{1}{Cryptography primer}\label{sec:cryptographyprimer}
\n{2}{Encryption}
Encryption is the process of transforming certain data, called a
\emph{message}, using, as Aumasson writes in Serious Cryptography, ``an
algorithm called a \emph{cipher} and a secret value called the
key''~\cite{seriouscryptography}. Its purpose is to protect the said message so
that only its intended recipients that know/hold the key are able to
\emph{decipher} and read it.
\n{3}{Symmetric encryption}
Symmetric encryption is simply when the \emph{key} used is to facilitate both
encryption and decryption operations.
\n{3}{Asymmetric encryption}
Asymmetric encryption is different from symmetric encryption in that there are
now two keys in use - a key \emph{pair}. One part is used solely for
encryption, while the other part's only purpose is to decrypt. This notion of
two keys is generally transposed to a domain called \emph{public key
cryptography}, whereby the decryption component is declared private and the
encryption component is called \emph{public}, hence the name. The rationale is
that everybody can encrypt messages \emph{for} the recipient but only they are
able to \emph{decrypt} them, which is a feature allowed by the mathematical
complementarity of the two components, and also explains why the private key
should be kept \emph{private}. Compared to symmetric encryption, this variant
is generally slower.
\n{3}{The key exchange problem}
Suppose a communication scheme that is protected by a pre-shared secret.
In order to establish secure communications, this secret needs to be
distributed to the other party via untrusted channels. In 1976 Whitfield Diffie
and Martin Hellman published a paper in which they devised a \emph{public-key
distribution scheme}, which allows the two parties to arrive at a shared secret
by exchanging information via insecure channels with the presence of an
eavesdropper. This scheme (or its variations) is in use to this day.
\n{2}{Hash functions}
Hash functions are cryptographic algorithms used to help with a number of
things: integrity verification, password protection, digital signature,
public-key encryption and others. Hashes are used in forensic analysis to prove
authenticity of digital artifacts, to uniquely identify a change-set within
revision-based source code management systems such as Git, Subversion or
Mercurial, to detect known-malicious software by anti-virus programs or by
advanced filesystems in order to verify block integrity and enable repairs, and
also in many other applications that each person using a modern computing
device has come across, such as when connecting to a website protected by the
famed HTTPS.
The popularity stems from a common use case: the need to identify a chunk of
data. Of course, two chunks of data, two files, frames or packets could always
be compared bit by bit, but that can get prohibitive from both cost and energy
point of view relatively quickly. That is when the hash functions come in,
since they are able to take a long input and produce a short output, named a
digest or a hash value. It also does not work the other way around, a file
cannot be reconstructed from the hash digest, it is a one-way function.
\n{3}{Rainbow tables}
As passwords are in more responsible scenarios stored not directly but as
hashes, attackers that would be interested in recovering the passwords really
only have one option (except finding a critical vulnerability in the hash
function): rainbow tables. Rainbow tables are lists of pre-computed hashes
paired with the passwords that were used to create them. When attackers gain
access to a password breach that contains hashes, all it takes is to find a
match within the rainbow table and reversely resolve that to the known
message: the password.
One of the popular counter-measures to pre-computed tables is adding a
\emph{salt} to the user-provided password before passing it to the KDF (Key
Derivation Function) or the hash function. Of course, the salt should be random
\textbf{per-user} and not reused, as that would mean that two users with the
same password would still end up with the same hash, and the salt should also
be adequately long to be effective. As the salt is supposed to be
\emph{random}, it would be a good idea to use an actual CSPRNG, such as
\textbf{Fortuna}~\cite{fortuna} as a source of entropy (randomness). In
FreeBSD, Fortuna is in fact the one serving \texttt{/dev/random}.
\n{3}{TLS}\label{sec:tls}
The Transport Layer Security protocol (or TLS) serves as as an encryption and
\emph{authentication} protocol to secure internet communications. An important
part of the protocol is the \emph{handhake}, during which the two communicating
parties exchange messages that acknowledge each other's presence, verify each
other, choose what cryptographic algorithms will be used and decide session
keys. As there are multiple versions of the protocol in active duty even at the
moment, the server together with the client need to agree upon the version they
are going to use (these days it should be 1.2 or 1.3), pick cipher suites
(TLSv1.3 dramatically reduced the number of available suites), the client
verifies the server's public key (and the signature of the certificate
authority that issued it) and they both generate session keys for use after
handshake completion.
The handshake consists of multiple stages (again, depending on the version), for
TLSv1.3 that would be:
\begin{itemize}
\item \textbf{Client hello}: client sends a client hello message containing
the protocol version, a list of cipher suites and the client random value.
The client in this step also includes the ephemeral Diffie-Helman (EDH)
parameters, which are later used for calculating the pre-master key.
\item \textbf{Server generating a master secret}: the server has got the
cipher suites, the client's paramaters and client random and already has
the server random, which means it can create the master secret.
\item \textbf{Server hello and ``Finished''}: the server includes in the
hello its certificate, digital signature, server random, the chosen
cipher suite, and sends a ``Finished'' (meaning \emph{ready}) message.
\item \textbf{Signature and certificate verification}: the client at this
step verifies server's certificate and signature, generates the master
secret and is ready (sends the ``Finished'' message).
\end{itemize}
At the end of the process, the connection is protected by symmetric encryption
using the session key that the both parties have arrived at.
\n{1}{Passwords}\label{sec:passwords}
Passwords have been in use since the ancient times, apparently already the
Roman sentries used passwords or \textit{watchwords} to discern who was allowed
to enter an area. The Roman army had a special system of distributing passwords
among the encampment members on a wooden tablet. Fast forward a couple of
thousand years, during the days of the Prohibition Era in the United States, it
was the secret ``speakeasies'' that were protecting their illegitimate
alcohol-serving business using passwords~\cite{speakeasy}~\cite{nisthistory}.
During the World War II.\ the US paratroopers' use of passwords has evolved to
even include a counter-password.
According to McMillan, the first \textit{computer} passwords date back to
mid-1960s' Massachusetts Institute of Technology (MIT), when researchers at the
university built a massive time-sharing computer called CTSS. Apparently,
\textit{even then} the passwords did not protect the users as well as they were
expected to~\cite{mcmillan}.
Traditionally, passwords were expected to be memorised, but the large number of
password-protected \emph{services} these days can make this impractical. To
list a few common examples, access to a bank account, electronic mailbox,
personal computer encrypted disk are all protected by some form of a password.
A password still often consists of a \textit{string} of characters typed into a
prompt but its function is still the same: as per NIST it enables the
\textit{verifier} to infer the \textit{claimant}'s identity via a secret the
claimant holds.
There are always some arbitrary requirements applied to what the password can
be, only some turn out to smarter than others.
Despite the impression given by the word ``password'', it does not need to be
an actual word, while a non-word (in the dictionary sense) may indeed be harder
to guess, which is a desirable property of passwords. A memorized secret
consisting of a sequence of words or other text separated by spaces is
sometimes called a passphrase. A passphrase is similar to a password in usage,
but the former is generally longer for added security.
\n{2}{Program-imposed constraints}
Some of the following examples might be a bit anecdotal and more of an
exception than a rule; nevertheless, when presented by a large-enough program
creator/service provider, their decisions reach a sufficient amount of
population, enough that the author will call them influential. They form how
users think when creating password and affect what users expect from other
services they happen to visit and use from that point on, as well.
\n{3}{Short arbitrary length}
It has been observed that a requirement for a ``strong'' password generally
represents that a password is:
\begin{itemize}
\item longer than 7 characters,
\item shorter than 11 characters,
\item begins with a letter and ends with a number OR
\item begins with a number and ends with a letter.
\end{itemize}
\obr{Short arbitrary password length
limit~\cite{larsklint}}{fig:arbitrarypasswdlengthlimit}{.8}{graphics/arbitrarypasswdlengthlimit.jpg}
This is wrong for multiple reasons, and it is a classic example of short
arbitrary length requirement. It essentially prevents users from using
passphrases, makes using a password manager impractical and all of that has
apparently been done ``because of security''~\cite{etihad}. Moreover, this
might be an indicative of the fact that instead of storing passwords hashed (as
it should be), they might be storing them in \textbf{plain text}.
Otherwise, what reason could exist for the limit to be 10 characters?
The recommendation of the US's National Institute for Standards and Technology
(NIST) in this regard is a minimum of 64 and a maximum of 256 characters, which
should be sufficient for most users' needs.
\n{3}{Restricting special characters}
Service providers have too often been found forbidding the use of so called
\textit{special characters} in passwords for as long as passwords have been
used to protect privileged access. Ways of achieving the same may vary but the
intent stays the same: preventing users from inputting characters into the
system, which the system cannot comfortably handle, for ``reasons'', which are
usually something dubious along the lines of ``an apostrophe may be used in SQL
injection attacks'' or ``angle brackets may be used in XSS attacks''. Instead
the real message it announces is pointing right to the serious shortcomings of
password handling of the site in question, as passwords should never be
re-displayed in a context that is prone to Cross Site Scripting (XSS), and the
passwords should always be hashed before being sent to the database anyway,
leaving us with only alphanumeric characters, rendering the SQLi fears
baseless.
\obr{Forbidden special characters in
passwords~\cite{forbiddencharacters}}{fig:forbiddencharacters}{.8}{graphics/forbiddencharacters.jpg}
Note that ``Passw0rd!'' would have been a perfectly acceptable password for the
validator displayed in Figure~\ref{fig:forbiddencharacters}.
NIST's recommendations on this are that all printing ASCII~\cite{asciirfc20}
characters as well as the space character SHOULD be acceptable in memorized
secrets and Unicode~\cite{iso10646} characters SHOULD be accepted as well.
\n{3}{Character composition requirements}
There is a tendency to come up with bad passwords when there are character
composition requirements in place, too. The reality is that instead of
creating strong passwords directly, most users first try a basic version and
then keep tweaking characters until the password ends up fulfilling the minimum
requirement.
The \emph{problem} with that is that it has been shown, that people use similar
patterns, i.e. starting with capital letters, putting a symbol last and a
number in the last two positions. This is also known to cyber criminals
cracking passwords and they run their dictionary attacks using the common
substitutions, such as "\$" for "s", "E" for "3", "1" for "l", "@" for "a" etc.
The password created in this manner will almost certainly be bad so all that is
achieved is frustrating the user in order to still arrive at a bad password.
\n{3}{Other common issues}
Some services don't allow users to paste into passwords fields (disabling them
using JavaScript), thereby essentially breaking the password manager
functionality, which is an issue because it encourages bad password practices
such as weak passwords and likewise, password reuse.
Another frequent issue is forced frequent password rotation. Making frequent
password rotations mandatory contributes to users developing a password
creation pattern and is further a modern-day security anti-pattern and
according to the British NCSC the practice ``carries no real benefits as stolen
passwords are generally exploited immediately''~\cite{ncsc}.
\n{1}{Web security}\label{sec:websecurity}
The internet, being the vast space of intertwined concepts and ideas, is a
superset of the Web, since not everything that is available on internet can be
described as web \emph{resources}. But precisely that is the part of the
internet that is discussed in the next sections and covers what browsers are,
what they do and how they relate to web security.
\n{2}{Browsers}\label{sec:browsers}
Browsers, sometimes used together with the word that can serve as a real tell
for their specialisation - \emph{web} browsers - are programs intended for
\emph{browsing} of \emph{the web}. In more technical terms, browsers are
programs that facilitate (directly or via intermediary tools) domain name
lookups, connecting to web servers, optionally establishing a secure
connection, requesting the web page in question, determining its \emph{security
policy} and resolving what accompanying resources the web page specifies and
depending on the applicable security policy, requesting those from their
respective origins, applying stylesheets and running scripts. Constructing a
program that can speak many protocols and securely runs untrusted code from the
internet is no easy task.
\n{3}{Complexity}
Browsers these days are also quite ubiquitous programs running on
\emph{billions} of consumer grade mobile devices (which are also notorious for
bad update hygiene) or desktop devices all over the world. Regular users
usually expect them to work flawlessly with a multitude of network conditions,
network scenarios (the proverbial café WiFi, cellular data in a remote
location, home broadband that is DNS-poisoned by the ISP), differently tuned
(or commonly misconfigured) web servers, a combination of modern and
\emph{legacy} encryption schemes and different levels of conformance to web
standards from both web server and website developers. Of course, if a website
is broken, it is the browser's fault. Browsers are expected to detect if
\emph{captive portals} (a type of access control that usually tries to force
the user through a webpage with terms of use) are active and offer redirects.
All of this is immense complexity and the combination of ubiquity and great
exposure that this type of software gets is, in the author's opinion, the cause
behind a staggering amount of vulnerabilities found, reported and fixed in
browsers every year.
\n{3}{Standardisation}
Over the years, a consortium of parties interested in promoting and developing
the web (also due to its potential as a digital marketplace, i.e.\ financial
incentives) and browser vendors (of which the most neutral participant is
perhaps \emph{Mozilla}, with Chrome being run by Google, Edge by Microsoft and
Safari/Webkit by Apple) has evolved a great volume of web standards, which are
also relatively frequently getting updated or deprecated and replaced by
revised or new ones, rendering the browser maintenance task into essentially a
cat-and-mouse game.
It is the web's extensibility that enabled this build-up and ironically has
been proclaimed by some to be its greatest asset. It has also been ostensibly
been criticised~\cite{ddvweb} in the past and the frustration with the status
quo of web standards has relatively recently prompted a group of people to even
create ``\textit{a new application-level internet protocol for the distribution
of arbitrary files, with some special consideration for serving a lightweight
hypertext format which facilitates linking between files}'':
Gemini~\cite{gemini}\cite{geminispec} that in the words of its authors can be
thought of as ``\textit{the web, stripped right back to its essence}'' or as
``\textit{Gopher, souped up and modernised just a little}'', depending upon the
reader's perspective, noting that the latter view is probably more accurate.
\n{3}{HTTP}
Originally, HTTP was also designed just for fetching hypertext
\emph{resources}, but it has evolved since then, particularly due to its
extensibility, to allow for fetching of all sorts of web resources a modern
website of today provides, such as scripts or images, or even to \emph{post}
content back to servers.
HTTP relies on TCP (Transmission Control Protocol), which is one of the
\emph{reliable} (mandated by HTTP) protocols used to send data across
contemporary IP (Internet Protocol) networks, to deliver the data it requests
or sends. When Tim Berners-Lee invented the World Wide Web (WWW) in 1989 while
working at CERN (The European Organization for Nuclear Research) with a rather
noble intent as a ``\emph{wide-area hypermedia information retrieval initiative
to give universal access to a large universe of documents}''~\cite{wwwf}, he
also invented the HyperText Markup Language (HTML) to serve as a formatting
method for these new hypermedia documents. The first website was written
roughly the same way as today's websites are, using HTML, although the markup
language has changed since, with the current version being HTML5.
It has been mentioned that the client \textbf{requests} a \textbf{resource} and
receives a \textbf{response}, so those terms should probably be defined.
A request is what the client sends to the server. A resource is what it
requests and a response is the answer provided by the server.
HTTP follows a classic client-server model whereby it is \textbf{always} the
client that initiates the request.
A web page is, to be blunt, a chunk of \emph{hypertext}. To display a web page,
a browser first needs to send a request to fetch the HTML representing the
page, which is then parsed and additional requests for sub-resources are made.
If a page defines a layout information in the form of CSS, that is parsed as
well.
A web page needs to be present on the local computer first \emph{before} it can
be parsed by the browser, and since websites are usually still served by
programs called \emph{web servers} as in the \emph{early days}, that presents a
problem of how tell the browser where the resource should be fetched from. In
today's browsers, the issue is sorted (short of the CLI) by the \emph{address
bar}, a place into which user types what they wish the browser to fetch for
them.
The formal name of this segment is a \emph{Universal Resource Locator}, or URL,
and it contains the schema (or the protocol, such as \texttt{http://}), the
host address or a domain name and a (TCP) port number.
Since a TCP connection needs to be established first, to connect to a server
whose only URL contains a domain name, the browser needs to perform a domain
name \emph{lookup} using system facilities, or as was the case for a couple of
notorious Chromium versions, send some additional and unrelated queries which
(with Chromium-based derivatives' numbers) ended up placing unnecessary load
directly at the root DNS servers~\cite{chromiumrootdns}.
If a raw IP address+port combination is used, the browser attempts to connect
to it directly and requests the user-requested page by default using the
\texttt{GET} \emph{method}. A \emph{well-known} HTTP port 80 is assumed unless
other port is explicitly specified and it can be omitted both if host is a
domain name or an IP address.
The method is a way for the user-agent to define what operation it wants to
perform. \texttt{GET} is used for fetching resources while \texttt{POST} is
used to send data to the server, such as to post the values of an HTML form.
A server response is comprised of a \textbf{status code}, a status message,
HTTP \textbf{headers} and an optional \textbf{body} containing the content. The
status code indicates if the original request was successful or not and the
browser is generally there to interpret these status codes to the user. There
is enough status codes to be confused by the sheer numbers but luckily, there
is a method to the madness and they can be divided into groups/classes:
\begin{itemize}
\item 1xx: Informational responses
\item 2xx: Successful responses
\item 3xx: Redirection responses
\item 4xx: Client error responses
\item 5xx: Server error responses
\end{itemize}
In case the \emph{user agent} (a web \emph{client}) such as a browser receives
a response with content, it has to parse it.
A header is additional information sent by both the server and the client that
can guide or alter the behaviour of software reading it. For instance a
\texttt{Cache-control} header with a duration value can be used by the server
to signify that the client can store certain resources for some time before
needing to re-fetch them, if they are not \emph{expired}.
\n{3}{Site Isolation}
Modern browsers such as Firefox or Chromium come with a security focus in mind.
Their developers are acutely aware of the dangers that parsing untrusted code
from the internet, which is the websites, the stylesheets and the scripts,
poses. This is reflected in the way these programs are architected: instead of
the main browser process running everything, the main privileged process spawns
de-privileged child processes for each website. This extra line of defence
makes it ``harder for untrustworthy websites to access or steal information''
from user accounts or other websites~\cite{siteisolation}, as even if the
misbehaving website manages to ``break some rules'' withing its own process, it
should find it more difficult to steal data from other sites. Firefox's version
of Site Isolation is called Project Fission
(\url{https://hacks.mozilla.org/2021/05/introducing-firefox-new-site-isolation-security-architecture/})
but the two are very similar in both architecture and what they try to achieve.
Some Chromium users have been complaining in the past on its high memory usage,
which is partially due to Site Isolation features enabled that are protecting
the user.
\n{2}{Cross-site scripting}\label{sec:xss}
As per OWASP Top Ten list~\cite{owasptop10} injection is the third most
observed issue across millions of websites. Cross-site scripting is a type of
attack in which scripts are injected into a website that would otherwise be
trusted. It occurs when a malicious user misuses the website to send malicious
code, such as a script, to another user. Since the misconfiguration or a flaw
of the application allowed this, the browser of the victim that trusts the
website simply executes the code provided by the attacker. This code thus gains
access to session tokens and any cookies associated with the website's origin,
apart from being able to rewrite the HTML content. The results of XSS can
range from account compromise to identity theft.
\n{2}{Content Security Policy}\label{sec:csp}
Content Security Policy (CSP) has been an important addition to the arsenal of
website operators, even though not everybody has necessarily been utilising it
properly or even taken notice. To understand what guarantees it provides and
what kind of protections it employs, it is first necessary to grok how websites
are parsed and displayed, which has been discussed in depth in previous
sections.
As per Weichselbaum et al.\ CSP is a mechanism designed to mitigate
XSS~\cite{longliveCSP}, a long-lived king of the vulnerability
lists~\cite{reininginCSP}. It is a declarative policy mechanism that allows the
website operator to decide what client-side resources can load and what origins
are permitted. Scripts can be restricted to only load from a list of trusted
domains and inline scripts can be blocked completely. The scripts can also be
allowed based on a cryptographic hash based on their content, which should be a
known information to website operators prior to or at the time scripts are
served.
CSP3, which is the current version, can be served either as a header or inside
website's \texttt{<meta>} tag. Directives are generally named after a topic
keyword and suffixed `-src', such as \texttt{script-src}, \texttt{img-src} or
\texttt{style-src}, although some directives don't follow this pattern,
\texttt{form-action}, \texttt{upgrade-insecure-requests} and \texttt{sandbox}
representing this group nicely. Sections are delimited using semicolons and
each section can only appear once. Special values exist for the origin itself
(\texttt{'self'}) and disallowing any source (\texttt{'none'}). A good CSP is
targeted and not overly broad. To give an example, a website that loads no
JavaScript at all does not need to allow a popular CDN (Content Delivery
Network) origin in its \texttt{script-src}, instead it should be set to
\texttt{'none'}. CSP can also aid with clickjacking protection using its
\texttt{frame-ancestors} directive, which could limit origins that could embed
the website, preventing attacker from embedding the website at random places,
that is malicious websites that masquerade as being legitimate.
Getting CSP right can be tricky at first but once grokked, it is relatively
straight-forward and can increase the security of the site greatly. While
testing, it is best to enable CSP in the report-only mode before turning it on
in production. There are many more directives and settings than mentioned in
this section, the author encourages anybody interested to give it a read, e.g.\
at \url{https://web.dev/csp/}.
\n{2}{Summary}
Passwords are in use everywhere and probably will be for the foreseeable
future. As long as passwords are going to be handled and stored by
service/application providers, they are going to get leaked, be it due to
provider carelessness or the attackers' resolve and wit. Of course, sifting
through all the available password breach data by hand is not a reasonable
option, and therefore tools should come in to provide assistance. The next part
of the thesis will explore that and offer a solution.
% =========================================================================== %
\part{Practical part}
\n{1}{Kudos}
\textbf{Disclaimer:} the author is not affiliated in any way with any of the
projects described on this page.
The \textit{Password Compromise Monitoring Tool} (\texttt{pcmt}) program has
been developed using and utilising a great deal of free (as in Freedom) and
open-source software in the process, either directly or as an outstanding work
tool, and the author would like to take this opportunity to recognise that
fact.
In particular, the author acknowledges that this work would not be the same
without:
\begin{itemize}
\item vim (\url{https://www.vim.org/})
\item Arch Linux (\url{https://archlinux.org/})
\item ZSH (\url{https://www.zsh.org/})
\item kitty (\url{https://sw.kovidgoyal.net/kitty/})
\item Nix (\url{https://nixos.org/explore.html})
\item pre-commit (\url{https://pre-commit.com/})
\item Podman (\url{https://podman.io/})
\item Go (\url{https://go.dev/})
\end{itemize}
All of the code written has been typed into VIM (\texttt{9.0}), the shell used
to run the commands was ZSH, both running in the author's terminal emulator of
choice - \texttt{kitty} on a \raisebox{.8ex}{\texttildelow}8 month (at the time
of writing) installation of \textit{Arch Linux (by the way)} using a
\texttt{6.3.x-wanderer-zfs-xanmod1} variant of the Linux kernel.
\n{1}{Development}
The source code of the project was being versioned since the start, using the
popular and industry-standard git (\url{https://git-scm.com}) source code
management (SCM) tool. Commits were made frequently and, if at all possible,
for small and self-contained changes of code, trying to follow sane commit
message \emph{hygiene}, i.e.\ striving for meaningful and well-formatted commit
messages. The name of the default branch is \texttt{development}, since that is
what the author likes to choose for new projects that are not yet stable (it is
in fact the default in author's \texttt{.gitconfig}).
\n{2}{Commit signing}
Since git allows cryptographically \emph{singing} all commits, it would be
unwise not to take advantage of this. For the longest time, GPG was the only
method available for signing commits in git; however, that is no longer
applicable~\cite{agwagitssh}. These days, it is also possible to both sign and
verify one's git commits (and tags!) using SSH keys, namely those produced by
OpenSSH (the same ones that can be used to log in to remote systems). The
author has, of course, not reused the same key pair that is used to connect to
machines for signing commits. A different, \texttt{Ed25519} elliptic curve key
pair has been used specifically for signing. A public component of this key is
enclosed in this thesis as Appendix~\ref{appendix:signingkey} for future
reference.
The validity of a signature on a particular commit can be viewed with git using
the following commands (the \% sign denotes the shell prompt):
\vspace{\parskip}
\begin{lstlisting}[language=bash, caption={Verifying signature of a git commit},
label=gitverif, basicstyle=\linespread{0.9}\footnotesize\ttfamily]
% cd <cloned project dir>
% git show --show-signature <commit>
% # alternatively:
% git verify-commit <commit>
\end{lstlisting}
There is one caveat to this though, git first needs some additional
configuration for the code in Listing~\ref{gitverif} to work as one would
expect. Namely that the public key used to verify the signature needs to be
stored in git's ``allowed signers file'', then git needs to be told where that
file is using the configuration value \texttt{gpg.ssh.allowedsignersfile} and
finally the configuration value of the \texttt{gpg.format} field needs to be
set to \texttt{ssh}.
Because git allows the configuration values to be local to each repository,
both of the mentioned issues can be solved by running the following commands
from inside of the cloned repository:
\vspace{\parskip}
\begin{lstlisting}[language=bash, caption={Prepare allowed signers file and signature format for git},
label=gitsshprep, basicstyle=\linespread{0.9}\footnotesize\ttfamily]
% # set the signature format for the local repository.
% git config --local gpg.format ssh
% # save the public key.
% cat >./tmp/.allowed_signers \
<<<'leo ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIKwshTdBgLzwY4d8N7VainZCngH88OwvPGhZ6bm87rBO'
% # set the allowed signers file path for the local repository.
% git config --local gpg.ssh.allowedsignersfile=./tmp/.allowed_signers
\end{lstlisting}
After the code in Listing~\ref{gitsshprep} is run, everything from the
Listing~\ref{gitverif} should remain applicable for the lifetime of the
repository or until git changes implementation of signature verification. The
git \texttt{user.name} that can be seen on the commits in the \textbf{Author}
field is named after the machine that was used to develop the program, since
the author uses different signing keys on each machine. That way the committer
machine can be determined post-hoc.
For future reference, git has been used in the version \texttt{git version
2.40.1}.
\n{2}{Continuous Integration}
To increase both the author's and public confidence in the atomic changes made
over time, it was attempted to thoroughly \emph{integrate} them using a
continuous integration (CI) service that was plugged into the main source code
repository since the early stages of development. This, of course, was again
self-hosted, including the workers. The tool of choice there was Drone
(\url{https://drone.io}) and the ``docker'' runner (in fact it runs any OCI
container) was used to run the builds.
The way this runner works is that it creates an ephemeral container for every
pipeline step and executes given \emph{commands} inside of it. At the end of
each step the container is discarded, while the repository, which is mounted
into each container's \texttt{/drone/src} is persisted between steps, allowing
it to be cloned only from \emph{origin} only at the start of the pipeline and
then shared for all of the following steps, saving bandwidth, time and disk
writes.
The entire configuration used to run the pipelines can be found in a file named
\texttt{.drone.yml} at the root of the main source code repository. The
workflow consists of four pipelines, which are run in parallel. Two main
pipelines are defined to build the frontend assets, the \texttt{pcmt} binary
and run tests on \texttt{x86\_64} GNU/Linux targets, one for each of Arch and
Alpine (version 3.17). These two pipelines are identical apart from OS-specific
bits such as installing a certain package, etc. For the record, other
OS-architecture combinations were not tested.
A third pipeline contains instructions to build a popular static analysis tool
called \texttt{golangci-lint}, which is sort of a meta-linter, bundling a
staggering amount of linters (linter is a tool that performs static code
analysis and can raise awareness of programming errors, flag potentially buggy
code constructs, or \emph{mere} stylistic errors) - from sources and then
perform the analysis of project's codebase using the freshly built binary. If
the result of this step is successful, a handful of code analysis services get
pinged in the next steps to take notice of the changes to project's source code
and update their metrics, details can be found in the main Drone configuration
file \texttt{.drone.yml} and the configuration for the \texttt{golangci-lint}
tool itself (such as what linters are enabled/disabled and with what settings)
can be found in the root of the repository in the file named
\texttt{.golangci.yml}.
The fourth pipeline focuses on linting the Containerfile and building the
container, although the latter action is only performed on feature branches,
\emph{pull requests} or \emph{tag} events.
The median build time as of writing was 1 minute, which includes running all
four pipelines, and that is acceptable. Build times might of course vary
depending on the hardware, for reference, these builds were being run on a
machine equipped with a Zen 3 Ryzen 5 5600 CPU with nominal clock times, DDR4
3200MHz RAM, a couple of PCIe Gen 4 NVMe drives in a mirrored setup (using ZFS)
and a 400Mbps downlink, software-wise running Arch with an author-flavoured
Xanmod kernel version 6.3.x.
\obr{Drone CI median build
time}{fig:drone-median-build}{.84}{graphics/drone-median-build}
\n{2}{Source code repositories}\label{sec:repos}
All of the pertaining source code was published in repositories on a publicly
available git server operated by the author, the reasoning \emph{pro}
self-hosting being that it is the preferred way of guaranteed autonomy over
one's source code, as opposed to large silos owned by big corporations having a
track record of arguably not always deciding with user's best interest in mind
(although recourse has been observed~\cite{ytdl}). When these providers act on
impulse or under public pressure they can potentially at least temporarily
disrupt their user's operations, thus not only beholding their user to their
lengthy \emph{terms of service} that \emph{can change at any time}, but also
factors outside their control. Granted, decentralisation can take a toll on
discoverability of the project, but that is not a concern here.
The git repository containing source code of the \texttt{pcmt} project:\\
\url{https://git.dotya.ml/mirre-mt/pcmt.git}.
The git repository hosting the \texttt{pcmt} configuration schema:\\
\url{https://git.dotya.ml/mirre-mt/pcmt-config-schema.git}.
The repository containing the \LaTeX{} source code of this thesis:\\
\url{https://git.dotya.ml/mirre-mt/masters-thesis.git}.
\n{2}{Toolchain}
Throughout the creation of this work, the \emph{current} version of the Go
programming language was used, i.e. \texttt{go1.20}.
\tab{Tool/Library-Usage Matrix}{tab:toolchain}{1.0}{ll}{
\textbf{Name} & \textbf{Usage} \\
Go programming language & program core \\
Dhall configuration language & program configuration \\
Echo & HTTP handlers, controllers, web server \\
ent & ORM using graph-based modelling \\
bluemonday & HTML sanitising \\
TailwindCSS & stylesheets using a utility-first approach \\
PostgreSQL & storing data \\
}
\tab{Dependency-Version Matrix}{tab:depsversionmx}{1.0}{ll}{
\textbf{Name} & \textbf{version} \\
\texttt{echo} (\url{https://echo.labstack.com/}) & 4.10.2 \\
\texttt{go-dhall} (\url{https://github.com/philandstuff/dhall-golang}) & 6.0.2\\
\texttt{ent} (\url{https://entgo.io/}) & 0.11.10 \\
\texttt{bluemonday} (\url{https://github.com/microcosm-cc/bluemonday}) & 1.0.23 \\
\texttt{tailwindcss} (\url{https://tailwindcss.com/}) & 3.3.0 \\
\texttt{PostgreSQL} (\url{https://www.postgresql.org/}) & 15.2 \\
}
\n{2}{A word about Go}
First, a question of \textit{`Why pick Go for building a web
application?'} might arise, so the following few lines will try to address
that.
Go~\cite{golang}, or \emph{Golang} for SEO-friendliness and disambiguating Go
the ancient game, is a strongly typed, high-level \emph{garbage-collected}
language where functions are first-class citizens and errors are values.
The appeal for the author comes from a number of features of the language, such
as built-in support for concurrency and unit testing, sane \emph{zero} values,
lack of pointer arithmetic, inheritance and implicit type conversions,
easy-to-read syntax, producing a statically linked binary by default, etc., on
top of that, the language has got a cute mascot. Thanks to the foresight of the
Go Authors regarding \emph{the formatting question} (i.e.\ where to put the
braces, \textbf{tabs vs.\ spaces}, etc.), most of the discussions on this topic
have been foregone. Every \emph{gopher}~\footnote{euph.\ a person writing in
the Go programming language} is expected to format their source code with the
official formatter (\texttt{gofmt}), which automatically ensures that the code
adheres to the one formatting standard. Then, there is \emph{The Promise} of
backwards compatibility for Go 1.x, which makes it a good choice for long-term
without the fear of being rug-pulled.
\n{2}{A word about Nix/devenv}
Nix (\url{https://builtwithnix.org/}) is a declarative package manager and a
functional programming language resembling Haskell, which has been used in this
project in the form of \texttt{devenv} tool (\url{https://devenv.sh/}) to
create \textbf{declarable} and \textbf{reproducible} development environment.
The author has previously used Nix directly with \emph{flakes} and liked
\texttt{devenv}, as it effectively exposed only a handful of parameters for
configuration, and rid of the need to manage the full flake, which is of course
still an option for people who choose so. See \texttt{devenv.nix} in the
repository root.
\n{1}{Application architecture}
The source code of the main module further is split into Go \emph{packages}
appropriately along a couple of domains: logging, core application, web
routers, configuration and settings, etc. In Go, packages are delimited by
folder structure -- each folder can be package.
Generally speaking, the program aggregates decision points into central places,
such as \texttt{run.go}, which imports child packages that facilitate each of
loading the configuration, connecting to the database and running migrations,
consolidating flag, environment variable and configuration-based values into
canonical \emph{settings}, setting up routes and handling graceful shutdown.
The program uses dependency injection to share a single logger instance,
similar applies to the database client. These are passed around as a pointer,
so the underlying data stays the same. As a rule of thumb, every larger
\texttt{struct} that needs to be passed around is passed around as a pointer.
The \texttt{internal} package was not used as of writing, but the author plans
to eventually migrate \emph{internal} logic of the program into the internal
package to prevent accidental imports.
The authentication logic is relatively simple and the author would like to
isolate it into a custom \emph{middleware}. User passwords are hashed using a
secure KDF before being sent to the database. The KDF used is \texttt{bcrypt}
(with a sane \emph{Cost} of 10), which automatically includes \emph{salt} for
the password and provides ``length-constant'' time hash comparisons. The author
plans to add support for the more modern \texttt{scrypt} and the
state-of-the-art, P-H-C (Password Hashing Competition) winner algorithm
\texttt{Argon2} (\url{https://github.com/P-H-C/phc-winner-argon2}). Besides, no
raw queries are used to access the database, helping decrease the likelihood of
SQL injection attacks.
An important thing to mention is embedded assets and templates. Go has multiple
mechanisms to natively embed arbitrary files directly into the binary during
the regular build process. The built-in \texttt{embed} package was used to
bundle all template files and web assets, such as images, logos and stylesheets
at the package level, and these are also the passed around the application as
needed. There is also a toggle in the application configuration, which can
instruct the program at start to either rely entirely on embedded assets or
pull live files from the filesystem. The former option makes the application
more portable, while the latter allows for flexibility not only during
development. Basically, any important value in the program has been made into a
configuration value, so that the operator can customise the experience as
needed. A choice of sane configuration defaults was attempted, which resulted
in the configuration file essentially only needing to contain secrets, unless
there is a need to override the defaults. It is not entirely \emph{zero-config}
situation, rather a \emph{minimal-config} one.
Templates used for rendering of the web pages were created in a composable
manner, split into smaller, reusable parts, such as \texttt{footer.tmpl} and
\texttt{head.tmpl}. Specific functionality is then conditionally executed based
on the determined level of access of the user. The core templating
functionality was provided by the native Go \texttt{html/template} package. An
Echo renderer interface has been implemented, so that the rendering can be
performed ergonomically and directly using Echo's built-in facilities. A
popular HTML sanitiser \emph{bluemonday} has been employed to aid with battling
XSS.
The application constructs the web pages entirely server-side and it runs
without a single line of JavaScript, of which the author is especially proud.
It improves load times, decreases attack surface, increases maintainability and
reduces cognitive load that is required when dealing with JavaScript. Of course
that requires extensive usage of non-semantic \texttt{POST}s in forms even for
updates (where \texttt{PUT}s should be used) and the accompanying frequent
full-page refreshes, but that still is not enough to warrant the use of
JavaScript.
As an aside, the author has briefly experimented with WebAssembly for this
project, but has ultimately scrapped the functionality in favour of the
entirely server-side rendered one. It is possible that it would get revisited
if the client-side dynamic functionality was necessary and performance
mattered. Even from the short experiments it was obvious how much faster
WebAssembly was compared to JavaScript.
Frontend-side, the application was styled using TailwindCSS, which promotes
using of flexible \emph{utility-first} classes in the markup (HTML) instead of
separating out the specific styles out into all-encompasing classes. The author
understands this is somewhat of a preference issue and does not hold hard
opinions in either direction, Tailwind simply looked nice, especially with its
built-in support for dark/light mode. The templates containing the CSS classes
need to be parsed by Tailwind in order to construct its final stylesheet and
there is also an original CLI tool for that called \texttt{tailwindcss}.
Overall, simple and accessible layouts had preference over convoluted ones.
\n{1}{Implementation}
\n{2}{Configuration}
Every non-trivial program usually offers at least \emph{some} way to
tweak/manage its behaviour, and these changes are usually persisted
\emph{somewhere} on the filesystem of the host: in a local SQLite3 database, a
\emph{LocalStorage} key-value store in the browser, a binary or plain text
configuration file. These configuration files need to be read and checked at
least on program start-up and either stored into operating memory for the
duration of the runtime of the program, or loaded and parsed and the memory
subsequently \emph{freed} (initial configuration).
There is an abundance of configuration languages (or file formats used to craft
configuration files, whether they were intended for it or not) available, TOML,
INI, JSON, YAML, to name some of the popular ones (as of today).
Dhall stood out as a language that was designed with both security and the
needs of dynamic configuration scenarios in mind, borrowing a concept or two
from Nix~\cite{nixoslearn}~\cite{nixlang} (which in turn sources more than a
few of its concepts from Haskell), and in its apparent core being very similar
to JSON, which adds to familiar feel. In fact, in Dhall's authors' own words it
is: ``a programmable configuration language that you can think of as: JSON +
functions + types + imports''~\cite{dhalllang}.
Among all of the listed features, the especially intriguing one to the author
was the promise of \emph{types}. There are multiple examples directly on the
project's documentation webpage demonstrating for instance the declaration and
usage of custom types (that are, of course merely combinations of the primitive
types that the language provides, such as \emph{Bool}, \emph{Natural} or
\emph{List}, to name just a few), so it was not exceedingly hard to start
designing a custom configuration \emph{schema} for the program.
Dhall not being a Turing-complete language also guarantees that evaluation
\emph{always} terminates eventually, which is a good attribute to possess as a
configuration language.
\n{3}{Dhall Schema}
The configuration schema was at first being developed as part of the main
project's repository, before it was determined that it would benefit both the
development and overall clarity if the schema lived in its own repository (see
Section~\ref{sec:repos} for details). This enabled it to be independently
developed and versioned, and only pulled into the main application whenever it
is determined the application is ready for it.
The full schema with type annotations can be seen in Listing~\ref{dhallschema}.
The \texttt{let} statement declares a variable called \texttt{Schema} and
assigns it the result of the expression on the right side of the equals sign,
which has for practical reasons been trimmed and is displayed without the
\emph{default} block, which is instead shown in its own
Listing~\ref{dhallschemadefaults}.
\vspace{\parskip}
\begin{lstlisting}[language=Haskell, caption={Dhall configuration schema version 0.0.1-rc.2},
label=dhallschema, basicstyle=\linespread{0.9}\footnotesize\ttfamily]
let Schema =
{ Type =
{ Host : Text
, Port : Natural
, HTTP :
{ Domain : Text
, Secure : Bool
, AutoTLS : Bool
, TLSKeyPath : Text
, TLSCertKeyPath : Text
, HSTSMaxAge : Natural
, ContentSecurityPolicy : Text
, RateLimit : Natural
, Gzip : Natural
, Timeout : Natural
}
, Mailer :
{ Enabled : Bool
, Protocol : Text
, SMTPAddr : Text
, SMTPPort : Natural
, ForceTrustServerCert : Bool
, EnableHELO : Bool
, HELOHostname : Text
, Auth : Text
, From : Text
, User : Text
, Password : Text
, SubjectPrefix : Text
, SendPlainText : Bool
}
, LiveMode : Bool
, DevelMode : Bool
, AppPath : Text
, Session :
{ CookieName : Text
, CookieAuthSecret : Text
, CookieEncrSecret : Text
, MaxAge : Natural
}
, Logger : { JSON : Bool, Fmt : Optional Text }
, Init : { CreateAdmin : Bool, AdminPassword : Text }
, Registration : { Allowed : Bool }
}
}
\end{lstlisting}
The main configuration is comprised of both raw attributes and child records,
which allow for grouping of related functionality. For instance, configuration
settings pertaining mailserver setup are grouped in a record named
\textbf{Mailer}. Its attribute \textbf{Enabled} is annotated as \textbf{Bool},
which was deemed appropriate for a on-off switch-like functionality, with the
only permissible values being either \emph{True} or \emph{False}. Do note that
in Dhall $true != True$, since \textbf{True} is internally a Bool constant,
which is built into Dhall (check out ``The Prelude''~\cite{dhallprelude}),
while \textbf{true} is evaluated as an \emph{unbound} variable, that is, a
variable \emph{not} defined in the current \emph{scope} and thus not
\emph{present} in the current scope.
\vspace{\parskip}
\begin{lstlisting}[language=Haskell, caption={Dhall configuration defaults for
schema version 0.0.1-rc.2},
label=dhallschemadefaults, basicstyle=\linespread{0.9}\scriptsize\ttfamily]
, default =
-- | have sane defaults.
{ Host = ""
, Port = 3000
, HTTP =
{ Domain = ""
, Secure = False
, AutoTLS = False
, TLSKeyPath = ""
, TLSCertKeyPath = ""
, HSTSMaxAge = 0
, ContentSecurityPolicy = ""
, RateLimit = 0
, Gzip = 0
, Timeout = 0
}
, Mailer =
{ Enabled = False
, Protocol = "smtps"
, SMTPAddr = ""
, SMTPPort = 465
, ForceTrustServerCert = False
, EnableHELO = False
, HELOHostname = ""
, Auth = ""
, From = ""
, User = ""
, Password = ""
, SubjectPrefix = "pcmt - "
, SendPlainText = True
}
, LiveMode =
-- | LiveMode controls whether the application looks for
-- | directories "assets" and "templates" on the filesystem or
-- | in its bundled Embed.FS.
False
, DevelMode = False
, AppPath =
-- | AppPath specifies where the program looks for "assets" and
-- | "templates" in case LiveMode is True.
"."
, Session =
{ CookieName = "pcmt_session"
, CookieAuthSecret = ""
, CookieEncrSecret = ""
, MaxAge = 3600
}
, Logger = { JSON = True, Fmt = None Text }
, Init =
{ CreateAdmin =
-- | if this is True, attempt to create a user with admin
-- | privileges with the password specified below (or better -
-- | overriden); it fails if users already exist in the DB.
False
, AdminPassword =
-- | used for the first admin, forced change on first login.
"50ce50fd0e4f5894d74c4caecb450b00c594681d9397de98ffc0c76af5cff5953eb795f7"
}
, Registration.Allowed = True
}
}
in Schema
\end{lstlisting}
Another one of specialties of Dhall is that $==$ and $!=$ equality operators
only work on values of type \texttt{Bool}, which for example means that
variables of type \texttt{Natural} (\texttt{uint}) or \texttt{Text}
(\texttt{string}) cannot be compared directly as in other languages, which
either leaves the work for a higher-level language (such as Go), or from the
perspective of the Dhall authors, \emph{enums} are promoted when the value
matters.
\n{3}{Safety considerations}
Having a programmable configuration language that understands functions and
allows importing not only arbitrary text from random internet URLs, but also
importing and \emph{evaluating} (i.e.\ running) potentially untrusted code, it
is important that there are some safety mechanisms employed, which can be
relied on by the user. Dhall offers this in multiple features: enforcing a
same-origin policy and (optionally) pinning a cryptographic hash of the value
of the expression being imported.
\n{3}{Possible alternatives}
While developing the program, the author has also
come across certain shortcomings of Dhall, namely long start-up with \emph{cold
cache}, which can generally be observed in the scenario of running the program
in an environment that does not allow to write the cache files (a read-only
filesystem), of does not keep the written cache files, such as a container that
is not configured to mount a persistent volume at the pertinent location.
To describe the way Dhall works when performing an evaluation, it resolves
every expression down to a combination of its most basic types (eliminating all
abstraction and indirection) in the process called
\textbf{normalisation}~\cite{dhallnorm} and then saves this result in the
host's cache. The \texttt{dhall-haskell} binary attempts to resolve the
variable \texttt{\$\{XDG\_CACHE\_HOME\}} (have a look at \emph{XDG Base
Directory Spec}~\cite{xdgbasedirspec} for details) to decide \emph{where} the
results of the normalisation will be written for repeated use. Do note that
this behaviour has been observed on a GNU/Linux host and the author has not
verified this behaviour on a non-GNU/Linux host, such as FreeBSD.
If normalisation is performed inside an ephemeral container (as opposed to, for
instance, an interactive desktop session), the results effectively get lost on
each container restart, which is both wasteful and not great for user
experience, since the normalisation of just a handful of imports (which
internally branches widely) can take an upwards of two minutes, during which
the user is left waiting for the hanging application with no reporting on the
progress or current status.
While workarounds for the above mentioned problem can be devised relatively
easily (such as bind mounting persistent volumes inside the container in place
of the \texttt{\$\{XDG\_CACHE\_HOME\}/dhall} and
\texttt{\$\{XDG\_CACHE\_HOME\}/dhall-haskell} to preserve the cache between
restarts, or let the cache be pre-computed during container build, since the
application is only really expected to run together with a compatible version
of the configuration schema and this version \emph{is} known at container build
time), it would certainly feel better if there was no need to work
\emph{around} the configuration system of choice.
Alternatives such as CUE (\url{https://cuelang.org/}) offer themselves nicely
as a potentially almost drop-in replacement for Dhall feature-wise, while also
resolving costly \emph{cold cache} normalisation operations, which is in
author's view Dhall's titular issue.
\n{2}{Data integrity and authenticity}
The user can interact with the application via a web client, such as a browser,
and is required to authenticate for all sensitive operations. To not only know
\emph{who} the user is but also make sure they are \emph{permitted} to perform
the action they are attempting, the program employs an \emph{authorisation}
mechanism in the form of sessions. These are on the client side represented by
cryptographically signed and encrypted (using 256 bit AES) HTTP cookies. That
lays foundations for a few things: the data saved into the cookies can be
regarded as private because short of future \emph{quantum computers} only the
program itself can decrypt and access the data, and the data can be trusted
since it is both signed using the key that only the program controls and
\emph{encrypted} with \emph{another} key that equally only the program holds.
The cookie data is only ever written \emph{or} read at the server side,
solidifying the authors decision to let it be encrypted, as there is not point
in not encrypting it for some perceived client-side simplification. Users
navigating the website send their session cookie (if it exists) with
\textbf{every request} to the server, which subsequently verifies the integrity
of the data and in case it is valid, determines the existence and potential
amount of user privilege that should be granted. Public endpoints do not
mandate the presence of a valid session by definition, while at protected
endpoints the user is authenticated at every request. When a session expires or
if there is no session to begin with, the user is either shown a \emph{Not
found} error message, the \emph{Unauthorised} error message or redirected to
\texttt{/signin}, depending on the endpoint or resource, as can be seen, this
behaviour is not uniform and depends on the resource and/or the endpoint.
Another aspect that contributes to data integrity from \emph{another} point of
view is utilising database \emph{transactions} for bundling together multiple
database operations that collectively change the \emph{state}. Using the
transactional jargon, the data is only \emph{committed} if each individual
change was successful. In case of any errors, the database is instructed to
perform an atomic \emph{rollback}, which brings it back to a state before the
changes were ever attempted.
The author has additionally considered the thought of utilising an embedded
immutable database like immudb (\url{https://immudb.io}) for record keeping
(verifiably storing data change history) and additional data integrity checks,
e.g.\ for tamper protection purposes and similar; however, that work remains
yet to be materialised.
\n{2}{User isolation}
Users are allowed into certain parts of the application based on the role they
currently posses. For the moment, two basic roles were envisioned, while this
list might get amended in the future, if the need arises:
\begin{itemize}
\item Administrator
\item User
\end{itemize}
It is paramount that the program protects itself from the insider threats as
well and therefore each role is only able to perform actions that it is
explicitly assigned. While there definitely is certain overlap between the
capabilities of the two outlined roles, each also possesses unique features
that the other does not.
For example, the administrator role is not able to perform searches on the
breach data directly using their administrator account, for that a separate
user account has to be devised. Similarly, the regular user is not able to
manage breach lists and other users, because that is a privileged operation.
In-application administrators are not able to view sensitive (any) user data
and should therefore only be able to perform the following actions:
\begin{itemize}
\item Create user accounts
\item View list of users
\item View user email
\item Change user email
\item Change user email
\item Toggle whether user is an administrator
\item Delete user accounts
\end{itemize}
Let us consider a case when a user manages self, while demoting from
administrator to a regular user is permitted, promoting self to be an
administrator would constitute a \emph{privilege escalation} and likely be a
precursor to at least a \emph{denial of service} of sorts.
\n{2}{Zero trust principle}
\textit{Data confidentiality, i.e.\ not trusting the provider}
There is no way for the application (and consequently, the in-application
administrator) to read user's data. This is possible by virtue of encrypting
the pertinent data before saving them in the database by a state-of-the-art
\emph{age} key~\cite{age} (backed by X25519~\cite{x25519rfc7748}), which in
turn is safely stored encrypted by a passphrase that only the user controls. Of
course, the user-supplied password is run by a password based key derivation
function (PBKDF: a key derivation function with a sliding computational cost)
before letting it encrypt the \emph{age} key.
The \emph{age} key is only generated when the user changes their password for
the first time to prevent scenarios such as in-application administrator with
access to physical database being able to both \textbf{recover} the key from
the database and \textbf{decrypt} it given that they already know the user
password (because they set it), which would subsequently give them unbounded
access to any future encrypted data, as long as they would be able to maintain
their database access. This is why the \emph{age} key generation and protection
are bound to the first password change. Of course, the evil administrator could
just perform the change themselves; however, the user would at least be able to
find those changes in the activity logs and know not to use the application.
But given the scenario of a total database compromise, the author finds all
hope is already lost at that point. At least when the database is dumped, it
only contains non-sensitive, functional information in plain test, everything
else should be encrypted.
Consequently, both the application operators and the in-application
administrators should never be able to learn the details of what the user is
tracking, the same being applicable even to potential attackers with direct
access to the database. Thus the author maintains that every scenario that
could potentially lead to a data breach (apart from a compromised user machine
and the like) would have to entail some form of operating memory acquisition,
for instance using \texttt{LiME}~\cite{lime}, or perhaps directly the
\emph{hypervisor}, if considering a virtualised (``cloud'') environments.
\n{2}{Compromise Monitoring}
\n{3}{Have I Been Pwned? Integration}
Troy Hunt's Have I Been Pwned? online service
(\url{https://haveibeenpwned.com/}) has been chosen as the online source of
compromised data. The service offers private APIs that are protected by API
keys. The application's \texttt{hibp} module and database representation models
the values returned by this API, which allows searching in large breaches using
email addresses.\\
The architecture there is relatively simple: the application administrator
configures an API key for HIBP, the user enters the query parameters, the
application constructs a query and calls the API and waits for a response. As
the API is rate-limited based on the key supplied, this can pose an issue and
it has not been fully resolved in the UI. The application then parses the
returned data and binds it to the local model for validation. If that goes
well, the data is saved into the database as a cache and the search query is
performed on the saved data. If it returns anything, it is displayed to the
user for browsing.
\n{3}{Local Dataset Plugin} Breach data from locally available datasets can be
imported into the application by first making sure it adheres to the specified
schema (have a look at the \emph{breach data schema} in
Listing~\ref{breachDataGoSchema}). If it doesn't (which is very likely with
random breach data), it needs to be converted to a form that does before
importing it to the application, e.g.\ using a Python script or similar.
Attempting to import data that does not follow the outlined schema would result
in an error. Also, importing a dataset which is over a reasonable size limit
would by default be rejected by the program as a precaution, since marshaling
e.g.\ a 1 TiB document would likely result in an OOM situation on the host,
assuming regular consumer hardware conditions, not HPC.
\vspace{\parskip}
\begin{lstlisting}[language=Go, caption={Breach Data Schema represented as a Go struct with imports from the standard library are assumed},
label=breachDataGoSchema]
type breachDataSchema struct {
Name string
Time time.Time
IsVerified bool
ContainsPasswords bool
ContainsHashes bool
HashType string
HashSalted bool
HashPepperred bool
ContainsUsernames bool
ContainsEmails bool
Data any
}
\end{lstlisting}
The Go representation shown in Listing~\ref{breachDataGoSchema} will in
actuality be written and supplied by the user of the program as a YAML
document. YAML was chosen for multiple reasons: relative ease of use (plain
text, readable, support for inclusion of comments, its capability to store
multiple \emph{documents} inside of a single file with most of the inputs
implicitly typed as strings while thanks to being a superset of JSON it sports
machine readability. That should allow for documents similar to what can be
seen in Listing~\ref{breachDataYAMLSchema} to be ingested by the program,
read and written by humans and programs alike.
\vspace{\parskip}
\begin{lstlisting}[language=YAML, caption={Example Breach Data Schema supplied
to the program as a YAML file, optionally containing multiple documents},
label=breachDataYAMLSchema]
---
name: Horrible breach
time: 2022-04-23T00:00:00Z+02:00
isVerified: false
containsPasswds: false
containsHashes: true
containsEmails: true
hashType: md5
hashSalted: false
hashPeppered: false
data:
hashes:
- hash1
- hash2
- hash3
emails:
- email1
-
- email3
---
# document #2, describing another breach.
name: Horrible breach 2
...
\end{lstlisting}
Notice how the emails list in Listing~\ref{breachDataYAMLSchema} misses one
record, perhaps because it was not supplied or mistakenly omitted. This is a
valid scenario (mistakes happen) and the application needs to be able to handle
it. The alternative would be to require the user to prepare the data in such a
way that the empty/partial records would be dropped entirely.
\n{2}{Database configuration}
The database schema is not being created manually in the database. Instead, an
Object-relational Mapping (ORM) tool named ent is used, which allows defining
the table schema and relations entirely in Go.
The best part about ent is that there is no need to define supplemental methods
on the models, since with ent these are meant to be \emph{code generated} (in
the older sense of word, not with Large Language Models). That creates files
with models based on the types of the attributes in the database model, and the
respective relations are transformed into methods on the receiver or functions
taking object attributes as arguments.
For instance, if the model's attribute is a string value \texttt{Email}, ent
can be used to generate code that contains methods on the user object like the
following:
\begin{itemize}
\item \texttt{EmailIn(pattern string)}
\item \texttt{EmailEQ(email string)}
\item \texttt{EmailNEQ(email string)}
\item \texttt{EmailHasSuffix(suffix string)}
\end{itemize}
These methods can further be imported into other packages and this makes
working with the database a morning breeze.
\n{1}{Production}
It is, of course, recommended that the application runs in a secure environment
\allowbreak although definitions of that almost certainly differ depending on
who you ask. General recommendations would be either to effectively reserve a
machine for a single use case - running this program - so as to dramatically
decrease the potential attack surface of the host, or run the program isolated
in a container or a virtual machine. Further, if the host does not need
management access (it is a deployed-to-only machine that is configured
out-of-band, such as with a \emph{golden} image/container or declaratively with
Nix), then an SSH \emph{daemon} should not be running in it, since it is not
needed. In an ideal scenario, the host machine would have as little software
installed as possible besides what the application absolutely requires.
A demonstration of the above can be found in the multi-stage Containerfile that
is available in the main sources. The resulting container image only contains a
statically linked copy of the program, a default configuration file and
corresponding Dhall expressions cached at build time, which only support the
main configuration file. Since the program also needs a database, an example
scenario could include the container being run in a Podman pod together with
the database, which would not have to be exposed from the pod and would
therefore only be available over \texttt{localhost}.
It goes without saying that the operator should substitute values of any
default configuration secrets with the new ones that were securely generated.
System-wide cryptographic policies should target highest feasible security
level, if at all available (such as by default on Fedora or RHEL), covering
SSH, DNSSec, IPsec, Kerberos and TLS protocols. Firewalls should be configured
and SELinux (kernel-level mandatory access control and security policy
mechanism) running in \emph{enforcing} mode, if available.
\n{2}{Deployment recommendations}
\n{3}{Transport security}
User connecting to the application should rightfully expect for their data to
be protected \textit{in transit} (i.e.\ on the way between their browser and
the server), which is what \emph{Transport Layer Security} family of
protocols~\cite{tls13rfc8446} was designed for, and which is the underpinning
of HTTPS. TLS utilises the primitives of asymmetric cryptography to let the
client authenticate the server (verify that it is who it claims it is) and
negotiate a symmetric key for encryption in the process named the \emph{TLS
handshake} (see Section~\ref{sec:tls} for more details), the final purpose of
which is establishing a secure communications connection. The operator should
configure the program to either directly utilise TLS using configuration or
have it listen behind a TLS-terminating \emph{reverse proxy}.
\n{3}{Containerisation}
Whether the pre-built or a custom container image is used to deploy the
application, it still needs access to secrets, such as database connection
string (containing database host, port, user, password/encrypted password,
authentication method and database name).
Currently, the application is able to handle \emph{peer}, \emph{scram-sha-256},
\emph{user name maps} and raw \emph{password} as Postgres authentication
methods~\cite{pgauthmethods}, although the \emph{password} option should not be
used in production, \emph{unless} the connection to the database is protected
by TLS.\ In any case, using the \emph{scram-sha-256}~\cite{scramsha256rfc7677}
method is preferable and one way to verify in development environment that
everything works as intended is the \emph{Password generator for PostgreSQL}
tool~\cite{goscramsha256}, which allows to get the encrypted string from a raw
user input.
If the application running in a container wants to use the \emph{peer}
authentication method, it is up to the operator to supply the Postgres socket
to the application (e.g.\ as a volume bind mount). This scenario was not
tested; however, and the author is also not entirely certain how \emph{user
namespaces} (on GNU/Linux) would influence the process (given that the
\emph{ID}s of a user \textbf{outside} the container are mapped to a range of
\emph{UIDs} \textbf{inside} the container), for which the setup would likely
need to account.
Equally, if the application is running inside the container, the operator needs
to make sure that the database is either running in a network that is also
directly attached to the container or that there is a mechanism in place that
routes the requests for the database hostname to the destination.
One such mechanism is container name based routing inside \emph{pods}
(Podman/Kubernetes), where the resolution of container names is the
responsibility of a specially configured piece of software called Aardvark for
the former and CoreDNS for the latter.
\n{1}{Validation}
\n{2}{Unit tests}
Unit testing is a hot topic for many people and the author does not count
himself to be a staunch supporter of neither extreme. The ``no unit tests''
seems to discount any benefit there is to unit testing, while a `` TDD-only''
(TDD, or Test Driven Development is a development methodology whereby tests are
written first, then a complementary piece of code that is supposed to be
tested, just enough to get past the compile errors and to see the test fail,
then the code is refactored to make the test pass and then it can be fearlessly
extended because the test is the safety net catching us when the user slips and
alters the originally intended behaviour) approach can be a little too much for
some people's taste. The author tends to sport a \emph{middle ground} approach
here, with writing enough tests where meaningful but not necessarily testing
everything or writing tests prior to code, although arguably that practice
should result in writing a \emph{better} designed code, particularly because
there has to be a prior though about it because it needs to be tested
\emph{first}.
Thanks to Go's built in support for testing via its \texttt{testing} package
and the tooling in the \texttt{go} tool, writing tests is relatively simple. Go
looks for files in the form \texttt{<filename>\_test.go} in the present working
directory but can be instructed to look for test files in packages recursively
found on any path using the ellipsis, like so: \texttt{go test
./path/to/package/\ldots}, which then \emph{runs} all the tests found, and
reports some statistics, such as the time it took to run the test or whether it
succeeded or failed. To be precise, the test files also need to contain test
functions, which are functions with the signature \texttt{func TestWhatever(t
*testing.T)\{\}} and where the function prefix ``Test'' is equally as important
as the signature. Without it, the function is not considered to be a testing
function despite having the required signature and is therefore \emph{not}
executed during testing.
This test lookup behaviour; however, also has a neat side-effect: all the test
files can be kept side-by-side their regular source counterparts, there is no
need to segregate them into a specially blessed \texttt{tests} folder or
similar, which in author's opinion improves readability. As a failsafe, in case
no actual test are found, the current behaviour of the tool is to print a note
informing the developer that no tests were found, which is handy to learn if it
was not intended/expected. When compiling regular source code, the Go files
with \texttt{\_test} in the name are simply ignored by the build tool.
\n{2}{Integration tests}
Integrating with external software, namely the database in case of this
program, is designed to utilise the same mechanism that was mentioned in the
previous section: Go's \texttt{testing} package. These tests verify that the
code changes can still perform the same actions with the external software that
were possible before the change and are run before every commit locally and
then after pushing to remote in the CI.
\n{3}{func TestUserExists(t *testing.T)}
An example integration test shown in Listing~\ref{integrationtest} can be seen
to declare a helper function \texttt{getCtx() context.Context}, which takes no
arguments and returns a new\\ \texttt{context.Context} initialised with a value
of the global logger, which is how the logger gets injected into the user
module functions. The function \texttt{TestUserExists(t *testing.T)} first
declares a database connection string and attempting to open a connection to
the database. The database in use here is SQLite3 running in memory mode,
meaning no file is actually written to disk during this process. Since the
testing data is not needed after the test, this is deemed good enough. Next, a
defer statement calling the \texttt{Close()} method on the database object is
made, which is the idiomatic Go way of closing files and network connections
(which are also an abstraction over files on UNIX-like operating systems such
as GNU/Linux). The \emph{defer} statement gets called after all of the
statements in the surrounding function, which makes sure no file descriptors
(FDs) are leaked and the file is properly closed when the function returns.
In the next step a database schema creation is attempted, handling the
potential error in a Go idiomatic way, which uses the return value from the
function in an assignment to a variable declared in the \texttt{if} statement,
and checks whether the \texttt{err} was \texttt{nil} or not. In case the
\texttt{err} was not \texttt{nil}, i.e.\ \emph{there was an error in the callee
function}, the condition evaluates to \texttt{true}, which is followed by
entering the inner block. Inside the inner block, the error is announced to the
user (likely a developer running the test in this case) and the testing
object's \texttt{FailNow()} method is called, which marks the test function as
having failed, and thus stops its execution. In this case, that is the desired
outcome, since if the database schema creation call fails, there really is no
point in continuing testing of user creation.
Conversely, if the schema does get created without an error, the code continues
to declare a few variables: \texttt{username}, \texttt{email} and \texttt{ctx},
where the context injected with the logger is saved. Some of them are
subsequently passed into the \texttt{UsernameExists} function, \texttt{ctx} as
the first argument, with the database pointer and username being passed next,
while the \texttt{email} variable is only used at a later stage, but was
declared here to give a sense of grouping. The error value returned from this
function is again checked and if everything goes well, the value of the
\texttt{usernameFound} boolean is checked next.
\smallskip
\begin{lstlisting}[language=Go, caption={Example integration test.},
label=integrationtest,basicstyle=\linespread{0.8}\footnotesize\ttfamily]
// modules/user/user_test.go
package user
import (
"context"
"testing"
"git.dotya.ml/mirre-mt/pcmt/ent/enttest"
"git.dotya.ml/mirre-mt/pcmt/slogging"
_ "github.com/xiaoqidun/entps"
)
func getCtx() context.Context {
l := slogging.Init(false)
ctx := context.WithValue(context.Background(), CtxKey{}, l)
return ctx
}
func TestUserExists(t *testing.T) {
connstr := "file:ent_tests?mode=memory&_fk=1"
db := enttest.Open(t, "sqlite3", connstr)
defer db.Close()
if err := db.Schema.Create(context.Background()); err != nil {
t.Errorf("failed to create schema resources: %v", err)
t.FailNow()
}
username := "dude"
email := "dude@b.cc"
ctx := getCtx()
usernameFound, err := UsernameExists(ctx, db, username)
if err != nil {
t.Errorf("error checking for username {%s} existence: %q",
username,
err,
)
}
if usernameFound {
t.Errorf("unexpected: user{%s} should not have been found",
username,
)
}
if _, err := EmailExists(ctx, db, email); err != nil {
t.Errorf("unexpected: user email '%s' should not have been found",
email,
)
}
usr, err := CreateUser(ctx, db, email, username, "so strong")
if err != nil {
t.Errorf("failed to create user, error: %q", err)
t.FailNow()
} else if usr == nil {
t.Error("got nil usr back")
t.FailNow()
}
if usr.Username != username {
t.Errorf("got back wrong username, want: %s, got: %s",
username, usr.Username,
)
} // ...more checks...
}
\end{lstlisting}
Since the database has just been created, there should be no users, which is
checked in the next \texttt{if} statement. The same check is then performed for
the earlier-declared user email that is also expected to fail.
The final statements of the described test attempts a user creation call, which
is again checked for both error and \emph{nilability}. The test continues with
more similar checks but it has been cut short for brevity.
A neat thing about error handling in Go is that it allows for very easy
checking of all paths, not just the \emph{happy path} where there are no
issues.
\n{2}{Testing environment}
The application has been deployed in a testing environment on author's modest
Virtual Private Server (VPS) at \texttt{https://testpcmt.dotya.ml}, protected
by \emph{Let's Encrypt}\allowbreak issued, short-lived, ECDSA
\texttt{secp384r1} curve TLS certificate, and configured with strict CSP. It is
a testing instance; therefore, limits to prevent abuse might be imposed.
% =========================================================================== %
\nn{Conclusion}
% =========================================================================== %