1863 lines
89 KiB
TeX
1863 lines
89 KiB
TeX
% =========================================================================== %
|
|
% Encoding: UTF-8 (žluťoučký kůň úpěl ďábelšké ódy)
|
|
% =========================================================================== %
|
|
|
|
% =========================================================================== %
|
|
\nn{Introduction}
|
|
|
|
Passwords. Everybody reading this text most assuredly recalls at least
|
|
\emph{some} of their own. The security-minded person perhaps even dozens. They
|
|
are complex and at least twelve characters long. They are only ever used in the
|
|
one place they were created for. And they are definitely getting rotated at
|
|
least once a year. Or are they?
|
|
|
|
A token so ubiquitous that it becomes tiring for human being to keep track of
|
|
all the places where it is required in some form or another. At some point, it
|
|
almost feels easier to stop caring and use the password intended for \emph{the
|
|
other site} for this one, too. What harm could that possibly do. The answer is
|
|
unimaginable, depending on the services in question, its relevance to the
|
|
person being discussed, and also on \emph{how many other} services also share
|
|
this password. A service requires a registration? No problem, the password will
|
|
be the name of the cat plus current year, so as to make it more secure. It is
|
|
the password rotation day again this month, a handful of logins will be
|
|
disabled if their passwords are not changed in the next couple of hours. No
|
|
worries, it is already covered by a combination of the current month and the
|
|
name of the specific service for each of them. A neat system. But just in case
|
|
they got forgotten in the fragments of this hectic lifestyle, they need to be
|
|
written down on a sticker note. Not to worry, nobody knows, it is hidden under
|
|
the keyboard, it is practically invisible.
|
|
|
|
These are all examples of poor password practices on user's side; some might
|
|
have been circumstantially helped to, such as the too frequently forced
|
|
password rotation, others can be ascribed to users not being sufficiently
|
|
well-versed in the intricacies of password hygiene.
|
|
|
|
Inevitably, these passwords are going to get appropriately treated in the form
|
|
of misuse, be it from a nosy colleague that finds the sticker note, or if the
|
|
user account is ever a target of an attack, the password's \emph{only} role, to
|
|
protect the access, will likely not stand much chance.
|
|
|
|
This thesis tangentially covers user-relating issues like the ones described
|
|
above, but rather than attempting to go for prevention, it mainly focuses on
|
|
dealing with the acute consequence of such behaviour: a password breach. The
|
|
thesis consists of two parts. The theoretical one offers an overview of
|
|
password-related topics and frames the password as well as security topics in
|
|
the web context in order to provide necessary context for the second part of
|
|
the thesis. Cryptography topics such as hashing, encryption and entropy are
|
|
mentioned, and within the browser context a special spotlight is given to the
|
|
protocols powering the web: HTTP and TLS.
|
|
|
|
The practical part discusses the architecture, decision making, implementation
|
|
details and validation methods utilised when building a web application that
|
|
enables users to monitor the breach status of their credentials by utilising an
|
|
online API service and local data imported into the program by the operators of
|
|
the tool. The program does not have many dependencies and is relatively
|
|
lightweight, which means that anybody with even little experience should be
|
|
able to potentially run their own private instance, if they so choose.
|
|
|
|
The purpose of the program is to allow users to learn if their credentials were
|
|
breached, while the reason for the breach might even be considered secondary in
|
|
importance. Breach data is not a publicly traded commodity and is relatively
|
|
hard to make sense of, given that we are talking about literal
|
|
\emph{terrabytes} of data available, if there is even the slightest interest to
|
|
find it online. Breaches happen, and of course, can inform the decision to stay
|
|
or leave the service, but there is not always a choice element involved, or
|
|
only a limited amount. Either way, knowledge is light and as such precedes
|
|
informed decision-making. Abstracting away the ugly parts and offering users an
|
|
understandable interface would likely result in their improved security
|
|
posture, if anything.
|
|
|
|
The author has been striving to utilise modern tooling and development
|
|
practices in an effort to build a maintainable and long-lasting piece of
|
|
software that serves its users well. When deployed, it could provide real
|
|
value.
|
|
|
|
Terminology is located in Appendix~\ref{appendix:terms}, feel free to give it a
|
|
read.
|
|
|
|
% =========================================================================== %
|
|
\part{Theoretical part}
|
|
|
|
\n{1}{Cryptography primer}\label{sec:cryptographyprimer}
|
|
|
|
\n{2}{Encryption}
|
|
|
|
\textbf{TODO:} add \emph{why} we care and how it's going to be used.
|
|
|
|
\n{2}{Hash functions}
|
|
|
|
Hash functions are algorithms used to help with a number of things: integrity
|
|
verification, password protection, digital signature, public-key encryption and
|
|
others. Hashes are used in forensic analysis to prove authenticity of digital
|
|
artifacts, to uniquely identify a change-set within revision-based source code
|
|
management systems such as Git, Subversion or Mercurial, to detect
|
|
known-malicious software by anti-virus programs or by advanced filesystems in
|
|
order to verify block integrity and enable repairs, and also in many other
|
|
applications that each person using a modern computing device has come across,
|
|
such as when connecting to a website protected by the famed HTTPS.
|
|
|
|
The popularity of hash functions stems from a common use case: the need to
|
|
simplify reliably identifying a chunk of data. Of course, two chunks of data,
|
|
two files, frames or packets could always be compared bit by bit, but that can
|
|
get prohibitive from both cost and energy point of view relatively quickly.
|
|
That is when the hash functions come in, since they are able to take a long
|
|
input and produce a short output, named a digest or a hash value. The function
|
|
also only works one way.
|
|
|
|
A file, or any original input data for that matter, cannot be reconstructed
|
|
from the hash digest alone by somehow \emph{reversing} the hashing operation,
|
|
since at the heart of any hash function there is essentially a compression
|
|
function.
|
|
|
|
Most alluringly, hashes are frequently used with the intent of
|
|
\emph{protecting} passwords by making those unreadable, while still being able
|
|
to verify that the user knows the password, therefore should be authorised.
|
|
|
|
As the hashing operation is irreversible, once the one-way function produces a
|
|
short a digest, there is no way to reconstruct the original message from it.
|
|
That is, unless the input of the hash function is also known, in which case all
|
|
it takes is hashing the supposed input and comparing the digest with existing
|
|
digests that are known to be digests of passwords.
|
|
|
|
\\ \textbf{TODO:} ad more on \emph{why} we care and what types of hashes should be
|
|
used (with refs) and why.
|
|
|
|
|
|
\n{3}{Types and use cases}
|
|
|
|
Hash functions can be loosely categorised based on their intended use case to
|
|
\emph{password protection hashes}, \emph{integrity verification hashes},
|
|
\emph{message authentication codes} and \emph{cryptographic hashes}. Each of
|
|
these possess unique characteristics and using the wrong type of hash function
|
|
for the wrong job can potentially result in a security breach.
|
|
|
|
As an example, suppose \texttt{MD5}, a popular hash function internally using
|
|
the same data structure - \emph{Merkle-Damgård} construction - as
|
|
\texttt{BLAKE3}. While the former produces 128 bit digests, the latter by
|
|
default outputs 256 bit digest with no upper limit (Merkle tree extensibility).
|
|
|
|
There is a list of differences that could further be mentioned, however, they
|
|
both have one thing in common: they are \emph{designed} to be \emph{fast}. The
|
|
latter, as a cryptographic hash function, is conjectured to be \emph{random
|
|
oracle indifferentiable}, secure against length extension, but it is also in
|
|
fact faster than all of \texttt{MD5}, \texttt{SHA3-256}, \texttt{SHA-1} and
|
|
even \texttt{Blake2} family of functions.
|
|
|
|
The use case of both is to (quickly) verify integrity of a given chunk of data,
|
|
in case of \texttt{BLAKE3} with pre-image and collision resistance in mind, not
|
|
to secure a password by hashing it first, which poses a big issue when used
|
|
to...secure passwords by hashing them first.
|
|
|
|
A password hash function, such as \texttt{argon2} or \texttt{bcrypt} are good
|
|
choices for securely storing hashed passwords, namely because they place CPU
|
|
and memory burden on the host computing the digest, as well as limit potential
|
|
parallelism, thus preventing the scale at which an exhaustive search could be
|
|
launched. Additionally, both functions automatically \emph{salt} the passwords
|
|
before hashing them, which means that two exact same passwords of two different
|
|
users will not end up hashing to the same digest value, making it that much
|
|
harder to recover the original, supposedly weak password.
|
|
|
|
|
|
\n{3}{Why are hashes interesting}
|
|
|
|
As already mentioned, since hashes are often used to store the
|
|
representation of the password instead of the password itself, they become a
|
|
subject of interest when they get leaked. There have been enough instances of
|
|
leaked raw passwords that anyone with enough interest can put together a neat
|
|
list of hashes of the most popular passwords.
|
|
|
|
So while the service does not store plain text passwords, which is good, using
|
|
a hashing function not designed to protect passwords does not offer much
|
|
additional protection in case of weak passwords, which are the most commonly
|
|
used ones.
|
|
|
|
It seems logical that a service that is not using cryptographic primitives
|
|
correctly is more likely to get hacked and have its users' passwords/hashes
|
|
leaked. Then, the Internet ends up serving as a storage of every data dump,
|
|
often exposing these passwords/hashes for everyone to access.
|
|
|
|
|
|
\n{2}{TLS}\label{sec:tls}
|
|
|
|
The Transport Layer Security protocol (or TLS) serves as as an encryption and
|
|
\emph{authentication} protocol to secure internet communications. An important
|
|
part of the protocol is the \emph{handshake}, during which the two communicating
|
|
parties exchange messages that acknowledge each other's presence, verify each
|
|
other, choose what cryptographic algorithms will be used and decide session
|
|
keys. As there are multiple versions of the protocol in active duty even at the
|
|
moment, the server together with the client need to agree upon the version they
|
|
are going to use (these days it is recommended to use either 1.2 or 1.3),
|
|
pick cipher suites (), the client verifies the server's public key (and the signature of the
|
|
certificate authority that issued it) and they both generate session keys for
|
|
use after handshake completion.
|
|
|
|
TLSv1.3 dramatically reduced the number of available suites to only include the
|
|
ones deemed secure enough, which is why it is no longer needed to manually
|
|
specify what cipher suite should be used (or rely on the client/server to
|
|
choose wisely). While possibly facing compatibility issues with legacy devices,
|
|
the simplicity that enabling TLSv1.3 brings is a worthy trade-off.
|
|
|
|
|
|
\n{1}{Passwords}\label{sec:passwords}
|
|
|
|
Passwords have been in use since the ancient times, apparently already the
|
|
Roman sentries used passwords or \textit{watchwords} to discern who was allowed
|
|
to enter an area. The Roman army had a special system of distributing passwords
|
|
among the encampment members on a wooden tablet. Fast forward a couple of
|
|
thousand years, during the days of the Prohibition Era in the United States, it
|
|
was the secret ``speakeasies'' that were protecting their illegitimate
|
|
alcohol-serving business using passwords~\cite{speakeasy}~\cite{nisthistory}.
|
|
During the World War II.\ the US paratroopers' use of passwords has evolved to
|
|
even include a counter-password.
|
|
|
|
According to McMillan, the first \textit{computer} passwords date back to
|
|
mid-1960s' Massachusetts Institute of Technology (MIT), when researchers at the
|
|
university built a massive time-sharing computer called CTSS. Apparently,
|
|
\textit{even then} the passwords did not protect the users as well as they were
|
|
expected to~\cite{mcmillan}.
|
|
|
|
Traditionally, passwords were expected to be memorised, but the large number of
|
|
password-protected \emph{services} these days can make this impractical. To
|
|
list a few common examples, access to a bank account, electronic mailbox,
|
|
personal computer encrypted disk are all protected by some form of a password.
|
|
|
|
A password still often consists of a \textit{string} of characters typed into a
|
|
prompt but its function is still the same: as per NIST it enables the
|
|
\textit{verifier} to infer the \textit{claimant}'s identity via a secret the
|
|
claimant holds.
|
|
|
|
There are always some arbitrary requirements applied to what the password can
|
|
be, only some turn out to smarter than others.
|
|
|
|
Despite the impression given by the word ``password'', it does not need to be
|
|
an actual word, while a non-word (in the dictionary sense) may indeed be harder
|
|
to guess, which is a desirable property of passwords. A memorized secret
|
|
consisting of a sequence of words or other text separated by spaces is
|
|
sometimes called a passphrase. A passphrase is similar to a password in usage,
|
|
but the former is generally longer for added security.
|
|
|
|
\n{2}{Program-imposed constraints}
|
|
|
|
Some of the following examples might be a bit anecdotal and more of an
|
|
exception than a rule; nevertheless, when presented by a large-enough program
|
|
creator/service provider, their decisions reach a sufficient amount of
|
|
population, enough that the author will call them influential. They form how
|
|
users think when creating password and affect what users expect from other
|
|
services they happen to visit and use from that point on, as well.
|
|
|
|
\n{3}{Short arbitrary length}
|
|
|
|
It has been observed that a requirement for a ``strong'' password generally
|
|
represents that a password is:
|
|
|
|
\begin{itemize}
|
|
\item longer than 7 characters,
|
|
\item shorter than 11 characters,
|
|
\item begins with a letter and ends with a number OR
|
|
\item begins with a number and ends with a letter.
|
|
\end{itemize}
|
|
|
|
\obr{Short arbitrary password length
|
|
limit~\cite{larsklint}}{fig:arbitrarypasswdlengthlimit}{.8}{graphics/arbitrarypasswdlengthlimit.jpg}
|
|
|
|
This is wrong for multiple reasons, and it is a classic example of short
|
|
arbitrary length requirement. It essentially prevents users from using
|
|
passphrases, makes using a password manager impractical and all of that has
|
|
apparently been done ``because of security''~\cite{etihad}. Moreover, this
|
|
might be an indicative of the fact that instead of storing passwords hashed (as
|
|
it should be), they might be storing them in \textbf{plain text}.
|
|
Otherwise, what reason could exist for the limit to be 10 characters?
|
|
The recommendation of the US's National Institute for Standards and Technology
|
|
(NIST) in this regard is a minimum of 64 and a maximum of 256 characters, which
|
|
should be sufficient for most users' needs.
|
|
|
|
\n{3}{Restricting special characters}
|
|
|
|
Service providers have too often been found forbidding the use of so called
|
|
\textit{special characters} in passwords for as long as passwords have been
|
|
used to protect privileged access. Ways of achieving the same may vary but the
|
|
intent stays the same: preventing users from inputting characters into the
|
|
system, which the system cannot comfortably handle, for ``reasons'', which are
|
|
usually something dubious along the lines of ``an apostrophe may be used in SQL
|
|
injection attacks'' or ``angle brackets may be used in XSS attacks''. Instead
|
|
the real message it announces is pointing right to the serious shortcomings of
|
|
password handling of the site in question, as passwords should never be
|
|
re-displayed in a context that is prone to Cross Site Scripting (XSS), and the
|
|
passwords should always be hashed before being sent to the database anyway,
|
|
leaving us with only alphanumeric characters, rendering the SQLi fears
|
|
baseless.
|
|
|
|
\obr{Forbidden special characters in
|
|
passwords~\cite{forbiddencharacters}}{fig:forbiddencharacters}{.8}{graphics/forbiddencharacters.jpg}
|
|
|
|
Note that ``Passw0rd!'' would have been a perfectly acceptable password for the
|
|
validator displayed in Figure~\ref{fig:forbiddencharacters}.
|
|
NIST's recommendations on this are that all printing ASCII~\cite{asciirfc20}
|
|
characters as well as the space character SHOULD be acceptable in memorized
|
|
secrets and Unicode~\cite{iso10646} characters SHOULD be accepted as well.
|
|
|
|
\n{3}{Character composition requirements}
|
|
|
|
There is a tendency to come up with bad passwords when there are character
|
|
composition requirements in place, too. The reality is that instead of
|
|
creating strong passwords directly, most users first try a basic version and
|
|
then keep tweaking characters until the password ends up fulfilling the minimum
|
|
requirement.
|
|
The \emph{problem} with that is that it has been shown, that people use similar
|
|
patterns, i.e. starting with capital letters, putting a symbol last and a
|
|
number in the last two positions. This is also known to cyber criminals
|
|
cracking passwords and they run their dictionary attacks using the common
|
|
substitutions, such as "\$" for "s", "E" for "3", "1" for "l", "@" for "a" etc.
|
|
The password created in this manner will almost certainly be bad so all that is
|
|
achieved is frustrating the user in order to still arrive at a bad password.
|
|
|
|
\n{3}{Other common issues}
|
|
|
|
Some services don't allow users to paste into passwords fields (disabling them
|
|
using JavaScript), thereby essentially breaking the password manager
|
|
functionality, which is an issue because it encourages bad password practices
|
|
such as weak passwords and likewise, password reuse.
|
|
|
|
Another frequent issue is forced frequent password rotation. Making frequent
|
|
password rotations mandatory contributes to users developing a password
|
|
creation pattern and is further a modern-day security anti-pattern and
|
|
according to the British NCSC the practice ``carries no real benefits as stolen
|
|
passwords are generally exploited immediately''~\cite{ncsc}.
|
|
|
|
|
|
\n{1}{Web security}\label{sec:websecurity}
|
|
|
|
The internet, being the vast space of intertwined concepts and ideas, is a
|
|
superset of the Web, since not everything that is available on internet can be
|
|
described as web \emph{resources}. But precisely that is the part of the
|
|
internet that is discussed in the next sections and covers what browsers are,
|
|
what they do and how they relate to web security.
|
|
|
|
|
|
\n{2}{Site Isolation}
|
|
|
|
Modern browsers such as Firefox or Chromium come with a security focus in mind.
|
|
Their developers are acutely aware of the dangers that parsing untrusted code
|
|
from the internet, which is the websites, the stylesheets and the scripts,
|
|
poses. This is reflected in the way these programs are architected: instead of
|
|
the main browser process running everything, the main privileged process spawns
|
|
de-privileged child processes for each website. This extra line of defence
|
|
makes it ``harder for untrustworthy websites to access or steal information''
|
|
from user accounts or other websites~\cite{siteisolation}, as even if the
|
|
misbehaving website manages to ``break some rules'' withing its own process, it
|
|
should find it more difficult to steal data from other sites. Firefox's version
|
|
of Site Isolation is called Project Fission
|
|
(\url{https://hacks.mozilla.org/2021/05/introducing-firefox-new-site-isolation-security-architecture/})
|
|
but the two are very similar in both architecture and what they try to achieve.
|
|
Some Chromium users have been complaining in the past on its high memory usage,
|
|
which is partially due to Site Isolation features enabled that are protecting
|
|
the user.
|
|
|
|
|
|
\n{2}{Cross-site scripting}\label{sec:xss}
|
|
|
|
As per OWASP Top Ten list~\cite{owasptop10} injection is the third most
|
|
observed issue across millions of websites. Cross-site scripting is a type of
|
|
attack in which scripts are injected into a website that would otherwise be
|
|
trusted. It occurs when a malicious user misuses the website to send malicious
|
|
code, such as a script, to another user. Since the misconfiguration or a flaw
|
|
of the application allowed this, the browser of the victim that trusts the
|
|
website simply executes the code provided by the attacker. This code thus gains
|
|
access to session tokens and any cookies associated with the website's origin,
|
|
apart from being able to rewrite the HTML content. The results of XSS can
|
|
range from account compromise to identity theft.
|
|
|
|
|
|
\n{2}{Content Security Policy}\label{sec:csp}
|
|
|
|
Content Security Policy (CSP) has been an important addition to the arsenal of
|
|
website operators, even though not everybody has necessarily been utilising it
|
|
properly or even taken notice. To understand what guarantees it provides and
|
|
what kind of protections it employs, it is first necessary to grok how websites
|
|
are parsed and displayed, which has been discussed in depth in previous
|
|
sections.
|
|
|
|
As per Weichselbaum et al.\ CSP is a mechanism designed to mitigate
|
|
XSS~\cite{longliveCSP}, a long-lived king of the vulnerability
|
|
lists~\cite{reininginCSP}. It is a declarative policy mechanism that allows the
|
|
website operator to decide what client-side resources can load and what origins
|
|
are permitted. Scripts can be restricted to only load from a list of trusted
|
|
domains and inline scripts can be blocked completely. The scripts can also be
|
|
allowed based on a cryptographic hash based on their content, which should be a
|
|
known information to website operators prior to or at the time scripts are
|
|
served.
|
|
|
|
CSP3, which is the current version, can be served either as a header or inside
|
|
website's \texttt{<meta>} tag. Directives are generally named after a topic
|
|
keyword and suffixed `-src', such as \texttt{script-src}, \texttt{img-src} or
|
|
\texttt{style-src}, although some directives don't follow this pattern,
|
|
\texttt{form-action}, \texttt{upgrade-insecure-requests} and \texttt{sandbox}
|
|
representing this group nicely. Sections are delimited using semicolons and
|
|
each section can only appear once. Special values exist for the origin itself
|
|
(\texttt{'self'}) and disallowing any source (\texttt{'none'}). A good CSP is
|
|
targeted and not overly broad. To give an example, a website that loads no
|
|
JavaScript at all does not need to allow a popular CDN (Content Delivery
|
|
Network) origin in its \texttt{script-src}, instead it should be set to
|
|
\texttt{'none'}. CSP can also aid with clickjacking protection using its
|
|
\texttt{frame-ancestors} directive, which could limit origins that could embed
|
|
the website, preventing attacker from embedding the website at random places,
|
|
that is malicious websites that masquerade as being legitimate.
|
|
|
|
Getting CSP right can be tricky at first but once grokked, it is relatively
|
|
straight-forward and can increase the security of the site greatly. While
|
|
testing, it is best to enable CSP in the report-only mode before turning it on
|
|
in production. There are many more directives and settings than mentioned in
|
|
this section, the author encourages anybody interested to give it a read, e.g.\
|
|
at \url{https://web.dev/csp/}.
|
|
|
|
\textbf{TODO}: add more concrete examples.
|
|
|
|
|
|
\n{1}{Configuration}
|
|
|
|
Every non-trivial program usually offers at least \emph{some} way to
|
|
tweak/manage its behaviour, and these changes are usually persisted
|
|
\emph{somewhere} on the filesystem of the host: in a local SQLite3 database, a
|
|
\emph{LocalStorage} key-value store in the browser, a binary or plain text
|
|
configuration file. These configuration files need to be read and checked at
|
|
least on program start-up and either stored into operating memory for the
|
|
duration of the runtime of the program, or loaded and parsed and the memory
|
|
subsequently \emph{freed} (initial configuration).
|
|
|
|
There is an abundance of configuration languages (or file formats used to craft
|
|
configuration files, whether they were intended for it or not) available, TOML,
|
|
INI, JSON, YAML, to name some of the popular ones (as of today).
|
|
|
|
Dhall stood out as a language that was designed with both security and the
|
|
needs of dynamic configuration scenarios in mind, borrowing a concept or two
|
|
from Nix~\cite{nixoslearn}~\cite{nixlang} (which in turn sources more than a
|
|
few of its concepts from Haskell), and in its apparent core being very similar
|
|
to JSON, which adds to familiar feel. In fact, in Dhall's authors' own words it
|
|
is: ``a programmable configuration language that you can think of as: JSON +
|
|
functions + types + imports''~\cite{dhalllang}.
|
|
|
|
Among all of the listed features, the especially intriguing one to the author
|
|
was the promise of \emph{types}. There are multiple examples directly on the
|
|
project's documentation webpage demonstrating for instance the declaration and
|
|
usage of custom types (that are, of course merely combinations of the primitive
|
|
types that the language provides, such as \emph{Bool}, \emph{Natural} or
|
|
\emph{List}, to name just a few), so it was not exceedingly hard to start
|
|
designing a custom configuration \emph{schema} for the program.
|
|
Dhall not being a Turing-complete language also guarantees that evaluation
|
|
\emph{always} terminates eventually, which is a good attribute to possess as a
|
|
configuration language.
|
|
|
|
\n{3}{Safety considerations}
|
|
|
|
Having a programmable configuration language that understands functions and
|
|
allows importing not only arbitrary text from random internet URLs, but also
|
|
importing and \emph{evaluating} (i.e.\ running) potentially untrusted code, it
|
|
is important that there are some safety mechanisms employed, which can be
|
|
relied on by the user. Dhall offers this in multiple features: enforcing a
|
|
same-origin policy and (optionally) pinning a cryptographic hash of the value
|
|
of the expression being imported.
|
|
|
|
\n{3}{Possible alternatives}
|
|
|
|
While developing the program, the author has also
|
|
come across certain shortcomings of Dhall, namely long start-up with \emph{cold
|
|
cache}, which can generally be observed in the scenario of running the program
|
|
in an environment that does not allow to write the cache files (a read-only
|
|
filesystem), of does not keep the written cache files, such as a container that
|
|
is not configured to mount a persistent volume at the pertinent location.
|
|
|
|
To describe the way Dhall works when performing an evaluation, it resolves
|
|
every expression down to a combination of its most basic types (eliminating all
|
|
abstraction and indirection) in the process called
|
|
\textbf{normalisation}~\cite{dhallnorm} and then saves this result in the
|
|
host's cache. The \texttt{dhall-haskell} binary attempts to resolve the
|
|
variable \texttt{\$\{XDG\_CACHE\_HOME\}} (have a look at \emph{XDG Base
|
|
Directory Spec}~\cite{xdgbasedirspec} for details) to decide \emph{where} the
|
|
results of the normalisation will be written for repeated use. Do note that
|
|
this behaviour has been observed on a GNU/Linux host and the author has not
|
|
verified this behaviour on a non-GNU/Linux host, such as FreeBSD.
|
|
|
|
If normalisation is performed inside an ephemeral container (as opposed to, for
|
|
instance, an interactive desktop session), the results effectively get lost on
|
|
each container restart. That is both wasteful and not great for user
|
|
experience, since the normalisation of just a handful of imports (which
|
|
internally branches widely) can take an upwards of two minutes, during which
|
|
the user is left waiting for the hanging application with no reporting on the
|
|
progress or current status.
|
|
|
|
While workarounds for the above mentioned problem can be devised relatively
|
|
easily (such as bind mounting persistent volumes inside the container in place
|
|
of the \texttt{\$\{XDG\_CACHE\_HOME\}/dhall} and
|
|
\texttt{\$\{XDG\_CACHE\_HOME\}/dhall-haskell} to preserve the cache between
|
|
restarts, or let the cache be pre-computed during container build, since the
|
|
application is only really expected to run together with a compatible version
|
|
of the configuration schema and this version \emph{is} known at container build
|
|
time), it would certainly feel better if there was no need to work
|
|
\emph{around} the configuration system of choice.
|
|
|
|
Alternatives such as CUE (\url{https://cuelang.org/}) offer themselves nicely
|
|
as a potentially almost drop-in replacement for Dhall feature-wise, while also
|
|
resolving costly \emph{cold cache} normalisation operations, which is in
|
|
author's view Dhall's titular issue.
|
|
|
|
|
|
\n{1}{Compromise Monitoring}
|
|
|
|
There are, of course, several ways one could approach monitoring of compromised
|
|
of credentials, some more \emph{manual} in nature than others. When using a
|
|
service that is suspected/expected to be breached in the future, one can always
|
|
create a unique username/password combination specifically for the subject
|
|
service and never use that combination anywhere else. That way, if the
|
|
credentials ever \emph{do} happen to appear in a data dump online in the
|
|
future, it is going to be a safe assumption as to where they came from.
|
|
|
|
Unfortunately, the task of actually \emph{monitoring} the credentials can prove
|
|
to be a little more arduous than one could expect at first. There are a couple
|
|
of points that can prove to pose a challenge in case the search is performed by
|
|
hand, namely:
|
|
|
|
\begin{itemize}
|
|
\item finding the breached data to look through
|
|
\item verifying the trustworthiness of the data
|
|
\item varying quality of the data
|
|
\item sifting through (possibly) unstructured data by hand
|
|
\end{itemize}
|
|
|
|
Of course, as this is a popular topic for a number of people, the above
|
|
mentioned work has already been packaged into neat and practical online
|
|
offerings. In case one decides in favour of using those, an additional range of
|
|
issues (the previous one still applicable) arises:
|
|
|
|
\begin{itemize}
|
|
\item the need to trust the provider with input credentials
|
|
\item relying on the goodwill of the provider to be able to access the data
|
|
\item hoping that the terms of service are kept
|
|
\end{itemize}
|
|
|
|
Besides that, there is a plethora of breaches floating around the Internet
|
|
available simply as zip files, which makes the job even harder.
|
|
|
|
The overarching goal of this thesis is devising and implementing a system in
|
|
which the user can \emph{monitor} whether their credentials have been
|
|
\emph{compromised} (at least as far as the data can tell), and allowing them to
|
|
do so without needing to entrust their sensitive data to a provider.
|
|
|
|
|
|
\n{2}{Data Sources}\label{sec:dataSources}
|
|
|
|
A data source in this place is considered anything that provides the
|
|
application with data that it understands.
|
|
|
|
Of course, the results of credential compromise verification/monitoring is only
|
|
going to be as good as the data underpinning it, which is why it is imperative
|
|
that high quality data sources be used, if at all possible. While great care
|
|
does have to be taken to only choose the highest quality data sources, the
|
|
application must offer a means to be able to utilise these.
|
|
|
|
The sources from which breached data can be loaded into an application can be
|
|
split into two basic categories: \textbf{online} or \textbf{local}, and it is
|
|
possible to further discern between \emph{structured} and \emph{unstructured}
|
|
data.
|
|
|
|
An online source is generally a service that ideally exposes a programmatic
|
|
API, which an application can query and from which it can request the necessary
|
|
subsets of data.
|
|
These types of services often additionally front the data by a user-friendly
|
|
web interface for one-off searches, which is, however, not of use here.
|
|
|
|
Among some examples of online services could be named:
|
|
|
|
\begin{itemize}
|
|
\item {Have I Been Pwned?} - \url{https://haveibeenpawned.com}
|
|
\item {DeHashed} - \url{https://dehashed.com}
|
|
\end{itemize}
|
|
|
|
Large lumps of unstructured data available on forums or shady web servers would
|
|
technically also count here, given that they provide data and are available
|
|
online. However, even though data is frequently found online precisely in this
|
|
form, it is also not of direct use for the application without manual
|
|
\emph{preprocessing}, as it is attended to in
|
|
Section~\ref{sec:localDatasetPlugin}.
|
|
|
|
Another source is then simply any locally supplied data, which, of course,
|
|
could have been obtained from a breach available online beforehand.
|
|
|
|
Locally supplied data is specific in that it needs to be formatted in such a
|
|
way that it can be understood by the application. That is, the data is not in
|
|
its raw form anymore but has been morphed into the precise shape the
|
|
application needs for further processing. Once imported, the application can
|
|
query the data at will, as it knows exactly the shape of it.
|
|
|
|
This supposes the existence of a \emph{format} for importing, schema of which
|
|
is devised in Section~\ref{sec:localDatasetPlugin}.
|
|
|
|
|
|
\n{3}{Local Dataset Plugin}\label{sec:localDatasetPlugin}
|
|
|
|
Unstructured breach data from locally available datasets can be imported into
|
|
the application by first making sure it adheres to the specified schema (have a
|
|
look at the \emph{Breach Data Schema} in Listing~\ref{breachDataGoSchema}). If
|
|
it does not (which is very likely with random breach data, as already mentioned
|
|
in Section~\ref{sec:dataSources}), it needs to be converted to a form that
|
|
\emph{does} before importing it to the application, e.g.\ using a Python script
|
|
or a similar method.
|
|
|
|
Attempting to import data that does not follow the outlined schema should
|
|
result in an error. Equally so, importing a dataset which is over a reasonable
|
|
size limit should by default be rejected by the program as a precaution.
|
|
Unmarshaling, for instance, a 1 TiB document would most likely result in an
|
|
out-of-memory (OOM) situation on the host running the application, assuming
|
|
contemporary consumer hardware conditions (not HPC).
|
|
|
|
\vspace{\parskip}
|
|
\begin{lstlisting}[language=Go, caption={Breach Data Schema represented as a Go
|
|
struct with imports from the standard library assumed},
|
|
label=breachDataGoSchema]
|
|
type breachDataSchema struct {
|
|
Name string
|
|
Time time.Time
|
|
IsVerified bool
|
|
ContainsPasswords bool
|
|
ContainsHashes bool
|
|
HashType string
|
|
HashSalted bool
|
|
HashPepperred bool
|
|
ContainsUsernames bool
|
|
ContainsEmails bool
|
|
Data any
|
|
}
|
|
\end{lstlisting}
|
|
\vspace*{-\baselineskip}
|
|
|
|
The Go representation shown in Listing~\ref{breachDataGoSchema} will in
|
|
actuality translate to a YAML document written and supplied by an
|
|
administrative user of the program. The YAML format was chosen for several
|
|
reasons:
|
|
|
|
\begin{itemize}
|
|
\item relative ease of use (plain text, readability)
|
|
\item capability to store multiple \emph{documents} inside of a single file
|
|
\item most of the inputs being implicitly typed as strings
|
|
\item support for inclusion of comments
|
|
\item machine readability thanks to being a superset of JSON
|
|
\end{itemize}
|
|
|
|
The last point specifically should allow for documents similar to what can be
|
|
seen in Listing~\ref{breachDataYAMLSchema} to be ingested by the program, read
|
|
and written by humans and programs alike.
|
|
|
|
\smallskip
|
|
\begin{lstlisting}[language=YAML, caption={Example Breach Data Schema supplied
|
|
to the program as a YAML file, optionally containing multiple documents},
|
|
label=breachDataYAMLSchema]
|
|
---
|
|
name: Horrible breach
|
|
time: 2022-04-23T00:00:00Z+02:00
|
|
isVerified: false
|
|
containsPasswds: false
|
|
containsHashes: true
|
|
containsEmails: true
|
|
hashType: md5
|
|
hashSalted: false
|
|
hashPeppered: false
|
|
data:
|
|
hashes:
|
|
- hash1
|
|
- hash2
|
|
- hash3
|
|
emails:
|
|
- email1
|
|
-
|
|
- email3
|
|
---
|
|
# document #2, describing another breach.
|
|
name: Horrible breach 2
|
|
...
|
|
\end{lstlisting}
|
|
\vspace*{-\baselineskip}
|
|
|
|
Notice how the emails list in Listing~\ref{breachDataYAMLSchema} misses one
|
|
record, perhaps because it was not supplied or mistakenly omitted. This is a
|
|
valid scenario (mistakes happen) and the application needs to be able to handle
|
|
it. The alternative would be to require the user to prepare the data in such a
|
|
way that the empty/partial records would be dropped entirely.
|
|
|
|
\n{3}{Have I Been Pwned? Integration}
|
|
|
|
Troy Hunt's \textbf{Have I Been Pwned?} online service
|
|
(\url{https://haveibeenpwned.com/}) has been chosen as the online source of
|
|
compromised data. The service offers private APIs that are protected by API
|
|
keys. The application's \texttt{hibp} module and database representation models
|
|
the values returned by this API, which allows searching in large breaches using
|
|
email addresses.\\
|
|
The architecture there is relatively simple: the application administrator
|
|
configures an API key for HIBP, the user enters the query parameters, the
|
|
application constructs a query and calls the API and waits for a response. As
|
|
the API is rate-limited based on the key supplied, this can pose an issue and
|
|
it has not been fully resolved in the UI. The application then parses the
|
|
returned data and binds it to the local model for validation. If that goes
|
|
well, the data is saved into the database as a cache and the search query is
|
|
performed on the saved data. If it returns anything, it is displayed to the
|
|
user for browsing.
|
|
|
|
|
|
\n{1}{Deployment recommendations}\label{sec:deploymentRecommendations}
|
|
|
|
It is, of course, recommended that the application runs in a secure environment
|
|
\allowbreak although definitions of that almost certainly differ depending on
|
|
who you ask. General recommendations would be either to effectively reserve a
|
|
machine for a single use case - running this program - so as to dramatically
|
|
decrease the potential attack surface of the host, or run the program isolated
|
|
in a container or a virtual machine. Further, if the host does not need
|
|
management access (it is a deployed-to-only machine that is configured
|
|
out-of-band, such as with a \emph{golden} image/container or declaratively with
|
|
Nix), then an SSH \emph{daemon} should not be running in it, since it is not
|
|
needed. In an ideal scenario, the host machine would have as little software
|
|
installed as possible besides what the application absolutely requires.
|
|
|
|
System-wide cryptographic policies should target highest feasible security
|
|
level, if at all available (such as by default on Fedora or RHEL), covering
|
|
SSH, DNSSec, IPsec, Kerberos and TLS protocols. Firewalls should be configured
|
|
and SELinux (kernel-level mandatory access control and security policy
|
|
mechanism) running in \emph{enforcing} mode, if available.
|
|
|
|
\n{2}{Transport security}
|
|
|
|
User connecting to the application should rightfully expect for their data to
|
|
be protected \textit{in transit} (i.e.\ on the way between their browser and
|
|
the server), which is what \emph{Transport Layer Security} family of
|
|
protocols~\cite{tls13rfc8446} was designed for, and which is the underpinning
|
|
of HTTPS. TLS utilises the primitives of asymmetric cryptography to let the
|
|
client authenticate the server (verify that it is who it claims it is) and
|
|
negotiate a symmetric key for encryption in the process named the \emph{TLS
|
|
handshake} (see Section~\ref{sec:tls} for more details), the final purpose of
|
|
which is establishing a secure communications connection. The operator should
|
|
configure the program to either directly utilise TLS using configuration or
|
|
have it listen behind a TLS-terminating \emph{reverse proxy}.
|
|
|
|
|
|
\n{2}{Containerisation}
|
|
|
|
Whether the pre-built or a custom container image is used to deploy the
|
|
application, it still needs access to secrets, such as database connection
|
|
string (containing database host, port, user, password/encrypted password,
|
|
authentication method and database name).
|
|
|
|
The application should be able to handle the most common Postgres
|
|
authentication methods~\cite{pgauthmethods}, namely \emph{peer},
|
|
\emph{scram-sha-256}, \emph{user name maps} and raw \emph{password}, although
|
|
the \emph{password} option should not be used in production, \emph{unless} the
|
|
connection to the database is protected by TLS.\ In any case, using the
|
|
\emph{scram-sha-256}~\cite{scramsha256rfc7677} method is preferable. One of the
|
|
ways to verify in development environment that everything works as intended is
|
|
the \emph{Password generator for PostgreSQL} tool~\cite{goscramsha256}, which
|
|
allows retrieving the encrypted string from a raw user input.
|
|
|
|
If the application running in a container wants to use the \emph{peer}
|
|
authentication method, it is up to the operator to supply the Postgres socket
|
|
to the application (e.g.\ as a volume bind mount). This scenario was not
|
|
tested; however, and the author is also not entirely certain how \emph{user
|
|
namespaces} (on GNU/Linux) would influence the process (as in when the
|
|
\emph{ID}s of a user \textbf{outside} the container are mapped to a range of
|
|
\emph{UIDs} \textbf{inside} the container), for which the setup would likely
|
|
need to account.
|
|
|
|
Equally, if the application is running inside the container, the operator needs
|
|
to make sure that the database is either running in a network that is also
|
|
directly attached to the container or that there is a mechanism in place that
|
|
routes the requests for the database hostname to the destination.
|
|
|
|
One such mechanism is container name based routing inside \emph{pods}
|
|
(Podman/Kubernetes), where the resolution of container names is the
|
|
responsibility of a specially configured (often auto-configured) piece of
|
|
software called Aardvark for the former and CoreDNS for the latter.
|
|
|
|
|
|
\n{1}{Summary}
|
|
|
|
Passwords (and/or passphrases) are in use everywhere and quite probably will be
|
|
for the foreseeable future. If not as \textit{the} principal way to
|
|
authenticate, then at least as \textit{a} way to authenticate. As long as
|
|
passwords are going to be handled and stored by service/application providers,
|
|
they are going to get leaked, be it due to provider carelessness or the
|
|
attackers' resolve and wit. Of course, sifting through all the available
|
|
password breach data by hand is not a reasonable option, and therefore tools
|
|
providing assistance come in handy. The next part of this diploma thesis will
|
|
explore that issue and introduce a solution.
|
|
|
|
|
|
% =========================================================================== %
|
|
\part{Practical part}
|
|
|
|
\n{1}{Kudos}
|
|
|
|
The program that has been developed as part of this thesis used and utilised a
|
|
great deal of free (as in \textit{freedom}) and open-source software in the
|
|
process, either directly or as an outstanding work tool, and the author would
|
|
like to take this opportunity to recognise that fact\footnotemark.
|
|
|
|
In particular, the author acknowledges that this work would not be the same
|
|
without:
|
|
|
|
\begin{itemize}
|
|
\item vim (\url{https://www.vim.org/})
|
|
\item Arch Linux (\url{https://archlinux.org/})
|
|
\item ZSH (\url{https://www.zsh.org/})
|
|
\item kitty (\url{https://sw.kovidgoyal.net/kitty/})
|
|
\item Nix (\url{https://nixos.org/explore.html})
|
|
\item pre-commit (\url{https://pre-commit.com/})
|
|
\item Podman (\url{https://podman.io/})
|
|
\item Go (\url{https://go.dev/})
|
|
\end{itemize}
|
|
|
|
All of the code written has been typed into VIM (\texttt{9.0}), the shell used
|
|
to run the commands was ZSH, both running in the author's terminal emulator of
|
|
choice, \texttt{kitty}. The development machines ran a recent installation of
|
|
\textit{Arch Linux (by the way)} and Fedora 38, both using a \texttt{6.3.x}
|
|
XanMod variant of the Linux kernel.
|
|
|
|
\footnotetext{\textbf{Disclaimer:} the author is not affiliated in any way with any
|
|
of the projects described on this page.}
|
|
|
|
|
|
\n{1}{Development}
|
|
|
|
The source code of the project was being versioned since the start, using the
|
|
popular and industry-standard git (\url{https://git-scm.com}) source code
|
|
management (SCM) tool. Commits were made frequently and, if at all possible,
|
|
for small and self-contained changes of code, trying to follow sane commit
|
|
message \emph{hygiene}, i.e.\ striving for meaningful and well-formatted commit
|
|
messages. The name of the default branch is \texttt{development}, since that is
|
|
what the author likes to choose for new projects that are not yet stable (it is
|
|
in fact the default in author's \texttt{.gitconfig}).
|
|
|
|
|
|
\n{2}{Commit signing}
|
|
|
|
Since git allows cryptographically \emph{singing} all commits, it would be
|
|
unwise not to take advantage of this. For the longest time, GPG was the only
|
|
method available for signing commits in git; however, that is no longer
|
|
applicable~\cite{agwagitssh}. These days, it is also possible to both sign and
|
|
verify one's git commits (and tags!) using SSH keys, namely those produced by
|
|
OpenSSH (the same ones that can be used to log in to remote systems). The
|
|
author has, of course, not reused the same key pair that is used to connect to
|
|
machines for signing commits. A different, \texttt{Ed25519} elliptic curve key
|
|
pair has been used specifically for signing. A public component of this key is
|
|
enclosed in this thesis as Appendix~\ref{appendix:signingkey} for future
|
|
reference.
|
|
|
|
The validity of a signature on a particular commit can be viewed with git using
|
|
the following commands (the \% sign denotes the shell prompt):
|
|
|
|
\vspace{\parskip}
|
|
\begin{lstlisting}[language=bash, caption={Verifying signature of a git commit},
|
|
label=gitverif, basicstyle=\linespread{0.9}\footnotesize\ttfamily]
|
|
% cd <cloned project dir>
|
|
% git show --show-signature <commit>
|
|
% # alternatively:
|
|
% git verify-commit <commit>
|
|
\end{lstlisting}
|
|
\vspace*{-\baselineskip}
|
|
|
|
There is one caveat to this though, git first needs some additional
|
|
configuration for the code in Listing~\ref{gitverif} to work as one would
|
|
expect. Namely that the public key used to verify the signature needs to be
|
|
stored in git's ``allowed signers file'', then git needs to be told where that
|
|
file is located using the configuration value
|
|
\texttt{gpg.ssh.allowedsignersfile} and finally the configuration value of the
|
|
\texttt{gpg.format} field needs to be set to \texttt{ssh}.
|
|
|
|
Because git allows the configuration values to be local to each repository,
|
|
both of the mentioned issues can be solved by running the following commands
|
|
from inside of the cloned repository:
|
|
|
|
\vspace{\parskip}
|
|
\begin{lstlisting}[language=bash, caption={Prepare allowed signers file and signature format for git},
|
|
label=gitsshprep, basicstyle=\linespread{0.9}\footnotesize\ttfamily]
|
|
% # set the signature format for the local repository.
|
|
% git config --local gpg.format ssh
|
|
% # save the public key.
|
|
% cat > ./.tmp-allowed_signers \
|
|
<<<'surtur <insert literal surtur pubkey>
|
|
leo <insert literal leo pubkey>'
|
|
% # set the allowed signers file path for the local repository.
|
|
% git config --local gpg.ssh.allowedsignersfile=./.tmp-allowed_signers
|
|
\end{lstlisting}
|
|
\vspace*{-\baselineskip}
|
|
|
|
After the code in Listing~\ref{gitsshprep} is run, everything from the
|
|
Listing~\ref{gitverif} should remain applicable for the lifetime of the
|
|
repository or until git changes implementation of signature verification. The
|
|
git \texttt{user.name} that can be seen on the commits in the \textbf{Author}
|
|
field is named after the machine that was used to develop the program, since
|
|
the author uses different signing keys on each machine. That way the committer
|
|
machine can be determined post-hoc.
|
|
|
|
For future reference, git has been used in the version \texttt{git version
|
|
2.40.1}.
|
|
|
|
|
|
\n{2}{Continuous Integration}
|
|
|
|
To increase both the author's and public confidence in the atomic changes made
|
|
over time, it was attempted to thoroughly \emph{integrate} them using a
|
|
continuous integration (CI) service that was plugged into the main source code
|
|
repository since the early stages of development. This, of course, was again
|
|
self-hosted, including the workers. The tool of choice there was Drone
|
|
(\url{https://drone.io}) and the ``docker'' runner (in fact it runs any OCI
|
|
container) was used to run the builds.
|
|
|
|
The way this runner works is that it creates an ephemeral container for every
|
|
pipeline step and executes given \emph{commands} inside of it. At the end of
|
|
each step the container is discarded, while the repository, which is mounted
|
|
into each container's \texttt{/drone/src} is persisted between steps, allowing
|
|
it to be cloned only from \emph{origin} only at the start of the pipeline and
|
|
then shared for all of the following steps, saving bandwidth, time and disk
|
|
writes.
|
|
|
|
The entire configuration used to run the pipelines can be found in a file named
|
|
\texttt{.drone.yml} at the root of the main source code repository. The
|
|
workflow consists of four pipelines, which are run in parallel. Two main
|
|
pipelines are defined to build the frontend assets, the \texttt{pcmt} binary
|
|
and run tests on \texttt{x86\_64} GNU/Linux targets, one for each of Arch and
|
|
Alpine (version 3.17). These two pipelines are identical apart from OS-specific
|
|
bits such as installing a certain package, etc. For the record, other
|
|
OS-architecture combinations were not tested.
|
|
|
|
A third pipeline contains instructions to build a popular static analysis tool
|
|
called \texttt{golangci-lint}, which is sort of a meta-linter, bundling a
|
|
staggering amount of linters (linter is a tool that performs static code
|
|
analysis and can raise awareness of programming errors, flag potentially buggy
|
|
code constructs, or \emph{mere} stylistic errors) - from sources and then
|
|
perform the analysis of project's codebase using the freshly built binary. If
|
|
the result of this step is successful, a handful of code analysis services get
|
|
pinged in the next steps to take notice of the changes to project's source code
|
|
and update their metrics, details can be found in the main Drone configuration
|
|
file \texttt{.drone.yml} and the configuration for the \texttt{golangci-lint}
|
|
tool itself (such as what linters are enabled/disabled and with what settings)
|
|
can be found in the root of the repository in the file named
|
|
\texttt{.golangci.yml}.
|
|
|
|
The fourth pipeline focuses on linting the Containerfile and building the
|
|
container, although the latter action is only performed on feature branches,
|
|
\emph{pull requests} or \emph{tag} events.
|
|
|
|
\obr{Drone CI median build
|
|
time}{fig:drone-median-build}{.84}{graphics/drone-median-build}
|
|
|
|
The median build time as of writing was 1 minute, which includes running all
|
|
four pipelines, and that is acceptable. Build times might of course vary
|
|
depending on the hardware, for reference, these builds were run on a machine
|
|
equipped with a Zen 3 Ryzen 5 5600 CPU with nominal clock times, DDR4 3200MHz
|
|
RAM, a couple of PCIe Gen 4 NVMe drives in a mirrored setup (using ZFS) and a
|
|
400Mbps downlink, software-wise running Arch with an author-flavoured Xanmod
|
|
kernel version 6.3.x.
|
|
|
|
|
|
\n{2}{Source code repositories}\label{sec:repos}
|
|
|
|
All of the pertaining source code was published in repositories on a publicly
|
|
available git server operated by the author, the reasoning \emph{pro}
|
|
self-hosting being that it is the preferred way of guaranteed autonomy over
|
|
one's source code, as opposed to large silos owned by big corporations having a
|
|
track record of arguably not always deciding with user's best interest in mind
|
|
(although recourse has been observed~\cite{ytdl}). When these providers act on
|
|
impulse or under public pressure they can potentially at least temporarily
|
|
disrupt their user's operations, thus not only beholding their user to their
|
|
lengthy \emph{terms of service} that \emph{can change at any time}, but also
|
|
factors outside their control. Granted, decentralisation can take a toll on
|
|
discoverability of the project, but that is not a concern here.
|
|
|
|
The git repository containing source code of the \texttt{pcmt} project:\\
|
|
\url{https://git.dotya.ml/mirre-mt/pcmt.git}.
|
|
|
|
The git repository hosting the \texttt{pcmt} configuration schema:\\
|
|
\url{https://git.dotya.ml/mirre-mt/pcmt-config-schema.git}.
|
|
|
|
The repository containing the \LaTeX{} source code of this thesis:\\
|
|
\url{https://git.dotya.ml/mirre-mt/masters-thesis.git}.
|
|
|
|
|
|
\n{2}{Toolchain}
|
|
|
|
Throughout the creation of this work, the \emph{then-current} version of the Go
|
|
programming language was used, i.e. \texttt{go1.20}.
|
|
|
|
To read more on why Go was chosen, see Appendix~\ref{appendix:whygo}.
|
|
Nix/\texttt{devenv} tools have also aided heavily during development, see
|
|
Appendix~\ref{appendix:whynix} to learn more.
|
|
|
|
\tab{Tool/Library-Usage Matrix}{tab:toolchain}{1.0}{ll}{
|
|
\textbf{Name} & \textbf{Usage} \\
|
|
Go programming language & program core \\
|
|
Dhall configuration language & program configuration \\
|
|
Echo & HTTP handlers, controllers, web server \\
|
|
ent & ORM using graph-based modelling \\
|
|
bluemonday & sanitising HTML \\
|
|
TailwindCSS & stylesheets using a utility-first approach \\
|
|
PostgreSQL & persistently storing data \\
|
|
}
|
|
|
|
Table~\ref{tab:depsversionmx} contains the names and versions of the most
|
|
important libraries and supporting software that were used to build the
|
|
application.
|
|
|
|
\tab{Dependency-Version Matrix}{tab:depsversionmx}{1.0}{ll}{
|
|
\textbf{Name} & \textbf{version} \\
|
|
\texttt{echo} (\url{https://echo.labstack.com/}) & 4.10.2 \\
|
|
\texttt{go-dhall} (\url{https://github.com/philandstuff/dhall-golang}) & 6.0.2\\
|
|
\texttt{ent} (\url{https://entgo.io/}) & 0.11.10 \\
|
|
\texttt{bluemonday} (\url{https://github.com/microcosm-cc/bluemonday}) & 1.0.23 \\
|
|
\texttt{tailwindcss} (\url{https://tailwindcss.com/}) & 3.3.0 \\
|
|
\texttt{PostgreSQL} (\url{https://www.postgresql.org/}) & 15.2 \\
|
|
}
|
|
|
|
|
|
\n{1}{Application architecture}
|
|
|
|
\n{2}{Package structure}
|
|
|
|
The source code of the main module is organised into smaller, self-contained Go
|
|
\emph{packages} appropriately along a couple of domains: logging, core
|
|
application, web routers, configuration and settings, etc. In Go, packages are
|
|
delimited by folder structure -- each folder can be a package.
|
|
|
|
Generally speaking, the program aggregates decision points into central places,
|
|
such as \texttt{run.go}, which then imports child packages that facilitate each
|
|
of the task of loading the configuration, connecting to the database and
|
|
running migrations, consolidating flag, environment variable and
|
|
configuration-based values into canonical \emph{settings}, setting up routes
|
|
and handling graceful shutdown.
|
|
|
|
\n{3}{Internal package}
|
|
|
|
The \texttt{internal} package was not used as of writing, but the author plans
|
|
to eventually migrate \emph{internal} logic of the program into the internal
|
|
package to prevent accidental imports.
|
|
|
|
|
|
\n{2}{Logging}
|
|
|
|
The program uses dependency injection to share a single logger instance,
|
|
similar applies to the database client. These are passed around as a pointer,
|
|
so the underlying data stays the same. As a rule of thumb, every larger
|
|
\texttt{struct} that needs to be passed around is passed around as a pointer.
|
|
|
|
|
|
\n{2}{Authentication}
|
|
|
|
The authentication logic is relatively simple and the author attempted to
|
|
isolate it into a custom \emph{middleware}. User passwords are hashed using a
|
|
secure KDF before being sent to the database. The KDF of choice is
|
|
\texttt{bcrypt} (with a sane \emph{Cost} of 10), which automatically includes
|
|
\emph{salt} for the password and provides ``length-constant'' time hash
|
|
comparisons. The author plans to add support for the more modern
|
|
\texttt{scrypt} and the state-of-the-art, P-H-C (Password Hashing Competition)
|
|
winner algorithm \texttt{Argon2}
|
|
(\url{https://github.com/P-H-C/phc-winner-argon2}) for flexibility.
|
|
|
|
\n{2}{SQLi prevention}
|
|
|
|
No raw SQL queries are directly used to access the database, thus decreasing
|
|
the likelihood of SQL injection attacks. Instead, parametric queries are
|
|
constructed in code using a graph-like API of the \texttt{ent} library, which
|
|
is attended to in-depth in Section~\ref{sec:dbschema}.
|
|
|
|
|
|
\n{2}{Configurability}
|
|
|
|
Virtually any important value in the program has been made into a configuration
|
|
value, so that the operator can customise the experience as needed. A choice of
|
|
sane configuration defaults was attempted, which resulted in the configuration
|
|
file essentially only needing to contain secrets, unless there is a need to
|
|
override the defaults. It is not entirely \emph{zero-config} situation, rather
|
|
a \emph{minimal-config} one. An example can be seen in
|
|
Section~\ref{sec:configuration}.
|
|
|
|
|
|
\n{2}{Embedded assets}
|
|
|
|
An important thing to mention is embedded assets and templates. Go has multiple
|
|
mechanisms to natively embed arbitrary files directly into the binary during
|
|
the regular build process. The built-in \texttt{embed} package was used to
|
|
bundle all template files and web assets, such as images, logos and stylesheets
|
|
at the package level, and these are also the passed around the application as
|
|
needed.
|
|
|
|
There is also a toggle in the application configuration, which can instruct the
|
|
program at start to either rely entirely on embedded assets or pull live files
|
|
from the filesystem. The former option makes the application more portable,
|
|
while the latter allows for flexibility not only during development.
|
|
|
|
|
|
\n{2}{Composability}
|
|
|
|
Templates used for rendering of the web pages were created in a composable
|
|
manner, split into smaller, reusable parts, such as \texttt{footer.tmpl} and
|
|
\texttt{head.tmpl}. Specific functionality is then conditionally executed based
|
|
on the determined level of access of the user. The core templating
|
|
functionality was provided by the native Go \texttt{html/template} package. An
|
|
Echo renderer interface has been implemented, so that the rendering can be
|
|
performed ergonomically and directly using Echo's built-in facilities. A
|
|
popular HTML sanitiser \emph{bluemonday} has been employed to aid with battling
|
|
XSS.
|
|
|
|
|
|
\n{2}{Server-side rendering}
|
|
|
|
The application constructs the web pages entirely server-side and it runs
|
|
without a single line of JavaScript, of which the author is especially proud.
|
|
It improves load times, decreases attack surface, increases maintainability and
|
|
reduces cognitive load that is required when dealing with JavaScript. Of course
|
|
that requires extensive usage of non-semantic \texttt{POST}s in forms even for
|
|
updates (where \texttt{PUT}s should be used) and the accompanying frequent
|
|
full-page refreshes, but that still is not enough to warrant the use of
|
|
JavaScript.
|
|
|
|
|
|
\n{2}{Frontend}
|
|
|
|
Frontend-side, the application was styled using TailwindCSS, which promotes
|
|
using of flexible \emph{utility-first} classes in the markup (HTML) instead of
|
|
separating out the specific styles out into all-encompasing classes. The author
|
|
understands this is somewhat of a preference issue and does not hold hard
|
|
opinions in either direction, Tailwind simply looked nice, especially with its
|
|
built-in support for dark/light mode. The templates containing the CSS classes
|
|
need to be parsed by Tailwind in order to construct its final stylesheet and
|
|
there is also an original CLI tool for that called \texttt{tailwindcss}.
|
|
Overall, simple and accessible layouts had preference over convoluted ones.
|
|
|
|
\n{3}{Frontend experiments}
|
|
|
|
As an aside, the author has briefly experimented with WebAssembly for this
|
|
project, but has ultimately scrapped the functionality in favour of the
|
|
entirely server-side rendered one. It is possible that it would get revisited
|
|
if the client-side dynamic functionality was necessary and performance
|
|
mattered. Even from the short experiments it was obvious how much faster
|
|
WebAssembly was compared to JavaScript.
|
|
|
|
|
|
\newpage
|
|
\n{2}{User isolation}
|
|
|
|
Users are allowed into certain parts of the application based on the role they
|
|
currently posses. For the moment, two basic roles were envisioned, while this
|
|
list might get amended in the future, should the need arise:
|
|
|
|
\begin{itemize}
|
|
\item Administrator
|
|
\item User
|
|
\end{itemize}
|
|
|
|
\obr{Application use case diagram}{fig:usecasediagram}{.9}{graphics/pcmt-use-case.pdf}
|
|
|
|
It is paramount that the program protects itself from the insider threats as
|
|
well and therefore each role is only able to perform actions that it is
|
|
explicitly assigned. While there definitely is certain overlap between the
|
|
capabilities of the two outlined roles, each also possesses unique features
|
|
that the other one does not.
|
|
|
|
For example, the administrator role is not able to perform searches on the
|
|
breach data directly using their administrator account, for that a separate
|
|
user account has to be devised. Similarly, the regular user is not able to
|
|
manage breach lists and other users, because that is a privileged operation.
|
|
|
|
In-application administrators are not able to view sensitive (any) user data
|
|
and should therefore only be able to perform the following actions:
|
|
|
|
\begin{itemize}
|
|
\item Create user accounts
|
|
\item View list of users
|
|
\item View user email
|
|
\item Change user email
|
|
\item Toggle whether user is an administrator
|
|
\item Delete user accounts
|
|
\end{itemize}
|
|
|
|
Let us consider a case when a user manages self, while demoting from
|
|
administrator to a regular user is permitted, promoting self to be an
|
|
administrator would constitute a \emph{privilege escalation} and likely be a
|
|
precursor to at least a \emph{denial of service} of sorts.
|
|
|
|
|
|
\n{2}{Zero trust principle}
|
|
|
|
\textit{Confidentiality, i.e.\ not trusting the provider}
|
|
|
|
There is no way for the application (and consequently, the in-application
|
|
administrator) to read user's data. This is possible by virtue of encrypting
|
|
the pertinent data before saving them in the database by a state-of-the-art
|
|
\emph{age} key~\cite{age} (backed by X25519~\cite{x25519rfc7748}), which is in
|
|
turn safely stored encrypted by a passphrase that only the user controls. Of
|
|
course, the user-supplied password is run by a password based key derivation
|
|
function (PBKDF: a key derivation function with a sliding computational cost)
|
|
before letting it encrypt the \emph{age} key.
|
|
|
|
The \emph{age} key is only generated when the user changes their password for
|
|
the first time to prevent scenarios such as in-application administrator with
|
|
access to physical database being able to both \textbf{recover} the key from
|
|
the database and \textbf{decrypt} it given that they already know the user
|
|
password (because they set it), which would subsequently give them unbounded
|
|
access to any future encrypted data, as long as they would be able to maintain
|
|
their database access. This is why the \emph{age} key generation and protection
|
|
are bound to the first password change. Of course, the evil administrator could
|
|
just perform the change themselves; however, the user would at least be able to
|
|
find those changes in the activity logs and know not to use the application.
|
|
But given the scenario of a total database compromise, the author finds all
|
|
hope is already lost at that point. At least when the database is dumped, it
|
|
only contains non-sensitive, functional information in plain test, everything
|
|
else should be encrypted.
|
|
|
|
Consequently, both the application operators and the in-application
|
|
administrators should never be able to learn the details of what the user is
|
|
tracking, the same being applicable even to potential attackers with direct
|
|
access to the database. Thus the author maintains that every scenario that
|
|
could potentially lead to a data breach (apart from a compromised user machine
|
|
and the like) would have to entail some form of operating memory acquisition,
|
|
for instance using \texttt{LiME}~\cite{lime}, or perhaps directly the
|
|
\emph{hypervisor}, if considering a virtualised (``cloud'') environments.
|
|
|
|
|
|
\n{1}{Implementation}
|
|
|
|
\n{2}{Dhall Configuration Schema}\label{sec:configuration}
|
|
|
|
The configuration schema was at first being developed as part of the main
|
|
project's repository, before it was determined that it would benefit both the
|
|
development and overall clarity if the schema lived in its own repository (see
|
|
Section~\ref{sec:repos} for details). This now enables the schema to be
|
|
independently developed and versioned, and only pulled into the main
|
|
application whenever the application is determined to be ready for it.
|
|
|
|
|
|
% \vspace{\parskip}
|
|
\smallskip
|
|
% \vspace{\baselineskip}
|
|
\begin{lstlisting}[language=Haskell, caption={Dhall configuration schema version 0.0.1-rc.2},
|
|
label=dhallschema, basicstyle=\linespread{0.9}\footnotesize\ttfamily]
|
|
let Schema =
|
|
{ Type =
|
|
{ Host : Text
|
|
, Port : Natural
|
|
, HTTP :
|
|
{ Domain : Text
|
|
, Secure : Bool
|
|
, AutoTLS : Bool
|
|
, TLSKeyPath : Text
|
|
, TLSCertKeyPath : Text
|
|
, HSTSMaxAge : Natural
|
|
, ContentSecurityPolicy : Text
|
|
, RateLimit : Natural
|
|
, Gzip : Natural
|
|
, Timeout : Natural
|
|
}
|
|
, Mailer :
|
|
{ Enabled : Bool
|
|
, Protocol : Text
|
|
, SMTPAddr : Text
|
|
, SMTPPort : Natural
|
|
, ForceTrustServerCert : Bool
|
|
, EnableHELO : Bool
|
|
, HELOHostname : Text
|
|
, Auth : Text
|
|
, From : Text
|
|
, User : Text
|
|
, Password : Text
|
|
, SubjectPrefix : Text
|
|
, SendPlainText : Bool
|
|
}
|
|
, LiveMode : Bool
|
|
, DevelMode : Bool
|
|
, AppPath : Text
|
|
, Session :
|
|
{ CookieName : Text
|
|
, CookieAuthSecret : Text
|
|
, CookieEncrSecret : Text
|
|
, MaxAge : Natural
|
|
}
|
|
, Logger : { JSON : Bool, Fmt : Optional Text }
|
|
, Init : { CreateAdmin : Bool, AdminPassword : Text }
|
|
, Registration : { Allowed : Bool }
|
|
}
|
|
}
|
|
\end{lstlisting}
|
|
\vspace*{-\baselineskip}
|
|
|
|
Full schema with type annotations can be seen in Listing~\ref{dhallschema}.
|
|
|
|
The \texttt{let} statement declares a variable called \texttt{Schema} and
|
|
assigns to it the result of the expression on the right side of the equals
|
|
sign, which has for practical reasons been trimmed and is displayed without the
|
|
\emph{default} block. The default block is instead shown in its own
|
|
Listing~\ref{dhallschemadefaults}.
|
|
|
|
The main configuration comprises both raw attributes and child records, which
|
|
allow for grouping of related functionality. For instance, configuration
|
|
settings pertaining mailserver setup are grouped in a record named
|
|
\textbf{Mailer}. Its attribute \textbf{Enabled} is annotated as \textbf{Bool},
|
|
which was deemed appropriate for a on-off switch-like functionality, with the
|
|
only permissible values being either \emph{True} or \emph{False}. Do note that
|
|
in Dhall $true != True$, since \textbf{True} is internally a Bool constant,
|
|
which is built into Dhall (check out ``The Prelude''~\cite{dhallprelude}),
|
|
while \textbf{true} is evaluated as an \emph{unbound} variable, that is, a
|
|
variable \emph{not} defined in the current \emph{scope} and thus not
|
|
\emph{present} in the current scope.
|
|
|
|
Another one of Dhall specialties is that `$==$' and `$!=$' (in)equality
|
|
operators \textbf{only} work on values of type \texttt{Bool}, which for example
|
|
means that variables of type \texttt{Natural} (\texttt{uint}) or \texttt{Text}
|
|
(\texttt{string}) cannot be compared directly as in other languages, which
|
|
either leaves the work for a higher-level language (such as Go), or from the
|
|
perspective of the Dhall authors, \emph{enums} are promoted when the value
|
|
matters.
|
|
|
|
\newpage
|
|
% \vspace{\parskip}
|
|
\begin{lstlisting}[language=Haskell, caption={Dhall configuration defaults for
|
|
schema version 0.0.1-rc.2},
|
|
label=dhallschemadefaults, basicstyle=\linespread{0.9}\footnotesize\ttfamily]
|
|
, default =
|
|
-- | have sane defaults.
|
|
{ Host = ""
|
|
, Port = 3000
|
|
, HTTP =
|
|
{ Domain = ""
|
|
, Secure = False
|
|
, AutoTLS = False
|
|
, TLSKeyPath = ""
|
|
, TLSCertKeyPath = ""
|
|
, HSTSMaxAge = 0
|
|
, ContentSecurityPolicy = ""
|
|
, RateLimit = 0
|
|
, Gzip = 0
|
|
, Timeout = 0
|
|
}
|
|
, Mailer =
|
|
{ Enabled = False
|
|
, Protocol = "smtps"
|
|
, SMTPAddr = ""
|
|
, SMTPPort = 465
|
|
, ForceTrustServerCert = False
|
|
, EnableHELO = False
|
|
, HELOHostname = ""
|
|
, Auth = ""
|
|
, From = ""
|
|
, User = ""
|
|
, Password = ""
|
|
, SubjectPrefix = "pcmt - "
|
|
, SendPlainText = True
|
|
}
|
|
, LiveMode =
|
|
-- | LiveMode controls whether the application looks for
|
|
-- | directories "assets" and "templates" on the filesystem or
|
|
-- | in its bundled Embed.FS.
|
|
False
|
|
, DevelMode = False
|
|
, AppPath =
|
|
-- | AppPath specifies where the program looks for "assets" and
|
|
-- | "templates" in case LiveMode is True.
|
|
"."
|
|
, Session =
|
|
{ CookieName = "pcmt_session"
|
|
, CookieAuthSecret = ""
|
|
, CookieEncrSecret = ""
|
|
, MaxAge = 3600
|
|
}
|
|
, Logger = { JSON = True, Fmt = None Text }
|
|
, Init =
|
|
{ CreateAdmin =
|
|
-- | if this is True, attempt to create a user with admin
|
|
-- | privileges with the password specified below
|
|
False
|
|
, AdminPassword =
|
|
-- | used for the first admin, forced change on first login.
|
|
"50ce50fd0e4f5894d74c4caecb450b00c594681d9397de98ffc0c76af5cff5953eb795f7"
|
|
}
|
|
, Registration.Allowed = True
|
|
}
|
|
}
|
|
|
|
in Schema
|
|
\end{lstlisting}
|
|
\vspace*{-\baselineskip}
|
|
\vspace*{-\baselineskip}
|
|
\vspace*{-\baselineskip}
|
|
\n{2}{Data integrity and authenticity}
|
|
|
|
The user can interact with the application via a web client, such as a browser,
|
|
and is required to authenticate for all sensitive operations. To not only know
|
|
\emph{who} the user is but also make sure they are \emph{permitted} to perform
|
|
the action they are attempting, the program employs an \emph{authorisation}
|
|
mechanism in the form of sessions. These are on the client side represented by
|
|
cryptographically signed and encrypted (using 256 bit AES) HTTP cookies. That
|
|
lays foundations for a few things: the data saved into the cookies can be
|
|
regarded as private because short of future \emph{quantum computers} only the
|
|
program itself can decrypt and access the data, and the data can be trusted
|
|
since it is both signed using the key that only the program controls and
|
|
\emph{encrypted} with \emph{another} key that equally only the program holds.
|
|
|
|
The cookie data is only ever written \emph{or} read at the server side,
|
|
solidifying the authors decision to let it be encrypted, as there is not point
|
|
in not encrypting it for some perceived client-side simplification. Users
|
|
navigating the website send their session cookie (if it exists) with
|
|
\textbf{every request} to the server, which subsequently verifies the integrity
|
|
of the data and in case it is valid, determines the existence and potential
|
|
amount of user privilege that should be granted. Public endpoints do not
|
|
mandate the presence of a valid session by definition, while at protected
|
|
endpoints the user is authenticated at every request. When a session expires or
|
|
if there is no session to begin with, the user is either shown a \emph{Not
|
|
found} error message, the \emph{Unauthorised} error message or redirected to
|
|
\texttt{/signin}, depending on the endpoint or resource, as can be seen, this
|
|
behaviour is not uniform and depends on the resource and/or the endpoint.
|
|
|
|
Another aspect that contributes to data integrity from \emph{another} point of
|
|
view is utilising database \emph{transactions} for bundling together multiple
|
|
database operations that collectively change the \emph{state}. Using the
|
|
transactional jargon, the data is only \emph{committed} if each individual
|
|
change was successful. In case of any errors, the database is instructed to
|
|
perform an atomic \emph{rollback}, which brings it back to a state before the
|
|
changes were ever attempted.
|
|
|
|
The author has additionally considered the thought of utilising an embedded
|
|
immutable database like immudb (\url{https://immudb.io}) for record keeping
|
|
(verifiably storing data change history) and additional data integrity checks,
|
|
e.g.\ for tamper protection purposes and similar; however, that work remains
|
|
yet to be materialised.
|
|
|
|
|
|
\n{2}{Database schema}\label{sec:dbschema}
|
|
|
|
The database schema is not being created manually in the database. Instead, an
|
|
Object-relational Mapping (ORM) tool named ent is used, which allows defining
|
|
the table schema and relations entirely in Go. The application does not need
|
|
for the database schema to be pre-created when the application starts, it only
|
|
requires a connection string providing access to the database for a reasonably
|
|
privileged user if that is the case.
|
|
|
|
The best part about \texttt{ent} is that there is no need to define
|
|
supplemental methods on the models, as with \texttt{ent} these are meant to be
|
|
\emph{code generated} (in the older sense of word, not with Large Language
|
|
Models) into existence. Code generation creates files with actual Go models
|
|
based on the types of the attributes in the database schema model, and the
|
|
respective relations are transformed into methods on the receiver or functions
|
|
taking object attributes as arguments.
|
|
|
|
For instance, if the model's attribute is a string value \texttt{Email}, ent
|
|
can be used to generate code that contains methods on the user object like the
|
|
following:
|
|
|
|
\begin{itemize}
|
|
\item \texttt{EmailIn(pattern string)}
|
|
\item \texttt{EmailEQ(email string)}
|
|
\item \texttt{EmailNEQ(email string)}
|
|
\item \texttt{EmailHasSuffix(suffix string)}
|
|
\end{itemize}
|
|
|
|
These methods can further be imported into other packages and this makes
|
|
working with the database a morning breeze.
|
|
|
|
|
|
\n{1}{Deployment}
|
|
|
|
\textbf{TODO}: mention how \texttt{systemd} aids in running the pod.
|
|
|
|
A deployment setup as suggested in Section~\ref{sec:deploymentRecommendations}
|
|
is already partially covered by the multi-stage \texttt{Containerfile} that is
|
|
available in the main sources. Once built, the resulting container image only
|
|
contains a handful of things it absolutely needs:
|
|
|
|
\begin{itemize}
|
|
\item a statically linked copy of the program
|
|
\item a default configuration file and corresponding Dhall expressions cached
|
|
at build time
|
|
\item a recent CA certs bundle
|
|
\end{itemize}
|
|
|
|
Since the program also needs a database for proper functioning, an example
|
|
scenario includes the application container being run in a Podman \textbf{pod}
|
|
together with the database. That results in not having to expose the database
|
|
to the entire host or out of the pod at all, it is only be available over pod's
|
|
\texttt{localhost}.
|
|
|
|
It goes without saying that the default values of any configuration secrets
|
|
should be substituted by the application operator with new, securely generated
|
|
ones.
|
|
|
|
|
|
\n{2}{Rootless Podman}
|
|
|
|
Assuming rootless Podman set up and the \texttt{just} tool installed on the
|
|
host, the application could be deployed by following a series of relatively
|
|
simple steps:
|
|
|
|
\begin{itemize}
|
|
\item build (or pull) the application container image
|
|
\item create a pod with user namespacing, exposing the application port
|
|
\item run the database container inside the pod
|
|
\item run the application inside the pod
|
|
\end{itemize}
|
|
|
|
In concrete terms, it would resemble something along the lines of
|
|
Listing~\ref{podmanDeployment}. Do note that all the commands are executed
|
|
under the unprivileged \texttt{user@containerHost} that is running rootless
|
|
Podman, i.e.\ it has \texttt{UID}/\texttt{GID} mapping entries in
|
|
\texttt{/etc/setuid} and \texttt{\etc/setgid} files \textbf{prior} to running any
|
|
Podman commands.
|
|
|
|
% \newpage
|
|
|
|
\begin{lstlisting}[language=bash, caption={Example application deployment using
|
|
rootless Podman},
|
|
label=podmanDeployment, basicstyle=\linespread{0.9}\small\ttfamily]
|
|
# From inside the project folder, build the image locally using kaniko.
|
|
just kaniko
|
|
|
|
# Create a pod.
|
|
podman pod create --userns=keep-id -p3005:3000 --name pcmt
|
|
|
|
# Run the database in the pod.
|
|
podman run --pod pcmt --replace -d --name "pcmt-pg" --rm \
|
|
-e POSTGRES_INITDB_ARGS="--auth-host=scram-sha-256 \
|
|
--auth-local=scram-sha-256" \
|
|
-e POSTGRES_PASSWORD=postgres -v $PWD/tmp/db:/var/lib/postgresql/data \
|
|
docker.io/library/postgres:15.2-alpine3.17
|
|
|
|
# Run the application in the pod.
|
|
podman run --pod pcmt --replace --name pcmt-og -d --rm \
|
|
-e PCMT_LIVE=False \
|
|
-e PCMT_DBTYPE="postgres" \
|
|
-e PCMT_CONNSTRING="host=pcmt-pg port=5432 sslmode=disable \
|
|
user=postgres dbname=postgres password=postgres"
|
|
-v $PWD/config.dhall:/config.dhall:ro \
|
|
docker.io/immawanderer/pcmt:testbuild -config /config.dhall
|
|
\end{lstlisting}
|
|
|
|
To summarise Listing~\ref{podmanDeployment}, first, the application
|
|
container is built from inside the project folder using \texttt{kaniko}.
|
|
Alternatively, the container image could be pulled from the container
|
|
repository, but it makes more sense showing the image being built from sources
|
|
since the listing depicts a \texttt{:testbuild} tag being used.
|
|
|
|
Next, a \emph{pod} is created and given a name, setting the port binding for
|
|
the application. Then, the database container is started inside the pod.
|
|
|
|
As a final step, the application container itself is run inside the pod. The application configuration named \texttt{config.dhall} located in
|
|
\texttt{\$PWD} is mounted as a volume into container's \texttt{/config.dhall},
|
|
providing the application with a default configuration. The default container
|
|
does contain a default configuration for reference, however, running the
|
|
container as is without additional configuration would fail as it does not
|
|
contain the necessary secrets.
|
|
|
|
\n{3}{Sanity checks}
|
|
|
|
Do also note that the application connects to the database using its
|
|
\emph{container} name, i.e.\ not the IP address. That is possible thanks to
|
|
Podman setting up DNS inside the pod in such a way that all containers in the
|
|
pod can reach each other using their (container) names. Interestingly,
|
|
connecting via \texttt{localhost} would also work, as from inside the pod, any
|
|
container in the pod can reach any other container in the same pod via pod's
|
|
\texttt{localhost}.
|
|
In fact, \emph{pinging} the database or application containers from an ad-hoc
|
|
\texttt{alpine} container added to the pod yields:
|
|
|
|
\vspace{\parskip}
|
|
\begin{lstlisting}[language=bash, caption={Pinging pod containers using their
|
|
names}, label=podmanPing, basicstyle=\linespread{0.9}\small\ttfamily]
|
|
user@containerHost % podman run --rm -it --user=0 --pod=pcmt \
|
|
docker.io/library/alpine:3.18
|
|
/ # ping -c2 pcmt-og
|
|
PING pcmt-og (127.0.0.1): 56 data bytes
|
|
64 bytes from 127.0.0.1: seq=0 ttl=42 time=0.072 ms
|
|
64 bytes from 127.0.0.1: seq=1 ttl=42 time=0.118 ms
|
|
|
|
--- pcmt-og ping statistics ---
|
|
2 packets transmitted, 2 packets received, 0% packet loss
|
|
round-trip min/avg/max = 0.072/0.095/0.118 ms
|
|
/ # ping -c2 pcmt-pg
|
|
PING pcmt-pg (127.0.0.1): 56 data bytes
|
|
64 bytes from 127.0.0.1: seq=0 ttl=42 time=0.045 ms
|
|
64 bytes from 127.0.0.1: seq=1 ttl=42 time=0.077 ms
|
|
|
|
--- pcmt-pg ping statistics ---
|
|
2 packets transmitted, 2 packets received, 0% packet loss
|
|
round-trip min/avg/max = 0.045/0.061/0.077 ms
|
|
/ #
|
|
\end{lstlisting}
|
|
\vspace*{-\baselineskip}
|
|
|
|
The pod created in Listing~\ref{podmanDeployment} only set the binding for a
|
|
port used by the application (\texttt{5005/tcp}). The Postgres default port
|
|
\texttt{5432/tcp} is not among pod's port bindings, as can be seen in the pod
|
|
creation command. This can also easily be verified using the command in
|
|
Listing~\ref{podmanPortBindings}:
|
|
|
|
\begin{lstlisting}[language=bash, caption={Podman pod port bindings},
|
|
label=podmanPortBindings, basicstyle=\linespread{0.9}\small\ttfamily]
|
|
user@containerHost % podman pod inspect pcmt \
|
|
--format="Port bindings: {{.InfraConfig.PortBindings}}\n\
|
|
Host network: {{.InfraConfig.HostNetwork}}"
|
|
Port bindings: map[3000/tcp:[{ 5005}]]
|
|
Host network: false
|
|
\end{lstlisting}
|
|
\vspace*{-\baselineskip}
|
|
|
|
To be absolutely sure, trying to connect to the database from outside of the
|
|
pod (i.e. from the container host) should \emph{fail}, unless, of course, there
|
|
is another process listening on that port:
|
|
|
|
\begin{lstlisting}[language=bash, caption={In-pod database is unreachable from
|
|
the host}, breaklines=true, label=podDbUnreachable,
|
|
basicstyle=\linespread{0.9}\small\ttfamily]
|
|
user@containerHost % curl localhost:5432
|
|
--> curl: (7) Failed to connect to localhost port 5432 after 0 ms: Couldn't connect to server
|
|
\end{lstlisting}
|
|
\vspace*{-\baselineskip}
|
|
|
|
The error in Listing~\ref{podDbUnreachable} is expected, as it is the result of
|
|
the database port not been exposed from the pod.
|
|
|
|
|
|
\n{1}{Validation}
|
|
|
|
\n{2}{Unit tests}
|
|
|
|
Unit testing is a hot topic for many people and the author does not count
|
|
himself to be a staunch supporter of neither extreme. The ``no unit tests''
|
|
seems to discount any benefit there is to unit testing, while a ``
|
|
TDD-only''\footnotemark{} approach can be a little too much for some people's
|
|
taste. The author tends to prefer a \emph{middle ground} approach in this
|
|
particular case, i.e. writing enough tests where meaningful but not necessarily
|
|
testing everything or writing tests prior to business logic code. Arguably,
|
|
following the practice of TDD should result in writing a \emph{better designed}
|
|
code, particularly because there needs to be a prior thought about the shape
|
|
and function of the code, as it is tested for before it is even written, but it
|
|
adds an slight inconvenience to what is otherwise a straightforward process.
|
|
|
|
Thanks to Go's built in support for testing via its \texttt{testing} package
|
|
and the tooling in the \texttt{go} tool, writing tests is relatively simple. Go
|
|
looks for files in the form \texttt{<filename>\_test.go} in the present working
|
|
directory but can be instructed to look for test files in packages recursively
|
|
found on any path using the ellipsis, like so: \texttt{go test
|
|
./path/to/package/\ldots}, which then \emph{runs} all the tests found, and
|
|
reports some statistics, such as the time it took to run the test or whether it
|
|
succeeded or failed. To be precise, the test files also need to contain test
|
|
functions, which are functions with the signature \texttt{func TestWhatever(t
|
|
*testing.T)\{\}} and where the function prefix ``Test'' is equally as important
|
|
as the signature. Without it, the function is not considered to be a testing
|
|
function despite having the required signature and is therefore \emph{not}
|
|
executed during testing.
|
|
|
|
This test lookup behaviour; however, also has a neat side-effect: all the test
|
|
files can be kept side-by-side their regular source counterparts, there is no
|
|
need to segregate them into a specially blessed \texttt{tests} folder or
|
|
similar, which in author's opinion improves readability. As a failsafe, in case
|
|
no actual test are found, the current behaviour of the tool is to print a note
|
|
informing the developer that no tests were found, which is handy to learn if it
|
|
was not intended/expected. When compiling regular source code, the Go files
|
|
with \texttt{\_test} in the name are simply ignored by the build tool.
|
|
|
|
\footnotetext{TDD, or Test Driven Development, is a development methodology
|
|
whereby tests are written \emph{first}, then a complementary piece of code
|
|
that is supposed to be tested is added, just enough to get past the compile
|
|
errors and to see the test \emph{fail} and then is the code finally
|
|
refactored to make the test \emph{pass}. The code can then be fearlessly
|
|
extended because the test is the safety net catching the programmer when the
|
|
mind slips and alters the originally intended behaviour of the code.}
|
|
|
|
|
|
\n{2}{Integration tests}
|
|
|
|
Integrating with external software, namely the database in case of this
|
|
program, is designed to utilise the same mechanism that was mentioned in the
|
|
previous section: Go's \texttt{testing} package. These tests verify that the
|
|
code changes can still perform the same actions with the external software that
|
|
were possible before the change and are run before every commit locally and
|
|
then after pushing to remote in the CI.
|
|
|
|
\n{3}{func TestUserExists(t *testing.T)}
|
|
|
|
An example integration test shown in Listing~\ref{integrationtest} can be seen
|
|
to declare a helper function \texttt{getCtx() context.Context}, which takes no
|
|
arguments and returns a new\\ \texttt{context.Context} initialised with a value
|
|
of the global logger, which is how the logger gets injected into the user
|
|
module functions. The function \texttt{TestUserExists(t *testing.T)} first
|
|
declares a database connection string and attempting to open a connection to
|
|
the database. The database in use here is SQLite3 running in memory mode,
|
|
meaning no file is actually written to disk during this process. Since the
|
|
testing data is not needed after the test, this is deemed good enough. Next, a
|
|
defer statement calling the \texttt{Close()} method on the database object is
|
|
made, which is the idiomatic Go way of closing files and network connections
|
|
(which are also an abstraction over files on UNIX-like operating systems such
|
|
as GNU/Linux). The \emph{defer} statement gets called after all of the
|
|
statements in the surrounding function, which makes sure no file descriptors
|
|
(FDs) are leaked and the file is properly closed when the function returns.
|
|
|
|
In the next step a database schema creation is attempted, handling the
|
|
potential error in a Go idiomatic way, which uses the return value from the
|
|
function in an assignment to a variable declared in the \texttt{if} statement,
|
|
and checks whether the \texttt{err} was \texttt{nil} or not. In case the
|
|
\texttt{err} was not \texttt{nil}, i.e.\ \emph{there was an error in the callee
|
|
function}, the condition evaluates to \texttt{true}, which is followed by
|
|
entering the inner block. Inside the inner block, the error is announced to the
|
|
user (likely a developer running the test in this case) and the testing
|
|
object's \texttt{FailNow()} method is called, which marks the test function as
|
|
having failed, and thus stops its execution. In this case, that is the desired
|
|
outcome, since if the database schema creation call fails, there really is no
|
|
point in continuing testing of user creation.
|
|
|
|
Conversely, if the schema does get created without an error, the code continues
|
|
to declare a few variables: \texttt{username}, \texttt{email} and \texttt{ctx},
|
|
where the context injected with the logger is saved. Some of them are
|
|
subsequently passed into the \texttt{UsernameExists} function, \texttt{ctx} as
|
|
the first argument, with the database pointer and username being passed next,
|
|
while the \texttt{email} variable is only used at a later stage, but was
|
|
declared here to give a sense of grouping. The error value returned from this
|
|
function is again checked and if everything goes well, the value of the
|
|
\texttt{usernameFound} boolean is checked next.
|
|
|
|
\smallskip
|
|
\smallskip
|
|
\begin{lstlisting}[language=Go, caption={Example integration test.},
|
|
label=integrationtest,basicstyle=\linespread{0.9}\scriptsize\ttfamily]
|
|
// modules/user/user_test.go
|
|
package user
|
|
|
|
import (
|
|
"context"
|
|
"testing"
|
|
|
|
"git.dotya.ml/mirre-mt/pcmt/ent/enttest"
|
|
"git.dotya.ml/mirre-mt/pcmt/slogging"
|
|
_ "github.com/xiaoqidun/entps"
|
|
)
|
|
|
|
func getCtx() context.Context {
|
|
l := slogging.Init(false)
|
|
ctx := context.WithValue(context.Background(), CtxKey{}, l)
|
|
|
|
return ctx
|
|
}
|
|
|
|
func TestUserExists(t *testing.T) {
|
|
connstr := "file:ent_tests?mode=memory&_fk=1"
|
|
db := enttest.Open(t, "sqlite3", connstr)
|
|
defer db.Close()
|
|
|
|
if err := db.Schema.Create(context.Background()); err != nil {
|
|
t.Errorf("failed to create schema resources: %v", err)
|
|
t.FailNow()
|
|
}
|
|
|
|
username := "dude"
|
|
email := "dude@b.cc"
|
|
ctx := getCtx()
|
|
|
|
usernameFound, err := UsernameExists(ctx, db, username)
|
|
if err != nil {
|
|
t.Errorf("error checking for username {%s} existence: %q",
|
|
username,
|
|
err,
|
|
)
|
|
}
|
|
|
|
if usernameFound {
|
|
t.Errorf("unexpected: user{%s} should not have been found",
|
|
username,
|
|
)
|
|
}
|
|
|
|
if _, err := EmailExists(ctx, db, email); err != nil {
|
|
t.Errorf("unexpected: user email '%s' should not have been found",
|
|
email,
|
|
)
|
|
}
|
|
|
|
usr, err := CreateUser(ctx, db, email, username, "so strong")
|
|
if err != nil {
|
|
t.Errorf("failed to create user, error: %q", err)
|
|
t.FailNow()
|
|
} else if usr == nil {
|
|
t.Error("got nil usr back")
|
|
t.FailNow()
|
|
}
|
|
|
|
if usr.Username != username {
|
|
t.Errorf("got back wrong username, want: %s, got: %s",
|
|
username, usr.Username,
|
|
)
|
|
} // ...more checks...
|
|
}
|
|
\end{lstlisting}
|
|
|
|
Since the database has just been created, there should be no users, which is
|
|
checked in the next \texttt{if} statement. The same check is then performed for
|
|
the earlier-declared user email that is also expected to fail.
|
|
|
|
The final statements of the described test attempts a user creation call, which
|
|
is again checked for both error and \emph{nilability}. The test continues with
|
|
more similar checks but it has been cut short for brevity.
|
|
|
|
A neat thing about error handling in Go is that it allows for very easy
|
|
checking of all paths, not just the \emph{happy path} where there are no
|
|
issues.
|
|
|
|
|
|
\n{2}{Testing environment}
|
|
|
|
The application has been deployed in a testing environment on author's modest
|
|
Virtual Private Server (VPS) at \texttt{https://testpcmt.dotya.ml}, protected
|
|
by \emph{Let's Encrypt}\allowbreak issued, short-lived, ECDSA
|
|
\texttt{secp384r1} curve TLS certificate, and configured with strict CSP. It is
|
|
a testing instance; therefore, limits to prevent abuse might be imposed.
|
|
|
|
|
|
\n{3}{Deployment validation}
|
|
|
|
TODO: show the results of testing the app in prod using
|
|
\url{https://testssl.sh/}.
|
|
|
|
|
|
% =========================================================================== %
|
|
\nn{Conclusion}
|
|
|
|
The objectives of the thesis have been to create the Password Compromise
|
|
Monitoring Tool aimed at security-conscious user in order to validate their
|
|
assumptions on the security of their credentials. The thesis opened by diving
|
|
into cryptography topics such as encryption and briefly mentioned TLS.
|
|
|
|
Additionally, security mechanisms such as Site Isolation and Content Security
|
|
Policy, commonly employed by mainstream browsers of today, were introduced and
|
|
the reader learnt how Content Security Policy is easily and dynamically
|
|
configured.
|
|
|
|
An extensive body of the thesis then revolved around the practical part,
|
|
describing everything from tooling used through high-level view of
|
|
application's architecture to implementation of specific parts of the
|
|
application across the stack.
|
|
|
|
Finally, the practical part concluded by broadly depicting validation
|
|
methods used to verify if the application worked correctly.
|
|
|
|
The author would like to recognise that there are certain aspects of the thesis
|
|
in the need of further development. It is necessary to admit that not
|
|
everything could have realistically been realised in the limited timespan and
|
|
scope imposed on the project to prevent diverging. The concerns mentioned above
|
|
constitute clear candidates for future work of the author who intends to
|
|
improve on the existing state, for example accessibility-wise. The author's
|
|
unfamiliarity with the accessibility tooling sometimes compromised on the
|
|
quality in this segment of the application, but it is a known deficiency.
|
|
Furthermore, the list of tasks for the future may also contain adding
|
|
\emph{fuzzing} tests for the program, producing Software Bill of Materials,
|
|
utilising additional immutable database or unifying the frontend design
|
|
language across the pages.
|
|
|
|
The program does have a very solid core that for instance listens for OS
|
|
signals, handles graceful shutdown and supports structured logging but still
|
|
has room for improvements, despite the fact that its creation has been
|
|
best-effort. Due to a number of reasons mentioned earlier, it should not be
|
|
called an utterly \emph{finished} project yet, but it can already serve a clear
|
|
purpose.
|
|
|
|
|
|
% =========================================================================== %
|