masters-thesis/tex/part-theoretical.tex

% =========================================================================== %
\part{Theoretical part}

\n{1}{Vision}

The thesis consists of two main parts: a theoretical one that provides
theoretical background to concepts and processes used in the other one - so
called \emph{practical} part, which then describes what exactly has been done
and how.

The application developed as part of this thesis should enable users to quickly
and privately check their credentials' compromise status against configured
local and online data sources. Of course the compromise monitoring depends on
access to quality data, which is in the purview of the application
administrator.

In-application administrative-level user is able to configure online and local
data sources, initially set up user accounts but does not have the access
users' search queries or credential entries, or more broadly, is not able to
read sensitive user information. This is enabled by the architectural decisions
the application has taken, whereby sensitive user data is encrypted and not
even administrative-level users are able to read them.


\n{1}{Cryptography primer}\label{sec:cryptographyprimer}

\n{2}{Encryption}

\textbf{TODO:} add \emph{why} we care and how it's going to be used.

\n{2}{Hash functions}

Hash functions are algorithms used to help with a number of things: integrity
verification, password protection, digital signature, public-key encryption and
others. Hashes are used in forensic analysis to prove authenticity of digital
artifacts, to uniquely identify a change-set within revision-based source code
management systems such as Git or Mercurial, to detect known-malicious software
by anti-virus programs or by advanced filesystems in order to verify block
integrity and enable repairs, and also in many other applications that each
person using a modern computing device has come across, such as when connecting
to a website protected by the famed HTTPS.

The popularity of hash functions stems from a common use case: the need to
simplify reliably identifying a chunk of data. Of course, two chunks of data,
two files, frames or packets could always be compared bit by bit, but that can
get prohibitive from both cost and energy point of view relatively quickly.
That is when the hash functions come in, since they are able to take a long
input and produce a short output, named a digest or a hash value. The function
also only works one way.

A file, or any original input data for that matter, cannot be reconstructed
from the hash digest alone by somehow \emph{reversing} the hashing operation,
since at the heart of any hash function there is essentially a compression
function.

Most alluringly, hashes are frequently used with the intent of
\emph{protecting} passwords by making those unreadable, while still being able
to verify that the user knows the password, therefore should be authorised.

As the hashing operation is irreversible, once the one-way function produces a
short a digest, there is no way to reconstruct the original message from it.
That is, unless the input of the hash function is also known, in which case all
it takes is hashing the supposed input and comparing the digest with existing
digests that are known to be digests of passwords.


\n{3}{Types and use cases}

Hash functions can be loosely categorised based on their intended use case to
\emph{password protection hashes}, \emph{integrity verification hashes},
\emph{message authentication codes} and \emph{cryptographic hashes}. Each of
these possess unique characteristics and using the wrong type of hash function
for the wrong job can potentially result in a security breach.

As an example, suppose \texttt{MD5}, a popular hash function internally using
the same data structure - \emph{Merkle-Damgård} construction - as
\texttt{BLAKE3}. The former produces 128 bit digests, compared to the default
256 bits of output and no upper ($<2^{64}$ bytes) limit (Merkle tree
extensibility) for the latter. There is a list of differences that could
further be mentioned, however, they both have one thing in common: they are
\emph{designed} to be \emph{fast}. The latter, as a cryptographic hash
function, is conjectured to be \emph{random oracle indifferentiable}, secure
against length extension, but it is also in fact faster than all of
\texttt{MD5}, \texttt{SHA3-256}, \texttt{SHA-1} and even \texttt{Blake2} family
of functions~\cite{blake3}.

The use case of both is to (quickly) verify integrity of a given chunk of data,
in case of \texttt{BLAKE3} with pre-image and collision resistance in mind, not
to secure a password by hashing it first, which poses a big issue when used
to...secure passwords by hashing them first.

Password hashing functions such as \texttt{argon2} or \texttt{bcrypt} are good
choices for \emph{securely} storing hashed passwords, namely because they place
CPU and memory burden on the machine that is computing the digest. In case of
the mentioned functions, \emph{hardness} is even configurable to satisfy the
greatest possible array of scenarios. These functions also forcefully limit
potential parallelism, thereby restricting the scale at which exhaustive
searches performed using tools like \texttt{Hashcat} or \texttt{John the
Ripper} could be at all feasible, practically obviating old-school hash
cracking~\cite{hashcracking},~\cite{hashcracking2}. Additionally, both
functions can automatically add random \emph{salt} to passwords, automatically
ensuring that no copies of the same password provided by different users will
end up hashing to the same digest value.


\n{3}{Why are hashes interesting}

As already hinted, hashes are often used to store a \emph{logical proof of the
password}, rather than the password itself. Especially services storing hashed
user passwords happen to non-voluntarily leak them. Using a wrong type of hash
for password hashing, weak hash function parameters, reusing \emph{salt} or the
inadvertently \emph{misusing} the hash function in some other way, is a sure
way to spark a lot of
interest~\cite{megatron},~\cite{linkedin1},~\cite{linkedin2}.

Historically, plain-text passwords have also leaked enough times (or weak
hashes have been cracked) that anyone with enough interest had more than
sufficient amount of time to additionally put together neat lists of hashes of
the most commonly used
passwords~\cite{rockyou},~\cite{plaintextpasswds1},~\cite{plaintextpasswds2},~\cite{plaitextpasswds3}.

So while a service might not be storing passwords in \emph{plain text}, which
is a good practice, using a hashing function not designed to protect passwords
does not offer much additional protection in the case of weak passwords, which
happen to be the ones that are the most commonly used.

It would seem only logical that a service that is not using cryptographic
primitives like hash functions correctly is more likely to get hacked and have
its users' passwords or password hashes leaked. Those are often exposed
publicly with no restrictions on access, and the internet turns out to be
serving as a storage/medium.

That incidentally also means that anyone interested in their own compromise
monitoring has at least \emph{some} chances of successfully learning about
their compromise, potentially a long time before it can be used to cause
greater harm.


\n{2}{TLS}\label{sec:tls}

The Transport Layer Security protocol (or TLS) serves as as an encryption and
\emph{authentication} protocol to secure internet communications. An important
part of the protocol is the \emph{handshake}, during which the two
communicating parties exchange messages that acknowledge each other's presence,
verify each other, choose what cryptographic algorithms will be used and decide
session keys. As there are multiple versions of the protocol in active duty
even at the moment, the server together with the client need to agree upon the
version they are going to use (it is recommended to use either v1.2 or v1.3
these days), pick cipher suites (if applicable), the client verifies the
server's public key (and the signature of the certificate authority that issued
it) and they both generate session keys for use after handshake completion.

TLSv1.3 dramatically reduced the number of available suites to only include the
ones deemed secure enough, which is why it is no longer needed to manually
specify what cipher suite should be used (or rely on the client/server to
choose wisely). While possibly facing compatibility issues with legacy devices,
the simplicity brought by enabling TLSv1.3 might be considered a worthy
trade-off~\cite{tls13rfc8446}.


\n{1}{Passwords}\label{sec:passwords}

Passwords have been in use since the ancient times, apparently already the
Roman sentries used passwords or \textit{watchwords} to discern who was allowed
to enter an area. The Roman army had a special system of distributing passwords
among the encampment members on a wooden tablet. Fast forward a couple of
thousand years, during the days of the Prohibition Era in the United States, it
was the secret ``speakeasies'' that were protecting their illegitimate
alcohol-serving business using passwords~\cite{speakeasy},~\cite{nisthistory}.
During the World War II.\ the US paratroopers' use of passwords has evolved to
even include a counter-password.

According to McMillan, the first \textit{computer} passwords date back to
mid-1960s Massachusetts Institute of Technology (MIT), when researchers at the
university built a massive time-sharing computer called CTSS. Apparently,
\textit{even then} the passwords did not protect the users as well as they were
expected to~\cite{mcmillan}.

Traditionally, passwords were expected to be memorised, but the large number of
password-protected \emph{services} these days can make this impractical.  To
list a few common examples, access to a bank account, electronic mailbox,
personal computer encrypted disk are all protected by some form of a password.

A password still often consists of a \textit{string} of characters typed into a
prompt but its function is still the same: as per NIST it enables the
\textit{verifier} to infer the \textit{claimant}'s identity via a secret the
claimant holds.

There are always some arbitrary requirements applied to what the password can
be, only some turn out to smarter than others.

Despite the impression given by the word ``password'', it does not need to be
an actual word, while a non-word (in the dictionary sense) may indeed be harder
to guess, which is a desirable property of passwords. A memorized secret
consisting of a sequence of words or other text separated by spaces is
sometimes called a passphrase. A passphrase is similar to a password in usage,
but the former is generally longer for added security.

\n{2}{Program-imposed constraints}

Some of the following examples might be a bit anecdotal and more of an
exception than a rule; nevertheless, when presented by a large-enough program
creator/service provider, their decisions reach a sufficient amount of
population, enough that the author will call them influential. They form how
users think when creating password and affect what users expect from other
services they happen to visit and use from that point on, as well.

\n{3}{Short arbitrary length}

It has been observed that a requirement for a ``strong'' password generally
represents that a password is:

\begin{itemize}
  \item longer than 7 characters,
  \item shorter than 11 characters,
  \item begins with a letter and ends with a number OR
  \item begins with a number and ends with a letter.
\end{itemize}

\obr{Short arbitrary password length
limit}{fig:arbitrarypasswdlengthlimit}{.8}{graphics/arbitrarypasswdlengthlimit.jpg}

The error message in above the password input field depicted in
Figure~\ref{fig:arbitrarypasswdlengthlimit} is wrong for multiple reasons, and
it is a classic example of short arbitrary length requirement~\cite{larsklint}.
It essentially prevents users from using passphrases, makes using a password
manager impractical and all of that has apparently been done ``because of
security''~\cite{etihad}. Moreover, this might be an indicative of the fact
that instead of storing passwords hashed (as it should be), they might be
storing them in \textbf{plain text}.
Otherwise, what reason could exist for the limit to be 10 characters?
The recommendation of the US's National Institute for Standards and Technology
(NIST) in this regard is a minimum of 64 and a maximum of 256 characters,
which, as they put it, \emph{should be sufficient for most users' needs}.

\n{3}{Restricting special characters}

Service providers have too often been found forbidding the use of so called
\textit{special characters} in passwords for as long as passwords have been
used to protect privileged access. Ways of achieving the same may vary but the
intent stays the same: preventing users from inputting characters into the
system, which the system cannot comfortably handle, for ``reasons'', which are
usually something dubious along the lines of ``an apostrophe may be used in SQL
injection attacks'' or ``angle brackets may be used in XSS attacks''. Instead,
the real message it often unwittingly announces is pointing right to the
serious shortcomings of password handling of the site in question, as passwords
should never be re-displayed in a context that is prone to Cross Site Scripting
(XSS), and the passwords should always be hashed before being sent to the
database anyway, leaving us with only alphanumeric characters, rendering the
SQLi fears baseless.

\obr{Forbidden special characters in
passwords}{fig:forbiddencharacters}{.8}{graphics/forbiddencharacters.jpg}

Note that ``Passw0rd!'' would have been a perfectly acceptable password for the
validator displayed in
Figure~\ref{fig:forbiddencharacters}~\cite{forbiddencharacters}. NIST's
recommendations on this matter are that all printing ASCII characters as well
as the space character SHOULD be acceptable in memorized secrets, and Unicode
characters SHOULD be accepted as well~\cite{asciirfc20},~\cite{iso10646}.

\n{3}{Character composition requirements}

There is a tendency to come up with bad passwords when there are character
composition requirements in place, too.  The reality is that instead of
creating strong passwords directly, most users first try a basic version and
then keep tweaking characters until the password ends up fulfilling the minimum
requirement.

The \emph{problem} with it is that it has been shown, that people use similar
patterns, i.e.\ starting with capital letters, putting a symbol last and a
number in the last two positions. This is also known to people cracking the
password hashes and they run their dictionary attacks using the common
substitutions, such as ``\$'' for ``s'', ``E'' for ``3'', ``1'' for ``l'',
``@'' for ``a''
etc.~\cite{megatron},~\cite{hashcracking},~\cite{hashcracking2}.  It is safe to
expect that the password created in this manner will almost certainly be bad,
and the only achievement was to frustrate the user in order to still arrive at
a bad password.

\n{3}{Other common issues}

Some services don't allow users to paste into passwords fields (disabling them
using JavaScript), thereby essentially breaking the password manager
functionality, which is an issue because it encourages bad password practices
such as weak passwords and likewise, password reuse.

Forced frequent password rotation is another common issue. Apparently, making
frequent password rotations mandatory contributes to users developing a
password creation \emph{patterns}. Moreover, according to the British NCSC, the
subject practice ``carries no real benefits as stolen passwords are generally
exploited immediately'', and the organisation calls it a modern-day security
anti-pattern~\cite{ncsc}.


\n{1}{Web security}\label{sec:websecurity}

The internet is a vast space full of intertwined concepts and ideas. It is a
superset of the Web, even though the two terms often get conflated. However,
not everything that is available on the internet can be accessed using web
protocols and \emph{resources}. This section delves into the concepts of web
security.


\n{2}{Site Isolation}\label{sec:siteisolation}

While website operators can perform steps to secure their sites, it is often
the browsers holding the last line when these web servers are misconfigured,
allowing the attacker to start exploiting a vulnerability in various ways.

Most users consume web content using web browsers. Modern browsers such as
Firefox or Chromium are being built with a security focus in mind. Their
developers are acutely aware of the dangers that parsing of untrusted code from
the internet poses, which is precisely what the websites, the stylesheets and
the accompanying scripts are.

This necessarily gets reflected in the way these programs are architected.
Instead of the main, privileged browser process running everything directly, it
spawns de-privileged child processes for each website. This extra line of
defence should make it \emph{harder for untrustworthy websites to access or
steal information} from user accounts or other websites. Even if the
misbehaving website does manage to ``break some rules'' within its own process,
it should find it more difficult to steal data from other
sites~\cite{siteisolation}.

Firefox calls their version of \emph{Site Isolation}-like functionality Project
Fission, but the two are very similar, both in internal architecture and what
they try to achieve~\cite{projectfission}. Elements of the web page are scanned
to decide whether they are allowed according to \emph{same-site} restrictions
and allocated shared or isolated memory based on the result.

Some Chromium users have been complaining on its high memory usage in the past,
which might have been partially caused by Site Isolation user-protection
features, unbeknownst to them.


\n{2}{Cross-site scripting}\label{sec:xss}

As per OWASP Top Ten list, injection is the third most observed issue across
millions of websites. Cross-site scripting is a type of attack in which
malicious code, such as infected scripts, is injected into a website that would
otherwise be trusted. Since the misconfiguration or a flaw of the application
allowed this, the browser of the victim that trusts the website simply executes
the code provided by the attacker. This code thus gains access to session
tokens and any cookies associated with the website's origin, apart from being
able to rewrite the HTML content. The results of XSS can range from account
compromise to identity theft~\cite{owasptop10}.

Solutions deployed against XSS vary. On the client side, it mainly comes down
to good browser patching hygiene, browser features such as Site Isolation (see
Section~\ref{sec:siteisolation}), and essentially browsers correctly parsing
website directives such as the \texttt{X-Frame-Options},
\texttt{X-Content-Type-Options}, \texttt{X-Xss-Protection} and
\texttt{Cross-Origin-Opener-Policy} HTTP headers. However, the latter falls
flat if the website operators do not correctly configure their websites.

On the server side though, these options (indicating to the browsers \emph{how}
the site should be parsed) can directly be manipulated and configured. They
should be fine-tuned to fit the needs of each specific website.

Furthermore, a new, powerful and comprehensive framework for controlling the
admissibility of content has been devised more than 10 years ago now: Content
Security Policy. Its capabilities superseded those of the previously mentioned
options, and it is discussed more in-depth in the following section.


\n{2}{Content Security Policy}\label{sec:csp}

Content Security Policy (CSP) has been an important addition to the arsenal of
website operators, even though not everybody has necessarily been utilising it
properly or even taken notice. Once configured on the web server, it provides
guarantees and employs protections against most common attack vectors on
websites exactly where the websites are being parsed and displayed - in the
(compliant) browser.

As per Weichselbaum et al.\ CSP is a mechanism designed to mitigate
XSS~\cite{longliveCSP}, a long-lived king of the vulnerability
lists~\cite{reininginCSP}. It is a declarative policy mechanism that allows the
website operator to decide what client-side resources can load on their website and what origins
are permitted \emph{sources} of content.

For example, scripts can be restricted to only load from a list of trusted
domains, and inline scripts can be blocked entirely, which is a huge win
against popular XSS techniques.

Not only that, scripts and stylesheets can also be allowed based on a
cryptographic (SHA256, SHA384 or SHA512) hash of their content, which should be
a known information to legitimate website operators prior to or at the time
scripts are served, making sure no unauthorised script or stylesheet will ever
be run on user's computer (running a compliant browser).

A policy of CSPv3, which is the current iteration of the concept, can be served
either as a header or inside website's \texttt{<meta>} tag. Configuration is
either site-wide or specific to each page.

Directive names are generally derived from the \emph{sources} they are
covering, and are thus often suffixed `-src', as in \texttt{script-src},
\texttt{img-src} or \texttt{style-src}, although some directives do not follow
this pattern, \texttt{form-action}, \texttt{upgrade-insecure-requests} and
\texttt{sandbox} representing this other group nicely.

Different directives are delimited using semicolon character at the end, and
each directive can only appear once in the entire policy.

Special values exist for the origin website itself - \texttt{'self'} - and for
disallowing \emph{any} source - \texttt{'none'}.

A good policy is \emph{targeted} and not overly broad. To give an example, a
website that loads no JavaScript at all does not need to allow a popular CDN
(Content Delivery Network) origin in its \texttt{script-src}, instead it should
be set to \texttt{'none'}. CSP can also aid with clickjacking protection using
its \texttt{frame-ancestors} directive, which can limit origins that have the
permission to embed the website. This prevents the attacker from embedding the
website at random places, for example malicious websites that masquerade as
being legitimate, e.g.\ utilising `Log in using xyz service' frame, that in
actuality just pharms the credentials.

Getting CSP right can be tricky at first but once grokked, it is relatively
straight-forward and can increase the security of the site greatly.

The recommended way to \emph{test} CSP is to enable it in the
\emph{report-only} mode before turning it on in production.

There are many more directives and settings than mentioned in this section, the
author encourages anybody interested to give it a read, e.g.\ at
\url{https://web.dev/csp/}.


\n{1}{Configuration}

Every non-trivial program usually offers at least \emph{some} way to
tweak/manage its behaviour, and these changes are usually persisted
\emph{somewhere} on the filesystem of the host: in a local SQLite3 database, a
\emph{LocalStorage} key-value store in the browser, a binary or plain text
configuration file.  These configuration files need to be read and checked at
least on program start-up and either stored into operating memory for the
duration of the runtime of the program, or loaded and parsed, and the memory
subsequently \emph{freed} (initial configuration).

There is an abundance of configuration languages (or file formats used to craft
configuration files, whether they were intended for it or not) available, TOML,
INI, JSON, YAML, to name some of the popular ones (as of today).

Dhall stood out as a language that was designed with both security and the
needs of dynamic configuration scenarios in mind, borrowing a concept or two
from Nix~\cite{nixoslearn}~\cite{nixlang} (which in turn sources more than a
few of its concepts from Haskell), and in its apparent core being very similar
to JSON, which adds to a familiar feel. In fact, in Dhall's authors' own words
it is: ``a programmable configuration language that you can think of as: JSON +
functions + types + imports''~\cite{dhalllang}.

Among all the listed features, the especially intriguing one to the author was
the promise of \emph{types}. There are multiple examples directly on the
project's documentation webpage demonstrating for instance the declaration and
usage of custom types (that are, of course, merely combinations of the
primitive types that the language provides, such as \emph{Bool}, \emph{Natural}
or \emph{List}, to name just a few), so it was not exceedingly hard to start
designing a custom configuration \emph{schema} for the program. Dhall, not
being a Turing-complete language, also guarantees that evaluation \emph{always}
terminates eventually, which is a good attribute to possess for a configuration
language.

\n{2}{Safety considerations}

Having a programmable configuration language that understands functions and
allows importing not only arbitrary text from random internet URLs, but also
importing and \emph{evaluating} (i.e.\ running) potentially untrusted code, it
is important that there are some safety mechanisms employed, which can be
relied on by the user. Dhall offers this in multiple features: enforcing a
same-origin policy and (optionally) pinning a cryptographic hash of the value
of the expression being imported.

\n{2}{Possible alternatives}

While developing the program, the author has also come across certain
shortcomings of Dhall, namely the long start-up on \emph{cold cache}. It can
generally be observed when running the program in an environment that does not
allow persistently writing the cache files (a read-only filesystem), or does
not keep the written cache files, such as a container that is not configured to
mount persistent volumes to pertinent locations.

To describe the way Dhall works when performing an evaluation, it resolves
every expression down to a combination of its most basic types (eliminating all
abstraction and indirection) in the process called
\textbf{normalisation}~\cite{dhallnorm} and then saves this result in the
host's cache. The \texttt{dhall-haskell} binary attempts to resolve the
variable \texttt{\$\{XDG\_CACHE\_HOME\}} (have a look at \emph{XDG Base
Directory Spec}~\cite{xdgbasedirspec} for details) to decide \emph{where} the
results of the normalisation will be written for repeated use. Do note that
this behaviour has been observed on a GNU/Linux host and the author has not
verified this behaviour on another platforms, such as FreeBSD.

If normalisation is performed inside an ephemeral container (as opposed to, for
instance, an interactive desktop session), the results effectively get lost on
each container restart. That is both wasteful and not great for user
experience, since the normalisation of just a handful of imports (which
internally branches widely) can take an upwards of two minutes, during which
the user is left waiting for the hanging application with no reporting on the
progress or current status.

Workarounds for the above-mentioned problem can be devised relatively easily,
but it would certainly \emph{feel} better if there was no need to work
\emph{around} the configuration system of choice. For instance, bind mounting
\emph{persistent} volumes to pertinent locations inside the container
(\texttt{\$\{XDG\_CACHE\_HOME\}/\{dhall,dhall-haskell\}}) would preserve cache
between restarts. Alternatively, the cache could be pre-computed on container
build (as the program is only expected to run with a compatible schema version,
and that version \emph{is} known at container build time for the supplied
configuration).

Alternatives such as CUE (\url{https://cuelang.org/}) offer themselves nicely
as an almost drop-in replacement for Dhall feature-wise, while also resolving
the costly \emph{cold cache} normalisation operations, which is in author's
view Dhall's titular flaw. In a slightly contrasting approach, another emerging
project called \texttt{TySON} (\url{https://github.com/jetpack-io/tyson}),
which uses \emph{a subset} of TypeScript to also create a programmable,
strictly typed configuration language, opted to take a well-known language
instead of reinventing the wheel, while still being able to retain feature
parity with Dhall.


\n{1}{Compromise Monitoring}

There are, of course, several ways one could approach monitoring of compromised
credentials, some more \emph{manual} in nature than others. When using a
service that is suspected/expected to be breached in the future, one can always
create a unique username/password combination specifically for the subject
service and never use that combination anywhere else. That way, if the
credentials ever \emph{do} happen to appear in a data dump online in the
future, it is going to be a safe assumption as to where they came from.

Unfortunately, the task of actually \emph{monitoring} the credentials can prove
to be a little more arduous than one could expect at first. There are a couple
of points that can prove to pose a challenge in case the search is performed by
hand, namely:

\begin{itemize}
  \item finding the breached data to look through
  \item verifying the trustworthiness of the data
  \item varying quality of the data
  \item sifting through (possibly) unstructured data by hand
\end{itemize}

Of course, as this is a popular topic for a number of people, the
above-mentioned work has already been packaged into neat and practical online
offerings. In case one decides in favour of using those, an additional range of
issues (the previous one still applicable) arises:

\begin{itemize}
  \item the need to trust the provider of the service with input credentials
  \item relying on the goodwill of the provider to be able to access the data
  \item hoping that the terms of service are kept as promised
  \item dependence on the quality and extent of their data sources
\end{itemize}

Besides that, there is a plethora of breaches floating around the Internet
available simply as zip files, which makes the job of password compromise
monitoring even harder.

The overarching goal of this thesis is devising and implementing a system in
which the user can \emph{monitor} whether their credentials have been
\emph{compromised} (at least as far as the data can tell), and allowing them to
do so without needing to entrust their sensitive data to a provider.


\n{2}{Data Sources}\label{sec:dataSources}

A data source in this place is considered anything that provides the
application with data that it understands.

Of course, the results of credential compromise verification/monitoring is only
going to be as good as the data underpinning it, which is why it is imperative
that high quality data sources be used, if at all possible. While great care
does have to be taken to only choose the highest quality data sources, the
application must offer a means to be able to utilise these.

The sources from which breached data can be loaded into an application can be
split into two basic categories: \textbf{online} or \textbf{local}, and it is
possible to further discern between them by whether the data they provide is
\emph{structured} or not.

An online source is generally a service that ideally exposes a programmatic
API, which an application can query and from which it can request the necessary
subsets of data.
These types of services often additionally front the data by a user-friendly
web interface for one-off searches, which is, however, not of use here.

Among some examples of online services could be named:

\begin{itemize}
  \item {Have I Been Pwned?} - \url{https://haveibeenpawned.com}
  \item {DeHashed} - \url{https://dehashed.com}
\end{itemize}

Large lumps of unstructured data available on forums or shady web servers would
technically also count here, given that they provide data and are available
online. However, even though data is frequently found online precisely in this
form, it is also not of direct use for the application without manual
\emph{preprocessing}, as it is attended to in
Section~\ref{sec:localDatasetPlugin}.

Another source is then simply any locally supplied data, which, of course,
could have been obtained from a breach available online beforehand.

Locally supplied data is specific in that it needs to be formatted in such a
way that it is understood by the application. That is, the data supplied for
importing cannot be in its original raw form anymore, instead it has to have
been morphed into the precise shape the application needs for further
processing. Once imported, the application can query the data at will, as it
knows exactly the shape of it.

This supposes the existence of a \emph{format} for importing, the schema of
which is devised in Section~\ref{sec:localDatasetPlugin}.


\n{3}{Local Dataset Plugin}\label{sec:localDatasetPlugin}

Unstructured breach data from locally available datasets can be imported into
the application by first making sure it adheres to the specified schema (have a
look at the breach \texttt{ImportSchema} in Listing~\ref{breachImportSchema}).
If it does not (which is very likely with random breach data, as already
mentioned in Section~\ref{sec:dataSources}), it needs to be converted to a form
that \emph{does} before importing it to the application, e.g.\ using a Python
script or a similar method.

Attempting to import data that does not follow the outlined schema should
result in an error. Equally so, importing a dataset which is over a reasonable
size limit should by default be rejected by the program as a precaution.
Unmarshaling, for instance, a 1 TiB document would most likely result in an
out-of-memory (OOM) situation on the host running the application, assuming
contemporary consumer hardware conditions (not HPC).

\vspace{\parskip}
\begin{lstlisting}[language=Go,
caption={Breach \texttt{ImportSchema} Go struct (imports from the standard
library assumed)},
label=breachImportSchema,
backgroundcolor=\color{lstbg},
morekeywords={any,time}
]
    // ImportSchema is the model for importing locally available breach data.
    type ImportSchema struct {
        Name              string
	Description       string
        Date              time.Time
        IsVerified        bool
        ContainsPasswords bool
        ContainsHashes    bool
        HashType          string
        HashSalted        bool
        HashPepperred     bool
        ContainsUsernames bool
        ContainsEmails    bool
        Data              *Data
    }
\end{lstlisting}
\vspace*{-\baselineskip}

The Go \emph{struct} shown in Listing~\ref{breachImportSchema} will in
actuality translate to a YAML document written and supplied by an
administrative user of the program. And while the author is personally not the
greatest supporter of YAML; however, the format was still chosen for several
reasons:

\begin{itemize}
  \item relative ease of use (plain text, readability) for machines and people
	  alike
  \item capability to store multiple \emph{documents} inside of a single file
  \item most of the inputs being implicitly typed as strings
  \item support for inclusion of comments
  \item machine readability thanks to being a superset of JSON
\end{itemize}

The last point specifically should allow for documents similar to what can be
seen in Listing~\ref{breachDataYAMLSchema} to be ingested by the program, read
and written by humans and programs alike.

\smallskip
\begin{lstlisting}[style=yaml,
caption={A YAML file containing breach data formatted according to the
\texttt{ImportSchema},
optionally containing multiple documents},
label=breachDataYAMLSchema,
backgroundcolor=\color{lstbg},
]
    ---
    name: Horrible breach
    date: 2022-04-23T00:00:00Z+02:00
    description: impacted X in 2022, it contains 10 000 unique emails...
    isVerified: false
    containsPasswds: false
    containsHashes: true
    containsEmails: true
    hashType: md5
    hashSalted: false
    hashPeppered: false
    data:
      hashes:
        - hash1
        - hash2
        - hash3
      emails:
        - email1
        - ""
        - email3
    ---
    # document #2, describing another breach.
    name: Horrible breach 2
    ...
\end{lstlisting}
\vspace*{-\baselineskip}

Notice how the emails list (\texttt{.data/emails}) in
Listing~\ref{breachDataYAMLSchema} is missing one record, perhaps because it
was mistakenly omitted due to either machine error or unfamiliarity with the
format. This is a valid scenario (mistakes do happen) and the application needs
to be account for it. Alternatively, the program could start dropping
empty/partial records, but that behaviour could quickly lead to unhappy users.
The golden rule for the program is to \emph{always do the expected thing} (and
also not being overly smart about it, i.e.\ the simpler program flow is often
better).

\n{3}{Have I Been Pwned? Integration}

Troy Hunt's \textbf{Have I Been Pwned?} online service
(\url{https://haveibeenpwned.com/}) has been chosen as the online source of
compromised data. The service offers public APIs, which were originally (and it
was the intention of their author that they stay that way) provided free of
charge and with little-to-no rate-limiting. A major overhaul in this regard has
been revealed in November of 2022, where in addition to a new rate-limit
system, different-levels-of-symbolic fees were introduced to obtain the API
keys. These Apparently, the top consumers of the API seemed to utilise it
orders of magnitude more than the average person, which led Hunt to devising a
new, tiered API access system in which the \emph{little guys} would not be
subsidising the \emph{big guys}. Additionally, the symbolic fee of \$3.50 a
month for the entry-level 10 requests-per-minute API key was meant to serve as
a small barrier for (mis)users with nefarious purposes, but pose practically no
obstacle for \emph{legitimate} users, which is entirely
reasonable~\cite{hibpBillingChanges}.

The application's \texttt{hibp} module and database representation
(\texttt{schema.HIBPSchema}) attempts to model the values returned by this API
and declare actions to be performed upon the data, which is what facilitates
the breach search functionality in the program.

The architecture is relatively simple. Breach data, including title, date,
description and tags are cached by the application on start-up, as this API is
not authenticated. In order for the authenticated API to be called, the
application administrator first needs to configure an API key for the HIBP
service via the management interface. The user can then enter the desired query
parameters and the application then constructs the API call that is sent to the
authenticated API, and awaits the response. As the API is rate-limited
(individually, based on the API key supplied), sending requests directly after
receiving them from the users would likely pose an issue at high utilisation
times, and would result in the application ending up unnecessarily throttled.
Request sending thus needs to be handled in the backend by a requests
scheduler, as well as appropriately in the UI.

After a response from the API server arrives, the application attempts to
\emph{bind} the returned data to the pre-programmed \emph{model} for
validation, before finally parsing it. If the data can be successfully
validated, it is saved into the database as a cache and the search query is
performed on the saved data. The result is then displayed to the user for
browsing.


\n{1}{Deployment recommendations}\label{sec:deploymentRecommendations}

It is, of course, recommended that the application runs in a secure environment
\allowbreak although definitions of that almost certainly differ depending on
who you ask. General recommendations would be either to effectively reserve a
machine for a single use case - running this program - so as to dramatically
decrease the potential attack surface of the host, or run the program isolated
in a container or a virtual machine. Furthermore, if the host does not need
management access (it is a deployed-to-only machine that is configured
out-of-band, such as with a \emph{golden} image/container or declaratively with
Nix), then an SSH \emph{daemon} should not be running in it, since it is not
needed. In an ideal scenario, the host machine would have as little software
installed as possible besides what the application absolutely requires.

System-wide cryptographic policies should target the highest feasible security
level, if at all available (as is the case by default on e.g.\ Fedora),
covering SSH, DNSSec and TLS protocols. Firewalls should be configured and
SELinux (kernel-level mandatory access control and security policy mechanism)
running in \emph{enforcing} mode, if available.

\n{2}{Transport security}

User connecting to the application should rightfully expect for their data to
be protected \textit{in transit} (i.e.\ on the way between their browser and
the server), which is what \emph{Transport Layer Security} family of
protocols~\cite{tls13rfc8446} was designed for, and which is the underpinning
of HTTPS. TLS utilises the primitives of asymmetric cryptography to let the
client authenticate the server (verify that it is who it claims it is) and
negotiate a symmetric key for encryption in the process named the \emph{TLS
handshake} (see Section~\ref{sec:tls} for more details), the final purpose of
which is establishing a secure communications connection. The operator should
configure the program to either directly utilise TLS using configuration or
have it listen behind a TLS-terminating \emph{reverse proxy}.


\n{2}{Containerisation}

Whether containerised or not, the application needs runtime access to secrets
such as cookie encryption and authentication keys, or the database connection
string (containing database host, port, user, password/encrypted password,
authentication method and database name). It is a relatively common practice to
deliver secrets to programs in configuration files; however, environment
variables should be preferred. The program could go one step further and only
accept certain secrets as environment variables.

While it is not impossible to run a process scheduler (such as SystemD) inside
a container, containers are well suited for single-program workloads. The fact
that the application needs persistent storage also begs the question of
\emph{how to run the database in the container?}. Should data be stored inside
the ephemeral container, it could end up being very short-lived (wiped on
container restart), and barring container root volume snapshotting, it could
turn backing up of data into a chore, which are likely not the desired features
in this case. Moreover, it is the opinion of the author that multiprocess
scheduling would inordinately complicate the container set-up. Instead of
running a single program per container, which also provides good amounts of
isolation if done properly, running multiple programs in one container would
likely do the opposite.

As per the above, a more \emph{sane} thing to do is to store data externally
using a proper persistent storage method, such as a database. With Postgres
being the safe bet among database engines, the program should be able to handle
Postgres' most common authentication methods, namely \emph{peer},
\emph{scram-sha-256} and raw \emph{password}, although the \emph{password}
option should not be used in production, \emph{unless} the database connection
is protected by TLS~\cite{pgauthmethods}. In any case, using the
\emph{scram-sha-256} method is preferable~\cite{scramsha256rfc7677}. One way to
verify during development that authentication works as intended is the
\emph{Password generator for PostgreSQL} tool, which generates an encrypted
string from a raw user input~\cite{goscramsha256}.

If the application wants to use the \emph{peer} authentication method, it is up
to the operator to supply the Postgres socket to the container (e.g.\ as a
volume bind mount). Equally, the operator needs to make sure that the database
is either running in a network that is also directly attached to the container
or that there is a mechanism in place that routes the requests for the database
hostname to the destination, unless a static IP configuration is used, which is
also possible.

Practically every container runtime satisfies this use case with a container
\emph{name-based routing} mechanism, which inside \emph{pods} (in case of
Podman/Kubernetes) or common default networks (that are both NAT-ted \emph{and}
routed) enables resolution of container names. This abstraction is a
responsibility of specially configured (most often autoconfigured) pieces of
software, Aardvark in case of Podman, and CoreDNS for Kubernetes, and it makes
using short-lived containers in dynamic networks convenient.


\n{1}{Summary}

Passwords (and/or passphrases) are in use everywhere and will quite probably
continue to be for the foreseeable future. If not as \textit{the} principal way
to authenticate, then at least as \textit{a} way to authenticate. And for as
long as passwords are going to be handled and stored, they \emph{are} going to
get leaked, be it due to user or provider carelessness, or the attackers'
resolve and wit. Of course, sifting through the heaps of available password
breach data by hand is not a reasonable option, and therefore tools providing
assistance come in handy. The following part of this thesis will explore that
issue and suggest a solution.


% =========================================================================== %