masters-thesis/tex/text.tex

% =========================================================================== %
% Encoding: UTF-8 (žluťoučký kůň úpěl ďábelšké ódy)
% =========================================================================== %

\vspace*{\fill}
\begin{center}
\Large
\textit{This is a document draft.}
\end{center}
\vspace*{\fill}
\newpage

% =========================================================================== %
\nn{Introduction}
Introduce the goals and the methods attempted to achieve the goals.

% =========================================================================== %
\part{Theoretical part}

\n{1}{Purpose}
What this write-up is attempting to achieve.


\n{1}{Terminology}

\n{2}{Linux}

The term \emph{Linux} is exclusively used in the meaning of the
Linux kernel~\cite{linux}.


\n{2}{GNU/Linux}

As far as a Linux-based operating system is concerned, the term ``GNU/Linux''
as defined by the Free Software Foundation~\cite{fsfgnulinux} is used. While it
is longer and arguably a little bit cumbersome, the author aligns with the
opinion that this term more correctly describes its actual target. Being aware
there are many people that conflate the complete operating system with its (be
it core) component, the kernel, the author is taking care to distinguish the
two, although writing from experience, colloquially, this probably brings more
confusion and a lengthy explanation is usually required.


\n{2}{Containers}

When the concept of \emph{containerisation} and \emph{containers} is mentioned
throughout this work, the author has OCI containers~\cite{ocicontainers} in
mind, which is broadly a superset of \emph{Linux Containers} where some set of
processes is presented with a view of kernel resources (there are multiple
kinds of resources, such as IPC queues; network devices, stacks, ports; mount
points, process IDs, user and group IDs, Cgroups and others) that differs for
each different set of processes, similar in thought to FreeBSD
\emph{jails}~\cite{freebsdjails} with the distingction being that they are, of
course, facilitated by the Linux kernel namespace
functionality~\cite{linuxnamespaces}, which is in turn be regarded to be
\emph{inspired} by Plan 9's namespaces~\cite{plan9namespaces}, Plan 9 being a
Bell Labs successor to Unix 8th Edition, discontinued in 2015.

While there without a doubt \emph{is} specificity bound to using each of the
tools that enable creating (Podman vs.\ Buildah vs.\ Docker BuildX) or running
(ContainerD vs.\ runC vs.\ crun) container images, when describing an action
that gets performed with or onto a container, the process should generally be
explained in such a way that it is repeatable using any spec-conforming tool
that is available and \emph{intended for the job}.


\n{1}{Cryptography primer}\label{sec:cryptographyprimer}
Pre-requisites necessary for following up.

\n{2}{Hash functions}
Explanation. What are hash functions

\n{3}{Uses and \textit{mis}uses}
The good, the bad and the ugly of hash usage (including or in some cases
excluding salting, weak hashes, split hashes (Microsoft)).

\n{3}{Threats to hashes}
Rainbow tables, broken hash functions\ldots

\n{2}{Encryption}
Generally.
\n{3}{The key exchange problem}
\n{3}{The key protection problem}


\n{1}{Brief passwords history}\label{sec:history}

\n{2}{Purpose over time}

\n{2}{What is considered a password}

\n{2}{Problems with passwords}
\n{3}{Arbitrary length requirements (min/max)}
\n{3}{Arbitrary complexity requirements}
\n{3}{Restricting special characters}
Service providers have too often been found forbidding the use of so called
\textit{special characters} in passwords for as long as passwords have been
used to protect privileged access. Ways of achieving the same may vary but the
intent stays the same: prevent users from inputting characters into the system,
which the system cannot comfortably handle, for one reason or another.


\n{1}{Password strength validation}
Entropy, dictionaries, multiple factors.


\n{1}{Web security}\label{sec:websecurity}

The internet, being the vast space of intertwined concepts and ideas, is a
superset of the Web, which is the part of the internet zoomed in on in this
section.

\n{2}{Browsers}\label{sec:browsers}

What they are, what do they do, how they relate to the security aspect
(privileged process running untrusted code on user's computer), history,
present, security focus of the dev teams, user facing signalling (padlock
colours, scary warnings).

TODO: describe how browsers find out where the web page lives, get a webpage,
parse it, parse stylesheets, run scripts, apply SAMEORIGIN restrictions etc.


\n{2}{Cross-site scripting}\label{sec:xss}

\n{2}{Content Security Policy}\label{sec:csp}
Content Security Policy has been an important addition to the arsenal of
website administrators, even though not everybody has necessarily taken notice
or even utilised it properly. To understand what guarantees it provides and
what kind of protections it employs, it is first necessary to grok how websites
are parsed and displayed, which has been discussed in depth in
Section~\ref{sec:browsers}.

\n{1}{Sandboxing}\label{sec:sandboxing}
\n{2}{User isolation}
Admin vs regular user, privilege escalation, least-privilege principle,
zero-trust principle.
\n{2}{Process isolation}
Sandbox escape.
\n{2}{Namespaced isolation}
Sandbox escape.


\n{1}{Data storage}
Among the key aspects of any security-minded system (application), the
following are certain to make the count:
\begin{enumerate}
  \item data integrity
  \item data authenticity
  \item data confidentiality
\end{enumerate}

\n{2}{Integrity}

\n{2}{Authenticity}

\n{2}{Confidentiality}

\n{2}{Encryption-at-rest}


\n{1}{Compromise checking and prevention}

\n{2}{HIBP and similar tools}

\n{2}{OWASP Top 10 for the implementers}

\n{2}{Password best practices}


% =========================================================================== %
\part{Practical part}

\n{1}{Kudos}

\textbf{Disclaimer:} the author is not affiliated in any way with any of the
projects described on this page.

The \textit{Password Compromise Monitoring Tool} (\texttt{pcmt}) program has
been developed using and utilising a great deal of free (as in Freedom) and
open-source software in the process, either directly or as an outstanding work
tool, and the author would like to take this opportunity to recognise that
fact.

In particular, the author acknowledges that this work would not be the same
without:

\begin{itemize}
  \item vim (\url{https://www.vim.org/})
  \item Arch Linux (\url{https://archlinux.org/})
  \item ZSH (\url{https://www.zsh.org/})
  \item kitty (\url{https://sw.kovidgoyal.net/kitty/})
  \item Nix (\url{https://nixos.org/explore.html})
  \item pre-commit (\url{https://pre-commit.com/})
  \item Podman (\url{https://podman.io/})
  \item Go (\url{https://go.dev/})
\end{itemize}

All of the code written has been typed into VIM (\texttt{9.0}), the shell used
to run the commands was ZSH, both running in the author's terminal emulator of
choice - \texttt{kitty} on a \raisebox{.8ex}{\texttildelow}8 month (at the time
of writing) installation of \textit{Arch Linux (by the way)} using a
\texttt{6.3.1-wanderer-zfs-xanmod1} variant of the Linux kernel.


\n{1}{Development}

The source code of the project was being versioned since the start using the
popular and industry-standard git (\url{https://git-scm.com}) source code
management (SCM) tool. Commits were made frequently and, if at all possible,
for small and self-contained changes of code, trying to follow sane commit
message \emph{hygiene}, i.e.\ striving for meaningful and well-formatted commit
messages. The name of the default branch is \texttt{development}, since that is
what the author likes to choose for new projects that are not yet stable (it is
in fact the default in author's \texttt{.gitconfig}).


\n{2}{Commit signing}

Since git allows cryptographically \emph{singing} all commits, it would be
unwise not to take advantage of this. For the longest time, GPG was the only
method available for signing commits in git, however, that is no longer
applicable~\cite{agwagitssh}. These days, it is also possible to both sign and
verify one's git commits (and tags!) using SSH keys, namely those produced by
OpenSSH (the same ones that can be used to log in to remote systems). The
author has, of course, not reused the same key pair that is used to connect to
machines for signing commits. A different, \texttt{Ed25519} elliptic curve key
pair has been used specifically for signing. A public component of this key is
enclosed to this thesis as an attachment for future reference.

The validity of a signature on a particular commit can be viewed with git using
the following commands (the \% sign denotes the shell prompt):

\begin{figure}[h]
  \centering
  \begin{varwidth}{\linewidth}
    \begin{verbatim}
    % cd <cloned project dir>
    % git show --show-signature <commit>
    % # alternatively:
    % git verify-commit <commit>
    \end{verbatim}
  \end{varwidth}
  \caption{Verifying signature of a git commit}
  \label{fig:gitverif}
\end{figure}

There is one caveat to this though, git first needs some additional
configuration for the code in Figure~\ref{fig:gitverif} to work as one would
expect. Namely that the public key used to verify the signature needs to be
stored in git's ``allowed signers file'', then git needs to be told where that
file is using the configuration value \texttt{gpg.ssh.allowedsignersfile} and
finally the configuration value of the \texttt{gpg.format} field needs to be
set to \texttt{ssh}.

Since git allows the configuration values to be local to each repository, both
of the mentioned issues can be solved by running the following commands from
inside of the cloned repository:

\begin{figure}[h]
  \centering
  \begin{varwidth}{\linewidth}
    \scriptsize
    \begin{verbatim}
    % # set the signature format for the local repository.
    % git config --local gpg.format ssh
    % # save the public key.
    % cat >./tmp/.allowed_signers \
      <<<'leo ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIKwshTdBgLzwY4d8N7VainZCngH88OwvPGhZ6bm87rBO'
    % # set the allowed signers file path for the local repository.
    % git config --local gpg.ssh.allowedsignersfile=./tmp/.allowed_signers
    \end{verbatim}
  \end{varwidth}
  \caption{Prepare allowed signers file and signature format for git}
  \label{fig:gitsshprep}
\end{figure}

After the code in Figure~\ref{fig:gitsshprep} is run, everything from the
Figure~\ref{fig:gitverif} should remain applicable for the lifetime of the
repository or until git changes implementation of signature verification.

For future reference, git has been used in the version \texttt{git version
2.40.1}.


\n{2}{Continuous Integration}

To increase both the author's and public confidence in the atomic changes made
over time, it was attempted to thoroughly \emph{integrate} them using a
continuous integration (CI) service that was plugged into the main source code
repository since the early stages of development. This, of course, was again
self-hosted, including the workers. The tool of choice there was Drone
(\url{https://drone.io}) and the ``docker'' runner (in fact it runs any OCI
container) was used to run the builds.

The way this runner works is it creates an ephemeral container for every
pipeline step and executes given \emph{commands} inside of it. At the end of
each step the container is discarded, while the repository, which is mounted
into each container's \texttt{/drone/src} is persisted between steps, allowing
it to be cloned only from \emph{origin} only at the start of the pipeline and
then shared for all of the following steps, saving bandwidth, time and disk
writes.

The entire configuration used to run the pipelines can be found in a file named
\texttt{.drone.yml} at the root of the main source code repository. The
workflow consists of three pipelines, which are run in parallel.  Two main
pipelines are defined to build the binary and run tests on \texttt{x86\_64}
GNU/Linux targets, one for each of Arch and Alpine (version 3.17).
These the two pipelines were identical apart from OS-specific bits such as
installing a certain package, etc.
For the record, other OS-architecture combinations were not tested.

A third pipeline was defined to build a popular static analysis tool called
\texttt{golangci-lint} - which is sort of a meta-linter, bundling a staggering
amount of linters (linter is a tool that performs static code analysis and can
raise awareness of programming errors, flag potentially buggy code constructs,
or \emph{mere} stylistic errors) - from sources and then perform the analysis
of project's codebase using the freshly built binary. If the result of this
step is successful, a handful of code analysis services get pinged in the next
steps to take notice of the changes to project's source code and update their
metrics, details can be found in the main Drone configuration file
\texttt{.drone.yml} and the configuration of \texttt{golangci-lint} can be
found in the root of the repository in the file named \texttt{.golangci.yml}.
The median build time as of writing was 1 minute, which includes running all
three pipelines, and that is acceptable.

\obr{Drone CI median
build}{fig:drone-median-build}{.77}{graphics/drone-median-build}


\n{2}{Source code repositories}\label{sec:repos}

All of the pertaining source code was published in repositories on a publicly
available git server operated by the author, the reasoning \emph{pro}
self-hosting being that it is the preferred way of guaranteed autonomy over
one's source code, as opposed to large silos owned by big corporations having a
track record of arguably not always deciding with user's best interest in mind,
acting on impulse or under public pressure (potentially at least temporarily
disrupting their user's operations), thus beholding their user to their lengthy
\emph{terms of service} that \emph{can change at any time}. Granted,
decentralisation can take a toll on discoverability of the project, but that is
not of concern here.

The git repository containing source code of the \texttt{pcmt} project:\\
\url{https://git.dotya.ml/mirre-mt/pcmt.git}.

The git repository hosting the \texttt{pcmt} configuration schema:\\
\url{https://git.dotya.ml/mirre-mt/pcmt-config-schema.git}.

The repository containing the \LaTeX{} source code of this thesis:\\
\url{https://git.dotya.ml/mirre-mt/masters-thesis.git}.


\n{2}{Toolchain}

Throughout the creation of this work, the \emph{current} version of the Go
programming language was used, i.e. \texttt{go1.20}.

\tab{Tool/Library-Usage Matrix}{tab:toolchain}{1.0}{ll}{
  \textbf{Name} & \textbf{Usage} \\
  Go programming language & program core \\
  Dhall configuration language & program configuration \\
  Echo & HTTP handlers, controllers, web server \\
  bluemonday & HTML sanitising \\
  TailwindCSS & creating stylesheets using a utility-first approach \\
}

\tab{Dependency-Version Matrix}{tab:depsversionmx}{1.0}{ll}{
  \textbf{Name} & \textbf{version} \\
  \texttt{echo} (\url{https://echo.labstack.com/}) & 4.10.2 \\
  \texttt{go-dhall} (\url{https://github.com/philandstuff/dhall-golang}) & 6.0.2\\
  \texttt{ent} (\url{https://entgo.io/}) & 0.11.10 \\
  \texttt{bluemonday} (\url{https://github.com/microcosm-cc/bluemonday}) & 1.0.23 \\
  \texttt{tailwindcss} (\url{https://tailwindcss.com/}) & 3.3.0 \\
}

\n{2}{A word about Go}
First, a question of \textit{`Why pick Go for building a web
application?'} might arise, so the following few lines will try to address
that.

Go~\cite{golang} (or \emph{Golang} for SEO-friendliness) is a strongly typed, high-level
\emph{garbage collected} language where functions are first-class citizens and
errors are values.

The appeal for the author comes from a number of language features, such as
built-in support for concurrency, testing, sane \emph{zero} values, lack of
pointer arithmetic, inheritance and implicit type conversions, easy-to-read
syntax, producing a statically linked binary by default, etc., on top of that,
the language has got a cute mascot.

Due to the foresight of the authors of the Go Authors regarding \emph{the
formatting question} (i.e.\ where to put the braces, tabs vs.\ spaces, etc.),
most of the discussions on this topic have been foregone. Every
\emph{gopher}~\footnote{euph.\ a person writing in the Go programming language}
is expected to format their source code with the official formatter
(\texttt{gofmt}), which automatically ensures the code adheres to the official
formatting standards.

\n{2}{A word about Nix}
\url{https://builtwithnix.org/}

\n{2}{Configuration}

Every non-trivial program usually offers at least \emph{some} way to
tweak/manage its behaviour, and these changes are usually persisted
\emph{somewhere} on the filesystem of the host: in a local SQLite3 database, a
\emph{LocalStorage} key-value store in the browser, a binary or plain text
configuration file.  These configuration files need to be read and checked at
least on program start-up and either stored into operating memory for the
duration of the runtime of the program, or loaded and parsed and the memory
subsequently \emph{freed} (initial configuration).

There is an abundance of configuration languages (or file formats used to craft
configuration files) available, TOML, INI, JSON, YAML, to name some of the
popular ones (as of today).

Dhall stood out as a language that was designed with both security and the
needs of dynamic configuration scenarios in mind, borrowing a concept or two
from Nix~\cite{nixoslearn}~\cite{nixlang} (which in turn sources more than a
few of its concepts from Haskell), and in core being very similar to JSON,
which adds to familiar feel. In fact, in Dhall's authors' own words it is: ``a
programmable configuration language that you can think of as: JSON + functions
+ types + imports''~\cite{dhalllang}.

Among all of the listed features, the especially intriguing one to the author
was the promise of \emph{types}. There are multiple examples directly on the
project's documentation webpage demonstrating for instance the declaration and
usage of custom types (that are, of course merely combinations of the primitive
types that the language provides, such as \emph{Bool}, \emph{Natural} or
\emph{List}, to name just a few), so it was not exceedingly hard to start
designing a custom configuration \emph{schema} for the program.
Dhall not being a Turing-complete language also guarantees that evaluation
\emph{always} terminates eventually, which is a good attribute to possess as a
configuration language.


\n{3}{Schema}

The configuration schema was at first being developed as part of the main
project's repository, before it was determined that it would benefit both the
development and overall clarity if the schema lived in its own repository (see
Section~\ref{sec:repos} for details).

\begin{figure}[h]
  \begin{varwidth}
    \scriptsize
    \begin{verbatim}
    let Schema =
      { Type =
          { Host : Text
          , Port : Natural
          , HTTP :
              { Domain : Text
              , Secure : Bool
              , AutoTLS : Bool
              , TLSKeyPath : Text
              , TLSCertKeyPath : Text
              , HSTSMaxAge : Natural
              , ContentSecurityPolicy : Text
              , RateLimit : Natural
              , Gzip : Natural
              , Timeout : Natural
              }
          , Mailer :
              { Enabled : Bool
              , Protocol : Text
              , SMTPAddr : Text
              , SMTPPort : Natural
              , ForceTrustServerCert : Bool
              , EnableHELO : Bool
              , HELOHostname : Text
              , Auth : Text
              , From : Text
              , User : Text
              , Password : Text
              , SubjectPrefix : Text
              , SendPlainText : Bool
              }
          , LiveMode : Bool
          , DevelMode : Bool
          , AppPath : Text
          , Session :
              { CookieName : Text
              , CookieAuthSecret : Text
              , CookieEncrSecret : Text
              , MaxAge : Natural
              }
          , Logger : { JSON : Bool, Fmt : Optional Text }
          , Init : { CreateAdmin : Bool, AdminPassword : Text }
          , Registration : { Allowed : Bool }
          }
      , default = {=}
      }

    in  Schema
    \end{verbatim}
  \end{varwidth}
  \caption{Dhall configuration schema version 0.0.1-rc.1}
  \label{fig:dhallschema}
\end{figure}


\n{3}{Safety considerations}

Having a programmable configuration language that understands functions and
allows importing not only arbitrary text from random internet URLs, but also
importing and \emph{evaluating} (i.e.\ running) potentially untrusted code, it
is important that there are some safety mechanisms employed, which can be
relied on by the user. Dhall offers this in multiple features: enforcing a
same-origin policy and (optionally) pinning a cryptographic hash of the value
of the expression being imported.


\n{3}{Possible alternatives}
While developing the program, the author has also come across certain
shortcomings of Dhall, namely long start-up with \emph{cold cache}, which can
generally be observed in the scenario of running the program in a
\emph{container}.

If we want to describe the way Dhall works when performing an evaluation, it
resolves every expression down to a combination of its most basic types
(eliminating all abstraction and indirection) in the process called
\textbf{normalisation}~\cite{dhallnorm} and then saves this result in the hosts
cache. The \texttt{dhall-haskell} binary attempts to resolve the variable
\texttt{XDG\_CACHE\_HOME} (have a look at \emph{XDG Base Directory
Spec}~\cite{xdgbasedirspec} for details) to decide \emph{where} the results of
the normalisation will be written for repeated use.  Do note that this
behaviour has been observed on a GNU/Linux host and the author has not verified
this behaviour on a non-GNU/Linux host.

If normalisation is performed inside an ephemeral container (as opposed to, for
instance, an interactive desktop session), the results effectively get lost on
each container restart, which is both wasteful and not great for user
experience, since the normalisation of just a handful of imports can take an
upside of 2 minutes, during which the user is left waiting for the hanging
application.

While workarounds for the above mentioned problem can be devised relatively
easily (bind mount volumes inside the container in place of the
\texttt{XDG\_CACHE\_HOME/dhall} and \texttt{XDG\_CACHE\_HOME/dhall-haskell} to
preserve the cache between restarts, or let the cache be pre-computed during
container build, since the application is only really expected to run together
with a compatible version of the configuration schema and this version
\emph{is} known at container build time), it would certainly feel better if
there was not need to work \emph{around} the configuration system of choice.

Alternatives such as CUE (\url{https://cuelang.org/}) offer themselves nicely
as a potentially almost drop-in replacement for Dhall feature-wise, while also
resolving the subject issue that was described with Dhall (costly normalisation
operations with \emph{cold cache}).


\n{2}{Production}

It is, of course, recommended that the application runs in a secure
environment, although definitions of that almost certainly differ depending on
who you ask. General recommendations would be to effectively reserve a machine
for a single use case - running this program - so as to dramatically decrease
the potential attack surface of the host. If the host does not need management
access (it is a deployed-to-only machine that is configured out-of-band, such
as with a \emph{golden} image/container), then do not run SSH on it. In an
ideal scenario, the host machine would have as little software installed as
possible besides what the application absolutely requires.


\n{1}{Application architecture}

\n{2}{Data integrity and authenticity}

The user can interact with the application via a web client, such as a browser,
and is required to authenticate for all sensitive operations. To not only know
\emph{who} the user is but also make sure they are \emph{permitted} to perform
the action they are attempting, the program employs an \emph{authorisation}
mechanism in the form of sessions. These are on the client side represented by
cryptographically signed and encrypted (using 256 bit AES) cookies. That lays
foundations for a few things: the data saved into the cookies can be regarded
as private because short of future \emph{quantum computers} only the program
itself can decrypt and access the data, and the data can be trusted since it is
both signed using the key that only the program controls and \emph{encrypted}
with \emph{another} key that equally only the program holds.

The cookie data is only ever written \emph{or} read at the server side,
solidifying the authors decision to let it be encrypted, as there is not point
in not encrypting it for some perceived client-side simplification. Users
navigating the website send their session cookie in \textbf{every request} (if
it exists) to the server, which then verifies the integrity of the data and in
case its valid, determines the existence and potential amount of user privilege
that should be granted. Public endpoints do not mandate the presence of a valid
session by definition, while at protected endpoints the user is authenticated
at every request. When a session expires or if there is no session to begin
with, the user is either shown a \emph{Not found} error message, the
\emph{Unauthorised} error message or redirected to \texttt{/signin}.

Another aspect that contributes to data integrity from another point of view is
utilising database \emph{transactions} for bundling together multiple database
operations that collectively change the \emph{state}. Using the transactional
jargon, the data is only \emph{committed} if each individual change was
successful. In case of any errors, the database is instructed to perform an
atomic \emph{rollback}, which brings it back to a state before the changes were
ever attempted.

The author has additionally considered the thought of utilising an embedded
immutable database like immudb (\url{https://immudb.io}) for record keeping
(verifiably storing data change history) and additional data integrity checks,
e.g.\ for tamper protection purposes and similar, however, that work remains
yet to be materialised.


\n{2}{Transport security}
User connecting to the application should rightfully expect for their data to
be protected \textit{in transit} (i.e.\ on the way between their browser and
the server), which is what \emph{Transport Layer Security} family of
protocols~\cite{tls13rfc8446} was designed for, and which is the underpinning
of HTTP\emph{S}. It utilises the primitives of asymmetric cryptography to first
negotiate a key exchange and then encrypt/decrypt the data using the negotiated
symmetric key.

\n{2}{User isolation}
Users are allowed into certain parts of the application based on the role they
currently posses. For a start, two basic roles were envisioned:
\begin{itemize}
  \item Administrator
  \item User
\end{itemize}

Each role is only able to perform actions explicitly assigned to it and while
there definitely is certain overlap between the capabilities of the two
outlined roles, each also possesses unique features that the other does not.

For example, the administrator role is not able to perform HIBPAPI

In-application administrators are not able to view sensitive (any) user data.
Administrators are only able to perform the following:
\begin{itemize}
  \item Create user accounts
  \item View list of users
  \item View user email
  \item Change user email
  \item Change user email
  \item Toggle whether user is an administrator
  \item Delete user accounts
\end{itemize}

\n{2}{Least-privilege principle}
Every role only has access to what it absolutely needs for functioning.

\n{2}{Zero trust principle.}
There will be no way for the application to access the user data since
it will be encrypted by a key, passphrase to which
only the user knows.

Consequently, both the application operators and the in-application
administrators will never be able to learn the details of what the user is
tracking, the same being applicable even to potential attackers with direct
access to the database.


\n{1}{Implementation}

\n{2}{Compromise checking}

\n{3}{Have I Been Pwned? Integration}
TODO

\n{3}{Local Dataset Plugin}
Breach data from locally available datasets can be imported into the
application by first making sure it adheres to the specified schema (have a
look at the \emph{breach data schema} in Figure~\ref{fig:breachDataGoSchema}).
If it doesn't (which is very likely with random breach data), it needs to be
converted to a form that does before importing it to the application, e.g.\
using a simple Python script or similar. Attempting to import data that does
not follow the outlined schema would result in an error.

\begin{figure}[h]
  \centering
  \begin{varwidth}{\linewidth}
    \begin{verbatim}
    type breachDataSchema struct {
        Name              string
        Time              time.Time
        IsVerified        bool
        ContainsPasswords bool
        ContainsHashes    bool
        HashType          string
        HashSalted        bool
        HashPepperred     bool
        ContainsUsernames bool
        ContainsEmails    bool
        Data              any
    }
    \end{verbatim}
  \end{varwidth}

  \caption{Breach Data Schema represented as a Go struct with imports from the
  standard library are assumed}
  \label{fig:breachDataGoSchema}
\end{figure}

The above Go representation will in actuality be written and supplied by the
user of the program as a YAML document. YAML was chosen for multiple reasons:
relative ease of use (plain text, readable, can include comments, most of the
inputs are implicitly typed as strings), and its capability to store multiple
\emph{documents} inside of a single file. That should allow for documents
similar to what can be seen in Figure~\ref{fig:breachDataYAMLSchema}.

\begin{figure}[h]
  \centering
  \begin{varwidth}{\linewidth}
    \begin{verbatim}
    ---
    name: Horrible breach
    time: 2022-04-23T00:00:00Z+02:00
    isVerified: false
    containsPasswds: false
    containsHashes: true
    containsEmails: true
    hashType: md5
    hashSalted: false
    hashPeppered: false
    data:
      hashes:
        - hash1
        - hash2
        - hash3
      emails:
        - email1
        -
        - email3
    ---
    # document #2, describing another breach.
    name: Horrible breach 2
    ...
    }
    \end{verbatim}
  \end{varwidth}

  \caption{Example Breach Data Schema supplied to the program as a YAML file, optionally
    containing multiple documents}
  \label{fig:breachDataYAMLSchema}
\end{figure}

Notice how the emails list in Figure~\ref{fig:breachDataYAMLSchema} misses one
record, perhaps because it was not supplied. This is a valid scenario and the
application needs to be able to handle it. The alternative would be to require
the user to prepare the data in such a way that the empty/partial records would
be dropped entirely.


\n{2}{Best practices}
Describe built-in fail-overs and fallback mechanisms (request retries),
collecting exposed app metrics and alerting based on their status.

Observability - in-app tracing collected by e.g. Jaeger
(\url{https://jaegertracing.io}) or \url{https://opentelemetry.io/} - for
superior performance insights on CPU, storage, network latency and jitter.

\n{2}{Database configuration}
Describe tuning for performance while also properly securing the DB.

The plan is to have 2 databases: one for raw data (\url{https://postgresql.org}) and
another one for checksum validation (\url{https://immudb.io/})

\n{2}{Deployment recommendations}
Describe in detail the following:
\begin{itemize}
  \item behind a reverse proxy (TLS termination) or inside of a private network
  \item reasonable host security
\end{itemize}

\n{3}{Containerisation}
Whether the pre-built or a custom container image is used to deploy the
application, it still needs access to secrets, such as database connection
string (containing database host, port, user, password/encrypted password,
authentication method and database name).

Currently, the application is able to handle \emph{peer}, \emph{scram-sha-256},
\emph{user name maps} and raw \emph{password} as Postgres authentication
methods~\cite{pgauthmethods}, although the \emph{password} option should not be
used in production, \emph{unless} the connection to the database is protected
by TLS.\ In any case, using the \emph{scram-sha-256}~\cite{scramsha256rfc7677}
method is preferable and one way to verify in development environment that
everything works as intended is the \emph{Password generator for PostgreSQL}
tool~\cite{goscramsha256}, which allows to get the encrypted string from a raw
user input.

If the application running in a container wants to use the \emph{peer}
authentication method, it is up to the operator to supply the Postgres socket
to the application (e.g.\ as a volume bind mount). This scenario was not
tested, however, and the author is also not entirely certain how \emph{user
namespaces} (on GNU/Linux) would influence the process (given that the
\emph{ID}s of a user \textbf{outside} the container are mapped to a range of
\emph{UIDs} \textbf{inside} the container), for which the setup would likely
need to account.

Equally, if the application is running inside the container, the operator needs
to make sure that the database is either running in a network that is also
directly attached to the container or that there is a mechanism in place that
routes the requests for the database hostname to the destination.

One such mechanism is container name based routing inside \emph{pods}
(Podman/Kubernetes), where the resolution of container names is the
responsibility of a specially configured piece of software called Aardvark for
the former and CoreDNS for the latter.


\n{1}{Validation}

\n{2}{Unit tests}

\n{2}{Integration tests}

\n{2}{Click-ops}

% =========================================================================== %
\nn{Conclusion}

% =========================================================================== %