masters-thesis/tex/part-practical.tex

% =========================================================================== %
\part{Practical part}

\n{1}{Introduction}

A part of the task of this thesis was to build an actual application, which was
named Password Compromise Monitoring Tool, or \texttt{pcmt} for short.
Therefore, the development process, the general tools and practices as well as
the specific outcome are all described in the following sections. A whole
section is dedicated to application architecture, whereby relevant engineering
choices are justified and motifs preceding the decisions are explained. This
part then flows into recommendations for more of a production deployment and
concludes by describing the validation methods chosen and used to ensure
correctness and stability of the program.


\n{2}{Kudos}

The program that has been developed as part of this thesis used and utilised a
great deal of free (as in \textit{freedom}) and open-source software in the
process, either directly or as an outstanding work tool, and the author would
like to take this opportunity to recognise that fact\footnotemark{}.

\footnotetext{\textbf{Disclaimer:} the author is not affiliated with any of the
projects mentioned on this page.}

In particular, the author acknowledges that this work would not be the same
without:

\begin{itemize}
  \item vim (\url{https://www.vim.org/})
  \item Arch Linux (\url{https://archlinux.org/})
  \item ZSH (\url{https://www.zsh.org/})
  \item kitty (\url{https://sw.kovidgoyal.net/kitty/})
  \item Nix (\url{https://nixos.org/explore.html})
  \item pre-commit (\url{https://pre-commit.com/})
  \item Podman (\url{https://podman.io/})
  \item Go (\url{https://go.dev/})
\end{itemize}

All the code was typed into VIM, the shell used was ZSH, and the terminal
emulator of choice was \texttt{kitty}. The development machines ran a recent
installation of \textit{Arch Linux}\footnotemark{} and Fedora 38, both using a
\texttt{6.\{2,3,4\}.x} XanMod variant of the Linux kernel.

\footnotetext{(by the way) \url{https://i.redd.it/mfrfqy66ey311.jpg}.}


\n{1}{Development}

The source code of the project was being versioned since the start, using the
popular and industry-standard git (\url{https://git-scm.com}) source code
management (SCM) tool. Commits were made frequently and, if at all possible,
consist of small and self-contained changes of code, trying to follow sane
commit message \emph{hygiene}, i.e.\ striving for meaningful and well-formatted
commit messages. The name of the default branch is \texttt{development}, since
that is what the author likes to choose for new projects that are not yet
stable (it is in fact the default in author's \texttt{.gitconfig}).


\n{2}{Commit signing}

Since git allows cryptographically \emph{singing} all commits, it would be
unwise not to take advantage of this. For the longest time, GPG was the only
method available for signing commits in git; however, that is no longer
applicable~\cite{agwagitssh}. These days, it is also possible to both sign and
verify one's git commits (and tags!) using SSH keys, namely those produced by
OpenSSH, which \emph{can} be the same ones that can be used to log in to remote
systems. The author has, of course, not reused the same key pairs that are used
to connect to machines for signing commits. A different, \texttt{Ed25519}
elliptic curve key pairs have been used specifically for signing. Public
components of these keys are enclosed in this thesis as
Appendix~\ref{appendix:signingkeys} for future reference.

The validity of a signature on a particular commit can be viewed with git using
the following commands (the \% sign denotes the shell prompt):

\vspace{\parskip}
\begin{lstlisting}[language=bash, caption={Verifying the signature of a git commit},
label=gitverif, basicstyle=\linespread{0.9}\small\ttfamily,
backgroundcolor=\color{lstbg}]
    % cd <cloned project dir>
    % git show --show-signature <commit>
    % # alternatively:
    % git verify-commit <commit>
\end{lstlisting}
\vspace*{-\baselineskip}

There is one caveat to this though, git first needs some additional
configuration for the code in Listing~\ref{gitverif} to work as one would
expect. Namely that the public key used to verify the signature needs to be
stored in git's ``allowed signers file'', then git needs to be told where that
file is located using the configuration value
\texttt{gpg.ssh.allowedsignersfile} and finally the configuration value of the
\texttt{gpg.format} field needs to be set to \texttt{ssh}.  Luckily, because
git also allows the configuration values to be local to each repository, both
of the mentioned issues can be solved by running the following commands from
inside the cloned repository:

\vspace{\parskip}
\begin{lstlisting}[language=bash, caption={Prepare allowed signers file and signature format for git},
label=gitsshprep, basicstyle=\linespread{0.9}\small\ttfamily,
backgroundcolor=\color{lstbg}]
% # set the signature format for the local repository.
% git config --local gpg.format ssh
% # save the public key.
% cat > ./.tmp-allowed_signers \
  <<<'surtur <insert literal surtur pubkey>
  leo <insert literal leo pubkey>'
% # set the allowed signers file path for the local repository.
% git config --local gpg.ssh.allowedsignersfile=./.tmp-allowed_signers
\end{lstlisting}
\vspace*{-\baselineskip}

After the code in Listing~\ref{gitsshprep} is run, everything from the
Listing~\ref{gitverif} should remain applicable for the lifetime of the
repository or until git changes implementation of signature verification. The
git \texttt{user.name} that can be seen on the commits in the \textbf{Author}
field is named after the machine that was used to develop the program, since
the author uses different signing keys on each machine. That way the committer
machine can be determined post-hoc.

For future reference, git has been used in the version \texttt{git version
2.4\{0,1,2\}.x}.


\n{2}{Continuous Integration}

To increase both the author's and public confidence in the atomic changes made
over time, it was attempted to thoroughly \emph{integrate} them using a
continuous integration (CI) service that was plugged into the main source code
repository since the early stages of development. This, of course, was again
self-hosted, including the workers. The tool of choice there was Drone
(\url{https://drone.io}) and the ``docker'' runner (in fact it runs any OCI
container) was used to run the builds.

The way this runner works is that it creates an ephemeral container for every
pipeline step and executes given \emph{commands} inside of it. At the end of
each step, the container is discarded while the repository clone, which is
mounted into each container's \texttt{/drone/src}, is persisted between steps,
allowing it to be cloned from \emph{origin} only at the start of the pipeline
and then shared for all the following steps, saving bandwidth, time and disk
writes.

The entire configuration used to run the pipelines can be found in a file named
\texttt{.drone.yml} at the root of the main source code repository. The
workflow consists of four pipelines, which are run in parallel. Two main
pipelines are defined to build the frontend assets, the \texttt{pcmt} binary
and run tests on \texttt{x86\_64} GNU/Linux targets, one for each of Alpine
(version 3.1\{7,8\}) and Arch. These two pipelines are identical apart from
OS-specific bits such as installing a certain package, etc. For the record,
other OS-architecture combinations were not tested.

A third pipeline contains instructions to build a popular static analysis tool
called \texttt{golangci-lint}, which is a sort of meta-linter, bundling a
staggering number of linters (linter is a tool that performs static code
analysis and can raise awareness of programming errors, flag potentially buggy
code constructs, or \emph{mere} stylistic errors), from sources and then
perform the analysis of project's codebase using the freshly built binary. If
the result of this step is successful, a handful of code analysis services get
pinged in the next steps to take notice of the changes to project's source code
and update their metrics. Details can be found in the main Drone configuration
file \texttt{.drone.yml} and the configuration for the \texttt{golangci-lint}
tool itself (such as what linters are enabled/disabled and their
configurations) can be found in the root of the repository in the file named
\texttt{.golangci.yml}.

The fourth pipeline focuses on linting the \texttt{Containerfile} and building
the container and pushing in to a public container registry, although the
latter action is only performed on feature branches, \emph{pull request} or
\emph{tag} events.

\obr{Drone CI median build
time}{fig:drone-median-build}{.84}{graphics/drone-median-build}

The median build time as of writing was 1 minute, which includes running all
four pipelines, and that is acceptable. Build times might of course vary
depending on the hardware, for reference, these builds were run on a machine
equipped with a Zen 3 Ryzen 5 5600 CPU with nominal clock times, DDR4 @ 3200
MHz RAM, a couple of PCIe Gen 4 NVMe drives in a mirrored setup (using ZFS) and
a 600 Mbps downlink, software-wise running Arch with an author-flavoured Xanmod
kernel version 6.\{2,3,4\}.x.


\n{2}{Source code repositories}\label{sec:repos}

The git repository containing source code of the \texttt{pcmt} project:\\
\url{https://git.dotya.ml/mirre-mt/pcmt.git}.\\
The git repository hosting the \texttt{pcmt} configuration schema:\\
\url{https://git.dotya.ml/mirre-mt/pcmt-config-schema.git}.\\
The repository containing the \LaTeX{} source code of this thesis:\\
\url{https://git.dotya.ml/mirre-mt/masters-thesis.git}.

All the pertaining source code was published in repositories on a publicly
available git server operated by the author, the reasoning \emph{pro}
self-hosting being that it is the preferred way of guaranteed autonomy over
one's source code, as opposed to large silos owned by big corporations having a
track record of arguably not always deciding with user's best interest in mind
(although recourse has been observed~\cite{ytdl}). When these providers act on
impulse or under public pressure they can potentially (at least temporarily)
disrupt operations of their users. Thus, they are not only beholding their
users to lengthy \emph{terms of service} that \emph{are subject to change at
any given moment}, but also outside factors beyond their control. Granted,
decentralisation can take a toll on discoverability of the project, but that is
only a concern if rapid market penetration is a goal, not when aiming for an
organically grown community.


\n{2}{Toolchain}

Throughout the creation of this work, the \emph{then-current} version of the Go
programming language was used, i.e. \texttt{go1.20}.

To read more on why Go was chosen in particular, see
Appendix~\ref{appendix:whygo}. Equally, Nix and Nix-based tools such as
\texttt{devenv}  have also aided heavily during development, more on those is
written in Appendix~\ref{appendix:whynix}.

\tab{Tool/Library-Usage Matrix}{tab:toolchain}{1.0}{ll}{
  \textbf{Tool/Library} & \textbf{Usage} \\
  Go programming language & program core \\
  Dhall configuration language & program configuration \\
  Echo & HTTP handlers, controllers \\
  ent & ORM using graph-based modelling \\
  pq & Pure-Go Postgres drivers \\
  bluemonday & sanitising HTML \\
  TailwindCSS & utility-first approach to Cascading Style Sheets \\
  PostgreSQL & persistent data storage \\
}

Table~\ref{tab:depsversionmx} contains the names and versions of the most
important libraries and supporting software that were used to build the
application.

\tab{Dependency-Version Matrix}{tab:depsversionmx}{1.0}{ll}{
  \textbf{Name} & \textbf{version} \\
  \texttt{echo} (\url{https://echo.labstack.com/}) & 4.11.1 \\
  \texttt{go-dhall} (\url{https://github.com/philandstuff/dhall-golang}) & 6.0.2\\
  \texttt{ent} (\url{https://entgo.io/}) & 0.12.3 \\
  \texttt{pq} (\url{https://github.com/lib/pq/}) & 1.10.9 \\
  \texttt{bluemonday} (\url{https://github.com/microcosm-cc/bluemonday}) & 1.0.25 \\
  \texttt{tailwindcss} (\url{https://tailwindcss.com/}) & 3.3.0 \\
  \texttt{PostgreSQL} (\url{https://www.postgresql.org/}) & 15.3 \\
}

Additionally, the dependency-version mapping for the Go program can be inferred
from looking at the \texttt{go.mod}'s first \textit{require} block at any point
in time. The same can be achieved for \emph{frontend} by glancing at the
\texttt{package-lock.json} file.


\n{1}{Application architecture}

The application is written in Go and uses \textit{gomodules}. The full name of
the module is \texttt{git.dotya.ml/mirre-mt/pcmt}.

\obr{Application class diagram}{fig:classdiagram}{.79}{graphics/pcmt-class-diagram.pdf}

\n{2}{Package structure}

The source code of the module is organised into smaller, self-contained Go
\emph{packages} appropriately along a couple of domains: logging, core
application, web routers, configuration and settings, etc. In Go, packages are
delimited by folder structure -- each folder can be a package.

Generally speaking, the program aggregates decision points into central places,
such as \texttt{run.go}, which then imports child packages that facilitate each
of the tasks of loading the configuration, connecting to the database and
running migrations, consolidating flag, environment variable and
configuration-based values into canonical \emph{settings} \texttt{struct},
setting up web routes, authenticating requests, or handling \texttt{signals}
and performing graceful shutdowns.

\n{3}{Internal package}

The \texttt{internal} package was not used as of writing, but the author plans
to eventually migrate \emph{internal} logic of the program into the internal
package to prevent accidental imports.


\n{2}{Logging}

The program uses \emph{dependency injection} to share a single logger instance
(the same technique is also used to share the database client). This logger is
then passed around as a pointer, so that the underlying data stays the same or
is modified concurrently for all consumers. As a rule of thumb throughout the
application, every larger \texttt{struct} that needs to be passed around is
passed around as a pointer.

An experimental (note: not anymore, with \texttt{go1.21} it was brought into
Go's \textit{stdlib}) library for \textit{structured} logging \texttt{slog} was
used to facilitate every logging need that the program might have. It supports
both JSON and plain-text logging, which was made configurable by the program.
Either a configuration file value or an environment variable can be used to set
this.

There are four log levels available by default (\texttt{DEBUG}, \texttt{INFO},
\texttt{WARNING}, \texttt{ERROR}) and the pertinent library funtions are
parametric. The first parameter of type \texttt{string} is the main message,
that is supplied as a \emph{value} to the \emph{key} named appropriately
`\texttt{msg}', a feature of structured loggers which can later be used for
filtering. Any other parameters need to be supplied in pairs, serving as key
and value, respectively.

This main \texttt{slog} interface has been extended in package
\texttt{slogging} to also provide the formatting functionality of the
\texttt{fmt} standard library package. This was achieved by directly embedding
\texttt{slog.Logger} in a custom \texttt{struct} type named \texttt{Slogger}
and implementing the additional methods on the custom type. The new type that
embeds the original \texttt{slog.Logger} gets to keep its methods thanks to the
composition nature of Go. Thus, common formatting directives like the one seen
in Listing~\ref{goFmtExpression} are now supported with the custom logger, in
addition to anything the base \texttt{slog.Logger} offers.

\vspace{\parskip}
\begin{lstlisting}[language=Go, caption={Example formatting expression supplied
to the logger}, label=goFmtExpression, basicstyle=\linespread{0.9}\small\ttfamily,
backgroundcolor=\color{lstbg},
otherkeywords={\%s, \%q, \%v},
]
slogger.Debugf("operation %q for user %q completed at %s", op, usr.ID, time.Now())
\end{lstlisting}

Furthermore, functionality was added to support changing the log level at
runtime, which is a convenient feature in certain situations.


\n{2}{Authentication}

The authentication logic is relatively simple and its core has mostly been
isolated into a custom \emph{middleware}. User passwords are hashed using a
secure KDF before ever being sent to the database. The KDF of choice is
\texttt{bcrypt} (with a sane \emph{Cost} of 10), which automatically includes
\emph{salt} for the password and provides ``length-constant'' time hash
comparisons. The author plans to add support for the more modern
\texttt{scrypt} and the state-of-the-art, P-H-C (Password Hashing Competition)
winner algorithm \texttt{Argon2}
(\url{https://github.com/P-H-C/phc-winner-argon2}) for flexibility.

\n{2}{SQLi prevention}

No raw SQL queries are directly used to access the database, thus decreasing
the likelihood of SQL injection attacks. Instead, parametric queries are
constructed in code using a graph-like API of the \texttt{ent} library, which
is attended to in depth in Section~\ref{sec:dbschema}.


\n{2}{Configurability}

Virtually any important value in the program has been made into a configuration
value, so that the operator can customise the experience as needed. A choice of
sane configuration defaults was attempted, which resulted in the configuration
file essentially only needing to contain secrets, unless there is a need to
override the defaults. It is not entirely a \emph{zero-config} situation,
rather a \emph{minimal-config} one. An example can be seen in
Section~\ref{sec:configuration}.

Certain options deemed important enough (this was largely subjective) were
additionally made into command-line \emph{flags}, using the standard library
package \texttt{flags}. Users wishing to display all available options can
append the program with the \texttt{-help} flag, a courtesy of the mentioned
\texttt{flags} package.

\vspace*{-\baselineskip}

\paragraph{\texttt{-host <hostname/IP>} (string)}{Takes one argument and specifies
the hostname, or the address to listen on.}

\vspace*{-\baselineskip}

\paragraph{\texttt{-port <port number>} (int)}{This flag takes one integer
  argument and specifies the port to listen on. The argument is validated at
  program start-up and the program has a fallback built in for the case that
  the supplied value is bogus, such as a string or a number outside the allowed
TCP range $1-65535$.}

\vspace*{-\baselineskip}

\paragraph{\texttt{-printMigration}}{A boolean option that, if set, makes the
program print any \textbf{upcoming} database migrations (based on the current
state of the database) and exit. The connection string environment variable
still needs to be set in order to be able connect to the database and perform
the schema \emph{diff}. This option is mainly useful during debugging.}

\vspace*{-\baselineskip}

\paragraph{\texttt{-devel}}{This flag instructs the program to enter
\textit{devel mode}, in which all templates are re-parsed and re-executed upon
each request, and the default log verbosity is changed to level
\texttt{DEBUG}. Should not be used in production.}

\vspace*{-\baselineskip}

\paragraph{\texttt{-import <path/to/file>} (string)}{This option tells the program
  to perform an import of local breach data into program's main database.
  Obviously, the database connection string environment variable also needs to
  be present for this. The option takes one argument that is the path to file
formatted according to the \texttt{ImportSchema} (consult
Listing~\ref{breachImportSchema}). The program prints the result of the import
operation, indicating success or failure, and exits.}

\vspace*{-\baselineskip}

\paragraph{\texttt{-version}}{As could probably be inferred from its name, this
  flag makes the program to print its own version (that has been embedded into
  the binary at build time) and exit. A release binary would print something
  akin to a \emph{semantic versioning}-compliant git tag string, while a
development binary might simply print the truncated commit ID (consult
\texttt{Containerfile} and \texttt{justfile}) of the sources used to build it.}


\n{2}{Embedded assets}

An important thing to mention is embedded assets and templates. Go has multiple
mechanisms to natively embed arbitrary files directly into the binary during
the regular build process. \texttt{embed.FS} from the standard library
\texttt{embed} package was used to bundle all template files and web assets,
such as images, logos and stylesheets at the module level. These are then
passed around the program as needed, such as to the \texttt{handlers} package.

There is also a toggle in the application configuration (\texttt{LiveMode}),
which instructs the program at start-up to either rely entirely on embedded
assets, or pull live template and asset files from the filesystem. The former
option makes the application more portable as it is wholy self-contained, while
the latter allows for flexibility and customisation not only during
development. Where the program looks for assets and templates in \emph{live
mode} is determined by another configuration options: \texttt{assetsPath} and
\texttt{templatePath}.


\n{2}{Composability}

The core templating functionality was provided by the \texttt{html/template} Go
standard library package. Echo's \texttt{Renderer} interface has been
implemented, so that template rendering could be performed directly using
Echo's built-in facilities in a more ergonomic manner using \texttt{return
c.Render(http.StatusOk, "home.tmpl")}.

\vspace{\parskip}
\begin{lstlisting}[
caption={Conditionaly enabling functionality inside a Go template based on user access level},
label=tmplConditionals, basicstyle=\linespread{0.9}\small\ttfamily,
backgroundcolor=\color{lstbg},
morekeywords={if,and,end},
]
{{ if and .User .User.IsLoggedIn .User.IsAdmin }}
...
{{ end }}
\end{lstlisting}

Templates used for rendering of the web pages were created in a composable
manner, split into smaller, reusable parts, such as \texttt{footer.tmpl} and
\texttt{head.tmpl}. Those could then be included e.g.\ using \texttt{\{\{
	template "footer.tmpl" \}\}}. Specific functionality is conditionally
executed based on the determined level of access of the user, see
Listing~\ref{tmplConditionals} for reference.

A popular HTML sanitiser \texttt{bluemonday} has been employed to aid with
battling XSS. The program first runs every template through the sanitiser
before rendering it, so that any user-controlled inputs are handled safely.

A dynamic web application should include a CSP configuration. The program
therefore has the ability to calculate the hashes (SHA256/SHA384) of its assets
(scripts, images) on the fly and it is able to use them inside the templates.
This unlocks potentially using third party assets without opening up CSP with
directives like \texttt{script-src 'unsafe-hashes'}. It also means that there
is no need to maintain a set of customised \texttt{head} templates with
pre-computed hashes next to script sources, since the application can perform
the necessary calculations in user's stead.


\n{2}{Server-side rendering}

The application constructs the web pages \emph{entirely} on the server side,
and it runs without a single line of JavaScript, of which the author is
especially proud. It improves load times, decreases the attack surface,
increases maintainability and reduces cognitive load that is required when
dealing with JavaScript. Of course, that requires extensive usage of
non-semantic \texttt{POST} requests in web forms even for data \emph{updates}
(where HTTP \texttt{PUT}s should be used) and the accompanying frequent
full-page refreshes, but that still is not enough to warrant the use of
JavaScript.


\n{2}{Frontend}

Frontend-wise, the application Tailwind was used for CSS. It promotes the usage
of flexible \emph{utility-first} classes in the HTML markup instead of
separating out styles from content. Understandably, this is somewhat of a
preference issue and the author does not hold hard opinions in either
direction; however, it has to be noted that this approach empirically allows
for rather quick UI prototyping. Tailwind was chosen for having a reasonably
detailed documentation and offering built-in support for dark/light mode, and
partially also because it \emph{looks} nice.

The Go templates containing the CSS classes need to be parsed by Tailwind in
order to produce the final stylesheet that can be bundled with the application.
The upstream provides an original CLI tool (\texttt{tailwindcss}), which can be
used exactly for that action. Simple and accessible layouts were overall
preferred, a single page was rather split into multiple when becoming
convoluted. Data-backed efforts were made to create reasonably contrasting
pages.

\n{3}{Frontend experiments}

As an aside, the author has briefly experimented with WebAssembly to provide
client-side dynamic functionality for this project, but has ultimately scrapped
it in favour of the entirely server-side rendered approach. It is possible that
it would get revisited in the future if necessary. Even from the short
experiments it was obvious how much faster WebAssembly was when compared to
JavaScript.


% \newpage
\n{2}{User isolation}

\obr{Application use case diagram}{fig:usecasediagram}{.9}{graphics/pcmt-use-case.pdf}

Users are only allowed into specific parts of the application based on the role
they currently possess (Role-based Access Control).

While this short list might get amended in the future, initially only two basic
roles were envisioned:

\begin{itemize}
  \item Administrator
  \item User
\end{itemize}

It is paramount that the program protects itself from the insider threats as
well, and therefore each role is only able to perform actions that it is
explicitly assigned. While there definitely is a certain overlap between the
capabilities of the two outlined roles, each also possesses unique features
that the other one does not.

For instance, the administrator role is not able to perform breach data
searches directly, for that a separate \emph{user} account has to be devised.
Similarly, a regular user is not able to manage breach lists and other users,
because that is a privileged operation.

In-application administrators are not able to view (any) sensitive user data
and should therefore only be able to perform the following actions:

\begin{itemize}
  \item Create user accounts
  \item View user listing
  \item View user details
  \item Change user details, including administrative status
  \item Delete user accounts
  \item Refresh breach data from online sources
\end{itemize}

Let us consider a case when a user performs an operation on their own account.
While demoting from administrator to a regular user should be permitted,
promoting self to be an administrator would constitute a \emph{privilege
escalation} and likely be a precursor to at least a \emph{denial of service} of
sorts, as there would be nothing preventing the newly-\emph{admined} user from
disabling the accounts of all other administrators.


\n{2}{Zero trust principle and confidentiality}

The program only sets generic titles (Settings, Home, Search) and thus foregoes
disclosing information that would make it to browsers history.

There is no way for the application (and consequently, the in-application
administrator) to read user's data (such as saved search queries). This is
possible by virtue of encrypting the pertinent data before saving them in the
database by a state-of-the-art \texttt{age} tool (backed by
X25519)~\cite{age},~\cite{x25519rfc7748}. The \texttt{age} \emph{identity}
itself is in turn encrypted by a passphrase that only the user controls. Of
course, the user-supplied password is run by a password based key derivation
function (\texttt{argon2}, version \emph{id} with the officially {recommended}
configuration parameters) before letting it encrypt \emph{anything}.

The \texttt{age} identity is only generated once the user changes their
password for the first time, in an attempt to prevent scenarios like the
in-application administrator with access to physical database being able to
both \textbf{recover} the key from the database and \textbf{decrypt} it, given
that they already know the user password (because they set it when they created
the user), which would subsequently give them unbounded access to any future
encrypted data, as long as they would be able to maintain their database
access. This is why generating the \texttt{age} identity is bound to the first
password change.

Of course, the supposed evil administrator could simply perform the password
change themselves! However, the user would at least be able to find those
changes in the activity logs and know to \emph{not} use the application under
such circumstances. But given the scenario of a total database compromise, the
author finds that all hope is \emph{already} lost at that point (similar to
physical access to a computer). At least when the database is dumped, it should
only contain non-sensitive, functional information in plain text, everything
else should be encrypted.

Consequently, both the application operators and the in-application
administrators should never be able to learn the details of what the user is
searching for, the same being by extension partly applicable even to potential
attackers with direct access to the database. Thus, the author maintains that a
scenario, which could potentially lead to a breach (apart from a compromised
actual password) would have to entail some form of operating memory acquisition
on the machine hosting the application, for instance using
\texttt{LiME}~\cite{lime}, or perhaps directly the \emph{hypervisor}, if
considering a virtualised (``cloud'') environments. Alternatively, all but one
(memory acquisition) of the above issues could perhaps be remedied by simply
not storing any user queries, turning off informative logging, and only letting
the program be mediate the data sources.


\n{1}{Implementation}

\n{2}{Dhall Configuration Schema}\label{sec:configuration}

The configuration schema was at first being developed as part of the main
project's repository, before it was determined that both the development and
overall clarity would benefit from the schema living in its own repository (see
Section~\ref{sec:repos} for details). This enabled the schema to be
independently developed and versioned, and only be pulled into the main
application whenever it was determined to be ready.


% \vspace{\parskip}
\smallskip
% \vspace{\baselineskip}
\begin{lstlisting}[language=Haskell, caption={Dhall configuration schema version 0.0.1-rc.2},
label=dhallschema, basicstyle=\linespread{0.9}\footnotesize\ttfamily,
backgroundcolor=\color{lstbg},
morekeywords={Text, Natural, Optional, Type}
]
let Schema =
  { Type =
      { Host : Text
      , Port : Natural
      , HTTP :
          { Domain : Text
          , Secure : Bool
          , AutoTLS : Bool
          , TLSKeyPath : Text
          , TLSCertKeyPath : Text
          , HSTSMaxAge : Natural
          , ContentSecurityPolicy : Text
          , RateLimit : Natural
          , Gzip : Natural
          , Timeout : Natural
          }
      , Mailer :
          { Enabled : Bool
          , Protocol : Text
          , SMTPAddr : Text
          , SMTPPort : Natural
          , ForceTrustServerCert : Bool
          , EnableHELO : Bool
          , HELOHostname : Text
          , Auth : Text
          , From : Text
          , User : Text
          , Password : Text
          , SubjectPrefix : Text
          , SendPlainText : Bool
          }
      , LiveMode : Bool
      , DevelMode : Bool
      , AppPath : Text
      , Session :
          { CookieName : Text
          , CookieAuthSecret : Text
          , CookieEncrSecret : Text
          , MaxAge : Natural
          }
      , Logger : { JSON : Bool, Fmt : Optional Text }
      , Init : { CreateAdmin : Bool, AdminPassword : Text }
      , Registration : { Allowed : Bool }
      }
  }
\end{lstlisting}
\vspace*{-\baselineskip}

Full schema with type annotations can be seen in Listing~\ref{dhallschema}.

\newpage

The \texttt{let} statement declares a variable called \texttt{Schema} and
assigns to it the result of the expression on the right side of the equals
sign, which has for practical reasons been trimmed and is displayed without the
\emph{default} block. The default block is instead shown in its own
Listing~\ref{dhallschemadefaults}.

The main configuration comprises both raw attributes and child records, which
allow for grouping of related functionality. For instance, configuration
settings pertaining mailserver setup are grouped in a record named
\textbf{Mailer}. Its attribute \textbf{Enabled} is annotated as \textbf{Bool},
which was deemed appropriate for an on-off switch-like functionality, with the
only permissible values being either \emph{True} or \emph{False}.

Do note that in Dhall $true\ != True$, since internally \textbf{True} is a
\texttt{Bool} constant built directly into Dhall (see ``The Prelude'' for
reference), while \textbf{true} is evaluated as an \emph{unbound} variable,
that is, a variable \emph{not} defined in the current \emph{scope} and thus not
\emph{present} in the current scope~\cite{dhallprelude}.

Another one of Dhall's specialties is that `$==$' and `$!=$' (in)equality
operators \textbf{only} work on values of type \texttt{Bool}, which for example
means that variables of type \texttt{Natural} (\texttt{uint}) or \texttt{Text}
(\texttt{string}) cannot be compared directly as is the case in other
languages. That either leaves the comparing work for a higher-level language
(such as Go). Alternatively, from the perspective of the Dhall authors
\emph{enums} are the promoted way to solve this when the value matters, i.e.\
derive a custom \emph{named} type from a primitive type and compare
\emph{that}.

\newpage
% \vspace{\parskip}
\begin{lstlisting}[language=Haskell, caption={Dhall configuration defaults for
schema version 0.0.1-rc.2},
label=dhallschemadefaults, basicstyle=\linespread{0.9}\footnotesize\ttfamily,
backgroundcolor=\color{lstbg},
]
  , default =
      -- | have sane defaults.
      { Host = ""
      , Port = 3000
      , HTTP =
        { Domain = ""
        , Secure = False
        , AutoTLS = False
        , TLSKeyPath = ""
        , TLSCertKeyPath = ""
        , HSTSMaxAge = 0
        , ContentSecurityPolicy = ""
        , RateLimit = 0
        , Gzip = 0
        , Timeout = 0
        }
      , Mailer =
        { Enabled = False
        , Protocol = "smtps"
        , SMTPAddr = ""
        , SMTPPort = 465
        , ForceTrustServerCert = False
        , EnableHELO = False
        , HELOHostname = ""
        , Auth = ""
        , From = ""
        , User = ""
        , Password = ""
        , SubjectPrefix = "pcmt - "
        , SendPlainText = True
        }
      , LiveMode =
          -- | LiveMode controls whether the application looks for
          -- | directories "assets" and "templates" on the filesystem or
          -- | in its bundled Embed.FS.
          False
      , DevelMode = False
      , AppPath =
          -- | AppPath specifies where the program looks for "assets" and
          -- | "templates" in case LiveMode is True.
          "."
      , Session =
        { CookieName = "pcmt_session"
        , CookieAuthSecret = ""
        , CookieEncrSecret = ""
        , MaxAge = 3600
        }
      , Logger = { JSON = True, Fmt = None Text }
      , Init =
        { CreateAdmin =
            -- | if this is True, attempt to create a user with admin
            -- | privileges with the password specified below
            False
        , AdminPassword =
            -- | used for the first admin, forced change on first login.
            "50ce50fd0e4f5894d74c4caecb450b00c594681d9397de98ffc0c76af5cff5953eb795f7"
        }
      , Registration.Allowed = True
      }
  }

in  Schema
\end{lstlisting}
\vspace*{-\baselineskip}
\vspace*{-\baselineskip}
\vspace*{-\baselineskip}
\n{2}{Data integrity and authenticity}

The user can interact with the application via a web client, such as a browser,
and is required to authenticate for all sensitive operations. To not only know
\emph{who} the user is but also make sure they are \emph{permitted} to perform
the action they are attempting, the program employs an \emph{authorisation}
mechanism in the form of sessions. These are on the client side represented by
cryptographically signed and encrypted (using 256-bit AES) HTTP cookies. That
lays foundations for a few things: the data saved into the cookies can be
regarded as private because short of future \emph{quantum computers} only the
program itself can decrypt and access the data, and the data can be trusted
since it is both signed using the key that only the program controls and
\emph{encrypted} with \emph{another} key that equally only the program holds.

The cookie data is only ever written \emph{or} read at the server side,
solidifying the authors decision to let it be encrypted, as there is no point
in not encrypting it for some perceived client-side simplification. Users
navigating the website send their session cookie (if it exists) with
\textbf{every request} to the server, which subsequently verifies the integrity
of the data and in case it is valid, determines the existence and potential
amount of user privilege that should be granted. Public endpoints do not
mandate the presence of a valid session by definition, while at protected
endpoints the user is authenticated at every request. When a session expires or
if there is no session to begin with, the user is either shown a \emph{Not
found} error message, the \emph{Unauthorised} error message or redirected to
\texttt{/signin}, depending on the endpoint or resource, as can be seen, this
behaviour is not uniform and depends on the resource and/or the endpoint.

Another aspect that contributes to data integrity from \emph{another} point of
view is utilising database \emph{transactions} for bundling together multiple
database operations that collectively change the \emph{state}. Using the
transactional jargon, the data is only \emph{committed} if each individual
change was successful. In case of any errors, the database is instructed to
perform an atomic \emph{rollback}, which brings it back to a state before the
changes were ever attempted.

The author has additionally considered the thought of utilising an embedded
immutable database like immudb (\url{https://immudb.io}) for record keeping
(verifiably storing data change history) and additional data integrity checks,
e.g.\ for tamper protection purposes and similar; however, that work remains
yet to be materialised.


\n{2}{Database schema}\label{sec:dbschema}

The database schema is not being created by manually typing out SQL statements.
Instead, an Object-relational Mapping (ORM) tool named \texttt{ent} is used,
which allows defining the table schema and relations entirely in Go. The upside
of this approach is that the \emph{entity} types are natively understood by
code editors, and they also get type-checked by the compiler for correctness,
preventing all sorts of headaches and potential bugs.

Since \texttt{ent} encourages the usage of \emph{declarative migrations} at
early stages of the project, it is not required for the database schema to
exist on application start-up in form of raw SQL (or HCL). Instead,
\texttt{ent} only requires a valid connection string providing reasonably
privileged access to the database and it handlers the database configuration by
auto-generating SQL with the help of the companion embedded library
\texttt{Atlas} (\url{https://atlasgo.io/}). The upstream project (\texttt{ent})
encourages moving to otherwise more traditional \emph{versioned migrations} for
more mature projects, so that is on the roadmap for later.

The best part about using \texttt{ent} is that there is no need to define
supplemental methods on the models, as with \texttt{ent} these are meant to be
\emph{code generated} (in the older sense of word, not with Large Language
Models) into existence. Code generation creates files with actual Go models
based on the types of the attributes in the database schema model, and the
respective relations are transformed into methods on the receiver or functions
taking object attributes as arguments.

For instance, if the model's attribute is a string value \texttt{Email}, ent
can be used to generate code that contains methods on the user object like the
following:

\begin{itemize}
\item \texttt{EmailIn(pattern string)}
\item \texttt{EmailEQ(email string)}
\item \texttt{EmailNEQ(email string)}
\item \texttt{EmailHasSuffix(suffix string)}
\end{itemize}

These methods can further be imported into other packages and this makes
working with the database a morning breeze.

All the database \emph{entity} IDs were declared as type \texttt{UUID}
(\emph{universally unique ID, theoretically across space and time}), contrary
to the more traditional \emph{integer} IDs.

Support for \texttt{UUID}s was provided natively by the supported databases and
in Go via a popular and vetted open-source library
(\url{github.com/google/uuid}). Among the upsides of using \texttt{UUID}s over
integer IDs is that there is no need to manually increment the ID. But more
importantly, there is also the fact that compared to 32-bit\footnotemark{}
signed integers the \texttt{UUID} is a somewhat randomly generated 16 byte (128
bit) array, reducing chances of collision.

Barring higher chances of preventing conflicts during imports of foreign
databases, this design decision might not provide any advantage for the current
system \emph{at the moment}. It could, however, hold importance in the future,
should the database ever be deployed in a replicated, high-availability (HA)
manner with more than one concurrent \emph{writer} (replicated application
instances).

\footnotetext{In Go, integer size is architecture dependent, see
\url{https://go.dev/ref/spec#Numeric_types}.}

The relations between entities as modelled with \texttt{ent} can be imagined as
the edges connecting the nodes of a directed \emph{graph}, with the nodes
representing the entities. This conceptualisation lends itself to a more
human-friendly querying language, where the directionality can be expressed
with words describing ownership, like so:

\vspace{\parskip}
\begin{lstlisting}[caption={Ent graph query},
label=entQuery,
backgroundcolor=\color{lstbg},
language=Go,
]
descr, err := users.Query().
	Where().
	LocalBreach.
	Has(BreachDetailXyz).
	Has(Description).
	Only(ctx)
\end{lstlisting}


\n{1}{Deployment}

A deployment set-up, as suggested in
Section~\ref{sec:deploymentRecommendations}, is already \emph{partially}
covered by the multi-stage \texttt{Containerfile} that is available in the main
sources. Once built, the resulting container image only contains a handful of
things it absolutely needs:

\begin{itemize}
  \item a self-contained statically linked copy of the program
  \item a default configuration file and corresponding Dhall expressions cached
    at build time
  \item a recent CA certs bundle
\end{itemize}

Since the program also needs a database for proper functioning, an example
scenario includes the application container being run in a Podman \textbf{pod}
(as in a pea pod or pod of whales) together with the database. That results in
not having to expose the database to the entire host or out of the pod at all,
it is only available over pod's \texttt{localhost}. Hopefully it goes without
saying that the default values of any configuration secrets should be
substituted by the application operator with new, securely generated ones
(read: using \texttt{openssl rand} or \texttt{pwgen}).


\n{2}{Rootless Podman}

Assuming rootless Podman set up and the \texttt{just} tool installed on the
host, the application could be deployed by following a series of relatively
simple steps:

\begin{itemize}
  \item build (or pull) the application container image
  \item create a pod with user namespacing, exposing the application port
  \item run the database container inside the pod
  \item run the application inside the pod
\end{itemize}

In concrete terms, it would resemble something along the lines of
Listing~\ref{podmanDeployment}. Do note that all the commands are executed
under the unprivileged \texttt{user@containerHost} that is running rootless
Podman, i.e.\ it has \texttt{UID}/\texttt{GID} mapping entries in
\texttt{/etc/setuid} and \texttt{/etc/setgid} files \textbf{prior} to running any
Podman commands.

% \newpage
\begin{lstlisting}[language=bash, caption={Example application deployment using
rootless Podman},
label=podmanDeployment, basicstyle=\linespread{0.9}\small\ttfamily,
backgroundcolor=\color{lstbg}, commentstyle=\color{gray},
morekeywords={mkdir,podman,just},
]
# From inside the project folder, build the image locally using kaniko.
just kaniko

# Create a pod, limit the amount of memory/CPU available to its containers.
podman pod create --replace --name pcmt \
    --memory=100m --cpus=2 \
    --userns=keep-id -p3005:3000

# Create the database folder and run the database in the pod.
mkdir -pv ./tmp/db
podman run --pod pcmt --replace -d --name "pcmt-pg" --rm \
    -e POSTGRES_INITDB_ARGS="--auth-host=scram-sha-256 \
         --auth-local=scram-sha-256" \
    -e POSTGRES_PASSWORD=postgres \
    -v $PWD/tmp/db:/var/lib/postgresql/data:Z \
    --health-cmd "sh -c 'pg_isready -U postgres -d postgres'" \
    --health-on-failure kill \
    --health-retries 3 \
    --health-interval 10s \
    --health-timeout 1s \
    --health-start-period=5s \
    docker.io/library/postgres:15.2-alpine3.17

# Run the application itself in the pod.
podman run --pod pcmt --replace --name pcmt-og -d --rm \
    -e PCMT_LIVE=False \
    -e PCMT_DBTYPE="postgres" \
    -e PCMT_CONNSTRING="host=pcmt-pg port=5432 sslmode=disable \
         user=postgres dbname=postgres password=postgres"
    -v $PWD/config.dhall:/config.dhall:Z,ro \
    docker.io/immawanderer/mt-pcmt:testbuild -config /config.dhall
\end{lstlisting}
% \vspace*{-\baselineskip}

To summarise Listing~\ref{podmanDeployment}, first the application container is
built from inside the project folder using \texttt{kaniko}.  The container
image could alternatively be pulled from the container repository, but it makes
more sense showing the image being built from sources with the listing
depicting a \texttt{:testbuild} tag being used.

Next, a \emph{pod} is created and given a name, setting the port binding for
the application. Then, the database container is started inside the pod,
configured with a healthchecking mechanism.

As a final step, the application container itself is run inside the pod. The
application configuration named \texttt{config.dhall} located in \texttt{\$PWD}
is mounted as a volume into container's \texttt{/config.dhall}, providing the
application with a default configuration. The default container does contain a
default configuration for reference, however, running the container without
additionally providing the necessary secrets would fail.

\n{3}{Sanity checks}

Also do note that the application connects to the database using its
\emph{container} name, i.e.\ not the IP address. This is possible thanks to
Podman setting up DNS resolution inside pods using default networks in such a
way that all containers in the pod can reach each other using their (container)
names.

Interestingly, connecting via \texttt{localhost} from containers inside the pod
would also work. Inside the pod, any container in the pod can reach any other
container in the same pod via \emph{pod's} own \texttt{localhost}, thanks to a
shared network name space~\cite{podmanNet}.

In fact, \emph{pinging} (sending ICMP packets using the \texttt{ping} command)
the database and application containers from an ad-hoc Alpine Linux container
that just joined the pod temporarily yields:

\vspace{\parskip}
\begin{lstlisting}[language=bash, caption={Pinging pod containers using their
names}, label=podmanPing, basicstyle=\linespread{0.9}\small\ttfamily,
backgroundcolor=\color{lstbg},
morekeywords={podman,ping}
]
user@containerHost % podman run --rm -it \
    --user=0 \
    --pod=pcmt \
    docker.io/library/alpine:3.18
/ % ping -c2 pcmt-og
PING pcmt-og (127.0.0.1): 56 data bytes
64 bytes from 127.0.0.1: seq=0 ttl=42 time=0.072 ms
64 bytes from 127.0.0.1: seq=1 ttl=42 time=0.118 ms

--- pcmt-og ping statistics ---
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max = 0.072/0.095/0.118 ms
/ % ping -c2 pcmt-pg
PING pcmt-pg (127.0.0.1): 56 data bytes
64 bytes from 127.0.0.1: seq=0 ttl=42 time=0.045 ms
64 bytes from 127.0.0.1: seq=1 ttl=42 time=0.077 ms

--- pcmt-pg ping statistics ---
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max = 0.045/0.061/0.077 ms
/ %
\end{lstlisting}

Was the application deployed in a traditional manner instead of using Podman,
the use of FQDNs or IPs would be probably be necessary, as there would be no
magic resolution of container names happening transparently in the background.

\n{3}{Database isolation from the host}

A keen observer has undoubtedly noticed that the pod constructed in
Listing~\ref{podmanDeployment} did only create the binding for a port used by
the application (\texttt{5005/tcp}). The Postgres default port
\texttt{5432/tcp} is not among pod's port bindings, as can be seen in the pod
creation command in the said listing. This can also easily be verified using
the command in Listing~\ref{podmanPortBindings}:

\begin{lstlisting}[language=bash, caption={Podman pod port binding inspection},
label=podmanPortBindings, basicstyle=\linespread{0.9}\small\ttfamily,
backgroundcolor=\color{lstbg},
morekeywords={podman},
]
user@containerHost % podman pod inspect pcmt \
    --format="Port bindings: {{.InfraConfig.PortBindings}}\n\
    Host network: {{.InfraConfig.HostNetwork}}"
Port bindings: map[3000/tcp:[{ 5005}]]
Host network: false
\end{lstlisting}
\vspace*{-\baselineskip}

To be absolutely sure that the database is available only internally in the pod
(unless, of course, there is another process listening on the subject port),
and that connecting to the database from outside the pod (i.e. from the
container host) really \emph{does} fail, the following commands can be issued:

\begin{lstlisting}[language=bash, caption={In-pod database is unreachable from
the host}, breaklines=true, label=podDbUnreachable,
basicstyle=\linespread{0.9}\small\ttfamily,
backgroundcolor=\color{lstbg},
]
user@containerHost % curl localhost:5432
--> curl: (7) Failed to connect to localhost port 5432 after 0 ms: Couldn't connect to server
\end{lstlisting}
\vspace*{-\baselineskip}

The error in Listing~\ref{podDbUnreachable} is indeed expected, as it is the
result of the database port not been exposed from the pod.

Of course, since a volume (essentially a bind mount) from the host is used, the
actual data is still accessible on the host, both to privileged users and the
user running the pod. On the host with SELinux support, the \texttt{:Z} volume
addendum at least ensures that the content of the volume is directly
inaccessible to other containers, including the application container running
inside the same pod, via SELinux labelling.

\n{3}{Health checks}

Running the containers with health checks can be counted among the few crucial
settings. That way
the container runtime can periodically \emph{check} that the application
running inside the container is behaving correctly and instructions can be
provided on what action should be taken, should the health of the application
evaluate unsatisfyingly. Furthermore, different sets of health checking
commands can be passed with Podman for start-up and runtime.


\n{2}{Reverse proxy configuration}

If the application is deployed behind a reverse proxy, such as NGINX, the
configuration snippet in Listing~\ref{nginxSnip} might apply. Do note how the
named upstream server \texttt{pcmt} references the port that was exposed from
the pod created in Listing~\ref{podmanDeployment}.

\begin{lstlisting}[caption={Example reverse proxy configuration snippet},
breaklines=true, label=nginxSnip, basicstyle=\linespread{0.9}\scriptsize\ttfamily,
backgroundcolor=\color{lstbg},
morekeywords={upstream,server,return,listen,server_name,add_header,access_log,error_log,location,proxy_pass,proxy_set_header,allow,include,more_set_headers,ssl_buffer_size,ssl_dhparam,ssl_certificate,ssl_certificate_key,http2},
]
upstream pcmt {
  server 127.0.0.1:5005;
}
server {
  return 301 https://<pcmt domain>$request_uri;
  listen 80;
  listen [::]:80;
  server_name: <pcmt domain> www.<pcmt domain>;
  return 404;
  add_header Referrer-Policy "no-referrer, origin-when-cross-origin";
}
server {
  server_name <pcmt domain>;
  access_log /var/log/nginx/<pcmt domain>.access.log;
  error_log /var/log/nginx/<pcmt domain>.error.log;
  location / {
    proxy_pass http://pcmt;
    proxy_set_header X-Forwarded-Host $host;
    proxy_set_header X-Forwarded-For $proxy_add_forwarded_for;
  }
  location /robots.txt {
    allow all;
    add_header Content-Type "text/plain; charset=utf-8";
    add_header X-Robots-Tag "all, noarchive, notranslate";
    return 200 "User-agent: *\nDisallow: /";
  }
  include sec-headers.conf;

  add_header X-Real-IP $remote_addr;
  add_header X-Forwarded-For $proxy_add_x_forwarded_for;
  add_header X-Forwarded-Proto $scheme;
  more_set_headers 'Early-Data: $ssl_early_data';

  listen [::]:443 ssl http2;
  listen 443 ssl http2;
  ssl_certificate /etc/letsencrypt/live/<pcmt domain>/fullchain.pem;
  ssl_certificate_key /etc/letsencrypt/live/<pcmt domain>/privkey.pem;
  include /etc/letsencrypt/options-ssl-nginx.conf;
  ssl_dhparam /etc/letsencrypt/ssl-dhparams.pem;

  # reduce TTFB
  ssl_buffer_size 4k;
}
\end{lstlisting}
\vspace*{-\baselineskip}

The snippet describes how traffic arriving at port \texttt{80/tcp} (IPv4 or
IPv6) that matches the domain name(s) \texttt{\{www.,\}<pcmt domain>}
(\texttt{<pcmt domain>} being the domain name that the program was configured
with, including appropriate DNS records) gets 301-redirected to the same
location (\texttt{\$request\_uri}), only over \texttt{HTTPS}. If the server
name does not match, a 404 is returned instead. In the main location block, all
traffic except for \texttt{/robots.txt} is forwarded to the named backend, with
headers added on top by the proxy in order to label the incoming requests as
\emph{not} originating at the proxy. The \emph{robots} route is treated
specially, immediately returning a directive that disallows crawling of any
resource on the page for all. The proxy is also instructed to log access and
error events to specific log files, finally load the domain's TLS certificates
(obtained out of band), reduce the \texttt{ssl\_buffer\_size} and listen on
port \texttt{443/tcp} (dual stack).


\n{1}{Validation}

\n{2}{Unit tests}

Unit testing is a hot topic for many people and the author does not count
himself to be a staunch supporter of neither extreme. The ``no unit tests''
opinion seems to discount any benefit there is to unit testing, while a
``TDD-only''\footnotemark{} approach can be a little too much for some people's
taste. The author tends to prefer a \emph{middle ground} approach in this
particular case, i.e. writing enough tests where meaningful, but not
necessarily testing everything or writing tests prior to business logic code.
Arguably, following the practice of TDD should result in writing a \emph{better
designed} code, particularly because there needs to be a prior thought about
the shape and function of the code, as it is tested for before being even
written, but it adds a slight inconvenience to what is otherwise a
straightforward process.

Thanks to Go's built in support for testing via its \texttt{testing} package
and the tooling in the \texttt{go} tool, writing tests is relatively simple. Go
looks for files in the form \texttt{<filename>\_test.go} in the present working
directory but can be instructed to look for test files in packages recursively
found on any path using the ellipsis, like so: \texttt{go test
./path/to/package/\ldots}, which then \emph{runs} all the tests found, and
reports some statistics, such as the time it took to run the test or whether it
succeeded or failed. To be precise, the test files also need to contain test
functions, which are functions with the signature \texttt{func TestWhatever(t
*testing.T)\{\}} and where the function prefix ``Test'' is just as important as
the signature. Without it, the function is not considered to be a testing
function despite having the required signature and is therefore \emph{not}
executed during testing.

This test lookup behaviour, however, also has a neat side effect: all the test
files can be kept side-by-side their regular source counterparts, there is no
need to segregate them into a specially blessed \texttt{tests} folder or
similar, which in author's opinion improves readability. As a failsafe, in case
no actual test are found, the current behaviour of the tool is to print a note
informing the developer that no tests were found, which is handy to learn if it
was not intended/expected. When compiling regular source code, the Go files
with \texttt{\_test} in the name are simply ignored by the build tool.

\footnotetext{TDD, or Test Driven Development, is a development methodology
  whereby tests are written \emph{first}, then a complementary piece of code
  that is supposed to be tested is added, just enough to get past the compile
  errors and to see the test \emph{fail} and then is the code finally
  refactored to make the test \emph{pass}. The code can then be fearlessly
extended because the test is the safety net catching the programmer when the
mind slips and alters the originally intended behaviour of the code.}


\n{2}{Integration tests}

Integrating with external software, namely the database in case of this
program, is designed to utilise the same mechanism that was mentioned in the
previous section: Go's \texttt{testing} package. These tests verify that the
code changes can still perform the same actions with the external software that
were possible before the change and are run before every commit locally and
then after pushing to remote in the CI.

\n{3}{func TestUserExists(t *testing.T)}

In the integration test shown in Listing~\ref{integrationtest}, it is prefaced
at line 10 by declaring a helper function \texttt{getCtx() context.Context},
which takes no arguments and returns a new \texttt{context.Context} initialised
with the value of the global logger. As previously mentioned, that is how the
logger gets injected into the user module functions. The actual test function
with the signature \texttt{TestUserExists(t *testing.T)} defines a database
connection string at line 21 and attempts to open a connection to the database.
The database in use here is SQLite3 running in memory mode, meaning no file is
actually written to disk during this process. Since the testing data is not
needed after the test, this is desirable. Next, a defer statement calls the
\texttt{Close()} method on the database object, which is the Go idiomatic way
of closing files and network connections (which are also an abstraction over
files on UNIX-like operating systems such as GNU/Linux). Contrary to where it
is declared, the \emph{defer} statement is only called after all the statements
in the surrounding function, which makes sure no file descriptors (FDs) are
leaked and the file is properly closed when the function returns.

In the next step at line 25 a database schema creation is attempted, handling
the potential error in a Go idiomatic way, which uses the return value from the
function in an assignment to a variable declared in the \texttt{if} statement,
and checks whether the \texttt{err} was \texttt{nil} or not. In case the
\texttt{err} was not \texttt{nil}, i.e.\ \emph{there was an error in the callee
function}, the condition evaluates to \texttt{true}, which is followed by
entering the inner block. Inside it, the error is announced to the user (likely
a developer running the test in this case) and the testing object's
\texttt{FailNow()} method is called. That marks the test function as having
failed, and thus stops its execution. In this case, that is the desired
outcome, since if the database schema creation call fails, there really is no
point in continuing the testing of user creation.
\\
Conversely, if the schema \emph{does} get created without an error, the code
continues to declare a few variables (lines 30-32): \texttt{username},
\texttt{email} and \texttt{ctx}, where the context injected with the logger is
saved. Two of them are subsequently (line 33) passed into the
\texttt{UsernameExists} function, \texttt{ctx} being the first argument and the
database pointer and \texttt{username} following, while the \texttt{email}
variable is only used at a later stage (line 46). The point of declaring them
together is to give a sense of relatedness. The error value returned from this
function is again checked (line 33) and if everything goes well, the
\texttt{usernameFound} boolean value is checked next at line 38.

\smallskip
\smallskip
\begin{lstlisting}[language=Go, caption={User existence integration test},
label=integrationtest,basicstyle=\linespread{0.9}\scriptsize\ttfamily,
backgroundcolor=\color{lstbg},
numbers=left,
numberstyle=\linespread{0.9}\scriptsize\ttfamily,
frame=l,
framesep=18.5pt,
framerule=0.1pt,
xleftmargin=18.7pt,
otherkeywords={\%s, \%q, \%v},
]
// modules/user/user_test.go
package user

import (
        "context"
        "testing"

        "git.dotya.ml/mirre-mt/pcmt/ent/enttest"
        "git.dotya.ml/mirre-mt/pcmt/slogging"
        _ "github.com/xiaoqidun/entps"
)

func getCtx() context.Context {
    l := slogging.Init(false)
    ctx := context.WithValue(context.Background(), CtxKey{}, l)
    return ctx
}

func TestUserExists(t *testing.T) {
    connstr := "file:ent_tests?mode=memory&_fk=1"
    db := enttest.Open(t, "sqlite3", connstr)
    defer db.Close()

    if err := db.Schema.Create(context.Background()); err != nil {
            t.Errorf("failed to create schema resources: %v", err)
            t.FailNow()
    }

    username := "dude"
    email := "dude@b.cc"
    ctx := getCtx()

    usernameFound, err := UsernameExists(ctx, db, username)
    if err != nil {
            t.Errorf("error checking for username {%s} existence: %q", username, err)
    }

    if usernameFound {
            t.Errorf("unexpected: user{%s} should not have been found", username)
    }

    if _, err := EmailExists(ctx, db, email); err != nil {
            t.Errorf("unexpected: user email '%s' should not have been found", email)
    }

    usr, err := CreateUser(ctx, db, email, username, "so strong")
    if err != nil {
            t.Errorf("failed to create user, error: %q", err)
            t.FailNow()
    } else if usr == nil {
            t.Error("got nil usr back")
            t.FailNow()
    }

    if usr.Username != username {
            t.Errorf("got back wrong username, want: %s, got: %s",
                       username, usr.Username,
            )
    } // ...more checks...
}
\end{lstlisting}

Since the database has just been created, there should be no users, which is
checked in the body of the \texttt{if} statement (line 35). The same check is
then performed using an email (line 42), which is also correctly expected to
fail.

The final statements of the described test attempts to create a user by calling
the function \texttt{CreateUser(...)} at line 46, whose return values are again
checked for both error and \emph{nillability}, respectively. The test continues
with more of the checks similar to what has been described so far, but the rest
was omitted for brevity.

As was just demonstrated in the test, a neat thing about error handling in Go
is that it allows for very easy checking of all code paths, not just the
\emph{happy path} where there are no issues. The recommended approach of
immediately explicitly handling (or deciding to ignore) the error is in
author's view superior to wrapping hundreds of lines in \texttt{try} blocks and
then \emph{catching} (or not) \emph{all the} exceptions, as is the practice in
some other languages.


\n{2}{Test environment}

The application has been deployed in a test environment on author's modest
Virtual Private Server (VPS) at \texttt{https://testpcmt.dotya.ml}, protected
by \emph{Let's Encrypt}\allowbreak issued, short-lived, ECDSA
\texttt{secp384r1} curve TLS certificate, and configured with strict CSP. It is
a test instance, therefore limits (and rate-limits) to prevent abuse might be
imposed.
\\
The test environment makes the program available over both modern IPv6 and
legacy IPv4 protocols, to maximise accessibility. Redirects were set up from
plain HTTP to HTTPS, as well as from \texttt{www} to non-\texttt{www} domain.
The subject domain configuration is hardened by setting the \texttt{CAA}
record, limiting certificate authorities (CAs) that are able to issue TLS
certificates for it (and let them be trusted by validating clients).
Additionally, \textit{HTTP Strict Transport Security} (HSTS) had been enabled
for the main domain (\texttt{dotya.ml}) including the subdomains quite some
time ago (consult the preload lists in Firefox/Chrome), which mandates that
clients speaking HTTP only ever connect to it (and the subdomains) using TLS.


\n{3}{Deployment validation}

% TODO: show the results of testing the app in prod using:
% \url{https://testssl.sh/} and
% \url{https://gtmetrix.com/reports/testpcmt.dotya.ml/}.

The deployed application has been validated using the \textit{Security Headers}
tool (see \url{https://securityheaders.com/?q=https%3A%2F%2Ftestpcmt.dotya.ml}),
the results of which can be seen in Figure~\ref{fig:secheaders}.

It shows that the application sets the \texttt{Cross Origin Opener Policy} to
\texttt{same-origin}, which isolates the browsing context exclusively to
\textit{same-origin} documents, preventing \textit{cross-origin} documents from
loading in the same browser context.

\obr{Security Headers scan}{fig:secheaders}{.89}{graphics/screen-securityHeaders}

Furthermore, a \texttt{Content Security Policy} of
\texttt{upgrade-insecure-requests; default-src 'none'; manifest-src 'self';
font-src 'self'; img-src 'self' https://*; script-src 'self'; style-src 'self';
object-src 'self'; form-action 'self'; frame-ancestors'self'; base-uri 'self'}
is set by the program using a header.
This policy essentially pronounces the application (whatever domain it happens
to be hosted at - \texttt{'self'}) as the only \textit{permissible} source for
any scripts, styles and frames, the only destination of web forms. One
exception is the \texttt{image-src 'self' https://*} directive, which more
leniently also permits images from any \textit{secure} sources. This measure
ensures that no unvetted content is ever loaded from elsewhere.

The \texttt{Referrer-Policy} header setting of \texttt{no-referrer,
strict-origin-when-cross-origin} ensures that user tracking is reduced, since
no referrer is included (the \texttt{Referer} header is omitted) when the user
navigates away from the site or somehow send requests outside the application
using other means. The \texttt{Permissions-Policy} set to
\texttt{geolocation=(), midi=(), sync-xhr=(), microphone=(), camera=(),
gyroscope=(), magnetometer=(), fullscreen=(self), payment=()} declares that the
application is, for instance, never going to request access to payment
information, user microphone or camera devices, or geolocation.

\texttt{gobuster} was used in fuzzing mode to aid in uncovering potential
application misconfigurations. The wordlists used include:

\begin{itemize}
  \item Anton Lopanitsyn's \texttt{fuzz.txt} (\url{https://github.com/Bo0oM/fuzz.txt/tree/master})
  \item Daniel Miessler's \texttt{SecLists} (\url{https://github.com/danielmiessler/SecLists})
  \item Sam's \texttt{samlists} (\url{https://github.com/the-xentropy/samlists})
\end{itemize}

Many requests yielded 404s for non-existent pages, or possibly pages requiring
authentication (\emph{NotFound} is used so as not to disclose page's
existence). The program initially also issued quite a few 503s as a result of
rate-limiting, until \texttt{gobuster} was tamed using the \texttt{--delay}
parameter. Anti-CSRF measures employed by the program caused most of the
requests to yield 400s (missing CSRF token), or 403s with a CSRF token.
% A Burp test would perhaps be more telling.

The deployed application was scanned with Quallys' \textit{SSL Labs} scanner
and the results can be seen in Figure~\ref{fig:ssllabs}, confirming that HSTS
(includes subdomains) is deployed, the server runs TLS 1.3, the DNS Certificate
Authority Authorisation (CAA) is configured for the domain, with the overall
grade being A+.

\obr{Quallys SSL Labs scan}{fig:ssllabs}{.90}{graphics/screen-sslLabs}


\n{1}{Application screenshots}

Figure~\ref{fig:homepage} depicts the initial page that a logged-out user is
greeted with when they load the application.

\obr{Homepage}{fig:homepage}{.83}{graphics/screen-homepage}

Figure~\ref{fig:signup} is showing a registration page with input fields turned
green after basic validation. Visiting this page with registration disabled in
settings would yield a 404.

\obr{Registration page}{fig:signup}{.40}{graphics/screen-signup}

% \newpage

\obr{Registration page email
error}{fig:signupEmailError}{.40}{graphics/screen-signup-emailError}

A sign-up form error telling the user to provide a valid email address is shown
in Figure~\ref{fig:signupEmailError}.

\obr{Sign-in page}{fig:signin}{.40}{graphics/screen-signin}

Figure~\ref{fig:signin} depicts a sign-in form similar to the sign-up one.

\obr{Short password error on
sign-in}{fig:signinShortPasswd}{.40}{graphics/screen-signin-shortPasswordError}

An error in Figure~\ref{fig:signinShortPasswd} prompts the user to lengthen the
content of the password field from 3 to at least 20 characters.

\obr{User management screen}
{fig:adminUserManagement}{.90}
{graphics/screen-adminUserManagement}

Figure~\ref{fig:adminUserManagement} shows the user management screen, which
provides links to view user details page, start creating a new user.

% \newpage

% \obr{User creation screen}
% {fig:adminUserCreate}{.35}
% {graphics/screen-adminUserCreate}

\obr{User creation: `username not unique' error}
{fig:adminUserCreateErrorUsernameNotUnique}{.65}
{graphics/screen-adminUserCreateErrorUsernameNotUnique}

User creation form can be seen in
Figure~\ref{fig:adminUserCreateErrorUsernameNotUnique}. Both regular and admin
level users can be created here. In this case, an error is shown, telling the
user there is an issue with username uniqueness. User experience of this
process could in the future be improved by using a bit of JavaScript (or
WebAssembly) to check uniqueness of the username on user's \emph{key-up}.

\newpage

\obr{`demo' user creation post-hoc}
{fig:adminUserCreateDemo}{.85}
{graphics/screen-adminUserCreateDemo}


The user management screen is again shown in
Figure~\ref{fig:adminUserCreateDemo} after user `demo' was created. An
informative \emph{flash} message is printed near the top of the page
immediately after the action and not shown on subsequent page loads.

\obr{User details screen}
{fig:adminUserDetail}{.75}
{graphics/screen-adminUserDetail}

The user details page is depicted in Figure~\ref{fig:adminUserDetail}. The
interface presents key information about the user such as ID, username and
admin status. Additionally, it provides a link back to the previous page and
two buttons: one for editing the user and one for user deletion.

\newpage

\obr{User edit screen}
{fig:adminUserEdit}{.45}
{graphics/screen-adminUserEdit}

Figure~\ref{fig:adminUserEdit} shows the form for user editing with a button
`Update' in the bottom for submitting, a couple of checkboxes for toggling
`admin' and `active' state of the user. Above those, there are input fields for
`username', `email', `password' and the confirmation of the password.

\obr{User deletion confirmation}
{fig:adminUserDeleteConfirm}{.55}
{graphics/screen-adminUserDeleteConfirmation}

When attempting to delete a user, the administrator is presented with the
screen shown in Figure~\ref{fig:adminUserDeleteConfirm}, which asks them
whether they are absolutely sure to perform an action with permanent
consequences. The `Confirm permanent deletion' button is highlighted in intense
red colour, while the `Cancel' button is displayed in a light blue tone. There
are two additional links: the `All users' one that points to the user
management page, and the `Back to detail' one that simply brings the
administrator one step back to the user details page.

\obr{User deletion post-hoc}
{fig:adminUserDeletePostHoc}{.65}
{graphics/screen-adminUserDemoDeletion}

\newpage

After successful user deletion, the administrator is redirected back to user
management page and a flash message confirming the deletion is printed near the
top of the page, as shown in Figure~\ref{fig:adminUserDeletePostHoc}.

\obr{Manage API keys}
{fig:manageAPIKeys}{.65}
{graphics/screen-manageAPIKeys}

Figure~\ref{fig:manageAPIKeys} shows a page that allows administrators to
manage instance-wide API keys for external services, such as \emph{Have I Been
Pwned?} or \emph{DeHashed.com}. Do note that these keys are never distributed
to clients in any way and are only ever used by the application itself to make
the requests on \emph{behalf} of the users.

\obr{Import of locally available breach data from the CLI}
{fig:localImport}{.99}
{graphics/screen-localImport}

Figure~\ref{fig:localImport} depicts how formatted breach data can be imported
into the program's database using the CLI.

Figure~\ref{fig:hibp-nobreach} displays the result of a search using the online
data source. The account was not found to be a part of any of the available
breaches.

\newpage

\obr{Compromise monitoring using online sources - no breach found}
{fig:hibp-nobreach}{.60}{graphics/screen-hibp-nobreach}

Figure~\ref{fig:hibp-breach3Accts} depicts the result of a search using
the online API, providing a message and list of links to breach details.

\obr{Compromise listing using online API (and a test account)}
{fig:hibp-breach3Accts}{.54}{graphics/screen-hibp-breach3Accts}

\obr{Stratfor breach details page}
{fig:hibp-breachDetails}{.74} {graphics/screen-hibp-breachDetails}

Figure~\ref{fig:hibp-breachDetails} shows the \emph{Stratfor} breach details
page, with the data sourced from an online API.


% =========================================================================== %