1014 lines
44 KiB
TeX
1014 lines
44 KiB
TeX
% =========================================================================== %
|
|
% Encoding: UTF-8 (žluťoučký kůň úpěl ďábelšké ódy)
|
|
% =========================================================================== %
|
|
|
|
\vspace*{\fill}
|
|
\begin{center}
|
|
\Large
|
|
\textit{This is a document draft.}
|
|
\end{center}
|
|
\vspace*{\fill}
|
|
\newpage
|
|
|
|
% =========================================================================== %
|
|
\nn{Introduction}
|
|
Introduce the goals and the methods attempted to achieve the goals.
|
|
|
|
% =========================================================================== %
|
|
\part{Theoretical part}
|
|
|
|
\n{1}{Purpose}
|
|
What this write-up is attempting to achieve.
|
|
|
|
|
|
\n{1}{Terminology}
|
|
|
|
\n{2}{Linux}
|
|
|
|
The term \emph{Linux} is exclusively used in the meaning of the
|
|
Linux kernel~\cite{linux}.
|
|
|
|
|
|
\n{2}{GNU/Linux}
|
|
|
|
As far as a Linux-based operating system is concerned, the term ``GNU/Linux''
|
|
as defined by the Free Software Foundation~\cite{fsfgnulinux} is used. While it
|
|
is longer and arguably a little bit cumbersome, the author aligns with the
|
|
opinion that this term more correctly describes its actual target. Being aware
|
|
there are many people that conflate the complete operating system with its (be
|
|
it core) component, the kernel, the author is taking care to distinguish the
|
|
two, although writing from experience, colloquially, this probably brings more
|
|
confusion and a lengthy explanation is usually required.
|
|
|
|
|
|
\n{2}{Containers}
|
|
|
|
When the concept of \emph{containerisation} and \emph{containers} is mentioned
|
|
throughout this work, the author has OCI containers~\cite{ocicontainers} in
|
|
mind, which is broadly a superset of \emph{Linux Containers} where some set of
|
|
processes is presented with a view of kernel resources (there are multiple
|
|
kinds of resources, such as IPC queues; network devices, stacks, ports; mount
|
|
points, process IDs, user and group IDs, Cgroups and others) that differs for
|
|
each different set of processes, similar in thought to FreeBSD
|
|
\emph{jails}~\cite{freebsdjails} with the distingction being that they are, of
|
|
course, facilitated by the Linux kernel namespace
|
|
functionality~\cite{linuxnamespaces}, which is in turn be regarded to be
|
|
\emph{inspired} by Plan 9's namespaces~\cite{plan9namespaces}, Plan 9 being a
|
|
Bell Labs successor to Unix 8th Edition, discontinued in 2015.
|
|
|
|
While there without a doubt \emph{is} specificity bound to using each of the
|
|
tools that enable creating (Podman vs.\ Buildah vs.\ Docker BuildX) or running
|
|
(ContainerD vs.\ runC vs.\ crun) container images, when describing an action
|
|
that gets performed with or onto a container, the process should generally be
|
|
explained in such a way that it is repeatable using any spec-conforming tool
|
|
that is available and \emph{intended for the job}.
|
|
|
|
\n{2}{The program}
|
|
|
|
By \emph{the program} or \emph{the application} without any additional context
|
|
the author usually means the Password Compromise Monitoring Tool program.
|
|
|
|
|
|
\n{1}{Cryptography primer}\label{sec:cryptographyprimer}
|
|
Pre-requisites necessary for following up.
|
|
|
|
\n{2}{Encryption}
|
|
|
|
\n{3}{Symmetric cryptography}
|
|
|
|
\n{3}{Asymmetric cryptography}
|
|
|
|
\n{3}{The key exchange problem}
|
|
|
|
\n{3}{The key protection problem}
|
|
|
|
\n{3}{TLS}\label{sec:tls}
|
|
|
|
|
|
\n{2}{Hash functions}
|
|
Explanation. What are hash functions
|
|
|
|
\n{3}{Uses and \textit{mis}uses}
|
|
The good, the bad and the ugly of hash usage (including or in some cases
|
|
excluding salting, weak hashes, split hashes (Microsoft)).
|
|
|
|
\n{3}{Threats to hashes}
|
|
Rainbow tables, broken hash functions\ldots
|
|
|
|
|
|
\n{1}{Brief passwords history}\label{sec:history}
|
|
|
|
\n{2}{Purpose over time}
|
|
|
|
\n{2}{What is considered a password}
|
|
|
|
\n{2}{Problems with passwords}
|
|
\n{3}{Arbitrary length requirements (min/max)}
|
|
\n{3}{Arbitrary complexity requirements}
|
|
\n{3}{Restricting special characters}
|
|
Service providers have too often been found forbidding the use of so called
|
|
\textit{special characters} in passwords for as long as passwords have been
|
|
used to protect privileged access. Ways of achieving the same may vary but the
|
|
intent stays the same: prevent users from inputting characters into the system,
|
|
which the system cannot comfortably handle, for one reason or another.
|
|
|
|
|
|
\n{1}{Password strength validation}
|
|
Entropy, dictionaries, multiple factors.
|
|
|
|
|
|
\n{1}{Web security}\label{sec:websecurity}
|
|
|
|
The internet, being the vast space of intertwined concepts and ideas, is a
|
|
superset of the Web, since not everything that is available on internet can be
|
|
described as web \emph{resources}. But precisely that is the part of the
|
|
internet that is discussed in the next sections and covers what browsers are,
|
|
what they do and how they relate to web security.
|
|
|
|
|
|
\n{2}{Browsers}\label{sec:browsers}
|
|
|
|
TODO: describe how browsers find out where the web page lives, get a webpage,
|
|
parse it, parse stylesheets, run scripts, apply SAMEORIGIN restrictions etc.
|
|
TODO: (privileged process running untrusted code on user's computer), history,
|
|
present, security focus of the development teams, user facing signalling
|
|
(padlock colours, scary warnings).
|
|
|
|
Browsers, sometimes used together with the word that can serve as a real tell
|
|
for their specialisation - web browsers - are programs intended for
|
|
\emph{browsing} of \emph{the web}. In more technical terms, browsers are
|
|
programs that facilitate (directly or via intermediary tools) domain name
|
|
lookups, connecting to web servers, optionally establishing a secure
|
|
connection, requesting the web page in question, determining its \emph{security
|
|
policy} and resolving what accompanying resources the web page specifies and
|
|
depending on the applicable security policy, requesting those from their
|
|
respective origins, applying stylesheets and running scripts. Constructing a
|
|
program that can speak many protocols, securely runs untrusted code from the
|
|
internet is no easy task.
|
|
|
|
\n{3}{Complexity}
|
|
|
|
Browsers these days are also quite ubiquitous programs running on
|
|
\emph{billions} of consumer-grade mobile devices (which are also notorious for
|
|
bad update hygiene) or desktop devices all over the world. Regular users
|
|
usually expect them to work flawlessly with a multitude of network conditions,
|
|
network scenarios (café WiFi, cellular data in a remote location, home
|
|
broadband that is DNS-poisoned by the ISP), differently tuned (or commonly
|
|
misconfigured) web servers, a combination of modern and \emph{legacy}
|
|
encryption schemes and different levels of conformance to web standards from
|
|
both web server and website developers. Of course, if a website is broken, it
|
|
is the browser's fault. Browsers are expected to detect if \emph{captive
|
|
portals} (a type of access control that usually tries to force the user through
|
|
a webpage with terms of use) are active and offer redirects. All of this is
|
|
immense complexity and the combination of ubiquity and great exposure this type
|
|
of software gets is in the authors opinion the cause behind a staggering amount
|
|
of vulnerabilities found, reported and fixed in browsers every year.
|
|
|
|
\n{3}{Standardisation}
|
|
|
|
Over the years, a consortium of parties interested in promoting and developing
|
|
the web (also due to its potential as a digital marketplace, i.e.\ financial
|
|
incentives) and browser vendors (of which the most neutral participant is
|
|
perhaps \emph{Mozilla}, with Chrome being run by Google, Edge by Microsoft and
|
|
Safari/Webkit by Apple) has evolved a great volume of web standards, which are
|
|
also relatively frequently getting updated or deprecated and replaced by
|
|
revised or new ones, rendering the browser maintenance task into essentially a
|
|
cat-and-mouse game.
|
|
|
|
It is the web's extensibility that enabled this build-up and ironically has
|
|
been proclaimed by some to be its greatest asset. It has also been ostensibly
|
|
been criticised~\cite{ddvweb} in the past and the frustration with the status
|
|
quo of web standards has relatively recently prompted a group of people to even
|
|
create ``\textit{a new application-level internet protocol for the distribution
|
|
of arbitrary files, with some special consideration for serving a lightweight
|
|
hypertext format which facilitates linking between files}'':
|
|
Gemini~\cite{gemini}\cite{geminispec} that in the words of its authors can be
|
|
thought of as ``\textit{the web, stripped right back to its essence}'' or as
|
|
``\textit{Gopher, souped up and modernised just a little}'', depending upon the
|
|
reader's perspective, noting that the latter view is probably more accurate.
|
|
|
|
\n{3}{HTTP}
|
|
|
|
Originally, HTTP was also designed just for fetching hypertext
|
|
\emph{resources}, but it has evolved since then, particularly due to its
|
|
extensibility, to allow for fetching of all sorts of web resources a modern
|
|
website of today provides, such as scripts or images, or even to \emph{post}
|
|
content back to servers.
|
|
|
|
HTTP relies on TCP (Transmission Control Protocol), which is one of the
|
|
\emph{reliable} (mandated by HTTP) protocols used to send data across
|
|
contemporary IP (Internet Protocol) networks, to deliver the data it requests
|
|
or sends. When Tim Berners-Lee invented the World Wide Web (WWW) in 1989 while
|
|
working at CERN (The European Organization for Nuclear Research) with a rather
|
|
noble intent as a ``\emph{wide-area hypermedia information retrieval initiative
|
|
to give universal access to a large universe of documents}''~\cite{wwwf}, he
|
|
also invented the HyperText Markup Language (HTML) to serve as a formatting
|
|
method for these new hypermedia documents. The first website was written
|
|
roughly the same way as today's websites are, using HTML, although the markup
|
|
language has changed since, with the current version being HTML5.
|
|
|
|
It has been mentioned that the client \textbf{requests} a \textbf{resource} and
|
|
receives a \textbf{response}, so those terms should probably be defined.
|
|
|
|
A request is what the client sends to the server. A resource is what it
|
|
requests and a response is the answer provided by the server.
|
|
|
|
HTTP follows a classic client-server model whereby it is \textbf{always} the
|
|
client that initiates the request.
|
|
|
|
A web page is, to be blunt, a chunk of \emph{hypertext}. To display a web page,
|
|
a browser first needs to send a request to fetch the HTML representing the
|
|
page, which is then parsed and additional requests for sub-resources are made.
|
|
If a page defines a layout information in the form of CSS, that is parsed as
|
|
well.
|
|
|
|
A web page needs to be present on the local computer first \emph{before} it can
|
|
be parsed by the browser, and since websites are usually still served by
|
|
programs called \emph{web servers} as in the \emph{early days}, that presents a
|
|
problem of how tell the browser where from the resource should be pulled. In
|
|
today's browsers, the issue is sorted (short of the CLI) by the \emph{address
|
|
bar}, a place into which user types what they wish the browser to fetch for
|
|
them.
|
|
|
|
The formal name of this segment is a \emph{Universal Resource Locator}, or URL,
|
|
and it contains the schema (or the protocol, such as \texttt{http://}), the
|
|
host address or a domain name and a (TCP) port number.
|
|
|
|
Since a TCP connection needs to be established first, to connect to a server
|
|
whose only URL contains a domain name, the browser needs to perform a domain
|
|
name \emph{lookup} using system facilities, or as was the case for a couple of
|
|
notorious Chromium versions, send some additional and unrelated queries which
|
|
(with Chromium-based derivatives' numbers) ended up placing unnecessary load
|
|
directly at the root DNS servers~\cite{chromiumrootdns}.
|
|
|
|
If a raw IP address+port combination is used, the browser attempts to connect
|
|
to it directly and requests the user-requested page by default using the
|
|
\texttt{GET} \emph{method}. A \emph{well-known} HTTP port 80 is assumed unless
|
|
other port is explicitly specified and it can be omitted both if host is a
|
|
domain name or an IP address.
|
|
|
|
The method is a way for the user-agent to define what operation it wants to
|
|
perform. \texttt{GET} is used for fetching resources while \texttt{POST} is
|
|
used to send data to the server, such as to post the values of an HTML form.
|
|
|
|
A server response is comprised of a \textbf{status code}, a status message,
|
|
HTTP \textbf{headers} and an optional \textbf{body} containing the content. The
|
|
status code indicates if the original request was successful or not and the
|
|
browser is generally there to interpret these status codes to the user. There
|
|
is enough status codes to be confused by the sheer numbers but luckily, there
|
|
is a method to the madness and they can be divided into groups/classes:
|
|
|
|
\begin{itemize}
|
|
\item 1xx: Informational responses
|
|
\item 2xx: Successful responses
|
|
\item 3xx: Redirection responses
|
|
\item 4xx: Client error responses
|
|
\item 5xx: Server error responses
|
|
\end{itemize}
|
|
|
|
In case the \emph{user agent} (a web \emph{client}) such as a browser receives
|
|
a response with content, it has to parse it.
|
|
|
|
A header is additional information sent by both the server and the client.
|
|
|
|
|
|
\n{2}{Cross-site scripting}\label{sec:xss}
|
|
|
|
\n{2}{Content Security Policy}\label{sec:csp}
|
|
|
|
Content Security Policy has been an important addition to the arsenal of
|
|
website operators, even though not everybody has necessarily been utilising it
|
|
properly or even taken notice. To understand what guarantees it provides and
|
|
what kind of protections it employs, it is first necessary to grok how websites
|
|
are parsed and displayed, which has been discussed in depth in previous
|
|
sections.
|
|
|
|
|
|
\n{1}{Sandboxing}\label{sec:sandboxing}
|
|
\n{2}{User isolation}
|
|
Admin vs regular user, privilege escalation, least-privilege principle,
|
|
zero-trust principle.
|
|
\n{2}{Process isolation}
|
|
Sandbox escape.
|
|
\n{2}{Namespaced isolation}
|
|
Sandbox escape.
|
|
|
|
|
|
\n{1}{Data storage}
|
|
Among the key aspects of any security-minded system (application), the
|
|
following are certain to make the count:
|
|
\begin{enumerate}
|
|
\item data integrity
|
|
\item data authenticity
|
|
\item data confidentiality
|
|
\end{enumerate}
|
|
|
|
\n{2}{Integrity}
|
|
|
|
\n{2}{Authenticity}
|
|
|
|
\n{2}{Confidentiality}
|
|
|
|
\n{2}{Encryption-at-rest}
|
|
|
|
|
|
\n{1}{Compromise checking and prevention}
|
|
|
|
\n{2}{HIBP and similar tools}
|
|
|
|
\n{2}{OWASP Top 10 for the implementers}
|
|
|
|
\n{2}{Password best practices}
|
|
|
|
|
|
% =========================================================================== %
|
|
\part{Practical part}
|
|
|
|
\n{1}{Kudos}
|
|
|
|
\textbf{Disclaimer:} the author is not affiliated in any way with any of the
|
|
projects described on this page.
|
|
|
|
The \textit{Password Compromise Monitoring Tool} (\texttt{pcmt}) program has
|
|
been developed using and utilising a great deal of free (as in Freedom) and
|
|
open-source software in the process, either directly or as an outstanding work
|
|
tool, and the author would like to take this opportunity to recognise that
|
|
fact.
|
|
|
|
In particular, the author acknowledges that this work would not be the same
|
|
without:
|
|
|
|
\begin{itemize}
|
|
\item vim (\url{https://www.vim.org/})
|
|
\item Arch Linux (\url{https://archlinux.org/})
|
|
\item ZSH (\url{https://www.zsh.org/})
|
|
\item kitty (\url{https://sw.kovidgoyal.net/kitty/})
|
|
\item Nix (\url{https://nixos.org/explore.html})
|
|
\item pre-commit (\url{https://pre-commit.com/})
|
|
\item Podman (\url{https://podman.io/})
|
|
\item Go (\url{https://go.dev/})
|
|
\end{itemize}
|
|
|
|
All of the code written has been typed into VIM (\texttt{9.0}), the shell used
|
|
to run the commands was ZSH, both running in the author's terminal emulator of
|
|
choice - \texttt{kitty} on a \raisebox{.8ex}{\texttildelow}8 month (at the time
|
|
of writing) installation of \textit{Arch Linux (by the way)} using a
|
|
\texttt{6.3.1-wanderer-zfs-xanmod1} variant of the Linux kernel.
|
|
|
|
|
|
\n{1}{Development}
|
|
|
|
The source code of the project was being versioned since the start using the
|
|
popular and industry-standard git (\url{https://git-scm.com}) source code
|
|
management (SCM) tool. Commits were made frequently and, if at all possible,
|
|
for small and self-contained changes of code, trying to follow sane commit
|
|
message \emph{hygiene}, i.e.\ striving for meaningful and well-formatted commit
|
|
messages. The name of the default branch is \texttt{development}, since that is
|
|
what the author likes to choose for new projects that are not yet stable (it is
|
|
in fact the default in author's \texttt{.gitconfig}).
|
|
|
|
|
|
\n{2}{Commit signing}
|
|
|
|
Since git allows cryptographically \emph{singing} all commits, it would be
|
|
unwise not to take advantage of this. For the longest time, GPG was the only
|
|
method available for signing commits in git, however, that is no longer
|
|
applicable~\cite{agwagitssh}. These days, it is also possible to both sign and
|
|
verify one's git commits (and tags!) using SSH keys, namely those produced by
|
|
OpenSSH (the same ones that can be used to log in to remote systems). The
|
|
author has, of course, not reused the same key pair that is used to connect to
|
|
machines for signing commits. A different, \texttt{Ed25519} elliptic curve key
|
|
pair has been used specifically for signing. A public component of this key is
|
|
enclosed to this thesis as an attachment for future reference.
|
|
|
|
The validity of a signature on a particular commit can be viewed with git using
|
|
the following commands (the \% sign denotes the shell prompt):
|
|
|
|
\begin{figure}[h]
|
|
\centering
|
|
\begin{varwidth}{\linewidth}
|
|
\begin{verbatim}
|
|
% cd <cloned project dir>
|
|
% git show --show-signature <commit>
|
|
% # alternatively:
|
|
% git verify-commit <commit>
|
|
\end{verbatim}
|
|
\end{varwidth}
|
|
\caption{Verifying signature of a git commit}
|
|
\label{fig:gitverif}
|
|
\end{figure}
|
|
|
|
There is one caveat to this though, git first needs some additional
|
|
configuration for the code in Figure~\ref{fig:gitverif} to work as one would
|
|
expect. Namely that the public key used to verify the signature needs to be
|
|
stored in git's ``allowed signers file'', then git needs to be told where that
|
|
file is using the configuration value \texttt{gpg.ssh.allowedsignersfile} and
|
|
finally the configuration value of the \texttt{gpg.format} field needs to be
|
|
set to \texttt{ssh}.
|
|
|
|
Since git allows the configuration values to be local to each repository, both
|
|
of the mentioned issues can be solved by running the following commands from
|
|
inside of the cloned repository:
|
|
|
|
\begin{figure}[h]
|
|
\centering
|
|
\begin{varwidth}{\linewidth}
|
|
\scriptsize
|
|
\begin{verbatim}
|
|
% # set the signature format for the local repository.
|
|
% git config --local gpg.format ssh
|
|
% # save the public key.
|
|
% cat >./tmp/.allowed_signers \
|
|
<<<'leo ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIKwshTdBgLzwY4d8N7VainZCngH88OwvPGhZ6bm87rBO'
|
|
% # set the allowed signers file path for the local repository.
|
|
% git config --local gpg.ssh.allowedsignersfile=./tmp/.allowed_signers
|
|
\end{verbatim}
|
|
\end{varwidth}
|
|
\caption{Prepare allowed signers file and signature format for git}
|
|
\label{fig:gitsshprep}
|
|
\end{figure}
|
|
|
|
After the code in Figure~\ref{fig:gitsshprep} is run, everything from the
|
|
Figure~\ref{fig:gitverif} should remain applicable for the lifetime of the
|
|
repository or until git changes implementation of signature verification.
|
|
|
|
For future reference, git has been used in the version \texttt{git version
|
|
2.40.1}.
|
|
|
|
|
|
\n{2}{Continuous Integration}
|
|
|
|
To increase both the author's and public confidence in the atomic changes made
|
|
over time, it was attempted to thoroughly \emph{integrate} them using a
|
|
continuous integration (CI) service that was plugged into the main source code
|
|
repository since the early stages of development. This, of course, was again
|
|
self-hosted, including the workers. The tool of choice there was Drone
|
|
(\url{https://drone.io}) and the ``docker'' runner (in fact it runs any OCI
|
|
container) was used to run the builds.
|
|
|
|
The way this runner works is it creates an ephemeral container for every
|
|
pipeline step and executes given \emph{commands} inside of it. At the end of
|
|
each step the container is discarded, while the repository, which is mounted
|
|
into each container's \texttt{/drone/src} is persisted between steps, allowing
|
|
it to be cloned only from \emph{origin} only at the start of the pipeline and
|
|
then shared for all of the following steps, saving bandwidth, time and disk
|
|
writes.
|
|
|
|
The entire configuration used to run the pipelines can be found in a file named
|
|
\texttt{.drone.yml} at the root of the main source code repository. The
|
|
workflow consists of three pipelines, which are run in parallel. Two main
|
|
pipelines are defined to build the binary and run tests on \texttt{x86\_64}
|
|
GNU/Linux targets, one for each of Arch and Alpine (version 3.17).
|
|
These the two pipelines were identical apart from OS-specific bits such as
|
|
installing a certain package, etc.
|
|
For the record, other OS-architecture combinations were not tested.
|
|
|
|
A third pipeline was defined to build a popular static analysis tool called
|
|
\texttt{golangci-lint} - which is sort of a meta-linter, bundling a staggering
|
|
amount of linters (linter is a tool that performs static code analysis and can
|
|
raise awareness of programming errors, flag potentially buggy code constructs,
|
|
or \emph{mere} stylistic errors) - from sources and then perform the analysis
|
|
of project's codebase using the freshly built binary. If the result of this
|
|
step is successful, a handful of code analysis services get pinged in the next
|
|
steps to take notice of the changes to project's source code and update their
|
|
metrics, details can be found in the main Drone configuration file
|
|
\texttt{.drone.yml} and the configuration of \texttt{golangci-lint} can be
|
|
found in the root of the repository in the file named \texttt{.golangci.yml}.
|
|
The median build time as of writing was 1 minute, which includes running all
|
|
three pipelines, and that is acceptable.
|
|
|
|
\obr{Drone CI median
|
|
build}{fig:drone-median-build}{.77}{graphics/drone-median-build}
|
|
|
|
|
|
\n{2}{Source code repositories}\label{sec:repos}
|
|
|
|
All of the pertaining source code was published in repositories on a publicly
|
|
available git server operated by the author, the reasoning \emph{pro}
|
|
self-hosting being that it is the preferred way of guaranteed autonomy over
|
|
one's source code, as opposed to large silos owned by big corporations having a
|
|
track record of arguably not always deciding with user's best interest in mind,
|
|
acting on impulse or under public pressure (potentially at least temporarily
|
|
disrupting their user's operations), thus beholding their user to their lengthy
|
|
\emph{terms of service} that \emph{can change at any time}. Granted,
|
|
decentralisation can take a toll on discoverability of the project, but that is
|
|
not of concern here.
|
|
|
|
The git repository containing source code of the \texttt{pcmt} project:\\
|
|
\url{https://git.dotya.ml/mirre-mt/pcmt.git}.
|
|
|
|
The git repository hosting the \texttt{pcmt} configuration schema:\\
|
|
\url{https://git.dotya.ml/mirre-mt/pcmt-config-schema.git}.
|
|
|
|
The repository containing the \LaTeX{} source code of this thesis:\\
|
|
\url{https://git.dotya.ml/mirre-mt/masters-thesis.git}.
|
|
|
|
|
|
\n{2}{Toolchain}
|
|
|
|
Throughout the creation of this work, the \emph{current} version of the Go
|
|
programming language was used, i.e. \texttt{go1.20}.
|
|
|
|
\tab{Tool/Library-Usage Matrix}{tab:toolchain}{1.0}{ll}{
|
|
\textbf{Name} & \textbf{Usage} \\
|
|
Go programming language & program core \\
|
|
Dhall configuration language & program configuration \\
|
|
Echo & HTTP handlers, controllers, web server \\
|
|
bluemonday & HTML sanitising \\
|
|
TailwindCSS & creating stylesheets using a utility-first approach \\
|
|
}
|
|
|
|
\tab{Dependency-Version Matrix}{tab:depsversionmx}{1.0}{ll}{
|
|
\textbf{Name} & \textbf{version} \\
|
|
\texttt{echo} (\url{https://echo.labstack.com/}) & 4.10.2 \\
|
|
\texttt{go-dhall} (\url{https://github.com/philandstuff/dhall-golang}) & 6.0.2\\
|
|
\texttt{ent} (\url{https://entgo.io/}) & 0.11.10 \\
|
|
\texttt{bluemonday} (\url{https://github.com/microcosm-cc/bluemonday}) & 1.0.23 \\
|
|
\texttt{tailwindcss} (\url{https://tailwindcss.com/}) & 3.3.0 \\
|
|
}
|
|
|
|
\n{2}{A word about Go}
|
|
First, a question of \textit{`Why pick Go for building a web
|
|
application?'} might arise, so the following few lines will try to address
|
|
that.
|
|
|
|
Go~\cite{golang} (or \emph{Golang} for SEO-friendliness) is a strongly typed, high-level
|
|
\emph{garbage collected} language where functions are first-class citizens and
|
|
errors are values.
|
|
|
|
The appeal for the author comes from a number of language features, such as
|
|
built-in support for concurrency, testing, sane \emph{zero} values, lack of
|
|
pointer arithmetic, inheritance and implicit type conversions, easy-to-read
|
|
syntax, producing a statically linked binary by default, etc., on top of that,
|
|
the language has got a cute mascot.
|
|
|
|
Due to the foresight of the authors of the Go Authors regarding \emph{the
|
|
formatting question} (i.e.\ where to put the braces, tabs vs.\ spaces, etc.),
|
|
most of the discussions on this topic have been foregone. Every
|
|
\emph{gopher}~\footnote{euph.\ a person writing in the Go programming language}
|
|
is expected to format their source code with the official formatter
|
|
(\texttt{gofmt}), which automatically ensures the code adheres to the official
|
|
formatting standards.
|
|
|
|
\n{2}{A word about Nix}
|
|
\url{https://builtwithnix.org/}
|
|
|
|
\n{2}{Configuration}
|
|
|
|
Every non-trivial program usually offers at least \emph{some} way to
|
|
tweak/manage its behaviour, and these changes are usually persisted
|
|
\emph{somewhere} on the filesystem of the host: in a local SQLite3 database, a
|
|
\emph{LocalStorage} key-value store in the browser, a binary or plain text
|
|
configuration file. These configuration files need to be read and checked at
|
|
least on program start-up and either stored into operating memory for the
|
|
duration of the runtime of the program, or loaded and parsed and the memory
|
|
subsequently \emph{freed} (initial configuration).
|
|
|
|
There is an abundance of configuration languages (or file formats used to craft
|
|
configuration files) available, TOML, INI, JSON, YAML, to name some of the
|
|
popular ones (as of today).
|
|
|
|
Dhall stood out as a language that was designed with both security and the
|
|
needs of dynamic configuration scenarios in mind, borrowing a concept or two
|
|
from Nix~\cite{nixoslearn}~\cite{nixlang} (which in turn sources more than a
|
|
few of its concepts from Haskell), and in core being very similar to JSON,
|
|
which adds to familiar feel. In fact, in Dhall's authors' own words it is: ``a
|
|
programmable configuration language that you can think of as: JSON + functions
|
|
+ types + imports''~\cite{dhalllang}.
|
|
|
|
Among all of the listed features, the especially intriguing one to the author
|
|
was the promise of \emph{types}. There are multiple examples directly on the
|
|
project's documentation webpage demonstrating for instance the declaration and
|
|
usage of custom types (that are, of course merely combinations of the primitive
|
|
types that the language provides, such as \emph{Bool}, \emph{Natural} or
|
|
\emph{List}, to name just a few), so it was not exceedingly hard to start
|
|
designing a custom configuration \emph{schema} for the program.
|
|
Dhall not being a Turing-complete language also guarantees that evaluation
|
|
\emph{always} terminates eventually, which is a good attribute to possess as a
|
|
configuration language.
|
|
|
|
|
|
\n{3}{Schema}
|
|
|
|
The configuration schema was at first being developed as part of the main
|
|
project's repository, before it was determined that it would benefit both the
|
|
development and overall clarity if the schema lived in its own repository (see
|
|
Section~\ref{sec:repos} for details).
|
|
|
|
\begin{figure}[h]
|
|
\begin{varwidth}
|
|
\scriptsize
|
|
\begin{verbatim}
|
|
let Schema =
|
|
{ Type =
|
|
{ Host : Text
|
|
, Port : Natural
|
|
, HTTP :
|
|
{ Domain : Text
|
|
, Secure : Bool
|
|
, AutoTLS : Bool
|
|
, TLSKeyPath : Text
|
|
, TLSCertKeyPath : Text
|
|
, HSTSMaxAge : Natural
|
|
, ContentSecurityPolicy : Text
|
|
, RateLimit : Natural
|
|
, Gzip : Natural
|
|
, Timeout : Natural
|
|
}
|
|
, Mailer :
|
|
{ Enabled : Bool
|
|
, Protocol : Text
|
|
, SMTPAddr : Text
|
|
, SMTPPort : Natural
|
|
, ForceTrustServerCert : Bool
|
|
, EnableHELO : Bool
|
|
, HELOHostname : Text
|
|
, Auth : Text
|
|
, From : Text
|
|
, User : Text
|
|
, Password : Text
|
|
, SubjectPrefix : Text
|
|
, SendPlainText : Bool
|
|
}
|
|
, LiveMode : Bool
|
|
, DevelMode : Bool
|
|
, AppPath : Text
|
|
, Session :
|
|
{ CookieName : Text
|
|
, CookieAuthSecret : Text
|
|
, CookieEncrSecret : Text
|
|
, MaxAge : Natural
|
|
}
|
|
, Logger : { JSON : Bool, Fmt : Optional Text }
|
|
, Init : { CreateAdmin : Bool, AdminPassword : Text }
|
|
, Registration : { Allowed : Bool }
|
|
}
|
|
, default = {=}
|
|
}
|
|
|
|
in Schema
|
|
\end{verbatim}
|
|
\end{varwidth}
|
|
\caption{Dhall configuration schema version 0.0.1-rc.1}
|
|
\label{fig:dhallschema}
|
|
\end{figure}
|
|
|
|
|
|
\n{3}{Safety considerations}
|
|
|
|
Having a programmable configuration language that understands functions and
|
|
allows importing not only arbitrary text from random internet URLs, but also
|
|
importing and \emph{evaluating} (i.e.\ running) potentially untrusted code, it
|
|
is important that there are some safety mechanisms employed, which can be
|
|
relied on by the user. Dhall offers this in multiple features: enforcing a
|
|
same-origin policy and (optionally) pinning a cryptographic hash of the value
|
|
of the expression being imported.
|
|
|
|
|
|
\n{3}{Possible alternatives}
|
|
While developing the program, the author has also come across certain
|
|
shortcomings of Dhall, namely long start-up with \emph{cold cache}, which can
|
|
generally be observed in the scenario of running the program in a
|
|
\emph{container}.
|
|
|
|
If we want to describe the way Dhall works when performing an evaluation, it
|
|
resolves every expression down to a combination of its most basic types
|
|
(eliminating all abstraction and indirection) in the process called
|
|
\textbf{normalisation}~\cite{dhallnorm} and then saves this result in the hosts
|
|
cache. The \texttt{dhall-haskell} binary attempts to resolve the variable
|
|
\texttt{XDG\_CACHE\_HOME} (have a look at \emph{XDG Base Directory
|
|
Spec}~\cite{xdgbasedirspec} for details) to decide \emph{where} the results of
|
|
the normalisation will be written for repeated use. Do note that this
|
|
behaviour has been observed on a GNU/Linux host and the author has not verified
|
|
this behaviour on a non-GNU/Linux host.
|
|
|
|
If normalisation is performed inside an ephemeral container (as opposed to, for
|
|
instance, an interactive desktop session), the results effectively get lost on
|
|
each container restart, which is both wasteful and not great for user
|
|
experience, since the normalisation of just a handful of imports can take an
|
|
upside of 2 minutes, during which the user is left waiting for the hanging
|
|
application.
|
|
|
|
While workarounds for the above mentioned problem can be devised relatively
|
|
easily (bind mount volumes inside the container in place of the
|
|
\texttt{XDG\_CACHE\_HOME/dhall} and \texttt{XDG\_CACHE\_HOME/dhall-haskell} to
|
|
preserve the cache between restarts, or let the cache be pre-computed during
|
|
container build, since the application is only really expected to run together
|
|
with a compatible version of the configuration schema and this version
|
|
\emph{is} known at container build time), it would certainly feel better if
|
|
there was not need to work \emph{around} the configuration system of choice.
|
|
|
|
Alternatives such as CUE (\url{https://cuelang.org/}) offer themselves nicely
|
|
as a potentially almost drop-in replacement for Dhall feature-wise, while also
|
|
resolving the subject issue that was described with Dhall (costly normalisation
|
|
operations with \emph{cold cache}).
|
|
|
|
|
|
\n{2}{Production}
|
|
|
|
It is, of course, recommended that the application runs in a secure
|
|
environment, although definitions of that almost certainly differ depending on
|
|
who you ask. General recommendations would be to effectively reserve a machine
|
|
for a single use case - running this program - so as to dramatically decrease
|
|
the potential attack surface of the host. If the host does not need management
|
|
access (it is a deployed-to-only machine that is configured out-of-band, such
|
|
as with a \emph{golden} image/container), then do not run SSH on it. In an
|
|
ideal scenario, the host machine would have as little software installed as
|
|
possible besides what the application absolutely requires.
|
|
|
|
|
|
\n{1}{Application architecture}
|
|
|
|
\n{2}{Data integrity and authenticity}
|
|
|
|
The user can interact with the application via a web client, such as a browser,
|
|
and is required to authenticate for all sensitive operations. To not only know
|
|
\emph{who} the user is but also make sure they are \emph{permitted} to perform
|
|
the action they are attempting, the program employs an \emph{authorisation}
|
|
mechanism in the form of sessions. These are on the client side represented by
|
|
cryptographically signed and encrypted (using 256 bit AES) cookies. That lays
|
|
foundations for a few things: the data saved into the cookies can be regarded
|
|
as private because short of future \emph{quantum computers} only the program
|
|
itself can decrypt and access the data, and the data can be trusted since it is
|
|
both signed using the key that only the program controls and \emph{encrypted}
|
|
with \emph{another} key that equally only the program holds.
|
|
|
|
The cookie data is only ever written \emph{or} read at the server side,
|
|
solidifying the authors decision to let it be encrypted, as there is not point
|
|
in not encrypting it for some perceived client-side simplification. Users
|
|
navigating the website send their session cookie in \textbf{every request} (if
|
|
it exists) to the server, which then verifies the integrity of the data and in
|
|
case its valid, determines the existence and potential amount of user privilege
|
|
that should be granted. Public endpoints do not mandate the presence of a valid
|
|
session by definition, while at protected endpoints the user is authenticated
|
|
at every request. When a session expires or if there is no session to begin
|
|
with, the user is either shown a \emph{Not found} error message, the
|
|
\emph{Unauthorised} error message or redirected to \texttt{/signin}.
|
|
|
|
Another aspect that contributes to data integrity from another point of view is
|
|
utilising database \emph{transactions} for bundling together multiple database
|
|
operations that collectively change the \emph{state}. Using the transactional
|
|
jargon, the data is only \emph{committed} if each individual change was
|
|
successful. In case of any errors, the database is instructed to perform an
|
|
atomic \emph{rollback}, which brings it back to a state before the changes were
|
|
ever attempted.
|
|
|
|
The author has additionally considered the thought of utilising an embedded
|
|
immutable database like immudb (\url{https://immudb.io}) for record keeping
|
|
(verifiably storing data change history) and additional data integrity checks,
|
|
e.g.\ for tamper protection purposes and similar, however, that work remains
|
|
yet to be materialised.
|
|
|
|
|
|
\n{2}{Transport security}
|
|
|
|
User connecting to the application should rightfully expect for their data to
|
|
be protected \textit{in transit} (i.e.\ on the way between their browser and
|
|
the server), which is what \emph{Transport Layer Security} family of
|
|
protocols~\cite{tls13rfc8446} was designed for, and which is the underpinning
|
|
of HTTPS. TLS utilises the primitives of asymmetric cryptography to let the
|
|
client authenticate the server (verify that it is who it claims it is) and
|
|
negotiate a symmetric key for encryption in the process named the \emph{TLS
|
|
handshake} (see Section~\ref{sec:tls} for more details), the final purpose of
|
|
which is establishing a secure communications connection. The operator should
|
|
configure the program to either directly utilise TLS using configuration or
|
|
have it listen behind a TLS-terminating \emph{reverse proxy}.
|
|
|
|
|
|
\n{2}{User isolation}
|
|
|
|
Users are allowed into certain parts of the application based on the role they
|
|
currently posses. For the moment, two basic roles were envisioned, while this
|
|
list might get amended in the future, if the need arises:
|
|
|
|
\begin{itemize}
|
|
\item Administrator
|
|
\item User
|
|
\end{itemize}
|
|
|
|
It is paramount that the program protects itself from the insider threats as
|
|
well and therefore each role is only able to perform actions that it is
|
|
explicitly assigned. While there definitely is certain overlap between the
|
|
capabilities of the two outlined roles, each also possesses unique features
|
|
that the other does not.
|
|
|
|
For example, the administrator role is not able to perform searches on the
|
|
breach data directly using their administrator account, for that a separate
|
|
user account has to be devised. Similarly, the regular user is not able to
|
|
manage breach lists and other users, because that is a privileged operation.
|
|
|
|
In-application administrators are not able to view sensitive (any) user data
|
|
and should therefore only be able to perform the following actions:
|
|
|
|
\begin{itemize}
|
|
\item Create user accounts
|
|
\item View list of users
|
|
\item View user email
|
|
\item Change user email
|
|
\item Change user email
|
|
\item Toggle whether user is an administrator
|
|
\item Delete user accounts
|
|
\end{itemize}
|
|
|
|
Let us consider a case when a user manages self, while demoting from
|
|
administrator to a regular user is permitted, promoting self to be an
|
|
administrator would constitute a \emph{privilege escalation} and likely be a
|
|
precursor to a at least a \emph{denial of service} of sorts.
|
|
|
|
|
|
\n{2}{Zero trust principle}
|
|
|
|
\textit{Data confidentiality, i.e.\ not trusting the provider}
|
|
|
|
There is no way for the application (and consequently, the in-application
|
|
administrator) to read user's data. This is possible by virtue of encrypting
|
|
the pertinent data before saving them in the database by a state-of-the-art
|
|
\emph{age} key~\cite{age} (backed by X25519~\cite{x25519rfc7748}), which is in
|
|
turn safely stored encrypted by a passphrase that only the user controls. Of
|
|
course, the user-supplied password is run by a password based key derivation
|
|
function (PBKDF) before letting it encrypt the \emph{age} key.
|
|
|
|
The \emph{age} key is only generated when the user changes their password for
|
|
the first time to prevent scenarios such as in-application administrator with
|
|
access to physical database being able to both \textbf{recover} the key from
|
|
the database and \textbf{decrypt} it given that they already know the user
|
|
password (because they set it), which would subsequently give them unbounded
|
|
access to any future encrypted data, as long as they would be able to maintain
|
|
their database access. This is why the \emph{age} key generation and protection
|
|
are bound to the first password change. Of course, the evil administrator could
|
|
just perform the change themselves, however, the user would at least be able to
|
|
find those changes in the activity logs and know not to use the application.
|
|
But given the scenario of a total database compromise, the author finds all
|
|
hope is already lost at that point.
|
|
|
|
Consequently, both the application operators and the in-application
|
|
administrators should never be able to learn the details of what the user is
|
|
tracking, the same being applicable even to potential attackers with direct
|
|
access to the database. Thus the author maintains that every scenario that
|
|
could potentially lead to a data breach (apart from a compromised user machine
|
|
and the like) would have to entail some form of operating memory acquisition,
|
|
for instance using \texttt{LiME}~\cite{lime}, or perhaps directly the
|
|
\emph{hypervisor}, if considering a virtualised (``cloud'') environments.
|
|
|
|
|
|
\n{1}{Implementation}
|
|
|
|
\n{2}{Compromise checking}
|
|
|
|
\n{3}{Have I Been Pwned? Integration}
|
|
TODO
|
|
|
|
\n{3}{Local Dataset Plugin}
|
|
Breach data from locally available datasets can be imported into the
|
|
application by first making sure it adheres to the specified schema (have a
|
|
look at the \emph{breach data schema} in Figure~\ref{fig:breachDataGoSchema}).
|
|
If it doesn't (which is very likely with random breach data), it needs to be
|
|
converted to a form that does before importing it to the application, e.g.\
|
|
using a simple Python script or similar. Attempting to import data that does
|
|
not follow the outlined schema would result in an error.
|
|
|
|
\begin{figure}[h]
|
|
\centering
|
|
\begin{varwidth}{\linewidth}
|
|
\begin{verbatim}
|
|
type breachDataSchema struct {
|
|
Name string
|
|
Time time.Time
|
|
IsVerified bool
|
|
ContainsPasswords bool
|
|
ContainsHashes bool
|
|
HashType string
|
|
HashSalted bool
|
|
HashPepperred bool
|
|
ContainsUsernames bool
|
|
ContainsEmails bool
|
|
Data any
|
|
}
|
|
\end{verbatim}
|
|
\end{varwidth}
|
|
|
|
\caption{Breach Data Schema represented as a Go struct with imports from the
|
|
standard library are assumed}
|
|
\label{fig:breachDataGoSchema}
|
|
\end{figure}
|
|
|
|
The above Go representation will in actuality be written and supplied by the
|
|
user of the program as a YAML document. YAML was chosen for multiple reasons:
|
|
relative ease of use (plain text, readable, can include comments, most of the
|
|
inputs are implicitly typed as strings), and its capability to store multiple
|
|
\emph{documents} inside of a single file. That should allow for documents
|
|
similar to what can be seen in Figure~\ref{fig:breachDataYAMLSchema}.
|
|
|
|
\begin{figure}[h]
|
|
\centering
|
|
\begin{varwidth}{\linewidth}
|
|
\begin{verbatim}
|
|
---
|
|
name: Horrible breach
|
|
time: 2022-04-23T00:00:00Z+02:00
|
|
isVerified: false
|
|
containsPasswds: false
|
|
containsHashes: true
|
|
containsEmails: true
|
|
hashType: md5
|
|
hashSalted: false
|
|
hashPeppered: false
|
|
data:
|
|
hashes:
|
|
- hash1
|
|
- hash2
|
|
- hash3
|
|
emails:
|
|
- email1
|
|
-
|
|
- email3
|
|
---
|
|
# document #2, describing another breach.
|
|
name: Horrible breach 2
|
|
...
|
|
}
|
|
\end{verbatim}
|
|
\end{varwidth}
|
|
|
|
\caption{Example Breach Data Schema supplied to the program as a YAML file, optionally
|
|
containing multiple documents}
|
|
\label{fig:breachDataYAMLSchema}
|
|
\end{figure}
|
|
|
|
Notice how the emails list in Figure~\ref{fig:breachDataYAMLSchema} misses one
|
|
record, perhaps because it was not supplied. This is a valid scenario and the
|
|
application needs to be able to handle it. The alternative would be to require
|
|
the user to prepare the data in such a way that the empty/partial records would
|
|
be dropped entirely.
|
|
|
|
|
|
\n{2}{Best practices}
|
|
Describe built-in fail-overs and fallback mechanisms (request retries),
|
|
collecting exposed app metrics and alerting based on their status.
|
|
|
|
Observability - in-app tracing collected by e.g. Jaeger
|
|
(\url{https://jaegertracing.io}) or \url{https://opentelemetry.io/} - for
|
|
superior performance insights on CPU, storage, network latency and jitter.
|
|
|
|
\n{2}{Database configuration}
|
|
Describe tuning for performance while also properly securing the DB.
|
|
|
|
The plan is to have 2 databases: one for raw data (\url{https://postgresql.org}) and
|
|
another one for checksum validation (\url{https://immudb.io/})
|
|
|
|
\n{2}{Deployment recommendations}
|
|
Describe in detail the following:
|
|
\begin{itemize}
|
|
\item behind a reverse proxy (TLS termination) or inside of a private network
|
|
\item reasonable host security
|
|
\end{itemize}
|
|
|
|
\n{3}{Containerisation}
|
|
Whether the pre-built or a custom container image is used to deploy the
|
|
application, it still needs access to secrets, such as database connection
|
|
string (containing database host, port, user, password/encrypted password,
|
|
authentication method and database name).
|
|
|
|
Currently, the application is able to handle \emph{peer}, \emph{scram-sha-256},
|
|
\emph{user name maps} and raw \emph{password} as Postgres authentication
|
|
methods~\cite{pgauthmethods}, although the \emph{password} option should not be
|
|
used in production, \emph{unless} the connection to the database is protected
|
|
by TLS.\ In any case, using the \emph{scram-sha-256}~\cite{scramsha256rfc7677}
|
|
method is preferable and one way to verify in development environment that
|
|
everything works as intended is the \emph{Password generator for PostgreSQL}
|
|
tool~\cite{goscramsha256}, which allows to get the encrypted string from a raw
|
|
user input.
|
|
|
|
If the application running in a container wants to use the \emph{peer}
|
|
authentication method, it is up to the operator to supply the Postgres socket
|
|
to the application (e.g.\ as a volume bind mount). This scenario was not
|
|
tested, however, and the author is also not entirely certain how \emph{user
|
|
namespaces} (on GNU/Linux) would influence the process (given that the
|
|
\emph{ID}s of a user \textbf{outside} the container are mapped to a range of
|
|
\emph{UIDs} \textbf{inside} the container), for which the setup would likely
|
|
need to account.
|
|
|
|
Equally, if the application is running inside the container, the operator needs
|
|
to make sure that the database is either running in a network that is also
|
|
directly attached to the container or that there is a mechanism in place that
|
|
routes the requests for the database hostname to the destination.
|
|
|
|
One such mechanism is container name based routing inside \emph{pods}
|
|
(Podman/Kubernetes), where the resolution of container names is the
|
|
responsibility of a specially configured piece of software called Aardvark for
|
|
the former and CoreDNS for the latter.
|
|
|
|
|
|
\n{1}{Validation}
|
|
|
|
\n{2}{Unit tests}
|
|
|
|
\n{2}{Integration tests}
|
|
|
|
\n{2}{Click-ops}
|
|
|
|
% =========================================================================== %
|
|
\nn{Conclusion}
|
|
|
|
% =========================================================================== %
|