1
0

tex: add stuff, reword stuff, move sections around

This commit is contained in:
leo 2023-05-24 16:47:18 +02:00
parent 2708d9509b
commit 966261f799
Signed by: wanderer
SSH Key Fingerprint: SHA256:Dp8+iwKHSlrMEHzE3bJnPng70I7LEsa3IJXRH/U+idQ
2 changed files with 177 additions and 139 deletions

@ -17,6 +17,7 @@ TCP & Transmission Control Protocol \\
SSH & Secure Shell \\
DNS & Domain Name System \\
ZSTD & Z Standard \\
ZFS & The Zettabyte File System \\
ISP & Internet Service Provider \\
GPG & GNU Privacy Guard \\
GNU & GNU's Not Unix! \\
@ -35,6 +36,10 @@ INI & Initialization file \\
CPU & Central Processing Unit \\
RAM & Random Access Memory \\
NVMe & Non-Volatile Memory Express \\
PCIe & Peripheral Component Interconnect Express \\
HPC & High Performance Computing \\
OOM & Out of Memory \\
OWASP & Open Web Application Security Project \\
NIST & National Institute of Standards and Technology \\

@ -353,7 +353,7 @@ All of the code written has been typed into VIM (\texttt{9.0}), the shell used
to run the commands was ZSH, both running in the author's terminal emulator of
choice - \texttt{kitty} on a \raisebox{.8ex}{\texttildelow}8 month (at the time
of writing) installation of \textit{Arch Linux (by the way)} using a
\texttt{6.3.1-wanderer-zfs-xanmod1} variant of the Linux kernel.
\texttt{6.3.x-wanderer-zfs-xanmod1} variant of the Linux kernel.
\n{1}{Development}
@ -430,7 +430,11 @@ inside of the cloned repository:
After the code in Figure~\ref{fig:gitsshprep} is run, everything from the
Figure~\ref{fig:gitverif} should remain applicable for the lifetime of the
repository or until git changes implementation of signature verification.
repository or until git changes implementation of signature verification. The
git \texttt{user.name} that can be seen on the commits in the \textbf{Author}
field is named after the machine that was used to develop the program, since
the author uses different signing keys on each machine. That way the committer
machine can be determined post-hoc.
For future reference, git has been used in the version \texttt{git version
2.40.1}.
@ -482,9 +486,11 @@ container, although the latter action is only performed on feature branches,
The median build time as of writing was 1 minute, which includes running all
four pipelines, and that is acceptable. Build times might of course vary
depending on the hardware, for reference, these builds are run on a machine
equipped with a Zen 3 Ryzen 5 5600 CPU with nominal clock times, DDR4@3200MHz
RAM, a couple of PCIe Gen 4 SSDs in a mirror setup and a 400MiB downlink.
depending on the hardware, for reference, these builds were being run on a
machine equipped with a Zen 3 Ryzen 5 5600 CPU with nominal clock times, DDR4
3200MHz RAM, a couple of PCIe Gen 4 NVMe drives in a mirrored setup (using ZFS)
and a 400MiB downlink, software-wise running Arch with an author-flavoured
Xanmod kernel version 6.3.x.
\obr{Drone CI median build
time}{fig:drone-median-build}{.77}{graphics/drone-median-build}
@ -496,12 +502,13 @@ All of the pertaining source code was published in repositories on a publicly
available git server operated by the author, the reasoning \emph{pro}
self-hosting being that it is the preferred way of guaranteed autonomy over
one's source code, as opposed to large silos owned by big corporations having a
track record of arguably not always deciding with user's best interest in mind,
acting on impulse or under public pressure (potentially at least temporarily
disrupting their user's operations), thus beholding their user to their lengthy
\emph{terms of service} that \emph{can change at any time}. Granted,
decentralisation can take a toll on discoverability of the project, but that is
not of concern here.
track record of arguably not always deciding with user's best interest in mind
(although recourse has been observed~\cite{ytdl}), acting on impulse or under
public pressure (potentially at least temporarily disrupting their user's
operations), thus not only beholding their user to their lengthy \emph{terms of
service} that \emph{can change at any time}, but also factors outside their
control. Granted, decentralisation can take a toll on discoverability of the
project, but that is not a concern here.
The git repository containing source code of the \texttt{pcmt} project:\\
\url{https://git.dotya.ml/mirre-mt/pcmt.git}.
@ -563,6 +570,11 @@ formatting standards.
\n{2}{A word about Nix}
\url{https://builtwithnix.org/}
\n{1}{Application architecture}
\n{1}{Implementation}
\n{2}{Configuration}
Every non-trivial program usually offers at least \emph{some} way to
@ -575,16 +587,16 @@ duration of the runtime of the program, or loaded and parsed and the memory
subsequently \emph{freed} (initial configuration).
There is an abundance of configuration languages (or file formats used to craft
configuration files) available, TOML, INI, JSON, YAML, to name some of the
popular ones (as of today).
configuration files, whether they were intended for it or not) available, TOML,
INI, JSON, YAML, to name some of the popular ones (as of today).
Dhall stood out as a language that was designed with both security and the
needs of dynamic configuration scenarios in mind, borrowing a concept or two
from Nix~\cite{nixoslearn}~\cite{nixlang} (which in turn sources more than a
few of its concepts from Haskell), and in core being very similar to JSON,
which adds to familiar feel. In fact, in Dhall's authors' own words it is: ``a
programmable configuration language that you can think of as: JSON + functions
+ types + imports''~\cite{dhalllang}.
few of its concepts from Haskell), and in its apparent core being very similar
to JSON, which adds to familiar feel. In fact, in Dhall's authors' own words it
is: ``a programmable configuration language that you can think of as: JSON +
functions + types + imports''~\cite{dhalllang}.
Among all of the listed features, the especially intriguing one to the author
was the promise of \emph{types}. There are multiple examples directly on the
@ -614,27 +626,7 @@ side of the equals sign, which has for practical reasons been trimmed and is
displayed without the \emph{default} block, which is instead shown in its own
Figure~\ref{fig:dhallschemadefaults}.
The main configuration is comprised of both raw attributes and child records,
which allows for grouping of related functionality. For instance, configuration
settings pertaining mailserver setup are grouped in a record named
\textbf{Mailer}. Its attribute \textbf{Enabled} is annotated as \textbf{Bool},
which was deemed appropriate for a on-off switch-like functionality, with the
only permissible values being either \emph{True} or \emph{False}. Do note that
in Dhall $true != True$, since \textbf{True} is internally a Bool constant,
which is built into Dhall (check out ``The Prelude''~\cite{dhallprelude}),
while \textbf{true} is evaluated as an \emph{unbound} variable, that is, a
variable \emph{not} defined in the current \emph{scope} and thus not
\emph{present} in the current scope.
Another one of specialties of Dhall is that $==$ and $!=$ equality operators
only work on values of type \texttt{Bool}, which for example means that
variables of type \texttt{Natural} (\texttt{uint}) or \texttt{Text}
(\texttt{string}) cannot be compared directly as in other languages, which
either leaves the work for a higher-level language (such as Go), or from the
perspective of the Dhall authors, \emph{enums} are promoted when the value
matters.
\begin{figure}[h]
\begin{figure}[!h]
\begin{varwidth}
\scriptsize
\begin{verbatim}
@ -688,7 +680,39 @@ matters.
\caption{Dhall configuration schema version 0.0.1-rc.2}
\label{fig:dhallschema}
\end{figure}
\newpage
The main configuration is comprised of both raw attributes and child records,
which allow for grouping of related functionality. For instance, configuration
settings pertaining mailserver setup are grouped in a record named
\textbf{Mailer}. Its attribute \textbf{Enabled} is annotated as \textbf{Bool},
which was deemed appropriate for a on-off switch-like functionality, with the
only permissible values being either \emph{True} or \emph{False}. Do note that
in Dhall $true != True$, since \textbf{True} is internally a Bool constant,
which is built into Dhall (check out ``The Prelude''~\cite{dhallprelude}),
while \textbf{true} is evaluated as an \emph{unbound} variable, that is, a
variable \emph{not} defined in the current \emph{scope} and thus not
\emph{present} in the current scope.
Another one of specialties of Dhall is that $==$ and $!=$ equality operators
only work on values of type \texttt{Bool}, which for example means that
variables of type \texttt{Natural} (\texttt{uint}) or \texttt{Text}
(\texttt{string}) cannot be compared directly as in other languages, which
either leaves the work for a higher-level language (such as Go), or from the
perspective of the Dhall authors, \emph{enums} are promoted when the value
matters.
\n{3}{Safety considerations}
Having a programmable configuration language that understands functions and
allows importing not only arbitrary text from random internet URLs, but also
importing and \emph{evaluating} (i.e.\ running) potentially untrusted code, it
is important that there are some safety mechanisms employed, which can be
relied on by the user. Dhall offers this in multiple features: enforcing a
same-origin policy and (optionally) pinning a cryptographic hash of the value
of the expression being imported.
\begin{figure}[!h]
\begin{varwidth}
\scriptsize
@ -763,43 +787,36 @@ matters.
\end{figure}
\n{3}{Safety considerations}
Having a programmable configuration language that understands functions and
allows importing not only arbitrary text from random internet URLs, but also
importing and \emph{evaluating} (i.e.\ running) potentially untrusted code, it
is important that there are some safety mechanisms employed, which can be
relied on by the user. Dhall offers this in multiple features: enforcing a
same-origin policy and (optionally) pinning a cryptographic hash of the value
of the expression being imported.
\n{3}{Possible alternatives}
While developing the program, the author has also come across certain
shortcomings of Dhall, namely long start-up with \emph{cold cache}, which can
generally be observed in the scenario of running the program in a
\emph{container}.
If we want to describe the way Dhall works when performing an evaluation, it
resolves every expression down to a combination of its most basic types
(eliminating all abstraction and indirection) in the process called
\textbf{normalisation}~\cite{dhallnorm} and then saves this result in the hosts
cache. The \texttt{dhall-haskell} binary attempts to resolve the variable
\texttt{XDG\_CACHE\_HOME} (have a look at \emph{XDG Base Directory
While developing the program, the author has also
come across certain shortcomings of Dhall, namely long start-up with \emph{cold
cache}, which can generally be observed in the scenario of running the program
in an environment that does not allow to write the cache files (a read-only
filesystem), of does not keep the written cache files, such as a container that
is not configured to mount a persistent volume at the pertinent location.
To describe the way Dhall works when performing an evaluation, it resolves
every expression down to a combination of its most basic types (eliminating all
abstraction and indirection) in the process called
\textbf{normalisation}~\cite{dhallnorm} and then saves this result in the
host's cache. The \texttt{dhall-haskell} binary attempts to resolve the
variable \texttt{XDG\_CACHE\_HOME} (have a look at \emph{XDG Base Directory
Spec}~\cite{xdgbasedirspec} for details) to decide \emph{where} the results of
the normalisation will be written for repeated use. Do note that this
behaviour has been observed on a GNU/Linux host and the author has not verified
this behaviour on a non-GNU/Linux host.
this behaviour on a non-GNU/Linux host, such as FreeBSD.
If normalisation is performed inside an ephemeral container (as opposed to, for
instance, an interactive desktop session), the results effectively get lost on
each container restart, which is both wasteful and not great for user
experience, since the normalisation of just a handful of imports can take an
upside of 2 minutes, during which the user is left waiting for the hanging
application.
experience, since the normalisation of just a handful of imports (which
internally branches widely) can take an upwards of two minutes, during which
the user is left waiting for the hanging application with no reporting on the
progress or current status.
While workarounds for the above mentioned problem can be devised relatively
easily (bind mount volumes inside the container in place of the
easily (bind mount persistent volumes inside the container in place of the
\texttt{XDG\_CACHE\_HOME/dhall} and \texttt{XDG\_CACHE\_HOME/dhall-haskell} to
preserve the cache between restarts, or let the cache be pre-computed during
container build, since the application is only really expected to run together
@ -809,25 +826,10 @@ there was not need to work \emph{around} the configuration system of choice.
Alternatives such as CUE (\url{https://cuelang.org/}) offer themselves nicely
as a potentially almost drop-in replacement for Dhall feature-wise, while also
resolving the subject issue that was described with Dhall (costly normalisation
operations with \emph{cold cache}).
resolving costly \emph{cold cache} normalisation operations, which is in
author's view Dhall's titular issue.
\n{2}{Production}
It is, of course, recommended that the application runs in a secure
environment, although definitions of that almost certainly differ depending on
who you ask. General recommendations would be to effectively reserve a machine
for a single use case - running this program - so as to dramatically decrease
the potential attack surface of the host. If the host does not need management
access (it is a deployed-to-only machine that is configured out-of-band, such
as with a \emph{golden} image/container), then do not run SSH on it. In an
ideal scenario, the host machine would have as little software installed as
possible besides what the application absolutely requires.
\n{1}{Application architecture}
\n{2}{Data integrity and authenticity}
The user can interact with the application via a web client, such as a browser,
@ -835,12 +837,12 @@ and is required to authenticate for all sensitive operations. To not only know
\emph{who} the user is but also make sure they are \emph{permitted} to perform
the action they are attempting, the program employs an \emph{authorisation}
mechanism in the form of sessions. These are on the client side represented by
cryptographically signed and encrypted (using 256 bit AES) cookies. That lays
foundations for a few things: the data saved into the cookies can be regarded
as private because short of future \emph{quantum computers} only the program
itself can decrypt and access the data, and the data can be trusted since it is
both signed using the key that only the program controls and \emph{encrypted}
with \emph{another} key that equally only the program holds.
cryptographically signed and encrypted (using 256 bit AES) HTTP cookies. That
lays foundations for a few things: the data saved into the cookies can be
regarded as private because short of future \emph{quantum computers} only the
program itself can decrypt and access the data, and the data can be trusted
since it is both signed using the key that only the program controls and
\emph{encrypted} with \emph{another} key that equally only the program holds.
The cookie data is only ever written \emph{or} read at the server side,
solidifying the authors decision to let it be encrypted, as there is not point
@ -869,21 +871,6 @@ e.g.\ for tamper protection purposes and similar, however, that work remains
yet to be materialised.
\n{2}{Transport security}
User connecting to the application should rightfully expect for their data to
be protected \textit{in transit} (i.e.\ on the way between their browser and
the server), which is what \emph{Transport Layer Security} family of
protocols~\cite{tls13rfc8446} was designed for, and which is the underpinning
of HTTPS. TLS utilises the primitives of asymmetric cryptography to let the
client authenticate the server (verify that it is who it claims it is) and
negotiate a symmetric key for encryption in the process named the \emph{TLS
handshake} (see Section~\ref{sec:tls} for more details), the final purpose of
which is establishing a secure communications connection. The operator should
configure the program to either directly utilise TLS using configuration or
have it listen behind a TLS-terminating \emph{reverse proxy}.
\n{2}{User isolation}
Users are allowed into certain parts of the application based on the role they
@ -960,21 +947,22 @@ for instance using \texttt{LiME}~\cite{lime}, or perhaps directly the
\emph{hypervisor}, if considering a virtualised (``cloud'') environments.
\n{1}{Implementation}
\n{2}{Compromise Monitoring}
\n{3}{Have I Been Pwned? Integration}
TODO
\n{3}{Local Dataset Plugin}
Breach data from locally available datasets can be imported into the
application by first making sure it adheres to the specified schema (have a
look at the \emph{breach data schema} in Figure~\ref{fig:breachDataGoSchema}).
If it doesn't (which is very likely with random breach data), it needs to be
converted to a form that does before importing it to the application, e.g.\
using a simple Python script or similar. Attempting to import data that does
not follow the outlined schema would result in an error.
\n{3}{Local Dataset Plugin} Breach data from locally available datasets can be
imported into the application by first making sure it adheres to the specified
schema (have a look at the \emph{breach data schema} in
Figure~\ref{fig:breachDataGoSchema}). If it doesn't (which is very likely with
random breach data), it needs to be converted to a form that does before
importing it to the application, e.g.\ using a Python script or similar.
Attempting to import data that does not follow the outlined schema would result
in an error. Also, importing a dataset which is over a reasonable size limit
would by default be rejected by the program as a precaution, since marshaling
e.g.\ a 1 TiB document would likely result in an OOM situation on the host,
assuming regular consumer hardware conditions, not HPC.
\begin{figure}[h]
\centering
@ -1001,12 +989,15 @@ not follow the outlined schema would result in an error.
\label{fig:breachDataGoSchema}
\end{figure}
The above Go representation will in actuality be written and supplied by the
user of the program as a YAML document. YAML was chosen for multiple reasons:
relative ease of use (plain text, readable, can include comments, most of the
inputs are implicitly typed as strings), and its capability to store multiple
\emph{documents} inside of a single file. That should allow for documents
similar to what can be seen in Figure~\ref{fig:breachDataYAMLSchema}.
The Go representation shown in Figure~\ref{fig:breachDataGoSchema} will in
actuality be written and supplied by the user of the program as a YAML
document. YAML was chosen for multiple reasons: relative ease of use (plain
text, readable, support for inclusion of comments, its capability to store
multiple \emph{documents} inside of a single file with most of the inputs
implicitly typed as strings while thanks to being a superset of JSON it sports
machine readability. That should allow for documents similar to what can be
seen in Figure~\ref{fig:breachDataYAMLSchema} to be ingested by the program,
read and written by humans and programs alike.
\begin{figure}[h]
\centering
@ -1045,32 +1036,74 @@ similar to what can be seen in Figure~\ref{fig:breachDataYAMLSchema}.
\end{figure}
Notice how the emails list in Figure~\ref{fig:breachDataYAMLSchema} misses one
record, perhaps because it was not supplied. This is a valid scenario and the
application needs to be able to handle it. The alternative would be to require
the user to prepare the data in such a way that the empty/partial records would
be dropped entirely.
record, perhaps because it was not supplied or mistakenly ommitted. This is a
valid scenario (mistakes happen) and the application needs to be able to handle
it. The alternative would be to require the user to prepare the data in such a
way that the empty/partial records would be dropped entirely.
\n{2}{Best practices}
Describe built-in fail-overs and fallback mechanisms (request retries),
collecting exposed app metrics and alerting based on their status.
Observability - in-app tracing collected by e.g. Jaeger
(\url{https://jaegertracing.io}) or \url{https://opentelemetry.io/} - for
superior performance insights on CPU, storage, network latency and jitter.
\n{2}{Database configuration}
Describe tuning for performance while also properly securing the DB.
The plan is to have 2 databases: one for raw data (\url{https://postgresql.org}) and
another one for checksum validation (\url{https://immudb.io/})
The database schema is not created manually in the database, instead, an
Object-relational Mapping (ORM) tool named ent is used. This allows defining
the table schema and relations entirely in Go. The best part about ent is that
there is not need to define supplemental methods on the models, since ent
employs \emph{code generation}, which creates these based on the types of the
attributes in the model and the respective relations. For instance, if an
attribute is a string value \texttt{Email}, ent can be used to generate code
that contains methods on the user object like the following:
\begin{itemize}
\item EmailIn
\item EmailEQ
\item EmailNEQ
\item EmailHasSuffix
\end{itemize}
\n{1}{Production}
It is, of course, recommended that the application runs in a secure
environment, although definitions of that almost certainly differ depending on
who you ask. General recommendations would be either to effectively reserve a
machine for a single use case - running this program - so as to dramatically
decrease the potential attack surface of the host, or run the program isolated
in a container or a virtual machine. Further, if the host does not need
management access (it is a deployed-to-only machine that is configured
out-of-band, such as with a \emph{golden} image/container or declaratively with
Nix), then an SSH \emph{daemon} should not be running in it, since it is not
needed. In an ideal scenario, the host machine would have as little software
installed as possible besides what the application absolutely requires.
A demonstration of the above can be found in the multi-stage Containerfile that
is available in the main sources. The resulting container image only contains a
statically linked copy of the program, a default configuration file and
corresponding Dhall expressions cached at build time, which only support the
main configuration file. Since the program also needs a database, an example
scenario could include the container being run in a Podman pod together with
the database, which would not have to be exposed from the pod and would
therefore only be available over \texttt{localhost}.
It goes without saying that the operator should substitute values of any
default configuration secrets with new ones that were securely generated.
\n{2}{Deployment recommendations}
Describe in detail the following:
\begin{itemize}
\item behind a reverse proxy (TLS termination) or inside of a private network
\item reasonable host security
\end{itemize}
\n{3}{Transport security}
User connecting to the application should rightfully expect for their data to
be protected \textit{in transit} (i.e.\ on the way between their browser and
the server), which is what \emph{Transport Layer Security} family of
protocols~\cite{tls13rfc8446} was designed for, and which is the underpinning
of HTTPS. TLS utilises the primitives of asymmetric cryptography to let the
client authenticate the server (verify that it is who it claims it is) and
negotiate a symmetric key for encryption in the process named the \emph{TLS
handshake} (see Section~\ref{sec:tls} for more details), the final purpose of
which is establishing a secure communications connection. The operator should
configure the program to either directly utilise TLS using configuration or
have it listen behind a TLS-terminating \emph{reverse proxy}.
\n{3}{Containerisation}
Whether the pre-built or a custom container image is used to deploy the