diff --git a/tex/abbreviations.tex b/tex/abbreviations.tex index 123378b..78323e4 100644 --- a/tex/abbreviations.tex +++ b/tex/abbreviations.tex @@ -17,6 +17,7 @@ TCP & Transmission Control Protocol \\ SSH & Secure Shell \\ DNS & Domain Name System \\ ZSTD & Z Standard \\ +ZFS & The Zettabyte File System \\ ISP & Internet Service Provider \\ GPG & GNU Privacy Guard \\ GNU & GNU's Not Unix! \\ @@ -35,6 +36,10 @@ INI & Initialization file \\ CPU & Central Processing Unit \\ RAM & Random Access Memory \\ +NVMe & Non-Volatile Memory Express \\ +PCIe & Peripheral Component Interconnect Express \\ +HPC & High Performance Computing \\ +OOM & Out of Memory \\ OWASP & Open Web Application Security Project \\ NIST & National Institute of Standards and Technology \\ diff --git a/tex/text.tex b/tex/text.tex index 7c3e95f..f84c869 100644 --- a/tex/text.tex +++ b/tex/text.tex @@ -353,7 +353,7 @@ All of the code written has been typed into VIM (\texttt{9.0}), the shell used to run the commands was ZSH, both running in the author's terminal emulator of choice - \texttt{kitty} on a \raisebox{.8ex}{\texttildelow}8 month (at the time of writing) installation of \textit{Arch Linux (by the way)} using a -\texttt{6.3.1-wanderer-zfs-xanmod1} variant of the Linux kernel. +\texttt{6.3.x-wanderer-zfs-xanmod1} variant of the Linux kernel. \n{1}{Development} @@ -430,7 +430,11 @@ inside of the cloned repository: After the code in Figure~\ref{fig:gitsshprep} is run, everything from the Figure~\ref{fig:gitverif} should remain applicable for the lifetime of the -repository or until git changes implementation of signature verification. +repository or until git changes implementation of signature verification. The +git \texttt{user.name} that can be seen on the commits in the \textbf{Author} +field is named after the machine that was used to develop the program, since +the author uses different signing keys on each machine. That way the committer +machine can be determined post-hoc. For future reference, git has been used in the version \texttt{git version 2.40.1}. @@ -482,9 +486,11 @@ container, although the latter action is only performed on feature branches, The median build time as of writing was 1 minute, which includes running all four pipelines, and that is acceptable. Build times might of course vary -depending on the hardware, for reference, these builds are run on a machine -equipped with a Zen 3 Ryzen 5 5600 CPU with nominal clock times, DDR4@3200MHz -RAM, a couple of PCIe Gen 4 SSDs in a mirror setup and a 400MiB downlink. +depending on the hardware, for reference, these builds were being run on a +machine equipped with a Zen 3 Ryzen 5 5600 CPU with nominal clock times, DDR4 +3200MHz RAM, a couple of PCIe Gen 4 NVMe drives in a mirrored setup (using ZFS) +and a 400MiB downlink, software-wise running Arch with an author-flavoured +Xanmod kernel version 6.3.x. \obr{Drone CI median build time}{fig:drone-median-build}{.77}{graphics/drone-median-build} @@ -496,12 +502,13 @@ All of the pertaining source code was published in repositories on a publicly available git server operated by the author, the reasoning \emph{pro} self-hosting being that it is the preferred way of guaranteed autonomy over one's source code, as opposed to large silos owned by big corporations having a -track record of arguably not always deciding with user's best interest in mind, -acting on impulse or under public pressure (potentially at least temporarily -disrupting their user's operations), thus beholding their user to their lengthy -\emph{terms of service} that \emph{can change at any time}. Granted, -decentralisation can take a toll on discoverability of the project, but that is -not of concern here. +track record of arguably not always deciding with user's best interest in mind +(although recourse has been observed~\cite{ytdl}), acting on impulse or under +public pressure (potentially at least temporarily disrupting their user's +operations), thus not only beholding their user to their lengthy \emph{terms of +service} that \emph{can change at any time}, but also factors outside their +control. Granted, decentralisation can take a toll on discoverability of the +project, but that is not a concern here. The git repository containing source code of the \texttt{pcmt} project:\\ \url{https://git.dotya.ml/mirre-mt/pcmt.git}. @@ -563,6 +570,11 @@ formatting standards. \n{2}{A word about Nix} \url{https://builtwithnix.org/} + +\n{1}{Application architecture} + +\n{1}{Implementation} + \n{2}{Configuration} Every non-trivial program usually offers at least \emph{some} way to @@ -575,16 +587,16 @@ duration of the runtime of the program, or loaded and parsed and the memory subsequently \emph{freed} (initial configuration). There is an abundance of configuration languages (or file formats used to craft -configuration files) available, TOML, INI, JSON, YAML, to name some of the -popular ones (as of today). +configuration files, whether they were intended for it or not) available, TOML, +INI, JSON, YAML, to name some of the popular ones (as of today). Dhall stood out as a language that was designed with both security and the needs of dynamic configuration scenarios in mind, borrowing a concept or two from Nix~\cite{nixoslearn}~\cite{nixlang} (which in turn sources more than a -few of its concepts from Haskell), and in core being very similar to JSON, -which adds to familiar feel. In fact, in Dhall's authors' own words it is: ``a -programmable configuration language that you can think of as: JSON + functions -+ types + imports''~\cite{dhalllang}. +few of its concepts from Haskell), and in its apparent core being very similar +to JSON, which adds to familiar feel. In fact, in Dhall's authors' own words it +is: ``a programmable configuration language that you can think of as: JSON + +functions + types + imports''~\cite{dhalllang}. Among all of the listed features, the especially intriguing one to the author was the promise of \emph{types}. There are multiple examples directly on the @@ -614,27 +626,7 @@ side of the equals sign, which has for practical reasons been trimmed and is displayed without the \emph{default} block, which is instead shown in its own Figure~\ref{fig:dhallschemadefaults}. -The main configuration is comprised of both raw attributes and child records, -which allows for grouping of related functionality. For instance, configuration -settings pertaining mailserver setup are grouped in a record named -\textbf{Mailer}. Its attribute \textbf{Enabled} is annotated as \textbf{Bool}, -which was deemed appropriate for a on-off switch-like functionality, with the -only permissible values being either \emph{True} or \emph{False}. Do note that -in Dhall $true != True$, since \textbf{True} is internally a Bool constant, -which is built into Dhall (check out ``The Prelude''~\cite{dhallprelude}), -while \textbf{true} is evaluated as an \emph{unbound} variable, that is, a -variable \emph{not} defined in the current \emph{scope} and thus not -\emph{present} in the current scope. - -Another one of specialties of Dhall is that $==$ and $!=$ equality operators -only work on values of type \texttt{Bool}, which for example means that -variables of type \texttt{Natural} (\texttt{uint}) or \texttt{Text} -(\texttt{string}) cannot be compared directly as in other languages, which -either leaves the work for a higher-level language (such as Go), or from the -perspective of the Dhall authors, \emph{enums} are promoted when the value -matters. - -\begin{figure}[h] +\begin{figure}[!h] \begin{varwidth} \scriptsize \begin{verbatim} @@ -688,7 +680,39 @@ matters. \caption{Dhall configuration schema version 0.0.1-rc.2} \label{fig:dhallschema} \end{figure} -\newpage + +The main configuration is comprised of both raw attributes and child records, +which allow for grouping of related functionality. For instance, configuration +settings pertaining mailserver setup are grouped in a record named +\textbf{Mailer}. Its attribute \textbf{Enabled} is annotated as \textbf{Bool}, +which was deemed appropriate for a on-off switch-like functionality, with the +only permissible values being either \emph{True} or \emph{False}. Do note that +in Dhall $true != True$, since \textbf{True} is internally a Bool constant, +which is built into Dhall (check out ``The Prelude''~\cite{dhallprelude}), +while \textbf{true} is evaluated as an \emph{unbound} variable, that is, a +variable \emph{not} defined in the current \emph{scope} and thus not +\emph{present} in the current scope. + +Another one of specialties of Dhall is that $==$ and $!=$ equality operators +only work on values of type \texttt{Bool}, which for example means that +variables of type \texttt{Natural} (\texttt{uint}) or \texttt{Text} +(\texttt{string}) cannot be compared directly as in other languages, which +either leaves the work for a higher-level language (such as Go), or from the +perspective of the Dhall authors, \emph{enums} are promoted when the value +matters. + + +\n{3}{Safety considerations} + +Having a programmable configuration language that understands functions and +allows importing not only arbitrary text from random internet URLs, but also +importing and \emph{evaluating} (i.e.\ running) potentially untrusted code, it +is important that there are some safety mechanisms employed, which can be +relied on by the user. Dhall offers this in multiple features: enforcing a +same-origin policy and (optionally) pinning a cryptographic hash of the value +of the expression being imported. + + \begin{figure}[!h] \begin{varwidth} \scriptsize @@ -763,43 +787,36 @@ matters. \end{figure} -\n{3}{Safety considerations} - -Having a programmable configuration language that understands functions and -allows importing not only arbitrary text from random internet URLs, but also -importing and \emph{evaluating} (i.e.\ running) potentially untrusted code, it -is important that there are some safety mechanisms employed, which can be -relied on by the user. Dhall offers this in multiple features: enforcing a -same-origin policy and (optionally) pinning a cryptographic hash of the value -of the expression being imported. - - \n{3}{Possible alternatives} -While developing the program, the author has also come across certain -shortcomings of Dhall, namely long start-up with \emph{cold cache}, which can -generally be observed in the scenario of running the program in a -\emph{container}. -If we want to describe the way Dhall works when performing an evaluation, it -resolves every expression down to a combination of its most basic types -(eliminating all abstraction and indirection) in the process called -\textbf{normalisation}~\cite{dhallnorm} and then saves this result in the hosts -cache. The \texttt{dhall-haskell} binary attempts to resolve the variable -\texttt{XDG\_CACHE\_HOME} (have a look at \emph{XDG Base Directory +While developing the program, the author has also +come across certain shortcomings of Dhall, namely long start-up with \emph{cold +cache}, which can generally be observed in the scenario of running the program +in an environment that does not allow to write the cache files (a read-only +filesystem), of does not keep the written cache files, such as a container that +is not configured to mount a persistent volume at the pertinent location. + +To describe the way Dhall works when performing an evaluation, it resolves +every expression down to a combination of its most basic types (eliminating all +abstraction and indirection) in the process called +\textbf{normalisation}~\cite{dhallnorm} and then saves this result in the +host's cache. The \texttt{dhall-haskell} binary attempts to resolve the +variable \texttt{XDG\_CACHE\_HOME} (have a look at \emph{XDG Base Directory Spec}~\cite{xdgbasedirspec} for details) to decide \emph{where} the results of the normalisation will be written for repeated use. Do note that this behaviour has been observed on a GNU/Linux host and the author has not verified -this behaviour on a non-GNU/Linux host. +this behaviour on a non-GNU/Linux host, such as FreeBSD. If normalisation is performed inside an ephemeral container (as opposed to, for instance, an interactive desktop session), the results effectively get lost on each container restart, which is both wasteful and not great for user -experience, since the normalisation of just a handful of imports can take an -upside of 2 minutes, during which the user is left waiting for the hanging -application. +experience, since the normalisation of just a handful of imports (which +internally branches widely) can take an upwards of two minutes, during which +the user is left waiting for the hanging application with no reporting on the +progress or current status. While workarounds for the above mentioned problem can be devised relatively -easily (bind mount volumes inside the container in place of the +easily (bind mount persistent volumes inside the container in place of the \texttt{XDG\_CACHE\_HOME/dhall} and \texttt{XDG\_CACHE\_HOME/dhall-haskell} to preserve the cache between restarts, or let the cache be pre-computed during container build, since the application is only really expected to run together @@ -809,25 +826,10 @@ there was not need to work \emph{around} the configuration system of choice. Alternatives such as CUE (\url{https://cuelang.org/}) offer themselves nicely as a potentially almost drop-in replacement for Dhall feature-wise, while also -resolving the subject issue that was described with Dhall (costly normalisation -operations with \emph{cold cache}). +resolving costly \emph{cold cache} normalisation operations, which is in +author's view Dhall's titular issue. -\n{2}{Production} - -It is, of course, recommended that the application runs in a secure -environment, although definitions of that almost certainly differ depending on -who you ask. General recommendations would be to effectively reserve a machine -for a single use case - running this program - so as to dramatically decrease -the potential attack surface of the host. If the host does not need management -access (it is a deployed-to-only machine that is configured out-of-band, such -as with a \emph{golden} image/container), then do not run SSH on it. In an -ideal scenario, the host machine would have as little software installed as -possible besides what the application absolutely requires. - - -\n{1}{Application architecture} - \n{2}{Data integrity and authenticity} The user can interact with the application via a web client, such as a browser, @@ -835,12 +837,12 @@ and is required to authenticate for all sensitive operations. To not only know \emph{who} the user is but also make sure they are \emph{permitted} to perform the action they are attempting, the program employs an \emph{authorisation} mechanism in the form of sessions. These are on the client side represented by -cryptographically signed and encrypted (using 256 bit AES) cookies. That lays -foundations for a few things: the data saved into the cookies can be regarded -as private because short of future \emph{quantum computers} only the program -itself can decrypt and access the data, and the data can be trusted since it is -both signed using the key that only the program controls and \emph{encrypted} -with \emph{another} key that equally only the program holds. +cryptographically signed and encrypted (using 256 bit AES) HTTP cookies. That +lays foundations for a few things: the data saved into the cookies can be +regarded as private because short of future \emph{quantum computers} only the +program itself can decrypt and access the data, and the data can be trusted +since it is both signed using the key that only the program controls and +\emph{encrypted} with \emph{another} key that equally only the program holds. The cookie data is only ever written \emph{or} read at the server side, solidifying the authors decision to let it be encrypted, as there is not point @@ -869,21 +871,6 @@ e.g.\ for tamper protection purposes and similar, however, that work remains yet to be materialised. -\n{2}{Transport security} - -User connecting to the application should rightfully expect for their data to -be protected \textit{in transit} (i.e.\ on the way between their browser and -the server), which is what \emph{Transport Layer Security} family of -protocols~\cite{tls13rfc8446} was designed for, and which is the underpinning -of HTTPS. TLS utilises the primitives of asymmetric cryptography to let the -client authenticate the server (verify that it is who it claims it is) and -negotiate a symmetric key for encryption in the process named the \emph{TLS -handshake} (see Section~\ref{sec:tls} for more details), the final purpose of -which is establishing a secure communications connection. The operator should -configure the program to either directly utilise TLS using configuration or -have it listen behind a TLS-terminating \emph{reverse proxy}. - - \n{2}{User isolation} Users are allowed into certain parts of the application based on the role they @@ -960,21 +947,22 @@ for instance using \texttt{LiME}~\cite{lime}, or perhaps directly the \emph{hypervisor}, if considering a virtualised (``cloud'') environments. -\n{1}{Implementation} - \n{2}{Compromise Monitoring} \n{3}{Have I Been Pwned? Integration} TODO -\n{3}{Local Dataset Plugin} -Breach data from locally available datasets can be imported into the -application by first making sure it adheres to the specified schema (have a -look at the \emph{breach data schema} in Figure~\ref{fig:breachDataGoSchema}). -If it doesn't (which is very likely with random breach data), it needs to be -converted to a form that does before importing it to the application, e.g.\ -using a simple Python script or similar. Attempting to import data that does -not follow the outlined schema would result in an error. +\n{3}{Local Dataset Plugin} Breach data from locally available datasets can be +imported into the application by first making sure it adheres to the specified +schema (have a look at the \emph{breach data schema} in +Figure~\ref{fig:breachDataGoSchema}). If it doesn't (which is very likely with +random breach data), it needs to be converted to a form that does before +importing it to the application, e.g.\ using a Python script or similar. +Attempting to import data that does not follow the outlined schema would result +in an error. Also, importing a dataset which is over a reasonable size limit +would by default be rejected by the program as a precaution, since marshaling +e.g.\ a 1 TiB document would likely result in an OOM situation on the host, +assuming regular consumer hardware conditions, not HPC. \begin{figure}[h] \centering @@ -1001,12 +989,15 @@ not follow the outlined schema would result in an error. \label{fig:breachDataGoSchema} \end{figure} -The above Go representation will in actuality be written and supplied by the -user of the program as a YAML document. YAML was chosen for multiple reasons: -relative ease of use (plain text, readable, can include comments, most of the -inputs are implicitly typed as strings), and its capability to store multiple -\emph{documents} inside of a single file. That should allow for documents -similar to what can be seen in Figure~\ref{fig:breachDataYAMLSchema}. +The Go representation shown in Figure~\ref{fig:breachDataGoSchema} will in +actuality be written and supplied by the user of the program as a YAML +document. YAML was chosen for multiple reasons: relative ease of use (plain +text, readable, support for inclusion of comments, its capability to store +multiple \emph{documents} inside of a single file with most of the inputs +implicitly typed as strings while thanks to being a superset of JSON it sports +machine readability. That should allow for documents similar to what can be +seen in Figure~\ref{fig:breachDataYAMLSchema} to be ingested by the program, +read and written by humans and programs alike. \begin{figure}[h] \centering @@ -1045,32 +1036,74 @@ similar to what can be seen in Figure~\ref{fig:breachDataYAMLSchema}. \end{figure} Notice how the emails list in Figure~\ref{fig:breachDataYAMLSchema} misses one -record, perhaps because it was not supplied. This is a valid scenario and the -application needs to be able to handle it. The alternative would be to require -the user to prepare the data in such a way that the empty/partial records would -be dropped entirely. +record, perhaps because it was not supplied or mistakenly ommitted. This is a +valid scenario (mistakes happen) and the application needs to be able to handle +it. The alternative would be to require the user to prepare the data in such a +way that the empty/partial records would be dropped entirely. -\n{2}{Best practices} -Describe built-in fail-overs and fallback mechanisms (request retries), -collecting exposed app metrics and alerting based on their status. - -Observability - in-app tracing collected by e.g. Jaeger -(\url{https://jaegertracing.io}) or \url{https://opentelemetry.io/} - for -superior performance insights on CPU, storage, network latency and jitter. - \n{2}{Database configuration} -Describe tuning for performance while also properly securing the DB. -The plan is to have 2 databases: one for raw data (\url{https://postgresql.org}) and -another one for checksum validation (\url{https://immudb.io/}) +The database schema is not created manually in the database, instead, an +Object-relational Mapping (ORM) tool named ent is used. This allows defining +the table schema and relations entirely in Go. The best part about ent is that +there is not need to define supplemental methods on the models, since ent +employs \emph{code generation}, which creates these based on the types of the +attributes in the model and the respective relations. For instance, if an +attribute is a string value \texttt{Email}, ent can be used to generate code +that contains methods on the user object like the following: + +\begin{itemize} +\item EmailIn +\item EmailEQ +\item EmailNEQ +\item EmailHasSuffix +\end{itemize} + + +\n{1}{Production} + +It is, of course, recommended that the application runs in a secure +environment, although definitions of that almost certainly differ depending on +who you ask. General recommendations would be either to effectively reserve a +machine for a single use case - running this program - so as to dramatically +decrease the potential attack surface of the host, or run the program isolated +in a container or a virtual machine. Further, if the host does not need +management access (it is a deployed-to-only machine that is configured +out-of-band, such as with a \emph{golden} image/container or declaratively with +Nix), then an SSH \emph{daemon} should not be running in it, since it is not +needed. In an ideal scenario, the host machine would have as little software +installed as possible besides what the application absolutely requires. + +A demonstration of the above can be found in the multi-stage Containerfile that +is available in the main sources. The resulting container image only contains a +statically linked copy of the program, a default configuration file and +corresponding Dhall expressions cached at build time, which only support the +main configuration file. Since the program also needs a database, an example +scenario could include the container being run in a Podman pod together with +the database, which would not have to be exposed from the pod and would +therefore only be available over \texttt{localhost}. + +It goes without saying that the operator should substitute values of any +default configuration secrets with new ones that were securely generated. + \n{2}{Deployment recommendations} -Describe in detail the following: -\begin{itemize} - \item behind a reverse proxy (TLS termination) or inside of a private network - \item reasonable host security -\end{itemize} + +\n{3}{Transport security} + +User connecting to the application should rightfully expect for their data to +be protected \textit{in transit} (i.e.\ on the way between their browser and +the server), which is what \emph{Transport Layer Security} family of +protocols~\cite{tls13rfc8446} was designed for, and which is the underpinning +of HTTPS. TLS utilises the primitives of asymmetric cryptography to let the +client authenticate the server (verify that it is who it claims it is) and +negotiate a symmetric key for encryption in the process named the \emph{TLS +handshake} (see Section~\ref{sec:tls} for more details), the final purpose of +which is establishing a secure communications connection. The operator should +configure the program to either directly utilise TLS using configuration or +have it listen behind a TLS-terminating \emph{reverse proxy}. + \n{3}{Containerisation} Whether the pre-built or a custom container image is used to deploy the