% =========================================================================== % % Encoding: UTF-8 (žluťoučký kůň úpěl ďábelšké ódy) % =========================================================================== % \vspace*{\fill} \begin{center} \Large \textit{This is a document draft.} \end{center} \vspace*{\fill} \newpage % =========================================================================== % \nn{Introduction} Introduce the goals and the methods attempted to achieve the goals. % =========================================================================== % \part{Theoretical part} \n{1}{Purpose} What this write-up is attempting to achieve. \n{1}{Terminology} \n{2}{Linux} The term \emph{Linux} is exclusively used in the meaning of the Linux kernel~\cite{linux}. \n{2}{GNU/Linux} As far as a Linux-based operating system is concerned, the term ``GNU/Linux'' as defined by the Free Software Foundation~\cite{fsfgnulinux} is used. While it is longer and arguably a little bit cumbersome, the author aligns with the opinion that this term more correctly describes its actual target. Being aware there are many people that conflate the complete operating system with its (be it core) component, the kernel, the author is taking care to distinguish the two, although writing from experience, colloquially, this probably brings more confusion and a lengthy explanation is usually required. \n{2}{Containers} When the concept of \emph{containerisation} and \emph{containers} is mentioned throughout this work, the author has OCI containers~\cite{ocicontainers} in mind, which is broadly a superset of \emph{Linux Containers} where some set of processes is presented with a view of kernel resources (there are multiple kinds of resources, such as IPC queues; network devices, stacks, ports; mount points, process IDs, user and group IDs, Cgroups and others) that differs for each different set of processes, similar in thought to FreeBSD \emph{jails}~\cite{freebsdjails} with the distingction being that they are, of course, facilitated by the Linux kernel namespace functionality~\cite{linuxnamespaces}, which is in turn be regarded to be \emph{inspired} by Plan 9's namespaces~\cite{plan9namespaces}, Plan 9 being a Bell Labs successor to Unix 8th Edition, discontinued in 2015. While there without a doubt \emph{is} specificity bound to using each of the tools that enable creating (Podman vs.\ Buildah vs.\ Docker BuildX) or running (ContainerD vs.\ runC vs.\ crun) container images, when describing an action that gets performed with or onto a container, the process should generally be explained in such a way that it is repeatable using any spec-conforming tool that is available and \emph{intended for the job}. \n{2}{The program} By \emph{the program} or \emph{the application} without any additional context the author usually means the Password Compromise Monitoring Tool program. \n{1}{Cryptography primer}\label{sec:cryptographyprimer} Pre-requisites necessary for following up. \n{2}{Encryption} \n{3}{Symmetric cryptography} \n{3}{Asymmetric cryptography} \n{3}{The key exchange problem} \n{3}{The key protection problem} \n{3}{TLS}\label{sec:tls} \n{2}{Hash functions} Explanation. What are hash functions \n{3}{Uses and \textit{mis}uses} The good, the bad and the ugly of hash usage (including or in some cases excluding salting, weak hashes, split hashes (Microsoft)). \n{3}{Threats to hashes} Rainbow tables, broken hash functions\ldots \n{1}{Brief passwords history}\label{sec:history} \n{2}{Purpose over time} \n{2}{What is considered a password} \n{2}{Problems with passwords} \n{3}{Arbitrary length requirements (min/max)} \n{3}{Arbitrary complexity requirements} \n{3}{Restricting special characters} Service providers have too often been found forbidding the use of so called \textit{special characters} in passwords for as long as passwords have been used to protect privileged access. Ways of achieving the same may vary but the intent stays the same: prevent users from inputting characters into the system, which the system cannot comfortably handle, for one reason or another. \n{1}{Password strength validation} Entropy, dictionaries, multiple factors. \n{1}{Web security}\label{sec:websecurity} The internet, being the vast space of intertwined concepts and ideas, is a superset of the Web, since not everything that is available on internet can be described as web \emph{resources}. But precisely that is the part of the internet that is discussed in the next sections and covers what browsers are, what they do and how they relate to web security. \n{2}{Browsers}\label{sec:browsers} TODO: describe how browsers find out where the web page lives, get a webpage, parse it, parse stylesheets, run scripts, apply SAMEORIGIN restrictions etc. TODO: (privileged process running untrusted code on user's computer), history, present, security focus of the development teams, user facing signalling (padlock colours, scary warnings). Browsers, sometimes used together with the word that can serve as a real tell for their specialisation - \emph{web} browsers - are programs intended for \emph{browsing} of \emph{the web}. In more technical terms, browsers are programs that facilitate (directly or via intermediary tools) domain name lookups, connecting to web servers, optionally establishing a secure connection, requesting the web page in question, determining its \emph{security policy} and resolving what accompanying resources the web page specifies and depending on the applicable security policy, requesting those from their respective origins, applying stylesheets and running scripts. Constructing a program that can speak many protocols, securely runs untrusted code from the internet is no easy task. \n{3}{Complexity} Browsers these days are also quite ubiquitous programs running on \emph{billions} of consumer-grade mobile devices (which are also notorious for bad update hygiene) or desktop devices all over the world. Regular users usually expect them to work flawlessly with a multitude of network conditions, network scenarios (café WiFi, cellular data in a remote location, home broadband that is DNS-poisoned by the ISP), differently tuned (or commonly misconfigured) web servers, a combination of modern and \emph{legacy} encryption schemes and different levels of conformance to web standards from both web server and website developers. Of course, if a website is broken, it is the browser's fault. Browsers are expected to detect if \emph{captive portals} (a type of access control that usually tries to force the user through a webpage with terms of use) are active and offer redirects. All of this is immense complexity and the combination of ubiquity and great exposure this type of software gets is in the authors opinion the cause behind a staggering amount of vulnerabilities found, reported and fixed in browsers every year. \n{3}{Standardisation} Over the years, a consortium of parties interested in promoting and developing the web (also due to its potential as a digital marketplace, i.e.\ financial incentives) and browser vendors (of which the most neutral participant is perhaps \emph{Mozilla}, with Chrome being run by Google, Edge by Microsoft and Safari/Webkit by Apple) has evolved a great volume of web standards, which are also relatively frequently getting updated or deprecated and replaced by revised or new ones, rendering the browser maintenance task into essentially a cat-and-mouse game. It is the web's extensibility that enabled this build-up and ironically has been proclaimed by some to be its greatest asset. It has also been ostensibly been criticised~\cite{ddvweb} in the past and the frustration with the status quo of web standards has relatively recently prompted a group of people to even create ``\textit{a new application-level internet protocol for the distribution of arbitrary files, with some special consideration for serving a lightweight hypertext format which facilitates linking between files}'': Gemini~\cite{gemini}\cite{geminispec} that in the words of its authors can be thought of as ``\textit{the web, stripped right back to its essence}'' or as ``\textit{Gopher, souped up and modernised just a little}'', depending upon the reader's perspective, noting that the latter view is probably more accurate. \n{3}{HTTP} Originally, HTTP was also designed just for fetching hypertext \emph{resources}, but it has evolved since then, particularly due to its extensibility, to allow for fetching of all sorts of web resources a modern website of today provides, such as scripts or images, or even to \emph{post} content back to servers. HTTP relies on TCP (Transmission Control Protocol), which is one of the \emph{reliable} (mandated by HTTP) protocols used to send data across contemporary IP (Internet Protocol) networks, to deliver the data it requests or sends. When Tim Berners-Lee invented the World Wide Web (WWW) in 1989 while working at CERN (The European Organization for Nuclear Research) with a rather noble intent as a ``\emph{wide-area hypermedia information retrieval initiative to give universal access to a large universe of documents}''~\cite{wwwf}, he also invented the HyperText Markup Language (HTML) to serve as a formatting method for these new hypermedia documents. The first website was written roughly the same way as today's websites are, using HTML, although the markup language has changed since, with the current version being HTML5. It has been mentioned that the client \textbf{requests} a \textbf{resource} and receives a \textbf{response}, so those terms should probably be defined. A request is what the client sends to the server. A resource is what it requests and a response is the answer provided by the server. HTTP follows a classic client-server model whereby it is \textbf{always} the client that initiates the request. A web page is, to be blunt, a chunk of \emph{hypertext}. To display a web page, a browser first needs to send a request to fetch the HTML representing the page, which is then parsed and additional requests for sub-resources are made. If a page defines a layout information in the form of CSS, that is parsed as well. A web page needs to be present on the local computer first \emph{before} it can be parsed by the browser, and since websites are usually still served by programs called \emph{web servers} as in the \emph{early days}, that presents a problem of how tell the browser where the resource should be fetched from. In today's browsers, the issue is sorted (short of the CLI) by the \emph{address bar}, a place into which user types what they wish the browser to fetch for them. The formal name of this segment is a \emph{Universal Resource Locator}, or URL, and it contains the schema (or the protocol, such as \texttt{http://}), the host address or a domain name and a (TCP) port number. Since a TCP connection needs to be established first, to connect to a server whose only URL contains a domain name, the browser needs to perform a domain name \emph{lookup} using system facilities, or as was the case for a couple of notorious Chromium versions, send some additional and unrelated queries which (with Chromium-based derivatives' numbers) ended up placing unnecessary load directly at the root DNS servers~\cite{chromiumrootdns}. If a raw IP address+port combination is used, the browser attempts to connect to it directly and requests the user-requested page by default using the \texttt{GET} \emph{method}. A \emph{well-known} HTTP port 80 is assumed unless other port is explicitly specified and it can be omitted both if host is a domain name or an IP address. The method is a way for the user-agent to define what operation it wants to perform. \texttt{GET} is used for fetching resources while \texttt{POST} is used to send data to the server, such as to post the values of an HTML form. A server response is comprised of a \textbf{status code}, a status message, HTTP \textbf{headers} and an optional \textbf{body} containing the content. The status code indicates if the original request was successful or not and the browser is generally there to interpret these status codes to the user. There is enough status codes to be confused by the sheer numbers but luckily, there is a method to the madness and they can be divided into groups/classes: \begin{itemize} \item 1xx: Informational responses \item 2xx: Successful responses \item 3xx: Redirection responses \item 4xx: Client error responses \item 5xx: Server error responses \end{itemize} In case the \emph{user agent} (a web \emph{client}) such as a browser receives a response with content, it has to parse it. A header is additional information sent by both the server and the client. \n{2}{Cross-site scripting}\label{sec:xss} \n{2}{Content Security Policy}\label{sec:csp} Content Security Policy has been an important addition to the arsenal of website operators, even though not everybody has necessarily been utilising it properly or even taken notice. To understand what guarantees it provides and what kind of protections it employs, it is first necessary to grok how websites are parsed and displayed, which has been discussed in depth in previous sections. \n{1}{Sandboxing}\label{sec:sandboxing} \n{2}{User isolation} Admin vs regular user, privilege escalation, least-privilege principle, zero-trust principle. \n{2}{Process isolation} Sandbox escape. \n{2}{Namespaced isolation} Sandbox escape. \n{1}{Data storage} Among the key aspects of any security-minded system (application), the following are certain to make the count: \begin{enumerate} \item data integrity \item data authenticity \item data confidentiality \end{enumerate} \n{2}{Integrity} \n{2}{Authenticity} \n{2}{Confidentiality} \n{2}{Encryption-at-rest} \n{1}{Compromise checking and prevention} \n{2}{HIBP and similar tools} \n{2}{OWASP Top 10 for the implementers} \n{2}{Password best practices} % =========================================================================== % \part{Practical part} \n{1}{Kudos} \textbf{Disclaimer:} the author is not affiliated in any way with any of the projects described on this page. The \textit{Password Compromise Monitoring Tool} (\texttt{pcmt}) program has been developed using and utilising a great deal of free (as in Freedom) and open-source software in the process, either directly or as an outstanding work tool, and the author would like to take this opportunity to recognise that fact. In particular, the author acknowledges that this work would not be the same without: \begin{itemize} \item vim (\url{https://www.vim.org/}) \item Arch Linux (\url{https://archlinux.org/}) \item ZSH (\url{https://www.zsh.org/}) \item kitty (\url{https://sw.kovidgoyal.net/kitty/}) \item Nix (\url{https://nixos.org/explore.html}) \item pre-commit (\url{https://pre-commit.com/}) \item Podman (\url{https://podman.io/}) \item Go (\url{https://go.dev/}) \end{itemize} All of the code written has been typed into VIM (\texttt{9.0}), the shell used to run the commands was ZSH, both running in the author's terminal emulator of choice - \texttt{kitty} on a \raisebox{.8ex}{\texttildelow}8 month (at the time of writing) installation of \textit{Arch Linux (by the way)} using a \texttt{6.3.x-wanderer-zfs-xanmod1} variant of the Linux kernel. \n{1}{Development} The source code of the project was being versioned since the start using the popular and industry-standard git (\url{https://git-scm.com}) source code management (SCM) tool. Commits were made frequently and, if at all possible, for small and self-contained changes of code, trying to follow sane commit message \emph{hygiene}, i.e.\ striving for meaningful and well-formatted commit messages. The name of the default branch is \texttt{development}, since that is what the author likes to choose for new projects that are not yet stable (it is in fact the default in author's \texttt{.gitconfig}). \n{2}{Commit signing} Since git allows cryptographically \emph{singing} all commits, it would be unwise not to take advantage of this. For the longest time, GPG was the only method available for signing commits in git, however, that is no longer applicable~\cite{agwagitssh}. These days, it is also possible to both sign and verify one's git commits (and tags!) using SSH keys, namely those produced by OpenSSH (the same ones that can be used to log in to remote systems). The author has, of course, not reused the same key pair that is used to connect to machines for signing commits. A different, \texttt{Ed25519} elliptic curve key pair has been used specifically for signing. A public component of this key is enclosed in this thesis as Appendix~\ref{appendix:signingkey} for future reference. The validity of a signature on a particular commit can be viewed with git using the following commands (the \% sign denotes the shell prompt): \begin{figure}[h] \centering \begin{varwidth}{\linewidth} \begin{verbatim} % cd % git show --show-signature % # alternatively: % git verify-commit \end{verbatim} \end{varwidth} \caption{Verifying signature of a git commit} \label{fig:gitverif} \end{figure} There is one caveat to this though, git first needs some additional configuration for the code in Figure~\ref{fig:gitverif} to work as one would expect. Namely that the public key used to verify the signature needs to be stored in git's ``allowed signers file'', then git needs to be told where that file is using the configuration value \texttt{gpg.ssh.allowedsignersfile} and finally the configuration value of the \texttt{gpg.format} field needs to be set to \texttt{ssh}. Since git allows the configuration values to be local to each repository, both of the mentioned issues can be solved by running the following commands from inside of the cloned repository: \begin{figure}[h] \centering \begin{varwidth}{\linewidth} \scriptsize \begin{verbatim} % # set the signature format for the local repository. % git config --local gpg.format ssh % # save the public key. % cat >./tmp/.allowed_signers \ <<<'leo ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIKwshTdBgLzwY4d8N7VainZCngH88OwvPGhZ6bm87rBO' % # set the allowed signers file path for the local repository. % git config --local gpg.ssh.allowedsignersfile=./tmp/.allowed_signers \end{verbatim} \end{varwidth} \caption{Prepare allowed signers file and signature format for git} \label{fig:gitsshprep} \end{figure} After the code in Figure~\ref{fig:gitsshprep} is run, everything from the Figure~\ref{fig:gitverif} should remain applicable for the lifetime of the repository or until git changes implementation of signature verification. The git \texttt{user.name} that can be seen on the commits in the \textbf{Author} field is named after the machine that was used to develop the program, since the author uses different signing keys on each machine. That way the committer machine can be determined post-hoc. For future reference, git has been used in the version \texttt{git version 2.40.1}. \n{2}{Continuous Integration} To increase both the author's and public confidence in the atomic changes made over time, it was attempted to thoroughly \emph{integrate} them using a continuous integration (CI) service that was plugged into the main source code repository since the early stages of development. This, of course, was again self-hosted, including the workers. The tool of choice there was Drone (\url{https://drone.io}) and the ``docker'' runner (in fact it runs any OCI container) was used to run the builds. The way this runner works is it creates an ephemeral container for every pipeline step and executes given \emph{commands} inside of it. At the end of each step the container is discarded, while the repository, which is mounted into each container's \texttt{/drone/src} is persisted between steps, allowing it to be cloned only from \emph{origin} only at the start of the pipeline and then shared for all of the following steps, saving bandwidth, time and disk writes. The entire configuration used to run the pipelines can be found in a file named \texttt{.drone.yml} at the root of the main source code repository. The workflow consists of four pipelines, which are run in parallel. Two main pipelines are defined to build the frontend assets, the \texttt{pcmt} binary and run tests on \texttt{x86\_64} GNU/Linux targets, one for each of Arch and Alpine (version 3.17). These the two pipelines are identical apart from OS-specific bits such as installing a certain package, etc. For the record, other OS-architecture combinations were not tested. A third pipeline contains instructions to build a popular static analysis tool called \texttt{golangci-lint}, which is sort of a meta-linter, bundling a staggering amount of linters (linter is a tool that performs static code analysis and can raise awareness of programming errors, flag potentially buggy code constructs, or \emph{mere} stylistic errors) - from sources and then perform the analysis of project's codebase using the freshly built binary. If the result of this step is successful, a handful of code analysis services get pinged in the next steps to take notice of the changes to project's source code and update their metrics, details can be found in the main Drone configuration file \texttt{.drone.yml} and the configuration for the \texttt{golangci-lint} tool itself (what linters are enabled/disabled and with whats settings) can be found in the root of the repository in the file named \texttt{.golangci.yml}. The fourth pipeline focuses on linting the Containerfile and building the container, although the latter action is only performed on feature branches, \emph{pull requests} or \emph{tag} events. The median build time as of writing was 1 minute, which includes running all four pipelines, and that is acceptable. Build times might of course vary depending on the hardware, for reference, these builds were being run on a machine equipped with a Zen 3 Ryzen 5 5600 CPU with nominal clock times, DDR4 3200MHz RAM, a couple of PCIe Gen 4 NVMe drives in a mirrored setup (using ZFS) and a 400MiB downlink, software-wise running Arch with an author-flavoured Xanmod kernel version 6.3.x. \obr{Drone CI median build time}{fig:drone-median-build}{.77}{graphics/drone-median-build} \n{2}{Source code repositories}\label{sec:repos} All of the pertaining source code was published in repositories on a publicly available git server operated by the author, the reasoning \emph{pro} self-hosting being that it is the preferred way of guaranteed autonomy over one's source code, as opposed to large silos owned by big corporations having a track record of arguably not always deciding with user's best interest in mind (although recourse has been observed~\cite{ytdl}), acting on impulse or under public pressure (potentially at least temporarily disrupting their user's operations), thus not only beholding their user to their lengthy \emph{terms of service} that \emph{can change at any time}, but also factors outside their control. Granted, decentralisation can take a toll on discoverability of the project, but that is not a concern here. The git repository containing source code of the \texttt{pcmt} project:\\ \url{https://git.dotya.ml/mirre-mt/pcmt.git}. The git repository hosting the \texttt{pcmt} configuration schema:\\ \url{https://git.dotya.ml/mirre-mt/pcmt-config-schema.git}. The repository containing the \LaTeX{} source code of this thesis:\\ \url{https://git.dotya.ml/mirre-mt/masters-thesis.git}. \n{2}{Toolchain} Throughout the creation of this work, the \emph{current} version of the Go programming language was used, i.e. \texttt{go1.20}. \tab{Tool/Library-Usage Matrix}{tab:toolchain}{1.0}{ll}{ \textbf{Name} & \textbf{Usage} \\ Go programming language & program core \\ Dhall configuration language & program configuration \\ Echo & HTTP handlers, controllers, web server \\ ent & ORM using graph-based modelling \\ bluemonday & HTML sanitising \\ TailwindCSS & stylesheets using a utility-first approach \\ PostgreSQL & storing data \\ } \tab{Dependency-Version Matrix}{tab:depsversionmx}{1.0}{ll}{ \textbf{Name} & \textbf{version} \\ \texttt{echo} (\url{https://echo.labstack.com/}) & 4.10.2 \\ \texttt{go-dhall} (\url{https://github.com/philandstuff/dhall-golang}) & 6.0.2\\ \texttt{ent} (\url{https://entgo.io/}) & 0.11.10 \\ \texttt{bluemonday} (\url{https://github.com/microcosm-cc/bluemonday}) & 1.0.23 \\ \texttt{tailwindcss} (\url{https://tailwindcss.com/}) & 3.3.0 \\ \texttt{PostgreSQL} (\url{https://www.postgresql.org/}) & 15.2 \\ } \n{2}{A word about Go} First, a question of \textit{`Why pick Go for building a web application?'} might arise, so the following few lines will try to address that. Go~\cite{golang} (or \emph{Golang} for SEO-friendliness) is a strongly typed, high-level \emph{garbage collected}, language where functions are first-class citizens and errors are values. The appeal for the author comes from a number of features of the language, such as built-in support for concurrency and unit testing, sane \emph{zero} values, lack of pointer arithmetic, inheritance and implicit type conversions, easy-to-read syntax, producing a statically linked binary by default, etc., on top of that, the language has got a cute mascot. Due to the foresight of the authors of the Go Authors regarding \emph{the formatting question} (i.e.\ where to put the braces, tabs vs.\ spaces, etc.), most of the discussions on this topic have been foregone. Every \emph{gopher}~\footnote{euph.\ a person writing in the Go programming language} is expected to format their source code with the official formatter (\texttt{gofmt}), which automatically ensures the code adheres to the official formatting standards. \n{2}{A word about Nix} \url{https://builtwithnix.org/} \n{1}{Application architecture} \n{1}{Implementation} \n{2}{Configuration} Every non-trivial program usually offers at least \emph{some} way to tweak/manage its behaviour, and these changes are usually persisted \emph{somewhere} on the filesystem of the host: in a local SQLite3 database, a \emph{LocalStorage} key-value store in the browser, a binary or plain text configuration file. These configuration files need to be read and checked at least on program start-up and either stored into operating memory for the duration of the runtime of the program, or loaded and parsed and the memory subsequently \emph{freed} (initial configuration). There is an abundance of configuration languages (or file formats used to craft configuration files, whether they were intended for it or not) available, TOML, INI, JSON, YAML, to name some of the popular ones (as of today). Dhall stood out as a language that was designed with both security and the needs of dynamic configuration scenarios in mind, borrowing a concept or two from Nix~\cite{nixoslearn}~\cite{nixlang} (which in turn sources more than a few of its concepts from Haskell), and in its apparent core being very similar to JSON, which adds to familiar feel. In fact, in Dhall's authors' own words it is: ``a programmable configuration language that you can think of as: JSON + functions + types + imports''~\cite{dhalllang}. Among all of the listed features, the especially intriguing one to the author was the promise of \emph{types}. There are multiple examples directly on the project's documentation webpage demonstrating for instance the declaration and usage of custom types (that are, of course merely combinations of the primitive types that the language provides, such as \emph{Bool}, \emph{Natural} or \emph{List}, to name just a few), so it was not exceedingly hard to start designing a custom configuration \emph{schema} for the program. Dhall not being a Turing-complete language also guarantees that evaluation \emph{always} terminates eventually, which is a good attribute to possess as a configuration language. \n{3}{Dhall Schema} The configuration schema was at first being developed as part of the main project's repository, before it was determined that it would benefit both the development and overall clarity if the schema lived in its own repository (see Section~\ref{sec:repos} for details). This enabled it to be independently developed and versioned and only pulled into the main application whenever it is determined the application is ready for it. The full schema with type annotations can be seen in Figure~\ref{fig:dhallschema}. The \texttt{let} statement declares a variable called \texttt{Schema} and assigns it the result of the expression on the right side of the equals sign, which has for practical reasons been trimmed and is displayed without the \emph{default} block, which is instead shown in its own Figure~\ref{fig:dhallschemadefaults}. \begin{figure}[!h] \begin{varwidth} \scriptsize \begin{verbatim} let Schema = { Type = { Host : Text , Port : Natural , HTTP : { Domain : Text , Secure : Bool , AutoTLS : Bool , TLSKeyPath : Text , TLSCertKeyPath : Text , HSTSMaxAge : Natural , ContentSecurityPolicy : Text , RateLimit : Natural , Gzip : Natural , Timeout : Natural } , Mailer : { Enabled : Bool , Protocol : Text , SMTPAddr : Text , SMTPPort : Natural , ForceTrustServerCert : Bool , EnableHELO : Bool , HELOHostname : Text , Auth : Text , From : Text , User : Text , Password : Text , SubjectPrefix : Text , SendPlainText : Bool } , LiveMode : Bool , DevelMode : Bool , AppPath : Text , Session : { CookieName : Text , CookieAuthSecret : Text , CookieEncrSecret : Text , MaxAge : Natural } , Logger : { JSON : Bool, Fmt : Optional Text } , Init : { CreateAdmin : Bool, AdminPassword : Text } , Registration : { Allowed : Bool } } } \end{verbatim} \end{varwidth} \caption{Dhall configuration schema version 0.0.1-rc.2} \label{fig:dhallschema} \end{figure} The main configuration is comprised of both raw attributes and child records, which allow for grouping of related functionality. For instance, configuration settings pertaining mailserver setup are grouped in a record named \textbf{Mailer}. Its attribute \textbf{Enabled} is annotated as \textbf{Bool}, which was deemed appropriate for a on-off switch-like functionality, with the only permissible values being either \emph{True} or \emph{False}. Do note that in Dhall $true != True$, since \textbf{True} is internally a Bool constant, which is built into Dhall (check out ``The Prelude''~\cite{dhallprelude}), while \textbf{true} is evaluated as an \emph{unbound} variable, that is, a variable \emph{not} defined in the current \emph{scope} and thus not \emph{present} in the current scope. Another one of specialties of Dhall is that $==$ and $!=$ equality operators only work on values of type \texttt{Bool}, which for example means that variables of type \texttt{Natural} (\texttt{uint}) or \texttt{Text} (\texttt{string}) cannot be compared directly as in other languages, which either leaves the work for a higher-level language (such as Go), or from the perspective of the Dhall authors, \emph{enums} are promoted when the value matters. \n{3}{Safety considerations} Having a programmable configuration language that understands functions and allows importing not only arbitrary text from random internet URLs, but also importing and \emph{evaluating} (i.e.\ running) potentially untrusted code, it is important that there are some safety mechanisms employed, which can be relied on by the user. Dhall offers this in multiple features: enforcing a same-origin policy and (optionally) pinning a cryptographic hash of the value of the expression being imported. \begin{figure}[!h] \begin{varwidth} \scriptsize \begin{verbatim} , default = -- | have sane defaults. { Host = "" , Port = 3000 , HTTP = { Domain = "" , Secure = False , AutoTLS = False , TLSKeyPath = "" , TLSCertKeyPath = "" , HSTSMaxAge = 0 , ContentSecurityPolicy = "" , RateLimit = 0 , Gzip = 0 , Timeout = 0 } , Mailer = { Enabled = False , Protocol = "smtps" , SMTPAddr = "" , SMTPPort = 465 , ForceTrustServerCert = False , EnableHELO = False , HELOHostname = "" , Auth = "" , From = "" , User = "" , Password = "" , SubjectPrefix = "pcmt - " , SendPlainText = True } , LiveMode = -- | LiveMode controls whether the application looks for -- | directories "assets" and "templates" on the filesystem or -- | in its bundled Embed.FS. False , DevelMode = False , AppPath = -- | AppPath specifies where the program looks for "assets" and -- | "templates" in case LiveMode is True. "." , Session = { CookieName = "pcmt_session" , CookieAuthSecret = "" , CookieEncrSecret = "" , MaxAge = 3600 } , Logger = { JSON = True, Fmt = None Text } , Init = { CreateAdmin = -- | if this is True, attempt to create a user with admin -- | privileges with the password specified below (or better - -- | overriden); it fails if users already exist in the DB. False , AdminPassword = -- | used for the first admin, forced change on first login. "50ce50fd0e4f5894d74c4caecb450b00c594681d9397de98ffc0c76af5cff5953eb795f7" } , Registration.Allowed = True } } in Schema \end{verbatim} \end{varwidth} \caption{Dhall configuration defaults for schema version 0.0.1-rc.2} \label{fig:dhallschemadefaults} \end{figure} \n{3}{Possible alternatives} While developing the program, the author has also come across certain shortcomings of Dhall, namely long start-up with \emph{cold cache}, which can generally be observed in the scenario of running the program in an environment that does not allow to write the cache files (a read-only filesystem), of does not keep the written cache files, such as a container that is not configured to mount a persistent volume at the pertinent location. To describe the way Dhall works when performing an evaluation, it resolves every expression down to a combination of its most basic types (eliminating all abstraction and indirection) in the process called \textbf{normalisation}~\cite{dhallnorm} and then saves this result in the host's cache. The \texttt{dhall-haskell} binary attempts to resolve the variable \texttt{XDG\_CACHE\_HOME} (have a look at \emph{XDG Base Directory Spec}~\cite{xdgbasedirspec} for details) to decide \emph{where} the results of the normalisation will be written for repeated use. Do note that this behaviour has been observed on a GNU/Linux host and the author has not verified this behaviour on a non-GNU/Linux host, such as FreeBSD. If normalisation is performed inside an ephemeral container (as opposed to, for instance, an interactive desktop session), the results effectively get lost on each container restart, which is both wasteful and not great for user experience, since the normalisation of just a handful of imports (which internally branches widely) can take an upwards of two minutes, during which the user is left waiting for the hanging application with no reporting on the progress or current status. While workarounds for the above mentioned problem can be devised relatively easily (bind mount persistent volumes inside the container in place of the \texttt{XDG\_CACHE\_HOME/dhall} and \texttt{XDG\_CACHE\_HOME/dhall-haskell} to preserve the cache between restarts, or let the cache be pre-computed during container build, since the application is only really expected to run together with a compatible version of the configuration schema and this version \emph{is} known at container build time), it would certainly feel better if there was not need to work \emph{around} the configuration system of choice. Alternatives such as CUE (\url{https://cuelang.org/}) offer themselves nicely as a potentially almost drop-in replacement for Dhall feature-wise, while also resolving costly \emph{cold cache} normalisation operations, which is in author's view Dhall's titular issue. \n{2}{Data integrity and authenticity} The user can interact with the application via a web client, such as a browser, and is required to authenticate for all sensitive operations. To not only know \emph{who} the user is but also make sure they are \emph{permitted} to perform the action they are attempting, the program employs an \emph{authorisation} mechanism in the form of sessions. These are on the client side represented by cryptographically signed and encrypted (using 256 bit AES) HTTP cookies. That lays foundations for a few things: the data saved into the cookies can be regarded as private because short of future \emph{quantum computers} only the program itself can decrypt and access the data, and the data can be trusted since it is both signed using the key that only the program controls and \emph{encrypted} with \emph{another} key that equally only the program holds. The cookie data is only ever written \emph{or} read at the server side, solidifying the authors decision to let it be encrypted, as there is not point in not encrypting it for some perceived client-side simplification. Users navigating the website send their session cookie in \textbf{every request} (if it exists) to the server, which then verifies the integrity of the data and in case its valid, determines the existence and potential amount of user privilege that should be granted. Public endpoints do not mandate the presence of a valid session by definition, while at protected endpoints the user is authenticated at every request. When a session expires or if there is no session to begin with, the user is either shown a \emph{Not found} error message, the \emph{Unauthorised} error message or redirected to \texttt{/signin}. Another aspect that contributes to data integrity from another point of view is utilising database \emph{transactions} for bundling together multiple database operations that collectively change the \emph{state}. Using the transactional jargon, the data is only \emph{committed} if each individual change was successful. In case of any errors, the database is instructed to perform an atomic \emph{rollback}, which brings it back to a state before the changes were ever attempted. The author has additionally considered the thought of utilising an embedded immutable database like immudb (\url{https://immudb.io}) for record keeping (verifiably storing data change history) and additional data integrity checks, e.g.\ for tamper protection purposes and similar, however, that work remains yet to be materialised. \n{2}{User isolation} Users are allowed into certain parts of the application based on the role they currently posses. For the moment, two basic roles were envisioned, while this list might get amended in the future, if the need arises: \begin{itemize} \item Administrator \item User \end{itemize} It is paramount that the program protects itself from the insider threats as well and therefore each role is only able to perform actions that it is explicitly assigned. While there definitely is certain overlap between the capabilities of the two outlined roles, each also possesses unique features that the other does not. For example, the administrator role is not able to perform searches on the breach data directly using their administrator account, for that a separate user account has to be devised. Similarly, the regular user is not able to manage breach lists and other users, because that is a privileged operation. In-application administrators are not able to view sensitive (any) user data and should therefore only be able to perform the following actions: \begin{itemize} \item Create user accounts \item View list of users \item View user email \item Change user email \item Change user email \item Toggle whether user is an administrator \item Delete user accounts \end{itemize} Let us consider a case when a user manages self, while demoting from administrator to a regular user is permitted, promoting self to be an administrator would constitute a \emph{privilege escalation} and likely be a precursor to a at least a \emph{denial of service} of sorts. \n{2}{Zero trust principle} \textit{Data confidentiality, i.e.\ not trusting the provider} There is no way for the application (and consequently, the in-application administrator) to read user's data. This is possible by virtue of encrypting the pertinent data before saving them in the database by a state-of-the-art \emph{age} key~\cite{age} (backed by X25519~\cite{x25519rfc7748}), which is in turn safely stored encrypted by a passphrase that only the user controls. Of course, the user-supplied password is run by a password based key derivation function (PBKDF) before letting it encrypt the \emph{age} key. The \emph{age} key is only generated when the user changes their password for the first time to prevent scenarios such as in-application administrator with access to physical database being able to both \textbf{recover} the key from the database and \textbf{decrypt} it given that they already know the user password (because they set it), which would subsequently give them unbounded access to any future encrypted data, as long as they would be able to maintain their database access. This is why the \emph{age} key generation and protection are bound to the first password change. Of course, the evil administrator could just perform the change themselves, however, the user would at least be able to find those changes in the activity logs and know not to use the application. But given the scenario of a total database compromise, the author finds all hope is already lost at that point. Consequently, both the application operators and the in-application administrators should never be able to learn the details of what the user is tracking, the same being applicable even to potential attackers with direct access to the database. Thus the author maintains that every scenario that could potentially lead to a data breach (apart from a compromised user machine and the like) would have to entail some form of operating memory acquisition, for instance using \texttt{LiME}~\cite{lime}, or perhaps directly the \emph{hypervisor}, if considering a virtualised (``cloud'') environments. \n{2}{Compromise Monitoring} \n{3}{Have I Been Pwned? Integration} TODO \n{3}{Local Dataset Plugin} Breach data from locally available datasets can be imported into the application by first making sure it adheres to the specified schema (have a look at the \emph{breach data schema} in Figure~\ref{fig:breachDataGoSchema}). If it doesn't (which is very likely with random breach data), it needs to be converted to a form that does before importing it to the application, e.g.\ using a Python script or similar. Attempting to import data that does not follow the outlined schema would result in an error. Also, importing a dataset which is over a reasonable size limit would by default be rejected by the program as a precaution, since marshaling e.g.\ a 1 TiB document would likely result in an OOM situation on the host, assuming regular consumer hardware conditions, not HPC. \begin{figure}[h] \centering \begin{varwidth}{\linewidth} \begin{verbatim} type breachDataSchema struct { Name string Time time.Time IsVerified bool ContainsPasswords bool ContainsHashes bool HashType string HashSalted bool HashPepperred bool ContainsUsernames bool ContainsEmails bool Data any } \end{verbatim} \end{varwidth} \caption{Breach Data Schema represented as a Go struct with imports from the standard library are assumed} \label{fig:breachDataGoSchema} \end{figure} The Go representation shown in Figure~\ref{fig:breachDataGoSchema} will in actuality be written and supplied by the user of the program as a YAML document. YAML was chosen for multiple reasons: relative ease of use (plain text, readable, support for inclusion of comments, its capability to store multiple \emph{documents} inside of a single file with most of the inputs implicitly typed as strings while thanks to being a superset of JSON it sports machine readability. That should allow for documents similar to what can be seen in Figure~\ref{fig:breachDataYAMLSchema} to be ingested by the program, read and written by humans and programs alike. \begin{figure}[h] \centering \begin{varwidth}{\linewidth} \begin{verbatim} --- name: Horrible breach time: 2022-04-23T00:00:00Z+02:00 isVerified: false containsPasswds: false containsHashes: true containsEmails: true hashType: md5 hashSalted: false hashPeppered: false data: hashes: - hash1 - hash2 - hash3 emails: - email1 - - email3 --- # document #2, describing another breach. name: Horrible breach 2 ... } \end{verbatim} \end{varwidth} \caption{Example Breach Data Schema supplied to the program as a YAML file, optionally containing multiple documents} \label{fig:breachDataYAMLSchema} \end{figure} Notice how the emails list in Figure~\ref{fig:breachDataYAMLSchema} misses one record, perhaps because it was not supplied or mistakenly ommitted. This is a valid scenario (mistakes happen) and the application needs to be able to handle it. The alternative would be to require the user to prepare the data in such a way that the empty/partial records would be dropped entirely. \n{2}{Database configuration} The database schema is not created manually in the database, instead, an Object-relational Mapping (ORM) tool named ent is used. This allows defining the table schema and relations entirely in Go. The best part about ent is that there is not need to define supplemental methods on the models, since ent employs \emph{code generation}, which creates these based on the types of the attributes in the model and the respective relations. For instance, if an attribute is a string value \texttt{Email}, ent can be used to generate code that contains methods on the user object like the following: \begin{itemize} \item EmailIn \item EmailEQ \item EmailNEQ \item EmailHasSuffix \end{itemize} \n{1}{Production} It is, of course, recommended that the application runs in a secure environment, although definitions of that almost certainly differ depending on who you ask. General recommendations would be either to effectively reserve a machine for a single use case - running this program - so as to dramatically decrease the potential attack surface of the host, or run the program isolated in a container or a virtual machine. Further, if the host does not need management access (it is a deployed-to-only machine that is configured out-of-band, such as with a \emph{golden} image/container or declaratively with Nix), then an SSH \emph{daemon} should not be running in it, since it is not needed. In an ideal scenario, the host machine would have as little software installed as possible besides what the application absolutely requires. A demonstration of the above can be found in the multi-stage Containerfile that is available in the main sources. The resulting container image only contains a statically linked copy of the program, a default configuration file and corresponding Dhall expressions cached at build time, which only support the main configuration file. Since the program also needs a database, an example scenario could include the container being run in a Podman pod together with the database, which would not have to be exposed from the pod and would therefore only be available over \texttt{localhost}. It goes without saying that the operator should substitute values of any default configuration secrets with new ones that were securely generated. \n{2}{Deployment recommendations} \n{3}{Transport security} User connecting to the application should rightfully expect for their data to be protected \textit{in transit} (i.e.\ on the way between their browser and the server), which is what \emph{Transport Layer Security} family of protocols~\cite{tls13rfc8446} was designed for, and which is the underpinning of HTTPS. TLS utilises the primitives of asymmetric cryptography to let the client authenticate the server (verify that it is who it claims it is) and negotiate a symmetric key for encryption in the process named the \emph{TLS handshake} (see Section~\ref{sec:tls} for more details), the final purpose of which is establishing a secure communications connection. The operator should configure the program to either directly utilise TLS using configuration or have it listen behind a TLS-terminating \emph{reverse proxy}. \n{3}{Containerisation} Whether the pre-built or a custom container image is used to deploy the application, it still needs access to secrets, such as database connection string (containing database host, port, user, password/encrypted password, authentication method and database name). Currently, the application is able to handle \emph{peer}, \emph{scram-sha-256}, \emph{user name maps} and raw \emph{password} as Postgres authentication methods~\cite{pgauthmethods}, although the \emph{password} option should not be used in production, \emph{unless} the connection to the database is protected by TLS.\ In any case, using the \emph{scram-sha-256}~\cite{scramsha256rfc7677} method is preferable and one way to verify in development environment that everything works as intended is the \emph{Password generator for PostgreSQL} tool~\cite{goscramsha256}, which allows to get the encrypted string from a raw user input. If the application running in a container wants to use the \emph{peer} authentication method, it is up to the operator to supply the Postgres socket to the application (e.g.\ as a volume bind mount). This scenario was not tested, however, and the author is also not entirely certain how \emph{user namespaces} (on GNU/Linux) would influence the process (given that the \emph{ID}s of a user \textbf{outside} the container are mapped to a range of \emph{UIDs} \textbf{inside} the container), for which the setup would likely need to account. Equally, if the application is running inside the container, the operator needs to make sure that the database is either running in a network that is also directly attached to the container or that there is a mechanism in place that routes the requests for the database hostname to the destination. One such mechanism is container name based routing inside \emph{pods} (Podman/Kubernetes), where the resolution of container names is the responsibility of a specially configured piece of software called Aardvark for the former and CoreDNS for the latter. \n{1}{Validation} \n{2}{Unit tests} Unit testing is a hot topic for many people and the author does not count himself to be a staunch supporter of neither extreme. The ``no unit tests'' seems to discount any benefit there is to unit testing, while a `` TDD-only'' (TDD, or Test Driven Development is a development methodology whereby tests are written first, then a complementary piece of code that is supposed to be tested, just enough to get past the compile errors and to see the test fail, then the code is refactored to make the test pass and then it can be fearlessly extended because the test is the safety net catching us when the user slips and alters the originally intended behaviour) approach can be a little too much for some people's taste. The author tends to sport a \emph{middle ground} approach here, with writing enough tests where meaningful but not necessarily testing everything or writing tests prior to code, although arguably that practice should result in writing a \emph{better} designed code, particularly because there has to be a prior though about it because it needs to be tested \emph{first}. Thanks to Go's built in support for testing in its \texttt{testing} package and the tooling in the \texttt{go} tool, writing tests is relatively simple. Go looks for files in the form \texttt{\_test.go} in the present working directory but can be instructed to look for test files in packages recursively found on any path using the ellipsis, like so: \texttt{go test ./path/to/package/\ldots}, which then \emph{runs} all the tests found and reports some statistics, such as the time it took to run the test or whether it succeeded or failed. To be precise, the test files also need to contain test functions, which are functions with the signature \texttt{func TestWhatever(t *testing.T)\{\}} and where the function prefix ``Test'' is equally as important as the signature. Without it, the function is not considered to be a testing function despite having the required signature and is therefore \emph{not} executed during testing. This test lookup behaviour, however, also has a neat side-effect: all the test files can be kept side-by-side their regular source counterparts, there is no need to segregate them into a specially blessed \texttt{tests} folder or similar, which in author's opinion improves readability. As a failsafe, in case no actual test are found, the current behaviour of the tool is to print a note informing the developer that no tests were found, which is handy to learn if it was not intended/expected. When compiling regular source code, the Go files with \texttt{\_test} in the name are simply ignored by the build tool. \n{2}{Integration tests} Integrating with external software, namely the database in case of this program, is designed to utilise the same mechanism that was mentioned in the previous section: Go's \texttt{testing} package. These tests verify that the code changes can still perform the same actions with the external software that were possible before the change and are run before every commit locally and then after pushing to remote in the CI. \n{3}{func TestUserExists(t *testing.T)} An example integration test shown in Figure~\ref{fig:integrationtest} can be seen to declare a helper function \texttt{getCtx() context.Context}, which takes no arguments and returns a new \texttt{context.Context} initialised with a value of the global logger, which is how the logger gets injected into the user module functions. The function \texttt{TestUserExists(t *testing.T)} first declares a database connection string and attempting to open a connection to the database. The database in use here is SQLite3 running in memory mode, meaning no file is actually written to disk during this process. Since the testing data is not needed after the test, this is deemed good enough. Next, a defer statement calling the \texttt{Close()} method on the database object is made, which is the idiomatic Go way of closing files and network connections (which are also an abstraction over files on UNIX-like operating systems such as GNU/Linux). The \emph{defer} statement gets called when after all of the statements in the surrounding function, which makes sure no file descriptors (FDs) are leaked and the file is properly closed when the function returns. In the next step a database schema creation is attempted, handling the potential error in a Go idiomatic way, which uses the return value from the function in an assignment to a variable declared in the \texttt{if} statement, and checks whether the \texttt{err} was \texttt{nil} or not. In case the \texttt{err} was not \texttt{nil}, i.e.\ \emph{there was an error in the callee function}, the condition evaluates to \texttt{true}, which is followed by entering the inner block. Inside the inner block the error is announced to the user (likely a developer running the test in this case) and the testing object's \texttt{FailNow()} method is called, which marks the test function as having failed and stops its execution, which in this case is the desired outcome, since if the database schema creation call fails there really is no point in continuing testing of user creation. Conversely, if the schema does get created without an error, the code continues to declare a few variables: \texttt{username}, \texttt{email} and \texttt{ctx}, to which the context injected with the logger is saved. Some of them are subsequently passed into the \texttt{UsernameExists} function, context as the first argument, with the database pointer and username being passed next, while the \texttt{email} variable is only used at a later stage, but was declared here to give a sense of grouping. The error value returned from this function is again checked and if everything goes well, the value of the \texttt{usernameFound} boolean is checked next. Since the database has just been created, there should be no users, which is checked in the next \texttt{if} statement. The same check is then performed for the earlier-declared user email that is also expected to fail. The final statements of the described test attempts a user creation call, which is again checked for both error and \emph{nilability}. The test continues with more similar checks but it has been cut short for brevity. A neat thing about error handling in Go is it allows for very easy checking of all paths, not just the \emph{happy path} where there are no issues. \begin{figure}[!h] \centering \scriptsize \begin{varwidth}{\linewidth} \begin{verbatim} // modules/user/user_test.go package user import ( "context" "testing" "git.dotya.ml/mirre-mt/pcmt/ent/enttest" "git.dotya.ml/mirre-mt/pcmt/slogging" _ "github.com/xiaoqidun/entps" ) func getCtx() context.Context { l := slogging.Init(false) ctx := context.WithValue(context.Background(), CtxKey{}, l) return ctx } func TestUserExists(t *testing.T) { connstr := "file:ent_tests?mode=memory&_fk=1" db := enttest.Open(t, "sqlite3", connstr) defer db.Close() if err := db.Schema.Create(context.Background()); err != nil { t.Errorf("failed to create schema resources: %v", err) t.FailNow() } username := "dude" email := "dude@b.cc" ctx := getCtx() usernameFound, err := UsernameExists(ctx, db, username) if err != nil { t.Errorf("error checking for username {%s} existence: %q", username, err, ) } if usernameFound { t.Errorf("unexpected: user{%s} should not have been found", username, ) } if _, err := EmailExists(ctx, db, email); err != nil { t.Errorf("unexpected: user email '%s' should not have been found", email, ) } usr, err := CreateUser(ctx, db, email, username, "so strong") if err != nil { t.Errorf("failed to create user, error: %q", err) t.FailNow() } else if usr == nil { t.Error("got nil usr back") t.FailNow() } if usr.Username != username { t.Errorf("got back wrong username, want: %s, got: %s", username, usr.Username, ) } // ...more checks... } \end{verbatim} \end{varwidth} \caption{Example integration test} \label{fig:integrationtest} \end{figure} \n{2}{Click-ops} % =========================================================================== % \nn{Conclusion} % =========================================================================== %