% =========================================================================== % % Encoding: UTF-8 (žluťoučký kůň úpěl ďábelšké ódy) % =========================================================================== % % =========================================================================== % \nn{Introduction} Passwords. Everybody reading this text most assuredly recalls at least \emph{some} of their own. The security-minded person perhaps even dozens. They are complex and at least twelve characters long. They are only ever used in the one place they were created for. And they are definitely getting rotated at least once a year. Or are they? A token so ubiquitous that it becomes tiring for human being to keep track of all the places where it is required in some form or another. At some point, it almost feels easier to stop caring and use the password intended for \emph{the other site} for this one, too. What harm could that possibly do. The answer is unimaginable, depending on the services in question, its relevance to the person being discussed, and also on \emph{how many other} services also share this password. A service requires a registration? No problem, the password will be the name of the cat plus current year, so as to make it more secure. It is the password rotation day again this month, a handful of logins will be disabled if their passwords are not changed in the next couple of hours. No worries, it is already covered by a combination of the current month and the name of the specific service for each of them. A neat system. But just in case they got forgotten in the fragments of this hectic lifestyle, they need to be written down on a sticker note. Not to worry, nobody knows, it is hidden under the keyboard, it is practically invisible. These are all examples of poor password practices on user's side; some might have been circumstantially helped to, such as the too frequently forced password rotation, others can be ascribed to users not being sufficiently well-versed in the intricacies of password hygiene. Inevitably, these passwords are going to get appropriately treated in the form of misuse, be it from a nosy colleague that finds the sticker note, or if the user account is ever a target of an attack, the password's \emph{only} role, to protect the access, will likely not stand much chance. This thesis tangentially covers user-relating issues like the ones described above, but rather than attempting to go for prevention, it mainly focuses on dealing with the acute consequence of such behaviour: a password breach. The thesis consists of two parts. The theoretical one offers an overview of password-related topics and frames the password as well as security topics in the web context in order to provide necessary context for the second part of the thesis. Cryptography topics such as hashing, encryption and entropy are mentioned, and within the browser context a special spotlight is given to the protocols powering the web: HTTP and TLS. The practical part discusses the architecture, decision making, implementation details and validation methods utilised when building a web application that enables users to monitor the breach status of their credentials by utilising an online API service and local data imported into the program by the operators of the tool. The program does not have many dependencies and is relatively lightweight, which means that anybody with even little experience should be able to potentially run their own private instance, if they so choose. The purpose of the program is to allow users to learn if their credentials were breached, while the reason for the breach might even be considered secondary in importance. Breach data is not a publicly traded commodity and is relatively hard to make sense of, given that we are talking about literal \emph{terrabytes} of data available, if there is even the slightest interest to find it online. Breaches happen, and of course, can inform the decision to stay or leave the service, but there is not always a choice element involved, or only a limited amount. Either way, knowledge is light and as such precedes informed decision-making. Abstracting away the ugly parts and offering users an understandable interface would likely result in their improved security posture, if anything. The author has been striving to utilise modern tooling and development practices in an effort to build a maintainable and long-lasting piece of software that serves its users well. When deployed, it could provide real value. Terminology is located in Appendix~\ref{appendix:terms}, feel free to give it a read. % =========================================================================== % \part{Theoretical part} \n{1}{Cryptography primer}\label{sec:cryptographyprimer} \n{2}{Encryption} \textbf{TODO:} add \emph{why} we care and how it's going to be used. \n{2}{Hash functions} Hash functions are algorithms used to help with a number of things: integrity verification, password protection, digital signature, public-key encryption and others. Hashes are used in forensic analysis to prove authenticity of digital artifacts, to uniquely identify a change-set within revision-based source code management systems such as Git, Subversion or Mercurial, to detect known-malicious software by anti-virus programs or by advanced filesystems in order to verify block integrity and enable repairs, and also in many other applications that each person using a modern computing device has come across, such as when connecting to a website protected by the famed HTTPS. The popularity of hash functions stems from a common use case: the need to simplify reliably identifying a chunk of data. Of course, two chunks of data, two files, frames or packets could always be compared bit by bit, but that can get prohibitive from both cost and energy point of view relatively quickly. That is when the hash functions come in, since they are able to take a long input and produce a short output, named a digest or a hash value. The function also only works one way. A file, or any original input data for that matter, cannot be reconstructed from the hash digest alone by somehow \emph{reversing} the hashing operation, since at the heart of any hash function there is essentially a compression function. Most alluringly, hashes are frequently used with the intent of \emph{protecting} passwords by making those unreadable, while still being able to verify that the user knows the password, therefore should be authorised. As the hashing operation is irreversible, once the one-way function produces a short a digest, there is no way to reconstruct the original message from it. That is, unless the input of the hash function is also known, in which case all it takes is hashing the supposed input and comparing the digest with existing digests that are known to be digests of passwords. \\ \textbf{TODO:} ad more on \emph{why} we care and what types of hashes should be used (with refs) and why. \n{3}{Types and use cases} Hash functions can be loosely categorised based on their intended use case to \emph{password protection hashes}, \emph{integrity verification hashes}, \emph{message authentication codes} and \emph{cryptographic hashes}. Each of these possess unique characteristics and using the wrong type of hash function for the wrong job can potentially result in a security breach. As an example, suppose \texttt{MD5}, a popular hash function internally using the same data structure - \emph{Merkle-Damgård} construction - as \texttt{BLAKE3}. While the former produces 128 bit digests, the latter by default outputs 256 bit digest with no upper limit (Merkle tree extensibility). There is a list of differences that could further be mentioned, however, they both have one thing in common: they are \emph{designed} to be \emph{fast}. The latter, as a cryptographic hash function, is conjectured to be \emph{random oracle indifferentiable}, secure against length extension, but it is also in fact faster than all of \texttt{MD5}, \texttt{SHA3-256}, \texttt{SHA-1} and even \texttt{Blake2} family of functions. The use case of both is to (quickly) verify integrity of a given chunk of data, in case of \texttt{BLAKE3} with pre-image and collision resistance in mind, not to secure a password by hashing it first, which poses a big issue when used to...secure passwords by hashing them first. A password hash function, such as \texttt{argon2} or \texttt{bcrypt} are good choices for securely storing hashed passwords, namely because they place CPU and memory burden on the host computing the digest, as well as limit potential parallelism, thus preventing the scale at which an exhaustive search could be launched. Additionally, both functions automatically \emph{salt} the passwords before hashing them, which means that two exact same passwords of two different users will not end up hashing to the same digest value, making it that much harder to recover the original, supposedly weak password. \n{3}{Why are hashes interesting} As already mentioned, since hashes are often used to store the representation of the password instead of the password itself, they become a subject of interest when they get leaked. There have been enough instances of leaked raw passwords that anyone with enough interest can put together a neat list of hashes of the most popular passwords. So while the service does not store plain text passwords, which is good, using a hashing function not designed to protect passwords does not offer much additional protection in case of weak passwords, which are the most commonly used ones. It seems logical that a service that is not using cryptographic primitives correctly is more likely to get hacked and have its users' passwords/hashes leaked. Then, the Internet ends up serving as a storage of every data dump, often exposing these passwords/hashes for everyone to access. \n{2}{TLS}\label{sec:tls} The Transport Layer Security protocol (or TLS) serves as as an encryption and \emph{authentication} protocol to secure internet communications. An important part of the protocol is the \emph{handshake}, during which the two communicating parties exchange messages that acknowledge each other's presence, verify each other, choose what cryptographic algorithms will be used and decide session keys. As there are multiple versions of the protocol in active duty even at the moment, the server together with the client need to agree upon the version they are going to use (these days it is recommended to use either 1.2 or 1.3), pick cipher suites (), the client verifies the server's public key (and the signature of the certificate authority that issued it) and they both generate session keys for use after handshake completion. TLSv1.3 dramatically reduced the number of available suites to only include the ones deemed secure enough, which is why it is no longer needed to manually specify what cipher suite should be used (or rely on the client/server to choose wisely). While possibly facing compatibility issues with legacy devices, the simplicity that enabling TLSv1.3 brings is a worthy trade-off. \n{1}{Passwords}\label{sec:passwords} Passwords have been in use since the ancient times, apparently already the Roman sentries used passwords or \textit{watchwords} to discern who was allowed to enter an area. The Roman army had a special system of distributing passwords among the encampment members on a wooden tablet. Fast forward a couple of thousand years, during the days of the Prohibition Era in the United States, it was the secret ``speakeasies'' that were protecting their illegitimate alcohol-serving business using passwords~\cite{speakeasy}~\cite{nisthistory}. During the World War II.\ the US paratroopers' use of passwords has evolved to even include a counter-password. According to McMillan, the first \textit{computer} passwords date back to mid-1960s' Massachusetts Institute of Technology (MIT), when researchers at the university built a massive time-sharing computer called CTSS. Apparently, \textit{even then} the passwords did not protect the users as well as they were expected to~\cite{mcmillan}. Traditionally, passwords were expected to be memorised, but the large number of password-protected \emph{services} these days can make this impractical. To list a few common examples, access to a bank account, electronic mailbox, personal computer encrypted disk are all protected by some form of a password. A password still often consists of a \textit{string} of characters typed into a prompt but its function is still the same: as per NIST it enables the \textit{verifier} to infer the \textit{claimant}'s identity via a secret the claimant holds. There are always some arbitrary requirements applied to what the password can be, only some turn out to smarter than others. Despite the impression given by the word ``password'', it does not need to be an actual word, while a non-word (in the dictionary sense) may indeed be harder to guess, which is a desirable property of passwords. A memorized secret consisting of a sequence of words or other text separated by spaces is sometimes called a passphrase. A passphrase is similar to a password in usage, but the former is generally longer for added security. \n{2}{Program-imposed constraints} Some of the following examples might be a bit anecdotal and more of an exception than a rule; nevertheless, when presented by a large-enough program creator/service provider, their decisions reach a sufficient amount of population, enough that the author will call them influential. They form how users think when creating password and affect what users expect from other services they happen to visit and use from that point on, as well. \n{3}{Short arbitrary length} It has been observed that a requirement for a ``strong'' password generally represents that a password is: \begin{itemize} \item longer than 7 characters, \item shorter than 11 characters, \item begins with a letter and ends with a number OR \item begins with a number and ends with a letter. \end{itemize} \obr{Short arbitrary password length limit~\cite{larsklint}}{fig:arbitrarypasswdlengthlimit}{.8}{graphics/arbitrarypasswdlengthlimit.jpg} This is wrong for multiple reasons, and it is a classic example of short arbitrary length requirement. It essentially prevents users from using passphrases, makes using a password manager impractical and all of that has apparently been done ``because of security''~\cite{etihad}. Moreover, this might be an indicative of the fact that instead of storing passwords hashed (as it should be), they might be storing them in \textbf{plain text}. Otherwise, what reason could exist for the limit to be 10 characters? The recommendation of the US's National Institute for Standards and Technology (NIST) in this regard is a minimum of 64 and a maximum of 256 characters, which should be sufficient for most users' needs. \n{3}{Restricting special characters} Service providers have too often been found forbidding the use of so called \textit{special characters} in passwords for as long as passwords have been used to protect privileged access. Ways of achieving the same may vary but the intent stays the same: preventing users from inputting characters into the system, which the system cannot comfortably handle, for ``reasons'', which are usually something dubious along the lines of ``an apostrophe may be used in SQL injection attacks'' or ``angle brackets may be used in XSS attacks''. Instead the real message it announces is pointing right to the serious shortcomings of password handling of the site in question, as passwords should never be re-displayed in a context that is prone to Cross Site Scripting (XSS), and the passwords should always be hashed before being sent to the database anyway, leaving us with only alphanumeric characters, rendering the SQLi fears baseless. \obr{Forbidden special characters in passwords~\cite{forbiddencharacters}}{fig:forbiddencharacters}{.8}{graphics/forbiddencharacters.jpg} Note that ``Passw0rd!'' would have been a perfectly acceptable password for the validator displayed in Figure~\ref{fig:forbiddencharacters}. NIST's recommendations on this are that all printing ASCII~\cite{asciirfc20} characters as well as the space character SHOULD be acceptable in memorized secrets and Unicode~\cite{iso10646} characters SHOULD be accepted as well. \n{3}{Character composition requirements} There is a tendency to come up with bad passwords when there are character composition requirements in place, too. The reality is that instead of creating strong passwords directly, most users first try a basic version and then keep tweaking characters until the password ends up fulfilling the minimum requirement. The \emph{problem} with that is that it has been shown, that people use similar patterns, i.e. starting with capital letters, putting a symbol last and a number in the last two positions. This is also known to cyber criminals cracking passwords and they run their dictionary attacks using the common substitutions, such as "\$" for "s", "E" for "3", "1" for "l", "@" for "a" etc. The password created in this manner will almost certainly be bad so all that is achieved is frustrating the user in order to still arrive at a bad password. \n{3}{Other common issues} Some services don't allow users to paste into passwords fields (disabling them using JavaScript), thereby essentially breaking the password manager functionality, which is an issue because it encourages bad password practices such as weak passwords and likewise, password reuse. Another frequent issue is forced frequent password rotation. Making frequent password rotations mandatory contributes to users developing a password creation pattern and is further a modern-day security anti-pattern and according to the British NCSC the practice ``carries no real benefits as stolen passwords are generally exploited immediately''~\cite{ncsc}. \n{1}{Web security}\label{sec:websecurity} The internet, being the vast space of intertwined concepts and ideas, is a superset of the Web, since not everything that is available on internet can be described as web \emph{resources}. But precisely that is the part of the internet that is discussed in the next sections and covers what browsers are, what they do and how they relate to web security. \n{2}{Site Isolation} Modern browsers such as Firefox or Chromium come with a security focus in mind. Their developers are acutely aware of the dangers that parsing untrusted code from the internet, which is the websites, the stylesheets and the scripts, poses. This is reflected in the way these programs are architected: instead of the main browser process running everything, the main privileged process spawns de-privileged child processes for each website. This extra line of defence makes it ``harder for untrustworthy websites to access or steal information'' from user accounts or other websites~\cite{siteisolation}, as even if the misbehaving website manages to ``break some rules'' withing its own process, it should find it more difficult to steal data from other sites. Firefox's version of Site Isolation is called Project Fission (\url{https://hacks.mozilla.org/2021/05/introducing-firefox-new-site-isolation-security-architecture/}) but the two are very similar in both architecture and what they try to achieve. Some Chromium users have been complaining in the past on its high memory usage, which is partially due to Site Isolation features enabled that are protecting the user. \n{2}{Cross-site scripting}\label{sec:xss} As per OWASP Top Ten list~\cite{owasptop10} injection is the third most observed issue across millions of websites. Cross-site scripting is a type of attack in which scripts are injected into a website that would otherwise be trusted. It occurs when a malicious user misuses the website to send malicious code, such as a script, to another user. Since the misconfiguration or a flaw of the application allowed this, the browser of the victim that trusts the website simply executes the code provided by the attacker. This code thus gains access to session tokens and any cookies associated with the website's origin, apart from being able to rewrite the HTML content. The results of XSS can range from account compromise to identity theft. \n{2}{Content Security Policy}\label{sec:csp} Content Security Policy (CSP) has been an important addition to the arsenal of website operators, even though not everybody has necessarily been utilising it properly or even taken notice. To understand what guarantees it provides and what kind of protections it employs, it is first necessary to grok how websites are parsed and displayed, which has been discussed in depth in previous sections. As per Weichselbaum et al.\ CSP is a mechanism designed to mitigate XSS~\cite{longliveCSP}, a long-lived king of the vulnerability lists~\cite{reininginCSP}. It is a declarative policy mechanism that allows the website operator to decide what client-side resources can load and what origins are permitted. Scripts can be restricted to only load from a list of trusted domains and inline scripts can be blocked completely. The scripts can also be allowed based on a cryptographic hash based on their content, which should be a known information to website operators prior to or at the time scripts are served. CSP3, which is the current version, can be served either as a header or inside website's \texttt{} tag. Directives are generally named after a topic keyword and suffixed `-src', such as \texttt{script-src}, \texttt{img-src} or \texttt{style-src}, although some directives don't follow this pattern, \texttt{form-action}, \texttt{upgrade-insecure-requests} and \texttt{sandbox} representing this group nicely. Sections are delimited using semicolons and each section can only appear once. Special values exist for the origin itself (\texttt{'self'}) and disallowing any source (\texttt{'none'}). A good CSP is targeted and not overly broad. To give an example, a website that loads no JavaScript at all does not need to allow a popular CDN (Content Delivery Network) origin in its \texttt{script-src}, instead it should be set to \texttt{'none'}. CSP can also aid with clickjacking protection using its \texttt{frame-ancestors} directive, which could limit origins that could embed the website, preventing attacker from embedding the website at random places, that is malicious websites that masquerade as being legitimate. Getting CSP right can be tricky at first but once grokked, it is relatively straight-forward and can increase the security of the site greatly. While testing, it is best to enable CSP in the report-only mode before turning it on in production. There are many more directives and settings than mentioned in this section, the author encourages anybody interested to give it a read, e.g.\ at \url{https://web.dev/csp/}. \textbf{TODO}: add more concrete examples. \n{1}{Configuration} Every non-trivial program usually offers at least \emph{some} way to tweak/manage its behaviour, and these changes are usually persisted \emph{somewhere} on the filesystem of the host: in a local SQLite3 database, a \emph{LocalStorage} key-value store in the browser, a binary or plain text configuration file. These configuration files need to be read and checked at least on program start-up and either stored into operating memory for the duration of the runtime of the program, or loaded and parsed and the memory subsequently \emph{freed} (initial configuration). There is an abundance of configuration languages (or file formats used to craft configuration files, whether they were intended for it or not) available, TOML, INI, JSON, YAML, to name some of the popular ones (as of today). Dhall stood out as a language that was designed with both security and the needs of dynamic configuration scenarios in mind, borrowing a concept or two from Nix~\cite{nixoslearn}~\cite{nixlang} (which in turn sources more than a few of its concepts from Haskell), and in its apparent core being very similar to JSON, which adds to familiar feel. In fact, in Dhall's authors' own words it is: ``a programmable configuration language that you can think of as: JSON + functions + types + imports''~\cite{dhalllang}. Among all of the listed features, the especially intriguing one to the author was the promise of \emph{types}. There are multiple examples directly on the project's documentation webpage demonstrating for instance the declaration and usage of custom types (that are, of course merely combinations of the primitive types that the language provides, such as \emph{Bool}, \emph{Natural} or \emph{List}, to name just a few), so it was not exceedingly hard to start designing a custom configuration \emph{schema} for the program. Dhall not being a Turing-complete language also guarantees that evaluation \emph{always} terminates eventually, which is a good attribute to possess as a configuration language. \n{3}{Safety considerations} Having a programmable configuration language that understands functions and allows importing not only arbitrary text from random internet URLs, but also importing and \emph{evaluating} (i.e.\ running) potentially untrusted code, it is important that there are some safety mechanisms employed, which can be relied on by the user. Dhall offers this in multiple features: enforcing a same-origin policy and (optionally) pinning a cryptographic hash of the value of the expression being imported. \n{3}{Possible alternatives} While developing the program, the author has also come across certain shortcomings of Dhall, namely long start-up with \emph{cold cache}, which can generally be observed in the scenario of running the program in an environment that does not allow to write the cache files (a read-only filesystem), of does not keep the written cache files, such as a container that is not configured to mount a persistent volume at the pertinent location. To describe the way Dhall works when performing an evaluation, it resolves every expression down to a combination of its most basic types (eliminating all abstraction and indirection) in the process called \textbf{normalisation}~\cite{dhallnorm} and then saves this result in the host's cache. The \texttt{dhall-haskell} binary attempts to resolve the variable \texttt{\$\{XDG\_CACHE\_HOME\}} (have a look at \emph{XDG Base Directory Spec}~\cite{xdgbasedirspec} for details) to decide \emph{where} the results of the normalisation will be written for repeated use. Do note that this behaviour has been observed on a GNU/Linux host and the author has not verified this behaviour on a non-GNU/Linux host, such as FreeBSD. If normalisation is performed inside an ephemeral container (as opposed to, for instance, an interactive desktop session), the results effectively get lost on each container restart. That is both wasteful and not great for user experience, since the normalisation of just a handful of imports (which internally branches widely) can take an upwards of two minutes, during which the user is left waiting for the hanging application with no reporting on the progress or current status. While workarounds for the above mentioned problem can be devised relatively easily (such as bind mounting persistent volumes inside the container in place of the \texttt{\$\{XDG\_CACHE\_HOME\}/dhall} and \texttt{\$\{XDG\_CACHE\_HOME\}/dhall-haskell} to preserve the cache between restarts, or let the cache be pre-computed during container build, since the application is only really expected to run together with a compatible version of the configuration schema and this version \emph{is} known at container build time), it would certainly feel better if there was no need to work \emph{around} the configuration system of choice. Alternatives such as CUE (\url{https://cuelang.org/}) offer themselves nicely as a potentially almost drop-in replacement for Dhall feature-wise, while also resolving costly \emph{cold cache} normalisation operations, which is in author's view Dhall's titular issue. \n{1}{Compromise Monitoring} There are, of course, several ways one could approach monitoring of compromised of credentials, some more \emph{manual} in nature than others. When using a service that is suspected/expected to be breached in the future, one can always create a unique username/password combination specifically for the subject service and never use that combination anywhere else. That way, if the credentials ever \emph{do} happen to appear in a data dump online in the future, it is going to be a safe assumption as to where they came from. Unfortunately, the task of actually \emph{monitoring} the credentials can prove to be a little more arduous than one could expect at first. There are a couple of points that can prove to pose a challenge in case the search is performed by hand, namely: \begin{itemize} \item finding the breached data to look through \item verifying the trustworthiness of the data \item varying quality of the data \item sifting through (possibly) unstructured data by hand \end{itemize} Of course, as this is a popular topic for a number of people, the above mentioned work has already been packaged into neat and practical online offerings. In case one decides in favour of using those, an additional range of issues (the previous one still applicable) arises: \begin{itemize} \item the need to trust the provider with input credentials \item relying on the goodwill of the provider to be able to access the data \item hoping that the terms of service are kept \end{itemize} Besides that, there is a plethora of breaches floating around the Internet available simply as zip files, which makes the job even harder. The overarching goal of this thesis is devising and implementing a system in which the user can \emph{monitor} whether their credentials have been \emph{compromised} (at least as far as the data can tell), and allowing them to do so without needing to entrust their sensitive data to a provider. \n{2}{Data Sources}\label{sec:dataSources} A data source in this place is considered anything that provides the application with data that it understands. Of course, the results of credential compromise verification/monitoring is only going to be as good as the data underpinning it, which is why it is imperative that high quality data sources be used, if at all possible. While great care does have to be taken to only choose the highest quality data sources, the application must offer a means to be able to utilise these. The sources from which breached data can be loaded into an application can be split into two basic categories: \textbf{online} or \textbf{local}, and it is possible to further discern between \emph{structured} and \emph{unstructured} data. An online source is generally a service that ideally exposes a programmatic API, which an application can query and from which it can request the necessary subsets of data. These types of services often additionally front the data by a user-friendly web interface for one-off searches, which is, however, not of use here. Among some examples of online services could be named: \begin{itemize} \item {Have I Been Pwned?} - \url{https://haveibeenpawned.com} \item {DeHashed} - \url{https://dehashed.com} \end{itemize} Large lumps of unstructured data available on forums or shady web servers would technically also count here, given that they provide data and are available online. However, even though data is frequently found online precisely in this form, it is also not of direct use for the application without manual \emph{preprocessing}, as it is attended to in Section~\ref{sec:localDatasetPlugin}. Another source is then simply any locally supplied data, which, of course, could have been obtained from a breach available online beforehand. Locally supplied data is specific in that it needs to be formatted in such a way that it can be understood by the application. That is, the data is not in its raw form anymore but has been morphed into the precise shape the application needs for further processing. Once imported, the application can query the data at will, as it knows exactly the shape of it. This supposes the existence of a \emph{format} for importing, schema of which is devised in Section~\ref{sec:localDatasetPlugin}. \n{3}{Local Dataset Plugin}\label{sec:localDatasetPlugin} Unstructured breach data from locally available datasets can be imported into the application by first making sure it adheres to the specified schema (have a look at the \emph{Breach Data Schema} in Listing~\ref{breachDataGoSchema}). If it does not (which is very likely with random breach data, as already mentioned in Section~\ref{sec:dataSources}), it needs to be converted to a form that \emph{does} before importing it to the application, e.g.\ using a Python script or a similar method. Attempting to import data that does not follow the outlined schema should result in an error. Equally so, importing a dataset which is over a reasonable size limit should by default be rejected by the program as a precaution. Unmarshaling, for instance, a 1 TiB document would most likely result in an out-of-memory (OOM) situation on the host running the application, assuming contemporary consumer hardware conditions (not HPC). \vspace{\parskip} \begin{lstlisting}[language=Go, caption={Breach Data Schema represented as a Go struct with imports from the standard library assumed}, label=breachDataGoSchema] type breachDataSchema struct { Name string Time time.Time IsVerified bool ContainsPasswords bool ContainsHashes bool HashType string HashSalted bool HashPepperred bool ContainsUsernames bool ContainsEmails bool Data any } \end{lstlisting} \vspace*{-\baselineskip} The Go representation shown in Listing~\ref{breachDataGoSchema} will in actuality translate to a YAML document written and supplied by an administrative user of the program. The YAML format was chosen for several reasons: \begin{itemize} \item relative ease of use (plain text, readability) \item capability to store multiple \emph{documents} inside of a single file \item most of the inputs being implicitly typed as strings \item support for inclusion of comments \item machine readability thanks to being a superset of JSON \end{itemize} The last point specifically should allow for documents similar to what can be seen in Listing~\ref{breachDataYAMLSchema} to be ingested by the program, read and written by humans and programs alike. \smallskip \begin{lstlisting}[language=YAML, caption={Example Breach Data Schema supplied to the program as a YAML file, optionally containing multiple documents}, label=breachDataYAMLSchema] --- name: Horrible breach time: 2022-04-23T00:00:00Z+02:00 isVerified: false containsPasswds: false containsHashes: true containsEmails: true hashType: md5 hashSalted: false hashPeppered: false data: hashes: - hash1 - hash2 - hash3 emails: - email1 - - email3 --- # document #2, describing another breach. name: Horrible breach 2 ... \end{lstlisting} \vspace*{-\baselineskip} Notice how the emails list in Listing~\ref{breachDataYAMLSchema} misses one record, perhaps because it was not supplied or mistakenly omitted. This is a valid scenario (mistakes happen) and the application needs to be able to handle it. The alternative would be to require the user to prepare the data in such a way that the empty/partial records would be dropped entirely. \n{3}{Have I Been Pwned? Integration} Troy Hunt's \textbf{Have I Been Pwned?} online service (\url{https://haveibeenpwned.com/}) has been chosen as the online source of compromised data. The service offers private APIs that are protected by API keys. The application's \texttt{hibp} module and database representation models the values returned by this API, which allows searching in large breaches using email addresses.\\ The architecture there is relatively simple: the application administrator configures an API key for HIBP, the user enters the query parameters, the application constructs a query and calls the API and waits for a response. As the API is rate-limited based on the key supplied, this can pose an issue and it has not been fully resolved in the UI. The application then parses the returned data and binds it to the local model for validation. If that goes well, the data is saved into the database as a cache and the search query is performed on the saved data. If it returns anything, it is displayed to the user for browsing. \n{1}{Deployment recommendations}\label{sec:deploymentRecommendations} It is, of course, recommended that the application runs in a secure environment \allowbreak although definitions of that almost certainly differ depending on who you ask. General recommendations would be either to effectively reserve a machine for a single use case - running this program - so as to dramatically decrease the potential attack surface of the host, or run the program isolated in a container or a virtual machine. Further, if the host does not need management access (it is a deployed-to-only machine that is configured out-of-band, such as with a \emph{golden} image/container or declaratively with Nix), then an SSH \emph{daemon} should not be running in it, since it is not needed. In an ideal scenario, the host machine would have as little software installed as possible besides what the application absolutely requires. System-wide cryptographic policies should target highest feasible security level, if at all available (such as by default on Fedora or RHEL), covering SSH, DNSSec, IPsec, Kerberos and TLS protocols. Firewalls should be configured and SELinux (kernel-level mandatory access control and security policy mechanism) running in \emph{enforcing} mode, if available. \n{2}{Transport security} User connecting to the application should rightfully expect for their data to be protected \textit{in transit} (i.e.\ on the way between their browser and the server), which is what \emph{Transport Layer Security} family of protocols~\cite{tls13rfc8446} was designed for, and which is the underpinning of HTTPS. TLS utilises the primitives of asymmetric cryptography to let the client authenticate the server (verify that it is who it claims it is) and negotiate a symmetric key for encryption in the process named the \emph{TLS handshake} (see Section~\ref{sec:tls} for more details), the final purpose of which is establishing a secure communications connection. The operator should configure the program to either directly utilise TLS using configuration or have it listen behind a TLS-terminating \emph{reverse proxy}. \n{2}{Containerisation} Whether the pre-built or a custom container image is used to deploy the application, it still needs access to secrets, such as database connection string (containing database host, port, user, password/encrypted password, authentication method and database name). The application should be able to handle the most common Postgres authentication methods~\cite{pgauthmethods}, namely \emph{peer}, \emph{scram-sha-256}, \emph{user name maps} and raw \emph{password}, although the \emph{password} option should not be used in production, \emph{unless} the connection to the database is protected by TLS.\ In any case, using the \emph{scram-sha-256}~\cite{scramsha256rfc7677} method is preferable. One of the ways to verify in development environment that everything works as intended is the \emph{Password generator for PostgreSQL} tool~\cite{goscramsha256}, which allows retrieving the encrypted string from a raw user input. If the application running in a container wants to use the \emph{peer} authentication method, it is up to the operator to supply the Postgres socket to the application (e.g.\ as a volume bind mount). This scenario was not tested; however, and the author is also not entirely certain how \emph{user namespaces} (on GNU/Linux) would influence the process (as in when the \emph{ID}s of a user \textbf{outside} the container are mapped to a range of \emph{UIDs} \textbf{inside} the container), for which the setup would likely need to account. Equally, if the application is running inside the container, the operator needs to make sure that the database is either running in a network that is also directly attached to the container or that there is a mechanism in place that routes the requests for the database hostname to the destination. One such mechanism is container name based routing inside \emph{pods} (Podman/Kubernetes), where the resolution of container names is the responsibility of a specially configured (often auto-configured) piece of software called Aardvark for the former and CoreDNS for the latter. \n{1}{Summary} Passwords (and/or passphrases) are in use everywhere and quite probably will be for the foreseeable future. If not as \textit{the} principal way to authenticate, then at least as \textit{a} way to authenticate. As long as passwords are going to be handled and stored by service/application providers, they are going to get leaked, be it due to provider carelessness or the attackers' resolve and wit. Of course, sifting through all the available password breach data by hand is not a reasonable option, and therefore tools providing assistance come in handy. The next part of this diploma thesis will explore that issue and introduce a solution. % =========================================================================== % \part{Practical part} \n{1}{Kudos} The program that has been developed as part of this thesis used and utilised a great deal of free (as in \textit{freedom}) and open-source software in the process, either directly or as an outstanding work tool, and the author would like to take this opportunity to recognise that fact\footnotemark. In particular, the author acknowledges that this work would not be the same without: \begin{itemize} \item vim (\url{https://www.vim.org/}) \item Arch Linux (\url{https://archlinux.org/}) \item ZSH (\url{https://www.zsh.org/}) \item kitty (\url{https://sw.kovidgoyal.net/kitty/}) \item Nix (\url{https://nixos.org/explore.html}) \item pre-commit (\url{https://pre-commit.com/}) \item Podman (\url{https://podman.io/}) \item Go (\url{https://go.dev/}) \end{itemize} All of the code written has been typed into VIM (\texttt{9.0}), the shell used to run the commands was ZSH, both running in the author's terminal emulator of choice, \texttt{kitty}. The development machines ran a recent installation of \textit{Arch Linux (by the way)} and Fedora 38, both using a \texttt{6.3.x} XanMod variant of the Linux kernel. \footnotetext{\textbf{Disclaimer:} the author is not affiliated in any way with any of the projects described on this page.} \n{1}{Development} The source code of the project was being versioned since the start, using the popular and industry-standard git (\url{https://git-scm.com}) source code management (SCM) tool. Commits were made frequently and, if at all possible, for small and self-contained changes of code, trying to follow sane commit message \emph{hygiene}, i.e.\ striving for meaningful and well-formatted commit messages. The name of the default branch is \texttt{development}, since that is what the author likes to choose for new projects that are not yet stable (it is in fact the default in author's \texttt{.gitconfig}). \n{2}{Commit signing} Since git allows cryptographically \emph{singing} all commits, it would be unwise not to take advantage of this. For the longest time, GPG was the only method available for signing commits in git; however, that is no longer applicable~\cite{agwagitssh}. These days, it is also possible to both sign and verify one's git commits (and tags!) using SSH keys, namely those produced by OpenSSH (the same ones that can be used to log in to remote systems). The author has, of course, not reused the same key pair that is used to connect to machines for signing commits. A different, \texttt{Ed25519} elliptic curve key pair has been used specifically for signing. A public component of this key is enclosed in this thesis as Appendix~\ref{appendix:signingkey} for future reference. The validity of a signature on a particular commit can be viewed with git using the following commands (the \% sign denotes the shell prompt): \vspace{\parskip} \begin{lstlisting}[language=bash, caption={Verifying signature of a git commit}, label=gitverif, basicstyle=\linespread{0.9}\footnotesize\ttfamily] % cd % git show --show-signature % # alternatively: % git verify-commit \end{lstlisting} \vspace*{-\baselineskip} There is one caveat to this though, git first needs some additional configuration for the code in Listing~\ref{gitverif} to work as one would expect. Namely that the public key used to verify the signature needs to be stored in git's ``allowed signers file'', then git needs to be told where that file is located using the configuration value \texttt{gpg.ssh.allowedsignersfile} and finally the configuration value of the \texttt{gpg.format} field needs to be set to \texttt{ssh}. Because git allows the configuration values to be local to each repository, both of the mentioned issues can be solved by running the following commands from inside of the cloned repository: \vspace{\parskip} \begin{lstlisting}[language=bash, caption={Prepare allowed signers file and signature format for git}, label=gitsshprep, basicstyle=\linespread{0.9}\footnotesize\ttfamily] % # set the signature format for the local repository. % git config --local gpg.format ssh % # save the public key. % cat > ./.tmp-allowed_signers \ <<<'surtur leo ' % # set the allowed signers file path for the local repository. % git config --local gpg.ssh.allowedsignersfile=./.tmp-allowed_signers \end{lstlisting} \vspace*{-\baselineskip} After the code in Listing~\ref{gitsshprep} is run, everything from the Listing~\ref{gitverif} should remain applicable for the lifetime of the repository or until git changes implementation of signature verification. The git \texttt{user.name} that can be seen on the commits in the \textbf{Author} field is named after the machine that was used to develop the program, since the author uses different signing keys on each machine. That way the committer machine can be determined post-hoc. For future reference, git has been used in the version \texttt{git version 2.40.1}. \n{2}{Continuous Integration} To increase both the author's and public confidence in the atomic changes made over time, it was attempted to thoroughly \emph{integrate} them using a continuous integration (CI) service that was plugged into the main source code repository since the early stages of development. This, of course, was again self-hosted, including the workers. The tool of choice there was Drone (\url{https://drone.io}) and the ``docker'' runner (in fact it runs any OCI container) was used to run the builds. The way this runner works is that it creates an ephemeral container for every pipeline step and executes given \emph{commands} inside of it. At the end of each step the container is discarded, while the repository, which is mounted into each container's \texttt{/drone/src} is persisted between steps, allowing it to be cloned only from \emph{origin} only at the start of the pipeline and then shared for all of the following steps, saving bandwidth, time and disk writes. The entire configuration used to run the pipelines can be found in a file named \texttt{.drone.yml} at the root of the main source code repository. The workflow consists of four pipelines, which are run in parallel. Two main pipelines are defined to build the frontend assets, the \texttt{pcmt} binary and run tests on \texttt{x86\_64} GNU/Linux targets, one for each of Arch and Alpine (version 3.17). These two pipelines are identical apart from OS-specific bits such as installing a certain package, etc. For the record, other OS-architecture combinations were not tested. A third pipeline contains instructions to build a popular static analysis tool called \texttt{golangci-lint}, which is sort of a meta-linter, bundling a staggering amount of linters (linter is a tool that performs static code analysis and can raise awareness of programming errors, flag potentially buggy code constructs, or \emph{mere} stylistic errors) - from sources and then perform the analysis of project's codebase using the freshly built binary. If the result of this step is successful, a handful of code analysis services get pinged in the next steps to take notice of the changes to project's source code and update their metrics, details can be found in the main Drone configuration file \texttt{.drone.yml} and the configuration for the \texttt{golangci-lint} tool itself (such as what linters are enabled/disabled and with what settings) can be found in the root of the repository in the file named \texttt{.golangci.yml}. The fourth pipeline focuses on linting the Containerfile and building the container, although the latter action is only performed on feature branches, \emph{pull requests} or \emph{tag} events. \obr{Drone CI median build time}{fig:drone-median-build}{.84}{graphics/drone-median-build} The median build time as of writing was 1 minute, which includes running all four pipelines, and that is acceptable. Build times might of course vary depending on the hardware, for reference, these builds were run on a machine equipped with a Zen 3 Ryzen 5 5600 CPU with nominal clock times, DDR4 3200MHz RAM, a couple of PCIe Gen 4 NVMe drives in a mirrored setup (using ZFS) and a 400Mbps downlink, software-wise running Arch with an author-flavoured Xanmod kernel version 6.3.x. \n{2}{Source code repositories}\label{sec:repos} All of the pertaining source code was published in repositories on a publicly available git server operated by the author, the reasoning \emph{pro} self-hosting being that it is the preferred way of guaranteed autonomy over one's source code, as opposed to large silos owned by big corporations having a track record of arguably not always deciding with user's best interest in mind (although recourse has been observed~\cite{ytdl}). When these providers act on impulse or under public pressure they can potentially at least temporarily disrupt their user's operations, thus not only beholding their user to their lengthy \emph{terms of service} that \emph{can change at any time}, but also factors outside their control. Granted, decentralisation can take a toll on discoverability of the project, but that is not a concern here. The git repository containing source code of the \texttt{pcmt} project:\\ \url{https://git.dotya.ml/mirre-mt/pcmt.git}. The git repository hosting the \texttt{pcmt} configuration schema:\\ \url{https://git.dotya.ml/mirre-mt/pcmt-config-schema.git}. The repository containing the \LaTeX{} source code of this thesis:\\ \url{https://git.dotya.ml/mirre-mt/masters-thesis.git}. \n{2}{Toolchain} Throughout the creation of this work, the \emph{then-current} version of the Go programming language was used, i.e. \texttt{go1.20}. To read more on why Go was chosen, see Appendix~\ref{appendix:whygo}. Nix/\texttt{devenv} tools have also aided heavily during development, see Appendix~\ref{appendix:whynix} to learn more. \tab{Tool/Library-Usage Matrix}{tab:toolchain}{1.0}{ll}{ \textbf{Name} & \textbf{Usage} \\ Go programming language & program core \\ Dhall configuration language & program configuration \\ Echo & HTTP handlers, controllers, web server \\ ent & ORM using graph-based modelling \\ bluemonday & sanitising HTML \\ TailwindCSS & stylesheets using a utility-first approach \\ PostgreSQL & persistently storing data \\ } Table~\ref{tab:depsversionmx} contains the names and versions of the most important libraries and supporting software that were used to build the application. \tab{Dependency-Version Matrix}{tab:depsversionmx}{1.0}{ll}{ \textbf{Name} & \textbf{version} \\ \texttt{echo} (\url{https://echo.labstack.com/}) & 4.10.2 \\ \texttt{go-dhall} (\url{https://github.com/philandstuff/dhall-golang}) & 6.0.2\\ \texttt{ent} (\url{https://entgo.io/}) & 0.11.10 \\ \texttt{bluemonday} (\url{https://github.com/microcosm-cc/bluemonday}) & 1.0.23 \\ \texttt{tailwindcss} (\url{https://tailwindcss.com/}) & 3.3.0 \\ \texttt{PostgreSQL} (\url{https://www.postgresql.org/}) & 15.2 \\ } \n{1}{Application architecture} \n{2}{Package structure} The source code of the main module is organised into smaller, self-contained Go \emph{packages} appropriately along a couple of domains: logging, core application, web routers, configuration and settings, etc. In Go, packages are delimited by folder structure -- each folder can be a package. Generally speaking, the program aggregates decision points into central places, such as \texttt{run.go}, which then imports child packages that facilitate each of the task of loading the configuration, connecting to the database and running migrations, consolidating flag, environment variable and configuration-based values into canonical \emph{settings}, setting up routes and handling graceful shutdown. \n{3}{Internal package} The \texttt{internal} package was not used as of writing, but the author plans to eventually migrate \emph{internal} logic of the program into the internal package to prevent accidental imports. \n{2}{Logging} The program uses dependency injection to share a single logger instance, similar applies to the database client. These are passed around as a pointer, so the underlying data stays the same. As a rule of thumb, every larger \texttt{struct} that needs to be passed around is passed around as a pointer. \n{2}{Authentication} The authentication logic is relatively simple and the author attempted to isolate it into a custom \emph{middleware}. User passwords are hashed using a secure KDF before being sent to the database. The KDF of choice is \texttt{bcrypt} (with a sane \emph{Cost} of 10), which automatically includes \emph{salt} for the password and provides ``length-constant'' time hash comparisons. The author plans to add support for the more modern \texttt{scrypt} and the state-of-the-art, P-H-C (Password Hashing Competition) winner algorithm \texttt{Argon2} (\url{https://github.com/P-H-C/phc-winner-argon2}) for flexibility. \n{2}{SQLi prevention} No raw SQL queries are directly used to access the database, thus decreasing the likelihood of SQL injection attacks. Instead, parametric queries are constructed in code using a graph-like API of the \texttt{ent} library, which is attended to in-depth in Section~\ref{sec:dbschema}. \n{2}{Configurability} Virtually any important value in the program has been made into a configuration value, so that the operator can customise the experience as needed. A choice of sane configuration defaults was attempted, which resulted in the configuration file essentially only needing to contain secrets, unless there is a need to override the defaults. It is not entirely \emph{zero-config} situation, rather a \emph{minimal-config} one. An example can be seen in Section~\ref{sec:configuration}. \n{2}{Embedded assets} An important thing to mention is embedded assets and templates. Go has multiple mechanisms to natively embed arbitrary files directly into the binary during the regular build process. The built-in \texttt{embed} package was used to bundle all template files and web assets, such as images, logos and stylesheets at the package level, and these are also the passed around the application as needed. There is also a toggle in the application configuration, which can instruct the program at start to either rely entirely on embedded assets or pull live files from the filesystem. The former option makes the application more portable, while the latter allows for flexibility not only during development. \n{2}{Composability} Templates used for rendering of the web pages were created in a composable manner, split into smaller, reusable parts, such as \texttt{footer.tmpl} and \texttt{head.tmpl}. Specific functionality is then conditionally executed based on the determined level of access of the user. The core templating functionality was provided by the native Go \texttt{html/template} package. An Echo renderer interface has been implemented, so that the rendering can be performed ergonomically and directly using Echo's built-in facilities. A popular HTML sanitiser \emph{bluemonday} has been employed to aid with battling XSS. \n{2}{Server-side rendering} The application constructs the web pages entirely server-side and it runs without a single line of JavaScript, of which the author is especially proud. It improves load times, decreases attack surface, increases maintainability and reduces cognitive load that is required when dealing with JavaScript. Of course that requires extensive usage of non-semantic \texttt{POST}s in forms even for updates (where \texttt{PUT}s should be used) and the accompanying frequent full-page refreshes, but that still is not enough to warrant the use of JavaScript. \n{2}{Frontend} Frontend-side, the application was styled using TailwindCSS, which promotes using of flexible \emph{utility-first} classes in the markup (HTML) instead of separating out the specific styles out into all-encompasing classes. The author understands this is somewhat of a preference issue and does not hold hard opinions in either direction, Tailwind simply looked nice, especially with its built-in support for dark/light mode. The templates containing the CSS classes need to be parsed by Tailwind in order to construct its final stylesheet and there is also an original CLI tool for that called \texttt{tailwindcss}. Overall, simple and accessible layouts had preference over convoluted ones. \n{3}{Frontend experiments} As an aside, the author has briefly experimented with WebAssembly for this project, but has ultimately scrapped the functionality in favour of the entirely server-side rendered one. It is possible that it would get revisited if the client-side dynamic functionality was necessary and performance mattered. Even from the short experiments it was obvious how much faster WebAssembly was compared to JavaScript. \newpage \n{2}{User isolation} Users are allowed into certain parts of the application based on the role they currently posses. For the moment, two basic roles were envisioned, while this list might get amended in the future, should the need arise: \begin{itemize} \item Administrator \item User \end{itemize} \obr{Application use case diagram}{fig:usecasediagram}{.9}{graphics/pcmt-use-case.pdf} It is paramount that the program protects itself from the insider threats as well and therefore each role is only able to perform actions that it is explicitly assigned. While there definitely is certain overlap between the capabilities of the two outlined roles, each also possesses unique features that the other one does not. For example, the administrator role is not able to perform searches on the breach data directly using their administrator account, for that a separate user account has to be devised. Similarly, the regular user is not able to manage breach lists and other users, because that is a privileged operation. In-application administrators are not able to view sensitive (any) user data and should therefore only be able to perform the following actions: \begin{itemize} \item Create user accounts \item View list of users \item View user email \item Change user email \item Toggle whether user is an administrator \item Delete user accounts \end{itemize} Let us consider a case when a user manages self, while demoting from administrator to a regular user is permitted, promoting self to be an administrator would constitute a \emph{privilege escalation} and likely be a precursor to at least a \emph{denial of service} of sorts. \n{2}{Zero trust principle} \textit{Confidentiality, i.e.\ not trusting the provider} There is no way for the application (and consequently, the in-application administrator) to read user's data. This is possible by virtue of encrypting the pertinent data before saving them in the database by a state-of-the-art \emph{age} key~\cite{age} (backed by X25519~\cite{x25519rfc7748}), which is in turn safely stored encrypted by a passphrase that only the user controls. Of course, the user-supplied password is run by a password based key derivation function (PBKDF: a key derivation function with a sliding computational cost) before letting it encrypt the \emph{age} key. The \emph{age} key is only generated when the user changes their password for the first time to prevent scenarios such as in-application administrator with access to physical database being able to both \textbf{recover} the key from the database and \textbf{decrypt} it given that they already know the user password (because they set it), which would subsequently give them unbounded access to any future encrypted data, as long as they would be able to maintain their database access. This is why the \emph{age} key generation and protection are bound to the first password change. Of course, the evil administrator could just perform the change themselves; however, the user would at least be able to find those changes in the activity logs and know not to use the application. But given the scenario of a total database compromise, the author finds all hope is already lost at that point. At least when the database is dumped, it only contains non-sensitive, functional information in plain test, everything else should be encrypted. Consequently, both the application operators and the in-application administrators should never be able to learn the details of what the user is tracking, the same being applicable even to potential attackers with direct access to the database. Thus the author maintains that every scenario that could potentially lead to a data breach (apart from a compromised user machine and the like) would have to entail some form of operating memory acquisition, for instance using \texttt{LiME}~\cite{lime}, or perhaps directly the \emph{hypervisor}, if considering a virtualised (``cloud'') environments. \n{1}{Implementation} \n{2}{Dhall Configuration Schema}\label{sec:configuration} The configuration schema was at first being developed as part of the main project's repository, before it was determined that it would benefit both the development and overall clarity if the schema lived in its own repository (see Section~\ref{sec:repos} for details). This now enables the schema to be independently developed and versioned, and only pulled into the main application whenever the application is determined to be ready for it. % \vspace{\parskip} \smallskip % \vspace{\baselineskip} \begin{lstlisting}[language=Haskell, caption={Dhall configuration schema version 0.0.1-rc.2}, label=dhallschema, basicstyle=\linespread{0.9}\footnotesize\ttfamily] let Schema = { Type = { Host : Text , Port : Natural , HTTP : { Domain : Text , Secure : Bool , AutoTLS : Bool , TLSKeyPath : Text , TLSCertKeyPath : Text , HSTSMaxAge : Natural , ContentSecurityPolicy : Text , RateLimit : Natural , Gzip : Natural , Timeout : Natural } , Mailer : { Enabled : Bool , Protocol : Text , SMTPAddr : Text , SMTPPort : Natural , ForceTrustServerCert : Bool , EnableHELO : Bool , HELOHostname : Text , Auth : Text , From : Text , User : Text , Password : Text , SubjectPrefix : Text , SendPlainText : Bool } , LiveMode : Bool , DevelMode : Bool , AppPath : Text , Session : { CookieName : Text , CookieAuthSecret : Text , CookieEncrSecret : Text , MaxAge : Natural } , Logger : { JSON : Bool, Fmt : Optional Text } , Init : { CreateAdmin : Bool, AdminPassword : Text } , Registration : { Allowed : Bool } } } \end{lstlisting} \vspace*{-\baselineskip} Full schema with type annotations can be seen in Listing~\ref{dhallschema}. The \texttt{let} statement declares a variable called \texttt{Schema} and assigns to it the result of the expression on the right side of the equals sign, which has for practical reasons been trimmed and is displayed without the \emph{default} block. The default block is instead shown in its own Listing~\ref{dhallschemadefaults}. The main configuration comprises both raw attributes and child records, which allow for grouping of related functionality. For instance, configuration settings pertaining mailserver setup are grouped in a record named \textbf{Mailer}. Its attribute \textbf{Enabled} is annotated as \textbf{Bool}, which was deemed appropriate for a on-off switch-like functionality, with the only permissible values being either \emph{True} or \emph{False}. Do note that in Dhall $true != True$, since \textbf{True} is internally a Bool constant, which is built into Dhall (check out ``The Prelude''~\cite{dhallprelude}), while \textbf{true} is evaluated as an \emph{unbound} variable, that is, a variable \emph{not} defined in the current \emph{scope} and thus not \emph{present} in the current scope. Another one of Dhall specialties is that `$==$' and `$!=$' (in)equality operators \textbf{only} work on values of type \texttt{Bool}, which for example means that variables of type \texttt{Natural} (\texttt{uint}) or \texttt{Text} (\texttt{string}) cannot be compared directly as in other languages, which either leaves the work for a higher-level language (such as Go), or from the perspective of the Dhall authors, \emph{enums} are promoted when the value matters. \newpage % \vspace{\parskip} \begin{lstlisting}[language=Haskell, caption={Dhall configuration defaults for schema version 0.0.1-rc.2}, label=dhallschemadefaults, basicstyle=\linespread{0.9}\footnotesize\ttfamily] , default = -- | have sane defaults. { Host = "" , Port = 3000 , HTTP = { Domain = "" , Secure = False , AutoTLS = False , TLSKeyPath = "" , TLSCertKeyPath = "" , HSTSMaxAge = 0 , ContentSecurityPolicy = "" , RateLimit = 0 , Gzip = 0 , Timeout = 0 } , Mailer = { Enabled = False , Protocol = "smtps" , SMTPAddr = "" , SMTPPort = 465 , ForceTrustServerCert = False , EnableHELO = False , HELOHostname = "" , Auth = "" , From = "" , User = "" , Password = "" , SubjectPrefix = "pcmt - " , SendPlainText = True } , LiveMode = -- | LiveMode controls whether the application looks for -- | directories "assets" and "templates" on the filesystem or -- | in its bundled Embed.FS. False , DevelMode = False , AppPath = -- | AppPath specifies where the program looks for "assets" and -- | "templates" in case LiveMode is True. "." , Session = { CookieName = "pcmt_session" , CookieAuthSecret = "" , CookieEncrSecret = "" , MaxAge = 3600 } , Logger = { JSON = True, Fmt = None Text } , Init = { CreateAdmin = -- | if this is True, attempt to create a user with admin -- | privileges with the password specified below False , AdminPassword = -- | used for the first admin, forced change on first login. "50ce50fd0e4f5894d74c4caecb450b00c594681d9397de98ffc0c76af5cff5953eb795f7" } , Registration.Allowed = True } } in Schema \end{lstlisting} \vspace*{-\baselineskip} \vspace*{-\baselineskip} \vspace*{-\baselineskip} \n{2}{Data integrity and authenticity} The user can interact with the application via a web client, such as a browser, and is required to authenticate for all sensitive operations. To not only know \emph{who} the user is but also make sure they are \emph{permitted} to perform the action they are attempting, the program employs an \emph{authorisation} mechanism in the form of sessions. These are on the client side represented by cryptographically signed and encrypted (using 256 bit AES) HTTP cookies. That lays foundations for a few things: the data saved into the cookies can be regarded as private because short of future \emph{quantum computers} only the program itself can decrypt and access the data, and the data can be trusted since it is both signed using the key that only the program controls and \emph{encrypted} with \emph{another} key that equally only the program holds. The cookie data is only ever written \emph{or} read at the server side, solidifying the authors decision to let it be encrypted, as there is not point in not encrypting it for some perceived client-side simplification. Users navigating the website send their session cookie (if it exists) with \textbf{every request} to the server, which subsequently verifies the integrity of the data and in case it is valid, determines the existence and potential amount of user privilege that should be granted. Public endpoints do not mandate the presence of a valid session by definition, while at protected endpoints the user is authenticated at every request. When a session expires or if there is no session to begin with, the user is either shown a \emph{Not found} error message, the \emph{Unauthorised} error message or redirected to \texttt{/signin}, depending on the endpoint or resource, as can be seen, this behaviour is not uniform and depends on the resource and/or the endpoint. Another aspect that contributes to data integrity from \emph{another} point of view is utilising database \emph{transactions} for bundling together multiple database operations that collectively change the \emph{state}. Using the transactional jargon, the data is only \emph{committed} if each individual change was successful. In case of any errors, the database is instructed to perform an atomic \emph{rollback}, which brings it back to a state before the changes were ever attempted. The author has additionally considered the thought of utilising an embedded immutable database like immudb (\url{https://immudb.io}) for record keeping (verifiably storing data change history) and additional data integrity checks, e.g.\ for tamper protection purposes and similar; however, that work remains yet to be materialised. \n{2}{Database schema}\label{sec:dbschema} The database schema is not being created manually in the database. Instead, an Object-relational Mapping (ORM) tool named ent is used, which allows defining the table schema and relations entirely in Go. The application does not need for the database schema to be pre-created when the application starts, it only requires a connection string providing access to the database for a reasonably privileged user if that is the case. The best part about \texttt{ent} is that there is no need to define supplemental methods on the models, as with \texttt{ent} these are meant to be \emph{code generated} (in the older sense of word, not with Large Language Models) into existence. Code generation creates files with actual Go models based on the types of the attributes in the database schema model, and the respective relations are transformed into methods on the receiver or functions taking object attributes as arguments. For instance, if the model's attribute is a string value \texttt{Email}, ent can be used to generate code that contains methods on the user object like the following: \begin{itemize} \item \texttt{EmailIn(pattern string)} \item \texttt{EmailEQ(email string)} \item \texttt{EmailNEQ(email string)} \item \texttt{EmailHasSuffix(suffix string)} \end{itemize} These methods can further be imported into other packages and this makes working with the database a morning breeze. \n{1}{Deployment} \textbf{TODO}: mention how \texttt{systemd} aids in running the pod. A deployment setup as suggested in Section~\ref{sec:deploymentRecommendations} is already partially covered by the multi-stage \texttt{Containerfile} that is available in the main sources. Once built, the resulting container image only contains a handful of things it absolutely needs: \begin{itemize} \item a statically linked copy of the program \item a default configuration file and corresponding Dhall expressions cached at build time \item a recent CA certs bundle \end{itemize} Since the program also needs a database for proper functioning, an example scenario includes the application container being run in a Podman \textbf{pod} together with the database. That results in not having to expose the database to the entire host or out of the pod at all, it is only be available over pod's \texttt{localhost}. It goes without saying that the default values of any configuration secrets should be substituted by the application operator with new, securely generated ones. \n{2}{Rootless Podman} Assuming rootless Podman set up and the \texttt{just} tool installed on the host, the application could be deployed by following a series of relatively simple steps: \begin{itemize} \item build (or pull) the application container image \item create a pod with user namespacing, exposing the application port \item run the database container inside the pod \item run the application inside the pod \end{itemize} In concrete terms, it would resemble something along the lines of Listing~\ref{podmanDeployment}. Do note that all the commands are executed under the unprivileged \texttt{user@containerHost} that is running rootless Podman, i.e.\ it has \texttt{UID}/\texttt{GID} mapping entries in \texttt{/etc/setuid} and \texttt{\etc/setgid} files \textbf{prior} to running any Podman commands. % \newpage \begin{lstlisting}[language=bash, caption={Example application deployment using rootless Podman}, label=podmanDeployment, basicstyle=\linespread{0.9}\small\ttfamily] # From inside the project folder, build the image locally using kaniko. just kaniko # Create a pod. podman pod create --userns=keep-id -p3005:3000 --name pcmt # Run the database in the pod. podman run --pod pcmt --replace -d --name "pcmt-pg" --rm \ -e POSTGRES_INITDB_ARGS="--auth-host=scram-sha-256 \ --auth-local=scram-sha-256" \ -e POSTGRES_PASSWORD=postgres -v $PWD/tmp/db:/var/lib/postgresql/data \ docker.io/library/postgres:15.2-alpine3.17 # Run the application in the pod. podman run --pod pcmt --replace --name pcmt-og -d --rm \ -e PCMT_LIVE=False \ -e PCMT_DBTYPE="postgres" \ -e PCMT_CONNSTRING="host=pcmt-pg port=5432 sslmode=disable \ user=postgres dbname=postgres password=postgres" -v $PWD/config.dhall:/config.dhall:ro \ docker.io/immawanderer/pcmt:testbuild -config /config.dhall \end{lstlisting} To summarise Listing~\ref{podmanDeployment}, first, the application container is built from inside the project folder using \texttt{kaniko}. Alternatively, the container image could be pulled from the container repository, but it makes more sense showing the image being built from sources since the listing depicts a \texttt{:testbuild} tag being used. Next, a \emph{pod} is created and given a name, setting the port binding for the application. Then, the database container is started inside the pod. As a final step, the application container itself is run inside the pod. The application configuration named \texttt{config.dhall} located in \texttt{\$PWD} is mounted as a volume into container's \texttt{/config.dhall}, providing the application with a default configuration. The default container does contain a default configuration for reference, however, running the container as is without additional configuration would fail as it does not contain the necessary secrets. \n{3}{Sanity checks} Do also note that the application connects to the database using its \emph{container} name, i.e.\ not the IP address. That is possible thanks to Podman setting up DNS inside the pod in such a way that all containers in the pod can reach each other using their (container) names. Interestingly, connecting via \texttt{localhost} would also work, as from inside the pod, any container in the pod can reach any other container in the same pod via pod's \texttt{localhost}. In fact, \emph{pinging} the database or application containers from an ad-hoc \texttt{alpine} container added to the pod yields: \vspace{\parskip} \begin{lstlisting}[language=bash, caption={Pinging pod containers using their names}, label=podmanPing, basicstyle=\linespread{0.9}\small\ttfamily] user@containerHost % podman run --rm -it --user=0 --pod=pcmt \ docker.io/library/alpine:3.18 / # ping -c2 pcmt-og PING pcmt-og (127.0.0.1): 56 data bytes 64 bytes from 127.0.0.1: seq=0 ttl=42 time=0.072 ms 64 bytes from 127.0.0.1: seq=1 ttl=42 time=0.118 ms --- pcmt-og ping statistics --- 2 packets transmitted, 2 packets received, 0% packet loss round-trip min/avg/max = 0.072/0.095/0.118 ms / # ping -c2 pcmt-pg PING pcmt-pg (127.0.0.1): 56 data bytes 64 bytes from 127.0.0.1: seq=0 ttl=42 time=0.045 ms 64 bytes from 127.0.0.1: seq=1 ttl=42 time=0.077 ms --- pcmt-pg ping statistics --- 2 packets transmitted, 2 packets received, 0% packet loss round-trip min/avg/max = 0.045/0.061/0.077 ms / # \end{lstlisting} \vspace*{-\baselineskip} The pod created in Listing~\ref{podmanDeployment} only set the binding for a port used by the application (\texttt{5005/tcp}). The Postgres default port \texttt{5432/tcp} is not among pod's port bindings, as can be seen in the pod creation command. This can also easily be verified using the command in Listing~\ref{podmanPortBindings}: \begin{lstlisting}[language=bash, caption={Podman pod port bindings}, label=podmanPortBindings, basicstyle=\linespread{0.9}\small\ttfamily] user@containerHost % podman pod inspect pcmt \ --format="Port bindings: {{.InfraConfig.PortBindings}}\n\ Host network: {{.InfraConfig.HostNetwork}}" Port bindings: map[3000/tcp:[{ 5005}]] Host network: false \end{lstlisting} \vspace*{-\baselineskip} To be absolutely sure, trying to connect to the database from outside of the pod (i.e. from the container host) should \emph{fail}, unless, of course, there is another process listening on that port: \begin{lstlisting}[language=bash, caption={In-pod database is unreachable from the host}, breaklines=true, label=podDbUnreachable, basicstyle=\linespread{0.9}\small\ttfamily] user@containerHost % curl localhost:5432 --> curl: (7) Failed to connect to localhost port 5432 after 0 ms: Couldn't connect to server \end{lstlisting} \vspace*{-\baselineskip} The error in Listing~\ref{podDbUnreachable} is expected, as it is the result of the database port not been exposed from the pod. \n{1}{Validation} \n{2}{Unit tests} Unit testing is a hot topic for many people and the author does not count himself to be a staunch supporter of neither extreme. The ``no unit tests'' seems to discount any benefit there is to unit testing, while a `` TDD-only''\footnotemark{} approach can be a little too much for some people's taste. The author tends to prefer a \emph{middle ground} approach in this particular case, i.e. writing enough tests where meaningful but not necessarily testing everything or writing tests prior to business logic code. Arguably, following the practice of TDD should result in writing a \emph{better designed} code, particularly because there needs to be a prior thought about the shape and function of the code, as it is tested for before it is even written, but it adds an slight inconvenience to what is otherwise a straightforward process. Thanks to Go's built in support for testing via its \texttt{testing} package and the tooling in the \texttt{go} tool, writing tests is relatively simple. Go looks for files in the form \texttt{\_test.go} in the present working directory but can be instructed to look for test files in packages recursively found on any path using the ellipsis, like so: \texttt{go test ./path/to/package/\ldots}, which then \emph{runs} all the tests found, and reports some statistics, such as the time it took to run the test or whether it succeeded or failed. To be precise, the test files also need to contain test functions, which are functions with the signature \texttt{func TestWhatever(t *testing.T)\{\}} and where the function prefix ``Test'' is equally as important as the signature. Without it, the function is not considered to be a testing function despite having the required signature and is therefore \emph{not} executed during testing. This test lookup behaviour; however, also has a neat side-effect: all the test files can be kept side-by-side their regular source counterparts, there is no need to segregate them into a specially blessed \texttt{tests} folder or similar, which in author's opinion improves readability. As a failsafe, in case no actual test are found, the current behaviour of the tool is to print a note informing the developer that no tests were found, which is handy to learn if it was not intended/expected. When compiling regular source code, the Go files with \texttt{\_test} in the name are simply ignored by the build tool. \footnotetext{TDD, or Test Driven Development, is a development methodology whereby tests are written \emph{first}, then a complementary piece of code that is supposed to be tested is added, just enough to get past the compile errors and to see the test \emph{fail} and then is the code finally refactored to make the test \emph{pass}. The code can then be fearlessly extended because the test is the safety net catching the programmer when the mind slips and alters the originally intended behaviour of the code.} \n{2}{Integration tests} Integrating with external software, namely the database in case of this program, is designed to utilise the same mechanism that was mentioned in the previous section: Go's \texttt{testing} package. These tests verify that the code changes can still perform the same actions with the external software that were possible before the change and are run before every commit locally and then after pushing to remote in the CI. \n{3}{func TestUserExists(t *testing.T)} An example integration test shown in Listing~\ref{integrationtest} can be seen to declare a helper function \texttt{getCtx() context.Context}, which takes no arguments and returns a new\\ \texttt{context.Context} initialised with a value of the global logger, which is how the logger gets injected into the user module functions. The function \texttt{TestUserExists(t *testing.T)} first declares a database connection string and attempting to open a connection to the database. The database in use here is SQLite3 running in memory mode, meaning no file is actually written to disk during this process. Since the testing data is not needed after the test, this is deemed good enough. Next, a defer statement calling the \texttt{Close()} method on the database object is made, which is the idiomatic Go way of closing files and network connections (which are also an abstraction over files on UNIX-like operating systems such as GNU/Linux). The \emph{defer} statement gets called after all of the statements in the surrounding function, which makes sure no file descriptors (FDs) are leaked and the file is properly closed when the function returns. In the next step a database schema creation is attempted, handling the potential error in a Go idiomatic way, which uses the return value from the function in an assignment to a variable declared in the \texttt{if} statement, and checks whether the \texttt{err} was \texttt{nil} or not. In case the \texttt{err} was not \texttt{nil}, i.e.\ \emph{there was an error in the callee function}, the condition evaluates to \texttt{true}, which is followed by entering the inner block. Inside the inner block, the error is announced to the user (likely a developer running the test in this case) and the testing object's \texttt{FailNow()} method is called, which marks the test function as having failed, and thus stops its execution. In this case, that is the desired outcome, since if the database schema creation call fails, there really is no point in continuing testing of user creation. Conversely, if the schema does get created without an error, the code continues to declare a few variables: \texttt{username}, \texttt{email} and \texttt{ctx}, where the context injected with the logger is saved. Some of them are subsequently passed into the \texttt{UsernameExists} function, \texttt{ctx} as the first argument, with the database pointer and username being passed next, while the \texttt{email} variable is only used at a later stage, but was declared here to give a sense of grouping. The error value returned from this function is again checked and if everything goes well, the value of the \texttt{usernameFound} boolean is checked next. \smallskip \smallskip \begin{lstlisting}[language=Go, caption={Example integration test.}, label=integrationtest,basicstyle=\linespread{0.9}\scriptsize\ttfamily] // modules/user/user_test.go package user import ( "context" "testing" "git.dotya.ml/mirre-mt/pcmt/ent/enttest" "git.dotya.ml/mirre-mt/pcmt/slogging" _ "github.com/xiaoqidun/entps" ) func getCtx() context.Context { l := slogging.Init(false) ctx := context.WithValue(context.Background(), CtxKey{}, l) return ctx } func TestUserExists(t *testing.T) { connstr := "file:ent_tests?mode=memory&_fk=1" db := enttest.Open(t, "sqlite3", connstr) defer db.Close() if err := db.Schema.Create(context.Background()); err != nil { t.Errorf("failed to create schema resources: %v", err) t.FailNow() } username := "dude" email := "dude@b.cc" ctx := getCtx() usernameFound, err := UsernameExists(ctx, db, username) if err != nil { t.Errorf("error checking for username {%s} existence: %q", username, err, ) } if usernameFound { t.Errorf("unexpected: user{%s} should not have been found", username, ) } if _, err := EmailExists(ctx, db, email); err != nil { t.Errorf("unexpected: user email '%s' should not have been found", email, ) } usr, err := CreateUser(ctx, db, email, username, "so strong") if err != nil { t.Errorf("failed to create user, error: %q", err) t.FailNow() } else if usr == nil { t.Error("got nil usr back") t.FailNow() } if usr.Username != username { t.Errorf("got back wrong username, want: %s, got: %s", username, usr.Username, ) } // ...more checks... } \end{lstlisting} Since the database has just been created, there should be no users, which is checked in the next \texttt{if} statement. The same check is then performed for the earlier-declared user email that is also expected to fail. The final statements of the described test attempts a user creation call, which is again checked for both error and \emph{nilability}. The test continues with more similar checks but it has been cut short for brevity. A neat thing about error handling in Go is that it allows for very easy checking of all paths, not just the \emph{happy path} where there are no issues. \n{2}{Testing environment} The application has been deployed in a testing environment on author's modest Virtual Private Server (VPS) at \texttt{https://testpcmt.dotya.ml}, protected by \emph{Let's Encrypt}\allowbreak issued, short-lived, ECDSA \texttt{secp384r1} curve TLS certificate, and configured with strict CSP. It is a testing instance; therefore, limits to prevent abuse might be imposed. \n{3}{Deployment validation} TODO: show the results of testing the app in prod using \url{https://testssl.sh/}. % =========================================================================== % \nn{Conclusion} The objectives of the thesis have been to create the Password Compromise Monitoring Tool aimed at security-conscious user in order to validate their assumptions on the security of their credentials. The thesis opened by diving into cryptography topics such as encryption and briefly mentioned TLS. Additionally, security mechanisms such as Site Isolation and Content Security Policy, commonly employed by mainstream browsers of today, were introduced and the reader learnt how Content Security Policy is easily and dynamically configured. An extensive body of the thesis then revolved around the practical part, describing everything from tooling used through high-level view of application's architecture to implementation of specific parts of the application across the stack. Finally, the practical part concluded by broadly depicting validation methods used to verify if the application worked correctly. The author would like to recognise that there are certain aspects of the thesis in the need of further development. It is necessary to admit that not everything could have realistically been realised in the limited timespan and scope imposed on the project to prevent diverging. The concerns mentioned above constitute clear candidates for future work of the author who intends to improve on the existing state, for example accessibility-wise. The author's unfamiliarity with the accessibility tooling sometimes compromised on the quality in this segment of the application, but it is a known deficiency. Furthermore, the list of tasks for the future may also contain adding \emph{fuzzing} tests for the program, producing Software Bill of Materials, utilising additional immutable database or unifying the frontend design language across the pages. The program does have a very solid core that for instance listens for OS signals, handles graceful shutdown and supports structured logging but still has room for improvements, despite the fact that its creation has been best-effort. Due to a number of reasons mentioned earlier, it should not be called an utterly \emph{finished} project yet, but it can already serve a clear purpose. % =========================================================================== %