1725 lines
87 KiB
TeX
1725 lines
87 KiB
TeX
% =========================================================================== %
|
|
% Encoding: UTF-8 (žluťoučký kůň úpěl ďábelšké ódy)
|
|
% =========================================================================== %
|
|
|
|
% =========================================================================== %
|
|
\nn{Introduction}
|
|
|
|
Passwords. Everybody reading this text most assuredly recalls at least
|
|
\emph{some} of their own. The security-minded person perhaps even dozens. They
|
|
are complex and at least twelve characters long. They are only ever used in the
|
|
one place they were created for. And they are definitely getting rotated at
|
|
least once a year. Or are they?
|
|
|
|
A token so ubiquitous that it becomes tiring for human being to keep track of
|
|
all the places where it is required in some form or another. At some point, it
|
|
almost feels easier to stop caring and use the password intended for \emph{the
|
|
other site} for this one, too. What harm could that possibly do. The answer is
|
|
unimaginable, depending on the services in question, its relevance to the
|
|
person being discussed, and also on \emph{how many other} services also share
|
|
this password. A service requires a registration? No problem, the password will
|
|
be the name of the cat plus current year, so as to make it more secure. It is
|
|
the password rotation day again this month, a handful of logins will be
|
|
disabled if their passwords are not changed in the next couple of hours. No
|
|
worries, it is already covered by a combination of the current month and the
|
|
name of the specific service for each of them. A neat system. But just in case
|
|
they got forgotten in the fragments of this hectic lifestyle, they need to be
|
|
written down on a sticker note. Not to worry, nobody knows, it is hidden under
|
|
the keyboard, it is practically invisible.
|
|
|
|
These are all examples of poor password practices on user's side; some might
|
|
have been circumstantially helped to, such as the too frequently forced
|
|
password rotation, others can be ascribed to users not being sufficiently
|
|
well-versed in the intricacies of password hygiene.
|
|
|
|
Inevitably, these passwords are going to get appropriately treated in the form
|
|
of misuse, be it from a nosy colleague that finds the sticker note, or if the
|
|
user account is ever a target of an attack, the password's \emph{only} role, to
|
|
protect the access, will likely not stand much chance.
|
|
|
|
This thesis tangentially covers user-relating issues like the ones described
|
|
above, but rather than attempting to go for prevention, it mainly focuses on
|
|
dealing with the acute consequence of such behaviour: a password breach. The
|
|
thesis consists of two parts. The theoretical one offers an overview of
|
|
password-related topics and frames the password as well as security topics in
|
|
the web context in order to provide necessary context for the second part of
|
|
the thesis. Cryptography topics such as hashing, encryption and entropy are
|
|
mentioned, and within the browser context a special spotlight is given to the
|
|
protocols powering the web: HTTP and TLS.
|
|
|
|
The practical part discusses the architecture, decision making, implementation
|
|
details and validation methods utilised when building a web application that
|
|
enables users to monitor the breach status of their credentials by utilising an
|
|
online API service and local data imported into the program by the operators of
|
|
the tool. The program does not have many dependencies and is relatively
|
|
lightweight, which means that anybody with even little experience should be
|
|
able to potentially run their own private instance, if they so choose.
|
|
|
|
The purpose of the program is to allow users to learn if their credentials were
|
|
breached, while the reason for the breach might even be considered secondary in
|
|
importance. Breach data is not a publicly traded commodity and is relatively
|
|
hard to make sense of, given that we are talking about literal
|
|
\emph{terrabytes} of data available, if there is even the slightest interest to
|
|
find it online. Breaches happen, and of course, can inform the decision to stay
|
|
or leave the service, but there is not always a choice element involved, or
|
|
only a limited amount. Either way, knowledge is light and as such precedes
|
|
informed decision-making. Abstracting away the ugly parts and offering users an
|
|
understandable interface would likely result in their improved security
|
|
posture, if anything.
|
|
|
|
The author has been striving to utilise modern tooling and development
|
|
practices in an effort to build a maintainable and long-lasting piece of
|
|
software that serves its users well. When deployed, it could provide real
|
|
value.
|
|
|
|
|
|
% =========================================================================== %
|
|
\part{Theoretical part}
|
|
|
|
\n{1}{Terminology}
|
|
|
|
\n{2}{Linux}
|
|
|
|
The term \emph{Linux} is exclusively used in the meaning of the
|
|
Linux kernel~\cite{linux}.
|
|
|
|
|
|
\n{2}{GNU/Linux}
|
|
|
|
As far as a Linux-based operating system is concerned, the term ``GNU/Linux''
|
|
as defined by the Free Software Foundation~\cite{fsfgnulinux} is used. While it
|
|
is longer and arguably a little bit cumbersome, the author aligns with the
|
|
opinion that this term more correctly describes its actual target. Being aware
|
|
there are many people that conflate the complete operating system with its (be
|
|
it core) component, the kernel, the author is taking care to distinguish the
|
|
two, although writing from experience, colloquially, this probably brings more
|
|
confusion and a lengthy explanation is usually required.
|
|
|
|
|
|
\n{2}{Containers}
|
|
|
|
When the concept of \emph{containerisation} and \emph{containers} is mentioned
|
|
throughout this work, the author has OCI containers~\cite{ocicontainers} in
|
|
mind, which is broadly a superset of \emph{Linux Containers} where some set of
|
|
processes is presented with a view of kernel resources (there are multiple
|
|
kinds of resources, such as IPC queues; network devices, stacks, ports; mount
|
|
points, process IDs, user and group IDs, Cgroups and others) that differs for
|
|
each different set of processes, similar in thought to FreeBSD
|
|
\emph{jails}~\cite{freebsdjails} with the distinction being that they are, of
|
|
course, facilitated by the Linux kernel namespace
|
|
functionality~\cite{linuxnamespaces}, which is in turn regarded to be
|
|
\emph{inspired} by Plan 9's namespaces~\cite{plan9namespaces}, Plan 9 being a
|
|
Bell Labs successor to Unix 8th Edition, discontinued in 2015.
|
|
While there without a doubt \emph{is} specificity bound to using each of the
|
|
tools that enable creating (Podman vs.\ Buildah vs.\ Docker BuildX) or running
|
|
(ContainerD vs.\ runC vs.\ crun) container images, when describing an action
|
|
that gets performed with or onto a container, the process should generally be
|
|
explained in such a way that it is repeatable using any spec-conforming tool
|
|
that is available and \emph{intended for the job}.
|
|
|
|
\vspace*{-\baselineskip}
|
|
\n{2}{The program}
|
|
|
|
By \emph{the program} or \emph{the application} without any additional context
|
|
the author usually means the Password Compromise Monitoring Tool program.
|
|
|
|
|
|
\n{1}{Cryptography primer}\label{sec:cryptographyprimer}
|
|
|
|
\n{2}{Encryption}
|
|
|
|
Encryption is the process of transforming certain data, called a
|
|
\emph{message}, using, as Aumasson writes in Serious Cryptography, ``an
|
|
algorithm called a \emph{cipher} and a secret value called the
|
|
key''~\cite{seriouscryptography}. Its purpose is to protect the said message so
|
|
that only its intended recipients that know/hold the key are able to
|
|
\emph{decipher} and read it.
|
|
|
|
\n{3}{Symmetric encryption}
|
|
|
|
Symmetric encryption is simply when the \emph{key} used is to facilitate both
|
|
encryption and decryption operations.
|
|
|
|
\n{3}{Asymmetric encryption}
|
|
|
|
Asymmetric encryption is different from symmetric encryption in that there are
|
|
now two keys in use - a key \emph{pair}. One part is used solely for
|
|
encryption, while the other part's only purpose is to decrypt. This notion of
|
|
two keys is generally transposed to a domain called \emph{public key
|
|
cryptography}, whereby the decryption component is declared private and the
|
|
encryption component is called \emph{public}, hence the name. The rationale is
|
|
that everybody can encrypt messages \emph{for} the recipient but only they are
|
|
able to \emph{decrypt} them, which is a feature allowed by the mathematical
|
|
complementarity of the two components, and also explains why the private key
|
|
should be kept \emph{private}. Compared to symmetric encryption, this variant
|
|
is generally slower.
|
|
|
|
\n{3}{The key exchange problem}
|
|
|
|
Suppose a communication scheme that is protected by a pre-shared secret.
|
|
In order to establish secure communications, this secret needs to be
|
|
distributed to the other party via untrusted channels. In 1976 Whitfield Diffie
|
|
and Martin Hellman published a paper in which they devised a \emph{public-key
|
|
distribution scheme}, which allows the two parties to arrive at a shared secret
|
|
by exchanging information via insecure channels with the presence of an
|
|
eavesdropper. This scheme (or its variations) is in use to this day.
|
|
|
|
\n{2}{Hash functions}
|
|
|
|
Hash functions are cryptographic algorithms used to help with a number of
|
|
things: integrity verification, password protection, digital signature,
|
|
public-key encryption and others. Hashes are used in forensic analysis to prove
|
|
authenticity of digital artifacts, to uniquely identify a change-set within
|
|
revision-based source code management systems such as Git, Subversion or
|
|
Mercurial, to detect known-malicious software by anti-virus programs or by
|
|
advanced filesystems in order to verify block integrity and enable repairs, and
|
|
also in many other applications that each person using a modern computing
|
|
device has come across, such as when connecting to a website protected by the
|
|
famed HTTPS.
|
|
|
|
The popularity stems from a common use case: the need to identify a chunk of
|
|
data. Of course, two chunks of data, two files, frames or packets could always
|
|
be compared bit by bit, but that can get prohibitive from both cost and energy
|
|
point of view relatively quickly. That is when the hash functions come in,
|
|
since they are able to take a long input and produce a short output, named a
|
|
digest or a hash value. It also does not work the other way around, a file
|
|
cannot be reconstructed from the hash digest, it is a one-way function.
|
|
|
|
\n{3}{Rainbow tables}
|
|
|
|
As passwords are in more responsible scenarios stored not directly but as
|
|
hashes, attackers that would be interested in recovering the passwords really
|
|
only have one option (except finding a critical vulnerability in the hash
|
|
function): rainbow tables. Rainbow tables are lists of pre-computed hashes
|
|
paired with the passwords that were used to create them. When attackers gain
|
|
access to a password breach that contains hashes, all it takes is to find a
|
|
match within the rainbow table and reversely resolve that to the known
|
|
message: the password.
|
|
|
|
One of the popular counter-measures to pre-computed tables is adding a
|
|
\emph{salt} to the user-provided password before passing it to the KDF (Key
|
|
Derivation Function) or the hash function. Of course, the salt should be random
|
|
\textbf{per-user} and not reused, as that would mean that two users with the
|
|
same password would still end up with the same hash, and the salt should also
|
|
be adequately long to be effective. As the salt is supposed to be
|
|
\emph{random}, it would be a good idea to use an actual CSPRNG, such as
|
|
\textbf{Fortuna}~\cite{fortuna} as a source of entropy (randomness). In
|
|
FreeBSD, Fortuna is in fact the one serving \texttt{/dev/random}.
|
|
|
|
|
|
\n{3}{TLS}\label{sec:tls}
|
|
|
|
The Transport Layer Security protocol (or TLS) serves as as an encryption and
|
|
\emph{authentication} protocol to secure internet communications. An important
|
|
part of the protocol is the \emph{handhake}, during which the two communicating
|
|
parties exchange messages that acknowledge each other's presence, verify each
|
|
other, choose what cryptographic algorithms will be used and decide session
|
|
keys. As there are multiple versions of the protocol in active duty even at the
|
|
moment, the server together with the client need to agree upon the version they
|
|
are going to use (these days it should be 1.2 or 1.3), pick cipher suites
|
|
(TLSv1.3 dramatically reduced the number of available suites), the client
|
|
verifies the server's public key (and the signature of the certificate
|
|
authority that issued it) and they both generate session keys for use after
|
|
handshake completion.
|
|
|
|
The handshake consists of multiple stages (again, depending on the version), for
|
|
TLSv1.3 that would be:
|
|
|
|
\begin{itemize}
|
|
\item \textbf{Client hello}: client sends a client hello message containing
|
|
the protocol version, a list of cipher suites and the client random value.
|
|
The client in this step also includes the ephemeral Diffie-Helman (EDH)
|
|
parameters, which are later used for calculating the pre-master key.
|
|
\item \textbf{Server generating a master secret}: the server has got the
|
|
cipher suites, the client's paramaters and client random and already has
|
|
the server random, which means it can create the master secret.
|
|
\item \textbf{Server hello and ``Finished''}: the server includes in the
|
|
hello its certificate, digital signature, server random, the chosen
|
|
cipher suite, and sends a ``Finished'' (meaning \emph{ready}) message.
|
|
\item \textbf{Signature and certificate verification}: the client at this
|
|
step verifies server's certificate and signature, generates the master
|
|
secret and is ready (sends the ``Finished'' message).
|
|
\end{itemize}
|
|
|
|
At the end of the process, the connection is protected by symmetric encryption
|
|
using the session key that the both parties have arrived at.
|
|
|
|
|
|
\n{1}{Passwords}\label{sec:passwords}
|
|
|
|
Passwords have been in use since the ancient times, apparently already the
|
|
Roman sentries used passwords or \textit{watchwords} to discern who was allowed
|
|
to enter an area. The Roman army had a special system of distributing passwords
|
|
among the encampment members on a wooden tablet. Fast forward a couple of
|
|
thousand years, during the days of the Prohibition Era in the United States, it
|
|
was the secret ``speakeasies'' that were protecting their illegitimate
|
|
alcohol-serving business using passwords~\cite{speakeasy}~\cite{nisthistory}.
|
|
During the World War II.\ the US paratroopers' use of passwords has evolved to
|
|
even include a counter-password.
|
|
|
|
According to McMillan, the first \textit{computer} passwords date back to
|
|
mid-1960s' Massachusetts Institute of Technology (MIT), when researchers at the
|
|
university built a massive time-sharing computer called CTSS. Apparently,
|
|
\textit{even then} the passwords did not protect the users as well as they were
|
|
expected to~\cite{mcmillan}.
|
|
|
|
Traditionally, passwords were expected to be memorised, but the large number of
|
|
password-protected \emph{services} these days can make this impractical. To
|
|
list a few common examples, access to a bank account, electronic mailbox,
|
|
personal computer encrypted disk are all protected by some form of a password.
|
|
|
|
A password still often consists of a \textit{string} of characters typed into a
|
|
prompt but its function is still the same: as per NIST it enables the
|
|
\textit{verifier} to infer the \textit{claimant}'s identity via a secret the
|
|
claimant holds.
|
|
|
|
There are always some arbitrary requirements applied to what the password can
|
|
be, only some turn out to smarter than others.
|
|
|
|
Despite the impression given by the word ``password'', it does not need to be
|
|
an actual word, while a non-word (in the dictionary sense) may indeed be harder
|
|
to guess, which is a desirable property of passwords. A memorized secret
|
|
consisting of a sequence of words or other text separated by spaces is
|
|
sometimes called a passphrase. A passphrase is similar to a password in usage,
|
|
but the former is generally longer for added security.
|
|
|
|
\n{2}{Program-imposed constraints}
|
|
|
|
Some of the following examples might be a bit anecdotal and more of an
|
|
exception than a rule; nevertheless, when presented by a large-enough program
|
|
creator/service provider, their decisions reach a sufficient amount of
|
|
population, enough that the author will call them influential. They form how
|
|
users think when creating password and affect what users expect from other
|
|
services they happen to visit and use from that point on, as well.
|
|
|
|
\n{3}{Short arbitrary length}
|
|
|
|
It has been observed that a requirement for a ``strong'' password generally
|
|
represents that a password is:
|
|
|
|
\begin{itemize}
|
|
\item longer than 7 characters,
|
|
\item shorter than 11 characters,
|
|
\item begins with a letter and ends with a number OR
|
|
\item begins with a number and ends with a letter.
|
|
\end{itemize}
|
|
|
|
\obr{Short arbitrary password length
|
|
limit~\cite{larsklint}}{fig:arbitrarypasswdlengthlimit}{.8}{graphics/arbitrarypasswdlengthlimit.jpg}
|
|
|
|
This is wrong for multiple reasons, and it is a classic example of short
|
|
arbitrary length requirement. It essentially prevents users from using
|
|
passphrases, makes using a password manager impractical and all of that has
|
|
apparently been done ``because of security''~\cite{etihad}. Moreover, this
|
|
might be an indicative of the fact that instead of storing passwords hashed (as
|
|
it should be), they might be storing them in \textbf{plain text}.
|
|
Otherwise, what reason could exist for the limit to be 10 characters?
|
|
The recommendation of the US's National Institute for Standards and Technology
|
|
(NIST) in this regard is a minimum of 64 and a maximum of 256 characters, which
|
|
should be sufficient for most users' needs.
|
|
|
|
\n{3}{Restricting special characters}
|
|
|
|
Service providers have too often been found forbidding the use of so called
|
|
\textit{special characters} in passwords for as long as passwords have been
|
|
used to protect privileged access. Ways of achieving the same may vary but the
|
|
intent stays the same: preventing users from inputting characters into the
|
|
system, which the system cannot comfortably handle, for ``reasons'', which are
|
|
usually something dubious along the lines of ``an apostrophe may be used in SQL
|
|
injection attacks'' or ``angle brackets may be used in XSS attacks''. Instead
|
|
the real message it announces is pointing right to the serious shortcomings of
|
|
password handling of the site in question, as passwords should never be
|
|
re-displayed in a context that is prone to Cross Site Scripting (XSS), and the
|
|
passwords should always be hashed before being sent to the database anyway,
|
|
leaving us with only alphanumeric characters, rendering the SQLi fears
|
|
baseless.
|
|
|
|
\obr{Forbidden special characters in
|
|
passwords~\cite{forbiddencharacters}}{fig:forbiddencharacters}{.8}{graphics/forbiddencharacters.jpg}
|
|
|
|
Note that ``Passw0rd!'' would have been a perfectly acceptable password for the
|
|
validator displayed in Figure~\ref{fig:forbiddencharacters}.
|
|
NIST's recommendations on this are that all printing ASCII~\cite{asciirfc20}
|
|
characters as well as the space character SHOULD be acceptable in memorized
|
|
secrets and Unicode~\cite{iso10646} characters SHOULD be accepted as well.
|
|
|
|
\n{3}{Character composition requirements}
|
|
|
|
There is a tendency to come up with bad passwords when there are character
|
|
composition requirements in place, too. The reality is that instead of
|
|
creating strong passwords directly, most users first try a basic version and
|
|
then keep tweaking characters until the password ends up fulfilling the minimum
|
|
requirement.
|
|
The \emph{problem} with that is that it has been shown, that people use similar
|
|
patterns, i.e. starting with capital letters, putting a symbol last and a
|
|
number in the last two positions. This is also known to cyber criminals
|
|
cracking passwords and they run their dictionary attacks using the common
|
|
substitutions, such as "\$" for "s", "E" for "3", "1" for "l", "@" for "a" etc.
|
|
The password created in this manner will almost certainly be bad so all that is
|
|
achieved is frustrating the user in order to still arrive at a bad password.
|
|
|
|
\n{3}{Other common issues}
|
|
|
|
Some services don't allow users to paste into passwords fields (disabling them
|
|
using JavaScript), thereby essentially breaking the password manager
|
|
functionality, which is an issue because it encourages bad password practices
|
|
such as weak passwords and likewise, password reuse.
|
|
|
|
Another frequent issue is forced frequent password rotation. Making frequent
|
|
password rotations mandatory contributes to users developing a password
|
|
creation pattern and is further a modern-day security anti-pattern and
|
|
according to the British NCSC the practice ``carries no real benefits as stolen
|
|
passwords are generally exploited immediately''~\cite{ncsc}.
|
|
|
|
|
|
\n{1}{Web security}\label{sec:websecurity}
|
|
|
|
The internet, being the vast space of intertwined concepts and ideas, is a
|
|
superset of the Web, since not everything that is available on internet can be
|
|
described as web \emph{resources}. But precisely that is the part of the
|
|
internet that is discussed in the next sections and covers what browsers are,
|
|
what they do and how they relate to web security.
|
|
|
|
|
|
\n{2}{Browsers}\label{sec:browsers}
|
|
|
|
Browsers, sometimes used together with the word that can serve as a real tell
|
|
for their specialisation - \emph{web} browsers - are programs intended for
|
|
\emph{browsing} of \emph{the web}. In more technical terms, browsers are
|
|
programs that facilitate (directly or via intermediary tools) domain name
|
|
lookups, connecting to web servers, optionally establishing a secure
|
|
connection, requesting the web page in question, determining its \emph{security
|
|
policy} and resolving what accompanying resources the web page specifies and
|
|
depending on the applicable security policy, requesting those from their
|
|
respective origins, applying stylesheets and running scripts. Constructing a
|
|
program that can speak many protocols and securely runs untrusted code from the
|
|
internet is no easy task.
|
|
|
|
\n{3}{Complexity}
|
|
|
|
Browsers these days are also quite ubiquitous programs running on
|
|
\emph{billions} of consumer grade mobile devices (which are also notorious for
|
|
bad update hygiene) or desktop devices all over the world. Regular users
|
|
usually expect them to work flawlessly with a multitude of network conditions,
|
|
network scenarios (the proverbial café WiFi, cellular data in a remote
|
|
location, home broadband that is DNS-poisoned by the ISP), differently tuned
|
|
(or commonly misconfigured) web servers, a combination of modern and
|
|
\emph{legacy} encryption schemes and different levels of conformance to web
|
|
standards from both web server and website developers. Of course, if a website
|
|
is broken, it is the browser's fault. Browsers are expected to detect if
|
|
\emph{captive portals} (a type of access control that usually tries to force
|
|
the user through a webpage with terms of use) are active and offer redirects.
|
|
All of this is immense complexity and the combination of ubiquity and great
|
|
exposure that this type of software gets is, in the author's opinion, the cause
|
|
behind a staggering amount of vulnerabilities found, reported and fixed in
|
|
browsers every year.
|
|
|
|
\n{3}{Standardisation}
|
|
|
|
Over the years, a consortium of parties interested in promoting and developing
|
|
the web (also due to its potential as a digital marketplace, i.e.\ financial
|
|
incentives) and browser vendors (of which the most neutral participant is
|
|
perhaps \emph{Mozilla}, with Chrome being run by Google, Edge by Microsoft and
|
|
Safari/Webkit by Apple) has evolved a great volume of web standards, which are
|
|
also relatively frequently getting updated or deprecated and replaced by
|
|
revised or new ones, rendering the browser maintenance task into essentially a
|
|
cat-and-mouse game.
|
|
|
|
It is the web's extensibility that enabled this build-up and ironically has
|
|
been proclaimed by some to be its greatest asset. It has also been ostensibly
|
|
been criticised~\cite{ddvweb} in the past and the frustration with the status
|
|
quo of web standards has relatively recently prompted a group of people to even
|
|
create ``\textit{a new application-level internet protocol for the distribution
|
|
of arbitrary files, with some special consideration for serving a lightweight
|
|
hypertext format which facilitates linking between files}'':
|
|
Gemini~\cite{gemini}\cite{geminispec} that in the words of its authors can be
|
|
thought of as ``\textit{the web, stripped right back to its essence}'' or as
|
|
``\textit{Gopher, souped up and modernised just a little}'', depending upon the
|
|
reader's perspective, noting that the latter view is probably more accurate.
|
|
|
|
\n{3}{HTTP}
|
|
|
|
Originally, HTTP was also designed just for fetching hypertext
|
|
\emph{resources}, but it has evolved since then, particularly due to its
|
|
extensibility, to allow for fetching of all sorts of web resources a modern
|
|
website of today provides, such as scripts or images, or even to \emph{post}
|
|
content back to servers.
|
|
|
|
HTTP relies on TCP (Transmission Control Protocol), which is one of the
|
|
\emph{reliable} (mandated by HTTP) protocols used to send data across
|
|
contemporary IP (Internet Protocol) networks, to deliver the data it requests
|
|
or sends. When Tim Berners-Lee invented the World Wide Web (WWW) in 1989 while
|
|
working at CERN (The European Organization for Nuclear Research) with a rather
|
|
noble intent as a ``\emph{wide-area hypermedia information retrieval initiative
|
|
to give universal access to a large universe of documents}''~\cite{wwwf}, he
|
|
also invented the HyperText Markup Language (HTML) to serve as a formatting
|
|
method for these new hypermedia documents. The first website was written
|
|
roughly the same way as today's websites are, using HTML, although the markup
|
|
language has changed since, with the current version being HTML5.
|
|
|
|
It has been mentioned that the client \textbf{requests} a \textbf{resource} and
|
|
receives a \textbf{response}, so those terms should probably be defined.
|
|
|
|
A request is what the client sends to the server. A resource is what it
|
|
requests and a response is the answer provided by the server.
|
|
|
|
HTTP follows a classic client-server model whereby it is \textbf{always} the
|
|
client that initiates the request.
|
|
|
|
A web page is, to be blunt, a chunk of \emph{hypertext}. To display a web page,
|
|
a browser first needs to send a request to fetch the HTML representing the
|
|
page, which is then parsed and additional requests for sub-resources are made.
|
|
If a page defines a layout information in the form of CSS, that is parsed as
|
|
well.
|
|
|
|
A web page needs to be present on the local computer first \emph{before} it can
|
|
be parsed by the browser, and since websites are usually still served by
|
|
programs called \emph{web servers} as in the \emph{early days}, that presents a
|
|
problem of how tell the browser where the resource should be fetched from. In
|
|
today's browsers, the issue is sorted (short of the CLI) by the \emph{address
|
|
bar}, a place into which user types what they wish the browser to fetch for
|
|
them.
|
|
|
|
The formal name of this segment is a \emph{Universal Resource Locator}, or URL,
|
|
and it contains the schema (or the protocol, such as \texttt{http://}), the
|
|
host address or a domain name and a (TCP) port number.
|
|
|
|
Since a TCP connection needs to be established first, to connect to a server
|
|
whose only URL contains a domain name, the browser needs to perform a domain
|
|
name \emph{lookup} using system facilities, or as was the case for a couple of
|
|
notorious Chromium versions, send some additional and unrelated queries which
|
|
(with Chromium-based derivatives' numbers) ended up placing unnecessary load
|
|
directly at the root DNS servers~\cite{chromiumrootdns}.
|
|
|
|
If a raw IP address+port combination is used, the browser attempts to connect
|
|
to it directly and requests the user-requested page by default using the
|
|
\texttt{GET} \emph{method}. A \emph{well-known} HTTP port 80 is assumed unless
|
|
other port is explicitly specified and it can be omitted both if host is a
|
|
domain name or an IP address.
|
|
|
|
The method is a way for the user-agent to define what operation it wants to
|
|
perform. \texttt{GET} is used for fetching resources while \texttt{POST} is
|
|
used to send data to the server, such as to post the values of an HTML form.
|
|
|
|
A server response is comprised of a \textbf{status code}, a status message,
|
|
HTTP \textbf{headers} and an optional \textbf{body} containing the content. The
|
|
status code indicates if the original request was successful or not and the
|
|
browser is generally there to interpret these status codes to the user. There
|
|
is enough status codes to be confused by the sheer numbers but luckily, there
|
|
is a method to the madness and they can be divided into groups/classes:
|
|
|
|
\begin{itemize}
|
|
\item 1xx: Informational responses
|
|
\item 2xx: Successful responses
|
|
\item 3xx: Redirection responses
|
|
\item 4xx: Client error responses
|
|
\item 5xx: Server error responses
|
|
\end{itemize}
|
|
|
|
In case the \emph{user agent} (a web \emph{client}) such as a browser receives
|
|
a response with content, it has to parse it.
|
|
|
|
A header is additional information sent by both the server and the client that
|
|
can guide or alter the behaviour of software reading it. For instance a
|
|
\texttt{Cache-control} header with a duration value can be used by the server
|
|
to signify that the client can store certain resources for some time before
|
|
needing to re-fetch them, if they are not \emph{expired}.
|
|
|
|
\n{3}{Site Isolation}
|
|
|
|
Modern browsers such as Firefox or Chromium come with a security focus in mind.
|
|
Their developers are acutely aware of the dangers that parsing untrusted code
|
|
from the internet, which is the websites, the stylesheets and the scripts,
|
|
poses. This is reflected in the way these programs are architected: instead of
|
|
the main browser process running everything, the main privileged process spawns
|
|
de-privileged child processes for each website. This extra line of defence
|
|
makes it ``harder for untrustworthy websites to access or steal information''
|
|
from user accounts or other websites~\cite{siteisolation}, as even if the
|
|
misbehaving website manages to ``break some rules'' withing its own process, it
|
|
should find it more difficult to steal data from other sites. Firefox's version
|
|
of Site Isolation is called Project Fission
|
|
(\url{https://hacks.mozilla.org/2021/05/introducing-firefox-new-site-isolation-security-architecture/})
|
|
but the two are very similar in both architecture and what they try to achieve.
|
|
Some Chromium users have been complaining in the past on its high memory usage,
|
|
which is partially due to Site Isolation features enabled that are protecting
|
|
the user.
|
|
|
|
|
|
\n{2}{Cross-site scripting}\label{sec:xss}
|
|
|
|
As per OWASP Top Ten list~\cite{owasptop10} injection is the third most
|
|
observed issue across millions of websites. Cross-site scripting is a type of
|
|
attack in which scripts are injected into a website that would otherwise be
|
|
trusted. It occurs when a malicious user misuses the website to send malicious
|
|
code, such as a script, to another user. Since the misconfiguration or a flaw
|
|
of the application allowed this, the browser of the victim that trusts the
|
|
website simply executes the code provided by the attacker. This code thus gains
|
|
access to session tokens and any cookies associated with the website's origin,
|
|
apart from being able to rewrite the HTML content. The results of XSS can
|
|
range from account compromise to identity theft.
|
|
|
|
\n{2}{Content Security Policy}\label{sec:csp}
|
|
|
|
Content Security Policy (CSP) has been an important addition to the arsenal of
|
|
website operators, even though not everybody has necessarily been utilising it
|
|
properly or even taken notice. To understand what guarantees it provides and
|
|
what kind of protections it employs, it is first necessary to grok how websites
|
|
are parsed and displayed, which has been discussed in depth in previous
|
|
sections.
|
|
|
|
As per Weichselbaum et al.\ CSP is a mechanism designed to mitigate
|
|
XSS~\cite{longliveCSP}, a long-lived king of the vulnerability
|
|
lists~\cite{reininginCSP}. It is a declarative policy mechanism that allows the
|
|
website operator to decide what client-side resources can load and what origins
|
|
are permitted. Scripts can be restricted to only load from a list of trusted
|
|
domains and inline scripts can be blocked completely. The scripts can also be
|
|
allowed based on a cryptographic hash based on their content, which should be a
|
|
known information to website operators prior to or at the time scripts are
|
|
served.
|
|
|
|
CSP3, which is the current version, can be served either as a header or inside
|
|
website's \texttt{<meta>} tag. Directives are generally named after a topic
|
|
keyword and suffixed `-src', such as \texttt{script-src}, \texttt{img-src} or
|
|
\texttt{style-src}, although some directives don't follow this pattern,
|
|
\texttt{form-action}, \texttt{upgrade-insecure-requests} and \texttt{sandbox}
|
|
representing this group nicely. Sections are delimited using semicolons and
|
|
each section can only appear once. Special values exist for the origin itself
|
|
(\texttt{'self'}) and disallowing any source (\texttt{'none'}). A good CSP is
|
|
targeted and not overly broad. To give an example, a website that loads no
|
|
JavaScript at all does not need to allow a popular CDN (Content Delivery
|
|
Network) origin in its \texttt{script-src}, instead it should be set to
|
|
\texttt{'none'}. CSP can also aid with clickjacking protection using its
|
|
\texttt{frame-ancestors} directive, which could limit origins that could embed
|
|
the website, preventing attacker from embedding the website at random places,
|
|
that is malicious websites that masquerade as being legitimate.
|
|
|
|
Getting CSP right can be tricky at first but once grokked, it is relatively
|
|
straight-forward and can increase the security of the site greatly. While
|
|
testing, it is best to enable CSP in the report-only mode before turning it on
|
|
in production. There are many more directives and settings than mentioned in
|
|
this section, the author encourages anybody interested to give it a read, e.g.\
|
|
at \url{https://web.dev/csp/}.
|
|
|
|
\n{2}{Summary}
|
|
|
|
Passwords are in use everywhere and probably will be for the foreseeable
|
|
future. As long as passwords are going to be handled and stored by
|
|
service/application providers, they are going to get leaked, be it due to
|
|
provider carelessness or the attackers' resolve and wit. Of course, sifting
|
|
through all the available password breach data by hand is not a reasonable
|
|
option, and therefore tools should come in to provide assistance. The next part
|
|
of the thesis will explore that and offer a solution.
|
|
|
|
|
|
% =========================================================================== %
|
|
\part{Practical part}
|
|
|
|
\n{1}{Kudos}
|
|
|
|
\textbf{Disclaimer:} the author is not affiliated in any way with any of the
|
|
projects described on this page.
|
|
|
|
The \textit{Password Compromise Monitoring Tool} (\texttt{pcmt}) program has
|
|
been developed using and utilising a great deal of free (as in Freedom) and
|
|
open-source software in the process, either directly or as an outstanding work
|
|
tool, and the author would like to take this opportunity to recognise that
|
|
fact.
|
|
|
|
In particular, the author acknowledges that this work would not be the same
|
|
without:
|
|
|
|
\begin{itemize}
|
|
\item vim (\url{https://www.vim.org/})
|
|
\item Arch Linux (\url{https://archlinux.org/})
|
|
\item ZSH (\url{https://www.zsh.org/})
|
|
\item kitty (\url{https://sw.kovidgoyal.net/kitty/})
|
|
\item Nix (\url{https://nixos.org/explore.html})
|
|
\item pre-commit (\url{https://pre-commit.com/})
|
|
\item Podman (\url{https://podman.io/})
|
|
\item Go (\url{https://go.dev/})
|
|
\end{itemize}
|
|
|
|
All of the code written has been typed into VIM (\texttt{9.0}), the shell used
|
|
to run the commands was ZSH, both running in the author's terminal emulator of
|
|
choice - \texttt{kitty} on a \raisebox{.8ex}{\texttildelow}8 month (at the time
|
|
of writing) installation of \textit{Arch Linux (by the way)} using a
|
|
\texttt{6.3.x-wanderer-zfs-xanmod1} variant of the Linux kernel.
|
|
|
|
|
|
\n{1}{Development}
|
|
|
|
The source code of the project was being versioned since the start, using the
|
|
popular and industry-standard git (\url{https://git-scm.com}) source code
|
|
management (SCM) tool. Commits were made frequently and, if at all possible,
|
|
for small and self-contained changes of code, trying to follow sane commit
|
|
message \emph{hygiene}, i.e.\ striving for meaningful and well-formatted commit
|
|
messages. The name of the default branch is \texttt{development}, since that is
|
|
what the author likes to choose for new projects that are not yet stable (it is
|
|
in fact the default in author's \texttt{.gitconfig}).
|
|
|
|
|
|
\n{2}{Commit signing}
|
|
|
|
Since git allows cryptographically \emph{singing} all commits, it would be
|
|
unwise not to take advantage of this. For the longest time, GPG was the only
|
|
method available for signing commits in git; however, that is no longer
|
|
applicable~\cite{agwagitssh}. These days, it is also possible to both sign and
|
|
verify one's git commits (and tags!) using SSH keys, namely those produced by
|
|
OpenSSH (the same ones that can be used to log in to remote systems). The
|
|
author has, of course, not reused the same key pair that is used to connect to
|
|
machines for signing commits. A different, \texttt{Ed25519} elliptic curve key
|
|
pair has been used specifically for signing. A public component of this key is
|
|
enclosed in this thesis as Appendix~\ref{appendix:signingkey} for future
|
|
reference.
|
|
|
|
The validity of a signature on a particular commit can be viewed with git using
|
|
the following commands (the \% sign denotes the shell prompt):
|
|
|
|
\vspace{\parskip}
|
|
\begin{lstlisting}[language=bash, caption={Verifying signature of a git commit},
|
|
label=gitverif, basicstyle=\linespread{0.9}\footnotesize\ttfamily]
|
|
% cd <cloned project dir>
|
|
% git show --show-signature <commit>
|
|
% # alternatively:
|
|
% git verify-commit <commit>
|
|
\end{lstlisting}
|
|
|
|
There is one caveat to this though, git first needs some additional
|
|
configuration for the code in Listing~\ref{gitverif} to work as one would
|
|
expect. Namely that the public key used to verify the signature needs to be
|
|
stored in git's ``allowed signers file'', then git needs to be told where that
|
|
file is using the configuration value \texttt{gpg.ssh.allowedsignersfile} and
|
|
finally the configuration value of the \texttt{gpg.format} field needs to be
|
|
set to \texttt{ssh}.
|
|
|
|
Because git allows the configuration values to be local to each repository,
|
|
both of the mentioned issues can be solved by running the following commands
|
|
from inside of the cloned repository:
|
|
|
|
\vspace{\parskip}
|
|
\begin{lstlisting}[language=bash, caption={Prepare allowed signers file and signature format for git},
|
|
label=gitsshprep, basicstyle=\linespread{0.9}\footnotesize\ttfamily]
|
|
% # set the signature format for the local repository.
|
|
% git config --local gpg.format ssh
|
|
% # save the public key.
|
|
% cat >./tmp/.allowed_signers \
|
|
<<<'leo ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIKwshTdBgLzwY4d8N7VainZCngH88OwvPGhZ6bm87rBO'
|
|
% # set the allowed signers file path for the local repository.
|
|
% git config --local gpg.ssh.allowedsignersfile=./tmp/.allowed_signers
|
|
\end{lstlisting}
|
|
|
|
After the code in Listing~\ref{gitsshprep} is run, everything from the
|
|
Listing~\ref{gitverif} should remain applicable for the lifetime of the
|
|
repository or until git changes implementation of signature verification. The
|
|
git \texttt{user.name} that can be seen on the commits in the \textbf{Author}
|
|
field is named after the machine that was used to develop the program, since
|
|
the author uses different signing keys on each machine. That way the committer
|
|
machine can be determined post-hoc.
|
|
|
|
For future reference, git has been used in the version \texttt{git version
|
|
2.40.1}.
|
|
|
|
|
|
\n{2}{Continuous Integration}
|
|
|
|
To increase both the author's and public confidence in the atomic changes made
|
|
over time, it was attempted to thoroughly \emph{integrate} them using a
|
|
continuous integration (CI) service that was plugged into the main source code
|
|
repository since the early stages of development. This, of course, was again
|
|
self-hosted, including the workers. The tool of choice there was Drone
|
|
(\url{https://drone.io}) and the ``docker'' runner (in fact it runs any OCI
|
|
container) was used to run the builds.
|
|
|
|
The way this runner works is that it creates an ephemeral container for every
|
|
pipeline step and executes given \emph{commands} inside of it. At the end of
|
|
each step the container is discarded, while the repository, which is mounted
|
|
into each container's \texttt{/drone/src} is persisted between steps, allowing
|
|
it to be cloned only from \emph{origin} only at the start of the pipeline and
|
|
then shared for all of the following steps, saving bandwidth, time and disk
|
|
writes.
|
|
|
|
The entire configuration used to run the pipelines can be found in a file named
|
|
\texttt{.drone.yml} at the root of the main source code repository. The
|
|
workflow consists of four pipelines, which are run in parallel. Two main
|
|
pipelines are defined to build the frontend assets, the \texttt{pcmt} binary
|
|
and run tests on \texttt{x86\_64} GNU/Linux targets, one for each of Arch and
|
|
Alpine (version 3.17). These two pipelines are identical apart from OS-specific
|
|
bits such as installing a certain package, etc. For the record, other
|
|
OS-architecture combinations were not tested.
|
|
|
|
A third pipeline contains instructions to build a popular static analysis tool
|
|
called \texttt{golangci-lint}, which is sort of a meta-linter, bundling a
|
|
staggering amount of linters (linter is a tool that performs static code
|
|
analysis and can raise awareness of programming errors, flag potentially buggy
|
|
code constructs, or \emph{mere} stylistic errors) - from sources and then
|
|
perform the analysis of project's codebase using the freshly built binary. If
|
|
the result of this step is successful, a handful of code analysis services get
|
|
pinged in the next steps to take notice of the changes to project's source code
|
|
and update their metrics, details can be found in the main Drone configuration
|
|
file \texttt{.drone.yml} and the configuration for the \texttt{golangci-lint}
|
|
tool itself (such as what linters are enabled/disabled and with what settings)
|
|
can be found in the root of the repository in the file named
|
|
\texttt{.golangci.yml}.
|
|
|
|
The fourth pipeline focuses on linting the Containerfile and building the
|
|
container, although the latter action is only performed on feature branches,
|
|
\emph{pull requests} or \emph{tag} events.
|
|
|
|
The median build time as of writing was 1 minute, which includes running all
|
|
four pipelines, and that is acceptable. Build times might of course vary
|
|
depending on the hardware, for reference, these builds were being run on a
|
|
machine equipped with a Zen 3 Ryzen 5 5600 CPU with nominal clock times, DDR4
|
|
3200MHz RAM, a couple of PCIe Gen 4 NVMe drives in a mirrored setup (using ZFS)
|
|
and a 400Mbps downlink, software-wise running Arch with an author-flavoured
|
|
Xanmod kernel version 6.3.x.
|
|
|
|
\obr{Drone CI median build
|
|
time}{fig:drone-median-build}{.84}{graphics/drone-median-build}
|
|
|
|
|
|
\n{2}{Source code repositories}\label{sec:repos}
|
|
|
|
All of the pertaining source code was published in repositories on a publicly
|
|
available git server operated by the author, the reasoning \emph{pro}
|
|
self-hosting being that it is the preferred way of guaranteed autonomy over
|
|
one's source code, as opposed to large silos owned by big corporations having a
|
|
track record of arguably not always deciding with user's best interest in mind
|
|
(although recourse has been observed~\cite{ytdl}). When these providers act on
|
|
impulse or under public pressure they can potentially at least temporarily
|
|
disrupt their user's operations, thus not only beholding their user to their
|
|
lengthy \emph{terms of service} that \emph{can change at any time}, but also
|
|
factors outside their control. Granted, decentralisation can take a toll on
|
|
discoverability of the project, but that is not a concern here.
|
|
|
|
The git repository containing source code of the \texttt{pcmt} project:\\
|
|
\url{https://git.dotya.ml/mirre-mt/pcmt.git}.
|
|
|
|
The git repository hosting the \texttt{pcmt} configuration schema:\\
|
|
\url{https://git.dotya.ml/mirre-mt/pcmt-config-schema.git}.
|
|
|
|
The repository containing the \LaTeX{} source code of this thesis:\\
|
|
\url{https://git.dotya.ml/mirre-mt/masters-thesis.git}.
|
|
|
|
|
|
\n{2}{Toolchain}
|
|
|
|
Throughout the creation of this work, the \emph{current} version of the Go
|
|
programming language was used, i.e. \texttt{go1.20}.
|
|
|
|
\tab{Tool/Library-Usage Matrix}{tab:toolchain}{1.0}{ll}{
|
|
\textbf{Name} & \textbf{Usage} \\
|
|
Go programming language & program core \\
|
|
Dhall configuration language & program configuration \\
|
|
Echo & HTTP handlers, controllers, web server \\
|
|
ent & ORM using graph-based modelling \\
|
|
bluemonday & HTML sanitising \\
|
|
TailwindCSS & stylesheets using a utility-first approach \\
|
|
PostgreSQL & storing data \\
|
|
}
|
|
|
|
\tab{Dependency-Version Matrix}{tab:depsversionmx}{1.0}{ll}{
|
|
\textbf{Name} & \textbf{version} \\
|
|
\texttt{echo} (\url{https://echo.labstack.com/}) & 4.10.2 \\
|
|
\texttt{go-dhall} (\url{https://github.com/philandstuff/dhall-golang}) & 6.0.2\\
|
|
\texttt{ent} (\url{https://entgo.io/}) & 0.11.10 \\
|
|
\texttt{bluemonday} (\url{https://github.com/microcosm-cc/bluemonday}) & 1.0.23 \\
|
|
\texttt{tailwindcss} (\url{https://tailwindcss.com/}) & 3.3.0 \\
|
|
\texttt{PostgreSQL} (\url{https://www.postgresql.org/}) & 15.2 \\
|
|
}
|
|
|
|
\n{2}{A word about Go}
|
|
First, a question of \textit{`Why pick Go for building a web
|
|
application?'} might arise, so the following few lines will try to address
|
|
that.
|
|
|
|
Go~\cite{golang}, or \emph{Golang} for SEO-friendliness and disambiguating Go
|
|
the ancient game, is a strongly typed, high-level \emph{garbage-collected}
|
|
language where functions are first-class citizens and errors are values.
|
|
|
|
The appeal for the author comes from a number of features of the language, such
|
|
as built-in support for concurrency and unit testing, sane \emph{zero} values,
|
|
lack of pointer arithmetic, inheritance and implicit type conversions,
|
|
easy-to-read syntax, producing a statically linked binary by default, etc., on
|
|
top of that, the language has got a cute mascot. Thanks to the foresight of the
|
|
Go Authors regarding \emph{the formatting question} (i.e.\ where to put the
|
|
braces, \textbf{tabs vs.\ spaces}, etc.), most of the discussions on this topic
|
|
have been foregone. Every \emph{gopher}~\footnote{euph.\ a person writing in
|
|
the Go programming language} is expected to format their source code with the
|
|
official formatter (\texttt{gofmt}), which automatically ensures that the code
|
|
adheres to the one formatting standard. Then, there is \emph{The Promise} of
|
|
backwards compatibility for Go 1.x, which makes it a good choice for long-term
|
|
without the fear of being rug-pulled.
|
|
|
|
|
|
\n{2}{A word about Nix/devenv}
|
|
|
|
Nix (\url{https://builtwithnix.org/}) is a declarative package manager and a
|
|
functional programming language resembling Haskell, which has been used in this
|
|
project in the form of \texttt{devenv} tool (\url{https://devenv.sh/}) to
|
|
create \textbf{declarable} and \textbf{reproducible} development environment.
|
|
The author has previously used Nix directly with \emph{flakes} and liked
|
|
\texttt{devenv}, as it effectively exposed only a handful of parameters for
|
|
configuration, and rid of the need to manage the full flake, which is of course
|
|
still an option for people who choose so. See \texttt{devenv.nix} in the
|
|
repository root.
|
|
|
|
\n{1}{Application architecture}
|
|
|
|
The source code of the main module further is split into Go \emph{packages}
|
|
appropriately along a couple of domains: logging, core application, web
|
|
routers, configuration and settings, etc. In Go, packages are delimited by
|
|
folder structure -- each folder can be package.
|
|
|
|
Generally speaking, the program aggregates decision points into central places,
|
|
such as \texttt{run.go}, which imports child packages that facilitate each of
|
|
loading the configuration, connecting to the database and running migrations,
|
|
consolidating flag, environment variable and configuration-based values into
|
|
canonical \emph{settings}, setting up routes and handling graceful shutdown.
|
|
|
|
The program uses dependency injection to share a single logger instance,
|
|
similar applies to the database client. These are passed around as a pointer,
|
|
so the underlying data stays the same. As a rule of thumb, every larger
|
|
\texttt{struct} that needs to be passed around is passed around as a pointer.
|
|
|
|
The \texttt{internal} package was not used as of writing, but the author plans
|
|
to eventually migrate \emph{internal} logic of the program into the internal
|
|
package to prevent accidental imports.
|
|
|
|
The authentication logic is relatively simple and the author would like to
|
|
isolate it into a custom \emph{middleware}. User passwords are hashed using a
|
|
secure KDF before being sent to the database. The KDF used is \texttt{bcrypt}
|
|
(with a sane \emph{Cost} of 10), which automatically includes \emph{salt} for
|
|
the password and provides ``length-constant'' time hash comparisons. The author
|
|
plans to add support for the more modern \texttt{scrypt} and the
|
|
state-of-the-art, P-H-C (Password Hashing Competition) winner algorithm
|
|
\texttt{Argon2} (\url{https://github.com/P-H-C/phc-winner-argon2}). Besides, no
|
|
raw queries are used to access the database, helping decrease the likelihood of
|
|
SQL injection attacks.
|
|
|
|
An important thing to mention is embedded assets and templates. Go has multiple
|
|
mechanisms to natively embed arbitrary files directly into the binary during
|
|
the regular build process. The built-in \texttt{embed} package was used to
|
|
bundle all template files and web assets, such as images, logos and stylesheets
|
|
at the package level, and these are also the passed around the application as
|
|
needed. There is also a toggle in the application configuration, which can
|
|
instruct the program at start to either rely entirely on embedded assets or
|
|
pull live files from the filesystem. The former option makes the application
|
|
more portable, while the latter allows for flexibility not only during
|
|
development. Basically, any important value in the program has been made into a
|
|
configuration value, so that the operator can customise the experience as
|
|
needed. A choice of sane configuration defaults was attempted, which resulted
|
|
in the configuration file essentially only needing to contain secrets, unless
|
|
there is a need to override the defaults. It is not entirely \emph{zero-config}
|
|
situation, rather a \emph{minimal-config} one.
|
|
|
|
Templates used for rendering of the web pages were created in a composable
|
|
manner, split into smaller, reusable parts, such as \texttt{footer.tmpl} and
|
|
\texttt{head.tmpl}. Specific functionality is then conditionally executed based
|
|
on the determined level of access of the user. The core templating
|
|
functionality was provided by the native Go \texttt{html/template} package. An
|
|
Echo renderer interface has been implemented, so that the rendering can be
|
|
performed ergonomically and directly using Echo's built-in facilities. A
|
|
popular HTML sanitiser \emph{bluemonday} has been employed to aid with battling
|
|
XSS.
|
|
|
|
The application constructs the web pages entirely server-side and it runs
|
|
without a single line of JavaScript, of which the author is especially proud.
|
|
It improves load times, decreases attack surface, increases maintainability and
|
|
reduces cognitive load that is required when dealing with JavaScript. Of course
|
|
that requires extensive usage of non-semantic \texttt{POST}s in forms even for
|
|
updates (where \texttt{PUT}s should be used) and the accompanying frequent
|
|
full-page refreshes, but that still is not enough to warrant the use of
|
|
JavaScript.
|
|
|
|
As an aside, the author has briefly experimented with WebAssembly for this
|
|
project, but has ultimately scrapped the functionality in favour of the
|
|
entirely server-side rendered one. It is possible that it would get revisited
|
|
if the client-side dynamic functionality was necessary and performance
|
|
mattered. Even from the short experiments it was obvious how much faster
|
|
WebAssembly was compared to JavaScript.
|
|
|
|
Frontend-side, the application was styled using TailwindCSS, which promotes
|
|
using of flexible \emph{utility-first} classes in the markup (HTML) instead of
|
|
separating out the specific styles out into all-encompasing classes. The author
|
|
understands this is somewhat of a preference issue and does not hold hard
|
|
opinions in either direction, Tailwind simply looked nice, especially with its
|
|
built-in support for dark/light mode. The templates containing the CSS classes
|
|
need to be parsed by Tailwind in order to construct its final stylesheet and
|
|
there is also an original CLI tool for that called \texttt{tailwindcss}.
|
|
Overall, simple and accessible layouts had preference over convoluted ones.
|
|
|
|
|
|
\n{1}{Implementation}
|
|
|
|
\n{2}{Configuration}
|
|
|
|
Every non-trivial program usually offers at least \emph{some} way to
|
|
tweak/manage its behaviour, and these changes are usually persisted
|
|
\emph{somewhere} on the filesystem of the host: in a local SQLite3 database, a
|
|
\emph{LocalStorage} key-value store in the browser, a binary or plain text
|
|
configuration file. These configuration files need to be read and checked at
|
|
least on program start-up and either stored into operating memory for the
|
|
duration of the runtime of the program, or loaded and parsed and the memory
|
|
subsequently \emph{freed} (initial configuration).
|
|
|
|
There is an abundance of configuration languages (or file formats used to craft
|
|
configuration files, whether they were intended for it or not) available, TOML,
|
|
INI, JSON, YAML, to name some of the popular ones (as of today).
|
|
|
|
Dhall stood out as a language that was designed with both security and the
|
|
needs of dynamic configuration scenarios in mind, borrowing a concept or two
|
|
from Nix~\cite{nixoslearn}~\cite{nixlang} (which in turn sources more than a
|
|
few of its concepts from Haskell), and in its apparent core being very similar
|
|
to JSON, which adds to familiar feel. In fact, in Dhall's authors' own words it
|
|
is: ``a programmable configuration language that you can think of as: JSON +
|
|
functions + types + imports''~\cite{dhalllang}.
|
|
|
|
Among all of the listed features, the especially intriguing one to the author
|
|
was the promise of \emph{types}. There are multiple examples directly on the
|
|
project's documentation webpage demonstrating for instance the declaration and
|
|
usage of custom types (that are, of course merely combinations of the primitive
|
|
types that the language provides, such as \emph{Bool}, \emph{Natural} or
|
|
\emph{List}, to name just a few), so it was not exceedingly hard to start
|
|
designing a custom configuration \emph{schema} for the program.
|
|
Dhall not being a Turing-complete language also guarantees that evaluation
|
|
\emph{always} terminates eventually, which is a good attribute to possess as a
|
|
configuration language.
|
|
|
|
|
|
\n{3}{Dhall Schema}
|
|
|
|
The configuration schema was at first being developed as part of the main
|
|
project's repository, before it was determined that it would benefit both the
|
|
development and overall clarity if the schema lived in its own repository (see
|
|
Section~\ref{sec:repos} for details). This enabled it to be independently
|
|
developed and versioned, and only pulled into the main application whenever it
|
|
is determined the application is ready for it.
|
|
|
|
The full schema with type annotations can be seen in Listing~\ref{dhallschema}.
|
|
The \texttt{let} statement declares a variable called \texttt{Schema} and
|
|
assigns it the result of the expression on the right side of the equals sign,
|
|
which has for practical reasons been trimmed and is displayed without the
|
|
\emph{default} block, which is instead shown in its own
|
|
Listing~\ref{dhallschemadefaults}.
|
|
|
|
\vspace{\parskip}
|
|
\begin{lstlisting}[language=Haskell, caption={Dhall configuration schema version 0.0.1-rc.2},
|
|
label=dhallschema, basicstyle=\linespread{0.9}\footnotesize\ttfamily]
|
|
let Schema =
|
|
{ Type =
|
|
{ Host : Text
|
|
, Port : Natural
|
|
, HTTP :
|
|
{ Domain : Text
|
|
, Secure : Bool
|
|
, AutoTLS : Bool
|
|
, TLSKeyPath : Text
|
|
, TLSCertKeyPath : Text
|
|
, HSTSMaxAge : Natural
|
|
, ContentSecurityPolicy : Text
|
|
, RateLimit : Natural
|
|
, Gzip : Natural
|
|
, Timeout : Natural
|
|
}
|
|
, Mailer :
|
|
{ Enabled : Bool
|
|
, Protocol : Text
|
|
, SMTPAddr : Text
|
|
, SMTPPort : Natural
|
|
, ForceTrustServerCert : Bool
|
|
, EnableHELO : Bool
|
|
, HELOHostname : Text
|
|
, Auth : Text
|
|
, From : Text
|
|
, User : Text
|
|
, Password : Text
|
|
, SubjectPrefix : Text
|
|
, SendPlainText : Bool
|
|
}
|
|
, LiveMode : Bool
|
|
, DevelMode : Bool
|
|
, AppPath : Text
|
|
, Session :
|
|
{ CookieName : Text
|
|
, CookieAuthSecret : Text
|
|
, CookieEncrSecret : Text
|
|
, MaxAge : Natural
|
|
}
|
|
, Logger : { JSON : Bool, Fmt : Optional Text }
|
|
, Init : { CreateAdmin : Bool, AdminPassword : Text }
|
|
, Registration : { Allowed : Bool }
|
|
}
|
|
}
|
|
\end{lstlisting}
|
|
|
|
The main configuration is comprised of both raw attributes and child records,
|
|
which allow for grouping of related functionality. For instance, configuration
|
|
settings pertaining mailserver setup are grouped in a record named
|
|
\textbf{Mailer}. Its attribute \textbf{Enabled} is annotated as \textbf{Bool},
|
|
which was deemed appropriate for a on-off switch-like functionality, with the
|
|
only permissible values being either \emph{True} or \emph{False}. Do note that
|
|
in Dhall $true != True$, since \textbf{True} is internally a Bool constant,
|
|
which is built into Dhall (check out ``The Prelude''~\cite{dhallprelude}),
|
|
while \textbf{true} is evaluated as an \emph{unbound} variable, that is, a
|
|
variable \emph{not} defined in the current \emph{scope} and thus not
|
|
\emph{present} in the current scope.
|
|
|
|
\vspace{\parskip}
|
|
\begin{lstlisting}[language=Haskell, caption={Dhall configuration defaults for
|
|
schema version 0.0.1-rc.2},
|
|
label=dhallschemadefaults, basicstyle=\linespread{0.9}\scriptsize\ttfamily]
|
|
, default =
|
|
-- | have sane defaults.
|
|
{ Host = ""
|
|
, Port = 3000
|
|
, HTTP =
|
|
{ Domain = ""
|
|
, Secure = False
|
|
, AutoTLS = False
|
|
, TLSKeyPath = ""
|
|
, TLSCertKeyPath = ""
|
|
, HSTSMaxAge = 0
|
|
, ContentSecurityPolicy = ""
|
|
, RateLimit = 0
|
|
, Gzip = 0
|
|
, Timeout = 0
|
|
}
|
|
, Mailer =
|
|
{ Enabled = False
|
|
, Protocol = "smtps"
|
|
, SMTPAddr = ""
|
|
, SMTPPort = 465
|
|
, ForceTrustServerCert = False
|
|
, EnableHELO = False
|
|
, HELOHostname = ""
|
|
, Auth = ""
|
|
, From = ""
|
|
, User = ""
|
|
, Password = ""
|
|
, SubjectPrefix = "pcmt - "
|
|
, SendPlainText = True
|
|
}
|
|
, LiveMode =
|
|
-- | LiveMode controls whether the application looks for
|
|
-- | directories "assets" and "templates" on the filesystem or
|
|
-- | in its bundled Embed.FS.
|
|
False
|
|
, DevelMode = False
|
|
, AppPath =
|
|
-- | AppPath specifies where the program looks for "assets" and
|
|
-- | "templates" in case LiveMode is True.
|
|
"."
|
|
, Session =
|
|
{ CookieName = "pcmt_session"
|
|
, CookieAuthSecret = ""
|
|
, CookieEncrSecret = ""
|
|
, MaxAge = 3600
|
|
}
|
|
, Logger = { JSON = True, Fmt = None Text }
|
|
, Init =
|
|
{ CreateAdmin =
|
|
-- | if this is True, attempt to create a user with admin
|
|
-- | privileges with the password specified below (or better -
|
|
-- | overriden); it fails if users already exist in the DB.
|
|
False
|
|
, AdminPassword =
|
|
-- | used for the first admin, forced change on first login.
|
|
"50ce50fd0e4f5894d74c4caecb450b00c594681d9397de98ffc0c76af5cff5953eb795f7"
|
|
}
|
|
, Registration.Allowed = True
|
|
}
|
|
}
|
|
|
|
in Schema
|
|
\end{lstlisting}
|
|
|
|
Another one of specialties of Dhall is that $==$ and $!=$ equality operators
|
|
only work on values of type \texttt{Bool}, which for example means that
|
|
variables of type \texttt{Natural} (\texttt{uint}) or \texttt{Text}
|
|
(\texttt{string}) cannot be compared directly as in other languages, which
|
|
either leaves the work for a higher-level language (such as Go), or from the
|
|
perspective of the Dhall authors, \emph{enums} are promoted when the value
|
|
matters.
|
|
|
|
|
|
\n{3}{Safety considerations}
|
|
|
|
Having a programmable configuration language that understands functions and
|
|
allows importing not only arbitrary text from random internet URLs, but also
|
|
importing and \emph{evaluating} (i.e.\ running) potentially untrusted code, it
|
|
is important that there are some safety mechanisms employed, which can be
|
|
relied on by the user. Dhall offers this in multiple features: enforcing a
|
|
same-origin policy and (optionally) pinning a cryptographic hash of the value
|
|
of the expression being imported.
|
|
|
|
|
|
\n{3}{Possible alternatives}
|
|
|
|
While developing the program, the author has also
|
|
come across certain shortcomings of Dhall, namely long start-up with \emph{cold
|
|
cache}, which can generally be observed in the scenario of running the program
|
|
in an environment that does not allow to write the cache files (a read-only
|
|
filesystem), of does not keep the written cache files, such as a container that
|
|
is not configured to mount a persistent volume at the pertinent location.
|
|
|
|
To describe the way Dhall works when performing an evaluation, it resolves
|
|
every expression down to a combination of its most basic types (eliminating all
|
|
abstraction and indirection) in the process called
|
|
\textbf{normalisation}~\cite{dhallnorm} and then saves this result in the
|
|
host's cache. The \texttt{dhall-haskell} binary attempts to resolve the
|
|
variable \texttt{\$\{XDG\_CACHE\_HOME\}} (have a look at \emph{XDG Base
|
|
Directory Spec}~\cite{xdgbasedirspec} for details) to decide \emph{where} the
|
|
results of the normalisation will be written for repeated use. Do note that
|
|
this behaviour has been observed on a GNU/Linux host and the author has not
|
|
verified this behaviour on a non-GNU/Linux host, such as FreeBSD.
|
|
|
|
If normalisation is performed inside an ephemeral container (as opposed to, for
|
|
instance, an interactive desktop session), the results effectively get lost on
|
|
each container restart, which is both wasteful and not great for user
|
|
experience, since the normalisation of just a handful of imports (which
|
|
internally branches widely) can take an upwards of two minutes, during which
|
|
the user is left waiting for the hanging application with no reporting on the
|
|
progress or current status.
|
|
|
|
While workarounds for the above mentioned problem can be devised relatively
|
|
easily (such as bind mounting persistent volumes inside the container in place
|
|
of the \texttt{\$\{XDG\_CACHE\_HOME\}/dhall} and
|
|
\texttt{\$\{XDG\_CACHE\_HOME\}/dhall-haskell} to preserve the cache between
|
|
restarts, or let the cache be pre-computed during container build, since the
|
|
application is only really expected to run together with a compatible version
|
|
of the configuration schema and this version \emph{is} known at container build
|
|
time), it would certainly feel better if there was no need to work
|
|
\emph{around} the configuration system of choice.
|
|
|
|
Alternatives such as CUE (\url{https://cuelang.org/}) offer themselves nicely
|
|
as a potentially almost drop-in replacement for Dhall feature-wise, while also
|
|
resolving costly \emph{cold cache} normalisation operations, which is in
|
|
author's view Dhall's titular issue.
|
|
|
|
|
|
\n{2}{Data integrity and authenticity}
|
|
|
|
The user can interact with the application via a web client, such as a browser,
|
|
and is required to authenticate for all sensitive operations. To not only know
|
|
\emph{who} the user is but also make sure they are \emph{permitted} to perform
|
|
the action they are attempting, the program employs an \emph{authorisation}
|
|
mechanism in the form of sessions. These are on the client side represented by
|
|
cryptographically signed and encrypted (using 256 bit AES) HTTP cookies. That
|
|
lays foundations for a few things: the data saved into the cookies can be
|
|
regarded as private because short of future \emph{quantum computers} only the
|
|
program itself can decrypt and access the data, and the data can be trusted
|
|
since it is both signed using the key that only the program controls and
|
|
\emph{encrypted} with \emph{another} key that equally only the program holds.
|
|
|
|
The cookie data is only ever written \emph{or} read at the server side,
|
|
solidifying the authors decision to let it be encrypted, as there is not point
|
|
in not encrypting it for some perceived client-side simplification. Users
|
|
navigating the website send their session cookie (if it exists) with
|
|
\textbf{every request} to the server, which subsequently verifies the integrity
|
|
of the data and in case it is valid, determines the existence and potential
|
|
amount of user privilege that should be granted. Public endpoints do not
|
|
mandate the presence of a valid session by definition, while at protected
|
|
endpoints the user is authenticated at every request. When a session expires or
|
|
if there is no session to begin with, the user is either shown a \emph{Not
|
|
found} error message, the \emph{Unauthorised} error message or redirected to
|
|
\texttt{/signin}, depending on the endpoint or resource, as can be seen, this
|
|
behaviour is not uniform and depends on the resource and/or the endpoint.
|
|
|
|
Another aspect that contributes to data integrity from \emph{another} point of
|
|
view is utilising database \emph{transactions} for bundling together multiple
|
|
database operations that collectively change the \emph{state}. Using the
|
|
transactional jargon, the data is only \emph{committed} if each individual
|
|
change was successful. In case of any errors, the database is instructed to
|
|
perform an atomic \emph{rollback}, which brings it back to a state before the
|
|
changes were ever attempted.
|
|
|
|
The author has additionally considered the thought of utilising an embedded
|
|
immutable database like immudb (\url{https://immudb.io}) for record keeping
|
|
(verifiably storing data change history) and additional data integrity checks,
|
|
e.g.\ for tamper protection purposes and similar; however, that work remains
|
|
yet to be materialised.
|
|
|
|
|
|
\n{2}{User isolation}
|
|
|
|
Users are allowed into certain parts of the application based on the role they
|
|
currently posses. For the moment, two basic roles were envisioned, while this
|
|
list might get amended in the future, if the need arises:
|
|
|
|
\begin{itemize}
|
|
\item Administrator
|
|
\item User
|
|
\end{itemize}
|
|
|
|
It is paramount that the program protects itself from the insider threats as
|
|
well and therefore each role is only able to perform actions that it is
|
|
explicitly assigned. While there definitely is certain overlap between the
|
|
capabilities of the two outlined roles, each also possesses unique features
|
|
that the other does not.
|
|
|
|
For example, the administrator role is not able to perform searches on the
|
|
breach data directly using their administrator account, for that a separate
|
|
user account has to be devised. Similarly, the regular user is not able to
|
|
manage breach lists and other users, because that is a privileged operation.
|
|
|
|
In-application administrators are not able to view sensitive (any) user data
|
|
and should therefore only be able to perform the following actions:
|
|
|
|
\begin{itemize}
|
|
\item Create user accounts
|
|
\item View list of users
|
|
\item View user email
|
|
\item Change user email
|
|
\item Change user email
|
|
\item Toggle whether user is an administrator
|
|
\item Delete user accounts
|
|
\end{itemize}
|
|
|
|
Let us consider a case when a user manages self, while demoting from
|
|
administrator to a regular user is permitted, promoting self to be an
|
|
administrator would constitute a \emph{privilege escalation} and likely be a
|
|
precursor to at least a \emph{denial of service} of sorts.
|
|
|
|
|
|
\n{2}{Zero trust principle}
|
|
|
|
\textit{Data confidentiality, i.e.\ not trusting the provider}
|
|
|
|
There is no way for the application (and consequently, the in-application
|
|
administrator) to read user's data. This is possible by virtue of encrypting
|
|
the pertinent data before saving them in the database by a state-of-the-art
|
|
\emph{age} key~\cite{age} (backed by X25519~\cite{x25519rfc7748}), which in
|
|
turn is safely stored encrypted by a passphrase that only the user controls. Of
|
|
course, the user-supplied password is run by a password based key derivation
|
|
function (PBKDF: a key derivation function with a sliding computational cost)
|
|
before letting it encrypt the \emph{age} key.
|
|
|
|
The \emph{age} key is only generated when the user changes their password for
|
|
the first time to prevent scenarios such as in-application administrator with
|
|
access to physical database being able to both \textbf{recover} the key from
|
|
the database and \textbf{decrypt} it given that they already know the user
|
|
password (because they set it), which would subsequently give them unbounded
|
|
access to any future encrypted data, as long as they would be able to maintain
|
|
their database access. This is why the \emph{age} key generation and protection
|
|
are bound to the first password change. Of course, the evil administrator could
|
|
just perform the change themselves; however, the user would at least be able to
|
|
find those changes in the activity logs and know not to use the application.
|
|
But given the scenario of a total database compromise, the author finds all
|
|
hope is already lost at that point. At least when the database is dumped, it
|
|
only contains non-sensitive, functional information in plain test, everything
|
|
else should be encrypted.
|
|
|
|
Consequently, both the application operators and the in-application
|
|
administrators should never be able to learn the details of what the user is
|
|
tracking, the same being applicable even to potential attackers with direct
|
|
access to the database. Thus the author maintains that every scenario that
|
|
could potentially lead to a data breach (apart from a compromised user machine
|
|
and the like) would have to entail some form of operating memory acquisition,
|
|
for instance using \texttt{LiME}~\cite{lime}, or perhaps directly the
|
|
\emph{hypervisor}, if considering a virtualised (``cloud'') environments.
|
|
|
|
|
|
\n{2}{Compromise Monitoring}
|
|
|
|
\n{3}{Have I Been Pwned? Integration}
|
|
|
|
Troy Hunt's Have I Been Pwned? online service
|
|
(\url{https://haveibeenpwned.com/}) has been chosen as the online source of
|
|
compromised data. The service offers private APIs that are protected by API
|
|
keys. The application's \texttt{hibp} module and database representation models
|
|
the values returned by this API, which allows searching in large breaches using
|
|
email addresses.\\
|
|
The architecture there is relatively simple: the application administrator
|
|
configures an API key for HIBP, the user enters the query parameters, the
|
|
application constructs a query and calls the API and waits for a response. As
|
|
the API is rate-limited based on the key supplied, this can pose an issue and
|
|
it has not been fully resolved in the UI. The application then parses the
|
|
returned data and binds it to the local model for validation. If that goes
|
|
well, the data is saved into the database as a cache and the search query is
|
|
performed on the saved data. If it returns anything, it is displayed to the
|
|
user for browsing.
|
|
|
|
|
|
\n{3}{Local Dataset Plugin} Breach data from locally available datasets can be
|
|
imported into the application by first making sure it adheres to the specified
|
|
schema (have a look at the \emph{breach data schema} in
|
|
Listing~\ref{breachDataGoSchema}). If it doesn't (which is very likely with
|
|
random breach data), it needs to be converted to a form that does before
|
|
importing it to the application, e.g.\ using a Python script or similar.
|
|
Attempting to import data that does not follow the outlined schema would result
|
|
in an error. Also, importing a dataset which is over a reasonable size limit
|
|
would by default be rejected by the program as a precaution, since marshaling
|
|
e.g.\ a 1 TiB document would likely result in an OOM situation on the host,
|
|
assuming regular consumer hardware conditions, not HPC.
|
|
|
|
\vspace{\parskip}
|
|
\begin{lstlisting}[language=Go, caption={Breach Data Schema represented as a Go struct with imports from the standard library are assumed},
|
|
label=breachDataGoSchema]
|
|
type breachDataSchema struct {
|
|
Name string
|
|
Time time.Time
|
|
IsVerified bool
|
|
ContainsPasswords bool
|
|
ContainsHashes bool
|
|
HashType string
|
|
HashSalted bool
|
|
HashPepperred bool
|
|
ContainsUsernames bool
|
|
ContainsEmails bool
|
|
Data any
|
|
}
|
|
\end{lstlisting}
|
|
|
|
The Go representation shown in Listing~\ref{breachDataGoSchema} will in
|
|
actuality be written and supplied by the user of the program as a YAML
|
|
document. YAML was chosen for multiple reasons: relative ease of use (plain
|
|
text, readable, support for inclusion of comments, its capability to store
|
|
multiple \emph{documents} inside of a single file with most of the inputs
|
|
implicitly typed as strings while thanks to being a superset of JSON it sports
|
|
machine readability. That should allow for documents similar to what can be
|
|
seen in Listing~\ref{breachDataYAMLSchema} to be ingested by the program,
|
|
read and written by humans and programs alike.
|
|
|
|
\vspace{\parskip}
|
|
\begin{lstlisting}[language=YAML, caption={Example Breach Data Schema supplied
|
|
to the program as a YAML file, optionally containing multiple documents},
|
|
label=breachDataYAMLSchema]
|
|
---
|
|
name: Horrible breach
|
|
time: 2022-04-23T00:00:00Z+02:00
|
|
isVerified: false
|
|
containsPasswds: false
|
|
containsHashes: true
|
|
containsEmails: true
|
|
hashType: md5
|
|
hashSalted: false
|
|
hashPeppered: false
|
|
data:
|
|
hashes:
|
|
- hash1
|
|
- hash2
|
|
- hash3
|
|
emails:
|
|
- email1
|
|
-
|
|
- email3
|
|
---
|
|
# document #2, describing another breach.
|
|
name: Horrible breach 2
|
|
...
|
|
\end{lstlisting}
|
|
|
|
Notice how the emails list in Listing~\ref{breachDataYAMLSchema} misses one
|
|
record, perhaps because it was not supplied or mistakenly omitted. This is a
|
|
valid scenario (mistakes happen) and the application needs to be able to handle
|
|
it. The alternative would be to require the user to prepare the data in such a
|
|
way that the empty/partial records would be dropped entirely.
|
|
|
|
|
|
\n{2}{Database configuration}
|
|
|
|
The database schema is not being created manually in the database. Instead, an
|
|
Object-relational Mapping (ORM) tool named ent is used, which allows defining
|
|
the table schema and relations entirely in Go.
|
|
|
|
The best part about ent is that there is no need to define supplemental methods
|
|
on the models, since with ent these are meant to be \emph{code generated} (in
|
|
the older sense of word, not with Large Language Models). That creates files
|
|
with models based on the types of the attributes in the database model, and the
|
|
respective relations are transformed into methods on the receiver or functions
|
|
taking object attributes as arguments.
|
|
|
|
For instance, if the model's attribute is a string value \texttt{Email}, ent
|
|
can be used to generate code that contains methods on the user object like the
|
|
following:
|
|
|
|
\begin{itemize}
|
|
\item \texttt{EmailIn(pattern string)}
|
|
\item \texttt{EmailEQ(email string)}
|
|
\item \texttt{EmailNEQ(email string)}
|
|
\item \texttt{EmailHasSuffix(suffix string)}
|
|
\end{itemize}
|
|
|
|
These methods can further be imported into other packages and this makes
|
|
working with the database a morning breeze.
|
|
|
|
|
|
\n{1}{Production}
|
|
|
|
It is, of course, recommended that the application runs in a secure environment
|
|
\allowbreak although definitions of that almost certainly differ depending on
|
|
who you ask. General recommendations would be either to effectively reserve a
|
|
machine for a single use case - running this program - so as to dramatically
|
|
decrease the potential attack surface of the host, or run the program isolated
|
|
in a container or a virtual machine. Further, if the host does not need
|
|
management access (it is a deployed-to-only machine that is configured
|
|
out-of-band, such as with a \emph{golden} image/container or declaratively with
|
|
Nix), then an SSH \emph{daemon} should not be running in it, since it is not
|
|
needed. In an ideal scenario, the host machine would have as little software
|
|
installed as possible besides what the application absolutely requires.
|
|
|
|
A demonstration of the above can be found in the multi-stage Containerfile that
|
|
is available in the main sources. The resulting container image only contains a
|
|
statically linked copy of the program, a default configuration file and
|
|
corresponding Dhall expressions cached at build time, which only support the
|
|
main configuration file. Since the program also needs a database, an example
|
|
scenario could include the container being run in a Podman pod together with
|
|
the database, which would not have to be exposed from the pod and would
|
|
therefore only be available over \texttt{localhost}.
|
|
|
|
It goes without saying that the operator should substitute values of any
|
|
default configuration secrets with the new ones that were securely generated.
|
|
|
|
System-wide cryptographic policies should target highest feasible security
|
|
level, if at all available (such as by default on Fedora or RHEL), covering
|
|
SSH, DNSSec, IPsec, Kerberos and TLS protocols. Firewalls should be configured
|
|
and SELinux (kernel-level mandatory access control and security policy
|
|
mechanism) running in \emph{enforcing} mode, if available.
|
|
|
|
|
|
\n{2}{Deployment recommendations}
|
|
|
|
\n{3}{Transport security}
|
|
|
|
User connecting to the application should rightfully expect for their data to
|
|
be protected \textit{in transit} (i.e.\ on the way between their browser and
|
|
the server), which is what \emph{Transport Layer Security} family of
|
|
protocols~\cite{tls13rfc8446} was designed for, and which is the underpinning
|
|
of HTTPS. TLS utilises the primitives of asymmetric cryptography to let the
|
|
client authenticate the server (verify that it is who it claims it is) and
|
|
negotiate a symmetric key for encryption in the process named the \emph{TLS
|
|
handshake} (see Section~\ref{sec:tls} for more details), the final purpose of
|
|
which is establishing a secure communications connection. The operator should
|
|
configure the program to either directly utilise TLS using configuration or
|
|
have it listen behind a TLS-terminating \emph{reverse proxy}.
|
|
|
|
|
|
\n{3}{Containerisation}
|
|
Whether the pre-built or a custom container image is used to deploy the
|
|
application, it still needs access to secrets, such as database connection
|
|
string (containing database host, port, user, password/encrypted password,
|
|
authentication method and database name).
|
|
|
|
Currently, the application is able to handle \emph{peer}, \emph{scram-sha-256},
|
|
\emph{user name maps} and raw \emph{password} as Postgres authentication
|
|
methods~\cite{pgauthmethods}, although the \emph{password} option should not be
|
|
used in production, \emph{unless} the connection to the database is protected
|
|
by TLS.\ In any case, using the \emph{scram-sha-256}~\cite{scramsha256rfc7677}
|
|
method is preferable and one way to verify in development environment that
|
|
everything works as intended is the \emph{Password generator for PostgreSQL}
|
|
tool~\cite{goscramsha256}, which allows to get the encrypted string from a raw
|
|
user input.
|
|
|
|
If the application running in a container wants to use the \emph{peer}
|
|
authentication method, it is up to the operator to supply the Postgres socket
|
|
to the application (e.g.\ as a volume bind mount). This scenario was not
|
|
tested; however, and the author is also not entirely certain how \emph{user
|
|
namespaces} (on GNU/Linux) would influence the process (given that the
|
|
\emph{ID}s of a user \textbf{outside} the container are mapped to a range of
|
|
\emph{UIDs} \textbf{inside} the container), for which the setup would likely
|
|
need to account.
|
|
|
|
Equally, if the application is running inside the container, the operator needs
|
|
to make sure that the database is either running in a network that is also
|
|
directly attached to the container or that there is a mechanism in place that
|
|
routes the requests for the database hostname to the destination.
|
|
|
|
One such mechanism is container name based routing inside \emph{pods}
|
|
(Podman/Kubernetes), where the resolution of container names is the
|
|
responsibility of a specially configured piece of software called Aardvark for
|
|
the former and CoreDNS for the latter.
|
|
|
|
|
|
\n{1}{Validation}
|
|
|
|
\n{2}{Unit tests}
|
|
|
|
Unit testing is a hot topic for many people and the author does not count
|
|
himself to be a staunch supporter of neither extreme. The ``no unit tests''
|
|
seems to discount any benefit there is to unit testing, while a `` TDD-only''
|
|
(TDD, or Test Driven Development is a development methodology whereby tests are
|
|
written first, then a complementary piece of code that is supposed to be
|
|
tested, just enough to get past the compile errors and to see the test fail,
|
|
then the code is refactored to make the test pass and then it can be fearlessly
|
|
extended because the test is the safety net catching us when the user slips and
|
|
alters the originally intended behaviour) approach can be a little too much for
|
|
some people's taste. The author tends to sport a \emph{middle ground} approach
|
|
here, with writing enough tests where meaningful but not necessarily testing
|
|
everything or writing tests prior to code, although arguably that practice
|
|
should result in writing a \emph{better} designed code, particularly because
|
|
there has to be a prior though about it because it needs to be tested
|
|
\emph{first}.
|
|
|
|
Thanks to Go's built in support for testing via its \texttt{testing} package
|
|
and the tooling in the \texttt{go} tool, writing tests is relatively simple. Go
|
|
looks for files in the form \texttt{<filename>\_test.go} in the present working
|
|
directory but can be instructed to look for test files in packages recursively
|
|
found on any path using the ellipsis, like so: \texttt{go test
|
|
./path/to/package/\ldots}, which then \emph{runs} all the tests found, and
|
|
reports some statistics, such as the time it took to run the test or whether it
|
|
succeeded or failed. To be precise, the test files also need to contain test
|
|
functions, which are functions with the signature \texttt{func TestWhatever(t
|
|
*testing.T)\{\}} and where the function prefix ``Test'' is equally as important
|
|
as the signature. Without it, the function is not considered to be a testing
|
|
function despite having the required signature and is therefore \emph{not}
|
|
executed during testing.
|
|
|
|
This test lookup behaviour; however, also has a neat side-effect: all the test
|
|
files can be kept side-by-side their regular source counterparts, there is no
|
|
need to segregate them into a specially blessed \texttt{tests} folder or
|
|
similar, which in author's opinion improves readability. As a failsafe, in case
|
|
no actual test are found, the current behaviour of the tool is to print a note
|
|
informing the developer that no tests were found, which is handy to learn if it
|
|
was not intended/expected. When compiling regular source code, the Go files
|
|
with \texttt{\_test} in the name are simply ignored by the build tool.
|
|
|
|
\n{2}{Integration tests}
|
|
|
|
Integrating with external software, namely the database in case of this
|
|
program, is designed to utilise the same mechanism that was mentioned in the
|
|
previous section: Go's \texttt{testing} package. These tests verify that the
|
|
code changes can still perform the same actions with the external software that
|
|
were possible before the change and are run before every commit locally and
|
|
then after pushing to remote in the CI.
|
|
|
|
\n{3}{func TestUserExists(t *testing.T)}
|
|
|
|
An example integration test shown in Listing~\ref{integrationtest} can be seen
|
|
to declare a helper function \texttt{getCtx() context.Context}, which takes no
|
|
arguments and returns a new\\ \texttt{context.Context} initialised with a value
|
|
of the global logger, which is how the logger gets injected into the user
|
|
module functions. The function \texttt{TestUserExists(t *testing.T)} first
|
|
declares a database connection string and attempting to open a connection to
|
|
the database. The database in use here is SQLite3 running in memory mode,
|
|
meaning no file is actually written to disk during this process. Since the
|
|
testing data is not needed after the test, this is deemed good enough. Next, a
|
|
defer statement calling the \texttt{Close()} method on the database object is
|
|
made, which is the idiomatic Go way of closing files and network connections
|
|
(which are also an abstraction over files on UNIX-like operating systems such
|
|
as GNU/Linux). The \emph{defer} statement gets called after all of the
|
|
statements in the surrounding function, which makes sure no file descriptors
|
|
(FDs) are leaked and the file is properly closed when the function returns.
|
|
|
|
In the next step a database schema creation is attempted, handling the
|
|
potential error in a Go idiomatic way, which uses the return value from the
|
|
function in an assignment to a variable declared in the \texttt{if} statement,
|
|
and checks whether the \texttt{err} was \texttt{nil} or not. In case the
|
|
\texttt{err} was not \texttt{nil}, i.e.\ \emph{there was an error in the callee
|
|
function}, the condition evaluates to \texttt{true}, which is followed by
|
|
entering the inner block. Inside the inner block, the error is announced to the
|
|
user (likely a developer running the test in this case) and the testing
|
|
object's \texttt{FailNow()} method is called, which marks the test function as
|
|
having failed, and thus stops its execution. In this case, that is the desired
|
|
outcome, since if the database schema creation call fails, there really is no
|
|
point in continuing testing of user creation.
|
|
|
|
Conversely, if the schema does get created without an error, the code continues
|
|
to declare a few variables: \texttt{username}, \texttt{email} and \texttt{ctx},
|
|
where the context injected with the logger is saved. Some of them are
|
|
subsequently passed into the \texttt{UsernameExists} function, \texttt{ctx} as
|
|
the first argument, with the database pointer and username being passed next,
|
|
while the \texttt{email} variable is only used at a later stage, but was
|
|
declared here to give a sense of grouping. The error value returned from this
|
|
function is again checked and if everything goes well, the value of the
|
|
\texttt{usernameFound} boolean is checked next.
|
|
|
|
\smallskip
|
|
\begin{lstlisting}[language=Go, caption={Example integration test.},
|
|
label=integrationtest,basicstyle=\linespread{0.8}\footnotesize\ttfamily]
|
|
// modules/user/user_test.go
|
|
package user
|
|
|
|
import (
|
|
"context"
|
|
"testing"
|
|
|
|
"git.dotya.ml/mirre-mt/pcmt/ent/enttest"
|
|
"git.dotya.ml/mirre-mt/pcmt/slogging"
|
|
_ "github.com/xiaoqidun/entps"
|
|
)
|
|
|
|
func getCtx() context.Context {
|
|
l := slogging.Init(false)
|
|
ctx := context.WithValue(context.Background(), CtxKey{}, l)
|
|
|
|
return ctx
|
|
}
|
|
|
|
func TestUserExists(t *testing.T) {
|
|
connstr := "file:ent_tests?mode=memory&_fk=1"
|
|
db := enttest.Open(t, "sqlite3", connstr)
|
|
defer db.Close()
|
|
|
|
if err := db.Schema.Create(context.Background()); err != nil {
|
|
t.Errorf("failed to create schema resources: %v", err)
|
|
t.FailNow()
|
|
}
|
|
|
|
username := "dude"
|
|
email := "dude@b.cc"
|
|
ctx := getCtx()
|
|
|
|
usernameFound, err := UsernameExists(ctx, db, username)
|
|
if err != nil {
|
|
t.Errorf("error checking for username {%s} existence: %q",
|
|
username,
|
|
err,
|
|
)
|
|
}
|
|
|
|
if usernameFound {
|
|
t.Errorf("unexpected: user{%s} should not have been found",
|
|
username,
|
|
)
|
|
}
|
|
|
|
if _, err := EmailExists(ctx, db, email); err != nil {
|
|
t.Errorf("unexpected: user email '%s' should not have been found",
|
|
email,
|
|
)
|
|
}
|
|
|
|
usr, err := CreateUser(ctx, db, email, username, "so strong")
|
|
if err != nil {
|
|
t.Errorf("failed to create user, error: %q", err)
|
|
t.FailNow()
|
|
} else if usr == nil {
|
|
t.Error("got nil usr back")
|
|
t.FailNow()
|
|
}
|
|
|
|
if usr.Username != username {
|
|
t.Errorf("got back wrong username, want: %s, got: %s",
|
|
username, usr.Username,
|
|
)
|
|
} // ...more checks...
|
|
}
|
|
\end{lstlisting}
|
|
|
|
Since the database has just been created, there should be no users, which is
|
|
checked in the next \texttt{if} statement. The same check is then performed for
|
|
the earlier-declared user email that is also expected to fail.
|
|
|
|
The final statements of the described test attempts a user creation call, which
|
|
is again checked for both error and \emph{nilability}. The test continues with
|
|
more similar checks but it has been cut short for brevity.
|
|
|
|
A neat thing about error handling in Go is that it allows for very easy
|
|
checking of all paths, not just the \emph{happy path} where there are no
|
|
issues.
|
|
|
|
|
|
\n{2}{Testing environment}
|
|
|
|
The application has been deployed in a testing environment on author's modest
|
|
Virtual Private Server (VPS) at \texttt{https://testpcmt.dotya.ml}, protected
|
|
by \emph{Let's Encrypt}\allowbreak issued, short-lived, ECDSA
|
|
\texttt{secp384r1} curve TLS certificate, and configured with strict CSP. It is
|
|
a testing instance; therefore, limits to prevent abuse might be imposed.
|
|
|
|
|
|
% =========================================================================== %
|
|
\nn{Conclusion}
|
|
|
|
% =========================================================================== %
|