814 lines
41 KiB
TeX
814 lines
41 KiB
TeX
% =========================================================================== %
|
|
\part{Theoretical part}
|
|
|
|
\n{1}{Vision}
|
|
|
|
The thesis consists of two main parts: a theoretical one that provides
|
|
theoretical background to concepts and processes used in the other one - so
|
|
called \emph{practical} part, which then describes what exactly has been done
|
|
and how.
|
|
|
|
The application developed as part of this thesis should enable users to quickly
|
|
and privately check their credentials' compromise status against configured
|
|
local and online data sources. Of course the compromise monitoring depends on
|
|
access to quality data, which is in the purview of the application
|
|
administrator.
|
|
|
|
In-application administrative-level user is able to configure online and local
|
|
data sources, initially set up user accounts but does not have the access
|
|
users' search queries or credential entries, or more broadly, is not able to
|
|
read sensitive user information. This is enabled by the architectural decisions
|
|
the application has taken, whereby sensitive user data is encrypted and not
|
|
even administrative-level users are able to read them.
|
|
|
|
|
|
\n{1}{Cryptography primer}\label{sec:cryptographyprimer}
|
|
|
|
\n{2}{Encryption}
|
|
|
|
\textbf{TODO:} add \emph{why} we care and how it's going to be used.
|
|
|
|
\n{2}{Hash functions}
|
|
|
|
Hash functions are algorithms used to help with a number of things: integrity
|
|
verification, password protection, digital signature, public-key encryption and
|
|
others. Hashes are used in forensic analysis to prove authenticity of digital
|
|
artifacts, to uniquely identify a change-set within revision-based source code
|
|
management systems such as Git, Subversion or Mercurial, to detect
|
|
known-malicious software by anti-virus programs or by advanced filesystems in
|
|
order to verify block integrity and enable repairs, and also in many other
|
|
applications that each person using a modern computing device has come across,
|
|
such as when connecting to a website protected by the famed HTTPS.
|
|
|
|
The popularity of hash functions stems from a common use case: the need to
|
|
simplify reliably identifying a chunk of data. Of course, two chunks of data,
|
|
two files, frames or packets could always be compared bit by bit, but that can
|
|
get prohibitive from both cost and energy point of view relatively quickly.
|
|
That is when the hash functions come in, since they are able to take a long
|
|
input and produce a short output, named a digest or a hash value. The function
|
|
also only works one way.
|
|
|
|
A file, or any original input data for that matter, cannot be reconstructed
|
|
from the hash digest alone by somehow \emph{reversing} the hashing operation,
|
|
since at the heart of any hash function there is essentially a compression
|
|
function.
|
|
|
|
Most alluringly, hashes are frequently used with the intent of
|
|
\emph{protecting} passwords by making those unreadable, while still being able
|
|
to verify that the user knows the password, therefore should be authorised.
|
|
|
|
As the hashing operation is irreversible, once the one-way function produces a
|
|
short a digest, there is no way to reconstruct the original message from it.
|
|
That is, unless the input of the hash function is also known, in which case all
|
|
it takes is hashing the supposed input and comparing the digest with existing
|
|
digests that are known to be digests of passwords.
|
|
|
|
\\ \textbf{TODO:} ad more on \emph{why} we care and what types of hashes should be
|
|
used (with refs) and why.
|
|
|
|
|
|
\n{3}{Types and use cases}
|
|
|
|
Hash functions can be loosely categorised based on their intended use case to
|
|
\emph{password protection hashes}, \emph{integrity verification hashes},
|
|
\emph{message authentication codes} and \emph{cryptographic hashes}. Each of
|
|
these possess unique characteristics and using the wrong type of hash function
|
|
for the wrong job can potentially result in a security breach.
|
|
|
|
As an example, suppose \texttt{MD5}, a popular hash function internally using
|
|
the same data structure - \emph{Merkle-Damgård} construction - as
|
|
\texttt{BLAKE3}. While the former produces 128 bit digests, the latter by
|
|
default outputs 256 bit digest with no upper limit (Merkle tree extensibility).
|
|
|
|
There is a list of differences that could further be mentioned, however, they
|
|
both have one thing in common: they are \emph{designed} to be \emph{fast}. The
|
|
latter, as a cryptographic hash function, is conjectured to be \emph{random
|
|
oracle indifferentiable}, secure against length extension, but it is also in
|
|
fact faster than all of \texttt{MD5}, \texttt{SHA3-256}, \texttt{SHA-1} and
|
|
even \texttt{Blake2} family of functions.
|
|
|
|
The use case of both is to (quickly) verify integrity of a given chunk of data,
|
|
in case of \texttt{BLAKE3} with pre-image and collision resistance in mind, not
|
|
to secure a password by hashing it first, which poses a big issue when used
|
|
to...secure passwords by hashing them first.
|
|
|
|
A password hash function, such as \texttt{argon2} or \texttt{bcrypt} are good
|
|
choices for securely storing hashed passwords, namely because they place CPU
|
|
and memory burden on the machine that is computing the digest. They also
|
|
forcefully limit potential parallelism, thus restricting the scale at which an
|
|
exhaustive search could be launched. Additionally, both functions automatically
|
|
\emph{salt} the passwords before hashing them, which means that two exact same
|
|
passwords of two different users will not end up hashing to the same digest
|
|
value, making it that much harder to recover the original, supposedly weak
|
|
user-provided password.
|
|
|
|
|
|
\n{3}{Why are hashes interesting}
|
|
|
|
As already mentioned, since hashes are often used to store the representation
|
|
of the password instead of the password itself, which is where the allure comes
|
|
from, especially services storing hashed user passwords happen to
|
|
non-voluntarily leak them. Should wrong type of hash be used for password
|
|
hashing or weak parameters be set or the hash function be simply used
|
|
improperly, it sparks even more interest.
|
|
|
|
Historically, there have also been enough instances of leaked raw passwords
|
|
that anyone with enough interest can additionally put together a neat list of
|
|
most common hashes of the most commonly used passwords.
|
|
|
|
So while a service might not be storing passwords in \emph{plain text}, which
|
|
is a good practice, using a hashing function not designed to protect passwords
|
|
does not offer much additional protection in the case of weak passwords, which
|
|
happen to be the ones that are the most commonly used.
|
|
|
|
It would seem logical that a service that is not using cryptographic primitives
|
|
like hash functions correctly is more likely to get hacked and have its users'
|
|
passwords or password hashes leaked. Those are often exposed publicly with no
|
|
restrictions on access, and the internet turns out to be serving as a
|
|
storage/medium.
|
|
|
|
That incidentally also means that anyone interested in their own compromise
|
|
monitoring has at least \emph{some} chances of successfully learning about
|
|
their compromise, potentially a long time before it can be used to cause
|
|
greater harm.
|
|
|
|
|
|
\n{2}{TLS}\label{sec:tls}
|
|
|
|
The Transport Layer Security protocol (or TLS) serves as as an encryption and
|
|
\emph{authentication} protocol to secure internet communications. An important
|
|
part of the protocol is the \emph{handshake}, during which the two
|
|
communicating parties exchange messages that acknowledge each other's presence,
|
|
verify each other, choose what cryptographic algorithms will be used and decide
|
|
session keys. As there are multiple versions of the protocol in active duty
|
|
even at the moment, the server together with the client need to agree upon the
|
|
version they are going to use (it is recommended to use either v1.2 or v1.3
|
|
these days), pick cipher suites (if applicable), the client verifies the
|
|
server's public key (and the signature of the certificate authority that issued
|
|
it) and they both generate session keys for use after handshake completion.
|
|
|
|
TLSv1.3 dramatically reduced the number of available suites to only include the
|
|
ones deemed secure enough, which is why it is no longer needed to manually
|
|
specify what cipher suite should be used (or rely on the client/server to
|
|
choose wisely). While possibly facing compatibility issues with legacy devices,
|
|
the simplicity brought by enabling TLSv1.3 might be considered a worthy
|
|
trade-off.
|
|
|
|
|
|
\n{1}{Passwords}\label{sec:passwords}
|
|
|
|
Passwords have been in use since the ancient times, apparently already the
|
|
Roman sentries used passwords or \textit{watchwords} to discern who was allowed
|
|
to enter an area. The Roman army had a special system of distributing passwords
|
|
among the encampment members on a wooden tablet. Fast forward a couple of
|
|
thousand years, during the days of the Prohibition Era in the United States, it
|
|
was the secret ``speakeasies'' that were protecting their illegitimate
|
|
alcohol-serving business using passwords~\cite{speakeasy}\cite{nisthistory}.
|
|
During the World War II.\ the US paratroopers' use of passwords has evolved to
|
|
even include a counter-password.
|
|
|
|
According to McMillan, the first \textit{computer} passwords date back to
|
|
mid-1960s' Massachusetts Institute of Technology (MIT), when researchers at the
|
|
university built a massive time-sharing computer called CTSS. Apparently,
|
|
\textit{even then} the passwords did not protect the users as well as they were
|
|
expected to~\cite{mcmillan}.
|
|
|
|
Traditionally, passwords were expected to be memorised, but the large number of
|
|
password-protected \emph{services} these days can make this impractical. To
|
|
list a few common examples, access to a bank account, electronic mailbox,
|
|
personal computer encrypted disk are all protected by some form of a password.
|
|
|
|
A password still often consists of a \textit{string} of characters typed into a
|
|
prompt but its function is still the same: as per NIST it enables the
|
|
\textit{verifier} to infer the \textit{claimant}'s identity via a secret the
|
|
claimant holds.
|
|
|
|
There are always some arbitrary requirements applied to what the password can
|
|
be, only some turn out to smarter than others.
|
|
|
|
Despite the impression given by the word ``password'', it does not need to be
|
|
an actual word, while a non-word (in the dictionary sense) may indeed be harder
|
|
to guess, which is a desirable property of passwords. A memorized secret
|
|
consisting of a sequence of words or other text separated by spaces is
|
|
sometimes called a passphrase. A passphrase is similar to a password in usage,
|
|
but the former is generally longer for added security.
|
|
|
|
\n{2}{Program-imposed constraints}
|
|
|
|
Some of the following examples might be a bit anecdotal and more of an
|
|
exception than a rule; nevertheless, when presented by a large-enough program
|
|
creator/service provider, their decisions reach a sufficient amount of
|
|
population, enough that the author will call them influential. They form how
|
|
users think when creating password and affect what users expect from other
|
|
services they happen to visit and use from that point on, as well.
|
|
|
|
\n{3}{Short arbitrary length}
|
|
|
|
It has been observed that a requirement for a ``strong'' password generally
|
|
represents that a password is:
|
|
|
|
\begin{itemize}
|
|
\item longer than 7 characters,
|
|
\item shorter than 11 characters,
|
|
\item begins with a letter and ends with a number OR
|
|
\item begins with a number and ends with a letter.
|
|
\end{itemize}
|
|
|
|
\obr{Short arbitrary password length
|
|
limit}{fig:arbitrarypasswdlengthlimit}{.8}{graphics/arbitrarypasswdlengthlimit.jpg}
|
|
|
|
The error message in above the password input field depicted in
|
|
Figure~\ref{fig:arbitrarypasswdlengthlimit} is wrong for multiple reasons, and
|
|
it is a classic example of short arbitrary length requirement~\cite{larsklint}.
|
|
It essentially prevents users from using passphrases, makes using a password
|
|
manager impractical and all of that has apparently been done ``because of
|
|
security''~\cite{etihad}. Moreover, this might be an indicative of the fact
|
|
that instead of storing passwords hashed (as it should be), they might be
|
|
storing them in \textbf{plain text}.
|
|
Otherwise, what reason could exist for the limit to be 10 characters?
|
|
The recommendation of the US's National Institute for Standards and Technology
|
|
(NIST) in this regard is a minimum of 64 and a maximum of 256 characters,
|
|
which, as they put it, \emph{should be sufficient for most users' needs}.
|
|
|
|
\n{3}{Restricting special characters}
|
|
|
|
Service providers have too often been found forbidding the use of so called
|
|
\textit{special characters} in passwords for as long as passwords have been
|
|
used to protect privileged access. Ways of achieving the same may vary but the
|
|
intent stays the same: preventing users from inputting characters into the
|
|
system, which the system cannot comfortably handle, for ``reasons'', which are
|
|
usually something dubious along the lines of ``an apostrophe may be used in SQL
|
|
injection attacks'' or ``angle brackets may be used in XSS attacks''. Instead,
|
|
the real message it often unwittingly announces is pointing right to the
|
|
serious shortcomings of password handling of the site in question, as passwords
|
|
should never be re-displayed in a context that is prone to Cross Site Scripting
|
|
(XSS), and the passwords should always be hashed before being sent to the
|
|
database anyway, leaving us with only alphanumeric characters, rendering the
|
|
SQLi fears baseless.
|
|
|
|
\obr{Forbidden special characters in
|
|
passwords}{fig:forbiddencharacters}{.8}{graphics/forbiddencharacters.jpg}
|
|
|
|
Note that ``Passw0rd!'' would have been a perfectly acceptable password for the
|
|
validator displayed in
|
|
Figure~\ref{fig:forbiddencharacters}~\cite{forbiddencharacters}. NIST's
|
|
recommendations on this matter are that all printing ASCII~\cite{asciirfc20}
|
|
characters as well as the space character SHOULD be acceptable in memorized
|
|
secrets and Unicode~\cite{iso10646} characters SHOULD be accepted as well.
|
|
|
|
\n{3}{Character composition requirements}
|
|
|
|
There is a tendency to come up with bad passwords when there are character
|
|
composition requirements in place, too. The reality is that instead of
|
|
creating strong passwords directly, most users first try a basic version and
|
|
then keep tweaking characters until the password ends up fulfilling the minimum
|
|
requirement.
|
|
The \emph{problem} with that is that it has been shown, that people use similar
|
|
patterns, i.e. starting with capital letters, putting a symbol last and a
|
|
number in the last two positions. This is also known to cyber criminals
|
|
cracking passwords and they run their dictionary attacks using the common
|
|
substitutions, such as "\$" for "s", "E" for "3", "1" for "l", "@" for "a" etc.
|
|
The password created in this manner will almost certainly be bad so all that is
|
|
achieved is frustrating the user in order to still arrive at a bad password.
|
|
|
|
\n{3}{Other common issues}
|
|
|
|
Some services don't allow users to paste into passwords fields (disabling them
|
|
using JavaScript), thereby essentially breaking the password manager
|
|
functionality, which is an issue because it encourages bad password practices
|
|
such as weak passwords and likewise, password reuse.
|
|
|
|
Another frequent issue is forced frequent password rotation. Making frequent
|
|
password rotations mandatory contributes to users developing a password
|
|
creation pattern and is further a modern-day security anti-pattern and
|
|
according to the British NCSC the practice ``carries no real benefits as stolen
|
|
passwords are generally exploited immediately''~\cite{ncsc}.
|
|
|
|
|
|
\n{1}{Web security}\label{sec:websecurity}
|
|
|
|
The internet is a vast space full of intertwined concepts and ideas. It is a
|
|
superset of the Web, even though the two terms often get conflated. However,
|
|
not everything that is available on the internet can be accessed using web
|
|
protocols and \emph{resources}. This section delves into the concepts of web
|
|
security.
|
|
|
|
|
|
\n{2}{Site Isolation}\label{sec:siteisolation}
|
|
|
|
While website operators can perform steps to secure their sites, it is often
|
|
the browsers holding the last line when these web servers are misconfigured,
|
|
allowing the attacker to start exploiting a vulnerability in various ways.
|
|
|
|
Most users consume web content using web browsers. Modern browsers such as
|
|
Firefox or Chromium are being built with a security focus in mind. Their
|
|
developers are acutely aware of the dangers that parsing of untrusted code from
|
|
the internet poses, which is precisely what the websites, the stylesheets and
|
|
the accompanying scripts are.
|
|
|
|
This necessarily gets reflected in the way these programs are architected.
|
|
Instead of the main, privileged browser process running everything directly, it
|
|
spawns de-privileged child processes for each website. This extra line of
|
|
defence should make it \emph{harder for untrustworthy websites to access or
|
|
steal information} from user accounts or other websites. Even if the
|
|
misbehaving website does manage to ``break some rules'' within its own process,
|
|
it should find it more difficult to steal data from other
|
|
sites~\cite{siteisolation}.
|
|
|
|
Firefox calls their version of \emph{Site Isolation}-like functionality Project
|
|
Fission~\cite{projectfission} but the two are very similar, both in internal
|
|
architecture and what they try to achieve. Elements of the web page are scanned
|
|
to decide whether they are allowed according to \emph{same-site} restrictions
|
|
and allocated shared or isolated memory based on the result.
|
|
|
|
Some Chromium users have been complaining on its high memory usage in the past,
|
|
which might have been partially caused by Site Isolation user-protection
|
|
features, unbeknownst to them.
|
|
|
|
|
|
\n{2}{Cross-site scripting}\label{sec:xss}
|
|
|
|
As per OWASP Top Ten list~\cite{owasptop10}, injection is the third most
|
|
observed issue across millions of websites. Cross-site scripting is a type of
|
|
attack in which malicious code, such as infected scripts are injected into a
|
|
website that would otherwise be trusted. Since the misconfiguration or a flaw
|
|
of the application allowed this, the browser of the victim that trusts the
|
|
website simply executes the code provided by the attacker. This code thus gains
|
|
access to session tokens and any cookies associated with the website's origin,
|
|
apart from being able to rewrite the HTML content. The results of XSS can range
|
|
from account compromise to identity theft.
|
|
|
|
Solutions deployed against XSS vary. On the client side, it mainly comes down
|
|
to good browser patching hygiene, browser features such as Site Isolation (see
|
|
Section~\ref{sec:siteisolation}), and essentially browsers correctly parsing
|
|
website directives such as the \texttt{X-Frame-Options},
|
|
\texttt{X-Content-Type-Options}, \texttt{X-Xss-Protection} and
|
|
\texttt{Cross-Origin-Opener-Policy} HTTP headers. However, the latter falls
|
|
flat if the website operators do not correctly configure their websites.
|
|
|
|
On the server side though, these options (indicating to the browsers \emph{how}
|
|
the site should be parsed) can directly be manipulated and configured. They
|
|
should be fine-tuned to fit the needs of each specific website.
|
|
|
|
Further, more than 10 years ago now, a new, powerful and comprehensive
|
|
framework for controlling the admissibility of content has been devised:
|
|
Content Security Policy. Its capabilities superseded those of the previously
|
|
mentioned options and it is discussed more in-depth in the following section.
|
|
|
|
|
|
\n{2}{Content Security Policy}\label{sec:csp}
|
|
|
|
Content Security Policy (CSP) has been an important addition to the arsenal of
|
|
website operators, even though not everybody has necessarily been utilising it
|
|
properly or even taken notice. Once configured on the web server, it provides
|
|
guarantees and employs protections against most common attack vectors on
|
|
websites exactly where the websites are being parsed and displayed - in the
|
|
(compliant) browser.
|
|
|
|
As per Weichselbaum et al.\ CSP is a mechanism designed to mitigate
|
|
XSS~\cite{longliveCSP}, a long-lived king of the vulnerability
|
|
lists~\cite{reininginCSP}. It is a declarative policy mechanism that allows the
|
|
website operator to decide what client-side resources can load on their website and what origins
|
|
are permitted \emph{sources} of content.
|
|
|
|
For example, scripts can be restricted to only load from a list of trusted
|
|
domains and inline scripts can be blocked completely, which is a huge win
|
|
against popular XSS techniques.
|
|
|
|
Further, scripts and stylesheets can also be allowed based on a cryptographic
|
|
(SHA256, SHA384 or SHA512) hash of their content, which should be a known
|
|
information to legitimate website operators prior to or at the time scripts are
|
|
served, making sure no unauthorised script or stylesheet will ever be run on
|
|
user's computer (running a compliant browser).
|
|
|
|
A policy of CSPv3, which is the current iteration of the concept, can be served
|
|
either as a header or inside website's \texttt{<meta>} tag. Configuration is
|
|
either site-wide or specific to each page.
|
|
|
|
Directive names are generally derived from the \emph{sources} they are
|
|
covering, and are thus often suffixed `-src', as in \texttt{script-src},
|
|
\texttt{img-src} or \texttt{style-src}, although some directives do not follow
|
|
this pattern, \texttt{form-action}, \texttt{upgrade-insecure-requests} and
|
|
\texttt{sandbox} representing this other group nicely.
|
|
|
|
Different directives are delimited using semicolon character at the end, and
|
|
each directive can only appear once in the entire policy.
|
|
|
|
Special values exist for the origin website itself - \texttt{'self'} - and for
|
|
disallowing \emph{any} source - \texttt{'none'}.
|
|
|
|
A good policy is \emph{targeted} and not overly broad. To give an example, a
|
|
website that loads no JavaScript at all does not need to allow a popular CDN
|
|
(Content Delivery Network) origin in its \texttt{script-src}, instead it should
|
|
be set to \texttt{'none'}. CSP can also aid with clickjacking protection using
|
|
its \texttt{frame-ancestors} directive, which can limit origins that have the
|
|
permission to embed the website. This prevents the attacker from embedding the
|
|
website at random places, for example malicious websites that masquerade as
|
|
being legitimate, e.g.\ utilising `Log in using xyz service' frame, that in
|
|
actuality just pharms the credentials.
|
|
|
|
Getting CSP right can be tricky at first but once grokked, it is relatively
|
|
straight-forward and can increase the security of the site greatly.
|
|
|
|
The recommended way to \emph{test} CSP is to enable it in the
|
|
\emph{report-only} mode before turning it on in production.
|
|
|
|
There are many more directives and settings than mentioned in this section, the
|
|
author encourages anybody interested to give it a read, e.g.\ at
|
|
\url{https://web.dev/csp/}.
|
|
|
|
\textbf{TODO}: add more concrete examples.
|
|
|
|
|
|
\n{1}{Configuration}
|
|
|
|
Every non-trivial program usually offers at least \emph{some} way to
|
|
tweak/manage its behaviour, and these changes are usually persisted
|
|
\emph{somewhere} on the filesystem of the host: in a local SQLite3 database, a
|
|
\emph{LocalStorage} key-value store in the browser, a binary or plain text
|
|
configuration file. These configuration files need to be read and checked at
|
|
least on program start-up and either stored into operating memory for the
|
|
duration of the runtime of the program, or loaded and parsed and the memory
|
|
subsequently \emph{freed} (initial configuration).
|
|
|
|
There is an abundance of configuration languages (or file formats used to craft
|
|
configuration files, whether they were intended for it or not) available, TOML,
|
|
INI, JSON, YAML, to name some of the popular ones (as of today).
|
|
|
|
Dhall stood out as a language that was designed with both security and the
|
|
needs of dynamic configuration scenarios in mind, borrowing a concept or two
|
|
from Nix~\cite{nixoslearn}~\cite{nixlang} (which in turn sources more than a
|
|
few of its concepts from Haskell), and in its apparent core being very similar
|
|
to JSON, which adds to familiar feel. In fact, in Dhall's authors' own words it
|
|
is: ``a programmable configuration language that you can think of as: JSON +
|
|
functions + types + imports''~\cite{dhalllang}.
|
|
|
|
Among all of the listed features, the especially intriguing one to the author
|
|
was the promise of \emph{types}. There are multiple examples directly on the
|
|
project's documentation webpage demonstrating for instance the declaration and
|
|
usage of custom types (that are, of course merely combinations of the primitive
|
|
types that the language provides, such as \emph{Bool}, \emph{Natural} or
|
|
\emph{List}, to name just a few), so it was not exceedingly hard to start
|
|
designing a custom configuration \emph{schema} for the program.
|
|
Dhall not being a Turing-complete language also guarantees that evaluation
|
|
\emph{always} terminates eventually, which is a good attribute to possess as a
|
|
configuration language.
|
|
|
|
\n{3}{Safety considerations}
|
|
|
|
Having a programmable configuration language that understands functions and
|
|
allows importing not only arbitrary text from random internet URLs, but also
|
|
importing and \emph{evaluating} (i.e.\ running) potentially untrusted code, it
|
|
is important that there are some safety mechanisms employed, which can be
|
|
relied on by the user. Dhall offers this in multiple features: enforcing a
|
|
same-origin policy and (optionally) pinning a cryptographic hash of the value
|
|
of the expression being imported.
|
|
|
|
\n{3}{Possible alternatives}
|
|
|
|
While developing the program, the author has also
|
|
come across certain shortcomings of Dhall, namely long start-up with \emph{cold
|
|
cache}, which can generally be observed in the scenario of running the program
|
|
in an environment that does not allow to write the cache files (a read-only
|
|
filesystem), of does not keep the written cache files, such as a container that
|
|
is not configured to mount a persistent volume at the pertinent location.
|
|
|
|
To describe the way Dhall works when performing an evaluation, it resolves
|
|
every expression down to a combination of its most basic types (eliminating all
|
|
abstraction and indirection) in the process called
|
|
\textbf{normalisation}~\cite{dhallnorm} and then saves this result in the
|
|
host's cache. The \texttt{dhall-haskell} binary attempts to resolve the
|
|
variable \texttt{\$\{XDG\_CACHE\_HOME\}} (have a look at \emph{XDG Base
|
|
Directory Spec}~\cite{xdgbasedirspec} for details) to decide \emph{where} the
|
|
results of the normalisation will be written for repeated use. Do note that
|
|
this behaviour has been observed on a GNU/Linux host and the author has not
|
|
verified this behaviour on a non-GNU/Linux host, such as FreeBSD.
|
|
|
|
If normalisation is performed inside an ephemeral container (as opposed to, for
|
|
instance, an interactive desktop session), the results effectively get lost on
|
|
each container restart. That is both wasteful and not great for user
|
|
experience, since the normalisation of just a handful of imports (which
|
|
internally branches widely) can take an upwards of two minutes, during which
|
|
the user is left waiting for the hanging application with no reporting on the
|
|
progress or current status.
|
|
|
|
While workarounds for the above mentioned problem can be devised relatively
|
|
easily (such as bind mounting \emph{persistent} volumes inside containers
|
|
to\texttt{\$\{XDG\_CACHE\_HOME\}/dhall} and
|
|
\texttt{\$\{XDG\_CACHE\_HOME\}/dhall-haskell} in order to preserve cache
|
|
between restarts, or let the cache be pre-computed during container build,
|
|
since the application is only really expected to run together with a compatible
|
|
version of the configuration schema and this version \emph{is} known at
|
|
container build time), it would certainly feel better if there was no need to
|
|
work \emph{around} the configuration system of choice.
|
|
|
|
Alternatives such as CUE (\url{https://cuelang.org/}) offer themselves nicely
|
|
as an almost drop-in replacement for Dhall feature-wise, while also resolving
|
|
the costly \emph{cold cache} normalisation operations, which is in author's
|
|
view Dhall's titular flaw. In a slightly contrasting approach, another emerging
|
|
project called \texttt{TySON} (\url{https://github.com/jetpack-io/tyson}),
|
|
which uses \emph{a subset} of TypeScript to also create a programmable,
|
|
strictly typed configuration language, opted to take a well known language
|
|
instead of reinventing the wheel, while still being able to retain feature
|
|
parity with Dhall.
|
|
|
|
|
|
\n{1}{Compromise Monitoring}
|
|
|
|
There are, of course, several ways one could approach monitoring of compromised
|
|
of credentials, some more \emph{manual} in nature than others. When using a
|
|
service that is suspected/expected to be breached in the future, one can always
|
|
create a unique username/password combination specifically for the subject
|
|
service and never use that combination anywhere else. That way, if the
|
|
credentials ever \emph{do} happen to appear in a data dump online in the
|
|
future, it is going to be a safe assumption as to where they came from.
|
|
|
|
Unfortunately, the task of actually \emph{monitoring} the credentials can prove
|
|
to be a little more arduous than one could expect at first. There are a couple
|
|
of points that can prove to pose a challenge in case the search is performed by
|
|
hand, namely:
|
|
|
|
\begin{itemize}
|
|
\item finding the breached data to look through
|
|
\item verifying the trustworthiness of the data
|
|
\item varying quality of the data
|
|
\item sifting through (possibly) unstructured data by hand
|
|
\end{itemize}
|
|
|
|
Of course, as this is a popular topic for a number of people, the above
|
|
mentioned work has already been packaged into neat and practical online
|
|
offerings. In case one decides in favour of using those, an additional range of
|
|
issues (the previous one still applicable) arises:
|
|
|
|
\begin{itemize}
|
|
\item the need to trust the provider of the service with input credentials
|
|
\item relying on the goodwill of the provider to be able to access the data
|
|
\item hoping that the terms of service are kept as promised
|
|
\item dependence on the quality and extent of their data sources
|
|
\end{itemize}
|
|
|
|
Besides that, there is a plethora of breaches floating around the Internet
|
|
available simply as zip files, which makes the job of password compromise
|
|
monitoring even harder.
|
|
|
|
The overarching goal of this thesis is devising and implementing a system in
|
|
which the user can \emph{monitor} whether their credentials have been
|
|
\emph{compromised} (at least as far as the data can tell), and allowing them to
|
|
do so without needing to entrust their sensitive data to a provider.
|
|
|
|
|
|
\n{2}{Data Sources}\label{sec:dataSources}
|
|
|
|
A data source in this place is considered anything that provides the
|
|
application with data that it understands.
|
|
|
|
Of course, the results of credential compromise verification/monitoring is only
|
|
going to be as good as the data underpinning it, which is why it is imperative
|
|
that high quality data sources be used, if at all possible. While great care
|
|
does have to be taken to only choose the highest quality data sources, the
|
|
application must offer a means to be able to utilise these.
|
|
|
|
The sources from which breached data can be loaded into an application can be
|
|
split into two basic categories: \textbf{online} or \textbf{local}, and it is
|
|
possible to further discern between them by whether the data they provide is
|
|
\emph{structured} or not.
|
|
|
|
An online source is generally a service that ideally exposes a programmatic
|
|
API, which an application can query and from which it can request the necessary
|
|
subsets of data.
|
|
These types of services often additionally front the data by a user-friendly
|
|
web interface for one-off searches, which is, however, not of use here.
|
|
|
|
Among some examples of online services could be named:
|
|
|
|
\begin{itemize}
|
|
\item {Have I Been Pwned?} - \url{https://haveibeenpawned.com}
|
|
\item {DeHashed} - \url{https://dehashed.com}
|
|
\end{itemize}
|
|
|
|
Large lumps of unstructured data available on forums or shady web servers would
|
|
technically also count here, given that they provide data and are available
|
|
online. However, even though data is frequently found online precisely in this
|
|
form, it is also not of direct use for the application without manual
|
|
\emph{preprocessing}, as it is attended to in
|
|
Section~\ref{sec:localDatasetPlugin}.
|
|
|
|
Another source is then simply any locally supplied data, which, of course,
|
|
could have been obtained from a breach available online beforehand.
|
|
|
|
Locally supplied data is specific in that it needs to be formatted in such a
|
|
way that it can be understood by the application. That is, the data cannot be
|
|
in its raw form anymore but has to have been morphed into the precise shape the
|
|
application needs for further processing. Once imported, the application can
|
|
query the data at will, as it knows exactly the shape of it.
|
|
|
|
This supposes the existence of a \emph{format} for importing, schema of which
|
|
is devised in Section~\ref{sec:localDatasetPlugin}.
|
|
|
|
|
|
\n{3}{Local Dataset Plugin}\label{sec:localDatasetPlugin}
|
|
|
|
Unstructured breach data from locally available datasets can be imported into
|
|
the application by first making sure it adheres to the specified schema (have a
|
|
look at the \emph{Breach Data Schema} in Listing~\ref{breachDataGoSchema}). If
|
|
it does not (which is very likely with random breach data, as already mentioned
|
|
in Section~\ref{sec:dataSources}), it needs to be converted to a form that
|
|
\emph{does} before importing it to the application, e.g.\ using a Python script
|
|
or a similar method.
|
|
|
|
Attempting to import data that does not follow the outlined schema should
|
|
result in an error. Equally so, importing a dataset which is over a reasonable
|
|
size limit should by default be rejected by the program as a precaution.
|
|
Unmarshaling, for instance, a 1 TiB document would most likely result in an
|
|
out-of-memory (OOM) situation on the host running the application, assuming
|
|
contemporary consumer hardware conditions (not HPC).
|
|
|
|
\vspace{\parskip}
|
|
\begin{lstlisting}[language=Go, caption={Breach Data Schema represented as a Go
|
|
struct with imports from the standard library assumed},
|
|
label=breachDataGoSchema,
|
|
backgroundcolor=\color{lstbg},
|
|
morekeywords={any}
|
|
]
|
|
type breachDataSchema struct {
|
|
Name string
|
|
Time time.Time
|
|
IsVerified bool
|
|
ContainsPasswords bool
|
|
ContainsHashes bool
|
|
HashType string
|
|
HashSalted bool
|
|
HashPepperred bool
|
|
ContainsUsernames bool
|
|
ContainsEmails bool
|
|
Data any
|
|
}
|
|
\end{lstlisting}
|
|
\vspace*{-\baselineskip}
|
|
|
|
The Go representation shown in Listing~\ref{breachDataGoSchema} will in
|
|
actuality translate to a YAML document written and supplied by an
|
|
administrative user of the program. The YAML format was chosen for several
|
|
reasons:
|
|
|
|
\begin{itemize}
|
|
\item relative ease of use (plain text, readability)
|
|
\item capability to store multiple \emph{documents} inside of a single file
|
|
\item most of the inputs being implicitly typed as strings
|
|
\item support for inclusion of comments
|
|
\item machine readability thanks to being a superset of JSON
|
|
\end{itemize}
|
|
|
|
The last point specifically should allow for documents similar to what can be
|
|
seen in Listing~\ref{breachDataYAMLSchema} to be ingested by the program, read
|
|
and written by humans and programs alike.
|
|
|
|
\smallskip
|
|
\begin{lstlisting}[style=yaml,
|
|
caption={Example Breach Data Schema supplied to the program as a YAML file,
|
|
optionally containing multiple documents},
|
|
label=breachDataYAMLSchema,
|
|
backgroundcolor=\color{lstbg},
|
|
]
|
|
---
|
|
name: Horrible breach
|
|
time: 2022-04-23T00:00:00Z+02:00
|
|
isVerified: false
|
|
containsPasswds: false
|
|
containsHashes: true
|
|
containsEmails: true
|
|
hashType: md5
|
|
hashSalted: false
|
|
hashPeppered: false
|
|
data:
|
|
hashes:
|
|
- hash1
|
|
- hash2
|
|
- hash3
|
|
emails:
|
|
- email1
|
|
- ""
|
|
- email3
|
|
---
|
|
# document #2, describing another breach.
|
|
name: Horrible breach 2
|
|
...
|
|
\end{lstlisting}
|
|
\vspace*{-\baselineskip}
|
|
|
|
Notice how the emails list (\texttt{.data/emails}) in
|
|
Listing~\ref{breachDataYAMLSchema} misses one record, perhaps because it was
|
|
not supplied or mistakenly omitted. This is a valid scenario (mistakes happen)
|
|
and the application needs to be able to handle it. The alternative would be to
|
|
require the user to prepare the data in such a way that the empty/partial
|
|
records would be dropped entirely.
|
|
|
|
\n{3}{Have I Been Pwned? Integration}
|
|
|
|
Troy Hunt's \textbf{Have I Been Pwned?} online service
|
|
(\url{https://haveibeenpwned.com/}) has been chosen as the online source of
|
|
compromised data. The service offers public APIs that are protected by API keys
|
|
due to abuse-prevention reasons.
|
|
|
|
The application's \texttt{hibp} module and database representation models
|
|
the values returned by this API, which allows searching in large breaches using
|
|
email addresses.
|
|
|
|
The architecture there is relatively simple: the application administrator
|
|
configures an API key for the HIBP service, the user enters the query
|
|
parameters, the application constructs a query, calls the API and waits for a
|
|
response. As the API is rate-limited based on the key supplied, this can pose
|
|
an issue and it has not been fully resolved in the UI. The application then
|
|
parses the returned data and binds it to the local model for validation. If
|
|
that goes well, the data is saved into the database as a cache and the search
|
|
query is performed on the saved data. If it returns anything, it is displayed
|
|
to the user for browsing.
|
|
|
|
|
|
\n{1}{Deployment recommendations}\label{sec:deploymentRecommendations}
|
|
|
|
It is, of course, recommended that the application runs in a secure environment
|
|
\allowbreak although definitions of that almost certainly differ depending on
|
|
who you ask. General recommendations would be either to effectively reserve a
|
|
machine for a single use case - running this program - so as to dramatically
|
|
decrease the potential attack surface of the host, or run the program isolated
|
|
in a container or a virtual machine. Further, if the host does not need
|
|
management access (it is a deployed-to-only machine that is configured
|
|
out-of-band, such as with a \emph{golden} image/container or declaratively with
|
|
Nix), then an SSH \emph{daemon} should not be running in it, since it is not
|
|
needed. In an ideal scenario, the host machine would have as little software
|
|
installed as possible besides what the application absolutely requires.
|
|
|
|
System-wide cryptographic policies should target highest feasible security
|
|
level, if at all available (such as by default on Fedora or RHEL), covering
|
|
SSH, DNSSec, IPsec, Kerberos and TLS protocols. Firewalls should be configured
|
|
and SELinux (kernel-level mandatory access control and security policy
|
|
mechanism) running in \emph{enforcing} mode, if available.
|
|
|
|
\n{2}{Transport security}
|
|
|
|
User connecting to the application should rightfully expect for their data to
|
|
be protected \textit{in transit} (i.e.\ on the way between their browser and
|
|
the server), which is what \emph{Transport Layer Security} family of
|
|
protocols~\cite{tls13rfc8446} was designed for, and which is the underpinning
|
|
of HTTPS. TLS utilises the primitives of asymmetric cryptography to let the
|
|
client authenticate the server (verify that it is who it claims it is) and
|
|
negotiate a symmetric key for encryption in the process named the \emph{TLS
|
|
handshake} (see Section~\ref{sec:tls} for more details), the final purpose of
|
|
which is establishing a secure communications connection. The operator should
|
|
configure the program to either directly utilise TLS using configuration or
|
|
have it listen behind a TLS-terminating \emph{reverse proxy}.
|
|
|
|
|
|
\n{2}{Containerisation}
|
|
|
|
Whether the pre-built or a custom container image is used to deploy the
|
|
application, it still needs access to secrets, such as database connection
|
|
string (containing database host, port, user, password/encrypted password,
|
|
authentication method and database name).
|
|
|
|
The application should be able to handle the most common Postgres
|
|
authentication methods~\cite{pgauthmethods}, namely \emph{peer},
|
|
\emph{scram-sha-256}, \emph{user name maps} and raw \emph{password}, although
|
|
the \emph{password} option should not be used in production, \emph{unless} the
|
|
connection to the database is protected by TLS.\ In any case, using the
|
|
\emph{scram-sha-256}~\cite{scramsha256rfc7677} method is preferable. One of the
|
|
ways to verify in development environment that everything works as intended is
|
|
the \emph{Password generator for PostgreSQL} tool~\cite{goscramsha256}, which
|
|
allows retrieving the encrypted string from a raw user input.
|
|
|
|
If the application running in a container wants to use the \emph{peer}
|
|
authentication method, it is up to the operator to supply the Postgres socket
|
|
to the application (e.g.\ as a volume bind mount). This scenario was not
|
|
tested; however, and the author is also not entirely certain how \emph{user
|
|
namespaces} (on GNU/Linux) would influence the process (as in when the
|
|
\emph{ID}s of a user \textbf{outside} the container are mapped to a range of
|
|
\emph{UIDs} \textbf{inside} the container), for which the setup would likely
|
|
need to account.
|
|
|
|
Equally, if the application is running inside the container, the operator needs
|
|
to make sure that the database is either running in a network that is also
|
|
directly attached to the container or that there is a mechanism in place that
|
|
routes the requests for the database hostname to the destination.
|
|
|
|
One such mechanism is container name based routing inside \emph{pods}
|
|
(Podman/Kubernetes), where the resolution of container names is the
|
|
responsibility of a specially configured (often auto-configured) piece of
|
|
software called Aardvark for the former and CoreDNS for the latter.
|
|
|
|
|
|
\n{1}{Summary}
|
|
|
|
Passwords (and/or passphrases) are in use everywhere and quite probably will be
|
|
for the foreseeable future. If not as \textit{the} principal way to
|
|
authenticate, then at least as \textit{a} way to authenticate. As long as
|
|
passwords are going to be handled and stored by service/application providers,
|
|
they are going to get leaked, be it due to provider carelessness or the
|
|
attackers' resolve and wit. Of course, sifting through all the available
|
|
password breach data by hand is not a reasonable option, and therefore tools
|
|
providing assistance come in handy. The next part of this diploma thesis will
|
|
explore that issue and introduce a solution.
|
|
|
|
|
|
% =========================================================================== %
|