868 lines
45 KiB
TeX
868 lines
45 KiB
TeX
% =========================================================================== %
|
|
\part{Theoretical part}
|
|
|
|
\n{1}{Vision}
|
|
|
|
The thesis consists of two main parts: a theoretical one that provides
|
|
theoretical background to concepts and processes used in the other one - so
|
|
called \emph{practical} part, which then describes what exactly has been done
|
|
and how.
|
|
|
|
The application developed as part of this thesis should enable users to quickly
|
|
and privately check their credentials' compromise status against configured
|
|
local and online data sources. Of course the compromise monitoring depends on
|
|
access to quality data, which is in the purview of the application
|
|
administrator.
|
|
|
|
In-application administrative-level user is able to configure online and local
|
|
data sources, initially set up user accounts but does not have the access
|
|
users' search queries or credential entries, or more broadly, is not able to
|
|
read sensitive user information. This is enabled by the architectural decisions
|
|
the application has taken, whereby sensitive user data is encrypted and not
|
|
even administrative-level users are able to read them.
|
|
|
|
|
|
\n{1}{Cryptography primer}\label{sec:cryptographyprimer}
|
|
|
|
\n{2}{Encryption}
|
|
|
|
\textbf{TODO:} add \emph{why} we care and how it's going to be used.
|
|
|
|
\n{2}{Hash functions}
|
|
|
|
Hash functions are algorithms used to help with a number of things: integrity
|
|
verification, password protection, digital signature, public-key encryption and
|
|
others. Hashes are used in forensic analysis to prove authenticity of digital
|
|
artifacts, to uniquely identify a change-set within revision-based source code
|
|
management systems such as Git or Mercurial, to detect known-malicious software
|
|
by anti-virus programs or by advanced filesystems in order to verify block
|
|
integrity and enable repairs, and also in many other applications that each
|
|
person using a modern computing device has come across, such as when connecting
|
|
to a website protected by the famed HTTPS.
|
|
|
|
The popularity of hash functions stems from a common use case: the need to
|
|
simplify reliably identifying a chunk of data. Of course, two chunks of data,
|
|
two files, frames or packets could always be compared bit by bit, but that can
|
|
get prohibitive from both cost and energy point of view relatively quickly.
|
|
That is when the hash functions come in, since they are able to take a long
|
|
input and produce a short output, named a digest or a hash value. The function
|
|
also only works one way.
|
|
|
|
A file, or any original input data for that matter, cannot be reconstructed
|
|
from the hash digest alone by somehow \emph{reversing} the hashing operation,
|
|
since at the heart of any hash function there is essentially a compression
|
|
function.
|
|
|
|
Most alluringly, hashes are frequently used with the intent of
|
|
\emph{protecting} passwords by making those unreadable, while still being able
|
|
to verify that the user knows the password, therefore should be authorised.
|
|
|
|
As the hashing operation is irreversible, once the one-way function produces a
|
|
short a digest, there is no way to reconstruct the original message from it.
|
|
That is, unless the input of the hash function is also known, in which case all
|
|
it takes is hashing the supposed input and comparing the digest with existing
|
|
digests that are known to be digests of passwords.
|
|
|
|
|
|
\n{3}{Types and use cases}
|
|
|
|
Hash functions can be loosely categorised based on their intended use case to
|
|
\emph{password protection hashes}, \emph{integrity verification hashes},
|
|
\emph{message authentication codes} and \emph{cryptographic hashes}. Each of
|
|
these possess unique characteristics and using the wrong type of hash function
|
|
for the wrong job can potentially result in a security breach.
|
|
|
|
As an example, suppose \texttt{MD5}, a popular hash function internally using
|
|
the same data structure - \emph{Merkle-Damgård} construction - as
|
|
\texttt{BLAKE3}. The former produces 128 bit digests, compared to the default
|
|
256 bits of output and no upper ($<2^{64}$ bytes) limit (Merkle tree
|
|
extensibility) for the latter. There is a list of differences that could
|
|
further be mentioned, however, they both have one thing in common: they are
|
|
\emph{designed} to be \emph{fast}. The latter, as a cryptographic hash
|
|
function, is conjectured to be \emph{random oracle indifferentiable}, secure
|
|
against length extension, but it is also in fact faster than all of
|
|
\texttt{MD5}, \texttt{SHA3-256}, \texttt{SHA-1} and even \texttt{Blake2} family
|
|
of functions~\cite{blake3}.
|
|
|
|
The use case of both is to (quickly) verify integrity of a given chunk of data,
|
|
in case of \texttt{BLAKE3} with pre-image and collision resistance in mind, not
|
|
to secure a password by hashing it first, which poses a big issue when used
|
|
to...secure passwords by hashing them first.
|
|
|
|
Password hashing functions such as \texttt{argon2} or \texttt{bcrypt} are good
|
|
choices for \emph{securely} storing hashed passwords, namely because they place
|
|
CPU and memory burden on the machine that is computing the digest. In case of
|
|
the mentioned functions, \emph{hardness} is even configurable to satisfy the
|
|
greatest possible array of scenarios. These functions also forcefully limit
|
|
potential parallelism, thereby restricting the scale at which exhaustive
|
|
searches performed using tools like \texttt{Hashcat} or \texttt{John the
|
|
Ripper} could be at all feasible, practically obviating old-school hash
|
|
cracking~\cite{hashcracking},~\cite{hashcracking2}. Additionally, both
|
|
functions can automatically add random \emph{salt} to passwords, automatically
|
|
ensuring that no copies of the same password provided by different users will
|
|
end up hashing to the same digest value.
|
|
|
|
|
|
\n{3}{Why are hashes interesting}
|
|
|
|
As already hinted, hashes are often used to store a \emph{logical proof of the
|
|
password}, rather than the password itself. Especially services storing hashed
|
|
user passwords happen to non-voluntarily leak them. Using a wrong type of hash
|
|
for password hashing, weak hash function parameters, reusing \emph{salt} or the
|
|
inadvertently \emph{misusing} the hash function in some other way, is a sure
|
|
way to spark a lot of
|
|
interest~\cite{megatron},~\cite{linkedin1},~\cite{linkedin2}.
|
|
|
|
Historically, plain-text passwords have also leaked enough times (or weak
|
|
hashes have been cracked) that anyone with enough interest had more than
|
|
sufficient amount of time to additionally put together neat lists of hashes of
|
|
the most commonly used
|
|
passwords~\cite{rockyou},~\cite{plaintextpasswds1},~\cite{plaintextpasswds2},~\cite{plaitextpasswds3}.
|
|
|
|
So while a service might not be storing passwords in \emph{plain text}, which
|
|
is a good practice, using a hashing function not designed to protect passwords
|
|
does not offer much additional protection in the case of weak passwords, which
|
|
happen to be the ones that are the most commonly used.
|
|
|
|
It would seem only logical that a service that is not using cryptographic
|
|
primitives like hash functions correctly is more likely to get hacked and have
|
|
its users' passwords or password hashes leaked. Those are often exposed
|
|
publicly with no restrictions on access, and the internet turns out to be
|
|
serving as a storage/medium.
|
|
|
|
That incidentally also means that anyone interested in their own compromise
|
|
monitoring has at least \emph{some} chances of successfully learning about
|
|
their compromise, potentially a long time before it can be used to cause
|
|
greater harm.
|
|
|
|
|
|
\n{2}{TLS}\label{sec:tls}
|
|
|
|
The Transport Layer Security protocol (or TLS) serves as as an encryption and
|
|
\emph{authentication} protocol to secure internet communications. An important
|
|
part of the protocol is the \emph{handshake}, during which the two
|
|
communicating parties exchange messages that acknowledge each other's presence,
|
|
verify each other, choose what cryptographic algorithms will be used and decide
|
|
session keys. As there are multiple versions of the protocol in active duty
|
|
even at the moment, the server together with the client need to agree upon the
|
|
version they are going to use (it is recommended to use either v1.2 or v1.3
|
|
these days), pick cipher suites (if applicable), the client verifies the
|
|
server's public key (and the signature of the certificate authority that issued
|
|
it) and they both generate session keys for use after handshake completion.
|
|
|
|
TLSv1.3 dramatically reduced the number of available suites to only include the
|
|
ones deemed secure enough, which is why it is no longer needed to manually
|
|
specify what cipher suite should be used (or rely on the client/server to
|
|
choose wisely). While possibly facing compatibility issues with legacy devices,
|
|
the simplicity brought by enabling TLSv1.3 might be considered a worthy
|
|
trade-off~\cite{tls13rfc8446}.
|
|
|
|
|
|
\n{1}{Passwords}\label{sec:passwords}
|
|
|
|
Passwords have been in use since the ancient times, apparently already the
|
|
Roman sentries used passwords or \textit{watchwords} to discern who was allowed
|
|
to enter an area. The Roman army had a special system of distributing passwords
|
|
among the encampment members on a wooden tablet. Fast forward a couple of
|
|
thousand years, during the days of the Prohibition Era in the United States, it
|
|
was the secret ``speakeasies'' that were protecting their illegitimate
|
|
alcohol-serving business using passwords~\cite{speakeasy},~\cite{nisthistory}.
|
|
During the World War II.\ the US paratroopers' use of passwords has evolved to
|
|
even include a counter-password.
|
|
|
|
According to McMillan, the first \textit{computer} passwords date back to
|
|
mid-1960s Massachusetts Institute of Technology (MIT), when researchers at the
|
|
university built a massive time-sharing computer called CTSS. Apparently,
|
|
\textit{even then} the passwords did not protect the users as well as they were
|
|
expected to~\cite{mcmillan}.
|
|
|
|
Traditionally, passwords were expected to be memorised, but the large number of
|
|
password-protected \emph{services} these days can make this impractical. To
|
|
list a few common examples, access to a bank account, electronic mailbox,
|
|
personal computer encrypted disk are all protected by some form of a password.
|
|
|
|
A password still often consists of a \textit{string} of characters typed into a
|
|
prompt but its function is still the same: as per NIST it enables the
|
|
\textit{verifier} to infer the \textit{claimant}'s identity via a secret the
|
|
claimant holds.
|
|
|
|
There are always some arbitrary requirements applied to what the password can
|
|
be, only some turn out to smarter than others.
|
|
|
|
Despite the impression given by the word ``password'', it does not need to be
|
|
an actual word, while a non-word (in the dictionary sense) may indeed be harder
|
|
to guess, which is a desirable property of passwords. A memorized secret
|
|
consisting of a sequence of words or other text separated by spaces is
|
|
sometimes called a passphrase. A passphrase is similar to a password in usage,
|
|
but the former is generally longer for added security.
|
|
|
|
\n{2}{Program-imposed constraints}
|
|
|
|
Some of the following examples might be a bit anecdotal and more of an
|
|
exception than a rule; nevertheless, when presented by a large-enough program
|
|
creator/service provider, their decisions reach a sufficient amount of
|
|
population, enough that the author will call them influential. They form how
|
|
users think when creating password and affect what users expect from other
|
|
services they happen to visit and use from that point on, as well.
|
|
|
|
\n{3}{Short arbitrary length}
|
|
|
|
It has been observed that a requirement for a ``strong'' password generally
|
|
represents that a password is:
|
|
|
|
\begin{itemize}
|
|
\item longer than 7 characters,
|
|
\item shorter than 11 characters,
|
|
\item begins with a letter and ends with a number OR
|
|
\item begins with a number and ends with a letter.
|
|
\end{itemize}
|
|
|
|
\obr{Short arbitrary password length
|
|
limit}{fig:arbitrarypasswdlengthlimit}{.8}{graphics/arbitrarypasswdlengthlimit.jpg}
|
|
|
|
The error message in above the password input field depicted in
|
|
Figure~\ref{fig:arbitrarypasswdlengthlimit} is wrong for multiple reasons, and
|
|
it is a classic example of short arbitrary length requirement~\cite{larsklint}.
|
|
It essentially prevents users from using passphrases, makes using a password
|
|
manager impractical and all of that has apparently been done ``because of
|
|
security''~\cite{etihad}. Moreover, this might be an indicative of the fact
|
|
that instead of storing passwords hashed (as it should be), they might be
|
|
storing them in \textbf{plain text}.
|
|
Otherwise, what reason could exist for the limit to be 10 characters?
|
|
The recommendation of the US's National Institute for Standards and Technology
|
|
(NIST) in this regard is a minimum of 64 and a maximum of 256 characters,
|
|
which, as they put it, \emph{should be sufficient for most users' needs}.
|
|
|
|
\n{3}{Restricting special characters}
|
|
|
|
Service providers have too often been found forbidding the use of so called
|
|
\textit{special characters} in passwords for as long as passwords have been
|
|
used to protect privileged access. Ways of achieving the same may vary but the
|
|
intent stays the same: preventing users from inputting characters into the
|
|
system, which the system cannot comfortably handle, for ``reasons'', which are
|
|
usually something dubious along the lines of ``an apostrophe may be used in SQL
|
|
injection attacks'' or ``angle brackets may be used in XSS attacks''. Instead,
|
|
the real message it often unwittingly announces is pointing right to the
|
|
serious shortcomings of password handling of the site in question, as passwords
|
|
should never be re-displayed in a context that is prone to Cross Site Scripting
|
|
(XSS), and the passwords should always be hashed before being sent to the
|
|
database anyway, leaving us with only alphanumeric characters, rendering the
|
|
SQLi fears baseless.
|
|
|
|
\obr{Forbidden special characters in
|
|
passwords}{fig:forbiddencharacters}{.8}{graphics/forbiddencharacters.jpg}
|
|
|
|
Note that ``Passw0rd!'' would have been a perfectly acceptable password for the
|
|
validator displayed in
|
|
Figure~\ref{fig:forbiddencharacters}~\cite{forbiddencharacters}. NIST's
|
|
recommendations on this matter are that all printing ASCII characters as well
|
|
as the space character SHOULD be acceptable in memorized secrets, and Unicode
|
|
characters SHOULD be accepted as well~\cite{asciirfc20},~\cite{iso10646}.
|
|
|
|
\n{3}{Character composition requirements}
|
|
|
|
There is a tendency to come up with bad passwords when there are character
|
|
composition requirements in place, too. The reality is that instead of
|
|
creating strong passwords directly, most users first try a basic version and
|
|
then keep tweaking characters until the password ends up fulfilling the minimum
|
|
requirement.
|
|
|
|
The \emph{problem} with it is that it has been shown, that people use similar
|
|
patterns, i.e.\ starting with capital letters, putting a symbol last and a
|
|
number in the last two positions. This is also known to people cracking the
|
|
password hashes and they run their dictionary attacks using the common
|
|
substitutions, such as ``\$'' for ``s'', ``E'' for ``3'', ``1'' for ``l'',
|
|
``@'' for ``a''
|
|
etc.~\cite{megatron},~\cite{hashcracking},~\cite{hashcracking2}. It is safe to
|
|
expect that the password created in this manner will almost certainly be bad,
|
|
and the only achievement was to frustrate the user in order to still arrive at
|
|
a bad password.
|
|
|
|
\n{3}{Other common issues}
|
|
|
|
Some services don't allow users to paste into passwords fields (disabling them
|
|
using JavaScript), thereby essentially breaking the password manager
|
|
functionality, which is an issue because it encourages bad password practices
|
|
such as weak passwords and likewise, password reuse.
|
|
|
|
Forced frequent password rotation is another common issue. Apparently, making
|
|
frequent password rotations mandatory contributes to users developing a
|
|
password creation \emph{patterns}. Moreover, according to the British NCSC, the
|
|
subject practice ``carries no real benefits as stolen passwords are generally
|
|
exploited immediately'', and the organisation calls it a modern-day security
|
|
anti-pattern~\cite{ncsc}.
|
|
|
|
|
|
\n{1}{Web security}\label{sec:websecurity}
|
|
|
|
The internet is a vast space full of intertwined concepts and ideas. It is a
|
|
superset of the Web, even though the two terms often get conflated. However,
|
|
not everything that is available on the internet can be accessed using web
|
|
protocols and \emph{resources}. This section delves into the concepts of web
|
|
security.
|
|
|
|
|
|
\n{2}{Site Isolation}\label{sec:siteisolation}
|
|
|
|
While website operators can perform steps to secure their sites, it is often
|
|
the browsers holding the last line when these web servers are misconfigured,
|
|
allowing the attacker to start exploiting a vulnerability in various ways.
|
|
|
|
Most users consume web content using web browsers. Modern browsers such as
|
|
Firefox or Chromium are being built with a security focus in mind. Their
|
|
developers are acutely aware of the dangers that parsing of untrusted code from
|
|
the internet poses, which is precisely what the websites, the stylesheets and
|
|
the accompanying scripts are.
|
|
|
|
This necessarily gets reflected in the way these programs are architected.
|
|
Instead of the main, privileged browser process running everything directly, it
|
|
spawns de-privileged child processes for each website. This extra line of
|
|
defence should make it \emph{harder for untrustworthy websites to access or
|
|
steal information} from user accounts or other websites. Even if the
|
|
misbehaving website does manage to ``break some rules'' within its own process,
|
|
it should find it more difficult to steal data from other
|
|
sites~\cite{siteisolation}.
|
|
|
|
Firefox calls their version of \emph{Site Isolation}-like functionality Project
|
|
Fission, but the two are very similar, both in internal architecture and what
|
|
they try to achieve~\cite{projectfission}. Elements of the web page are scanned
|
|
to decide whether they are allowed according to \emph{same-site} restrictions
|
|
and allocated shared or isolated memory based on the result.
|
|
|
|
Some Chromium users have been complaining on its high memory usage in the past,
|
|
which might have been partially caused by Site Isolation user-protection
|
|
features, unbeknownst to them.
|
|
|
|
|
|
\n{2}{Cross-site scripting}\label{sec:xss}
|
|
|
|
As per OWASP Top Ten list, injection is the third most observed issue across
|
|
millions of websites. Cross-site scripting is a type of attack in which
|
|
malicious code, such as infected scripts, is injected into a website that would
|
|
otherwise be trusted. Since the misconfiguration or a flaw of the application
|
|
allowed this, the browser of the victim that trusts the website simply executes
|
|
the code provided by the attacker. This code thus gains access to session
|
|
tokens and any cookies associated with the website's origin, apart from being
|
|
able to rewrite the HTML content. The results of XSS can range from account
|
|
compromise to identity theft~\cite{owasptop10}.
|
|
|
|
Solutions deployed against XSS vary. On the client side, it mainly comes down
|
|
to good browser patching hygiene, browser features such as Site Isolation (see
|
|
Section~\ref{sec:siteisolation}), and essentially browsers correctly parsing
|
|
website directives such as the \texttt{X-Frame-Options},
|
|
\texttt{X-Content-Type-Options}, \texttt{X-Xss-Protection} and
|
|
\texttt{Cross-Origin-Opener-Policy} HTTP headers. However, the latter falls
|
|
flat if the website operators do not correctly configure their websites.
|
|
|
|
On the server side though, these options (indicating to the browsers \emph{how}
|
|
the site should be parsed) can directly be manipulated and configured. They
|
|
should be fine-tuned to fit the needs of each specific website.
|
|
|
|
Furthermore, a new, powerful and comprehensive framework for controlling the
|
|
admissibility of content has been devised more than 10 years ago now: Content
|
|
Security Policy. Its capabilities superseded those of the previously mentioned
|
|
options, and it is discussed more in-depth in the following section.
|
|
|
|
|
|
\n{2}{Content Security Policy}\label{sec:csp}
|
|
|
|
Content Security Policy (CSP) has been an important addition to the arsenal of
|
|
website operators, even though not everybody has necessarily been utilising it
|
|
properly or even taken notice. Once configured on the web server, it provides
|
|
guarantees and employs protections against most common attack vectors on
|
|
websites exactly where the websites are being parsed and displayed - in the
|
|
(compliant) browser.
|
|
|
|
As per Weichselbaum et al.\ CSP is a mechanism designed to mitigate
|
|
XSS~\cite{longliveCSP}, a long-lived king of the vulnerability
|
|
lists~\cite{reininginCSP}. It is a declarative policy mechanism that allows the
|
|
website operator to decide what client-side resources can load on their website and what origins
|
|
are permitted \emph{sources} of content.
|
|
|
|
For example, scripts can be restricted to only load from a list of trusted
|
|
domains, and inline scripts can be blocked entirely, which is a huge win
|
|
against popular XSS techniques.
|
|
|
|
Not only that, scripts and stylesheets can also be allowed based on a
|
|
cryptographic (SHA256, SHA384 or SHA512) hash of their content, which should be
|
|
a known information to legitimate website operators prior to or at the time
|
|
scripts are served, making sure no unauthorised script or stylesheet will ever
|
|
be run on user's computer (running a compliant browser).
|
|
|
|
A policy of CSPv3, which is the current iteration of the concept, can be served
|
|
either as a header or inside website's \texttt{<meta>} tag. Configuration is
|
|
either site-wide or specific to each page.
|
|
|
|
Directive names are generally derived from the \emph{sources} they are
|
|
covering, and are thus often suffixed `-src', as in \texttt{script-src},
|
|
\texttt{img-src} or \texttt{style-src}, although some directives do not follow
|
|
this pattern, \texttt{form-action}, \texttt{upgrade-insecure-requests} and
|
|
\texttt{sandbox} representing this other group nicely.
|
|
|
|
Different directives are delimited using semicolon character at the end, and
|
|
each directive can only appear once in the entire policy.
|
|
|
|
Special values exist for the origin website itself - \texttt{'self'} - and for
|
|
disallowing \emph{any} source - \texttt{'none'}.
|
|
|
|
A good policy is \emph{targeted} and not overly broad. To give an example, a
|
|
website that loads no JavaScript at all does not need to allow a popular CDN
|
|
(Content Delivery Network) origin in its \texttt{script-src}, instead it should
|
|
be set to \texttt{'none'}. CSP can also aid with clickjacking protection using
|
|
its \texttt{frame-ancestors} directive, which can limit origins that have the
|
|
permission to embed the website. This prevents the attacker from embedding the
|
|
website at random places, for example malicious websites that masquerade as
|
|
being legitimate, e.g.\ utilising `Log in using xyz service' frame, that in
|
|
actuality just pharms the credentials.
|
|
|
|
Getting CSP right can be tricky at first but once grokked, it is relatively
|
|
straight-forward and can increase the security of the site greatly.
|
|
|
|
The recommended way to \emph{test} CSP is to enable it in the
|
|
\emph{report-only} mode before turning it on in production.
|
|
|
|
There are many more directives and settings than mentioned in this section, the
|
|
author encourages anybody interested to give it a read, e.g.\ at
|
|
\url{https://web.dev/csp/}.
|
|
|
|
|
|
\n{1}{Configuration}
|
|
|
|
Every non-trivial program usually offers at least \emph{some} way to
|
|
tweak/manage its behaviour, and these changes are usually persisted
|
|
\emph{somewhere} on the filesystem of the host: in a local SQLite3 database, a
|
|
\emph{LocalStorage} key-value store in the browser, a binary or plain text
|
|
configuration file. These configuration files need to be read and checked at
|
|
least on program start-up and either stored into operating memory for the
|
|
duration of the runtime of the program, or loaded and parsed, and the memory
|
|
subsequently \emph{freed} (initial configuration).
|
|
|
|
There is an abundance of configuration languages (or file formats used to craft
|
|
configuration files, whether they were intended for it or not) available, TOML,
|
|
INI, JSON, YAML, to name some of the popular ones (as of today).
|
|
|
|
Dhall stood out as a language that was designed with both security and the
|
|
needs of dynamic configuration scenarios in mind, borrowing a concept or two
|
|
from Nix~\cite{nixoslearn}~\cite{nixlang} (which in turn sources more than a
|
|
few of its concepts from Haskell), and in its apparent core being very similar
|
|
to JSON, which adds to a familiar feel. In fact, in Dhall's authors' own words
|
|
it is: ``a programmable configuration language that you can think of as: JSON +
|
|
functions + types + imports''~\cite{dhalllang}.
|
|
|
|
Among all the listed features, the especially intriguing one to the author was
|
|
the promise of \emph{types}. There are multiple examples directly on the
|
|
project's documentation webpage demonstrating for instance the declaration and
|
|
usage of custom types (that are, of course, merely combinations of the
|
|
primitive types that the language provides, such as \emph{Bool}, \emph{Natural}
|
|
or \emph{List}, to name just a few), so it was not exceedingly hard to start
|
|
designing a custom configuration \emph{schema} for the program. Dhall, not
|
|
being a Turing-complete language, also guarantees that evaluation \emph{always}
|
|
terminates eventually, which is a good attribute to possess for a configuration
|
|
language.
|
|
|
|
\n{2}{Safety considerations}
|
|
|
|
Having a programmable configuration language that understands functions and
|
|
allows importing not only arbitrary text from random internet URLs, but also
|
|
importing and \emph{evaluating} (i.e.\ running) potentially untrusted code, it
|
|
is important that there are some safety mechanisms employed, which can be
|
|
relied on by the user. Dhall offers this in multiple features: enforcing a
|
|
same-origin policy and (optionally) pinning a cryptographic hash of the value
|
|
of the expression being imported.
|
|
|
|
\n{2}{Possible alternatives}
|
|
|
|
While developing the program, the author has also come across certain
|
|
shortcomings of Dhall, namely the long start-up on \emph{cold cache}. It can
|
|
generally be observed when running the program in an environment that does not
|
|
allow persistently writing the cache files (a read-only filesystem), or does
|
|
not keep the written cache files, such as a container that is not configured to
|
|
mount persistent volumes to pertinent locations.
|
|
|
|
To describe the way Dhall works when performing an evaluation, it resolves
|
|
every expression down to a combination of its most basic types (eliminating all
|
|
abstraction and indirection) in the process called
|
|
\textbf{normalisation}~\cite{dhallnorm} and then saves this result in the
|
|
host's cache. The \texttt{dhall-haskell} binary attempts to resolve the
|
|
variable \texttt{\$\{XDG\_CACHE\_HOME\}} (have a look at \emph{XDG Base
|
|
Directory Spec}~\cite{xdgbasedirspec} for details) to decide \emph{where} the
|
|
results of the normalisation will be written for repeated use. Do note that
|
|
this behaviour has been observed on a GNU/Linux host and the author has not
|
|
verified this behaviour on another platforms, such as FreeBSD.
|
|
|
|
If normalisation is performed inside an ephemeral container (as opposed to, for
|
|
instance, an interactive desktop session), the results effectively get lost on
|
|
each container restart. That is both wasteful and not great for user
|
|
experience, since the normalisation of just a handful of imports (which
|
|
internally branches widely) can take an upwards of two minutes, during which
|
|
the user is left waiting for the hanging application with no reporting on the
|
|
progress or current status.
|
|
|
|
Workarounds for the above-mentioned problem can be devised relatively easily,
|
|
but it would certainly \emph{feel} better if there was no need to work
|
|
\emph{around} the configuration system of choice. For instance, bind mounting
|
|
\emph{persistent} volumes to pertinent locations inside the container
|
|
(\texttt{\$\{XDG\_CACHE\_HOME\}/\{dhall,dhall-haskell\}}) would preserve cache
|
|
between restarts. Alternatively, the cache could be pre-computed on container
|
|
build (as the program is only expected to run with a compatible schema version,
|
|
and that version \emph{is} known at container build time for the supplied
|
|
configuration).
|
|
|
|
Alternatives such as CUE (\url{https://cuelang.org/}) offer themselves nicely
|
|
as an almost drop-in replacement for Dhall feature-wise, while also resolving
|
|
the costly \emph{cold cache} normalisation operations, which is in author's
|
|
view Dhall's titular flaw. In a slightly contrasting approach, another emerging
|
|
project called \texttt{TySON} (\url{https://github.com/jetpack-io/tyson}),
|
|
which uses \emph{a subset} of TypeScript to also create a programmable,
|
|
strictly typed configuration language, opted to take a well-known language
|
|
instead of reinventing the wheel, while still being able to retain feature
|
|
parity with Dhall.
|
|
|
|
|
|
\n{1}{Compromise Monitoring}
|
|
|
|
There are, of course, several ways one could approach monitoring of compromised
|
|
credentials, some more \emph{manual} in nature than others. When using a
|
|
service that is suspected/expected to be breached in the future, one can always
|
|
create a unique username/password combination specifically for the subject
|
|
service and never use that combination anywhere else. That way, if the
|
|
credentials ever \emph{do} happen to appear in a data dump online in the
|
|
future, it is going to be a safe assumption as to where they came from.
|
|
|
|
Unfortunately, the task of actually \emph{monitoring} the credentials can prove
|
|
to be a little more arduous than one could expect at first. There are a couple
|
|
of points that can prove to pose a challenge in case the search is performed by
|
|
hand, namely:
|
|
|
|
\begin{itemize}
|
|
\item finding the breached data to look through
|
|
\item verifying the trustworthiness of the data
|
|
\item varying quality of the data
|
|
\item sifting through (possibly) unstructured data by hand
|
|
\end{itemize}
|
|
|
|
Of course, as this is a popular topic for a number of people, the
|
|
above-mentioned work has already been packaged into neat and practical online
|
|
offerings. In case one decides in favour of using those, an additional range of
|
|
issues (the previous one still applicable) arises:
|
|
|
|
\begin{itemize}
|
|
\item the need to trust the provider of the service with input credentials
|
|
\item relying on the goodwill of the provider to be able to access the data
|
|
\item hoping that the terms of service are kept as promised
|
|
\item dependence on the quality and extent of their data sources
|
|
\end{itemize}
|
|
|
|
Besides that, there is a plethora of breaches floating around the Internet
|
|
available simply as zip files, which makes the job of password compromise
|
|
monitoring even harder.
|
|
|
|
The overarching goal of this thesis is devising and implementing a system in
|
|
which the user can \emph{monitor} whether their credentials have been
|
|
\emph{compromised} (at least as far as the data can tell), and allowing them to
|
|
do so without needing to entrust their sensitive data to a provider.
|
|
|
|
|
|
\n{2}{Data Sources}\label{sec:dataSources}
|
|
|
|
A data source in this place is considered anything that provides the
|
|
application with data that it understands.
|
|
|
|
Of course, the results of credential compromise verification/monitoring is only
|
|
going to be as good as the data underpinning it, which is why it is imperative
|
|
that high quality data sources be used, if at all possible. While great care
|
|
does have to be taken to only choose the highest quality data sources, the
|
|
application must offer a means to be able to utilise these.
|
|
|
|
The sources from which breached data can be loaded into an application can be
|
|
split into two basic categories: \textbf{online} or \textbf{local}, and it is
|
|
possible to further discern between them by whether the data they provide is
|
|
\emph{structured} or not.
|
|
|
|
An online source is generally a service that ideally exposes a programmatic
|
|
API, which an application can query and from which it can request the necessary
|
|
subsets of data.
|
|
These types of services often additionally front the data by a user-friendly
|
|
web interface for one-off searches, which is, however, not of use here.
|
|
|
|
Among some examples of online services could be named:
|
|
|
|
\begin{itemize}
|
|
\item {Have I Been Pwned?} - \url{https://haveibeenpawned.com}
|
|
\item {DeHashed} - \url{https://dehashed.com}
|
|
\end{itemize}
|
|
|
|
Large lumps of unstructured data available on forums or shady web servers would
|
|
technically also count here, given that they provide data and are available
|
|
online. However, even though data is frequently found online precisely in this
|
|
form, it is also not of direct use for the application without manual
|
|
\emph{preprocessing}, as it is attended to in
|
|
Section~\ref{sec:localDatasetPlugin}.
|
|
|
|
Another source is then simply any locally supplied data, which, of course,
|
|
could have been obtained from a breach available online beforehand.
|
|
|
|
Locally supplied data is specific in that it needs to be formatted in such a
|
|
way that it is understood by the application. That is, the data supplied for
|
|
importing cannot be in its original raw form anymore, instead it has to have
|
|
been morphed into the precise shape the application needs for further
|
|
processing. Once imported, the application can query the data at will, as it
|
|
knows exactly the shape of it.
|
|
|
|
This supposes the existence of a \emph{format} for importing, the schema of
|
|
which is devised in Section~\ref{sec:localDatasetPlugin}.
|
|
|
|
|
|
\n{3}{Local Dataset Plugin}\label{sec:localDatasetPlugin}
|
|
|
|
Unstructured breach data from locally available datasets can be imported into
|
|
the application by first making sure it adheres to the specified schema (have a
|
|
look at the breach \texttt{ImportSchema} in Listing~\ref{breachImportSchema}).
|
|
If it does not (which is very likely with random breach data, as already
|
|
mentioned in Section~\ref{sec:dataSources}), it needs to be converted to a form
|
|
that \emph{does} before importing it to the application, e.g.\ using a Python
|
|
script or a similar method.
|
|
|
|
Attempting to import data that does not follow the outlined schema should
|
|
result in an error. Equally so, importing a dataset which is over a reasonable
|
|
size limit should by default be rejected by the program as a precaution.
|
|
Unmarshaling, for instance, a 1 TiB document would most likely result in an
|
|
out-of-memory (OOM) situation on the host running the application, assuming
|
|
contemporary consumer hardware conditions (not HPC).
|
|
|
|
\vspace{\parskip}
|
|
\begin{lstlisting}[language=Go,
|
|
caption={Breach \texttt{ImportSchema} Go struct (imports from the standard
|
|
library assumed)},
|
|
label=breachImportSchema,
|
|
backgroundcolor=\color{lstbg},
|
|
morekeywords={any,time}
|
|
]
|
|
// ImportSchema is the model for importing locally available breach data.
|
|
type ImportSchema struct {
|
|
Name string
|
|
Description string
|
|
Date time.Time
|
|
IsVerified bool
|
|
ContainsPasswords bool
|
|
ContainsHashes bool
|
|
HashType string
|
|
HashSalted bool
|
|
HashPepperred bool
|
|
ContainsUsernames bool
|
|
ContainsEmails bool
|
|
Data *Data
|
|
}
|
|
\end{lstlisting}
|
|
\vspace*{-\baselineskip}
|
|
|
|
The Go \emph{struct} shown in Listing~\ref{breachImportSchema} will in
|
|
actuality translate to a YAML document written and supplied by an
|
|
administrative user of the program. And while the author is personally not the
|
|
greatest supporter of YAML; however, the format was still chosen for several
|
|
reasons:
|
|
|
|
\begin{itemize}
|
|
\item relative ease of use (plain text, readability) for machines and people
|
|
alike
|
|
\item capability to store multiple \emph{documents} inside of a single file
|
|
\item most of the inputs being implicitly typed as strings
|
|
\item support for inclusion of comments
|
|
\item machine readability thanks to being a superset of JSON
|
|
\end{itemize}
|
|
|
|
The last point specifically should allow for documents similar to what can be
|
|
seen in Listing~\ref{breachDataYAMLSchema} to be ingested by the program, read
|
|
and written by humans and programs alike.
|
|
|
|
\smallskip
|
|
\begin{lstlisting}[style=yaml,
|
|
caption={A YAML file containing breach data formatted according to the
|
|
\texttt{ImportSchema},
|
|
optionally containing multiple documents},
|
|
label=breachDataYAMLSchema,
|
|
backgroundcolor=\color{lstbg},
|
|
]
|
|
---
|
|
name: Horrible breach
|
|
date: 2022-04-23T00:00:00Z+02:00
|
|
description: impacted X in 2022, it contains 10 000 unique emails...
|
|
isVerified: false
|
|
containsPasswds: false
|
|
containsHashes: true
|
|
containsEmails: true
|
|
hashType: md5
|
|
hashSalted: false
|
|
hashPeppered: false
|
|
data:
|
|
hashes:
|
|
- hash1
|
|
- hash2
|
|
- hash3
|
|
emails:
|
|
- email1
|
|
- ""
|
|
- email3
|
|
---
|
|
# document #2, describing another breach.
|
|
name: Horrible breach 2
|
|
...
|
|
\end{lstlisting}
|
|
\vspace*{-\baselineskip}
|
|
|
|
Notice how the emails list (\texttt{.data/emails}) in
|
|
Listing~\ref{breachDataYAMLSchema} is missing one record, perhaps because it
|
|
was mistakenly omitted due to either machine error or unfamiliarity with the
|
|
format. This is a valid scenario (mistakes do happen) and the application needs
|
|
to be account for it. Alternatively, the program could start dropping
|
|
empty/partial records, but that behaviour could quickly lead to unhappy users.
|
|
The golden rule for the program is to \emph{always do the expected thing} (and
|
|
also not being overly smart about it, i.e.\ the simpler program flow is often
|
|
better).
|
|
|
|
\n{3}{Have I Been Pwned? Integration}
|
|
|
|
Troy Hunt's \textbf{Have I Been Pwned?} online service
|
|
(\url{https://haveibeenpwned.com/}) has been chosen as the online source of
|
|
compromised data. The service offers public APIs, which were originally (and it
|
|
was the intention of their author that they stay that way) provided free of
|
|
charge and with little-to-no rate-limiting. A major overhaul in this regard has
|
|
been revealed in November of 2022, where in addition to a new rate-limit
|
|
system, different-levels-of-symbolic fees were introduced to obtain the API
|
|
keys. These Apparently, the top consumers of the API seemed to utilise it
|
|
orders of magnitude more than the average person, which led Hunt to devising a
|
|
new, tiered API access system in which the \emph{little guys} would not be
|
|
subsidising the \emph{big guys}. Additionally, the symbolic fee of \$3.50 a
|
|
month for the entry-level 10 requests-per-minute API key was meant to serve as
|
|
a small barrier for (mis)users with nefarious purposes, but pose practically no
|
|
obstacle for \emph{legitimate} users, which is entirely
|
|
reasonable~\cite{hibpBillingChanges}.
|
|
|
|
The application's \texttt{hibp} module and database representation
|
|
(\texttt{schema.HIBPSchema}) attempts to model the values returned by this API
|
|
and declare actions to be performed upon the data, which is what facilitates
|
|
the breach search functionality in the program.
|
|
|
|
The architecture is relatively simple. Breach data, including title, date,
|
|
description and tags are cached by the application on start-up, as this API is
|
|
not authenticated. In order for the authenticated API to be called, the
|
|
application administrator first needs to configure an API key for the HIBP
|
|
service via the management interface. The user can then enter the desired query
|
|
parameters and the application then constructs the API call that is sent to the
|
|
authenticated API, and awaits the response. As the API is rate-limited
|
|
(individually, based on the API key supplied), sending requests directly after
|
|
receiving them from the users would likely pose an issue at high utilisation
|
|
times, and would result in the application ending up unnecessarily throttled.
|
|
Request sending thus needs to be handled in the backend by a requests
|
|
scheduler, as well as appropriately in the UI.
|
|
|
|
After a response from the API server arrives, the application attempts to
|
|
\emph{bind} the returned data to the pre-programmed \emph{model} for
|
|
validation, before finally parsing it. If the data can be successfully
|
|
validated, it is saved into the database as a cache and the search query is
|
|
performed on the saved data. The result is then displayed to the user for
|
|
browsing.
|
|
|
|
|
|
\n{1}{Deployment recommendations}\label{sec:deploymentRecommendations}
|
|
|
|
It is, of course, recommended that the application runs in a secure environment
|
|
\allowbreak although definitions of that almost certainly differ depending on
|
|
who you ask. General recommendations would be either to effectively reserve a
|
|
machine for a single use case - running this program - so as to dramatically
|
|
decrease the potential attack surface of the host, or run the program isolated
|
|
in a container or a virtual machine. Furthermore, if the host does not need
|
|
management access (it is a deployed-to-only machine that is configured
|
|
out-of-band, such as with a \emph{golden} image/container or declaratively with
|
|
Nix), then an SSH \emph{daemon} should not be running in it, since it is not
|
|
needed. In an ideal scenario, the host machine would have as little software
|
|
installed as possible besides what the application absolutely requires.
|
|
|
|
System-wide cryptographic policies should target the highest feasible security
|
|
level, if at all available (as is the case by default on e.g.\ Fedora),
|
|
covering SSH, DNSSec and TLS protocols. Firewalls should be configured and
|
|
SELinux (kernel-level mandatory access control and security policy mechanism)
|
|
running in \emph{enforcing} mode, if available.
|
|
|
|
\n{2}{Transport security}
|
|
|
|
User connecting to the application should rightfully expect for their data to
|
|
be protected \textit{in transit} (i.e.\ on the way between their browser and
|
|
the server), which is what \emph{Transport Layer Security} family of
|
|
protocols~\cite{tls13rfc8446} was designed for, and which is the underpinning
|
|
of HTTPS. TLS utilises the primitives of asymmetric cryptography to let the
|
|
client authenticate the server (verify that it is who it claims it is) and
|
|
negotiate a symmetric key for encryption in the process named the \emph{TLS
|
|
handshake} (see Section~\ref{sec:tls} for more details), the final purpose of
|
|
which is establishing a secure communications connection. The operator should
|
|
configure the program to either directly utilise TLS using configuration or
|
|
have it listen behind a TLS-terminating \emph{reverse proxy}.
|
|
|
|
|
|
\n{2}{Containerisation}
|
|
|
|
Whether containerised or not, the application needs runtime access to secrets
|
|
such as cookie encryption and authentication keys, or the database connection
|
|
string (containing database host, port, user, password/encrypted password,
|
|
authentication method and database name). It is a relatively common practice to
|
|
deliver secrets to programs in configuration files; however, environment
|
|
variables should be preferred. The program could go one step further and only
|
|
accept certain secrets as environment variables.
|
|
|
|
While it is not impossible to run a process scheduler (such as SystemD) inside
|
|
a container, containers are well suited for single-program workloads. The fact
|
|
that the application needs persistent storage also begs the question of
|
|
\emph{how to run the database in the container?}. Should data be stored inside
|
|
the ephemeral container, it could end up being very short-lived (wiped on
|
|
container restart), and barring container root volume snapshotting, it could
|
|
turn backing up of data into a chore, which are likely not the desired features
|
|
in this case. Moreover, it is the opinion of the author that multiprocess
|
|
scheduling would inordinately complicate the container set-up. Instead of
|
|
running a single program per container, which also provides good amounts of
|
|
isolation if done properly, running multiple programs in one container would
|
|
likely do the opposite.
|
|
|
|
As per the above, a more \emph{sane} thing to do is to store data externally
|
|
using a proper persistent storage method, such as a database. With Postgres
|
|
being the safe bet among database engines, the program should be able to handle
|
|
Postgres' most common authentication methods, namely \emph{peer},
|
|
\emph{scram-sha-256} and raw \emph{password}, although the \emph{password}
|
|
option should not be used in production, \emph{unless} the database connection
|
|
is protected by TLS~\cite{pgauthmethods}. In any case, using the
|
|
\emph{scram-sha-256} method is preferable~\cite{scramsha256rfc7677}. One way to
|
|
verify during development that authentication works as intended is the
|
|
\emph{Password generator for PostgreSQL} tool, which generates an encrypted
|
|
string from a raw user input~\cite{goscramsha256}.
|
|
|
|
If the application wants to use the \emph{peer} authentication method, it is up
|
|
to the operator to supply the Postgres socket to the container (e.g.\ as a
|
|
volume bind mount). Equally, the operator needs to make sure that the database
|
|
is either running in a network that is also directly attached to the container
|
|
or that there is a mechanism in place that routes the requests for the database
|
|
hostname to the destination, unless a static IP configuration is used, which is
|
|
also possible.
|
|
|
|
Practically every container runtime satisfies this use case with a container
|
|
\emph{name-based routing} mechanism, which inside \emph{pods} (in case of
|
|
Podman/Kubernetes) or common default networks (that are both NAT-ted \emph{and}
|
|
routed) enables resolution of container names. This abstraction is a
|
|
responsibility of specially configured (most often autoconfigured) pieces of
|
|
software, Aardvark in case of Podman, and CoreDNS for Kubernetes, and it makes
|
|
using short-lived containers in dynamic networks convenient.
|
|
|
|
|
|
\n{1}{Summary}
|
|
|
|
Passwords (and/or passphrases) are in use everywhere and will quite probably
|
|
continue to be for the foreseeable future. If not as \textit{the} principal way
|
|
to authenticate, then at least as \textit{a} way to authenticate. And for as
|
|
long as passwords are going to be handled and stored, they \emph{are} going to
|
|
get leaked, be it due to user or provider carelessness, or the attackers'
|
|
resolve and wit. Of course, sifting through the heaps of available password
|
|
breach data by hand is not a reasonable option, and therefore tools providing
|
|
assistance come in handy. The following part of this thesis will explore that
|
|
issue and suggest a solution.
|
|
|
|
|
|
% =========================================================================== %
|