diff --git a/tex/part-theoretical.tex b/tex/part-theoretical.tex index c7b0e92..eb00f71 100644 --- a/tex/part-theoretical.tex +++ b/tex/part-theoretical.tex @@ -600,19 +600,19 @@ in its raw form anymore but has to have been morphed into the precise shape the application needs for further processing. Once imported, the application can query the data at will, as it knows exactly the shape of it. -This supposes the existence of a \emph{format} for importing, schema of which -is devised in Section~\ref{sec:localDatasetPlugin}. +This supposes the existence of a \emph{format} for importing, the schema of +which is devised in Section~\ref{sec:localDatasetPlugin}. \n{3}{Local Dataset Plugin}\label{sec:localDatasetPlugin} Unstructured breach data from locally available datasets can be imported into the application by first making sure it adheres to the specified schema (have a -look at the \emph{Breach Data Schema} in Listing~\ref{breachDataGoSchema}). If -it does not (which is very likely with random breach data, as already mentioned -in Section~\ref{sec:dataSources}), it needs to be converted to a form that -\emph{does} before importing it to the application, e.g.\ using a Python script -or a similar method. +look at the breach \texttt{ImportSchema} in Listing~\ref{breachImportSchema}). +If it does not (which is very likely with random breach data, as already +mentioned in Section~\ref{sec:dataSources}), it needs to be converted to a form +that \emph{does} before importing it to the application, e.g.\ using a Python +script or a similar method. Attempting to import data that does not follow the outlined schema should result in an error. Equally so, importing a dataset which is over a reasonable @@ -622,15 +622,18 @@ out-of-memory (OOM) situation on the host running the application, assuming contemporary consumer hardware conditions (not HPC). \vspace{\parskip} -\begin{lstlisting}[language=Go, caption={Breach Data Schema represented as a Go -struct with imports from the standard library assumed}, -label=breachDataGoSchema, +\begin{lstlisting}[language=Go, +caption={Breach \texttt{ImportSchema} Go struct (imports from the standard +library assumed)}, +label=breachImportSchema, backgroundcolor=\color{lstbg}, -morekeywords={any} +morekeywords={any,time} ] - type breachDataSchema struct { + // ImportSchema is the model for importing locally available breach data. + type ImportSchema struct { Name string - Time time.Time + Description string + Date time.Time IsVerified bool ContainsPasswords bool ContainsHashes bool @@ -639,18 +642,20 @@ morekeywords={any} HashPepperred bool ContainsUsernames bool ContainsEmails bool - Data any + Data *Data } \end{lstlisting} \vspace*{-\baselineskip} -The Go representation shown in Listing~\ref{breachDataGoSchema} will in +The Go \emph{struct} shown in Listing~\ref{breachImportSchema} will in actuality translate to a YAML document written and supplied by an -administrative user of the program. The YAML format was chosen for several +administrative user of the program. And while the author is personally not the +greatest supporter of YAML, however, the format was still chosen for several reasons: \begin{itemize} - \item relative ease of use (plain text, readability) + \item relative ease of use (plain text, readability) for machines and people + alike \item capability to store multiple \emph{documents} inside of a single file \item most of the inputs being implicitly typed as strings \item support for inclusion of comments @@ -663,7 +668,8 @@ and written by humans and programs alike. \smallskip \begin{lstlisting}[style=yaml, -caption={Example Breach Data Schema supplied to the program as a YAML file, +caption={A YAML file containing breach data formatted according to the +\texttt{ImportSchema}, optionally containing multiple documents}, label=breachDataYAMLSchema, backgroundcolor=\color{lstbg}, @@ -696,11 +702,14 @@ backgroundcolor=\color{lstbg}, \vspace*{-\baselineskip} Notice how the emails list (\texttt{.data/emails}) in -Listing~\ref{breachDataYAMLSchema} misses one record, perhaps because it was -not supplied or mistakenly omitted. This is a valid scenario (mistakes happen) -and the application needs to be able to handle it. The alternative would be to -require the user to prepare the data in such a way that the empty/partial -records would be dropped entirely. +Listing~\ref{breachDataYAMLSchema} is missing one record, perhaps because it +was mistakenly omitted due to either machine error or unfamiliarity with the +format. This is a valid scenario (mistakes do happen) and the application needs +to be account for it. Alternatively, the program could start dropping +empty/partial records, but that behaviour could quickly lead to unhappy users. +The golden rule for the program is to \emph{always do the expected thing} (and +also not being overly smart about it, i.e.\ the simpler program flow is often +better). \n{3}{Have I Been Pwned? Integration}