1
0
Fork 0
mirror of https://github.com/BLAKE3-team/BLAKE3 synced 2024-05-07 09:46:16 +02:00
BLAKE3/b3sum/what_does_check_do.md

Ignoring revisions in .git-blame-ignore-revs. Click here to bypass and see the normal blame view.

175 lines
6.8 KiB
Markdown
Raw Permalink Normal View History

2020-05-15 21:28:07 +02:00
# How does `b3sum --check` behave exactly?<br>or: Are filepaths...text?
2020-05-14 20:02:37 +02:00
Most of the time, `b3sum --check` is a drop-in replacement for `md5sum --check`
and other Coreutils hashing tools. It consumes a checkfile (the output of a
regular `b3sum` command), re-hashes all the files listed there, and returns
success if all of those hashes are still correct. What makes this more
complicated than it might seem, is that representing filepaths as text means we
need to consider many possible edge cases of unrepresentable filepaths. This
document describes all of these edge cases in detail.
## The simple case
Here's the result of running `b3sum a b c/d` in a directory that contains
those three files:
2020-05-15 21:28:07 +02:00
```bash
$ echo hi > a
$ echo lo > b
$ mkdir c
$ echo stuff > c/d
$ b3sum a b c/d
2020-05-14 20:02:37 +02:00
0b8b60248fad7ac6dfac221b7e01a8b91c772421a15b387dd1fb2d6a94aee438 a
6ae4a57bbba24f79c461d30bcb4db973b9427d9207877e34d2d74528daa84115 b
2d477356c962e54784f1c5dc5297718d92087006f6ee96b08aeaf7f3cd252377 c/d
```
If we pipe that output into `b3sum --check`, it will exit with status zero
(success) and print:
2020-05-15 21:28:07 +02:00
```bash
$ b3sum a b c/d | b3sum --check
2020-05-14 20:02:37 +02:00
a: OK
b: OK
c/d: OK
```
If we delete `b` and change the contents of `c/d`, and then use the same
checkfile as above, `b3sum --check` will exit with a non-zero status (failure)
and print:
2020-05-15 21:28:07 +02:00
```bash
$ b3sum a b c/d > checkfile
$ rm b
$ echo more stuff >> c/d
$ b3sum --check checkfile
2020-05-14 20:02:37 +02:00
a: OK
b: FAILED (No such file or directory (os error 2))
c/d: FAILED
```
In these typical cases, `b3sum` and `md5sum` have identical output for success
and very similar output for failure.
## Escaping newlines and backslashes
Since the checkfile format (the regular output format of `b3sum`) is
newline-separated text, we need to worry about what happens when a filepath
contains a newline, or worse. Suppose we create a file named `x[newline]x`
(3 characters). One way to create such a file is with a Python one-liner like
2020-05-14 20:02:37 +02:00
this:
2020-05-15 21:28:07 +02:00
```python
>>> open("x\nx", "w")
2020-05-14 20:02:37 +02:00
```
2020-05-15 22:05:18 +02:00
Here's what happens when we hash that file with `b3sum`:
2020-05-14 20:02:37 +02:00
2020-05-15 21:28:07 +02:00
```bash
$ b3sum x*
\af1349b9f5f9a1a6a0404dea36dcc9499bcb25c9adc112b7cc9a93cae41f3262 x\nx
2020-05-14 20:02:37 +02:00
```
2020-05-15 21:28:07 +02:00
Notice two things. First, `b3sum` puts a single `\` character at the front of
2020-05-14 20:02:37 +02:00
the line. This indicates that the filepath contains escape sequences that
2020-05-15 21:28:07 +02:00
`b3sum --check` will need to unescape. Then, `b3sum` replaces the newline
2020-05-14 20:02:37 +02:00
character in the filepath with the two-character escape sequence `\n`.
Similarly, if the filepath contained a backslash, `b3sum` would escape it as
`\\` in the output. So far, all of this behavior is still identical to
`md5sum`.
## Invalid Unicode
2021-11-13 15:24:15 +01:00
This is where `b3sum` and `md5sum` diverge. Apart from the newline and
backslash escapes described above, `md5sum` copies all other filepath bytes
verbatim to its output. That means its output encoding is "ASCII plus whatever
bytes we got from the command line". This creates two problems:
2020-05-14 20:02:37 +02:00
2020-05-15 21:28:07 +02:00
1. Printing something that isn't UTF-8 is kind of gross.
2020-05-14 20:02:37 +02:00
2. Windows support.
2020-05-15 21:28:07 +02:00
What's the deal with Windows? To start with, there's a fundamental difference
in how Unix and Windows represent filepaths. Unix filepaths are "usually UTF-8"
and Windows filepaths are "usually UTF-16". That means that a file named `abc`
is typically represented as the bytes `[97, 98, 99]` on Unix and as the bytes
`[97, 0, 98, 0, 99, 0]` on Windows. The `md5sum` approach won't work if we plan
on creating a checkfile on Unix and checking it on Windows, or vice versa.
A more portable approach is to convert platform-specific bytes into some
consistent Unicode encoding. (In practice this is going to be UTF-8, but in
theory it could be anything.) Then when `--check` needs to open a file, we
convert the Unicode representation back into platform-specific bytes. This
2020-05-23 21:13:19 +02:00
makes important common cases like `abc`, and in fact even `abc[newline]def`,
2020-05-15 21:28:07 +02:00
work as expected. Great!
But...what did we mean above when we said *usually* UTF-8 and *usually* UTF-16?
It turns out that not every possible sequence of bytes is valid UTF-8, and not
every possible sequence of 16-bit wide chars is valid UTF-16. For example, the
byte 0xFF (255) can never appear in any UTF-8 string. If we ask Python to
2020-05-15 21:28:07 +02:00
decode it, it yells at us:
```python
>>> b"\xFF".decode("UTF-8")
2020-05-15 21:28:07 +02:00
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
```
However, tragically, we *can* create a file with that byte in its name (on
Linux at least, though not usually on macOS):
```python
>>> open(b"y\xFFy", "w")
2020-05-15 21:28:07 +02:00
```
So some filepaths aren't representable in Unicode at all. Our plan to "convert
platform-specific bytes into some consistent Unicode encoding" isn't going to
work for everything. What does `b3sum` do with the file above?
```bash
$ b3sum y*
af1349b9f5f9a1a6a0404dea36dcc9499bcb25c9adc112b7cc9a93cae41f3262 y<>y
2020-05-15 21:28:07 +02:00
```
That <20> in there is a "Unicode replacement character". When we run into
filepaths that we can't represent in Unicode, we replace the unrepresentable
parts with these characters. On the checking side, to avoid any possible
confusion between two different invalid filepaths, we automatically fail if we
see a replacement character. Together with a few more details covered in the
next section, this gives us an important set of properties:
1. Any file can be hashed locally.
2. Any file with a valid Unicode name not containing the <20> character can be
checked.
3. Checking ambiguous or unrepresentable filepaths always fails.
4. Checkfiles are always valid UTF-8.
5. Checkfiles are portable between Unix and Windows.
2020-05-15 21:28:07 +02:00
## Formal Rules
1. When hashing, filepaths are represented in a platform-specific encoding,
which can accommodate any filepath on the current platform. In Rust, this is
`OsStr`/`OsString`.
2020-05-15 21:28:07 +02:00
2. In output, filepaths are first converted to UTF-8. Any non-Unicode segments
are replaced with Unicode replacement characters (U+FFFD). In Rust, this is
`OsStr::to_string_lossy`.
3. Then, if a filepath contains any backslashes (U+005C) or newlines (U+000A),
2020-05-15 21:28:07 +02:00
these characters are escaped as `\\` and `\n` respectively.
4. Finally, any output line containing an escape sequence is prefixed with a
single backslash.
5. When checking, each line is parsed as UTF-8, separated by a newline
(U+000A). Invalid UTF-8 is an error.
6. Then, if a line begins with a backslash, the filepath component is
unescaped. Any escape sequence other than `\\` or `\n` is an error. If a
line does not begin with a backslash, unescaping is not performed, and any
backslashes in the filepath component are interpreted literally. (`b3sum`
output never contains unescaped backslashes, but they can occur in
checkfiles assembled by hand.)
7. Finally, if a filepath contains a Unicode replacement character (U+FFFD) or
a null character (U+0000), it is an error.
**Additionally, on Windows only:**
8. In output, all backslashes (U+005C) are replaced with forward slashes
(U+002F).
9. When checking, after unescaping, if a filepath contains a backslash, it is
an error.