6.8 KiB
How does b3sum --check
behave exactly?
or: Are filepaths...text?
Most of the time, b3sum --check
is a drop-in replacement for md5sum --check
and other Coreutils hashing tools. It consumes a checkfile (the output of a
regular b3sum
command), re-hashes all the files listed there, and returns
success if all of those hashes are still correct. What makes this more
complicated than it might seem, is that representing filepaths as text means we
need to consider many possible edge cases of unrepresentable filepaths. This
document describes all of these edge cases in detail.
The simple case
Here's the result of running b3sum a b c/d
in a directory that contains
those three files:
$ echo hi > a
$ echo lo > b
$ mkdir c
$ echo stuff > c/d
$ b3sum a b c/d
0b8b60248fad7ac6dfac221b7e01a8b91c772421a15b387dd1fb2d6a94aee438 a
6ae4a57bbba24f79c461d30bcb4db973b9427d9207877e34d2d74528daa84115 b
2d477356c962e54784f1c5dc5297718d92087006f6ee96b08aeaf7f3cd252377 c/d
If we pipe that output into b3sum --check
, it will exit with status zero
(success) and print:
$ b3sum a b c/d | b3sum --check
a: OK
b: OK
c/d: OK
If we delete b
and change the contents of c/d
, and then use the same
checkfile as above, b3sum --check
will exit with a non-zero status (failure)
and print:
$ b3sum a b c/d > checkfile
$ rm b
$ echo more stuff >> c/d
$ b3sum --check checkfile
a: OK
b: FAILED (No such file or directory (os error 2))
c/d: FAILED
In these typical cases, b3sum
and md5sum
have identical output for success
and very similar output for failure.
Escaping newlines and backslashes
Since the checkfile format (the regular output format of b3sum
) is
newline-separated text, we need to worry about what happens when a filepath
contains a newline, or worse. Suppose we create a file named x[newline]x
(3 characters). One way to create such a file is with a Python one-liner like
this:
>>> open("x\nx", "w")
Here's what happens when we hash that file with b3sum
:
$ b3sum x*
\af1349b9f5f9a1a6a0404dea36dcc9499bcb25c9adc112b7cc9a93cae41f3262 x\nx
Notice two things. First, b3sum
puts a single \
character at the front of
the line. This indicates that the filepath contains escape sequences that
b3sum --check
will need to unescape. Then, b3sum
replaces the newline
character in the filepath with the two-character escape sequence \n
.
Similarly, if the filepath contained a backslash, b3sum
would escape it as
\\
in the output. So far, all of this behavior is still identical to
md5sum
.
Invalid Unicode
This is where b3sum
and md5sum
diverge. Apart from the newline and
backslash escapes described above, md5sum
copies all other filepath bytes
verbatim to its output. That means its output encoding is "ASCII plus whatever
bytes we got from the command line". This creates two problems:
- Printing something that isn't UTF-8 is kind of gross.
- Windows support.
What's the deal with Windows? To start with, there's a fundamental difference
in how Unix and Windows represent filepaths. Unix filepaths are "usually UTF-8"
and Windows filepaths are "usually UTF-16". That means that a file named abc
is typically represented as the bytes [97, 98, 99]
on Unix and as the bytes
[97, 0, 98, 0, 99, 0]
on Windows. The md5sum
approach won't work if we plan
on creating a checkfile on Unix and checking it on Windows, or vice versa.
A more portable approach is to convert platform-specific bytes into some
consistent Unicode encoding. (In practice this is going to be UTF-8, but in
theory it could be anything.) Then when --check
needs to open a file, we
convert the Unicode representation back into platform-specific bytes. This
makes important common cases like abc
, and in fact even abc[newline]def
,
work as expected. Great!
But...what did we mean above when we said usually UTF-8 and usually UTF-16? It turns out that not every possible sequence of bytes is valid UTF-8, and not every possible sequence of 16-bit wide chars is valid UTF-16. For example, the byte 0xFF (255) can never appear in any UTF-8 string. If we ask Python to decode it, it yells at us:
>>> b"\xFF".decode("UTF-8")
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
However, tragically, we can create a file with that byte in its name (on Linux at least, though not usually on macOS):
>>> open(b"y\xFFy", "w")
So some filepaths aren't representable in Unicode at all. Our plan to "convert
platform-specific bytes into some consistent Unicode encoding" isn't going to
work for everything. What does b3sum
do with the file above?
$ b3sum y*
af1349b9f5f9a1a6a0404dea36dcc9499bcb25c9adc112b7cc9a93cae41f3262 y<>y
That <20> in there is a "Unicode replacement character". When we run into filepaths that we can't represent in Unicode, we replace the unrepresentable parts with these characters. On the checking side, to avoid any possible confusion between two different invalid filepaths, we automatically fail if we see a replacement character. Together with a few more details covered in the next section, this gives us an important set of properties:
- Any file can be hashed locally.
- Any file with a valid Unicode name not containing the <20> character can be checked.
- Checking ambiguous or unrepresentable filepaths always fails.
- Checkfiles are always valid UTF-8.
- Checkfiles are portable between Unix and Windows.
Formal Rules
-
When hashing, filepaths are represented in a platform-specific encoding, which can accommodate any filepath on the current platform. In Rust, this is
OsStr
/OsString
. -
In output, filepaths are first converted to UTF-8. Any non-Unicode segments are replaced with Unicode replacement characters (U+FFFD). In Rust, this is
OsStr::to_string_lossy
. -
Then, if a filepath contains any backslashes (U+005C) or newlines (U+000A), these characters are escaped as
\\
and\n
respectively. -
Finally, any output line containing an escape sequence is prefixed with a single backslash.
-
When checking, each line is parsed as UTF-8, separated by a newline (U+000A). Invalid UTF-8 is an error.
-
Then, if a line begins with a backslash, the filepath component is unescaped. Any escape sequence other than
\\
or\n
is an error. If a line does not begin with a backslash, unescaping is not performed, and any backslashes in the filepath component are interpreted literally. (b3sum
output never contains unescaped backslashes, but they can occur in checkfiles assembled by hand.) -
Finally, if a filepath contains a Unicode replacement character (U+FFFD) or a null character (U+0000), it is an error.
Additionally, on Windows only:
-
In output, all backslashes (U+005C) are replaced with forward slashes (U+002F).
-
When checking, after unescaping, if a filepath contains a backslash, it is an error.