We use a counter value that's very close to wrapping the lower word,
when we're testing the hash_many chunks case. It turns out that this is
a useful thing to do with parents too, even though parents 1) are
teeechnically supposed to always use a counter of 0, and 2) aren't going
to increment the counter at all. We caught a bug in the assembly
implementations this way (where we accidentally did increment the
counter, but only the higher word), because the equivalent test in
rust_c_bindings uses this eccentric parents counter value.
https://github.com/rust-lang/rust/issues/68905 is currently causing
nightly builds to fail, unless `--no-default-features` is used. This
change means that the default build will succeed, and the failure will
only happen when the "c_avx512" is enabled. The `b3sum` crate will still
fail to build on nightly, because it enables that feature, but most
callers should start succeeding on nightly.
This is a new interface that allows the caller to provide a
multi-threading implementation. It's defined in terms of a new `Join`
trait, for which we provide two implementations, `SerialJoin` and
`RayonJoin`. This lets the caller control when multi-threading is used,
rather than the previous all-or-nothing design of the "rayon" feature.
Although existing callers should keep working, this is a compatibility
break, because callers who were relying on automatic multi-threading
before will now be single-threaded. Thus the next release of this crate
will need to be version 0.2.
See https://github.com/BLAKE3-team/BLAKE3/issues/25 and
https://github.com/BLAKE3-team/BLAKE3/issues/54.
One thing I like to test is that, if I hack simd_degree to be higher
than MAX_SIMD_DEGREE, assertions fire. This requires a test case long
enough to exceed that number of chunks.
Because compress_subtree_to_parent_node effectively cuts its input in
half, we can give it an input that's twice as big, without violating the
CV stack invariant.
For the Read and Write traits, this also allows the compiler to see that
the return value is always Ok, allowing it to remove the Err case from
the caller as dead code.
The generic constant_time_eq has several branches on the slice length,
which are not necessary when the slice length is known. However, the
optimizer is not allowed to look into the core of constant_time_eq, so
these branches cannot be elided.
Use instead a fixed-size variant of constant_time_eq, which has no
branches since the length is known.
The previous version of this API called for a key of exactly 256 bits.
That's good for optimal performance, but it would mean losing the
use-with-other-algorithms property for applications whose input keys are
a different size. There's no way for an abstraction over the previous
version to provide reliable domain separation for the "extract" step.
This is a performance improvement on modern x86 chips (Skylake and
later), and the LLVM optimizer can convert these to AVX-512 rotations
when those are enabled.
As part of this, get rid of the BLAKE3_FUZZ_ITERATIONS variable. I
wasn't using it anywhere, and it was leading to some compiler warnings
in --no-default-features mode.
Smaller chunk sizes are a big benefit for parallelism at shorter input
lengths, and recent benchmarks show that this reduction has a relative
small cost in terms of peak throughput. It's also a nice round number.
The portable implementation was getting slowed down by converting back
and forth between words and bytes.
I made the corresponding change on the C side first
(12a37be8b5),
and as part of this commit I'm re-vendoring the C code. I'm also
exposing a small FFI interface to C so that blake3_neon.c can link
against portable.rs rather than blake3_portable.c, see c_neon.rs.