The previous version of this API called for a key of exactly 256 bits.
That's good for optimal performance, but it would mean losing the
use-with-other-algorithms property for applications whose input keys are
a different size. There's no way for an abstraction over the previous
version to provide reliable domain separation for the "extract" step.
Also use fewer array references (the compiler doesn't care) be more
explicit with a `new_cv` mutable variable. This clarifies
push_chunk_chaining_value just a bit. Since that's the trickiest
function in the entire thing, it's good to clarify it. (It also gets
excerpted directly into the spec.)
This is a performance improvement on modern x86 chips (Skylake and
later), and the LLVM optimizer can convert these to AVX-512 rotations
when those are enabled.
Putting secret keys on the command line is bad practice, because command
line args are usually globally visible within the OS. Even if these
flags are mostly intended for testing and experimentation, we might as
well do the right thing. Plus this saves people the trouble of hex
encoding their keys.
As part of this, get rid of the BLAKE3_FUZZ_ITERATIONS variable. I
wasn't using it anywhere, and it was leading to some compiler warnings
in --no-default-features mode.
Smaller chunk sizes are a big benefit for parallelism at shorter input
lengths, and recent benchmarks show that this reduction has a relative
small cost in terms of peak throughput. It's also a nice round number.
The portable implementation was getting slowed down by converting back
and forth between words and bytes.
I made the corresponding change on the C side first
(12a37be8b5),
and as part of this commit I'm re-vendoring the C code. I'm also
exposing a small FFI interface to C so that blake3_neon.c can link
against portable.rs rather than blake3_portable.c, see c_neon.rs.