The previous version of this API called for a key of exactly 256 bits.
That's good for optimal performance, but it would mean losing the
use-with-other-algorithms property for applications whose input keys are
a different size. There's no way for an abstraction over the previous
version to provide reliable domain separation for the "extract" step.
Also use fewer array references (the compiler doesn't care) be more
explicit with a `new_cv` mutable variable. This clarifies
push_chunk_chaining_value just a bit. Since that's the trickiest
function in the entire thing, it's good to clarify it. (It also gets
excerpted directly into the spec.)
Smaller chunk sizes are a big benefit for parallelism at shorter input
lengths, and recent benchmarks show that this reduction has a relative
small cost in terms of peak throughput. It's also a nice round number.