The portable implementation was getting slowed down by converting back
and forth between words and bytes.
I made the corresponding change on the C side first
(12a37be8b5),
and as part of this commit I'm re-vendoring the C code. I'm also
exposing a small FFI interface to C so that blake3_neon.c can link
against portable.rs rather than blake3_portable.c, see c_neon.rs.