Changes since 0.3.6:
- BUGFIX: The C implementation was incorrect on big endian systems for
inputs longer than 1024 bytes. This bug affected all previous versions
of the C implementation. Little endian platforms like x86 were
unaffected. The Rust implementation was also unaffected.
@jakub-zwolakowski and @pascal-cuoq from TrustInSoft reported this
bug: https://github.com/BLAKE3-team/BLAKE3/pull/118
- BUGFIX: The C build on x86-64 was producing binaries with an
executable stack. @tristanheaven reported this bug:
https://github.com/BLAKE3-team/BLAKE3/issues/109
- @mkrupcale added optimized implementations for SSE2. This improves
performance on older x86 processors that don't support SSE4.1.
- The C implementation now exposes the
`blake3_hasher_init_derive_key_raw` function, to make it easier to
implement language bindings. Added by @k0001.
This will let us add big endian testing to CI for our C code. (We were
already doing it for our Rust code.)
This is adapted from test_vectors/cross_test.sh. It works around the
limitation that the `cross` tool can't reach parent directories. It's an
unfortunate hack, but at least it's only for testing. It might've been
less hacky to use symlinks for this somehow, but I worry that would
break things on Windows, and I don't want to have to add workarounds for
my workarounds.
This is quite hard to trigger, because SSE2 has been guaranteed for a
long time. But you could trigger it this way:
rustup target add i686-unknown-linux-musl
RUSTFLAGS="-C target-cpu=i386" cargo build --target i686-unknown-linux-musl
Note a relevant gotcha though: The `cross` tool will not forward
environment variables like RUSTFLAGS to the container by default, so if
you're testing with `cross` you'll need to use the `rustc` command to
explicitly pass the flag, as I've done here in ci.yml. (Or you could
create a `Cross.toml` file, but I don't want to commit one of those if I
can avoid it.)
Use statically calculated ~mask. This reduces the number of moves and registers necessary at the expense of an extra memory load. This is probably a good trade-off since we are not bound by memory uops in this loop.
blake3_hasher_init_derive_key_len is an alternative version of
blake3_hasher_init_derive_key which takes the context and its
length as separate parameters, and not together as a C string.
The motivation for this addition is making it easier for
bindings to this C library to call this function without
having to first copy over the context bytes just to add
one 0x00 byte at the end.
Notice that contrary to blake3_hasher_init_derive_key,
blake3_hasher_init_derive_key_len allows the inclusion of a
0x00 byte in the context. Given the rules about context string
selection, this byte is unlikely to be used as part of a context
string. But if for some reason it is ever given, it will be
included in the context string and processed like any other
non-alphanumeric byte would. For compatibility with
blake3_hasher_init_derive_key, bindings should still check for
the absence of 0x00 bytes.
Use _mm_and_si128 and _mm_cmpeq_epi16 rather than expensive multiplication _mm_mullo_epi16 with _mm_srai_epi16 that compiler may not be able to optimize.
Use a simple shift for the rotation.
* c/blake3_sse2_x86-64_unix.S: emulate pshufb using SSE2 instructions for x86_64 unix
* c/blake3_sse2_x86-64_windows_gnu.S: Likewise for x86_64 Windows GNU.
* c/blake3_sse2_x86-64_windows_msvc.asm: Likewise for x86_64 Windows MSVC.
Use two 16-bit shuffles: one for the low 64-bits and one for the high 64-bits.
* c/blake3_sse2_x86-64_unix.S: emulate pshufb using SSE2 instructions for x86_64 unix
* c/blake3_sse2_x86-64_windows_gnu.S: Likewise for x86_64 Windows GNU.
* c/blake3_sse2_x86-64_windows_msvc.asm: Likewise for x86_64 Windows MSVC.
Use two pinsrw and a 16-bit shift to insert the 32-bit integer at the desired location.
* c/blake3_sse2_x86-64_unix.S: emulate pinsrd using SSE2 instructions for x86_64 unix
* c/blake3_sse2_x86-64_windows_gnu.S: Likewise for x86_64 Windows GNU.
* c/blake3_sse2_x86-64_windows_msvc.asm: Likewise for x86_64 Windows MSVC.
Blend according to (mask & b) | ((~mask) & a).
* c/blake3_sse2_x86-64_unix.S: emulate blendvps using SSE2 instructions for x86_64 unix
* c/blake3_sse2_x86-64_windows_gnu.S: Likewise for x86_64 Windows GNU.
* c/blake3_sse2_x86-64_windows_msvc.asm: Likewise for x86_64 Windows MSVC.
Use a constant mask to blend according to (mask & b) | ((~mask) & a).
* c/blake3_sse2_x86-64_unix.S: emulate pblendw using SSE2 instructions for x86_64 unix
* c/blake3_sse2_x86-64_windows_gnu.S: Likewise for x86_64 Windows GNU.
* c/blake3_sse2_x86-64_windows_msvc.asm: Likewise for x86_64 Windows MSVC.
Use a constant mask to blend according to (mask & b) | ((~mask) & a).
* src/rust_sse2.rs: emulate _mm_blend_epi16 using SSE2 intrinsics
* c/blake3_sse2.c: Likewise.
Wire up basic functions and features for SSE2 support using the SSE4.1 version
as a basis without implementing the SSE2 instructions yet.
* Cargo.toml: add no_sse2 feature
* benches/bench.rs: wire SSE2 benchmarks
* build.rs: add SSE2 rust intrinsics and assembly builds
* c/Makefile.testing: add SSE2 C and assembly targets
* c/README.md: add SSE2 to C build instructions
* c/blake3_c_rust_bindings/build.rs: add SSE2 C rust binding builds
* c/blake3_c_rust_bindings/src/lib.rs: add SSE2 C rust bindings
* c/blake3_dispatch.c: add SSE2 C dispatch
* c/blake3_impl.h: add SSE2 C function prototypes
* c/blake3_sse2.c: add SSE2 C intrinsic file starting with SSE4.1 version
* c/blake3_sse2_x86-64_{unix.S,windows_gnu.S,windows_msvc.asm}: add SSE2
assembly files starting with SSE4.1 version
* src/ffi_sse2.rs: add rust implementation using SSE2 C rust bindings
* src/lib.rs: add SSE2 rust intrinsics and SSE2 C rust binding rust SSE2 module
configurations
* src/platform.rs: add SSE2 rust platform detection and dispatch
* src/rust_sse2.rs: add SSE2 rust intrinsic file starting with SSE4.1 version
* tools/instruction_set_support/src/main.rs: add SSE2 feature detection
The default executable stack setting on Linux can be fixed in two different ways:
- By adding the `.section .note.GNU-stack,"",%progbits` special incantation
- By passing the `--noexecstack` flag to the assembler
This patch implements both, but only one of them is strictly necessary.
I've also added some additional hardening flags to the Makefile. May not be portable.