Use statically calculated ~mask. This reduces the number of moves and registers necessary at the expense of an extra memory load. This is probably a good trade-off since we are not bound by memory uops in this loop.
blake3_hasher_init_derive_key_len is an alternative version of
blake3_hasher_init_derive_key which takes the context and its
length as separate parameters, and not together as a C string.
The motivation for this addition is making it easier for
bindings to this C library to call this function without
having to first copy over the context bytes just to add
one 0x00 byte at the end.
Notice that contrary to blake3_hasher_init_derive_key,
blake3_hasher_init_derive_key_len allows the inclusion of a
0x00 byte in the context. Given the rules about context string
selection, this byte is unlikely to be used as part of a context
string. But if for some reason it is ever given, it will be
included in the context string and processed like any other
non-alphanumeric byte would. For compatibility with
blake3_hasher_init_derive_key, bindings should still check for
the absence of 0x00 bytes.
Use _mm_and_si128 and _mm_cmpeq_epi16 rather than expensive multiplication _mm_mullo_epi16 with _mm_srai_epi16 that compiler may not be able to optimize.
Use a simple shift for the rotation.
* c/blake3_sse2_x86-64_unix.S: emulate pshufb using SSE2 instructions for x86_64 unix
* c/blake3_sse2_x86-64_windows_gnu.S: Likewise for x86_64 Windows GNU.
* c/blake3_sse2_x86-64_windows_msvc.asm: Likewise for x86_64 Windows MSVC.
Use two 16-bit shuffles: one for the low 64-bits and one for the high 64-bits.
* c/blake3_sse2_x86-64_unix.S: emulate pshufb using SSE2 instructions for x86_64 unix
* c/blake3_sse2_x86-64_windows_gnu.S: Likewise for x86_64 Windows GNU.
* c/blake3_sse2_x86-64_windows_msvc.asm: Likewise for x86_64 Windows MSVC.
Use two pinsrw and a 16-bit shift to insert the 32-bit integer at the desired location.
* c/blake3_sse2_x86-64_unix.S: emulate pinsrd using SSE2 instructions for x86_64 unix
* c/blake3_sse2_x86-64_windows_gnu.S: Likewise for x86_64 Windows GNU.
* c/blake3_sse2_x86-64_windows_msvc.asm: Likewise for x86_64 Windows MSVC.
Blend according to (mask & b) | ((~mask) & a).
* c/blake3_sse2_x86-64_unix.S: emulate blendvps using SSE2 instructions for x86_64 unix
* c/blake3_sse2_x86-64_windows_gnu.S: Likewise for x86_64 Windows GNU.
* c/blake3_sse2_x86-64_windows_msvc.asm: Likewise for x86_64 Windows MSVC.
Use a constant mask to blend according to (mask & b) | ((~mask) & a).
* c/blake3_sse2_x86-64_unix.S: emulate pblendw using SSE2 instructions for x86_64 unix
* c/blake3_sse2_x86-64_windows_gnu.S: Likewise for x86_64 Windows GNU.
* c/blake3_sse2_x86-64_windows_msvc.asm: Likewise for x86_64 Windows MSVC.
Use a constant mask to blend according to (mask & b) | ((~mask) & a).
* src/rust_sse2.rs: emulate _mm_blend_epi16 using SSE2 intrinsics
* c/blake3_sse2.c: Likewise.
Wire up basic functions and features for SSE2 support using the SSE4.1 version
as a basis without implementing the SSE2 instructions yet.
* Cargo.toml: add no_sse2 feature
* benches/bench.rs: wire SSE2 benchmarks
* build.rs: add SSE2 rust intrinsics and assembly builds
* c/Makefile.testing: add SSE2 C and assembly targets
* c/README.md: add SSE2 to C build instructions
* c/blake3_c_rust_bindings/build.rs: add SSE2 C rust binding builds
* c/blake3_c_rust_bindings/src/lib.rs: add SSE2 C rust bindings
* c/blake3_dispatch.c: add SSE2 C dispatch
* c/blake3_impl.h: add SSE2 C function prototypes
* c/blake3_sse2.c: add SSE2 C intrinsic file starting with SSE4.1 version
* c/blake3_sse2_x86-64_{unix.S,windows_gnu.S,windows_msvc.asm}: add SSE2
assembly files starting with SSE4.1 version
* src/ffi_sse2.rs: add rust implementation using SSE2 C rust bindings
* src/lib.rs: add SSE2 rust intrinsics and SSE2 C rust binding rust SSE2 module
configurations
* src/platform.rs: add SSE2 rust platform detection and dispatch
* src/rust_sse2.rs: add SSE2 rust intrinsic file starting with SSE4.1 version
* tools/instruction_set_support/src/main.rs: add SSE2 feature detection
The default executable stack setting on Linux can be fixed in two different ways:
- By adding the `.section .note.GNU-stack,"",%progbits` special incantation
- By passing the `--noexecstack` flag to the assembler
This patch implements both, but only one of them is strictly necessary.
I've also added some additional hardening flags to the Makefile. May not be portable.
It looks like I originally made this mistake when I was copying code
from the baokeshed prototype (a274a9b0fa),
and then it got replicated into the C implementation later.
Whereas vinserti64x4 is present on AVX512F, vinserti32x8 requires
AVX512DQ, which we do not test for. At this point there is not
much risk of incompatibility, since Skylake-X chips have all the
requires instruction sets, but let's be precise about this.
The biggest change here is that assembly implementations are enabled by
default.
Added features:
- "pure" (Pure Rust, with no C or assembly implementations.)
Removed features:
- "c" (Now basically the default.)
Renamed features;
- "c_prefer_intrinsics" -> "prefer_intrinsics"
- "c_neon" -> "neon"
Unchanged:
- "rayon"
- "std" (Still the only feature on by default.)
If the total number of chunks hashed so far is e.g. 1, and update() is
called with e.g. 8 more chunks, we can't compress all 8 together. We
have to break the input up, to make sure that that 1 lone chunk CV gets
merged with its proper sibling, and that in general the correct layout
of the tree is preserved. What we should do is hash 1-2-4-1 chunks of
input, using increasing powers of 2 (with some cleanup at the end). What
we were doing was 2-2-2-2 chunks. This was the result of a mistaken
optimization that got us stuck with an always-odd number of chunks so
far.
Fixes https://github.com/BLAKE3-team/BLAKE3/issues/69.
This gives the assembly files the same prefix as the intrinsics files which
simplifies building when the build system should pick between the assembly and
the intrinsics files.