1
0
Fork 0
mirror of https://github.com/BLAKE3-team/BLAKE3 synced 2024-05-29 02:16:04 +02:00
Commit Graph

245 Commits

Author SHA1 Message Date
Jack O'Connor 5ef22de9d0 link to the C implementation from the README 2020-01-27 13:02:00 -05:00
Jack O'Connor 71e605fd5d
typo 2020-01-26 16:12:10 -05:00
Jack O'Connor 1db856a3e5 expand the C README for public consumption 2020-01-26 16:07:51 -05:00
Samuel Neves 214c70d8f3
Merge pull request #40 from erijo/cpp
Add extern "C" to blake3.h
2020-01-24 00:42:41 +00:00
Erik Johansson 182aea4871 Add extern "C" to blake3.h
So that the header can be included in C++-programs without getting linker
errors.
2020-01-23 20:42:34 +01:00
Samuel Neves a830ab2661 streamline load_counters
avx2 before:

        mov     eax, esi
        neg     rax
        vmovq   xmm0, rax
        vpbroadcastq    ymm0, xmm0
        vpand   ymm0, ymm0, ymmword ptr [rip + .LCPI1_0]
        vmovq   xmm2, rdi
        vpbroadcastq    ymm1, xmm2
        vpaddq  ymm1, ymm0, ymm1
        vmovdqa ymm0, ymmword ptr [rip + .LCPI1_1] # ymm0 = [0,2,4,6,4,6,6,7]
        vpermd  ymm3, ymm0, ymm1
        mov     r8d, eax
        and     r8d, 5
        add     r8, rdi
        mov     esi, eax
        and     esi, 6
        add     rsi, rdi
        and     eax, 7
        vpshufd xmm4, xmm3, 231         # xmm4 = xmm3[3,1,2,3]
        vpinsrd xmm4, xmm4, r8d, 1
        add     rax, rdi
        vpinsrd xmm4, xmm4, esi, 2
        vpinsrd xmm4, xmm4, eax, 3
        vpshufd xmm3, xmm3, 144         # xmm3 = xmm3[0,0,1,2]
        vpinsrd xmm3, xmm3, edi, 0
        vmovdqa xmmword ptr [rdx], xmm3
        vmovdqa xmmword ptr [rdx + 16], xmm4
        vpermq  ymm3, ymm1, 144         # ymm3 = ymm1[0,0,1,2]
        vpblendd        ymm2, ymm3, ymm2, 3 # ymm2 = ymm2[0,1],ymm3[2,3,4,5,6,7]
        vpsrlq  ymm2, ymm2, 32
        vpermd  ymm2, ymm0, ymm2
        vextracti128    xmm1, ymm1, 1
        vmovq   xmm3, rax
        vmovq   xmm4, rsi
        vpunpcklqdq     xmm3, xmm4, xmm3 # xmm3 = xmm4[0],xmm3[0]
        vmovq   xmm4, r8
        vpalignr        xmm1, xmm4, xmm1, 8 # xmm1 = xmm1[8,9,10,11,12,13,14,15],xmm4[0,1,2,3,4,5,6,7]
        vinserti128     ymm1, ymm1, xmm3, 1
        vpsrlq  ymm1, ymm1, 32
        vpermd  ymm0, ymm0, ymm1

avx2 after:

        neg     esi
        vmovd   xmm0, esi
        vpbroadcastd    ymm0, xmm0
        vmovd   xmm1, edi
        vpbroadcastd    ymm1, xmm1
        vpand   ymm0, ymm0, ymmword ptr [rip + .LCPI0_0]
        vpaddd  ymm1, ymm1, ymm0
        vpbroadcastd    ymm2, dword ptr [rip + .LCPI0_1] # ymm2 = [2147483648,2147483648,2147483648,2147483648,2147483648,2147483648,2147483648,2147483648]
        vpor    ymm0, ymm0, ymm2
        vpxor   ymm2, ymm1, ymm2
        vpcmpgtd        ymm0, ymm0, ymm2
        shr     rdi, 32
        vmovd   xmm2, edi
        vpbroadcastd    ymm2, xmm2
        vpsubd  ymm0, ymm2, ymm0
2020-01-23 12:17:43 +00:00
Samuel Neves de1458c565 name collision 2020-01-23 11:51:46 +00:00
Samuel Neves 37ea737c16 more robust bit-trickery functions 2020-01-23 10:58:45 +00:00
Jack O'Connor e17c45ddd5 version 0.1.3
Changes since 0.1.2:
- All x86 implementations include _mm_prefetch optimizations. These
  improve performance for very large inputs.
- The C implementation performs parallel parent hashing, matching the
  performance of the single-threaded Rust implementation.
- b3sum supports --no-mmap. Contributed by @cesarb.
2020-01-22 21:35:24 -05:00
Jack O'Connor 163f52245d port compress_subtree_to_parent_node from Rust to C
This recursive function performs parallel parent node hashing, which is
an important optimization.
2020-01-22 21:32:39 -05:00
Jack O'Connor de1cf0038e add the round_down_to_power_of_2 algoirthm
This could probably be sped up by detecting LZCNT support, but it's
unlikely to be a bottleneck.
2020-01-22 21:32:39 -05:00
Jack O'Connor 087d72e08f clang-format 2020-01-22 21:32:35 -05:00
Jack O'Connor 92d421dea1 add a larger test case
One thing I like to test is that, if I hack simd_degree to be higher
than MAX_SIMD_DEGREE, assertions fire. This requires a test case long
enough to exceed that number of chunks.
2020-01-22 21:19:47 -05:00
Jack O'Connor 78e858d050 expand comments about lazy merging 2020-01-21 12:09:42 -05:00
Jack O'Connor ccadbad244 stack size in the optimized impl should be MAX_DEPTH + 1 2020-01-21 11:41:20 -05:00
Jack O'Connor d0c8fc16b3 use a better popcnt fallback algorithm
This one loops once for every set bit, rather than once for each bit
position to the right of the highest set bit.

https://en.wikipedia.org/wiki/Hamming_weight#Efficient_implementation
2020-01-21 10:47:00 -05:00
Jack O'Connor 67262dff31 double the maximum incremental subtree size
Because compress_subtree_to_parent_node effectively cuts its input in
half, we can give it an input that's twice as big, without violating the
CV stack invariant.
2020-01-20 19:25:55 -05:00
Jack O'Connor 4a92e8eeb1 add the reference impl doc test to CI 2020-01-20 16:36:30 -05:00
Jack O'Connor 4021636022 test the BLAKE3_NO_* vars in CI 2020-01-20 16:19:16 -05:00
Jack O'Connor 40f4bdc22a switch from BLAKE3_USE_* to BLAKE3_NO_*
This means that compiling C sources includes all implementations by
default, which is what most callers are going to want.
2020-01-20 15:24:03 -05:00
Samuel Neves 66da5afb0c make things more modular 2020-01-20 12:03:31 -05:00
Jack O'Connor 491f799fd9 clarify the --no-mmap logic a bit 2020-01-20 12:03:31 -05:00
Cesar Eduardo Barros 273a679ddc b3sum: add no-mmap option
Using mmap is not always the best option. For instance, if the file is
truncated while being read, b3sum will receive a SIGBUS and abort.

Follow ripgrep's lead and add a --no-mmap option to disable mmap. This
can also help benchmark the mmap versus the read path, and help debug
performance issues potentially caused by mmap access patterns (like
issue #32).
2020-01-20 11:58:07 -05:00
Samuel Neves b8c33e11ef manually prefetch message blocks 2020-01-19 18:45:37 +00:00
Jack O'Connor a3147eb909 comment about parallelism 2020-01-18 14:32:52 -05:00
Jack O'Connor 14cd5c51c4 version 0.1.2
Changes since 0.1.1:
- b3sum no longer mmaps files smaller than 16 KiB. This improves
  performance for hashing many small files. Contributed by @xzfc.
- b3sum now supports --raw output. Contributed by @phayes.
2020-01-17 13:58:55 -05:00
Jack O'Connor 7ee89fe738 update b3sum help text in README.md 2020-01-17 13:54:58 -05:00
Jack O'Connor e2ce07601f edit the --raw help string 2020-01-17 13:36:09 -05:00
Jack O'Connor 2db9f2d2ea
Merge pull request #22 from phayes/raw_output
Adds support for raw output to b3sum
2020-01-17 13:29:39 -05:00
Albert Safin f26880e282 b3sum: do not mmap files smaller than 16 KiB 2020-01-17 12:58:32 -05:00
Jack O'Connor 28701d1585 add a README.md in c/blake3_c_rust_bindings 2020-01-16 18:29:20 -05:00
Jack O'Connor 84c26670bf add blake3_c_rust_bindings for testing and benchmarking 2020-01-16 16:09:42 -05:00
Jack O'Connor 33a9bee51f update the b3sum README 2020-01-15 10:46:47 -05:00
Jack O'Connor e60934a129 more consistent use of Self in the reference impl 2020-01-15 10:41:06 -05:00
phayes aec1d88e31
Using take() to limit the number of bytes copies 2020-01-14 14:35:18 -08:00
Jack O'Connor c8c442a99b add comments to the reference impl 2020-01-14 15:22:22 -05:00
phayes a02b4cb040
bailing early if we have both --raw and multiple files 2020-01-13 14:56:06 -08:00
phayes 0e8734b7f6
Making sure our raw multi-file test is testing what we think it is 2020-01-13 14:48:24 -08:00
phayes 5cb01ad696
Using stdout_capture for capturing stdout that is not a string 2020-01-13 14:43:09 -08:00
phayes 2bd7614d1e
Fixing stdout locking 2020-01-13 14:40:30 -08:00
phayes ec1233bca3
Locking stdout for writing in a tight loop. 2020-01-13 14:36:28 -08:00
phayes 8d251af29f
Adds support for raw output to b3sum 2020-01-13 13:12:47 -08:00
Jack O'Connor 02250a7b7c version 0.1.1
Changes since 0.1.0:
- Optimizations contributed by @cesarb.
- Fix the build on x86_64-pc-windows-gnu when c_avx512 is enabled.
- Add an explicit error message for compilers that don't support c_avx512.
2020-01-13 14:47:28 -05:00
Jack O'Connor caa6622afa explicitly check for -mavx512f or /arch:AVX512 support
If AVX-512 is enabled, and the local C compiler doesn't support it, the
build is going to fail. However, if we check for this explicitly, we can
give a better error message.

Fixes https://github.com/BLAKE3-team/BLAKE3/issues/6.
2020-01-13 14:34:27 -05:00
Jack O'Connor b9b1d48545 avoid using MSVC-style flags with the GNU toolchain on Windows 2020-01-13 13:34:06 -05:00
Jack O'Connor 5c7004c518 a bit of README formatting 2020-01-13 13:21:06 -05:00
Jack O'Connor de3ff01b14 mention the default output size in the README 2020-01-13 12:09:43 -05:00
Jack O'Connor 3ec157a9e9 add a CI badge 2020-01-13 12:09:43 -05:00
Cesar Eduardo Barros 9f509a8f1f Inline trivial functions
For the Read and Write traits, this also allows the compiler to see that
the return value is always Ok, allowing it to remove the Err case from
the caller as dead code.
2020-01-12 18:27:42 -05:00
Cesar Eduardo Barros 4690c5f14e Use fixed-size constant_time_eq
The generic constant_time_eq has several branches on the slice length,
which are not necessary when the slice length is known. However, the
optimizer is not allowed to look into the core of constant_time_eq, so
these branches cannot be elided.

Use instead a fixed-size variant of constant_time_eq, which has no
branches since the length is known.
2020-01-12 17:40:57 -05:00