The Bloom filter used for path limited history traversal was broken
on systems whose "char" is unsigned; update the implementation and
bump the format version to 2.
* tb/path-filter-fix:
bloom: introduce `deinit_bloom_filters()`
commit-graph: reuse existing Bloom filters where possible
object.h: fix mis-aligned flag bits table
commit-graph: new Bloom filter version that fixes murmur3
commit-graph: unconditionally load Bloom filters
bloom: prepare to discard incompatible Bloom filters
bloom: annotate filters with hash version
repo-settings: introduce commitgraph.changedPathsVersion
t4216: test changed path filters with high bit paths
t/helper/test-read-graph: implement `bloom-filters` mode
bloom.h: make `load_bloom_filter_from_graph()` public
t/helper/test-read-graph.c: extract `dump_graph_info()`
gitformat-commit-graph: describe version 2 of BDAT
commit-graph: ensure Bloom filters are read with consistent settings
revision.c: consult Bloom filters for root commits
t/t4216-log-bloom.sh: harden `test_bloom_filters_not_used()`
A subsequent commit will introduce another version of the changed-path
filter in the commit graph file. In order to control which version to
write (and read), a config variable is needed.
Therefore, introduce this config variable. For forwards compatibility,
teach Git to not read commit graphs when the config variable
is set to an unsupported version. Because we teach Git this,
commitgraph.readChangedPaths is now redundant, so deprecate it and
define its behavior in terms of the config variable we introduce.
This commit does not change the behavior of writing (Git writes changed
path filters when explicitly instructed regardless of any config
variable), but a subsequent commit will restrict Git such that it will
only write when commitgraph.changedPathsVersion is a recognized value.
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Now that multi-pack reuse is supported, enable it via the
feature.experimental configuration in addition to the classic
`pack.allowPackReuse`.
This will allow more users to experiment with the new behavior who might
not otherwise be aware of the existing `pack.allowPackReuse`
configuration option.
The enum with values NO_PACK_REUSE, SINGLE_PACK_REUSE, and
MULTI_PACK_REUSE is defined statically in builtin/pack-objects.c's
compilation unit. We could hoist that enum into a scope visible from the
repository_settings struct, and then use that enum value in
pack-objects. Instead, define a single int that indicates what
pack-objects's default value should be to avoid additional unnecessary
code movement.
Though `feature.experimental` implies `pack.allowPackReuse=multi`, this
can still be overridden by explicitly setting the latter configuration
to either "single" or "false". Tests covering all of these cases are
showin t5332.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The 'read_replace_refs' global specifies whether or not we should
respect the references of the form 'refs/replace/<oid>' to replace which
object we look up when asking for '<oid>'. This global has caused issues
when it is not initialized properly, such as in b6551feadfd (merge-tree:
load default git config, 2023-05-10).
To make this more robust, move its config-based initialization out of
git_default_config and into prepare_repo_settings(). This provides a
repository-scoped version of the 'read_replace_refs' global.
The global still has its purpose: it is disabled process-wide by the
GIT_NO_REPLACE_OBJECTS environment variable or by a call to
disable_replace_refs() in some specific Git commands.
Since we already encapsulated the use of the constant inside
replace_refs_enabled(), we can perform the initialization inside that
method, if necessary. This solves the problem of forgetting to check the
config, as we will check it before returning this value.
Due to this encapsulation, the global can move to be static within
replace-object.c.
There is an interesting behavior change possible here: we now have a
repository-scoped understanding of this config value. Thus, if there was
a command that recurses into submodules and might follow replace refs,
then it would now respect the core.useReplaceRefs config value in each
repository.
'git grep --recurse-submodules' is such a command that recurses into
submodules in-process. We can demonstrate the granularity of this config
value via a test in t7814.
Signed-off-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When reachability bitmap coverage exists in a repository, Git will use a
different (and hopefully faster) traversal to compute revision walks.
Consider a set of positive and negative tips (which we'll refer to with
their standard bitmap parlance by "wants", and "haves"). In order to
figure out what objects exist between the tips, the existing traversal
in `prepare_bitmap_walk()` does something like:
1. Consider if we can even compute the set of objects with bitmaps,
and fall back to the usual traversal if we cannot. For example,
pathspec limiting traversals can't be computed using bitmaps (since
they don't know which objects are at which paths). The same is true
of certain kinds of non-trivial object filters.
2. If we can compute the traversal with bitmaps, partition the
(dereferenced) tips into two object lists, "haves", and "wants",
based on whether or not the objects have the UNINTERESTING flag,
respectively.
3. Fall back to the ordinary object traversal if either (a) there are
more than zero haves, none of which are in the bitmapped pack or
MIDX, or (b) there are no wants.
4. Construct a reachability bitmap for the "haves" side by walking
from the revision tips down to any existing bitmaps, OR-ing in any
bitmaps as they are found.
5. Then do the same for the "wants" side, stopping at any objects that
appear in the "haves" bitmap.
6. Filter the results if any object filter (that can be easily
computed with bitmaps alone) was given, and then return back to the
caller.
When there is good bitmap coverage relative to the traversal tips, this
walk is often significantly faster than an ordinary object traversal
because it can visit far fewer objects.
But in certain cases, it can be significantly *slower* than the usual
object traversal. Why? Because we need to compute complete bitmaps on
either side of the walk. If either one (or both) of the sides require
walking many (or all!) objects before they get to an existing bitmap,
the extra bitmap machinery is mostly or all overhead.
One of the benefits, however, is that even if the walk is slower, bitmap
traversals are guaranteed to provide an *exact* answer. Unlike the
traditional object traversal algorithm, which can over-count the results
by not opening trees for older commits, the bitmap walk builds an exact
reachability bitmap for either side, meaning the results are never
over-counted.
But producing non-exact results is OK for our traversal here (both in
the bitmap case and not), as long as the results are over-counted, not
under.
Relaxing the bitmap traversal to allow it to produce over-counted
results gives us the opportunity to make some significant improvements.
Instead of the above, the new algorithm only has to walk from the
*boundary* down to the nearest bitmap, instead of from each of the
UNINTERESTING tips.
The boundary-based approach still has degenerate cases, but we'll show
in a moment that it is often a significant improvement.
The new algorithm works as follows:
1. Build a (partial) bitmap of the haves side by first OR-ing any
bitmap(s) that already exist for UNINTERESTING commits between the
haves and the boundary.
2. For each commit along the boundary, add it as a fill-in traversal
tip (where the traversal terminates once an existing bitmap is
found), and perform fill-in traversal.
3. Build up a complete bitmap of the wants side as usual, stopping any
time we intersect the (partial) haves side.
4. Return the results.
And is more-or-less equivalent to using the *old* algorithm with this
invocation:
$ git rev-list --objects --use-bitmap-index $WANTS --not \
$(git rev-list --objects --boundary $WANTS --not $HAVES |
perl -lne 'print $1 if /^-(.*)/')
The new result performs significantly better in many cases, particularly
when the distance from the boundary commit(s) to an existing bitmap is
shorter than the distance from (all of) the have tips to the nearest
bitmapped commit.
Note that when using the old bitmap traversal algorithm, the results can
be *slower* than without bitmaps! Under the new algorithm, the result is
computed faster with bitmaps than without (at the cost of over-counting
the true number of objects in a similar fashion as the non-bitmap
traversal):
# (Computing the number of tagged objects not on any branches
# without bitmaps).
$ time git rev-list --count --objects --tags --not --branches
20
real 0m1.388s
user 0m1.092s
sys 0m0.296s
# (Computing the same query using the old bitmap traversal).
$ time git rev-list --count --objects --tags --not --branches --use-bitmap-index
19
real 0m22.709s
user 0m21.628s
sys 0m1.076s
# (this commit)
$ time git.compile rev-list --count --objects --tags --not --branches --use-bitmap-index
19
real 0m1.518s
user 0m1.234s
sys 0m0.284s
The new algorithm is still slower than not using bitmaps at all, but it
is nearly a 15-fold improvement over the existing traversal.
In a more realistic setting (using my local copy of git.git), I can
observe a similar (if more modest) speed-up:
$ argv="--count --objects --branches --not --tags"
hyperfine \
-n 'no bitmaps' "git.compile rev-list $argv" \
-n 'existing traversal' "git.compile rev-list --use-bitmap-index $argv" \
-n 'boundary traversal' "git.compile -c pack.useBitmapBoundaryTraversal=true rev-list --use-bitmap-index $argv"
Benchmark 1: no bitmaps
Time (mean ± σ): 124.6 ms ± 2.1 ms [User: 103.7 ms, System: 20.8 ms]
Range (min … max): 122.6 ms … 133.1 ms 22 runs
Benchmark 2: existing traversal
Time (mean ± σ): 368.6 ms ± 3.0 ms [User: 325.3 ms, System: 43.1 ms]
Range (min … max): 365.1 ms … 374.8 ms 10 runs
Benchmark 3: boundary traversal
Time (mean ± σ): 167.6 ms ± 0.9 ms [User: 139.5 ms, System: 27.9 ms]
Range (min … max): 166.1 ms … 169.2 ms 17 runs
Summary
'no bitmaps' ran
1.34 ± 0.02 times faster than 'boundary traversal'
2.96 ± 0.05 times faster than 'existing traversal'
Here, the new algorithm is also still slower than not using bitmaps, but
represents a more than 2-fold improvement over the existing traversal in
a more modest example.
Since this algorithm was originally written (nearly a year and a half
ago, at the time of writing), the bitmap lookup table shipped, making
the new algorithm's result more competitive. A few other future
directions for improving bitmap traversal times beyond not using bitmaps
at all:
- Decrease the cost to decompress and OR together many bitmaps
together (particularly when enumerating the uninteresting side of
the walk). Here we could explore more efficient bitmap storage
techniques, like Roaring+Run and/or use SIMD instructions to speed
up ORing them together.
- Store pseudo-merge bitmaps, which could allow us to OR together
fewer "summary" bitmaps (which would also help with the above).
Helped-by: Jeff King <peff@peff.net>
Helped-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When "gc" needs to retain unreachable objects, packing them into
cruft packs (instead of exploding them into loose object files) has
been offered as a more efficient option for some time. Now the use
of cruft packs has been made the default and no longer considered
an experimental feature.
* tb/enable-cruft-packs-by-default:
repository.h: drop unused `gc_cruft_packs`
builtin/gc.c: make `gc.cruftPacks` enabled by default
t/t9300-fast-import.sh: prepare for `gc --cruft` by default
t/t6500-gc.sh: add additional test cases
t/t6500-gc.sh: refactor cruft pack tests
t/t6501-freshen-objects.sh: prepare for `gc --cruft` by default
t/t5304-prune.sh: prepare for `gc --cruft` by default
builtin/gc.c: ignore cruft packs with `--keep-largest-pack`
builtin/repack.c: fix incorrect reference to '-C'
pack-write.c: plug a leak in stage_tmp_packfiles()
As of the previous commit, all callers that need to read the value of
`gc.cruftPacks` do so outside without using the `repo_settings` struct,
making its `gc_cruft_packs` unused. Drop it accordingly.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Since 1615c567b8 (Documentation/config/pack.txt: advertise
'pack.writeReverseIndex', 2021-01-25), we have had the
`pack.writeReverseIndex` configuration option, which tells Git whether
or not it is allowed to write a ".rev" file when indexing a pack.
Introduce a complementary configuration knob, `pack.readReverseIndex` to
control whether or not Git will read any ".rev" file(s) that may be
available on disk.
This option is useful for debugging, as well as disabling the effect of
".rev" files in certain instances.
This is useful because of the trade-off[^1] between the time it takes to
generate a reverse index (slow from scratch, fast when reading an
existing ".rev" file), and the time it takes to access a record (the
opposite).
For example, even though it is faster to use the on-disk reverse index
when computing the on-disk size of a packed object, it is slower to
enumerate the same value for all objects.
Here are a couple of examples from linux.git. When computing the above
for a single object, using the on-disk reverse index is significantly
faster:
$ git rev-parse HEAD >in
$ hyperfine -L v false,true 'git.compile -c pack.readReverseIndex={v} cat-file --batch-check="%(objectsize:disk)" <in'
Benchmark 1: git.compile -c pack.readReverseIndex=false cat-file --batch-check="%(objectsize:disk)" <in
Time (mean ± σ): 302.5 ms ± 12.5 ms [User: 258.7 ms, System: 43.6 ms]
Range (min … max): 291.1 ms … 328.1 ms 10 runs
Benchmark 2: git.compile -c pack.readReverseIndex=true cat-file --batch-check="%(objectsize:disk)" <in
Time (mean ± σ): 3.9 ms ± 0.3 ms [User: 1.6 ms, System: 2.4 ms]
Range (min … max): 2.0 ms … 4.4 ms 801 runs
Summary
'git.compile -c pack.readReverseIndex=true cat-file --batch-check="%(objectsize:disk)" <in' ran
77.29 ± 7.14 times faster than 'git.compile -c pack.readReverseIndex=false cat-file --batch-check="%(objectsize:disk)" <in'
, but when instead trying to compute the on-disk object size for all
objects in the repository, using the ".rev" file is a disadvantage over
creating the reverse index from scratch:
$ hyperfine -L v false,true 'git.compile -c pack.readReverseIndex={v} cat-file --batch-check="%(objectsize:disk)" --batch-all-objects'
Benchmark 1: git.compile -c pack.readReverseIndex=false cat-file --batch-check="%(objectsize:disk)" --batch-all-objects
Time (mean ± σ): 8.258 s ± 0.035 s [User: 7.949 s, System: 0.308 s]
Range (min … max): 8.199 s … 8.293 s 10 runs
Benchmark 2: git.compile -c pack.readReverseIndex=true cat-file --batch-check="%(objectsize:disk)" --batch-all-objects
Time (mean ± σ): 16.976 s ± 0.107 s [User: 16.706 s, System: 0.268 s]
Range (min … max): 16.839 s … 17.105 s 10 runs
Summary
'git.compile -c pack.readReverseIndex=false cat-file --batch-check="%(objectsize:disk)" --batch-all-objects' ran
2.06 ± 0.02 times faster than 'git.compile -c pack.readReverseIndex=true cat-file --batch-check="%(objectsize:disk)" --batch-all-objects'
Luckily, the results when running `git cat-file` with `--unordered` are
closer together:
$ hyperfine -L v false,true 'git.compile -c pack.readReverseIndex={v} cat-file --unordered --batch-check="%(objectsize:disk)" --batch-all-objects'
Benchmark 1: git.compile -c pack.readReverseIndex=false cat-file --unordered --batch-check="%(objectsize:disk)" --batch-all-objects
Time (mean ± σ): 5.066 s ± 0.105 s [User: 4.792 s, System: 0.274 s]
Range (min … max): 4.943 s … 5.220 s 10 runs
Benchmark 2: git.compile -c pack.readReverseIndex=true cat-file --unordered --batch-check="%(objectsize:disk)" --batch-all-objects
Time (mean ± σ): 6.193 s ± 0.069 s [User: 5.937 s, System: 0.255 s]
Range (min … max): 6.145 s … 6.356 s 10 runs
Summary
'git.compile -c pack.readReverseIndex=false cat-file --unordered --batch-check="%(objectsize:disk)" --batch-all-objects' ran
1.22 ± 0.03 times faster than 'git.compile -c pack.readReverseIndex=true cat-file --unordered --batch-check="%(objectsize:disk)" --batch-all-objects'
Because the equilibrium point between these two is highly machine- and
repository-dependent, allow users to configure whether or not they will
read any ".rev" file(s) with this configuration knob.
[^1]: Generating a reverse index in memory takes O(N) time (where N is
the number of objects in the repository), since we use a radix sort.
Reading an entry from an on-disk ".rev" file is slower since each
operation is bound by disk I/O instead of memory I/O.
In order to compute the on-disk size of a packed object, we need to
find the offset of our object, and the adjacent object (the on-disk
size difference of these two). Finding the first offset requires a
binary search. Finding the latter involves a single .rev lookup.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Acked-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
We had several C files include cache.h unnecessarily. Replace those
with an include of "git-compat-util.h" instead. Much like the previous
commit, these have all been verified via both ensuring that
gcc -E $SOURCE_FILE | grep '"cache.h"'
found no hits and that
make DEVELOPER=1 ${OBJECT_FILE_FOR_SOURCE_FILE}
successfully compiles without warnings.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The recent addition of the index.skipHash config option allows index
writes to speed up by skipping the hash computation for the trailing
checksum. This is particularly critical for repositories with many files
at HEAD, so add this config option to two cases where users in that
scenario may opt-in to such behavior:
1. The feature.manyFiles config option enables some options that are
helpful for repositories with many files at HEAD.
2. 'scalar register' and 'scalar reconfigure' set config options that
optimize for large repositories.
In both of these cases, set index.skipHash=true to gain this
speedup. Add tests that demonstrate the proper way that
index.skipHash=true can override feature.manyFiles=true.
Signed-off-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Enable gc.cruftpacks by default for those who opt into
feature.experimental setting.
* es/mark-gc-cruft-as-experimental:
config: let feature.experimental imply gc.cruftPacks=true
gc: add tests for --cruft and friends
We are interested in exploring whether gc.cruftPacks=true should become
the default value.
To determine whether it is safe to do so, let's encourage more users to
try it out.
Users who have set feature.experimental=true have already volunteered to
try new and possibly-breaking config changes, so let's try this new
default with that set of users.
Signed-off-by: Emily Shaffer <emilyshaffer@google.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The code that manages list-object-filter structure, used in partial
clones, leaked the instances, which has been plugged.
* jk/plug-list-object-filter-leaks:
prepare_repo_settings(): plug leak of config values
list_objects_filter_options: plug leak of filter_spec strings
transport: free filter options in disconnect_git()
transport: deep-copy object-filter struct for fetch-pack
list_objects_filter_copy(): deep-copy sparse_oid_name field
We call repo_config_get_string() for fetch.negotiationAlgorithm, which
allocates a copy of the string, but we never free it.
We could add a call to free(), but there's an even simpler solution: we
don't need the string to persist beyond a few strcasecmp() calls, so we
can instead use the "_tmp" variant which gives us a const pointer to the
cached value.
We need to switch the type of "strval" to "const char *" for this to
work, which affects a similar call that checks core.untrackedCache. But
it's in the same boat! It doesn't actually need the value to persist
beyond a maybe_bool() check (though it does remember to correctly free
the string afterwards). So we can simplify it at the same time.
Note that this core.untrackedCache check arguably should be using
repo_config_get_maybe_bool(), but there are some subtle behavior
changes. E.g., it doesn't currently allow a value-less "true". Arguably
it should, but let's avoid lumping further changes in what should be a
simple leak cleanup.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The parse_commit_graph() function takes a 'struct repository *' pointer,
but it only ever accesses config settings (either directly or through
the .settings field of the repo struct). Move all relevant config
settings into the repo_settings struct, and update parse_commit_graph()
and its existing callers so that it takes 'struct repo_settings *'
instead.
Callers of parse_commit_graph() will now need to call
prepare_repo_settings() themselves, or initialize a 'struct
repo_settings' directly.
Prior to ab14d0676c (commit-graph: pass a 'struct repository *' in more
places, 2020-09-09), parsing a commit-graph was a pure function
depending only on the contents of the commit-graph itself. Commit
ab14d0676c introduced a dependency on a `struct repository` pointer, and
later commits such as b66d84756f (commit-graph: respect
'commitGraph.readChangedPaths', 2020-09-09) added dependencies on config
settings, which were accessed through the `settings` field of the
repository pointer. This field was initialized via a call to
`prepare_repo_settings()`.
Additionally, this fixes an issue in fuzz-commit-graph: In 44c7e62
(2021-12-06, repo-settings:prepare_repo_settings only in git repos),
prepare_repo_settings was changed to issue a BUG() if it is called by a
process whose CWD is not a Git repository.
The combination of commits mentioned above broke fuzz-commit-graph,
which attempts to parse arbitrary fuzzing-engine-provided bytes as a
commit graph file. Prior to this change, parse_commit_graph() called
prepare_repo_settings(), but since we run the fuzz tests without a valid
repository, we are hitting the BUG() from 44c7e62 for every test case.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Josh Steadmon <steadmon@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Stub in empty filesystem listener backend for fsmonitor--daemon on Windows.
Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Interaction between fetch.negotiationAlgorithm and
feature.experimental configuration variables has been corrected.
* en/fetch-negotiation-default-fix:
repo-settings: rename the traditional default fetch.negotiationAlgorithm
repo-settings: fix error handling for unknown values
repo-settings: fix checking for fetch.negotiationAlgorithm=default
Give the traditional default fetch.negotiationAlgorithm the name
'consecutive'. Also allow a choice of 'default' to have Git decide
between the choices (currently, picking 'skipping' if
feature.experimental is true and 'consecutive' otherwise). Update the
documentation accordingly.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In commit af3a67de01 ("negotiator: unknown fetch.negotiationAlgorithm
should error out", 2018-08-01), error handling for an unknown
fetch.negotiationAlgorithm was added with the code die()ing. This was
also added to the documentation for the fetch.negotiationAlgorithm
option, to make it explicit that the code would die on unknown values.
This behavior was lost with commit aaf633c2ad ("repo-settings: create
feature.experimental setting", 2019-08-13). Restore it so that the
behavior again matches the documentation.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In commit 3050b6dfc75d (repo-settings.c: simplify the setup,
2021-09-21), the branch for handling fetch.negotiationAlgorithm=default
was deleted. Since this value is documented in
Documentation/config/fetch.txt, restore the check for this value.
Note that this change caused an observable bug: if someone sets
feature.experimental=true in config, and then passes "-c
fetch.negotiationAlgorithm=default" on the command line in an attempt to
override the config, then the override is ignored. Fix the bug by not
ignoring the value of "default".
Technically, before commit 3050b6dfc75d, repo-settings would treat any
fetch.negotiationAlgorithm value other than "skipping" or "noop" as a
request for "default", but I think it probably makes more sense to
ignore such broken requests and leave fetch.negotiationAlgorithm with
the default value rather than the value of "default". (If that sounds
confusing, note that "default" is usually the default value, but when
feature.experimental=true, "skipping" is the default value.)
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Check whether git directory exists before adding any repo settings. If it
does not exist, BUG with the message that one cannot add settings for an
uninitialized repository. If it does exist, proceed with adding repo
settings.
Signed-off-by: Lessley Dennington <lessleydennington@gmail.com>
Reviewed-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Change the bitfield added in 58300f47432 (sparse-index: add
index.sparse config option, 2021-03-30) and 3964fc2aae7 (sparse-index:
add guard to ensure full index, 2021-03-30) to just use an "int"
boolean instead.
It might be smart to optimize the space here in the future, but by
consistently using an "int" we can take its address and pass it to
repo_cfg_bool(), and therefore don't need to handle "sparse_index" as
a special-case when reading the "index.sparse" setting.
There's no corresponding config for "command_requires_full_index", but
let's change it too for consistency and to prevent future bugs
creeping in due to one of these being "unsigned".
Using "int" consistently also prevents subtle bugs or undesired
control flow creeping in here. Before the preceding commit the
initialization of "command_requires_full_index" in
prepare_repo_settings() did nothing, i.e. this:
r->settings.command_requires_full_index = 1
Was redundant to the earlier memset() to -1. Likewise for
"sparse_index" added in 58300f47432 (sparse-index: add index.sparse
config option, 2021-03-30) the code and comment added there was
misleading, we weren't initializing it to off, but re-initializing it
from "1" to "0", and then finally checking the config, and perhaps
setting it to "1" again. I.e. we could have applied this patch before
the preceding commit:
+ assert(r->settings.command_requires_full_index == 1);
r->settings.command_requires_full_index = 1;
/*
* Initialize this as off.
*/
+ assert(r->settings.sparse_index == 1);
r->settings.sparse_index = 0;
if (!repo_config_get_bool(r, "index.sparse", &value) && value)
r->settings.sparse_index = 1;
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Simplify the setup code in repo-settings.c in various ways, making the
code shorter, easier to read, and requiring fewer hacks to do the same
thing as it did before:
Since 7211b9e7534 (repo-settings: consolidate some config settings,
2019-08-13) we have memset() the whole "settings" structure to -1 in
prepare_repo_settings(), and subsequently relied on the -1 value.
Most of the fields did not need to be initialized to -1, and because
we were doing that we had the enum labels "UNTRACKED_CACHE_UNSET" and
"FETCH_NEGOTIATION_UNSET" purely to reflect the resulting state
created this memset() in prepare_repo_settings(). No other code used
or relied on them, more on that below.
For the rest most of the subsequent "are we -1, then read xyz" can
simply be removed by re-arranging what we read first. E.g. when
setting the "index.version" setting we should have first read
"feature.experimental", so that it (and "feature.manyfiles") can
provide a default for our "index.version".
Instead the code setting it, added when "feature.manyFiles"[1] was
created, was using the UPDATE_DEFAULT_BOOL() macro added in an earlier
commit[2]. That macro is now gone, since it was only needed for this
pattern of reading things in the wrong order.
This also fixes an (admittedly obscure) logic error where we'd
conflate an explicit "-1" value in the config with our own earlier
memset() -1.
We can also remove the UPDATE_DEFAULT_BOOL() wrapper added in
[3]. Using it is redundant to simply using the return value from
repo_config_get_bool(), which is non-zero if the provided key exists
in the config.
Details on edge cases relating to the memset() to -1, continued from
"more on that below" above:
* UNTRACKED_CACHE_KEEP:
In [4] the "unset" and "keep" handling for core.untrackedCache was
consolidated. But it while we understand the "keep" value, we don't
handle it differently than the case of any other unknown value.
So let's retain UNTRACKED_CACHE_KEEP and remove the
UNTRACKED_CACHE_UNSET setting (which was always implicitly
UNTRACKED_CACHE_KEEP before). We don't need to inform any code
after prepare_repo_settings() that the setting was "unset", as far
as anyone else is concerned it's core.untrackedCache=keep. if
"core.untrackedcache" isn't present in the config.
* FETCH_NEGOTIATION_UNSET & FETCH_NEGOTIATION_NONE:
Since these two two enum fields added in [5] don't rely on the
memzero() setting them to "-1" anymore we don't have to provide
them with explicit values.
1. c6cc4c5afd2 (repo-settings: create feature.manyFiles setting,
2019-08-13)
2. 31b1de6a09b (commit-graph: turn on commit-graph by default,
2019-08-13)
3. 31b1de6a09b (commit-graph: turn on commit-graph by default,
2019-08-13)
4. ad0fb659993 (repo-settings: parse core.untrackedCache,
2019-08-13)
5. aaf633c2ad1 (repo-settings: create feature.experimental setting,
2019-08-13)
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Instead of the global ignore_untracked_cache_config variable added in
dae6c322fa1 (test-dump-untracked-cache: don't modify the untracked
cache, 2016-01-27) we can make use of the new facility to set config
via environment variables added in d8d77153eaf (config: allow
specifying config entries via envvar pairs, 2021-01-12).
It's arguably a bit hacky to use setenv() and getenv() to pass
messages between the same program, but since the test helpers are not
the main intended audience of repo-settings.c I think it's better than
hardcoding the test-only special-case in prepare_repo_settings().
This uses the xsetenv() wrapper added in the preceding commit, if we
don't set these in the environment we'll fail in
t7063-status-untracked-cache.sh, but let's fail earlier anyway if that
were to happen.
This breaks any parent process that's potentially using the
GIT_CONFIG_* and GIT_CONFIG_PARAMETERS mechanism to pass one-shot
config setting down to a git subprocess, but in this case we don't
care about the general case of such potential parents. This process
neither spawns other "git" processes, nor is it interested in other
configuration. We might want to pick up other test modes here, but
those will be passed via GIT_TEST_* environment variables.
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When enabled, this config option signals that index writes should
attempt to use sparse-directory entries.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Upcoming changes will introduce modifications to the index format that
allow sparse directories. It will be useful to have a mechanism for
converting those sparse index files into full indexes by walking the
tree at those sparse directories. Name this method ensure_full_index()
as it will guarantee that the index is fully expanded.
This method is not implemented yet, and instead we focus on the
scaffolding to declare it and call it at the appropriate time.
Add a 'command_requires_full_index' member to struct repo_settings. This
will be an indicator that we need the index in full mode to do certain
index operations. This starts as being true for every command, then we
will set it to false as some commands integrate with sparse indexes.
If 'command_requires_full_index' is true, then we will immediately
expand a sparse index to a full one upon reading from disk. This
suffices for now, but we will want to add more callers to
ensure_full_index() later.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
"git commit-graph write" learned to limit the number of bloom
filters that are computed from scratch with the --max-new-filters
option.
* tb/bloom-improvements:
commit-graph: introduce 'commitGraph.maxNewFilters'
builtin/commit-graph.c: introduce '--max-new-filters=<n>'
commit-graph: rename 'split_commit_graph_opts'
bloom: encode out-of-bounds filters as non-empty
bloom/diff: properly short-circuit on max_changes
bloom: use provided 'struct bloom_filter_settings'
bloom: split 'get_bloom_filter()' in two
commit-graph.c: store maximum changed paths
commit-graph: respect 'commitGraph.readChangedPaths'
t/helper/test-read-graph.c: prepare repo settings
commit-graph: pass a 'struct repository *' in more places
t4216: use an '&&'-chain
commit-graph: introduce 'get_bloom_filter_settings()'
The core.multiPackIndex setting has been around since c4d25228ebb
(config: create core.multiPackIndex setting, 2018-07-12), but has been
disabled by default. If a user wishes to use the multi-pack-index
feature, then they must enable this config and run 'git multi-pack-index
write'.
The multi-pack-index feature is relatively stable now, so make the
config option true by default. For users that do not use a
multi-pack-index, the only extra cost will be a file lookup to see if a
multi-pack-index file exists (once per process, per object directory).
Also, this config option will be referenced by an upcoming
"incremental-repack" task in the maintenance builtin, so move the config
option into the repository settings struct. Note that if
GIT_TEST_MULTI_PACK_INDEX=1, then we want to ignore the config option
and treat core.multiPackIndex as enabled.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Git uses the 'core.commitGraph' configuration value to control whether
or not the commit graph is used when parsing commits or performing a
traversal.
Now that commit-graphs can also contain a section for changed-path Bloom
filters, administrators that already have commit-graphs may find it
convenient to use those graphs without relying on their changed-path
Bloom filters. This can happen, for example, during a staged roll-out,
or in the event of an incident.
Introduce 'commitGraph.readChangedPaths' to control whether or not Bloom
filters are read. Note that this configuration is independent from both:
- 'core.commitGraph', to allow flexibility in using all parts of a
commit-graph _except_ for its Bloom filters.
- The '--changed-paths' option for 'git commit-graph write', to allow
reading and writing Bloom filters to be controlled independently.
When the variable is set, pretend as if no Bloom data was specified at
all. This avoids adding additional special-casing outside of the
commit-graph internals.
Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Add a noop fetch negotiator. This is introduced to allow partial clones
to skip the unneeded negotiation step when fetching missing objects
using a "git fetch" subprocess. (The implementation of spawning a "git
fetch" subprocess will be done in a subsequent patch.) But this can also
be useful for end users, e.g. as a blunt fix for object corruption.
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The fetch.writeCommitGraph feature makes fetches write out a commit
graph file for the newly downloaded pack on fetch. This improves the
performance of various commands that would perform a revision walk and
eventually ought to be the default for everyone. To prepare for that
future, it's enabled by default for users that set
feature.experimental=true to experience such future defaults.
Alas, for --unshallow fetches from a shallow clone it runs into a
snag: by the time Git has fetched the new objects and is writing a
commit graph, it has performed a revision walk and r->parsed_objects
contains information about the shallow boundary from *before* the
fetch. The commit graph writing code is careful to avoid writing a
commit graph file in shallow repositories, but the new state is not
shallow, and the result is that from that point on, commands like "git
log" make use of a newly written commit graph file representing a
fictional history with the old shallow boundary.
We could fix this by making the commit graph writing code more careful
to avoid writing a commit graph that could have used any grafts or
shallow state, but it is possible that there are other pieces of
mutated state that fetch's commit graph writing code may be relying
on. So disable it in the feature.experimental configuration.
Google developers have been running in this configuration (by setting
fetch.writeCommitGraph=false in the system config) to work around this
bug since it was discovered in April. Once the fix lands, we'll
enable fetch.writeCommitGraph=true again to give it some early testing
before rolling out to a wider audience.
In other words:
- this patch only affects behavior with feature.experimental=true
- it makes feature.experimental match the configuration Google has
been using for the last few months, meaning it would leave users in
a better tested state than without it
- this should improve testing for other features guarded by
feature.experimental, by making feature.experimental safer to use
Reported-by: Jay Conrod <jayconrod@google.com>
Helped-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Jonathan Nieder <jrnieder@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The pack.useSparse config option was introduced by 3d036eb0
(pack-objects: create pack.useSparse setting, 2019-01-19) and was
first available in v2.21.0. When enabled, the pack-objects process
during 'git push' will use a sparse tree walk when deciding which
trees and blobs to send to the remote. The algorithm was introduced
by d5d2e93 (revision: implement sparse algorithm, 2019-01-16) and
has been in production use by VFS for Git since around that time.
The features.experimental config option also enabled pack.useSparse,
so hopefully that has also increased exposure.
It is worth noting that pack.useSparse has a possibility of
sending more objects across a push, but requires a special
arrangement of exact _copies_ across directories. There is a test
in t5322-pack-objects-sparse.sh that demonstrates this possibility.
This test uses the --sparse option to "git pack-objects" but we
can make it implied by the config value to demonstrate that the
default value has changed.
While updating that test, I noticed that the documentation did not
include an option for --no-sparse, which is now more important than
it was before.
Since the downside is unlikely but the upside is significant, set
the default value of pack.useSparse to true. Remove it from the
set of options implied by features.experimental.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The codepath that reads the index.version configuration was broken
with a recent update, which has been corrected.
* ds/feature-macros:
repo-settings: read an int for index.version
Several config options were combined into a repo_settings struct in
ds/feature-macros, including a move of the "index.version" config
setting in 7211b9e (repo-settings: consolidate some config settings,
2019-08-13).
Unfortunately, that file looked like a lot of boilerplate and what is
clearly a factor of copy-paste overload, the config setting is parsed
with repo_config_ge_bool() instead of repo_config_get_int(). This means
that a setting "index.version=4" would not register correctly and would
revert to the default version of 3.
I caught this while incorporating v2.24.0-rc0 into the VFS for Git
codebase, where we really care that the index is in version 4.
This was not caught by the codebase because the version checks placed
in t1600-index.sh did not test the "basic" scenario enough. Here, we
modify the test to include these normal settings to not be overridden by
features.manyFiles or GIT_INDEX_VERSION. While the "default" version is
3, this is demoted to version 2 in do_write_index() when not necessary.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The commit-graph feature is now on by default, and is being
written during 'git gc' by default. Typically, Git only writes
a commit-graph when a 'git gc --auto' command passes the gc.auto
setting to actualy do work. This means that a commit-graph will
typically fall behind the commits that are being used every day.
To stay updated with the latest commits, add a step to 'git
fetch' to write a commit-graph after fetching new objects. The
fetch.writeCommitGraph config setting enables writing a split
commit-graph, so on average the cost of writing this file is
very small. Occasionally, the commit-graph chain will collapse
to a single level, and this could be slow for very large repos.
For additional use, adjust the default to be true when
feature.experimental is enabled.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The 'feature.experimental' setting includes config options that are
not committed to become defaults, but could use additional testing.
Update the following config settings to take new defaults, and to
use the repo_settings struct if not already using them:
* 'pack.useSparse=true'
* 'fetch.negotiationAlgorithm=skipping'
In the case of fetch.negotiationAlgorithm, the existing logic
would load the config option only when about to use the setting,
so had a die() statement on an unknown string value. This is
removed as now the config is parsed under prepare_repo_settings().
In general, this die() is probably misplaced and not valuable.
A test was removed that checked this die() statement executed.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The feature.manyFiles setting is suitable for repos with many
files in the working directory. By setting index.version=4 and
core.untrackedCache=true, commands such as 'git status' should
improve.
While adding this setting, modify the index version precedence
tests to check how this setting overrides the default for
index.version is unset.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The core.untrackedCache config setting is slightly complicated,
so clarify its use and centralize its parsing into the repo
settings.
The default value is "keep" (returned as -1), which persists the
untracked cache if it exists.
If the value is set as "false" (returned as 0), then remove the
untracked cache if it exists.
If the value is set as "true" (returned as 1), then write the
untracked cache and persist it.
Instead of relying on magic values of -1, 0, and 1, split these
options into an enum. This allows the use of "-1" as a
default value. After parsing the config options, if the value is
unset we can initialize it to UNTRACKED_CACHE_KEEP.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The commit-graph feature has seen a lot of activity in the past
year or so since it was introduced. The feature is a critical
performance enhancement for medium- to large-sized repos, and
does not significantly hurt small repos.
Change the defaults for core.commitGraph and gc.writeCommitGraph
to true so users benefit from this feature by default.
There are several places in the test suite where the environment
variable GIT_TEST_COMMIT_GRAPH is disabled to avoid reading a
commit-graph, if it exists. The config option overrides the
environment, so swap these. Some GIT_TEST_COMMIT_GRAPH assignments
remain, and those are to avoid writing a commit-graph when a new
commit is created.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
There are a few important config settings that are not loaded
during git_default_config. These are instead loaded on-demand.
Centralize these config options to a single scan, and store
all of the values in a repo_settings struct. The values for
each setting are initialized as negative to indicate "unset".
This centralization will be particularly important in a later
change to introduce "meta" config settings that change the
defaults for these config settings.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>