1
0
Fork 0
mirror of https://github.com/git/git.git synced 2024-05-18 19:26:07 +02:00
Commit Graph

891 Commits

Author SHA1 Message Date
Miklos Vajna 96697781e0 log: "--since-as-filter" option is a non-terminating "--since" variant
The "--since=<time>" option of "git log" limits the commits displayed by
the command by stopping the traversal once it sees a commit whose
timestamp is older than the given time and not digging further into its
parents.

This is OK in a history where a commit always has a newer timestamp than
any of its parents'.  Once you see a commit older than the given <time>,
all ancestor commits of it are even older than the time anyway.  It
poses, however, a problem when there is a commit with a wrong timestamp
that makes it appear older than its parents.  Stopping traversal at the
"incorrectly old" commit will hide its ancestors that are newer than
that wrong commit and are newer than the cut-off time given with the
--since option.  --max-age and --after being the synonyms to --since,
they share the same issue.

Add a new "--since-as-filter" option that is a variant of
"--since=<time>".  Instead of stopping the traversal to hide an old
enough commit and its all ancestors, exclude commits with an old
timestamp from the output but still keep digging the history.

Without other traversal stopping options, this will force the command in
"git log" family to dig down the history to the root.  It may be an
acceptable cost for a small project with short history and many commits
with screwy timestamps.

It is quite unlikely for us to add traversal stopper other than since,
so have this as a --since-as-filter option, rather than a separate
--as-filter, that would be probably more confusing.

Signed-off-by: Miklos Vajna <vmiklos@vmiklos.hu>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-04-23 09:36:07 -07:00
Derrick Stolee cc91044256 list-objects-filter: remove CL_ARG__FILTER
We have established the command-line interface for the --[no-]filter
options for a while now, so we do not need a helper to make this
editable in the future.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-03-23 13:13:17 -07:00
Junio C Hamano 7391ecd338 Merge branch 'ds/partial-bundles'
Bundle file format gets extended to allow a partial bundle,
filtered by similar criteria you would give when making a
partial/lazy clone.

* ds/partial-bundles:
  clone: fail gracefully when cloning filtered bundle
  bundle: unbundle promisor packs
  bundle: create filtered bundles
  rev-list: move --filter parsing into revision.c
  bundle: parse filter capability
  list-objects: handle NULL function pointers
  MyFirstObjectWalk: update recommended usage
  list-objects: consolidate traverse_commit_list[_filtered]
  pack-bitmap: drop filter in prepare_bitmap_walk()
  pack-objects: use rev.filter when possible
  revision: put object filter into struct rev_info
  list-objects-filter-options: create copy helper
  index-pack: document and test the --promisor option
2022-03-21 15:14:24 -07:00
Derrick Stolee c4ea513f4a rev-list: move --filter parsing into revision.c
Now that 'struct rev_info' has a 'filter' member and most consumers of
object filtering are using that member instead of an external struct,
move the parsing of the '--filter' option out of builtin/rev-list.c and
into revision.c.

This use within handle_revision_pseudo_opt() allows us to find the
option within setup_revisions() if the arguments are passed directly. In
the case of a command such as 'git blame', the arguments are first
scanned and checked with parse_revision_opt(), which complains about the
option, so 'git blame --filter=blob:none <file>' does not become valid
with this change.

Some commands, such as 'git diff' gain this option without having it
make an effect. And 'git diff --objects' was already possible, but does
not actually make sense in that builtin.

The key addition that is coming is 'git bundle create --filter=<X>' so
we can create bundles containing promisor packs. More work is required
to make them fully functional, but that will follow.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-03-09 10:25:27 -08:00
Junio C Hamano 5b84280c65 Merge branch 'ab/grep-patterntype'
Some code clean-up in the "git grep" machinery.

* ab/grep-patterntype:
  grep: simplify config parsing and option parsing
  grep.c: do "if (bool && memchr())" not "if (memchr() && bool)"
  grep.h: make "grep_opt.pattern_type_option" use its enum
  grep API: call grep_config() after grep_init()
  grep.c: don't pass along NULL callback value
  built-ins: trust the "prefix" from run_builtin()
  grep tests: add missing "grep.patternType" config tests
  grep tests: create a helper function for "BRE" or "ERE"
  log tests: check if grep_config() is called by "log"-like cmds
  grep.h: remove unused "regex_t regexp" from grep_opt
2022-02-25 15:47:36 -08:00
Junio C Hamano 8813596531 Merge branch 'ah/log-no-graph'
"git log --graph --graph" used to leak a graph structure, and there
was no way to countermand "--graph" that appear earlier on the
command line.  A "--no-graph" option has been added and resource
leakage has been plugged.

* ah/log-no-graph:
  log: add a --no-graph option
  log: fix memory leak if --graph is passed multiple times
2022-02-23 16:58:03 -08:00
Ævar Arnfjörð Bjarmason 04bf052eef grep: simplify config parsing and option parsing
Simplify the parsing of "grep.patternType" and
"grep.extendedRegexp". This changes no behavior, but gets rid of
complex parsing logic that isn't needed anymore.

When "grep.patternType" was introduced in 84befcd0a4 (grep: add a
grep.patternType configuration setting, 2012-08-03) we promised that:

 1. You can set "grep.patternType", and "[setting it to] 'default'
    will return to the default matching behavior".

    In that context "the default" meant whatever the configuration
    system specified before that change, i.e. via grep.extendedRegexp.

 2. We'd support the existing "grep.extendedRegexp" option, but ignore
    it when the new "grep.patternType" option is set. We said we'd
    only ignore the older "grep.extendedRegexp" option "when the
    `grep.patternType` option is set to a value other than
    'default'".

In a preceding commit we changed grep_config() to be called after
grep_init(), which means that much of the complexity here can go
away.

As before both "grep.patternType" and "grep.extendedRegexp" are
last-one-wins variable, with "grep.extendedRegexp" yielding to
"grep.patternType", except when "grep.patternType=default".

Note that as the previously added tests indicate this cannot be done
on-the-fly as we see the config variables, without introducing more
state keeping. I.e. if we see:

    -c grep.extendedRegexp=false
    -c grep.patternType=default
    -c extendedRegexp=true

We need to select ERE, since grep.patternType=default unselects that
variable, which normally has higher precedence, but we also need to
select BRE in cases of:

    -c grep.extendedRegexp=true \
    -c grep.extendedRegexp=false

Which would not be the case for this, which select ERE:

    -c grep.patternType=extended \
    -c grep.extendedRegexp=false

Therefore we cannot do this on-the-fly in grep_config without also
introducing tracking variables for not only the pattern type, but what
the source of that pattern type was.

So we need to decide on the pattern after our config was fully
parsed. Let's do that by deferring the decision on the pattern type
until it's time to compile it in compile_regexp().

By that time we've not only parsed the config, but also handled the
command-line options. Those will set "opt.pattern_type_option" (*not*
"opt.extended_regexp_option"!).

At that point all we need to do is see if "grep.patternType" was
UNSPECIFIED in the end (including an explicit "=default"), if so we'll
use the "grep.extendedRegexp" configuration, if any.

See my 07a3d41173 (grep: remove regflags from the public grep_opt
API, 2017-06-29) for addition of the two comments being removed here,
i.e. the complexity noted in that commit is now going away.

1. https://lore.kernel.org/git/patch-v8-09.10-c211bb0c69d-20220118T155211Z-avarab@gmail.com/

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-02-15 18:00:50 -08:00
Ævar Arnfjörð Bjarmason 9725c8dda2 built-ins: trust the "prefix" from run_builtin()
Change code in "builtin/grep.c" and "builtin/ls-tree.c" to trust the
"prefix" passed from "run_builtin()". The "prefix" we get from setup.c
is either going to be NULL or a string of length >0, never "".

So we can drop the "prefix && *prefix" checks added for
"builtin/grep.c" in 0d042fecf2 (git-grep: show pathnames relative to
the current directory, 2006-08-11), and for "builtin/ls-tree.c" in
a69dd585fc (ls-tree: chomp leading directories when run from a
subdirectory, 2005-12-23).

As seen in code in revision.c that was added in cd676a5136 (diff
--relative: output paths as relative to the current subdirectory,
2008-02-12) we already have existing code that does away with this
assertion.

This makes it easier to reason about a subsequent change to the
"prefix_length" code in grep.c in a subsequent commit, and since we're
going to the trouble of doing that let's leave behind an assert() to
promise this to any future callers.

For "builtin/grep.c" it would be painful to pass the "prefix" down the
callchain of:

    cmd_grep -> grep_tree -> grep_submodule -> grep_cache -> grep_oid ->
    grep_source_name

So for the code that needs it in grep_source_name() let's add a
"grep_prefix" variable similar to the existing "ls_tree_prefix".

While at it let's move the code in cmd_ls_tree() around so that we
assign to the "ls_tree_prefix" right after declaring the variables,
and stop assigning to "prefix". We only subsequently used that
variable later in the function after clobbering it. Let's just use our
own "grep_prefix" instead.

Let's also add an assert() in git.c, so that we'll make this promise
about the "prefix" to any current and future callers, as well as to
any readers of the code.

Code history:

 * The strlen() in "grep.c" hasn't been used since 493b7a08d8 (grep:
   accept relative paths outside current working directory, 2009-09-05).

   When that code was added in 0d042fecf2 (git-grep: show pathnames
   relative to the current directory, 2006-08-11) we used the length.

   But since 493b7a08d8 we haven't used it for anything except a
   boolean check that we could have done on the "prefix" member
   itself.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-02-15 18:00:50 -08:00
Alex Henrie 087c745833 log: add a --no-graph option
It's useful to be able to countermand a previous --graph option, for
example if `git log --graph` is run via an alias.

Signed-off-by: Alex Henrie <alexhenrie24@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-02-11 10:06:41 -08:00
Alex Henrie dccf6c16f1 log: fix memory leak if --graph is passed multiple times
Signed-off-by: Alex Henrie <alexhenrie24@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-02-11 10:06:40 -08:00
Jerry Zhang 9d505b7b49 git-rev-list: add --exclude-first-parent-only flag
It is useful to know when a branch first diverged in history
from some integration branch in order to be able to enumerate
the user's local changes. However, these local changes can
include arbitrary merges, so it is necessary to ignore this
merge structure when finding the divergence point.

In order to do this, teach the "rev-list" family to accept
"--exclude-first-parent-only", which restricts the traversal
of excluded commits to only follow first parent links.

   -A-----E-F-G--main
     \   / /
      B-C-D--topic

In this example, the goal is to return the set {B, C, D} which
represents a topic branch that has been merged into main branch.
`git rev-list topic ^main` will end up returning no commits
since excluding main will end up traversing the commits on topic
as well. `git rev-list --exclude-first-parent-only topic ^main`
however will return {B, C, D} as desired.

Add docs for the new flag, and clarify the doc for --first-parent
to indicate that it applies to traversing the set of included
commits only.

Signed-off-by: Jerry Zhang <jerry@skydio.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-01-12 11:08:42 -08:00
Junio C Hamano c17de5a505 Merge branch 'ja/i18n-similar-messages'
Similar message templates have been consolidated so that
translators need to work on fewer number of messages.

* ja/i18n-similar-messages:
  i18n: turn even more messages into "cannot be used together" ones
  i18n: ref-filter: factorize "%(foo) atom used without %(bar) atom"
  i18n: factorize "--foo outside a repository"
  i18n: refactor "unrecognized %(foo) argument" strings
  i18n: factorize "no directory given for --foo"
  i18n: factorize "--foo requires --bar" and the like
  i18n: tag.c factorize i18n strings
  i18n: standardize "cannot open" and "cannot read"
  i18n: turn "options are incompatible" into "cannot be used together"
  i18n: refactor "%s, %s and %s are mutually exclusive"
  i18n: refactor "foo and bar are mutually exclusive"
2022-01-10 11:52:56 -08:00
Junio C Hamano 2043ce828e Merge branch 'rs/log-invert-grep-with-headers'
"git log --invert-grep --author=<name>" used to exclude commits
written by the given author, but now "--invert-grep" only affects
the matches made by the "--grep=<pattern>" option.

* rs/log-invert-grep-with-headers:
  log: let --invert-grep only invert --grep
2022-01-05 14:01:30 -08:00
Jean-Noël Avila 6fa00ee843 i18n: factorize "--foo requires --bar" and the like
They are all replaced by "the option '%s' requires '%s'", which is a
new string but replaces 17 previous unique strings.

Signed-off-by: Jean-Noël Avila <jn.avila@free.fr>
Reviewed-by: Johannes Sixt <j6t@kdbg.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-01-05 13:31:00 -08:00
Jean-Noël Avila 12909b6b8a i18n: turn "options are incompatible" into "cannot be used together"
Signed-off-by: Jean-Noël Avila <jn.avila@free.fr>
Reviewed-by: Johannes Sixt <j6t@kdbg.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-01-05 13:29:23 -08:00
Junio C Hamano 5a4069a1d8 Merge branch 'jc/c99-var-decl-in-for-loop'
Weather balloon to find compilers that do not grok variable
declaration in the for() loop.

* jc/c99-var-decl-in-for-loop:
  revision: use C99 declaration of variable in for() loop
2021-12-21 15:03:15 -08:00
René Scharfe 794c000267 log: let --invert-grep only invert --grep
The option --invert-grep is documented to filter out commits whose
messages match the --grep filters.  However, it also affects the
header matches (--author, --committer), which is not intended.

Move the handling of that option to grep.c, as only the code there can
distinguish between matches in the header from those in the message
body.  If --invert-grep is given then enable extended expressions (not
the regex type, we just need git grep's --not to work), negate the body
patterns and check if any of them match by piggy-backing on the
collect_hits mechanism of grep_source_1().

Collecting the matches in struct grep_opt is a bit iffy, but with
"last_shown" we have a precedent for writing state information to that
struct.

Reported-by: Dotan Cohen <dotancohen@gmail.com>
Signed-off-by: René Scharfe <l.s.r@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-12-17 14:13:08 -08:00
Junio C Hamano 44ba10d671 revision: use C99 declaration of variable in for() loop
There are certain C99 features that might be nice to use in our code
base, but we've hesitated to do so in order to avoid breaking
compatibility with older compilers. But we don't actually know if
people are even using pre-C99 compilers these days.

One way to figure that out is to introduce a very small use of a
feature, and see if anybody complains, and we've done so to probe
the portability for a few features like "trailing comma in enum
declaration", "designated initializer for struct", and "designated
initializer for array".  A few years ago, we tried to use a handy

    for (int i = 0; i < n; i++)
	use(i);

to introduce a new variable valid only in the loop, but found that
some compilers we cared about didn't like it back then.  Two years
is a long-enough time, so let's try it again.

If this patch can survive a few releases without complaint, then we
can feel more confident that variable declaration in for() loop is
supported by the compilers our user base use.  And if we do get
complaints, then we'll have gained some data and we can easily
revert this patch.

Helped-by: Martin Ågren <martin.agren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-12-03 10:16:00 -08:00
Junio C Hamano 8996d68ac7 Merge branch 'ps/connectivity-optim'
Regression fix.

* ps/connectivity-optim:
  Revert "connected: do not sort input revisions"
2021-11-12 15:29:24 -08:00
Junio C Hamano a7df4f52af Revert "connected: do not sort input revisions"
This reverts commit f45022dc2f,
as this is like breakage in the traversal more likely.  In a
history with 10 single strand of pearls,

   1-->2-->3--...->7-->8-->9-->10

asking "rev-list --unsorted-input 1 10 --not 9 8 7 6 5 4" fails to
paint the bottom 1 uninteresting as the traversal stops, without
completing the propagation of uninteresting bit starting at 4 down
through 3 and 2 to 1.
2021-11-11 12:34:41 -08:00
Junio C Hamano 404c4a5462 Merge branch 'ab/designated-initializers'
Code clean-up.

* ab/designated-initializers:
  cbtree.h: define cb_init() in terms of CBTREE_INIT
  *.h: move some *_INIT to designated initializers
  *.h _INIT macros: don't specify fields equal to 0
  *.[ch] *_INIT macros: use { 0 } for a "zero out" idiom
  submodule-config.h: remove unused SUBMODULE_INIT macro
2021-10-11 10:21:48 -07:00
Junio C Hamano f6c075ad71 Merge branch 'jk/ref-paranoia'
The ref iteration code used to optionally allow dangling refs to be
shown, which has been tightened up.

* jk/ref-paranoia:
  refs: drop "broken" flag from for_each_fullref_in()
  ref-filter: drop broken-ref code entirely
  ref-filter: stop setting FILTER_REFS_INCLUDE_BROKEN
  repack, prune: drop GIT_REF_PARANOIA settings
  refs: turn on GIT_REF_PARANOIA by default
  refs: omit dangling symrefs when using GIT_REF_PARANOIA
  refs: add DO_FOR_EACH_OMIT_DANGLING_SYMREFS flag
  refs-internal.h: reorganize DO_FOR_EACH_* flag documentation
  refs-internal.h: move DO_FOR_EACH_* flags next to each other
  t5312: be more assertive about command failure
  t5312: test non-destructive repack
  t5312: create bogus ref as necessary
  t5312: drop "verbose" helper
  t5600: provide detached HEAD for corruption failures
  t5516: don't use HEAD ref for invalid ref-deletion tests
  t7900: clean up some more broken refs
2021-10-11 10:21:47 -07:00
Junio C Hamano 921c795c25 Merge branch 'jt/add-submodule-odb-clean-up'
More code paths that use the hack to add submodule's object
database to the set of alternate object store have been cleaned up.

* jt/add-submodule-odb-clean-up:
  revision: remove "submodule" from opt struct
  repository: support unabsorbed in repo_submodule_init
  submodule: remove unnecessary unabsorbed fallback
2021-10-06 13:40:11 -07:00
Ævar Arnfjörð Bjarmason 9865b6e6a4 *.[ch] *_INIT macros: use { 0 } for a "zero out" idiom
In C it isn't required to specify that all members of a struct are
zero'd out to 0, NULL or '\0', just providing a "{ 0 }" will
accomplish that.

Let's also change code that provided N zero'd fields to just
provide one, and change e.g. "{ NULL }" to "{ 0 }" for
consistency. I.e. even if the first member is a pointer let's use "0"
instead of "NULL". The point of using "0" consistently is to pick one,
and to not have the reader wonder why we're not using the same pattern
everywhere.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-27 14:47:59 -07:00
Jeff King 67985e4e4a refs: drop "broken" flag from for_each_fullref_in()
No callers pass in anything but "0" here. Likewise to our sibling
functions. Note that some of them ferry along the flag, but none of
their callers pass anything but "0" either.

Nor is anybody likely to change that. Callers which really want to see
all of the raw refs use for_each_rawref(). And anybody interested in
iterating a subset of the refs will likely be happy to use the
now-default behavior of showing broken refs, but omitting dangling
symlinks.

So we can get rid of this whole feature.

Signed-off-by: Jeff King <peff@peff.net>
Reviewed-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-27 12:36:45 -07:00
Jonathan Tan 10a0d6ae64 revision: remove "submodule" from opt struct
Clean up a TODO in revision.h by removing the "submodule" field from
struct setup_revision_opt. This field is only used to specify the ref
store to use, so use rev_info->repo to determine the ref store instead.

The only users of this field are merge-ort.c and merge-recursive.c.
However, both these files specify the superproject as rev_info->repo and
the submodule as setup_revision_opt->submodule. In order to be able to
pass the submodule as rev_info->repo, all commits must be parsed with
the submodule explicitly specified; this patch does that as well. (An
incremental solution in which only some commits are parsed with explicit
submodule will not work, because if the same commit is parsed twice in
different repositories, there will be 2 heap-allocated object structs
corresponding to that commit, and any flag set by the revision walking
mechanism on one of them will not be reflected onto the other.)

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-09 14:09:30 -07:00
Patrick Steinhardt f559d6d45e revision: avoid hitting packfiles when commits are in commit-graph
When queueing references in git-rev-list(1), we try to optimize parsing
of commits via the commit-graph. To do so, we first look up the object's
type, and if it is a commit we call `repo_parse_commit()` instead of
`parse_object()`. This is quite inefficient though given that we're
always uncompressing the object header in order to determine the type.
Instead, we can opportunistically search the commit-graph for the object
ID: in case it's found, we know it's a commit and can directly fill in
the commit object without having to uncompress the object header.

Expose a new function `lookup_commit_in_graph()`, which tries to find a
commit in the commit-graph by ID, and convert `get_reference()` to use
this function. This provides a big performance win in cases where we
load references in a repository with lots of references pointing to
commits. The following has been executed in a real-world repository with
about 2.2 million refs:

    Benchmark #1: HEAD~: rev-list --unsorted-input --objects --quiet --not --all --not $newrev
      Time (mean ± σ):      4.458 s ±  0.044 s    [User: 4.115 s, System: 0.342 s]
      Range (min … max):    4.409 s …  4.534 s    10 runs

    Benchmark #2: HEAD: rev-list --unsorted-input --objects --quiet --not --all --not $newrev
      Time (mean ± σ):      3.089 s ±  0.015 s    [User: 2.768 s, System: 0.321 s]
      Range (min … max):    3.061 s …  3.105 s    10 runs

    Summary
      'HEAD: rev-list --unsorted-input --objects --quiet --not --all --not $newrev' ran
        1.44 ± 0.02 times faster than 'HEAD~: rev-list --unsorted-input --objects --quiet --not --all --not $newrev'

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-08-09 09:51:12 -07:00
Patrick Steinhardt bf9c0cbddb revision: stop retrieving reference twice
When queueing up references for the revision walk, `handle_one_ref()`
will resolve the reference's object ID via `get_reference()` and then
queue the ID as pending object via `add_pending_oid()`. But given that
`add_pending_oid()` is only a thin wrapper around `add_pending_object()`
which fist calls `get_reference()`, we effectively resolve the reference
twice and thus duplicate some of the work.

Fix the issue by instead calling `add_pending_object()` directly, which
takes the already-resolved object as input. In a repository with lots of
refs, this translates into a near 10% speedup:

    Benchmark #1: HEAD~: rev-list --unsorted-input --objects --quiet --not --all --not $newrev
      Time (mean ± σ):      5.015 s ±  0.038 s    [User: 4.698 s, System: 0.316 s]
      Range (min … max):    4.970 s …  5.089 s    10 runs

    Benchmark #2: HEAD: rev-list --unsorted-input --objects --quiet --not --all --not $newrev
      Time (mean ± σ):      4.606 s ±  0.029 s    [User: 4.260 s, System: 0.345 s]
      Range (min … max):    4.565 s …  4.657 s    10 runs

    Summary
      'HEAD: rev-list --unsorted-input --objects --quiet --not --all --not $newrev' ran
        1.09 ± 0.01 times faster than 'HEAD~: rev-list --unsorted-input --objects --quiet --not --all --not $newrev'

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-08-09 09:51:12 -07:00
Patrick Steinhardt f45022dc2f connected: do not sort input revisions
In order to compute whether objects reachable from a set of tips are all
connected, we do a revision walk with these tips as positive references
and `--not --all`. `--not --all` will cause the revision walk to load
all preexisting references as uninteresting, which can be very expensive
in repositories with many references.

Benchmarking the git-rev-list(1) command highlights that by far the most
expensive single phase is initial sorting of the input revisions: after
all references have been loaded, we first sort commits by author date.
In a real-world repository with about 2.2 million references, it makes
up about 40% of the total runtime of git-rev-list(1).

Ultimately, the connectivity check shouldn't really bother about the
order of input revisions at all. We only care whether we can actually
walk all objects until we hit the cut-off point. So sorting the input is
a complete waste of time.

Introduce a new "--unsorted-input" flag to git-rev-list(1) which will
cause it to not sort the commits and adjust the connectivity check to
always pass the flag. This results in the following speedups, executed
in a clone of gitlab-org/gitlab [1]:

    Benchmark #1: git rev-list  --objects --quiet --not --all --not $(cat newrev)
      Time (mean ± σ):      7.639 s ±  0.065 s    [User: 7.304 s, System: 0.335 s]
      Range (min … max):    7.543 s …  7.742 s    10 runs

    Benchmark #2: git rev-list --unsorted-input --objects --quiet --not --all --not $newrev
      Time (mean ± σ):      4.995 s ±  0.044 s    [User: 4.657 s, System: 0.337 s]
      Range (min … max):    4.909 s …  5.048 s    10 runs

    Summary
      'git rev-list --unsorted-input --objects --quiet --not --all --not $(cat newrev)' ran
        1.53 ± 0.02 times faster than 'git rev-list  --objects --quiet --not --all --not $newrev'

[1]: https://gitlab.com/gitlab-org/gitlab.git. Note that not all refs
     are visible to clients.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-08-09 09:51:12 -07:00
Patrick Steinhardt 29ef1f27fe revision: separate walk and unsorted flags
The `--no-walk` flag supports two modes: either it sorts the revisions
given as input input or it doesn't. This is reflected in a single
`no_walk` flag, which reflects one of the three states "walk", "don't
walk but without sorting" and "don't walk but with sorting".

Split up the flag into two separate bits, one indicating whether we
should walk or not and one indicating whether the input should be sorted
or not. This will allow us to more easily introduce a new flag
`--unsorted-input`, which only impacts the sorting bit.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-08-05 09:37:28 -07:00
Jeff King 1d72b604ef add_pending_object_with_path(): work around "gcc -O3" complaint
When compiling with -O3, some gcc versions (10.2.1 here) complain about
an out-of-bounds subscript:

  revision.c: In function ‘do_add_index_objects_to_pending’:
  revision.c:321:22: error: array subscript [1, 2147483647] is outside array bounds of ‘char[1]’ [-Werror=array-bounds]
    321 |   if (0 < len && name[len] && buf.len)
        |                  ~~~~^~~~~

The "len" parameter here comes from calling interpret_branch_name(),
which intends to return the number of characters of "name" it parsed.

But the compiler doesn't realize this. It knows the size of the empty
string "name" passed in from do_add_index_objects_to_pending(), but it
has no clue that the "len" we get back will be constrained to "0" in
that case.

And I don't think the warning is telling us about some subtle or clever
bug. The implementation of interpret_branch_name() is in another file
entirely, and the compiler can't see it (you can even verify there is no
clever LTO going on by replacing it with "return 0" and still getting
the warning).

We can work around this by replacing our "did we hit the trailing NUL"
subscript dereference with a length check. We do not even have to pay
the cost for an extra strlen(), as we can pass our new length into
interpret_branch_name(), which was converting our "0" into a call to
strlen() anyway.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-06-11 12:45:37 +09:00
Junio C Hamano 936e58851a Merge branch 'ah/plugleaks'
Plug various leans reported by LSAN.

* ah/plugleaks:
  builtin/rm: avoid leaking pathspec and seen
  builtin/rebase: release git_format_patch_opt too
  builtin/for-each-ref: free filter and UNLEAK sorting.
  mailinfo: also free strbuf lists when clearing mailinfo
  builtin/checkout: clear pending objects after diffing
  builtin/check-ignore: clear_pathspec before returning
  builtin/bugreport: don't leak prefixed filename
  branch: FREE_AND_NULL instead of NULL'ing real_ref
  bloom: clear each bloom_key after use
  ls-files: free max_prefix when done
  wt-status: fix multiple small leaks
  revision: free remainder of old commit list in limit_list
2021-05-07 12:47:41 +09:00
Junio C Hamano 8585d6c04a Merge branch 'ps/rev-list-object-type-filter'
"git rev-list" learns the "--filter=object:type=<type>" option,
which can be used to exclude objects of the given kind from the
packfile generated by pack-objects.

* ps/rev-list-object-type-filter:
  rev-list: allow filtering of provided items
  pack-bitmap: implement combined filter
  pack-bitmap: implement object type filter
  list-objects: implement object type filter
  list-objects: support filtering by tag and commit
  list-objects: move tag processing into its own function
  revision: mark commit parents as NOT_USER_GIVEN
  uploadpack.txt: document implication of `uploadpackfilter.allow`
2021-05-07 12:47:41 +09:00
Junio C Hamano 8e97852919 Merge branch 'ds/sparse-index-protections'
Builds on top of the sparse-index infrastructure to mark operations
that are not ready to mark with the sparse index, causing them to
fall back on fully-populated index that they always have worked with.

* ds/sparse-index-protections: (47 commits)
  name-hash: use expand_to_path()
  sparse-index: expand_to_path()
  name-hash: don't add directories to name_hash
  revision: ensure full index
  resolve-undo: ensure full index
  read-cache: ensure full index
  pathspec: ensure full index
  merge-recursive: ensure full index
  entry: ensure full index
  dir: ensure full index
  update-index: ensure full index
  stash: ensure full index
  rm: ensure full index
  merge-index: ensure full index
  ls-files: ensure full index
  grep: ensure full index
  fsck: ensure full index
  difftool: ensure full index
  commit: ensure full index
  checkout: ensure full index
  ...
2021-04-30 13:50:26 +09:00
Andrzej Hunt db69bf608d revision: free remainder of old commit list in limit_list
limit_list() iterates over the original revs->commits list, and consumes
many of its entries via pop_commit. However we might stop iterating over
the list early (e.g. if we realise that the rest of the list is
uninteresting). If we do stop iterating early, list will be pointing to
the unconsumed portion of revs->commits - and we need to free this list
to avoid a leak. (revs->commits itself will be an invalid pointer: it
will have been free'd during the first pop_commit.)

However the list pointer is later reused to iterate over our new list,
but only for the limiting_can_increase_treesame() branch. We therefore
need to introduce a new variable for that branch - and while we're here
we can rename the original list to original_list as that makes its
purpose more obvious.

This leak was found while running t0090. It's not likely to be very
impactful, but it can happen quite early during some checkout
invocations, and hence seems to be worth fixing:

Direct leak of 16 byte(s) in 1 object(s) allocated from:
    #0 0x49a85d in malloc ../projects/compiler-rt/lib/asan/asan_malloc_linux.cpp:145:3
    #1 0x9ac084 in do_xmalloc wrapper.c:41:8
    #2 0x9ac05a in xmalloc wrapper.c:62:9
    #3 0x7175d6 in commit_list_insert commit.c:540:33
    #4 0x71800f in commit_list_insert_by_date commit.c:604:9
    #5 0x8f8d2e in process_parents revision.c:1128:5
    #6 0x8f2f2c in limit_list revision.c:1418:7
    #7 0x8f210e in prepare_revision_walk revision.c:3577:7
    #8 0x514170 in orphaned_commit_warning builtin/checkout.c:1185:6
    #9 0x512f05 in switch_branches builtin/checkout.c:1250:3
    #10 0x50f8de in checkout_branch builtin/checkout.c:1646:9
    #11 0x50ba12 in checkout_main builtin/checkout.c:2003:9
    #12 0x5086c0 in cmd_checkout builtin/checkout.c:2055:8
    #13 0x4cd91d in run_builtin git.c:467:11
    #14 0x4cb5f3 in handle_builtin git.c:719:3
    #15 0x4ccf47 in run_argv git.c:808:4
    #16 0x4caf49 in cmd_main git.c:939:19
    #17 0x69dc0e in main common-main.c:52:11
    #18 0x7faaabd0e349 in __libc_start_main (/lib64/libc.so.6+0x24349)

Indirect leak of 48 byte(s) in 3 object(s) allocated from:
    #0 0x49a85d in malloc ../projects/compiler-rt/lib/asan/asan_malloc_linux.cpp:145:3
    #1 0x9ac084 in do_xmalloc wrapper.c:41:8
    #2 0x9ac05a in xmalloc wrapper.c:62:9
    #3 0x717de6 in commit_list_append commit.c:1609:35
    #4 0x8f1f9b in prepare_revision_walk revision.c:3554:12
    #5 0x514170 in orphaned_commit_warning builtin/checkout.c:1185:6
    #6 0x512f05 in switch_branches builtin/checkout.c:1250:3
    #7 0x50f8de in checkout_branch builtin/checkout.c:1646:9
    #8 0x50ba12 in checkout_main builtin/checkout.c:2003:9
    #9 0x5086c0 in cmd_checkout builtin/checkout.c:2055:8
    #10 0x4cd91d in run_builtin git.c:467:11
    #11 0x4cb5f3 in handle_builtin git.c:719:3
    #12 0x4ccf47 in run_argv git.c:808:4
    #13 0x4caf49 in cmd_main git.c:939:19
    #14 0x69dc0e in main common-main.c:52:11
    #15 0x7faaabd0e349 in __libc_start_main (/lib64/libc.so.6+0x24349)

Signed-off-by: Andrzej Hunt <ajrhunt@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-04-28 09:25:44 +09:00
Derrick Stolee f5fed74fb2 revision: ensure full index
Before iterating over all index entries, ensure that a sparse index is
expanded to a full index to avoid unexpected behavior. This case could
be integrated later by ensuring that we walk the tree in the
sparse-directory entry, but the current behavior is only expecting
blobs. Save this integration for later when it can be properly tested.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Reviewed-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-04-14 13:47:48 -07:00
Jeff King c1fa951d7e revision: avoid parsing with --exclude-promisor-objects
When --exclude-promisor-objects is given, before traversing any objects
we iterate over all of the objects in any promisor packs, marking them
as UNINTERESTING and SEEN. We turn the oid we get from iterating the
pack into an object with parse_object(), but this has two problems:

  - it's slow; we are zlib inflating (and reconstructing from deltas)
    every byte of every object in the packfile

  - it leaves the tree buffers attached to their structs, which means
    our heap usage will grow to store every uncompressed tree
    simultaneously. This can be gigabytes.

We can obviously fix the second by freeing the tree buffers after we've
parsed them. But we can observe that the function doesn't look at the
object contents at all! The only reason we call parse_object() is that
we need a "struct object" on which to set the flags. There are two
options here:

  - we can look up just the object type via oid_object_info(), and then
    call the appropriate lookup_foo() function

  - we can call lookup_unknown_object(), which gives us an OBJ_NONE
    struct (which will get auto-converted later by object_as_type() via
    calls to lookup_commit(), etc).

The first one is closer to the current code, but we do pay the price to
look up the type for each object. The latter should be more efficient in
CPU, though it wastes a little bit of memory (the "unknown" object
structs are a union of all object types, so some of the structs are
bigger than they need to be). It also runs the risk of triggering a
latent bug in code that calls lookup_object() directly but isn't ready
to handle OBJ_NONE (such code would already be buggy, but we use
lookup_unknown_object() infrequently enough that it might be hiding).

I went with the second option here. I don't think the risk is high (and
we'd want to find and fix any such bugs anyway), and it should be more
efficient overall.

The new tests in p5600 show off the improvement (this is on git.git):

  Test                                 HEAD^               HEAD
  -------------------------------------------------------------------------------
  5600.5: count commits                0.37(0.37+0.00)     0.38(0.38+0.00) +2.7%
  5600.6: count non-promisor commits   11.74(11.37+0.37)   0.04(0.03+0.00) -99.7%

The improvement is particularly big in this script because _every_
object in the newly-cloned partial repo is a promisor object. So after
marking them all, there's nothing left to traverse.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-04-13 13:22:37 -07:00
Patrick Steinhardt b2025da38b revision: mark commit parents as NOT_USER_GIVEN
The NOT_USER_GIVEN flag of an object marks whether a flag was explicitly
provided by the user or not. The most important use case for this is
when filtering objects: only objects that were not explicitly requested
will get filtered.

The flag is currently only set for blobs and trees, which has been fine
given that there are no filters for tags or commits currently. We're
about to extend filtering capabilities to add object type filter though,
which requires us to set up the NOT_USER_GIVEN flag correctly -- if it's
not set, the object wouldn't get filtered at all.

Mark unseen commit parents as NOT_USER_GIVEN when processing parents.
Like this, explicitly provided parents stay user-given and thus
unfiltered, while parents which get loaded as part of the graph walk
can be filtered.

This commit shouldn't have any user-visible impact yet as there is no
logic to filter commits yet.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-04-10 23:03:20 -07:00
Junio C Hamano 2744383cbd Merge branch 'tb/geometric-repack'
"git repack" so far has been only capable of repacking everything
under the sun into a single pack (or split by size).  A cleverer
strategy to reduce the cost of repacking a repository has been
introduced.

* tb/geometric-repack:
  builtin/pack-objects.c: ignore missing links with --stdin-packs
  builtin/repack.c: reword comment around pack-objects flags
  builtin/repack.c: be more conservative with unsigned overflows
  builtin/repack.c: assign pack split later
  t7703: test --geometric repack with loose objects
  builtin/repack.c: do not repack single packs with --geometric
  builtin/repack.c: add '--geometric' option
  packfile: add kept-pack cache for find_kept_pack_entry()
  builtin/pack-objects.c: rewrite honor-pack-keep logic
  p5303: measure time to repack with keep
  p5303: add missing &&-chains
  builtin/pack-objects.c: add '--stdin-packs' option
  revision: learn '--no-kept-objects'
  packfile: introduce 'find_kept_pack_entry()'
2021-03-24 14:36:27 -07:00
René Scharfe ca56dadb4b use CALLOC_ARRAY
Add and apply a semantic patch for converting code that open-codes
CALLOC_ARRAY to use it instead.  It shortens the code and infers the
element size automatically.

Signed-off-by: René Scharfe <l.s.r@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-13 16:00:09 -08:00
Taylor Blau c9fff00016 revision: learn '--no-kept-objects'
A future caller will want to be able to perform a reachability traversal
which terminates when visiting an object found in a kept pack. The
closest existing option is '--honor-pack-keep', but this isn't quite
what we want. Instead of halting the traversal midway through, a full
traversal is always performed, and the results are only trimmed
afterwords.

Besides needing to introduce a new flag (since culling results
post-facto can be different than halting the traversal as it's
happening), there is an additional wrinkle handling the distinction
in-core and on-disk kept packs. That is: what kinds of kept pack should
stop the traversal?

Introduce '--no-kept-objects[=<on-disk|in-core>]' to specify which kinds
of kept packs, if any, should stop a traversal. This can be useful for
callers that want to perform a reachability analysis, but want to leave
certain packs alone (for e.g., when doing a geometric repack that has
some "large" packs which are kept in-core that it wants to leave alone).

Note that this option is not guaranteed to produce exactly the set of
objects that aren't in kept packs, since it's possible the traversal
order may end up in a situation where a non-kept ancestor was "cut off"
by a kept object (at which point we would stop traversing). But, we
don't care about absolute correctness here, since this will eventually
be used as a purely additive guide in an upcoming new repack mode.

Explicitly avoid documenting this new flag, since it is only used
internally. In theory we could avoid even adding it rev-list, but being
able to spell this option out on the command-line makes some special
cases easier to test without promising to keep it behaving consistently
forever. Those tricky cases are exercised in t6114.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-22 23:30:52 -08:00
Junio C Hamano 8b4701ae4f Merge branch 'ak/corrected-commit-date'
The commit-graph learned to use corrected commit dates instead of
the generation number to help topological revision traversal.

* ak/corrected-commit-date:
  doc: add corrected commit date info
  commit-reach: use corrected commit dates in paint_down_to_common()
  commit-graph: use generation v2 only if entire chain does
  commit-graph: implement generation data chunk
  commit-graph: implement corrected commit date
  commit-graph: return 64-bit generation number
  commit-graph: add a slab to store topological levels
  t6600-test-reach: generalize *_three_modes
  commit-graph: consolidate fill_commit_graph_info
  revision: parse parent in indegree_walk_step()
  commit-graph: fix regression when computing Bloom filters
2021-02-17 17:21:40 -08:00
Junio C Hamano c9f94ab4fa Merge branch 'ab/lose-grep-debug'
Lose the debugging aid that may have been useful in the past, but
no longer is, in the "grep" codepaths.

* ab/lose-grep-debug:
  grep/log: remove hidden --debug and --grep-debug options
2021-02-10 14:48:31 -08:00
Junio C Hamano aac006aa99 Merge branch 'so/log-diff-merge'
"git log" learned a new "--diff-merges=<how>" option.

* so/log-diff-merge: (32 commits)
  t4013: add tests for --diff-merges=first-parent
  doc/git-show: include --diff-merges description
  doc/rev-list-options: document --first-parent changes merges format
  doc/diff-generate-patch: mention new --diff-merges option
  doc/git-log: describe new --diff-merges options
  diff-merges: add '--diff-merges=1' as synonym for 'first-parent'
  diff-merges: add old mnemonic counterparts to --diff-merges
  diff-merges: let new options enable diff without -p
  diff-merges: do not imply -p for new options
  diff-merges: implement new values for --diff-merges
  diff-merges: make -m/-c/--cc explicitly mutually exclusive
  diff-merges: refactor opt settings into separate functions
  diff-merges: get rid of now empty diff_merges_init_revs()
  diff-merges: group diff-merge flags next to each other inside 'rev_info'
  diff-merges: split 'ignore_merges' field
  diff-merges: fix -m to properly override -c/--cc
  t4013: add tests for -m failing to override -c/--cc
  t4013: support test_expect_failure through ':failure' magic
  diff-merges: revise revs->diff flag handling
  diff-merges: handle imply -p on -c/--cc logic for log.c
  ...
2021-02-05 16:40:44 -08:00
Ævar Arnfjörð Bjarmason 15c9649730 grep/log: remove hidden --debug and --grep-debug options
Remove the hidden "grep --debug" and "log --grep-debug" options added
in 17bf35a3c7 (grep: teach --debug option to dump the parse tree,
2012-09-13).

At the time these options seem to have been intended to go along with
a documentation discussion and to help the author of relevant tests to
perform ad-hoc debugging on them[1].

Reasons to want this gone:

 1. They were never documented, and the only (rather trivial) use of
    them in our own codebase for testing is something I removed back
    in e01b4dab01 (grep: change non-ASCII -i test to stop using
    --debug, 2017-05-20).

 2. Googling around doesn't show any in-the-wild uses I could dig up,
    and on the Git ML the only mentions after the original discussion
    seem to have been when they came up in unrelated diff contexts, or
    that test commit of mine.

 3. An exception to that is c581e4a749 (grep: under --debug, show
    whether PCRE JIT is enabled, 2019-08-18) where we added the
    ability to dump out when PCREv2 has the JIT in effect.

    The combination of that and my earlier b65abcafc7 (grep: use PCRE
    v2 for optimized fixed-string search, 2019-07-01) means Git prints
    this out in its most common in-the-wild configuration:

        $ git log  --grep-debug --grep=foo --grep=bar --grep=baz --all-match
        pcre2_jit_on=1
        pcre2_jit_on=1
        pcre2_jit_on=1
        [all-match]
        (or
         pattern_body<body>foo
         (or
          pattern_body<body>bar
          pattern_body<body>baz
         )
        )

        $ git grep --debug \( -e foo --and -e bar \) --or -e baz
        pcre2_jit_on=1
        pcre2_jit_on=1
        pcre2_jit_on=1
        (or
         (and
          patternfoo
          patternbar
         )
         patternbaz
        )

I.e. for each pattern we're considering for the and/or/--all-match
etc. debugging we'll now diligently spew out another identical line
saying whether the PCREv2 JIT is on or not.

I think that nobody's complained about that rather glaringly obviously
bad output says something about how much this is used, i.e. it's
not.

The need for this debugging aid for the composed grep/log patterns
seems to have passed, and the desire to dump the JIT config seems to
have been another one-off around the time we had JIT-related issues on
the PCREv2 codepath. That the original author of this debugging
facility seemingly hasn't noticed the bad output since then[2] is
probably some indicator.

1. https://lore.kernel.org/git/cover.1347615361.git.git@drmicha.warpmail.net/
2. https://lore.kernel.org/git/xmqqk1b8x0ac.fsf@gitster-ct.c.googlers.com/

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-26 11:36:20 -08:00
Junio C Hamano b69bed22c5 Merge branch 'jk/log-cherry-pick-duplicate-patches'
When more than one commit with the same patch ID appears on one
side, "git log --cherry-pick A...B" did not exclude them all when a
commit with the same patch ID appears on the other side.  Now it
does.

* jk/log-cherry-pick-duplicate-patches:
  patch-ids: handle duplicate hashmap entries
2021-01-25 14:19:19 -08:00
Abhishek Kumar d7f92784c6 commit-graph: return 64-bit generation number
In a preparatory step for introducing corrected commit dates, let's
return timestamp_t values from commit_graph_generation(), use
timestamp_t for local variables and define GENERATION_NUMBER_INFINITY
as (2 ^ 63 - 1) instead.

We rename GENERATION_NUMBER_MAX to GENERATION_NUMBER_V1_MAX to
represent the largest topological level we can store in the commit data
chunk.

With corrected commit dates implemented, we will have two such *_MAX
variables to denote the largest offset and largest topological level
that can be stored.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-18 16:21:18 -08:00
Abhishek Kumar 2f9bbb6d91 revision: parse parent in indegree_walk_step()
In indegree_walk_step(), we add unvisited parents to the indegree queue.
However, parents are not guaranteed to be parsed. As the indegree queue
sorts by generation number, let's parse parents before inserting them to
ensure the correct priority order.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-18 16:21:18 -08:00
Jeff King c9e3a4e76d patch-ids: handle duplicate hashmap entries
This fixes a bug introduced in dfb7a1b4d0 (patch-ids: stop using a
hand-rolled hashmap implementation, 2016-07-29) in which

  git rev-list --cherry-pick A...B

will fail to suppress commits reachable from A even if a commit with
matching patch-id appears in B.

Around the time of that commit, the algorithm for "--cherry-pick" looked
something like this:

  0. Traverse all of the commits, marking them as being on the left or
     right side of the symmetric difference.

  1. Iterate over the left-hand commits, inserting a patch-id struct for
     each into a hashmap, and pointing commit->util to the patch-id
     struct.

  2. Iterate over the right-hand commits, checking which are present in
     the hashmap. If so, we exclude the commit from the output _and_ we
     mark the patch-id as "seen".

  3. Iterate again over the left-hand commits, checking whether
     commit->util->seen is set; if so, exclude them from the output.

At the end, we'll have eliminated commits from both sides that have a
matching patch-id on the other side. But there's a subtle assumption
here: for any given patch-id, we must have exactly one struct
representing it. If two commits from A both have the same patch-id and
we allow duplicates in the hashmap, then we run into a problem:

  a. In step 1, we insert two patch-id structs into the hashmap.

  b. In step 2, our lookups will find only one of these structs, so only
     one "seen" flag is marked.

  c. In step 3, one of the commits in A will have its commit->util->seen
     set, but the other will not. We'll erroneously output the latter.

Prior to dfb7a1b4d0, our hashmap did not allow duplicates. Afterwards,
it used hashmap_add(), which explicitly does allow duplicates.

At that point, the solution would have been easy: when we are about to
add a duplicate, skip doing so and return the existing entry which
matches. But it gets more complicated.

In 683f17ec44 (patch-ids: replace the seen indicator with a commit
pointer, 2016-07-29), our step 3 goes away entirely. Instead, in step 2,
when the right-hand side finds a matching patch_id from the left-hand
side, we can directly mark the left-hand patch_id->commit to be omitted.
Solving that would be easy, too; there's a one-to-many relationship of
patch-ids to commits, so we just need to keep a list.

But there's more. Commit b3dfeebb92 (rebase: avoid computing unnecessary
patch IDs, 2016-07-29) built on that by lazily computing the full
patch-ids. So we don't even know when adding to the hashmap whether two
commits truly have the same id. We'd have to tentatively assign them a
list, and then possibly split them apart (possibly into N new structs)
at the moment we compute the real patch-ids. This could work, but it's
complicated and error-prone.

Instead, let's accept that we may store duplicates, and teach the lookup
side to be more clever. Rather than asking for a single matching
patch-id, it will need to iterate over all matching patch-ids. This does
mean examining every entry in a single hash bucket, but the worst-case
for a hash lookup was already doing that.

We'll keep the hashmap details out of the caller by providing a simple
iteration interface. We can retain the simple has_commit_patch_id()
interface for the other callers, but we'll simplify its return value
into an integer, rather than returning the patch_id struct. That way
they won't be tempted to look at the "commit" field of the return value
without iterating.

Reported-by: Arnaud Morin <arnaud.morin@gmail.com>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-12 11:13:32 -08:00
Derrick Stolee 90b666da60 revision: trace topo-walk statistics
We trace statistics about the effectiveness of changed-path Bloom
filters since 42e50e78 (revision.c: add trace2 stats around Bloom
filter usage, 2020-04-06). Add similar tracing for the topo-walk
algorithm that uses generation numbers to limit the walk size.

This information can help investigate and describe benefits to
heuristics and other changes.

The information that is printed is in JSON format and can be formatted
nicely to present as follows:

    {
	"count_explort_walked":2603,
	"count_indegree_walked":2603,
	"count_topo_walked":473
    }

Each of these values count the number of commits are visited by each of
the three "stages" of the topo-walk as detailed in b4542418 (revision.c:
generation-based topo-order algorithm, 2018-11-01).

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-04 15:18:22 -08:00