1
0
Fork 0
mirror of https://github.com/git/git.git synced 2024-05-09 10:16:08 +02:00
git/repository.h
Taylor Blau dbcf611617 pack-revindex: introduce `pack.readReverseIndex`
Since 1615c567b8 (Documentation/config/pack.txt: advertise
'pack.writeReverseIndex', 2021-01-25), we have had the
`pack.writeReverseIndex` configuration option, which tells Git whether
or not it is allowed to write a ".rev" file when indexing a pack.

Introduce a complementary configuration knob, `pack.readReverseIndex` to
control whether or not Git will read any ".rev" file(s) that may be
available on disk.

This option is useful for debugging, as well as disabling the effect of
".rev" files in certain instances.

This is useful because of the trade-off[^1] between the time it takes to
generate a reverse index (slow from scratch, fast when reading an
existing ".rev" file), and the time it takes to access a record (the
opposite).

For example, even though it is faster to use the on-disk reverse index
when computing the on-disk size of a packed object, it is slower to
enumerate the same value for all objects.

Here are a couple of examples from linux.git. When computing the above
for a single object, using the on-disk reverse index is significantly
faster:

    $ git rev-parse HEAD >in
    $ hyperfine -L v false,true 'git.compile -c pack.readReverseIndex={v} cat-file --batch-check="%(objectsize:disk)" <in'
    Benchmark 1: git.compile -c pack.readReverseIndex=false cat-file --batch-check="%(objectsize:disk)" <in
      Time (mean ± σ):     302.5 ms ±  12.5 ms    [User: 258.7 ms, System: 43.6 ms]
      Range (min … max):   291.1 ms … 328.1 ms    10 runs

    Benchmark 2: git.compile -c pack.readReverseIndex=true cat-file --batch-check="%(objectsize:disk)" <in
      Time (mean ± σ):       3.9 ms ±   0.3 ms    [User: 1.6 ms, System: 2.4 ms]
      Range (min … max):     2.0 ms …   4.4 ms    801 runs

    Summary
      'git.compile -c pack.readReverseIndex=true cat-file --batch-check="%(objectsize:disk)" <in' ran
       77.29 ± 7.14 times faster than 'git.compile -c pack.readReverseIndex=false cat-file --batch-check="%(objectsize:disk)" <in'

, but when instead trying to compute the on-disk object size for all
objects in the repository, using the ".rev" file is a disadvantage over
creating the reverse index from scratch:

    $ hyperfine -L v false,true 'git.compile -c pack.readReverseIndex={v} cat-file --batch-check="%(objectsize:disk)" --batch-all-objects'
    Benchmark 1: git.compile -c pack.readReverseIndex=false cat-file --batch-check="%(objectsize:disk)" --batch-all-objects
      Time (mean ± σ):      8.258 s ±  0.035 s    [User: 7.949 s, System: 0.308 s]
      Range (min … max):    8.199 s …  8.293 s    10 runs

    Benchmark 2: git.compile -c pack.readReverseIndex=true cat-file --batch-check="%(objectsize:disk)" --batch-all-objects
      Time (mean ± σ):     16.976 s ±  0.107 s    [User: 16.706 s, System: 0.268 s]
      Range (min … max):   16.839 s … 17.105 s    10 runs

    Summary
      'git.compile -c pack.readReverseIndex=false cat-file --batch-check="%(objectsize:disk)" --batch-all-objects' ran
	2.06 ± 0.02 times faster than 'git.compile -c pack.readReverseIndex=true cat-file --batch-check="%(objectsize:disk)" --batch-all-objects'

Luckily, the results when running `git cat-file` with `--unordered` are
closer together:

    $ hyperfine -L v false,true 'git.compile -c pack.readReverseIndex={v} cat-file --unordered --batch-check="%(objectsize:disk)" --batch-all-objects'
    Benchmark 1: git.compile -c pack.readReverseIndex=false cat-file --unordered --batch-check="%(objectsize:disk)" --batch-all-objects
      Time (mean ± σ):      5.066 s ±  0.105 s    [User: 4.792 s, System: 0.274 s]
      Range (min … max):    4.943 s …  5.220 s    10 runs

    Benchmark 2: git.compile -c pack.readReverseIndex=true cat-file --unordered --batch-check="%(objectsize:disk)" --batch-all-objects
      Time (mean ± σ):      6.193 s ±  0.069 s    [User: 5.937 s, System: 0.255 s]
      Range (min … max):    6.145 s …  6.356 s    10 runs

    Summary
      'git.compile -c pack.readReverseIndex=false cat-file --unordered --batch-check="%(objectsize:disk)" --batch-all-objects' ran
        1.22 ± 0.03 times faster than 'git.compile -c pack.readReverseIndex=true cat-file --unordered --batch-check="%(objectsize:disk)" --batch-all-objects'

Because the equilibrium point between these two is highly machine- and
repository-dependent, allow users to configure whether or not they will
read any ".rev" file(s) with this configuration knob.

[^1]: Generating a reverse index in memory takes O(N) time (where N is
  the number of objects in the repository), since we use a radix sort.
  Reading an entry from an on-disk ".rev" file is slower since each
  operation is bound by disk I/O instead of memory I/O.

  In order to compute the on-disk size of a packed object, we need to
  find the offset of our object, and the adjacent object (the on-disk
  size difference of these two). Finding the first offset requires a
  binary search. Finding the latter involves a single .rev lookup.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Acked-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-04-13 07:55:46 -07:00

243 lines
6.4 KiB
C

#ifndef REPOSITORY_H
#define REPOSITORY_H
#include "path.h"
struct config_set;
struct fsmonitor_settings;
struct git_hash_algo;
struct index_state;
struct lock_file;
struct pathspec;
struct raw_object_store;
struct submodule_cache;
struct promisor_remote_config;
struct remote_state;
enum untracked_cache_setting {
UNTRACKED_CACHE_KEEP,
UNTRACKED_CACHE_REMOVE,
UNTRACKED_CACHE_WRITE,
};
enum fetch_negotiation_setting {
FETCH_NEGOTIATION_CONSECUTIVE,
FETCH_NEGOTIATION_SKIPPING,
FETCH_NEGOTIATION_NOOP,
};
struct repo_settings {
int initialized;
int core_commit_graph;
int commit_graph_generation_version;
int commit_graph_read_changed_paths;
int gc_write_commit_graph;
int gc_cruft_packs;
int fetch_write_commit_graph;
int command_requires_full_index;
int sparse_index;
int pack_read_reverse_index;
struct fsmonitor_settings *fsmonitor; /* lazily loaded */
int index_version;
int index_skip_hash;
enum untracked_cache_setting core_untracked_cache;
int pack_use_sparse;
enum fetch_negotiation_setting fetch_negotiation_algorithm;
int core_multi_pack_index;
};
struct repo_path_cache {
char *squash_msg;
char *merge_msg;
char *merge_rr;
char *merge_mode;
char *merge_head;
char *merge_autostash;
char *auto_merge;
char *fetch_head;
char *shallow;
};
struct repository {
/* Environment */
/*
* Path to the git directory.
* Cannot be NULL after initialization.
*/
char *gitdir;
/*
* Path to the common git directory.
* Cannot be NULL after initialization.
*/
char *commondir;
/*
* Holds any information related to accessing the raw object content.
*/
struct raw_object_store *objects;
/*
* All objects in this repository that have been parsed. This structure
* owns all objects it references, so users of "struct object *"
* generally do not need to free them; instead, when a repository is no
* longer used, call parsed_object_pool_clear() on this structure, which
* is called by the repositories repo_clear on its desconstruction.
*/
struct parsed_object_pool *parsed_objects;
/*
* The store in which the refs are held. This should generally only be
* accessed via get_main_ref_store(), as that will lazily initialize
* the ref object.
*/
struct ref_store *refs_private;
/*
* Contains path to often used file names.
*/
struct repo_path_cache cached_paths;
/*
* Path to the repository's graft file.
* Cannot be NULL after initialization.
*/
char *graft_file;
/*
* Path to the current worktree's index file.
* Cannot be NULL after initialization.
*/
char *index_file;
/*
* Path to the working directory.
* A NULL value indicates that there is no working directory.
*/
char *worktree;
/*
* Path from the root of the top-level superproject down to this
* repository. This is only non-NULL if the repository is initialized
* as a submodule of another repository.
*/
char *submodule_prefix;
struct repo_settings settings;
/* Subsystems */
/*
* Repository's config which contains key-value pairs from the usual
* set of config files (i.e. repo specific .git/config, user wide
* ~/.gitconfig, XDG config file and the global /etc/gitconfig)
*/
struct config_set *config;
/* Repository's submodule config as defined by '.gitmodules' */
struct submodule_cache *submodule_cache;
/*
* Repository's in-memory index.
* 'repo_read_index()' can be used to populate 'index'.
*/
struct index_state *index;
/* Repository's remotes and associated structures. */
struct remote_state *remote_state;
/* Repository's current hash algorithm, as serialized on disk. */
const struct git_hash_algo *hash_algo;
/* A unique-id for tracing purposes. */
int trace2_repo_id;
/* True if commit-graph has been disabled within this process. */
int commit_graph_disabled;
/* Configurations related to promisor remotes. */
char *repository_format_partial_clone;
struct promisor_remote_config *promisor_remote_config;
/* Configurations */
/* Indicate if a repository has a different 'commondir' from 'gitdir' */
unsigned different_commondir:1;
};
extern struct repository *the_repository;
/*
* Define a custom repository layout. Any field can be NULL, which
* will default back to the path according to the default layout.
*/
struct set_gitdir_args {
const char *commondir;
const char *object_dir;
const char *graft_file;
const char *index_file;
const char *alternate_db;
int disable_ref_updates;
};
void repo_set_gitdir(struct repository *repo, const char *root,
const struct set_gitdir_args *extra_args);
void repo_set_worktree(struct repository *repo, const char *path);
void repo_set_hash_algo(struct repository *repo, int algo);
void initialize_the_repository(void);
RESULT_MUST_BE_USED
int repo_init(struct repository *r, const char *gitdir, const char *worktree);
/*
* Initialize the repository 'subrepo' as the submodule at the given path. If
* the submodule's gitdir cannot be found at <path>/.git, this function calls
* submodule_from_path() to try to find it. treeish_name is only used if
* submodule_from_path() needs to be called; see its documentation for more
* information.
* Return 0 upon success and a non-zero value upon failure.
*/
struct object_id;
RESULT_MUST_BE_USED
int repo_submodule_init(struct repository *subrepo,
struct repository *superproject,
const char *path,
const struct object_id *treeish_name);
void repo_clear(struct repository *repo);
/*
* Populates the repository's index from its index_file, an index struct will
* be allocated if needed.
*
* Return the number of index entries in the populated index or a value less
* than zero if an error occurred. If the repository's index has already been
* populated then the number of entries will simply be returned.
*/
int repo_read_index(struct repository *repo);
int repo_hold_locked_index(struct repository *repo,
struct lock_file *lf,
int flags);
int repo_read_index_preload(struct repository *,
const struct pathspec *pathspec,
unsigned refresh_flags);
int repo_read_index_unmerged(struct repository *);
/*
* Opportunistically update the index but do not complain if we can't.
* The lockfile is always committed or rolled back.
*/
void repo_update_index_if_able(struct repository *, struct lock_file *);
void prepare_repo_settings(struct repository *r);
/*
* Return 1 if upgrade repository format to target_version succeeded,
* 0 if no upgrade is necessary, and -1 when upgrade is not possible.
*/
int upgrade_repository_format(int target_version);
#endif /* REPOSITORY_H */