1
0
Fork 0
mirror of https://github.com/git/git.git synced 2024-05-20 09:56:10 +02:00
git/oidset.h
Eric Wong 4bfdf5800f treewide: switch to khashl for memory savings
khashl is an updated version of khash with less memory overhead
(one bit/bucket instead of two) than the original khash and
similar overall performance.  According to its author,
insertions are simpler (linear probing) but deletions may be
slightly slower[1].  Of course, the majority of hash tables in
git do not delete individual elements.

Overall memory usage did not decrease much, as the hash tables
and elements we store in them are big and currently dwarf the
overhead of the khash internals.  Only around 10 MB in
allocations (and a few dozen KB peak use out of ~6 GB) is saved
when doing a no-op `git gc' of a Linux kernel object store with
thousands of refs and islands.

A summary of differences I've found from khash to khashl:

* two 32-bit ints (instead of four) in the top-level struct

* 2 heap allocations (instead of 3) for maps
  (though I wonder locality suffers when probing is necessary)

* 1 bit of metadata per-bucket (no tombstones for deleted elements)

* 0.75 load factor.  Lowered slightly from 0.77, but no FP multiply
  and responsible for the aforementioned struct size reduction

* FNV-1A instead of x31 hash for strings

* Fibonacci hashing (__kh_h2b), probably good for FNV-1A, but
  I'm skeptical of its usefulness for our SHA-* using cases

* linear probing instead of quadratic

* Wang's integer hash functions (currently unused)

* optional hash value caching and ensemble APIs (currently unused)

* some API differences (see below), but not enough to easily
  use both khash and khashl in the same compilation unit

This patch was made with two additional goals to ease review:

1) minimize changes outside of khash*.h files

2) minimize and document all differences from upstream[2] khashl.h

Our khashl.h differences from upstream:

* favor portability constructs from our codebase:
  MAYBE_UNUSED over klib_unused, inline over kh_inline, and
  various integer types

* disable packed attribute to satisfy -Werror=address-of-packed-member,
  AFAIK it doesn't change any of the data structures we use

* port the following commits over from our old khash.h:
  9249ca26ac (khash: factor out kh_release_*, 2018-10-04)
  2756ca4347 (use REALLOC_ARRAY for changing the allocation size of arrays, 2014-09-16)
  5632e838f8 (khash: clarify that allocations never fail, 2021-07-03)

* use our memory allocation wrappers

* provide wrappers for compatibility with existing callers using the
  khash API.  The khashl function naming convention is: ${NOUN}_${VERB}
  while the khash convention is: kh_${VERB}_${NOUN}.  The kh_${NAME}_t
  typedef and naming convention are preserved via __KHASH_COMPAT macro
  to ease review (despite the `_t' suffix being reserved and typedefs
  being discouraged in the Linux kernel).

* copy relevant API docs over from khash.h for identically named macros

* preserve kh_begin, kh_foreach, kh_foreach_value from khash.h since
  khashl.h doesn't provide them

* flesh out KHASHL_{SET,MAP}_INIT wrappers with *_clear, *_resize,
  and *_release functions

* sparse fixes from Junio and Jeff

[1] https://attractivechaos.wordpress.com/2019/12/28/deletion-from-hash-tables-without-tombstones/
[2] git clone https://github.com/attractivechaos/klib.git
    2895a16cb55e (support an ensemble of hash tables, 2023-12-18)

khashl.h API differences from khash.h which affected this change:

* KHASHL_MAP_INIT and KHASHL_SET_INIT macros replace KHASH_INIT

* user-supplied hash and equality functions use different names

* object-store-ll.h avoided the kh_*_t convention (since I dislike
  typedef) and was the only place where I had to change a definition.

Signed-off-by: Eric Wong <e@80x24.org>
Helped-by: Junio C Hamano <gitster@pobox.com>
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-03-28 08:52:43 -07:00

123 lines
3.3 KiB
C

#ifndef OIDSET_H
#define OIDSET_H
#include "khashl.h"
/**
* This API is similar to oid-array, in that it maintains a set of object ids
* in a memory-efficient way. The major differences are:
*
* 1. It uses a hash, so we can do online duplicate removal, rather than
* sort-and-uniq at the end. This can reduce memory footprint if you have
* a large list of oids with many duplicates.
*
* 2. The per-unique-oid memory footprint is slightly higher due to hash
* table overhead.
*/
/**
* A single oidset; should be zero-initialized (or use OIDSET_INIT).
*/
struct oidset {
kh_oid_set_t set;
};
#define OIDSET_INIT { { 0 } }
/**
* Initialize the oidset structure `set`.
*
* If `initial_size` is bigger than 0 then preallocate to allow inserting
* the specified number of elements without further allocations.
*/
void oidset_init(struct oidset *set, size_t initial_size);
/**
* Returns true iff `set` contains `oid`.
*/
int oidset_contains(const struct oidset *set, const struct object_id *oid);
/**
* Insert the oid into the set; a copy is made, so "oid" does not need
* to persist after this function is called.
*
* Returns 1 if the oid was already in the set, 0 otherwise. This can be used
* to perform an efficient check-and-add.
*/
int oidset_insert(struct oidset *set, const struct object_id *oid);
/**
* Insert all the oids that are in set 'src' into set 'dest'; a copy
* is made of each oid inserted into set 'dest'.
*/
void oidset_insert_from_set(struct oidset *dest, struct oidset *src);
/**
* Remove the oid from the set.
*
* Returns 1 if the oid was present in the set, 0 otherwise.
*/
int oidset_remove(struct oidset *set, const struct object_id *oid);
/**
* Returns the number of oids in the set.
*/
static inline int oidset_size(const struct oidset *set)
{
return kh_size(&set->set);
}
/**
* Remove all entries from the oidset, freeing any resources associated with
* it.
*/
void oidset_clear(struct oidset *set);
/**
* Add the contents of the file 'path' to an initialized oidset. Each line is
* an unabbreviated object name. Comments begin with '#', and trailing comments
* are allowed. Leading whitespace and empty or white-space only lines are
* ignored.
*/
void oidset_parse_file(struct oidset *set, const char *path);
/*
* Similar to the above, but with a callback which can (1) return non-zero to
* signal displeasure with the object and (2) replace object ID with something
* else (meant to be used to "peel").
*/
typedef int (*oidset_parse_tweak_fn)(struct object_id *, void *);
void oidset_parse_file_carefully(struct oidset *set, const char *path,
oidset_parse_tweak_fn fn, void *cbdata);
struct oidset_iter {
kh_oid_set_t *set;
khiter_t iter;
};
static inline void oidset_iter_init(struct oidset *set,
struct oidset_iter *iter)
{
iter->set = &set->set;
iter->iter = kh_begin(iter->set);
}
static inline struct object_id *oidset_iter_next(struct oidset_iter *iter)
{
for (; iter->iter != kh_end(iter->set); iter->iter++) {
if (kh_exist(iter->set, iter->iter))
return &kh_key(iter->set, iter->iter++);
}
return NULL;
}
static inline struct object_id *oidset_iter_first(struct oidset *set,
struct oidset_iter *iter)
{
oidset_iter_init(set, iter);
return oidset_iter_next(iter);
}
#endif /* OIDSET_H */