From 05d2c61c6744212cdef6085832a84b49da77591c Mon Sep 17 00:00:00 2001 From: Elijah Newren Date: Thu, 15 Jul 2021 00:45:21 +0000 Subject: [PATCH 1/4] diff: correct warning message when renameLimit exceeded The warning when quadratic rename detection was skipped referred to "inexact rename detection". For years, the only linear portion of rename detection was looking for exact renames, so "inexact rename detection" was an accurate way to refer to the quadratic portion of rename detection. However, that changed with commit bd24aa2f97a0 (diffcore-rename: guide inexact rename detection based on basenames, 2021-02-14). Let's instead use the term "exhaustive rename detection" to refer to the quadratic portion. Signed-off-by: Elijah Newren Signed-off-by: Junio C Hamano --- diff.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/diff.c b/diff.c index 52c791574b7..2454e34cf6d 100644 --- a/diff.c +++ b/diff.c @@ -6284,7 +6284,7 @@ static int is_summary_empty(const struct diff_queue_struct *q) } static const char rename_limit_warning[] = -N_("inexact rename detection was skipped due to too many files."); +N_("exhaustive rename detection was skipped due to too many files."); static const char degrade_cc_to_c_warning[] = N_("only found copies from modified paths due to too many files."); From 6623a528e00b73f5438724a355c43343d3de8652 Mon Sep 17 00:00:00 2001 From: Elijah Newren Date: Thu, 15 Jul 2021 00:45:22 +0000 Subject: [PATCH 2/4] doc: clarify documentation for rename/copy limits A few places in the docs implied that rename/copy detection is always quadratic or that all (unpaired) files were involved in the quadratic portion of rename/copy detection. The following two commits each introduced an exception to this: 9027f53cb505 (Do linear-time/space rename logic for exact renames, 2007-10-25) bd24aa2f97a0 (diffcore-rename: guide inexact rename detection based on basenames, 2021-02-14) (As a side note, for copy detection, the basename guided inexact rename detection is turned off and the exact renames will only result in sources (without the dests) being removed from the set of files used in quadratic detection. So, for copy detection, the documentation was closer to correct.) Avoid implying that all files involved in rename/copy detection are subject to the full quadratic algorithm. While at it, also note the default values for all these settings. Signed-off-by: Elijah Newren Signed-off-by: Junio C Hamano --- Documentation/config/diff.txt | 7 ++++--- Documentation/config/merge.txt | 10 ++++++---- Documentation/diff-options.txt | 15 ++++++++++----- 3 files changed, 20 insertions(+), 12 deletions(-) diff --git a/Documentation/config/diff.txt b/Documentation/config/diff.txt index 2d3331f55c2..d1b5cfa3542 100644 --- a/Documentation/config/diff.txt +++ b/Documentation/config/diff.txt @@ -118,9 +118,10 @@ diff.orderFile:: relative to the top of the working tree. diff.renameLimit:: - The number of files to consider when performing the copy/rename - detection; equivalent to the 'git diff' option `-l`. This setting - has no effect if rename detection is turned off. + The number of files to consider in the exhaustive portion of + copy/rename detection; equivalent to the 'git diff' option + `-l`. If not set, the default value is currently 400. This + setting has no effect if rename detection is turned off. diff.renames:: Whether and how Git detects renames. If set to "false", diff --git a/Documentation/config/merge.txt b/Documentation/config/merge.txt index 6b66c83eabe..7cd6d7883b6 100644 --- a/Documentation/config/merge.txt +++ b/Documentation/config/merge.txt @@ -33,10 +33,12 @@ merge.verifySignatures:: include::fmt-merge-msg.txt[] merge.renameLimit:: - The number of files to consider when performing rename detection - during a merge; if not specified, defaults to the value of - diff.renameLimit. This setting has no effect if rename detection - is turned off. + The number of files to consider in the exhaustive portion of + rename detection during a merge. If not specified, defaults + to the value of diff.renameLimit. If neither + merge.renameLimit nor diff.renameLimit are specified, + currently defaults to 1000. This setting has no effect if + rename detection is turned off. merge.renames:: Whether Git detects renames. If set to "false", rename detection diff --git a/Documentation/diff-options.txt b/Documentation/diff-options.txt index 32e6dee5ac3..58acfff9289 100644 --- a/Documentation/diff-options.txt +++ b/Documentation/diff-options.txt @@ -588,11 +588,16 @@ When used together with `-B`, omit also the preimage in the deletion part of a delete/create pair. -l:: - The `-M` and `-C` options require O(n^2) processing time where n - is the number of potential rename/copy targets. This - option prevents rename/copy detection from running if - the number of rename/copy targets exceeds the specified - number. + The `-M` and `-C` options involve some preliminary steps that + can detect subsets of renames/copies cheaply, followed by an + exhaustive fallback portion that compares all remaining + unpaired destinations to all relevant sources. (For renames, + only remaining unpaired sources are relevant; for copies, all + original sources are relevant.) For N sources and + destinations, this exhaustive check is O(N^2). This option + prevents the exhaustive portion of rename/copy detection from + running if the number of source/destination files involved + exceeds the specified number. Defaults to diff.renameLimit. ifndef::git-format-patch[] --diff-filter=[(A|C|D|M|R|T|U|X|B)...[*]]:: From 9dd29dbef01e39fe9df81ad9e5e193128d8c5ad5 Mon Sep 17 00:00:00 2001 From: Elijah Newren Date: Thu, 15 Jul 2021 00:45:23 +0000 Subject: [PATCH 3/4] diffcore-rename: treat a rename_limit of 0 as unlimited In commit 89973554b52c (diffcore-rename: make diff-tree -l0 mean -l, 2017-11-29), -l0 was given a special magical "large" value, but one which was not large enough for some uses (as can be seen from commit 9f7e4bfa3b6d (diff: remove silent clamp of renameLimit, 2017-11-13). Make 0 (or a negative value) be treated as unlimited instead and update the documentation to mention this. Signed-off-by: Elijah Newren Signed-off-by: Junio C Hamano --- Documentation/diff-options.txt | 1 + diffcore-rename.c | 2 +- 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/Documentation/diff-options.txt b/Documentation/diff-options.txt index 58acfff9289..0aebe832057 100644 --- a/Documentation/diff-options.txt +++ b/Documentation/diff-options.txt @@ -598,6 +598,7 @@ of a delete/create pair. prevents the exhaustive portion of rename/copy detection from running if the number of source/destination files involved exceeds the specified number. Defaults to diff.renameLimit. + Note that a value of 0 is treated as unlimited. ifndef::git-format-patch[] --diff-filter=[(A|C|D|M|R|T|U|X|B)...[*]]:: diff --git a/diffcore-rename.c b/diffcore-rename.c index 3375e24659e..513ba7b05f1 100644 --- a/diffcore-rename.c +++ b/diffcore-rename.c @@ -1021,7 +1021,7 @@ static int too_many_rename_candidates(int num_destinations, int num_sources, * memory for the matrix anyway. */ if (rename_limit <= 0) - rename_limit = 32767; + return 0; /* treat as unlimited */ if (st_mult(num_destinations, num_sources) <= st_mult(rename_limit, rename_limit)) return 0; From 94b82d56866793018a7a9bcbe20c1c061fa41aa8 Mon Sep 17 00:00:00 2001 From: Elijah Newren Date: Thu, 15 Jul 2021 00:45:24 +0000 Subject: [PATCH 4/4] rename: bump limit defaults yet again These were last bumped in commit 92c57e5c1d29 (bump rename limit defaults (again), 2011-02-19), and were bumped both because processors had gotten faster, and because people were getting ugly merges that caused problems and reporting it to the mailing list (suggesting that folks were willing to spend more time waiting). Since that time: * Linus has continued recommending kernel folks to set diff.renameLimit=0 (maps to 32767, currently) * Folks with repositories with lots of renames were happy to set merge.renameLimit above 32767, once the code supported that, to get correct cherry-picks * Processors have gotten faster * It has been discovered that the timing methodology used last time probably used too large example files. The last point is probably worth explaining a bit more: * The "average" file size used appears to have been average blob size in the linux kernel history at the time (probably v2.6.25 or something close to it). * Since bigger files are modified more frequently, such a computation weights towards larger files. * Larger files may be more likely to be modified over time, but are not more likely to be renamed -- the mean and median blob size within a tree are a bit higher than the mean and median of blob sizes in the history leading up to that version for the linux kernel. * The mean blob size in v2.6.25 was half the average blob size in history leading to that point * The median blob size in v2.6.25 was about 40% of the mean blob size in v2.6.25. * Since the mean blob size is more than double the median blob size, any file as big as the mean will not be compared to any files of median size or less (because they'd be more than 50% dissimilar). * Since it is the number of files compared that provides the O(n^2) behavior, median-sized files should matter more than mean-sized ones. The combined effect of the above is that the file size used in past calculations was likely about 5x too large. Combine that with a CPU performance improvement of ~30%, and we can increase the limits by a factor of sqrt(5/(1-.3)) = 2.67, while keeping the original stated time limits. Keeping the same approximate time limit probably makes sense for diff.renameLimit (there is no progress feedback in e.g. git log -p), but the experience above suggests merge.renameLimit could be extended significantly. In fact, it probably would make sense to have an unlimited default setting for merge.renameLimit, but that would likely need to be coupled with changes to how progress is displayed. (See https://lore.kernel.org/git/YOx+Ok%2FEYvLqRMzJ@coredump.intra.peff.net/ for details in that area.) For now, let's just bump the approximate time limit from 10s to 1m. (Note: We do not want to use actual time limits, because getting results that depend on how loaded your system is that day feels bad, and because we don't discover that we won't get all the renames until after we've put in a lot of work rather than just upfront telling the user there are too many files involved.) Using the original time limit of 2s for diff.renameLimit, and bumping merge.renameLimit from 10s to 60s, I found the following timings using the simple script at the end of this commit message (on an AWS c5.xlarge which reports as "Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz"): N Timing 1300 1.995s 7100 59.973s So let's round down to nice even numbers and bump the limits from 400->1000, and from 1000->7000. Here is the measure_rename_perf script (adapted from https://lore.kernel.org/git/20080211113516.GB6344@coredump.intra.peff.net/ in particular to avoid triggering the linear handling from basename-guided rename detection): #!/bin/bash n=$1; shift rm -rf repo mkdir repo && cd repo git init -q -b main mkdata() { mkdir $1 for i in `seq 1 $2`; do (sed "s/^/$i /" <../sample echo tag: $1 ) >$1/$i done } mkdata initial $n git add . git commit -q -m initial mkdata new $n git add . cd new for i in *; do git mv $i $i.renamed; done cd .. git rm -q -rf initial git commit -q -m new time git diff-tree -M -l0 --summary HEAD^ HEAD Signed-off-by: Elijah Newren Signed-off-by: Junio C Hamano --- Documentation/config/diff.txt | 2 +- Documentation/config/merge.txt | 2 +- diff.c | 2 +- merge-ort.c | 2 +- merge-recursive.c | 2 +- 5 files changed, 5 insertions(+), 5 deletions(-) diff --git a/Documentation/config/diff.txt b/Documentation/config/diff.txt index d1b5cfa3542..32f84838ac1 100644 --- a/Documentation/config/diff.txt +++ b/Documentation/config/diff.txt @@ -120,7 +120,7 @@ diff.orderFile:: diff.renameLimit:: The number of files to consider in the exhaustive portion of copy/rename detection; equivalent to the 'git diff' option - `-l`. If not set, the default value is currently 400. This + `-l`. If not set, the default value is currently 1000. This setting has no effect if rename detection is turned off. diff.renames:: diff --git a/Documentation/config/merge.txt b/Documentation/config/merge.txt index 7cd6d7883b6..e27cc639447 100644 --- a/Documentation/config/merge.txt +++ b/Documentation/config/merge.txt @@ -37,7 +37,7 @@ merge.renameLimit:: rename detection during a merge. If not specified, defaults to the value of diff.renameLimit. If neither merge.renameLimit nor diff.renameLimit are specified, - currently defaults to 1000. This setting has no effect if + currently defaults to 7000. This setting has no effect if rename detection is turned off. merge.renames:: diff --git a/diff.c b/diff.c index 2454e34cf6d..0244a371d32 100644 --- a/diff.c +++ b/diff.c @@ -35,7 +35,7 @@ static int diff_detect_rename_default; static int diff_indent_heuristic = 1; -static int diff_rename_limit_default = 400; +static int diff_rename_limit_default = 1000; static int diff_suppress_blank_empty; static int diff_use_color_default = -1; static int diff_color_moved_default; diff --git a/merge-ort.c b/merge-ort.c index b954f7184a5..8a84375e940 100644 --- a/merge-ort.c +++ b/merge-ort.c @@ -2558,7 +2558,7 @@ static void detect_regular_renames(struct merge_options *opt, diff_opts.detect_rename = DIFF_DETECT_RENAME; diff_opts.rename_limit = opt->rename_limit; if (opt->rename_limit <= 0) - diff_opts.rename_limit = 1000; + diff_opts.rename_limit = 7000; diff_opts.rename_score = opt->rename_score; diff_opts.show_rename_progress = opt->show_rename_progress; diff_opts.output_format = DIFF_FORMAT_NO_OUTPUT; diff --git a/merge-recursive.c b/merge-recursive.c index 4327e0cfa33..f19f8cc37bd 100644 --- a/merge-recursive.c +++ b/merge-recursive.c @@ -1879,7 +1879,7 @@ static struct diff_queue_struct *get_diffpairs(struct merge_options *opt, */ if (opts.detect_rename > DIFF_DETECT_RENAME) opts.detect_rename = DIFF_DETECT_RENAME; - opts.rename_limit = (opt->rename_limit >= 0) ? opt->rename_limit : 1000; + opts.rename_limit = (opt->rename_limit >= 0) ? opt->rename_limit : 7000; opts.rename_score = opt->rename_score; opts.show_rename_progress = opt->show_rename_progress; opts.output_format = DIFF_FORMAT_NO_OUTPUT;