'How to make 'git diff' ignore comments
I am trying to produce a list of the files that were changed in a specific commit. The problem is, that every file has the version number in a comment at the top of the file - and since this commit introduces a new version, that means that every file has changed.
I don't care about the changed comments, so I would like to have git diff ignore all lines that match ^\s*\*.*$, as these are all comments (part of /* */).
I cannot find any way to tell git diff to ignore specific lines.
I have already tried setting a textconv attribute to cause Git to pass the files to sed before diffing them, so that sed can strip out the offending lines - the problem with this, is that git diff --name-status does not actually diff the files, just compares the hashes, and of course all the hashes have changed.
Is there a way to do this?
Solution 1:[1]
git diff -G <regex>
And specify a regular expression that does not match your version number line.
Solution 2:[2]
Here is a solution that is working well for me. I've written up the solution and some additional missing documentation on the git (log|diff) -G<regex> option.
It is basically using the same solution as in previous answers, but specifically for comments that start with a * or a #, and sometimes a space before the *... But it still needs to allow #ifdef, #include, etc. changes.
Look ahead and look behind do not seem to be supported by the -G option, nor does the ? in general, and I have had problems with using *, too. + seems to be working well, though.
(Note, tested on Git v2.7.0)
Multi-Line Comment Version
git diff -w -G'(^[^\*# /])|(^#\w)|(^\s+[^\*#/])'
-wignore whitespace-Gonly show diff lines that match the following regex(^[^\*# /])any line that does not start with a star or a hash or a space(^#\w)any line that starts with#followed by a letter(^\s+[^\*#/])any line that starts with some whitespace followed by a comment character
Basically an SVN hook modifies every file in and out right now and modifies multi-line comment blocks on every file. Now I can diff my changes against SVN without the FYI information that SVN drops in the comments.
Technically this will allow for Python and Bash comments like #TODO to be shown in the diff, and if a division operator started on a new line in C++ it could be ignored:
a = b
/ c;
Also the documentation on -G in Git seemed pretty lacking, so the information here should help:
git diff -G<regex>
-G<regex>Look for differences whose patch text contains added/removed lines that match
<regex>.To illustrate the difference between
-S<regex> --pickaxe-regexand-G<regex>, consider a commit with the following diff in the same file:+ return !regexec(regexp, two->ptr, 1, ®match, 0); ... - hit = !regexec(regexp, mf2.ptr, 1, ®match, 0);While
git log -G"regexec\(regexp"will show this commit,git log -S"regexec\(regexp" --pickaxe-regexwill not (because the number of occurrences of that string did not change).See the pickaxe entry in gitdiffcore(7) for more information.
(Note, tested on Git v2.7.0)
-Guses a basic regular expression.- No support for
?,*,!,{,}regular expression syntax. - Grouping with
()and OR-ing groups works with|. - Wild card characters such as
\s,\W, etc. are supported. - Look-ahead and look-behind are not supported.
- Beginning and ending line anchors
^$work. - Feature has been available since Git 1.7.4.
Excluded Files v Excluded Diffs
Note that the -G option filters the files that will be diffed.
But if a file gets "diffed" those lines that were "excluded/included" before will all be shown in the diff.
Examples
Only show file differences with at least one line that mentions foo.
git diff -G'foo'
Show file differences for everything except lines that start with a #
git diff -G'^[^#]'
Show files that have differences mentioning FIXME or TODO
git diff -G`(FIXME)|(TODO)`
See also git log -G, git grep, git log -S, --pickaxe-regex, and --pickaxe-all
UPDATE: Which regular expression tool is in use by the -G option?
https://github.com/git/git/search?utf8=%E2%9C%93&q=regcomp&type=
https://github.com/git/git/blob/master/diffcore-pickaxe.c
if (opts & (DIFF_PICKAXE_REGEX | DIFF_PICKAXE_KIND_G)) {
int cflags = REG_EXTENDED | REG_NEWLINE;
if (DIFF_OPT_TST(o, PICKAXE_IGNORE_CASE))
cflags |= REG_ICASE;
regcomp_or_die(®ex, needle, cflags);
regexp = ®ex;
// and in the regcom_or_die function
regcomp(regex, needle, cflags);
http://man7.org/linux/man-pages/man3/regexec.3.html
REG_EXTENDED
Use POSIX Extended Regular Expression syntax when interpreting
regex. If not set, POSIX Basic Regular Expression syntax is
used.
// ...
REG_NEWLINE
Match-any-character operators don't match a newline.
A nonmatching list ([^...]) not containing a newline does not
match a newline.
Match-beginning-of-line operator (^) matches the empty string
immediately after a newline, regardless of whether eflags, the
execution flags of regexec(), contains REG_NOTBOL.
Match-end-of-line operator ($) matches the empty string
immediately before a newline, regardless of whether eflags
contains REG_NOTEOL.
Solution 3:[3]
I found it easiest to use git difftool to launch an external diff tool:
git difftool -y -x "diff -I '<regex>'"
Solution 4:[4]
I found a solution. I can use this command:
git diff --numstat --minimal <commit> <commit> | sed '/^[1-]\s\+[1-]\s\+.*/d'
To show the files that have more than one line changed between commits, which eliminates files whose only change was the version number in the comments.
Solution 5:[5]
Using 'grep' on the 'git diff' output,
git diff -w | grep -c -E "(^[+-]\s*(\/)?\*)|(^[+-]\s*\/\/)"
comment line changes alone can be calculated. (A)
Using 'git diff --stat' output,
git diff -w --stat
all line changes can be calculated. (B)
To get non comment source line changes (NCSL) count, subtract (A) from (B).
Explanation:
In the 'git diff ' output (in which whitespace changes are ignored),
- Look out for a line which start with either '+' or '-', which means modified line.
- There can be optional white-space characters following this. '\s*'
- Then look for comment line pattern '/*' (or) just '*' (or) '//'.
- Since, '-c' option is given with grep, just print the count. Remove '-c' option to see the comments alone in the diffs.
NOTE: There can be minor errors in the comment line count due to following assumptions, and the result should be taken as a ballpark figure.
1.) Source files are based on the C language. Makefile and shell script files have a different convention, '#', to denote the comment lines and if they are part of diffset, their comment lines won't be counted.
2.) The Git convention of line change: If a line is modified, Git sees it as that particular line is deleted and a new line is inserted there and it may look like two lines are changed whereas in reality one line is modified.
In the below example, the new definition of 'FOO' looks like a two-line change. $ git diff --stat -w abc.h ... -#define FOO 7 +#define FOO 105 ... 1 files changed, 1 insertions(+), 1 deletions(-) $3.) Valid comment lines not matching the pattern (or) Valid source code lines matching the pattern can cause errors in the calculation.
In the below example, the "+ blah blah" line which doesn't start with '*' won't be detected as a comment line.
+ /*
+ blah blah
+ *
+ */
In the below example, the "+ *ptr" line will be counted as a comment line as it starts with *, though it is a valid source code line.
+ printf("\n %p",
+ *ptr);
Solution 6:[6]
For most languages, to do it correctly, you have to parse the original source file/ast, and exclude comments that way.
One reason is that the start of multi-line comments might not be covered by the diff. Another reason is that language-parsing isn't trivial, and there are often things that can trip up a naive parser.
I was going to do that for python, but string-hacking was good enough for my needs.
For python, you can ignore comments and attempt-to-ignore docstrings using a custom filter, such as this:
#!/usr/bin/env python
import sys
import re
import configparser
from fnmatch import fnmatch
from unidiff import PatchSet
EXTS = ["py"]
class Opts: # pylint: disable=too-few-public-methods
debug = False
exclude = []
def filtered_hunks(fil):
path_re = ".*[.](%s)$" % "|".join(EXTS)
for patch in PatchSet(fil):
if not re.match(path_re, patch.path):
continue
excluded = False
if Opts.exclude:
if Opts.debug:
print(">", patch.path, "=~", Opts.exclude)
for ex in Opts.exclude:
if fnmatch(patch.path, ex):
excluded = True
if excluded:
continue
for hunk in patch:
yield hunk
class Typ: # pylint: disable=too-few-public-methods
LINE = "."
COMMENT = "#"
DOCSTRING = "d"
WHITE = "w"
def classify_lines(fil):
for hunk in filtered_hunks(fil):
yield from classify_hunk(hunk)
def classify_line(lval):
"""Classify a single python line, noting comments, best efforts at docstring start/stop and pure-whitespace."""
lval = lval.rstrip("\n\r")
remaining_lval = lval
typ = Typ.LINE
if re.match(r"^ *$", lval):
return Typ.WHITE, None, ""
if re.match(r"^ *#", lval):
typ = Typ.COMMENT
remaining_lval = ""
else:
slug = re.match(r"^ *(\"\"\"|''')(.*)", lval)
if slug:
remaining_lval = slug[2]
slug = slug[1]
return Typ.DOCSTRING, slug, remaining_lval
return typ, None, remaining_lval
def classify_hunk(hunk):
"""Classify lines of a python diff-hunk, attempting to note comments and docstrings.
Ignores context lines.
Docstring detection is not guaranteed (changes in the middle of large docstrings won't have starts.)
Using ast would fix, but seems like overkill, and cannot be done on a diff-only.
"""
p = ""
prev_typ = 0
pslug = None
for line in hunk:
lval = line.value
lval = lval.rstrip("\n\r")
typ = Typ.LINE
naive_typ, slug, remaining_lval = classify_line(lval)
if p and p[-1] == "\\":
typ = prev_typ
else:
if prev_typ != Typ.DOCSTRING and naive_typ == Typ.COMMENT:
typ = naive_typ
elif naive_typ == Typ.DOCSTRING:
if prev_typ == Typ.DOCSTRING and pslug == slug:
# remainder of line could have stuff on it
typ, _, _ = classify_line(remaining_lval)
else:
typ = Typ.DOCSTRING
pslug = slug
elif prev_typ == Typ.DOCSTRING:
# continue docstring found in this context/hunk
typ = Typ.DOCSTRING
p = lval
prev_typ = typ
if typ == Typ.DOCSTRING:
if re.match(r"(%s) *$" % pslug, remaining_lval):
prev_typ = Typ.LINE
if line.is_context:
continue
yield typ, lval
def count_lines(fil):
"""Totals changed lines of python code, attempting to strip comments and docstrings.
Deletes/adds are counted equally.
Could miss some things, don't rely on exact counts.
"""
count = 0
for (typ, line) in classify_lines(fil):
if Opts.debug:
print(typ, line)
if typ == Typ.LINE:
count += 1
return count
def main():
Opts.debug = "--debug" in sys.argv
Opts.exclude = []
use_covrc = "--covrc" in sys.argv
if use_covrc:
config = configparser.ConfigParser()
config.read(".coveragerc")
cfg = {s: dict(config.items(s)) for s in config.sections()}
exclude = cfg.get("report", {}).get("omit", [])
Opts.exclude = [f.strip() for f in exclude.split("\n") if f.strip()]
for i in range(len(sys.argv)):
if sys.argv[i] == "--exclude":
Opts.exclude.append(sys.argv[i + 1])
if Opts.debug and Opts.exclude:
print("--exclude", Opts.exclude)
print(count_lines(sys.stdin))
example = '''
diff --git a/cryptvfs.py b/cryptvfs.py
index c68429cf6..ee90ecea8 100755
--- a/cryptvfs.py
+++ b/cryptvfs.py
@@ -2,5 +2,17 @@
from src.main import proc_entry
-if __name__ == "__main__":
- proc_entry()
+
+
+class Foo:
+ """some docstring
+ """
+ # some comment
+ pass
+
+class Bar:
+ """some docstring
+ """
+ # some comment
+ def method():
+ line1 + 1
'''
def strio(s):
import io
return io.StringIO(s)
def test_basic():
assert count_lines(strio(example)) == 10
def test_main(capsys):
sys.argv = []
sys.stdin = strio(example)
main()
cap = capsys.readouterr()
print(cap.out)
assert cap.out == "10\n"
def test_debug(capsys):
sys.argv = ["--debug"]
sys.stdin = strio(example)
main()
cap = capsys.readouterr()
print(cap.out)
assert Typ.DOCSTRING + ' """some docstring' in cap.out
def test_exclude(capsys):
sys.argv = ["--exclude", "cryptvfs.py"]
sys.stdin = strio(example)
main()
cap = capsys.readouterr()
print(cap.out)
assert cap.out == "0\n"
def test_covrc(capsys):
sys.argv = ["--covrc"]
sys.stdin = strio(example)
main()
cap = capsys.readouterr()
print(cap.out)
assert cap.out == "10\n"
if __name__ == "__main__":
main()
That code can be trivially modified to produce filenames, rather than counts.
But it can, of course, mistakenly count parts of docstrings as "code" (which is isn't for things like coverage, etc).
Solution 7:[7]
Perhaps a Bash script like this:
#!/bin/bash
git diff --name-only "$@" | while read FPATH ; do
LINES_COUNT=`git diff --textconv "$FPATH" "$@" | sed '/^[1-]\s\+[1-]\s\+.*/d' | wc -l`
if [ $LINES_COUNT -gt 0 ] ; then
echo -e "$LINES_COUNT\t$FPATH"
fi
done | sort -n
Solution 8:[8]
I use meld as the tool to ignore comments by setting its options, then use meld as difftool:
git difftool --tool=meld -y
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
