'Get large string without catastrophic backtracking regex
I'm wanting to use Regex to get a specific file (e.g. package-lock.json) out of a git diff. The reason for this approach is because I'm getting a whole git diff via the Github API (Using Octocat js), therefore I can't just run the git diff on that specific file. (As far as I'm aware). Obviously the diff on a file like package-lock.json is very large so there's a lot of content). What I've noticed is that when I try to use a regular expression to get this content out it fails due to catastrophic backtracking.
Essentially the file structure looks like this
diff --git a/package-lock.json b/package-lock.json
lots of content
diff --git a/next-file b/next-file
Therefore my idea was to get everything between the two diff --git strings.
I figured I could just use this /(?<=diff --git )(.+?)(?=diff)/gs This works fine if the lookahead is not too far ahead, but after a long way through the file this stops working due to catastrophic backtracking.
I get why this is happening but just don't get how to get around it. Perhaps I should be sorting this some other way and just using Regex for more specific details?
Any help would be appreciated.
Solution 1:[1]
You're working with lines of data, and regexes don't work well like that, as you've found out. Use a tool like awk that can find ranges of lines.
Give this file foo.txt:
Here is stuff I don't care about
diff --git a/package-lock.json b/package-lock.json
lots of content
diff --git a/next-file b/next-file
Don't care about this either.
use awk to specify a range of lines you want to print:
$ awk '/^diff --git a\/package-lock/,/^diff --git a\/next-file/' foo.txt
diff --git a/package-lock.json b/package-lock.json
lots of content
diff --git a/next-file b/next-file
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Andy Lester |
