'Search with regex but replace only a portion of the string with sed
I'm trying to replace any occurrence of a cwe.mitre.org.*.html (regex) URL and remove the .html extension and not change any other type of URL.
Example:
https://cwe.mitre.org/data/definitions/377.html
http://google.com/404.html
Expectation:
https://cwe.mitre.org/data/definitions/377
http://google.com/404.html
Is there a way to do this in sed or another tool?
I've tried sed -Ei 's/cwe.mitre.org.*.html/<REPLACEMENT>/g' file.txt, but that won't work. Is there a way for the <REPLACEMENT> to be a regular expression? The sed manual doesn't seem to suggest that?
EDIT: I was wrong about the sed manual. It does mention it, see "5.7 Back-references and Subexpressions" section of https://www.gnu.org/software/sed/manual/sed.html.
Solution 1:[1]
$ sed 's/\(cwe\.mitre\.org.*\)\.html/\1/' file
https://cwe.mitre.org/data/definitions/377
http://google.com/404.html
google sed capture groups.
Solution 2:[2]
Use
sed -Ei 's/(cwe\.mitre\.org.*)\.html/\1/' file
EXPLANATION
NODE EXPLANATION
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
cwe 'cwe'
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
mitre 'mitre'
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
org 'org'
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
html 'html'
The \1 backreferences the part of a string captured by parenthesized piece of the pattern. When you want a piece of a match stay in the result, use the backreference.
Solution 3:[3]
GNU AWK solution, let file.txt content be
https://cwe.mitre.org/data/definitions/377.html
http://google.com/404.html
then
awk '/cwe\.mitre\.org.*\.html/{sub(/\.html$/,"")}{print}' file.txt
gives output
https://cwe.mitre.org/data/definitions/377
http://google.com/404.html
Explanation: If you find provided regex in line, replace .html followed by end of line ($) using empty string. Every line, changed or not, print.
(tested in GNU Awk 5.0.1)
Solution 4:[4]
Another possibility is
% sed '/cwe\.mitre\.org/s/\.html//' try.txt
https://cwe.mitre.org/data/definitions/377
Nothing
hello.html
http://google.com/404.html
This isn't unequivocally better than the accepted answer (it would get confused by foo.html text http://cwe.mitre.org/bar.html, for example, but the other answers may also be assuming there's only one relevant URL on a line). I mention it as a supplement to that one, however, since it usefully illustrates that sed commands can be prefixed by ‘addresses’, which can include regexps. This script deletes .html on any line which includes cvw.mitre.org.
This feature is often forgotten, and is only occasionally useful, but when it's appropriate, it can avoid an otherwise complicated regexp in the s ‘pattern’ slot, and back-references.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | Ryszard Czech |
| Solution 3 | |
| Solution 4 | Norman Gray |
