'Search with regex but replace only a portion of the string with sed

I'm trying to replace any occurrence of a cwe.mitre.org.*.html (regex) URL and remove the .html extension and not change any other type of URL.

Example:

https://cwe.mitre.org/data/definitions/377.html
http://google.com/404.html

Expectation:

https://cwe.mitre.org/data/definitions/377
http://google.com/404.html

Is there a way to do this in sed or another tool?

I've tried sed -Ei 's/cwe.mitre.org.*.html/<REPLACEMENT>/g' file.txt, but that won't work. Is there a way for the <REPLACEMENT> to be a regular expression? The sed manual doesn't seem to suggest that?

EDIT: I was wrong about the sed manual. It does mention it, see "5.7 Back-references and Subexpressions" section of https://www.gnu.org/software/sed/manual/sed.html.



Solution 1:[1]

$ sed 's/\(cwe\.mitre\.org.*\)\.html/\1/' file
https://cwe.mitre.org/data/definitions/377
http://google.com/404.html

google sed capture groups.

Solution 2:[2]

Use

sed -Ei 's/(cwe\.mitre\.org.*)\.html/\1/' file

EXPLANATION

NODE                     EXPLANATION
--------------------------------------------------------------------------------
  (                        group and capture to \1:
--------------------------------------------------------------------------------
    cwe                      'cwe'
--------------------------------------------------------------------------------
    \.                       '.'
--------------------------------------------------------------------------------
    mitre                    'mitre'
--------------------------------------------------------------------------------
    \.                       '.'
--------------------------------------------------------------------------------
    org                      'org'
--------------------------------------------------------------------------------
    .*                       any character except \n (0 or more times
                             (matching the most amount possible))
--------------------------------------------------------------------------------
  )                        end of \1
--------------------------------------------------------------------------------
  \.                       '.'
--------------------------------------------------------------------------------
  html                     'html'

The \1 backreferences the part of a string captured by parenthesized piece of the pattern. When you want a piece of a match stay in the result, use the backreference.

Solution 3:[3]

GNU AWK solution, let file.txt content be

https://cwe.mitre.org/data/definitions/377.html
http://google.com/404.html

then

awk '/cwe\.mitre\.org.*\.html/{sub(/\.html$/,"")}{print}' file.txt

gives output

https://cwe.mitre.org/data/definitions/377
http://google.com/404.html

Explanation: If you find provided regex in line, replace .html followed by end of line ($) using empty string. Every line, changed or not, print.

(tested in GNU Awk 5.0.1)

Solution 4:[4]

Another possibility is

% sed '/cwe\.mitre\.org/s/\.html//' try.txt 
https://cwe.mitre.org/data/definitions/377
Nothing
hello.html
http://google.com/404.html

This isn't unequivocally better than the accepted answer (it would get confused by foo.html text http://cwe.mitre.org/bar.html, for example, but the other answers may also be assuming there's only one relevant URL on a line). I mention it as a supplement to that one, however, since it usefully illustrates that sed commands can be prefixed by ‘addresses’, which can include regexps. This script deletes .html on any line which includes cvw.mitre.org.

This feature is often forgotten, and is only occasionally useful, but when it's appropriate, it can avoid an otherwise complicated regexp in the s ‘pattern’ slot, and back-references.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 Ryszard Czech
Solution 3
Solution 4 Norman Gray