'Sed: template through a multiple line

I have the following text:

[...]
<p class="title">ABC</p>
<p class="text">
<a href="https://site" target="_blank">
TEXT HERE   </a>
</p>
[...]

[...]
<p class="title">ABC</p>
<p class="text">
TEXT HERE  </p>
[...]

from the given text is necessary to get:

TEXT HERE<no space>
TEXT HERE<no space>

If the text was in one line, i.e.

<p class="title">ABC</p><p class="text"><a href="https://site" target="_blank">TEXT HERE   </a></p>
<p class="title">ABC</p><p class="text">TEXT HERE </p>

I would solve this problem in the following way: sed -n "s/.*title\">ABC<\/p>.*\">\([^<]*\).*/\1/p" ./file.txt

But I have a pattern that goes through a multiple line and I don't know how to solve the task in this case. Can somebody give the right direction for solving the problems?

sed


Solution 1:[1]

This might work for you (GNU sed):

sed -nE '/"title">ABC<\/p>/{:a;s/<\/p>/&/2;tb;N;ba;:b;s/\n//g;s/ABC//;s/<[^>]*>//g;s/\s*$//;p}' file

Focus on line(s) with "title">ABC</p> and then keep appending lines (or not) until a second </p> is found.

Remove newlines if present.

Remove the text ABC.

Remove all tags.

Remove any trailing white space and print the result.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 potong