'How to remove all html tags except img?

I got some html text, which contains all kinds of html tags, such as <table>, <a>, <img>, and so on.

Now I want to use a regular expression to remove all the html tags, except <img ...> and </img>(and upper case <IMG></IMG>).

How to do this?

UPDATE:

My task is very simple, it just print the text content(including images) of a html as a summary in the front page, so I think regular expression is good and simple enough.

UPDATE AGAIN

Maybe a sample will make my question better to understand :)

There are some html text:

<html>
  <head></head>
  <body>
     Hello, everyone. Here is my photo: <img src="xxx.jpg" />. 
     And, <a href="xxx">know more</a> about me!
  </body>
</html>

I want to keep , and remove other tags. Following is what I want:

Hello, everyone. Here is my photo: <img src="xxx.jpg" />. And, know more about me!

Now I code like this:

html.replaceAll("<.*?>", "")

But it will remove all the content between < and >, but I want to keep <img xxx> and </img>, and remove the other content between < and >

Thank for everyone!

regex html-parsing

Solution 1:^[1]

I tried a lot, this regular expression seems work for me:

(?i)<(?!img|/img).*?>

My code is:

html.replaceAll('(?i)<(?!img|/img).*?>', '');

Solution 2:^[2]

Do not use a RegEx to parse HTML. See here for a compelling demonstration of why.

Use an HTML parser for your language/platform.

Here is a java one (HTML parser)
For .NET, the HTML Agility Pack is recommended
For ruby, there is nokogiry, though I am not a ruby dev, so don't know how good it is

Solution 3:^[3]

A simple answer to why Do not use a RegEx is:

Regexp can't parse recursive grammar such as:

S -> (S)
S -> Empty

Because this kind of grammar has infinite state.

Since HTML has a recursive grammar you can simply use regexp.

SPAN -> <span>SPAN</span>
SPAN -> text

But in your case you can express a regular expression that is not recursive.

Solution 4:^[4]

<(img|IMG)*>*</(img|IMG)>

Solution 5:^[5]

Here is a simple using Regex:

const html = "<html>...</html>";
return html.replace(/<.*?>/ig, function (tag) {
  if (tag.indexOf('<img ') === 0) {
    return tag;
  } else {
    return '';
  }
})

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1
Solution 2	Community
Solution 3	mathk
Solution 4	mathk
Solution 5	Quang Tuyen Nguyen

'How to remove all html tags except img?

Solution 1:[1]

Solution 2:[2]

Solution 3:[3]

Solution 4:[4]

Solution 5:[5]