'How to remove all html tags except img?
I got some html text, which contains all kinds of html tags, such as <table>, <a>, <img>, and so on.
Now I want to use a regular expression to remove all the html tags, except <img ...> and </img>(and upper case <IMG></IMG>).
How to do this?
UPDATE:
My task is very simple, it just print the text content(including images) of a html as a summary in the front page, so I think regular expression is good and simple enough.
UPDATE AGAIN
Maybe a sample will make my question better to understand :)
There are some html text:
<html>
<head></head>
<body>
Hello, everyone. Here is my photo: <img src="xxx.jpg" />.
And, <a href="xxx">know more</a> about me!
</body>
</html>
I want to keep , and remove other tags. Following is what I want:
Hello, everyone. Here is my photo: <img src="xxx.jpg" />. And, know more about me!
Now I code like this:
html.replaceAll("<.*?>", "")
But it will remove all the content between < and >, but I want to keep <img xxx> and </img>, and remove the other content between < and >
Thank for everyone!
Solution 1:[1]
I tried a lot, this regular expression seems work for me:
(?i)<(?!img|/img).*?>
My code is:
html.replaceAll('(?i)<(?!img|/img).*?>', '');
Solution 2:[2]
Do not use a RegEx to parse HTML. See here for a compelling demonstration of why.
Use an HTML parser for your language/platform.
- Here is a java one (HTML parser)
- For .NET, the HTML Agility Pack is recommended
- For ruby, there is nokogiry, though I am not a ruby dev, so don't know how good it is
Solution 3:[3]
A simple answer to why Do not use a RegEx is:
Regexp can't parse recursive grammar such as:
S -> (S)
S -> Empty
Because this kind of grammar has infinite state.
Since HTML has a recursive grammar you can simply use regexp.
SPAN -> <span>SPAN</span>
SPAN -> text
But in your case you can express a regular expression that is not recursive.
Solution 4:[4]
<(img|IMG)*>*</(img|IMG)>
Solution 5:[5]
Here is a simple using Regex:
const html = "<html>...</html>";
return html.replace(/<.*?>/ig, function (tag) {
if (tag.indexOf('<img ') === 0) {
return tag;
} else {
return '';
}
})
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | Community |
| Solution 3 | mathk |
| Solution 4 | mathk |
| Solution 5 | Quang Tuyen Nguyen |
