'Better way to remove HTML tags in Oracle SQL

I have a comments column and the comments added to release are stored as rich text in comments column. Now i'm trying to process this data and get an human readable output. I'm providing 2 sample comment data i have in my oracle SQL db which i'm trying to process.

Example 1 :

    <html>
<body>
<div align="left"><font face="Arial Unicode MS"><span style="font-size:8pt">Display the frulog on the count values</span></font></div>
</body>
</html>

Example 2: <not implemented in this release>

i used the below query to process the html characters

Select (REGEXP_REPLACE(comments),'<.+?>') from test_table;

Note: consider values provided in Example 1 and Example 2 are passed in as column comments in the above SQL command.

the query result for Example 1 was Display the frulog on the count values which is what i'm expecting. result for Example 2 was ''. Value in Example 2 was not an html tag but it still replaced the tags. How can i make the replace statement smart.

Feel free to drop your suggestions .



Solution 1:[1]

SQL Fiddle

Oracle 11g R2 Schema Setup:

CREATE TABLE comments ( value ) AS
SELECT '<html>
<body>
<div align="left">
<font face="Arial Unicode MS">
<span style="font-size:8pt">
  Display the frulog on the count values
</span>
</font>
</div>
</body>
</html>' FROM DUAL UNION ALL
SELECT '<not implemented in this release>' FROM DUAL UNION ALL
SELECT '<test a="1"
  b=''2''
  c = 3
  d
  e = ">" >test</test>' FROM DUAL;

Query 1:

SELECT value,
       REGEXP_REPLACE(
         value,
         '\s*</?\w+((\s+\w+(\s*=\s*(".*?"|''.*?''|[^''">\s]+))?)+\s*|\s*)/?>\s*',
         NULL,
         1,
         0,
         'im'
       ) AS replaced
FROM   comments

Results:

|                                    VALUE |                               REPLACED |
|------------------------------------------|----------------------------------------|
| <html>                                   | Display the frulog on the count values |
| <body>                                   |                                        |
| <div align="left">                       |                                        |
| <font face="Arial Unicode MS">           |                                        |
| <span style="font-size:8pt">             |                                        |
| Display the frulog on the count values   |                                        |
| </span>                                  |                                        |
| </font>                                  |                                        |
| </div>                                   |                                        |
| </body>                                  |                                        |
| </html>                                  |                                        |
|------------------------------------------|----------------------------------------|
| <not implemented in this release>        |                                 (null) |
|------------------------------------------|----------------------------------------|
| <test a="1"                              |                                   test | 
|   b='2'                                  |                                        |
|   c = 3                                  |                                        |
|   d                                      |                                        |
|   e = ">" >test</test>                   |                                        |

Note: <not implemented in this release> is a valid HTML custom element with tag name not and attributes implemented, in, this and release.

If you only want to replace specific HTML elements then list them at the start of the regular expression:

\s*</?(a|abbr|acronym|address|applet|area|article|aside|audio|b|base|basefont|bdi|bdo|bgsound|big|blink|blockquote|body|br|button|canvas|caption|center|cite|code|col|colgroup|command|content|data|datalist|dd|del|details|dfn|dialog|dir|div|dl|dt|element|em|embed|fieldset|figcaption|figure|font|footer|form|frame|frameset|h1|head|header|hgroup|hr|html|i|iframe|image|img|input|ins|isindex|kbd|keygen|label|legend|li|link|listing|main|map|mark|marquee|menu|menuitem|meta|meter|multicol|nav|nextid|nobr|noembed|noframes|noscript|object|ol|optgroup|option|output|p|param|picture|plaintext|pre|progress|q|rp|rt|rtc|ruby|s|samp|script|section|select|shadow|slot|small|source|spacer|span|strike|strong|style|sub|summary|sup|table|tbody|td|template|textarea|tfoot|th|thead|time|title|tr|track|tt|u|ul|var|video|wbr|xmp)((\s+\w+(\s*=\s*(".*?"|''.*?''|[^''">\s]+))?)+\s*|\s*)/?>\s*

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1