'Removing html tags in pandas

I am using pandas library on Python 3.5.1. How can I remove html tags from field values? Here are my input and output:

enter image description here

My code returned an error:

import pandas as pd

code=[1,2,3]
overview =['<p>Environments subject.</p>',
          '<ul><li> property ;</li></ul><ul><li>markets and exchange;</li></ul>',
          '<p class="MsoNormal" style="margin: 0cm 0cm 0pt;">']
# '<p class="SSPBodyText" style="padding: 0cm; text-align: justify;">The subject.</p>']
df= pd.DataFrame(overview,code)

df.columns = ['overview']
df['overview_copy'] = df['overview']

# print(df)

tags_list = ['<p>' ,'</p>' , '<p*>',
             '<ul>','</ul>',
             '<li>','</li>',
             '<br>',
             '<strong>','</strong>',
             '<span*>','</span>',
             '<a href*>','</a>',
             '<em>','</em>']

for tag in tags_list:
#     df['overview_copy'] = df['overview_copy'].str.replace(tag, '')
  df['overview_copy'].replace(to_replace=tag, value='', regex=True, inplace=True)
print(df)


Solution 1:[1]

Like so re.sub('<[^<]+?>', '', text)

You can find details answer there.

Solution 2:[2]

The Pandas way is using Series.str.replace:

df['overview_copy'] = df['overview_copy'].str.replace(r'<[^<>]*>', '', regex=True)

Details:

  • < - a < char
  • [^<>]* - zero or more chars ther than < and > as many as possible
  • > - a > char.

See the regex demo.

Pandas output:

>>> df['overview_copy']
1               Environments subject.
2     property ;markets and exchange;
3                                    
Name: overview_copy, dtype: object
>>> 

Solution 3:[3]

Note that if you have the column of data with HTML tags in a list, it is much faster to remove the tags before you create the dataframe. (This will not always be possible when loading data from an external source.) Even for this small example, it's consistently 10 times faster.

import re
import pandas as pd
from timeit import default_timer as timer

code = [1, 2, 3]
overview = ['<p>Environments subject.</p>',
          '<ul><li> property ;</li></ul><ul><li>markets and exchange;</li></ul>',
          '<p class="MsoNormal" style="margin: 0cm 0cm 0pt;">']
# '<p class="SSPBodyText" style="padding: 0cm; text-align: justify;">The subject.</p>']
df = pd.DataFrame({'overview': overview, 'code': code})

start = timer()
overview = [re.sub(r'<[^<]+?>', '', text) for text in overview]
end = timer()
re_sub_time = end - start
print("re_sub time:", re_sub_time)

start = timer()
df['overview_copy'] = df['overview'].str.replace(r'<[^<>]*>', '', regex=True)
# df['overview_copy'] = df['overview'].str.replace(r'<[^<]+?>', '', regex=True)
end = timer()
str_replace_time = end - start
print("Pandas str.replace time:", str_replace_time)

print("Ratio:", str_replace_time / re_sub_time)

Note that the speed improvement is not due to the slight difference in regular expressions used in the other examples. I tested both regular expressions, and stripping tags is faster in the list with either regex.

Output:

re_sub time: 8.690000000000087e-05
Pandas str.replace time: 0.0010488999999999082
Ratio: 12.070195627156476

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Pobe
Solution 2 Wiktor Stribiżew
Solution 3 Bill the Lizard