'How to access data frame columns as a text during object instantiation in python
I am trying to create a class to pre-process a text dataset. After creating an instance of my class, I want to call some methods from the class to apply on a column in the data frame but it does not work. This is what I tried
class Preprocessor:
def __init__(self, dataset):
self.dataset = dataset
def strip_html(self,text):
soup = BeautifulSoup(text, "html.parser")
return soup.get_text()
def remove_between_square_brackets(self,text):
return re.sub('\[[^]]*\]', '', text)
def denoise_text(self,text):
text = self.strip_html(text)
text = self.remove_between_square_brackets(text)
return text
I try calling the methods here
trial = Preprocessor(dataset['review'])
trial.strip_html(dataset['review'])
I get this error message
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-45-26f1c4298563> in <module>()
----> 1 trial.strip_html(dataset['review'])
6 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/generic.py in __nonzero__(self)
1536 def __nonzero__(self):
1537 raise ValueError(
-> 1538 f"The truth value of a {type(self).__name__} is ambiguous. "
1539 "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
1540 )
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Solution 1:[1]
BeautifulSoup's get_text() method expects a string as input. Hence it cannot be directly used with a pandas series.
One way to achieve this would be to iterate over each element in the series and apply the method to it:
import pandas as pd
from bs4 import BeautifulSoup
class Preprocessor:
def __init__(self, dataset):
self.dataset = dataset
@staticmethod
def soup_and_strip(text):
soup = BeautifulSoup(text, "html.parser")
return soup.get_text()
def strip_html(self):
return self.dataset.apply(self.soup_and_strip)
if __name__ == '__main__':
df = pd.DataFrame(
{'review': ['<b>good</b>', '<i>excellent</i>', '<h1>splendid</h1>']})
trial = Preprocessor(df['review'])
print(trial.strip_html())
Remark: Your overall idea of a prepocessor is good, but the implementation of it is a bit weird. You init the Prepocessor with the required data, but instead of using this data directly its methods, you provide the data again as an argument. You might want to look up some tutorials regarding class usage.
Another advice I would give you is to name your arguments appropriatly. Calling an argument "text", but providing a pandas series is confusing (and - in this case - the source of your presented problem)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Christian Karcher |
