'How to access data frame columns as a text during object instantiation in python

I am trying to create a class to pre-process a text dataset. After creating an instance of my class, I want to call some methods from the class to apply on a column in the data frame but it does not work. This is what I tried

class Preprocessor:
def __init__(self, dataset):
  self.dataset = dataset

def strip_html(self,text):
  soup = BeautifulSoup(text, "html.parser")
  return soup.get_text()

def remove_between_square_brackets(self,text):
  return re.sub('\[[^]]*\]', '', text)

def denoise_text(self,text):
  text = self.strip_html(text)
  text = self.remove_between_square_brackets(text)
  return text

I try calling the methods here

trial = Preprocessor(dataset['review'])
trial.strip_html(dataset['review'])

I get this error message

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-45-26f1c4298563> in <module>()
----> 1 trial.strip_html(dataset['review'])

6 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/generic.py in __nonzero__(self)
   1536     def __nonzero__(self):
   1537         raise ValueError(
-> 1538             f"The truth value of a {type(self).__name__} is ambiguous. "
   1539             "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
   1540         )

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().


Solution 1:[1]

BeautifulSoup's get_text() method expects a string as input. Hence it cannot be directly used with a pandas series.

One way to achieve this would be to iterate over each element in the series and apply the method to it:

import pandas as pd
from bs4 import BeautifulSoup


class Preprocessor:
    def __init__(self, dataset):
        self.dataset = dataset

    @staticmethod
    def soup_and_strip(text):
        soup = BeautifulSoup(text, "html.parser")
        return soup.get_text()

    def strip_html(self):
        return self.dataset.apply(self.soup_and_strip)


if __name__ == '__main__':
    df = pd.DataFrame(
        {'review': ['<b>good</b>', '<i>excellent</i>', '<h1>splendid</h1>']})
    trial = Preprocessor(df['review'])
    print(trial.strip_html())

Remark: Your overall idea of a prepocessor is good, but the implementation of it is a bit weird. You init the Prepocessor with the required data, but instead of using this data directly its methods, you provide the data again as an argument. You might want to look up some tutorials regarding class usage.

Another advice I would give you is to name your arguments appropriatly. Calling an argument "text", but providing a pandas series is confusing (and - in this case - the source of your presented problem)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Christian Karcher