'NLP stopword removal, stemming and lemmatization

def clean_text(text): # get English stopwords english_stopwords = set(stopwords.words('english'))

# change to lower case and remove punctuation
#text = text.lower().translate(str.maketrans('', '', string.punctuation))
text = text.map(lambda x: x.lower().translate(str.maketrans('', '', string.punctuation)))

# divide string into individual words
def custom_tokenize(text):
    if not text:
        #print('The text to be tokenized is a None type. Defaulting to blank string.')
        text = ''
    return word_tokenize(text)

token = df['transcription'].apply(custom_tokenize)

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

clean_tokens = []
for tok in tokens:
    tok = tok.strip("#") 
    #tok = tok.strip() # remove space
    if tok not in english_stopwords:
        clean_tok = lemmatizer.lemmatize(tok) # lemmatizition
        clean_tok = stemmer.stem(clean_tok) # Stemming
        clean_tokens.append(clean_tok)
return " ".join(clean_tokens)

 22     #tok = [[tok for tok in sent if tok not in stop] for sent in text]
 23     for tok in tokens:

---> 24 tok = tok.strip("#") 25 #tok = tok.strip() # remove space 26 if tok not in english_stopwords:

AttributeError: 'list' object has no attribute 'strip'

I have been getting this; AttributeError: 'list' object has no attribute 'strip'



Solution 1:[1]

Exactly what it says, you are trying to strip a list. You can only strip strings. That is why Python throws you an error.

Are you perhaps mixing up the variables 'token' and 'tokens'?

Solution 2:[2]

Lemmatization already takes care of stemming so you don't have to do both.

Stemming may change the meaning of a word. For e.g. 'pie' and 'pies' will be changed to 'pi', but lemmatization preserves the meaning and identifies the root word 'pie'.

Assuming your data is in a pandas dataframe. So if you're preprocessing text data for an NLP problem, here's my solution to do stop word removal and lemmatization in a more elegant way:

import pandas as pd
import nltk
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from gensim.utils import lemmatize

nltk.download('stopwords') # comment out if already downloaded
nltk.download('punkt')     # comment out if already downloaded

df = pd.read_csv('/path/to/text_dataset.csv')

# convert to lower case
df = df.apply(lambda x: x.str.lower())

# replace special characters (preserving only space)
df = df.apply(lambda x: [re.sub('[^a-z0-9]', ' ', i) for i in x])

# tokenize columns 
df = df.apply(lambda x:[word_tokenize(i) for i in x])

# remove stop words from token list in each column
df = df.apply(
    lambda x: [
               [ w for w in tokenlist if w not in stopwords.words('english')] 
               for tokenlist in x])

# lemmatize columns
# the lemmatize method may fail during the first 3 to 4 iterations, 
# so try running it several times
for attempt in range(1, 11):
  try:
    print(f'Lemmatize attempt: {attempt}')
    df = df.apply(
        lambda x: [ [  l.decode('utf-8').split('/', 1)[0]        
                    for word in tokenlist for l in lemmatize(word) ]
                  for tokenlist in x])
    print(f'Attempt {attempt} success!')
    break
  except:
    pass

gensim.utils require patterns package to run lemmatize(). If you don't already have it, install it using

pip install pattern

Gensim lematizer gives a binary string list as output, along with its pos (parts-of-speech) tag. E.g. 'finding' will be converted to [b'find/VB']. I have added an extra loop to convert binary string to text string and remove the pos tag.

If you have non text data in certain columns, apply transformations like this:

textcols = ['column1', 'column2', 'column3']
df[textcols] = df[textcols].apply(lambda x: ... )

Note: If you're applying these for just one column, here's the modified version.

df['column'] = df['column'].apply(lambda x: x.lower())
df['column'] = df['column'].apply(lambda x: re.sub('[^a-z0-9]', ' ', x))
df['column'] = df['column'].apply(lambda x: word_tokenize(x))
df['column'] = df['column'].apply(
    lambda x: [ token for token in x 
               if token not in stopwords.words('english')] )
for attempt in range(1, 11):
  try:
    print(f'Lemmatize attempt: {attempt}')
    df['column'] = df['column'].apply(
        lambda x: [l.decode('utf-8').split('/', 1)[0] 
                  for word in x for l in lemmatize(word)])
    print(f'Attempt {attempt} success!')
    break
  except:
    pass

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Robert
Solution 2