'NLP stopword removal, stemming and lemmatization
def clean_text(text): # get English stopwords english_stopwords = set(stopwords.words('english'))
# change to lower case and remove punctuation
#text = text.lower().translate(str.maketrans('', '', string.punctuation))
text = text.map(lambda x: x.lower().translate(str.maketrans('', '', string.punctuation)))
# divide string into individual words
def custom_tokenize(text):
if not text:
#print('The text to be tokenized is a None type. Defaulting to blank string.')
text = ''
return word_tokenize(text)
token = df['transcription'].apply(custom_tokenize)
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
clean_tokens = []
for tok in tokens:
tok = tok.strip("#")
#tok = tok.strip() # remove space
if tok not in english_stopwords:
clean_tok = lemmatizer.lemmatize(tok) # lemmatizition
clean_tok = stemmer.stem(clean_tok) # Stemming
clean_tokens.append(clean_tok)
return " ".join(clean_tokens)
22 #tok = [[tok for tok in sent if tok not in stop] for sent in text]
23 for tok in tokens:
---> 24 tok = tok.strip("#") 25 #tok = tok.strip() # remove space 26 if tok not in english_stopwords:
AttributeError: 'list' object has no attribute 'strip'
I have been getting this; AttributeError: 'list' object has no attribute 'strip'
Solution 1:[1]
Exactly what it says, you are trying to strip a list. You can only strip strings. That is why Python throws you an error.
Are you perhaps mixing up the variables 'token' and 'tokens'?
Solution 2:[2]
Lemmatization already takes care of stemming so you don't have to do both.
Stemming may change the meaning of a word. For e.g. 'pie' and 'pies' will be changed to 'pi', but lemmatization preserves the meaning and identifies the root word 'pie'.
Assuming your data is in a pandas dataframe. So if you're preprocessing text data for an NLP problem, here's my solution to do stop word removal and lemmatization in a more elegant way:
import pandas as pd
import nltk
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from gensim.utils import lemmatize
nltk.download('stopwords') # comment out if already downloaded
nltk.download('punkt') # comment out if already downloaded
df = pd.read_csv('/path/to/text_dataset.csv')
# convert to lower case
df = df.apply(lambda x: x.str.lower())
# replace special characters (preserving only space)
df = df.apply(lambda x: [re.sub('[^a-z0-9]', ' ', i) for i in x])
# tokenize columns
df = df.apply(lambda x:[word_tokenize(i) for i in x])
# remove stop words from token list in each column
df = df.apply(
lambda x: [
[ w for w in tokenlist if w not in stopwords.words('english')]
for tokenlist in x])
# lemmatize columns
# the lemmatize method may fail during the first 3 to 4 iterations,
# so try running it several times
for attempt in range(1, 11):
try:
print(f'Lemmatize attempt: {attempt}')
df = df.apply(
lambda x: [ [ l.decode('utf-8').split('/', 1)[0]
for word in tokenlist for l in lemmatize(word) ]
for tokenlist in x])
print(f'Attempt {attempt} success!')
break
except:
pass
gensim.utils require patterns package to run lemmatize(). If you don't already have it, install it using
pip install pattern
Gensim lematizer gives a binary string list as output, along with its pos (parts-of-speech) tag. E.g. 'finding' will be converted to [b'find/VB']. I have added an extra loop to convert binary string to text string and remove the pos tag.
If you have non text data in certain columns, apply transformations like this:
textcols = ['column1', 'column2', 'column3']
df[textcols] = df[textcols].apply(lambda x: ... )
Note: If you're applying these for just one column, here's the modified version.
df['column'] = df['column'].apply(lambda x: x.lower())
df['column'] = df['column'].apply(lambda x: re.sub('[^a-z0-9]', ' ', x))
df['column'] = df['column'].apply(lambda x: word_tokenize(x))
df['column'] = df['column'].apply(
lambda x: [ token for token in x
if token not in stopwords.words('english')] )
for attempt in range(1, 11):
try:
print(f'Lemmatize attempt: {attempt}')
df['column'] = df['column'].apply(
lambda x: [l.decode('utf-8').split('/', 1)[0]
for word in x for l in lemmatize(word)])
print(f'Attempt {attempt} success!')
break
except:
pass
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Robert |
| Solution 2 |
