'How to go through each row with pandas apply() and lambda to clean sentence tokens?
My goal is to created a cleaned column of the tokenized sentence within the existing dataframe. The dataset is a pandas dataframe looking like this:
| Index | Tokenized_sents |
|---|---|
| First | [Donald, Trump, just, couldn, t, wish, all, Am] |
| Second | [On, Friday, ,, it, was, revealed, that] |
dataset['cleaned_sents'] = dataset.apply(lambda row: [w for w in row["tokenized_sents"] if len(w)>2 and w.lower() not in stop_words], axis = 1)
My current output is the dataframe without that extra column.
Current outout:
tokenized_sents \
0 [Donald, Trump, just, couldn, t, wish, all, Am...
Wanted output:
tokenized_sents \
0 [Donald, Trump, just, couldn, wish, all...
Basically removing all the stopwords & short words
Solution 1:[1]
Create a sentence index
dataset['gid'] = range(1, dataset.shape[0] + 1)
tokenized_sents gid
0 [This, is, a, test] 1
1 [and, this, too!] 2
Then explode the dataframe
clean_df = dataset.explode('tokenized_sents')
tokenized_sents gid
0 This 1
0 is 1
0 a 1
0 test 1
1 and 2
1 this 2
1 too! 2
Do all the cleaning on this dataframe and use gid column to group them back. It will be the fastest way to go about doing it.
clean_df = clean_df[clean_df.tokenized_sents.str.len() >= 2]
.
.
.
To get it back,
clean_dataset = clean_df.groupby('gid').agg(list)
Solution 2:[2]
Fix your code
dataset['new'] = dataset['tokenized_sents'].\
map(lambda x : [t for t in x if len(t)>2 and t not in stop] )
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Vishnudev |
| Solution 2 | BENY |
