'Extract sentence embeddings features with Pandas and spaCy

I'm currently learning spaCy, and I have an exercise on word and sentence embeddings. Sentences are stored in a pandas DataFrame columns, and, we're requested to train a classifier based on the vector of these sentences.

I have a dataframe that looks like this:

+---+---------------------------------------------------+
|   |                                          sentence |
+---+---------------------------------------------------+
| 0 | "Whitey on the Moon" is a 1970 spoken word poe... |
+---+---------------------------------------------------+
| 1 | St Anselm's Church is a Roman Catholic church ... |
+---+---------------------------------------------------+
| 2 | Nymphargus grandisonae (common name: giant gla... |
+---+---------------------------------------------------+

Next, I apply an NLP function to these sentences:

import en_core_web_md
nlp = en_core_web_md.load()
df['tokenized'] = df['sentence'].apply(nlp)

Now, if I understand correctly, each item in df['tokenized'] has an attribute that returns the vector of the sentence in a 2D array.

print(type(df['tokenized'][0].vector))
print(df['tokenized'][0].vector.shape)

yields

<class 'numpy.ndarray'>
(300,)

How do I add the content of this array (300 rows) as columns to the df dataframe for the corresponding sentence, ignoring stop words?

Thanks!



Solution 1:[1]

Assume you have list of sentences:

sents = ["'Whitey on the Moon' is a 1970 spoken word"
         , "St Anselm's Church is a Roman Catholic church"
         , "Nymphargus grandisonae (common name: giant)"]

that you put into a dataframe:

df=pd.DataFrame({"sentence":sents})
print(df)
                                        sentence
0     'Whitey on the Moon' is a 1970 spoken word
1  St Anselm's Church is a Roman Catholic church
2    Nymphargus grandisonae (common name: giant)

Then you may proceed as follows:

df['tokenized'] = df['sentence'].apply(nlp)
df['sent_vectors'] = df['tokenized'].apply(
  lambda sent: np.mean([token.vector for token in sent if not token.is_stop])
                                          )

The resulting sent_vectorized column is a mean of all vector embeddings for tokens that are not stop words (token.is_stop attribute).

Note 1 What you call a sentence in your dataframe is actually an instance of a Doc class.

Note 2 Though you may prefer to go through a pandas dataframe, the recommended way would be through a getter extension:

import spacy
from spacy.tokens import Doc
nlp = spacy.load("en_core_web_md")

sents = ["'Whitey on the Moon' is a 1970 spoken word"
         , "St Anselm's Church is a Roman Catholic church"
         , "Nymphargus grandisonae (common name: giant)"]

vector_except_stopwords = lambda doc: np.mean([token.vector for token in sent if not token.is_stop])
Doc.set_extension("vector_except_stopwords", getter=vector_except_stopwords)

vecs =[] # for demonstration purposes
for doc in nlp.pipe(sents):
    vecs.append(doc._.vector_except_stopwords)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1