'Looping through each row in array to calculate cosine similarity
I have a subset of a dataframe that looks like:
<OUT>
PageNumber english_only_tags
175 flower architecture people
162 hair red bobbles sweets flower
576 sweets chocolate shop people
I have transformed each row in the english_only_tags
df column into a vector via TF-IDF and am calculating cosine similarity to compare each vector with every other vector in the corpus. I am using Sci-Kit for all this.
The intended output is:
<OUT> (made-up numbers)
0 0 1 2 3 ...
1 0.45 0.34 0.76 0.21
2 0.32 0.65 0.71 0.31
3 0.44 0.34 0.72 0.65
...
*where 0-3 are the vectors and the numbers within the matrix the cosine similarity values.*
I am trying to write a function to do this with the hope of attaching all outputs into an exportable CSV. I have this so far, but I am unsure how to loop through each row in the array and calculate cosine similarity for one vector to all other vectors:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import linear_kernel
def cos_similarity(vector):
'''
Calculate cosine/semantic similarity between one vector/row with every other vector in the array. Append the
values to a dataframe and export to csv.
'''
counter = 1
for i in vector:
cosine_similarities = linear_kernel(Vectors[:], Vectors).flatten()
list = cosine_similarities.tolist() #convert array to list
df = pd.DataFrame(list) #convert list to dataframe
#return concatenated dataframe with all cosine similarity calculations for every vector
matrix = pd.concat(pd.DataFrame(df), axis=1)
counter += 1 #loop through function to next row in array
return matrix.to_csv("Semantic_similarity_matrix.csv", mode='a', index = False, header=False)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|