'Text similarity with Python and SkLearn

I want to find similar (score of cosine similarity) values in dataset based on the user input: I have started like this:

import pandas as pd
import nltk
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity


df = pd.read_csv('NewsCategorizer.csv')
df = df[['short_description']]
df = df.dropna()
df = df.drop_duplicates()
df = df.head(30)
print(df)
print(df.columns)


vectorizer = TfidfVectorizer(lowercase=True, analyzer='word', stop_words= None, ngram_range=(1, 1))


def calculating_scores(text, vectorizer):
    docs_tfidf = vectorizer.fit_transform([text])
    return docs_tfidf

scores_list = []
for text in df.short_description:
    res = calculating_scores(text, vectorizer)
    scores_list.append(res)

df['scores'] = scores_list  
print(df)


                                    short_description                                             scores
0   Resting is part of training. I've confirmed wh...    (0, 15)\t0.1270001270001905\n  (0, 25)\t0.12...
1   Think of talking to yourself as a tool to coac...    (0, 6)\t0.12216944435630522\n  (0, 7)\t0.122...
2   The clock is ticking for the United States to ...    (0, 8)\t0.1643989873053573\n  (0, 7)\t0.1643...
3   If you want to be busy, keep trying to be perf...    (0, 2)\t0.16012815380508713\n  (0, 7)\t0.160...
4   First, the bad news: Soda bread, corned beef a...    (0, 21)\t0.2\n  (0, 11)\t0.2\n  (0, 18)\t0.2...
5   By Carey Moss for YouBeauty.com Love rom-coms,...    (0, 11)\t0.19245008972987526\n  (0, 19)\t0.1...
6   The nation in general scored a 66.2 in 2011 on...    (0, 18)\t0.17677669529663687\n  (0, 14)\t0.1...
7   It's also worth remembering that if the water ...    (0, 1)\t0.21320071635561041\n  (0, 9)\t0.213...
8   If you look at our culture's eating behavior, ...    (0, 22)\t0.10540925533894598\n  (0, 11)\t0.1...
9   François-Marie Arouet, 18th century French aut...    (0, 13)\t0.22941573387056174\n  (0, 14)\t0.2...
10  If the other person never opens to caring conf...    (0, 11)\t0.10153461651336192\n  (0, 13)\t0.1...

That is the new look of my dataset, new column represents <class 'scipy.sparse.csr.csr_matrix'> . Im using that column for comparing with the test docs.

def get_tf_idf_query_similarity(vectorizer, docs_tfidf, query):


   """
   vectorizer: TfIdfVectorizer model
   docs_tfidf: tfidf vectors for all docs
   query: query doc

   return: cosine similarity between query and all docs
   """

   query_tfidf = vectorizer.transform([query])
   cosineSimilarities = cosine_similarity(query_tfidf, docs_tfidf)[0] # IM GETTING THE ERROR IN THIS LINE

   return cosineSimilarities


testing_docs = ["If the other person never opens to caring conflict resolution, then you need to decide for yourself how to take loving care of yourself in the face of the unresolved conflict. This means accepting your helplessness over the other person and doing your own inner learning to discover what would be in your highest good.",
                "If the all other person never opens to caring conflict resolution, then you need to decide quickly for yourself how to take loving care of them and yourself in the face of the unresolved conflict. This means accepting your and other help over the others persons and doing your own inner learning to discover what would be in your highest goods!"]


for query in testing_docs:

   for text, docs_tfidf in zip(df.short_description, df.scores):

       print(type(docs_tfidf)) # <class 'scipy.sparse.csr.csr_matrix'>
       print(get_tf_idf_query_similarity(vectorizer, docs_tfidf, query))

The problem is that I'm getting this error when I run the script:

raise ValueError("Incompatible dimension for X and Y matrices: "
ValueError: Incompatible dimension for X and Y matrices: X.shape[1] == 18 while Y.shape[1] == 36

What am I doing wrong? I was thinking that its because of the different length of the strings, but then I tried:

def get_tf_idf_query_similarity(vectorizer, docs_tfidf, query):
    
    query_tfidf = vectorizer.transform([query])
    cosineSimilarities = cosine_similarity(query_tfidf, docs_tfidf)[0]
    
    return cosineSimilarities

vectorizer = TfidfVectorizer(lowercase=True, analyzer='word', stop_words= None, ngram_range=(1, 1))
t = "whats up with you"
docs_tfidf = vectorizer.fit_transform([t])

test = ["whats is up with you today", "whats up with you", "whats up"]
for t in test:    
    print(get_tf_idf_query_similarity(vectorizer, docs_tfidf, t))

So this is the same thing but without pandas, and with strings with different lengths and everything works good.



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source