'How can I improve the speed of my Spacy similarity calculation? [closed]
I currently have the following code that handles the similarity calculation between a search and a dictionary of candidates. It takes approximately 13 seconds to get the calculations from 4000 candidates. I did some research that it can be improved by using nlp.pipe(). However, I still don't understand how I can achieve that? Please advise. Below I have my python code.
import os
import sys
from flask import Flask, request, jsonify
import spacy
nlp = spacy.load("en_core_web_lg")
all_stopwords = nlp.Defaults.stop_words
app = Flask(__name__)
@app.route("/")
def index():
return "Page does not exist"
@app.route('/calculate-matches', methods=['POST'])
def calculate_matches():
data = request.get_json()
candidates = data['candidates']
cur_search = nlp('Looking for someone with experience in building vue frontend applications')
tmp_search = ''
for x in cur_search:
if x.pos_ == "NOUN" or x.pos_ == "PROPN" or x.pos_=="PRON" or x.is_stop==False:
tmp_search += str(x) + ' '
cur_search = nlp(tmp_search)
for member in candidates:
member_bio = nlp(member['bio']+ ' ' + member['education']+ ' ' + member['experience'])
#calculate similarity
member['match_score'] = ( cur_search.similarity(member_bio) * 100 )
#sort canidates' match_score from high to low
results = sorted(candidates, key=lambda k: k['match_score'], reverse=True)
return jsonify(results)
if __name__ == "__main__":
currentdir = os.path.dirname(os.path.realpath(__file__))
if currentdir not in sys.path:
sys.path.insert(0, currentdir)
app.run(host='0.0.0.0', port=5000)
Solution 1:[1]
You can use linear algebra to compute this similarity in a broadcasted manner:
import numpy as np
def cosine_similarity(v, A):
return np.argmax(np.dot(v, A.T) / (np.linalg.norm(v, ord=2) * np.linalg.norm(A, axis=1, ord=2))
A = np.stack([member.vector for member in member_bio])
v = cur_search.vector
closest_idx = cosine_similarity(v, A)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | erip |
