'Calculate cosine similarity for nested object in elasticsearch using Python

I am working on an use case which has documents in following format:-

[{'Fruits': ['Mango', 'Apple']},
 {'Fruits': ['Banana', 'Guava', 'Mango']},
 {'Fruits': ['Grapes', 'Apple']}]

I am storing the data into Elasticsearch cluster in below format:-

["Fruits":[{"value": dense_vector('Mango'),"text": "Mango"},{"value": dense_vector('Apple'),"text": "Apple"}],
"Fruits":[{"value": dense_vector('Banana'),"text": "Banana"},{"value": dense_vector('Guava'),"text": "Guava"},
{"value": dense_vector('Mango'),"text": "Mango"}],...]

Am storing dense vector(Bert embeddings) so as to handle synonyms by calculating cosine similarity. My Elasticsearch query looks like this"-

{
    "nested": {
      "path": "Fruits",
        "score_mode": "max", 
      "query": {
        "function_score": {
          "script_score": {
            "script": {
              "source": "1+cosineSimilarity(params.query_vector0,'Fruits.value')+cosineSimilarity(params.query_vector1,'Fruits.value')",
              "params": {"query_vector0":query_vector,"query_vector1":query_vector1}
            }
          }
        }
      }
    }
}

,wherein query_vector and query_vector1 are the below values"-

query_vector=bert_embedding.encode([str(Grape)])[0]
query_vector1=bert_embedding.encode([str(Apples)])[0]

When I give user queries as "Grape" and "Apples", I am getting similar score for these 2 documents:-

[{'Fruits': ['Mango', 'Apple']},
 {'Fruits': ['Grapes', 'Apple']}]

Whereas, as expected, I should get a higher score for {'Fruits': ['Grapes', 'Apple']} when compared to the 1st document. Also , upon changing the value for "score_mode" to avg/sum/max, the results are changing. Which score_mode should I finalize on based on my document format?

Am not able to figure out how cosine score is similar for both the documents. Can someone please help on that?



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source