'python pandas dataframe column sentence encode got error KeyError: 6

Use transformer bert model to do sentence encoder but got KeyError: 6. Here df is a pandas dataframe. df['text'] contains all the text records.

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('bert-base-uncased')
model.max_seq_length = 512

sentence_embeddings = model.encode(df['text'])

got this key error. If only use sentence_embeddings = model.encode(df['text'].head()) there is no problem to return sentence_embeddings Not sure what might cause this.

Checked online KeyError: 6 some post explain you could get KeyError when you call for a key that is not in the dictionary. For my case my input is dataframe column not dictionary.

The error message is like this:

---------------------------------------
KeyErrorTraceback (most recent call last)
~/.local/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2897             try:
-> 2898                 return self._engine.get_loc(casted_key)
   2899             except KeyError as err:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()

KeyError: 6

The above exception was the direct cause of the following exception:

KeyErrorTraceback (most recent call last)
<ipython-input-46-b8efcebcf608> in <module>
----> 1 sentence_embeddings = model.encode(df['text'])

~/.pyenv/versions/3.6.9/lib/python3.6/site-packages/sentence_transformers/SentenceTransformer.py in encode(self, sentences, batch_size, show_progress_bar, output_value, convert_to_numpy, convert_to_tensor, device, normalize_embeddings)
    154         all_embeddings = []
    155         length_sorted_idx = np.argsort([-self._text_length(sen) for sen in sentences])
--> 156         sentences_sorted = [sentences[idx] for idx in length_sorted_idx]
    157 
    158         for start_index in trange(0, len(sentences), batch_size, desc="Batches", disable=not show_progress_bar):

~/.pyenv/versions/3.6.9/lib/python3.6/site-packages/sentence_transformers/SentenceTransformer.py in <listcomp>(.0)
    154         all_embeddings = []
    155         length_sorted_idx = np.argsort([-self._text_length(sen) for sen in sentences])
--> 156         sentences_sorted = [sentences[idx] for idx in length_sorted_idx]
    157 
    158         for start_index in trange(0, len(sentences), batch_size, desc="Batches", disable=not show_progress_bar):

~/.local/lib/python3.6/site-packages/pandas/core/series.py in __getitem__(self, key)
    880 
    881         elif key_is_scalar:
--> 882             return self._get_value(key)
    883 
    884         if is_hashable(key):

~/.local/lib/python3.6/site-packages/pandas/core/series.py in _get_value(self, label, takeable)
    988 
    989         # Similar to Index.get_value, but we do not fall back to positional
--> 990         loc = self.index.get_loc(label)
    991         return self.index._get_values_for_loc(self, loc, label)
    992 

~/.local/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2898                 return self._engine.get_loc(casted_key)
   2899             except KeyError as err:
-> 2900                 raise KeyError(key) from err
   2901 
   2902         if tolerance is not None:

KeyError: 6


Solution 1:[1]

I solved the issue by reset pandas dataframe index. I had some cleaning process before this sentence_embeddings = model.encode(df['text']) to remove some undesired rows. Checked df index found it's not continuous. This could bring some problems using other sklearn process. It's better to reset_index for df before using any sklearn process.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 newleaf