'python pandas dataframe column sentence encode got error KeyError: 6
Use transformer bert model to do sentence encoder but got KeyError: 6. Here df is a pandas dataframe. df['text'] contains all the text records.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('bert-base-uncased')
model.max_seq_length = 512
sentence_embeddings = model.encode(df['text'])
got this key error.
If only use sentence_embeddings = model.encode(df['text'].head()) there is no problem to return sentence_embeddings
Not sure what might cause this.
Checked online KeyError: 6 some post explain you could get KeyError when you call for a key that is not in the dictionary. For my case my input is dataframe column not dictionary.
The error message is like this:
---------------------------------------
KeyErrorTraceback (most recent call last)
~/.local/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
2897 try:
-> 2898 return self._engine.get_loc(casted_key)
2899 except KeyError as err:
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()
KeyError: 6
The above exception was the direct cause of the following exception:
KeyErrorTraceback (most recent call last)
<ipython-input-46-b8efcebcf608> in <module>
----> 1 sentence_embeddings = model.encode(df['text'])
~/.pyenv/versions/3.6.9/lib/python3.6/site-packages/sentence_transformers/SentenceTransformer.py in encode(self, sentences, batch_size, show_progress_bar, output_value, convert_to_numpy, convert_to_tensor, device, normalize_embeddings)
154 all_embeddings = []
155 length_sorted_idx = np.argsort([-self._text_length(sen) for sen in sentences])
--> 156 sentences_sorted = [sentences[idx] for idx in length_sorted_idx]
157
158 for start_index in trange(0, len(sentences), batch_size, desc="Batches", disable=not show_progress_bar):
~/.pyenv/versions/3.6.9/lib/python3.6/site-packages/sentence_transformers/SentenceTransformer.py in <listcomp>(.0)
154 all_embeddings = []
155 length_sorted_idx = np.argsort([-self._text_length(sen) for sen in sentences])
--> 156 sentences_sorted = [sentences[idx] for idx in length_sorted_idx]
157
158 for start_index in trange(0, len(sentences), batch_size, desc="Batches", disable=not show_progress_bar):
~/.local/lib/python3.6/site-packages/pandas/core/series.py in __getitem__(self, key)
880
881 elif key_is_scalar:
--> 882 return self._get_value(key)
883
884 if is_hashable(key):
~/.local/lib/python3.6/site-packages/pandas/core/series.py in _get_value(self, label, takeable)
988
989 # Similar to Index.get_value, but we do not fall back to positional
--> 990 loc = self.index.get_loc(label)
991 return self.index._get_values_for_loc(self, loc, label)
992
~/.local/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
2898 return self._engine.get_loc(casted_key)
2899 except KeyError as err:
-> 2900 raise KeyError(key) from err
2901
2902 if tolerance is not None:
KeyError: 6
Solution 1:[1]
I solved the issue by reset pandas dataframe index. I had some cleaning process before this sentence_embeddings = model.encode(df['text']) to remove some undesired rows.
Checked df index found it's not continuous. This could bring some problems using other sklearn process. It's better to reset_index for df before using any sklearn process.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | newleaf |
