'Map column lists to dictionary and create new column with padded strings
Given this dataframe and word_index dictionary:
import pandas as pd
df = pd.DataFrame(data={'text_ids': [
[1, 2, 3, 2, 7, 2, 8, 2, 0],
[1, 2, 4, 2, 7, 2, 8, 2, 0],
[1, 2, 5, 2, 6, 2, 8, 2, 0],
[1, 2, 9, 2, 6, 2, 10, 2, 11, 2, 8, 0]
]})
word_index = {0: '<eos>', 1: '<sos>', 2: '/s', 3: 'he', 4: 'she', 5:'they', 6:'love', 7:'loves', 8: 'cats', 9: 'we', 10: 'talking', 11: 'about', 12: '<pad>'}
How can I map each sequence in text_ids to its corresponding value(s) in word_index, while making sure that \s really creates spaces in each string? Also, I need to add <pad> tokens to each string that has a length smaller than the largest integer sequence.
Expected output:
text_ids text
0 [1, 2, 3, 2, 7, 2, 8, 2, 0] <sos> he loves cats <eos><pad><pad><pad>
1 [1, 2, 4, 2, 7, 2, 8, 2, 0] <sos> she loves cats <eos><pad><pad><pad>
2 [1, 2, 5, 2, 6, 2, 8, 2, 0] <sos> they love cats <eos><pad><pad><pad>
3 [1, 2, 9, 2, 6, 2, 10, 2, 11, 2, 8, 0] <sos> we love talking about cats <eos>
Solution 1:[1]
Another option:
(df["text_ids"]
.explode()
.map(word_index)
.groupby(level=0)
.apply(lambda q: " ".join(q)))
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Mark Moretto |
