'Encoding each value in a pandas cell

I have a dataset

Inp1    Inp2        Inp3               Output
A,B,C   AI,UI,JI    Apple,Bat,Dog      Animals
L,M,N   LI,DO,LI    Lawn, Moon, Noon   Noun
X,Y     AI,UI       Yemen,Zombie       Extras

For these values, I need to apply a ML algorithm. Hence need an encoding technique. Tried a Label encoding technique, it encodes the entire cell to an int for eg.

Inp1    Inp2    Inp3    Output
5       4       8       0

But I need a separate encoding for each value in a cell. How should I go about it.

Inp1     Inp2    Inp3      Output
7,44,87  4,65,2  47,36,20  45

Integers are random here.



Solution 1:[1]

Assuming each cell is a list (as you have multiple strings stored in each), and that you are not looking for a specific encoding. The following should work. It can also be adjusted to suit different encodings.

import pandas as pd
A = [["Inp1", "Inp2", "Inp3", "Output"],
[["A","B","C"], ["AI","UI","JI"],["Apple","Bat","Dog"],["Animals"]],
[["L","M","N"], ["LI","DO","LI"], ["Lawn", "Moon", "Noon"], ["Noun"]]]

dataframe = pd.DataFrame(A[1:], columns=A[0])

def my_encoding(row):
    encoded_row = []
    for ls in row:
        encoded_ls = []
        for s in ls:
            sbytes = s.encode('utf-8')
            sint = int.from_bytes(sbytes, 'little')
            encoded_ls.append(sint)
        encoded_row.append(encoded_ls)
    return encoded_row

print(dataframe.apply(my_encoding))

output:

           Inp1  ...               Output
0  [65, 66, 67]  ...  [32488788024979009]
1  [76, 77, 78]  ...         [1853189966]

if my assumptions are incorrect or this is not what you're looking for let me know.

Solution 2:[2]

As you mentioned, you are going to apply ML algorithm (say classification), I think One Hot Encoding is what you are looking for.

Requested format:

Inp1     Inp2    Inp3      Output
7,44,87  4,65,2  47,36,20  45

This format can't help you to train your model as multiple labels in a single cell. However you have to pre-process again like OHE.

Suggesting format:

A  B  C  L  M  N  X  Y  AI  DO  JI  LI  UI  Apple  Bat  Dog  Lawn  Moon  Noon  Yemen  Zombie
1  1  1  0  0  0  0  0   1   0   1   0   1      1    1    1     0     0     0      0       0
0  0  0  1  1  1  0  0   0   1   0   1   0      0    0    0     1     1     1      0       0
0  0  0  0  0  0  1  1   1   0   0   0   1      0    0    0     0     0     0      1       1

Hereafter you can label encode / ohe the output field as per your model requires.

Happy learning !

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 thomas
Solution 2 Bhanuchander Udhayakumar