'Manual format data Python

Please help encode on my sequence dataset manually not with sklearn, previously I used the sklearn library to encode like



k1,    m2,   A3,  A4,   A5, P1
A1,    k2,   A7,  A9,   A9, P2
A99,   m77,  A22,  A22,   A22, P9



Solution 1:[1]

First do it for one column and later put code in function and run in for-loop with different columns


First step is to get unique values in column

unique = sorted(df['COL1'].unique())

And next use for-loop and for every value in unique you can run

df['COL1'] == value

to get column with True/False which you can convert to integer to get 1 /0

(df['COL1'] == val).astype(int)

and you can put it back to original dataframe with new name

df['COL1' + '_' + val] = (df['COL1'] == val).astype(int)

text = '''COL1, COL2, COL3, COL4, COL5, LABELS
A1,    A2,   A3,  A4,   A5, P1
A1,    A2,   A7,  A9,   A9, P2
A99,   A77,  A22,  A22,   A22, P9
A1,    A2,   A8,  A9,   A0, P7
A1,    A2,   A8,  A90,   A9, P2
A1,    A21,  A8,  A9,   A11, P1
A11,   A2,   A81,  A9,   A9, P1'''

import pandas as pd
import io

df = pd.read_csv(io.StringIO(text), sep=',\s+')

unique = sorted(df['COL1'].unique())
print(unique)

for val in unique:
    col = (df['COL1'] == val).astype(int)
    #col.name += '_' + val
    print(col)
    df[col.name + '_' + val] = col

print(df)

Result:

  COL1 COL2 COL3 COL4 COL5 LABELS  COL1_A1  COL1_A11  COL1_A99
0   A1   A2   A3   A4   A5     P1        1         0         0
1   A1   A2   A7   A9   A9     P2        1         0         0
2  A99  A77  A22  A22  A22     P9        0         0         1
3   A1   A2   A8   A9   A0     P7        1         0         0
4   A1   A2   A8  A90   A9     P2        1         0         0
5   A1  A21   A8   A9  A11     P1        1         0         0
6  A11   A2  A81   A9   A9     P1        0         1         0

It need also to remove COL1 and it need to remeber unique values in some dictionary so later you can convert back to these values.


Now you have to put it in some function and run for other columns.

text = '''COL1, COL2, COL3, COL4, COL5, LABELS
A1,    A2,   A3,  A4,   A5, P1
A1,    A2,   A7,  A9,   A9, P2
A99,   A77,  A22,  A22,   A22, P9
A1,    A2,   A8,  A9,   A0, P7
A1,    A2,   A8,  A90,   A9, P2
A1,    A21,  A8,  A9,   A11, P1
A11,   A2,   A81,  A9,   A9, P1'''

import pandas as pd
import io

def convert(df, col_name):
    unique = sorted(df[col_name].unique())
    #print(unique)

    for val in unique:
        df[col_name + '_' + val] = (df[col_name] == val).astype(int)
    
    df.drop(columns=col_name, inplace=True)
    
    return unique

# ---

df = pd.read_csv(io.StringIO(text), sep=',\s+')

transformations = {}

for name in df.columns:
    if name.startswith('COL'):
        transformations[name] = convert(df, name)
        
print(transformations)        
    
print(df)    

Result:

{'COL1': ['A1', 'A11', 'A99'], 
'COL2': ['A2', 'A21', 'A77'], 
'COL3': ['A22', 'A3', 'A7', 'A8', 'A81'], 
'COL4': ['A22', 'A4', 'A9', 'A90'], 
'COL5': ['A0', 'A11', 'A22', 'A5', 'A9']}

  LABELS  COL1_A1  COL1_A11  COL1_A99  ...  COL5_A11  COL5_A22  COL5_A5  COL5_A9
0     P1        1         0         0  ...         0         0        1        0
1     P2        1         0         0  ...         0         0        0        1
2     P9        0         0         1  ...         0         1        0        0
3     P7        1         0         0  ...         0         0        0        0
4     P2        1         0         0  ...         0         0        0        1
5     P1        1         0         0  ...         1         0        0        0
6     P1        0         1         0  ...         0         0        0        1

[7 rows x 21 columns]

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1