'Encoding categorical variables such that both the presence as well as the position of characters matter in literal strings

Let's assume we have a dataframe whose last column is made up of literal strings such as the following:

df = pd.DataFrame(
        {
            "col1": ["C", "A", "B"],
            "col2": [4, 1.7, 1],
            "col3": ["SHRTYPPS", "PGYTCCCKAR", "VPCCYCCARE"],
        }
    )

Note that both 1) the presence of a character in a string and 2) the position at which it is located within the string matter.

One-hot-encoding the last column follows:

col3_lst = [list(i) for i in df.col3]
ids, U = pd.factorize(np.concatenate(col3_lst))
df_new = pd.DataFrame([np.isin(U, i) for i in col3_lst], columns=U).astype(int)
pd.concat([df, df_new], axis=1).drop(["col3"], axis=1)

which would result in:

  col1  col2  S  H  R  T  Y  P  G  C  K  A  V  E
0    C   4.0  1  1  1  1  1  1  0  0  0  0  0  0
1    A   1.7  0  0  1  1  1  1  1  1  1  1  0  0
2    B   1.0  0  0  1  0  1  1  0  1  0  1  1  1

However, as you can see the order is not regarded accordingly. Is there anyway to inject the information about the position of the character in the corresponding string into the output dataframe? For example, if there are four C's in the last string, we need to capture the factual information that the letter is present in positions 3rd, 4th, 6th, and 7th as evident. I am looking for something like the following:

  col1  col2     position_1    posistion_2    position_3    position_4     position_5  ....  
0    C   4.0         19             8              18           20            25       ....
1    A   1.7         16             7              25           20            3        ....
2    B   1.0         22             16             3            3             25       ....

, where each numerical label of encoded columns, $position_{i}$, belongs to the position of the following character in the English alphabet; i.e. 1 for A, 2 for B, etc...

Or even better, something like the following:

  col1  col2     position_1_A   position_1_B  ...  posistion_2_A   posistion_2_B  ...  position_3_A   position_3_B  ...  position_4_A   position_4_B ...
0    C   4.0           0             0        ...        0               0        ...       0               0       ...            0            0    ...
1    A   1.7           0             0        ...        0               0        ...       0               0       ...            0            0    ...
2    B   1.0           0             0        ...        0               0        ...       0               0       ...            0            0    ...

Thank you,



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source