'Encoding categorical variables such that both the presence as well as the position of characters matter in literal strings
Let's assume we have a dataframe whose last column is made up of literal strings such as the following:
df = pd.DataFrame(
{
"col1": ["C", "A", "B"],
"col2": [4, 1.7, 1],
"col3": ["SHRTYPPS", "PGYTCCCKAR", "VPCCYCCARE"],
}
)
Note that both 1) the presence of a character in a string and 2) the position at which it is located within the string matter.
One-hot-encoding the last column follows:
col3_lst = [list(i) for i in df.col3]
ids, U = pd.factorize(np.concatenate(col3_lst))
df_new = pd.DataFrame([np.isin(U, i) for i in col3_lst], columns=U).astype(int)
pd.concat([df, df_new], axis=1).drop(["col3"], axis=1)
which would result in:
col1 col2 S H R T Y P G C K A V E
0 C 4.0 1 1 1 1 1 1 0 0 0 0 0 0
1 A 1.7 0 0 1 1 1 1 1 1 1 1 0 0
2 B 1.0 0 0 1 0 1 1 0 1 0 1 1 1
However, as you can see the order is not regarded accordingly. Is there anyway to inject the information about the position of the character in the corresponding string into the output dataframe? For example, if there are four C's in the last string, we need to capture the factual information that the letter is present in positions 3rd, 4th, 6th, and 7th as evident. I am looking for something like the following:
col1 col2 position_1 posistion_2 position_3 position_4 position_5 ....
0 C 4.0 19 8 18 20 25 ....
1 A 1.7 16 7 25 20 3 ....
2 B 1.0 22 16 3 3 25 ....
, where each numerical label of encoded columns, $position_{i}$, belongs to the position of the following character in the English alphabet; i.e. 1 for A, 2 for B, etc...
Or even better, something like the following:
col1 col2 position_1_A position_1_B ... posistion_2_A posistion_2_B ... position_3_A position_3_B ... position_4_A position_4_B ...
0 C 4.0 0 0 ... 0 0 ... 0 0 ... 0 0 ...
1 A 1.7 0 0 ... 0 0 ... 0 0 ... 0 0 ...
2 B 1.0 0 0 ... 0 0 ... 0 0 ... 0 0 ...
Thank you,
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
