'Is the HashingEncoder from Category encoders meant to be used over multiple variables at the time?
I am trying to understand the output of the HashingEncoder when encoding more than 1 variable.
I am working with the credit approval dataset from the UCI, which can be found in the UCI website
My understanding is that if I want to encode a variable with say 10 categories into 4 features, each category will be assigned a value from 0 to 3 through a hashing function, to then be assigned to one of the 4 features during the encoding. In other words, the hashing function returns the index that will allocate a 1 to the corresponding feature.
And I can see that in action if I run the following code:
from category_encoders.hashing import HashingEncoder
encoder = HashingEncoder(cols=["A7"], n_components=4)
encoder.fit(X_train)
X_train_enc = encoder.transform(X_train)
The encoded dataset contains the four "hashing features" at the beginning of the dataframe, and the number 1 indicates that the category was allocated to that particular feature:
col_0 col_1 col_2 col_3 A1 A2 A3 A4 A5 A6 A8 A9 A10 \
596 0 0 1 0 a 46.08 3.000 u g c 2.375 t t
303 0 0 1 0 a 15.92 2.875 u g q 0.085 f f
204 0 0 1 0 b 36.33 2.125 y p w 0.085 t t
351 0 1 0 0 b 22.17 0.585 y p ff 0.000 f f
118 0 0 1 0 b 57.83 7.040 u g m 14.000 t t
A11 A12 A13 A14 A15
596 8 t g 396.0 4159
303 0 f g 120.0 0
204 1 f g 50.0 1187
351 0 f g 100.0 0
118 6 t g 360.0 1332
And if I explore the unique values of those features, I can see that they only take values 0 or 1:
for c in ["col_0", "col_1", "col_2", "col_3"]:
print(X_train_enc[c].unique())
[0 1]
[0 1]
[1 0]
[0 1]
Now, if I instead encode multiple categorical variables using the HashingEncoder, I obtain something that I am not sure I understand:
from category_encoders.hashing import HashingEncoder
encoder = HashingEncoder(cols=["A5",'A7', "A12", "A14"], n_components=4)
encoder.fit(X_train)
X_train_enc = encoder.transform(X_train)
The encoded dataset contains the four "hashed features" at the beginning of the dataframe, but now they take values beyond 0 and 1:
col_0 col_1 col_2 col_3 A1 A2 A3 A4 A6 A8 A9 A10 A11 \
596 0 2 2 0 a 46.08 3.000 u c 2.375 t t 8
303 0 1 2 1 a 15.92 2.875 u q 0.085 f f 0
204 1 0 2 1 b 36.33 2.125 y w 0.085 t t 1
351 0 1 2 1 b 22.17 0.585 y ff 0.000 f f 0
118 1 1 2 0 b 57.83 7.040 u m 14.000 t t 6
A13 A15
596 g 4159
303 g 0
204 g 1187
351 g 0
118 g 1332
Which I corroborate if I explore the unique values of those features:
for c in ["col_0", "col_1", "col_2", "col_3"]:
print(X_train_enc[c].unique())
[0 1 2]
[2 1 0 3]
[2 0 1 3 4]
[0 1 2]
Do I understand correctly that a number 2 would mean that categories from 2 different variables were allocated to that particular feature?
Is this the expected behaviour?
Somehow, I expected that I would get 4 hashed features per variable, and not 4 hashed features in total. Should this not be the case?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
