'Is the HashingEncoder from Category encoders meant to be used over multiple variables at the time?

I am trying to understand the output of the HashingEncoder when encoding more than 1 variable.

I am working with the credit approval dataset from the UCI, which can be found in the UCI website

My understanding is that if I want to encode a variable with say 10 categories into 4 features, each category will be assigned a value from 0 to 3 through a hashing function, to then be assigned to one of the 4 features during the encoding. In other words, the hashing function returns the index that will allocate a 1 to the corresponding feature.

And I can see that in action if I run the following code:

from category_encoders.hashing import HashingEncoder

encoder = HashingEncoder(cols=["A7"], n_components=4)

encoder.fit(X_train)

X_train_enc = encoder.transform(X_train)

The encoded dataset contains the four "hashing features" at the beginning of the dataframe, and the number 1 indicates that the category was allocated to that particular feature:

     col_0  col_1  col_2  col_3 A1     A2     A3 A4 A5  A6      A8 A9 A10  \
596      0      0      1      0  a  46.08  3.000  u  g   c   2.375  t   t   
303      0      0      1      0  a  15.92  2.875  u  g   q   0.085  f   f   
204      0      0      1      0  b  36.33  2.125  y  p   w   0.085  t   t   
351      0      1      0      0  b  22.17  0.585  y  p  ff   0.000  f   f   
118      0      0      1      0  b  57.83  7.040  u  g   m  14.000  t   t   

     A11 A12 A13    A14   A15  
596    8   t   g  396.0  4159  
303    0   f   g  120.0     0  
204    1   f   g   50.0  1187  
351    0   f   g  100.0     0  
118    6   t   g  360.0  1332  

And if I explore the unique values of those features, I can see that they only take values 0 or 1:

for c in ["col_0", "col_1",  "col_2", "col_3"]:
    print(X_train_enc[c].unique())

[0 1]
[0 1]
[1 0]
[0 1]

Now, if I instead encode multiple categorical variables using the HashingEncoder, I obtain something that I am not sure I understand:

from category_encoders.hashing import HashingEncoder

encoder = HashingEncoder(cols=["A5",'A7', "A12", "A14"], n_components=4)

encoder.fit(X_train)

X_train_enc = encoder.transform(X_train)

The encoded dataset contains the four "hashed features" at the beginning of the dataframe, but now they take values beyond 0 and 1:

     col_0  col_1  col_2  col_3 A1     A2     A3 A4  A6      A8 A9 A10  A11  \
596      0      2      2      0  a  46.08  3.000  u   c   2.375  t   t    8   
303      0      1      2      1  a  15.92  2.875  u   q   0.085  f   f    0   
204      1      0      2      1  b  36.33  2.125  y   w   0.085  t   t    1   
351      0      1      2      1  b  22.17  0.585  y  ff   0.000  f   f    0   
118      1      1      2      0  b  57.83  7.040  u   m  14.000  t   t    6   

    A13   A15  
596   g  4159  
303   g     0  
204   g  1187  
351   g     0  
118   g  1332 

Which I corroborate if I explore the unique values of those features:

for c in ["col_0", "col_1",  "col_2", "col_3"]:
    print(X_train_enc[c].unique())

[0 1 2]
[2 1 0 3]
[2 0 1 3 4]
[0 1 2]

Do I understand correctly that a number 2 would mean that categories from 2 different variables were allocated to that particular feature?

Is this the expected behaviour?

Somehow, I expected that I would get 4 hashed features per variable, and not 4 hashed features in total. Should this not be the case?



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source