'How to decide an optimum value for n_features parameter in Sklearn FeatureHasher
I am using hash encoding on a categorical column with 13 different value counts, and ideally speaking one-hot and dummy will give us 12 and 13 columns respectively after encoding. But when it comes to hash encoding, the default value of n_features is 2**20, which eventually creates 1000000+ columns.
How does one choose the value of n_features? I see that we need to consider a value to the nearest power of 2. Say if we consider the IRIS dataset we can end up using 2 or maybe even 4 for the n_features.
But what about a dataset where we have 13 different values in a column, what would be the n_features then?
# Feature Hashing Code:
import pandas as pd, numpy as np
from sklearn.feature_extraction import FeatureHasher
df = pd.read_csv(r'C:/Users/<user_name>/Downloads/datasets/countriesoftheworld.csv')
hash_encoder = FeatureHasher(n_features = ????, alternate_sign=False, input_type='string')
features = hash_encoder.fit_transform(df['Country'])
print(df.shape)
(227, 20)
I have not dropped the "Country" column, just to contrast between the original and encoded columns.
df = iris.join(pd.DataFrame(features.toarray()).add_prefix('encoded_'))
print(df.shape)
(227, 1048596)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
