'How to decide an optimum value for n_features parameter in Sklearn FeatureHasher

I am using hash encoding on a categorical column with 13 different value counts, and ideally speaking one-hot and dummy will give us 12 and 13 columns respectively after encoding. But when it comes to hash encoding, the default value of n_features is 2**20, which eventually creates 1000000+ columns.

How does one choose the value of n_features? I see that we need to consider a value to the nearest power of 2. Say if we consider the IRIS dataset we can end up using 2 or maybe even 4 for the n_features.

But what about a dataset where we have 13 different values in a column, what would be the n_features then?

# Feature Hashing Code:

import pandas as pd, numpy as np
from sklearn.feature_extraction import FeatureHasher

df = pd.read_csv(r'C:/Users/<user_name>/Downloads/datasets/countriesoftheworld.csv')
hash_encoder = FeatureHasher(n_features = ????, alternate_sign=False, input_type='string')
features = hash_encoder.fit_transform(df['Country'])

print(df.shape)

(227, 20)

I have not dropped the "Country" column, just to contrast between the original and encoded columns.

df = iris.join(pd.DataFrame(features.toarray()).add_prefix('encoded_'))

print(df.shape)

(227, 1048596)



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source