'Merit of using a small sample of data for hyperparameter tuning
In the section Blueprint recap and conclusion of this notebook, the author used only 20% of the data for the hyperparameter tuning with Grid Search as shown in these lines of code
# Step 1 - Data Preparation
df['text'] = df['text'].apply(clean)
df = df[df['text'].str.len() > 50]
if (runSVC):
# Sample the data when running SVC to ensure reasonable run-times
df = df.groupby('Component', as_index=False).apply(pd.DataFrame.sample,
random_state=42,
frac=.2)
# Step 2 - Train-Test Split
X_train, X_test, Y_train, Y_test = train_test_split(df['text'],
df['Component'],
test_size=0.2,
random_state=42,
stratify=df['Component'])
print('Size of Training Data ', X_train.shape[0])
print('Size of Test Data ', X_test.shape[0])
# Step 3 - Training the Machine Learning model
tfidf = TfidfVectorizer(stop_words="english")
if (runSVC):
model = SVC(random_state=42, probability=True)
grid_param = [{
'tfidf__min_df': [5, 10],
'tfidf__ngram_range': [(1, 3), (1, 6)],
'model__C': [1, 100],
'model__kernel': ['linear']
}]
else:
model = LinearSVC(random_state=42, tol=1e-5)
grid_param = {
'tfidf__min_df': [5, 10],
'tfidf__ngram_range': [(1, 3), (1, 6)],
'model__C': [1, 100],
'model__loss': ['hinge']
}
My question is: would the best_params
found by the Grid Search using smaller dataset be actually best? Would a different best_params
be found if we use the full dataset?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|