'oversampling using smote not working in Pipeline
I tried using SMOTE into my pipeline but i realise there was no oversampling performed when i evaluated the model. Does anyone have any clue?
from sklearn.compose import ColumnTransformer
from imblearn.over_sampling import RandomOverSampler, SMOTE
from sklearn.pipeline import Pipeline
from imblearn.pipeline import Pipeline as imbpipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
param_grid={
'DecisionTree__max_depth': (10,30,50,70,100),
'DecisionTree__criterion': ('gini','entropy'),
'DecisionTree__max_depth': (3,5,7,9,10),
'DecisionTree__max_features': ('auto','sqrt','log2'),
'DecisionTree__min_samples_split': (2,4,6)
}
numeric_features = ['Ambient', 'Process', 'Rotation_Speed', 'Torque', 'Tool_Wear']
numeric_transformer = Pipeline([("imputer", SimpleImputer(strategy="mean")), ("scaler", StandardScaler())])
categorical_features = ['Quality']
categorical_transformer = Pipeline([("imputer", SimpleImputer(strategy="most_frequent")), ("encoder", OneHotEncoder())])
preprocessor = ColumnTransformer(
transformers=[
('num',numeric_transformer, numeric_features),
('cat',categorical_transformer, categorical_features)
]
)
clf = imbpipeline(
[["preprocessor", preprocessor],['smote', SMOTE(random_state=42)], ["DecisionTree", DecisionTreeClassifier()]]
)
grid_search = GridSearchCV(clf, param_grid, cv=10, scoring='f1_micro')
grid_search.fit(X, y)
Solution 1:[1]
I put OPs sample into an MCVE on coliru and was able to reproduce the issue of OP. In my case, the execution simply was aborted due to an out-of-bounds access.
I highly suspect that OP passed in the nums1, nums1.size(), nums2, nums2.size() into Solution::merge(). As the merge is destructive (writes result into one of the input containers), this has to be considered.
IMHO, OP seems to be aware of this basically as the merge is done from back to front.
However, as nums1 is used for output it has to be provided with sufficient size before merging.
E.g. nums1.resize(m + n); at the begin of merge will do the job.
My MCVE with fix:
#include <iostream>
#include <vector>
class Solution {
public:
void merge(std::vector<int>& nums1, int m, std::vector<int>& nums2, int n)
{
nums1.resize(m + n); // <= ENSURE SUFFICIENT STORAGE IN nums1
int p1 = m - 1, p2 = n - 1, i = m + n - 1;
while (p2 >= 0) {
if (p1 >= 0 && nums1[p1] < nums2[p2]) {
nums1[i] = nums2[p2];
p2--;
} else {
nums1[i] = nums1[p1];
p1--;
}
i--;
}
}
};
int main()
{
// sample data
std::vector<int> nums1{ 1, 3, 5, 7, 9 };
std::vector<int> nums2{ 2, 4, 6, 8 };
// run merge
Solution().merge(nums1, (int)nums1.size(), nums2, (int)nums2.size());
// output result
std::cout << "nums1: {";
const char* sep = " ";
for (const int num : nums1) {
std::cout << sep << num;
sep = ", ";
}
std::cout << " }" << std::endl;
}
Output:
nums1: { 1, 2, 3, 4, 5, 6, 7, 8, 9 }
@Armin Montigny recommended the use of std::merge() instead of a hand-knitted implementation.
First I had some doubts whether it's able to manage destructive merging as well but after thinking twice I realized that it will do it perfectly if used right. I modified my MCVE to see how this would look like.
My alternative MCVE using std::merge():
#include <algorithm>
#include <functional>
#include <iostream>
#include <vector>
class Solution {
public:
void merge(std::vector<int>& nums1, int m, std::vector<int>& nums2, int n)
{
nums1.resize(m + n); // <= ENSURE SUFFICIENT STORAGE IN nums1
std::merge(
nums1.rbegin() + n, nums1.rend(), // consider that nums1 is already resized
nums2.rbegin(), nums2.rend(),
nums1.rbegin(),
std::greater<int>());
}
};
int main()
{
// sample data
std::vector<int> nums1{ 1, 3, 5, 7, 9 };
std::vector<int> nums2{ 2, 4, 6, 8 };
// run merge
Solution().merge(nums1, (int)nums1.size(), nums2, (int)nums2.size());
// output result
std::cout << "nums1: {";
const char* sep = " ";
for (const int num : nums1) {
std::cout << sep << num;
sep = ", ";
}
std::cout << " }" << std::endl;
}
Output:
nums1: { 1, 2, 3, 4, 5, 6, 7, 8, 9 }
Notes:
- I used
rbegin()andrend(), to merge the vectors from end to begin, like OP did it. This is necessary to write merged elements always after where the still unprocessed elements have to be read. - The input iterators for
nums1may look a bit surprising on the 1st glance. Usingnums1.rbegin() + nis necessary to address to skip the space which was already allocated to store the additional elements ofnums2. (That are exactlyn.) - The
std::merge()has to be used with a custom predicate (e.g. std::greater) as I used ascending ordered sample data but merge from back to front which reverses that into a descending order.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
