'oversampling using smote not working in Pipeline

I tried using SMOTE into my pipeline but i realise there was no oversampling performed when i evaluated the model. Does anyone have any clue?

from sklearn.compose import ColumnTransformer
from imblearn.over_sampling import RandomOverSampler, SMOTE
from sklearn.pipeline import Pipeline
from imblearn.pipeline import Pipeline as imbpipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

param_grid={
    'DecisionTree__max_depth': (10,30,50,70,100),
    'DecisionTree__criterion': ('gini','entropy'),
    'DecisionTree__max_depth': (3,5,7,9,10),
    'DecisionTree__max_features': ('auto','sqrt','log2'),
    'DecisionTree__min_samples_split': (2,4,6)
}

numeric_features = ['Ambient', 'Process', 'Rotation_Speed', 'Torque', 'Tool_Wear']
numeric_transformer = Pipeline([("imputer", SimpleImputer(strategy="mean")), ("scaler", StandardScaler())])
categorical_features = ['Quality']
categorical_transformer = Pipeline([("imputer", SimpleImputer(strategy="most_frequent")), ("encoder", OneHotEncoder())])
preprocessor = ColumnTransformer(
    transformers=[
        ('num',numeric_transformer, numeric_features),
        ('cat',categorical_transformer, categorical_features)
    ]
)
clf = imbpipeline(
     [["preprocessor", preprocessor],['smote', SMOTE(random_state=42)], ["DecisionTree", DecisionTreeClassifier()]]
)

grid_search = GridSearchCV(clf, param_grid, cv=10, scoring='f1_micro')
grid_search.fit(X, y)


Solution 1:[1]

I put OPs sample into an MCVE on coliru and was able to reproduce the issue of OP. In my case, the execution simply was aborted due to an out-of-bounds access.

I highly suspect that OP passed in the nums1, nums1.size(), nums2, nums2.size() into Solution::merge(). As the merge is destructive (writes result into one of the input containers), this has to be considered.

IMHO, OP seems to be aware of this basically as the merge is done from back to front.

However, as nums1 is used for output it has to be provided with sufficient size before merging.

E.g. nums1.resize(m + n); at the begin of merge will do the job.

My MCVE with fix:

#include <iostream>
#include <vector>

class Solution {
    
  public:
  
    void merge(std::vector<int>& nums1, int m, std::vector<int>& nums2, int n)
    {
      nums1.resize(m + n); // <= ENSURE SUFFICIENT STORAGE IN nums1
      int p1 = m - 1, p2 = n - 1, i = m + n - 1;      
      while (p2 >= 0) {
        if (p1 >= 0 && nums1[p1] < nums2[p2]) {
          nums1[i] = nums2[p2];
          p2--;
        } else {
          nums1[i] = nums1[p1];
          p1--;
        }
        i--;
      }
    }
};

int main()
{
  // sample data
  std::vector<int> nums1{ 1, 3, 5, 7, 9 };
  std::vector<int> nums2{ 2, 4, 6, 8 };
  // run merge
  Solution().merge(nums1, (int)nums1.size(), nums2, (int)nums2.size());
  // output result
  std::cout << "nums1: {";
  const char* sep = " ";
  for (const int num : nums1) {
    std::cout << sep << num;
    sep = ", ";
  }
  std::cout << " }" << std::endl;
}

Output:

nums1: { 1, 2, 3, 4, 5, 6, 7, 8, 9 }

Live Demo on coliru


@Armin Montigny recommended the use of std::merge() instead of a hand-knitted implementation.

First I had some doubts whether it's able to manage destructive merging as well but after thinking twice I realized that it will do it perfectly if used right. I modified my MCVE to see how this would look like.

My alternative MCVE using std::merge():

#include <algorithm>
#include <functional>
#include <iostream>
#include <vector>

class Solution {
    
  public:
  
    void merge(std::vector<int>& nums1, int m, std::vector<int>& nums2, int n)
    {
      nums1.resize(m + n); // <= ENSURE SUFFICIENT STORAGE IN nums1
      std::merge(
        nums1.rbegin() + n, nums1.rend(), // consider that nums1 is already resized
        nums2.rbegin(), nums2.rend(),
        nums1.rbegin(),
        std::greater<int>());
    }
};

int main()
{
  // sample data
  std::vector<int> nums1{ 1, 3, 5, 7, 9 };
  std::vector<int> nums2{ 2, 4, 6, 8 };
  // run merge
  Solution().merge(nums1, (int)nums1.size(), nums2, (int)nums2.size());
  // output result
  std::cout << "nums1: {";
  const char* sep = " ";
  for (const int num : nums1) {
    std::cout << sep << num;
    sep = ", ";
  }
  std::cout << " }" << std::endl;
}

Output:

nums1: { 1, 2, 3, 4, 5, 6, 7, 8, 9 }

Live Demo on coliru

Notes:

  1. I used rbegin() and rend(), to merge the vectors from end to begin, like OP did it. This is necessary to write merged elements always after where the still unprocessed elements have to be read.
  2. The input iterators for nums1 may look a bit surprising on the 1st glance. Using nums1.rbegin() + n is necessary to address to skip the space which was already allocated to store the additional elements of nums2. (That are exactly n.)
  3. The std::merge() has to be used with a custom predicate (e.g. std::greater) as I used ascending ordered sample data but merge from back to front which reverses that into a descending order.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1