'ValueError: A given column is not a column of the dataframe when using ColumnTransformer for pipeline in sklearn

Hi I am trying to learn the concept of pipeline. I have read a csv file https://www.kaggle.com/zhangjuefei/birds-bones-and-living-habits and want to apply pipeline for pre-processing and classification.

I have been referring sklearn's official documentation for pipeline.This is the code I used in google colab.

import pandas as pd

data1 = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/data/bird.csv')
from sklearn.compose import ColumnTransformer

import numpy as np

from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer

from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier

from sklearn.pipeline import Pipeline
x = data1.iloc[:,1:11]
y = data1.iloc[:,11:12]
numeric_features = ['huml','humw','ulnal','ulnaw','feml','femw','tibl','tibw','tarl','tarw']

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())])
categorical_features = ['type']
categorical_transformer = OneHotEncoder(handle_unknown='ignore')
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])
pipeline_lr = Pipeline(steps=[
                        ('preprocessor', preprocessor),
                        ('LRClassifier',LogisticRegression(random_state=0))
                        ]
                       )
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2,random_state=0)
if 'type' in y_train:
  print('Present') 
pipeline_lr.fit(x_train, y_train)

ValueError: 'type' is not in list

ValueError: A given column is not a column of the dataframe

Can anyone give suggestion on how to rectify this?



Solution 1:[1]

First import ColumnTransformer and make_column_selector

from sklearn.compose import ColumnTransformer, make_column_selector

Then follow the following code:

preprocessing = ColumnTransformer(transformers=[
        ('numerical', StandardScaler(),
         make_column_selector(dtype_include=np.number))], remainder='passthrough')

    pipe = Pipeline([('preprocess', preprocessing)])

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 ElhamMotamedi