'Why Does StackingClassifier Raise Error When Component Classifier Does Not?

I am using the StackingClassifier to combine several model pipelines for predicting hospital readmission on the UCI diabetes dataset. Each pipeline works fine on its own, but I keep running into problems when trying to combine them. I want to know why a standalone text classifier will run, while the stacked classifier won't and how I can fix it.

Here is the section that raises the error:

stack_clf = StackingClassifier(
    estimators=estimators,
    final_estimator=LogisticRegression()
)

x_train, x_test, y_train, y_test = train_test_split(
    pd.concat([
        diabetes_data[text_data].apply(lambda x: preprocess_series(x)),
        diabetes_data[categorical_data+ordinal_data+scalar_data]
    ], axis=1
    ),
    diabetes_data["readmitted"]                                                
)

# This line throws the error in the fit function
stack_clf.fit(x_train, y_train).score(x_test, y_test)

ValueError: could not convert string to float: 'bronchitis specified acute chronic'

Now an example of a component classifier that works just fine:

x_train, x_test, y_train, y_test = train_test_split(
    diabetes_data[text_data].apply(lambda x: preprocess_series(x)),
    diabetes_data["readmitted"]
)

text_pipe.fit(x_train, y_train).score(x_test, y_test)

0.5935

Because it is unclear to me where in the pipeline the error is originating, I have provided the full minimal reproducible example below.

Select Columns

text_data = [
    "diag_1_desc",
    "diag_2_desc",
    "diag_3_desc"
]

scalar_data = [
    "num_medications",
    "time_in_hospital",
    "num_lab_procedures",
    "num_procedures",
    "number_outpatient",
    "number_emergency",
    "number_inpatient",
    "number_diagnoses",
]

ordinal_data = [
    "age"
]

categorical_data = [
    "race",
    "gender",
    "admission_type_id",
    "discharge_disposition_id",
    "admission_source_id",
    "insulin",
    "diabetesMed",
    "change",
    "A1Cresult",
    "metformin",
    "repaglinide",
    "nateglinide",
    "chlorpropamide",
    "glimepiride",
    "glipizide",
    "glyburide",
    "tolbutamide",
    "pioglitazone",
    "rosiglitazone",
    "acarbose",
    "miglitol",
    "tolazamide",
    "glyburide.metformin",
    "glipizide.metformin",    
]

Create Logistic Regression Classifier

logreg = LogisticRegression(
    solver = "saga",
    penalty="elasticnet",
    l1_ratio=0.5,
    max_iter=1000
)

Create Column Transformers

text_trans = compose.make_column_transformer(
    (TfidfVectorizer(ngram_range=(1,2)), "diag_1_desc"),
    (TfidfVectorizer(ngram_range=(1,2)), "diag_2_desc"),
    (TfidfVectorizer(ngram_range=(1,2)), "diag_3_desc"),
    remainder="passthrough",
)

scalar_trans = compose.make_column_transformer(
    (
        preprocessing.StandardScaler(),
        scalar_data
    ),
    remainder="passthrough",
)

cat_trans = compose.make_column_transformer(
    (
        preprocessing.OneHotEncoder(
            sparse=False,
            handle_unknown="ignore"
        ),
        categorical_data
    ),
    (
        preprocessing.OrdinalEncoder(),
        ordinal_data
    ),
    remainder="passthrough",
)

Create Pipeline Estimators

text_pipe = make_pipeline(text_trans, logreg)
scalar_pipe = make_pipeline(scalar_trans, logreg)
cat_pipe = make_pipeline(cat_trans, logreg)

estimators = [
    ("cat", cat_pipe),
    ("text", text_pipe),
    ("scalar", scalar_pipe)
]

Create and Fit Stacking Classifier

stack_clf = StackingClassifier(
    estimators=estimators,
    final_estimator=LogisticRegression()
)

x_train, x_test, y_train, y_test = train_test_split(
    pd.concat([
        diabetes_data[text_data].apply(lambda x: preprocess_series(x)),
        diabetes_data[categorical_data+ordinal_data+scalar_data]
    ], axis=1
    ),
    diabetes_data["readmitted"]                                                
)

stack_clf.fit(x_train, y_train).score(x_test, y_test)

ValueError: could not convert string to float: 'bronchitis specified acute chronic'

My pipeline also relies on two helper functions that I use for preprocessing the text data by removing punctuation and stopwords.

Helper Functions

def preprocess_text(text):
    try:
        text = re.sub('[^a-zA-Z]', ' ', text)
        text = text.lower().split()
        text = [word for word in text if not word in set(nltk.corpus.stopwords.words('english'))]
        text = [nltk.stem.WordNetLemmatizer().lemmatize(word) for word in text if len(word) > 1]
        return ' '.join(text)
    except TypeError:
        return ''

def preprocess_series(series):
    texts = []
    for i in range(len(series)):
        texts.append(preprocess_text(series[i]))
    return pd.Series(texts)


Solution 1:[1]

It looks like your component pipelines don't all work, just the text one. Your other pipelines use a column transformer with remainder='passthrough', which means they pass the test columns along untouched, to which the logistic regression will balk.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Ben Reiniger