'Issues with caching a dataframe in Streamlit

I am loading a pandas dataframe as .csv. I am using a @st.cache decorator to cache this dataframe. I want to predict a classification by using a predefined classification model (RandomForest, XGBoost).

Essentially a column will be added to the original dataframe and stored in a new variable.

However, I am having issues caching this new dataframe.

import pandas as pd
import numpy as np
from xgboost import XGBClassifier
import streamlit as st

def main():
    st.set_page_config(layout="wide")
    st.title('Classification Problem on Home Equity dataset')

if __name__ == '__main__':
    main()

#Load prediction data
    @st.cache
    def load_predict():
        data= pd.read_csv("hmeq_Predict_2.csv")  #Currently on my local machine
        return data
    df_predict = load_predict()

# Predict on data
    @st.cache
    def predictor_func():
        y_pred_nd = pd.Series(model.predict(df_predict),name='BAD')
        Predicted_X = pd.concat([df_predict,y_pred_nd],axis=1)
        #This is the Dataframe that I want cache
        return Predicted_X

#Run XGBoost classification , I have loaded X_train and y_train also, not shown in this example
    if classifier == "XGBoost":
        if st.sidebar.button("Run Classification", key="Classification"):
            model = XGBClassifier() 
            model.fit(X_train,y_train) 
           #I want this function to return the cached dataframe.
            Predicted_X=predictor_func()  
# This command will correctly display the Dataframe, meaning that the predictor_func() ran correctly
            st.write(Predicted_X)    

#However, when I want to display the dataframe, Predicted_X, only when I click this button
    if st.sidebar.button("Run Prediction on new Data", key="Prediction"):
        st.subheader('Check last column for prediction. ')
        st.write(Predicted_X)

This is the error I get:

NameError: name 'Predicted_X' is not defined
Traceback: File "C:\Users\vchaubal\Anaconda3\envs\Jupyter_Project_2\lib\site-packages\streamlit\script_runner.py", line 379, in _run_script
    exec(code, module.__dict__) File "C:\Users\vchaubal\Downloads\Streamlit_project.py", line 328, in <module>
    st.write(Predicted_X)

Am I missing a key concept here? Also, is there a way to cache a model from sklearn?



Solution 1:[1]

NameError: name 'Predicted_X' is not defined means that you are calling a variable Predicted_X that has not been instantiated (meaning, there is a Predicted_X = .... missing before.

In your code, at

    if st.sidebar.button("Run Prediction on new Data", key="Prediction"):
        st.subheader('Check last column for prediction. ')
        st.write(Predicted_X)

there is no garanty that Predicted_X = ... from the previous lines have been executed.

Your code should look like this:

Predicted_X = None  # Instantiate Predicted_X

if classifier == "XGBoost":
    if st.sidebar.button("Run Classification", key="Classification"):
        model = XGBClassifier() 
        model.fit(X_train, y_train) 
        Predicted_X = predictor_func()  
        st.write(Predicted_X)    

if st.sidebar.button("Run Prediction on new Data", key="Prediction"):
    st.subheader('Check last column for prediction. ')
    # Show Predicted_X only if it has been computed
    if Predicted_X is None:
        st.write("Predicted_X has not been yet computed")
    else:
        st.write(Predicted_X)

As for your other question

Also, is there a way to cache a model from sklearn?

There is a way:

@st.cache()
def load_xgboost_model():
    model = XGBClassifier()
    model.fit(X_train, y_train) 
    return model

@st.cache():
def load_sklearn_model(path_to_sklearn_model):
    import pickle
    model = pickle.load(open(path_to_sklearn_model, "rb"))
    return model

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 vinzee