'Pandas Regression Model Replacing Column Values

I have a data frame "df" with columns "bedrooms", "bathrooms", "sqft_living", and "sqft_lot".

Data frame "df"

I want to create a regression model by filling the missing column values based on the values of the other columns. The missing value would be determined by observing the other columns and making a prediction based on the other column values.

As an example, the sqft_living column is missing in row 12. To determine this, the count for the bedrooms, bathrooms, and sqft_lot would be considered to make a prediction on the missing value.

Is there any way to do this? Any help is appreciated. Thanks!



Solution 1:[1]

import pandas as pd
from sklearn.linear_model import LinearRegression

# setup
dictionary = {'bedrooms': [3,3,2,4,3,4,3,3,3,3,3,2,3,3],
              'bathrooms': [1,2.25,1,3,2,4.5,2.25,1.5,1,2.5,2.5,1,1,1.75],
              'sqft_living': [1180, 2570,770,1960,1680,5420,1715,1060,1780,1890,'',1160,'',1370],
              'sqft_lot': [5650,7242,10000,5000,8080,101930,6819,9711,7470,6560,9796,6000,19901,9680]}
df = pd.DataFrame(dictionary)

# setup x and y for training
# drop data with empty row
clean_df = df[df['sqft_living'] != '']
# separate variables into my x and y
x = clean_df.iloc[:, [0,1,3]].values
y = clean_df['sqft_living'].values

# fit my model
lm = LinearRegression()
lm.fit(x, y)

# get the rows I am trying to do my prediction on
predict_x = df[df['sqft_living'] == ''].iloc[:, [0,1,3]].values

# perform my prediction
lm.predict(predict_x)
# I get values 1964.983 for row 10, and 1567.068 row row 12

It should be noted that you're asking about imputation. I suggest reading and understanding other methods, trade offs, and when to do it.

Edit: Putting Code back into DataFrame:

# Get index of missing data
missing_index = df[df['sqft_living'] == ''].index
# Replace
df.loc[missing_index, 'sqft_living'] = lm.predict(predict_x)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1