'How to get the P Value in a Variable from OLSResults in Python?

The OLSResults of

df2 = pd.read_csv("MultipleRegression.csv")
X = df2[['Distance', 'CarrierNum', 'Day', 'DayOfBooking']]
Y = df2['Price']
X = add_constant(X)
fit = sm.OLS(Y, X).fit()
print(fit.summary())

shows the P values of each attribute to only 3 decimal places.

I need to extract the p value for each attribute like Distance, CarrierNum etc. and print it in scientific notation.

I can extract the coefficients using fit.params[0] or fit.params[1] etc.

Need to get it for all their P values.

Also what does all P values being 0 mean?



Solution 1:[1]

You need to do fit.pvalues[i] to get the answer where i is the index of independent variables. i.e. fit.pvalues[0] for intercept, fit.pvalues[1] for Distance, etc.

You can also look for all the attributes of an object using dir(<object>).

Solution 2:[2]

Instead of using fit.summary() you could use fit.pvalues[attributeIndex] in a for loop to print the p-values of all your features/attributes as follows:

df2 = pd.read_csv("MultipleRegression.csv")
X = df2[['Distance', 'CarrierNum', 'Day', 'DayOfBooking']]
Y = df2['Price']
X = add_constant(X)
fit = sm.OLS(Y, X).fit()
for attributeIndex in range (0, numberOfAttributes):
    print(fit.pvalues[attributeIndex])

==========================================================================

Also what does all P values being 0 mean?

It might be a good outcome. The p-value for each term tests the null hypothesis that the coefficients (b1, b2, ..., bn) are equal to zero causing no effect to the fitting equation y = b0 + b1x1 + b2x2... A low p-value (< 0.05) indicates that you can reject the null hypothesis. In other words, a predictor that has a low p-value is likely to be a meaningful addition to your model because changes in the predictor's value are related to changes in the response variable (y).

On the other hand, a larger (insignificant) p-value suggests that changes in the predictor are not correlated to changes in the response.

Solution 3:[3]

I have used this solution

df2 = pd.read_csv("MultipleRegression.csv")
X = df2[['Distance', 'CarrierNum', 'Day', 'DayOfBooking']]
Y = df2['Price']
X = add_constant(X)
model = sm.OLS(Y, X).fit()

# Following code snippet will generate sorted dataframe with feature name and it's p-value. 

# Hence, you will see most relevant features on the top (p-values will be sorted in ascending order)

d = {}
for i in X.columns.tolist():
    d[f'{i}'] = model_ols.pvalues[i]

df_pvalue= pd.DataFrame(d.items(), columns=['Var_name', 'p-Value']).sort_values(by = 'p-Value').reset_index(drop=True)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 xanjay
Solution 2 Marcos Pacheco Jr
Solution 3 Suhas_Pote