'Getting model parameters from regression in Pyspark in efficient way for large data

I have created a function for applying OLS regression and just getting the model parameters. I used groupby and applyInPandas but it's taking too much of time. Is there are more efficient way to work around this?

Note: I din't had to use groupby as all features have many levels but as I cannot use applyInPandas without it so I created a dummy feature as 'group' having the same value as 1.

Code

import pandas as pd
import statsmodels.api as sm
from pyspark.sql.functions import lit

pdf = pd.DataFrame({
                    'x':[3,6,2,0,1,5,2,3,4,5],
                    'y':[0,1,2,0,1,5,2,3,4,5],
                    'z':[2,1,0,0,0.5,2.5,3,4,5,6]})

df = sqlContext.createDataFrame(pdf)

result_schema =StructType([
  StructField('index',StringType()),
  StructField('coef',DoubleType())
 ])

def ols(pdf):
    
    y_column = ['z']
    x_column = ['x', 'y']
    y = pdf[y_column]
    X = pdf[x_column]
    model = sm.OLS(y, X).fit()
    
    param_table =pd.DataFrame(model.params, columns = ['coef']).reset_index()

    return param_table

#adding a new column to apply groupby
df = df.withColumn('group', lit(1))

#applying function
data = df.groupby('group').applyInPandas(ols, schema = result_schema)

Final output sample

index   coef
x       0.183246073
y       0.770680628


Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source