'Pyspark on Colab: Regression linear occurs an error Py4JJavaError: An error occurred while calling o651.fit

On Colab Pro with this Spark Context:

SparkContext.setSystemProperty('spark.executor.memory', '16g')
SparkContext.setSystemProperty('spark.driver.memory', '45G')

I have a pyspark.sql.dataframe.DataFrame with theses columns:

 |-- title: string (nullable = true)
 |-- rating: string (nullable = true)
 |-- ratingLevel: string (nullable = true)
 |-- ratingDescription: integer (nullable = true)
 |-- year: integer (nullable = true)
 |-- user_rating_score: float (nullable = true)
 |-- user_rating_size: integer (nullable = true)

I'm trying to do a regression linear with:

train, test = df_netflix.randomSplit([0.8, 0.2], seed=42)

from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import HashingTF , Tokenizer
from pyspark.ml.regression import LinearRegression

tokenizer = Tokenizer(inputCol="title", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol() , outputCol="features" )
lr = LinearRegression(featuresCol = 'features', labelCol='user_rating_score', maxIter=10, regParam=0.3, elasticNetParam=0.8)

pipeline = Pipeline(stages=[tokenizer , hashingTF , lr] )

# Fitting the model
model = pipeline.fit(train)

It occurs this error:

Py4JJavaError: An error occurred while calling o530.fit.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 7.0 failed 1 times, most recent failure: Lost task 0.0 in stage 7.0 (TID 7) (7cf4fb6d3af8 executor driver): scala.MatchError: [null,1.0,(262144,[109503,141652],[1.0,1.0])] (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)

In Df_netflix Dataframe, column user_rating_score contains NA values that I have been deleted but no better.

Thanks you.



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source