'Saving trained pyspark pipeline

Pipeline consists of one-hot encoding and min_max scaler

stages = asmbler + mm_scaler + str_indexer + ohe

pp_pl = Pipeline(stages=stages).fit(X)

After fitting the model, I'm trying to save it for later use.

Following the documentation https://spark.apache.org/docs/latest/ml-pipeline.html#ml-persistence-saving-and-loading-pipelines, tells me I can do it however no guide. From Pyspark ML - How to save pipeline and RandomForestClassificationModel It says I can save it by executing following

pp_pl.save(path)

But no matter which path I try I cannot save it.(I've tried multiple paths, some output error saying file already exists.)

Py4JJavaError: An error occurred while calling o6061.save.
: java.lang.RuntimeException: java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset

I'm not understanding why we are not giving type of file like .pkl. Also where does path start from?

apache-spark pyspark

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'Saving trained pyspark pipeline

Sources

Related Questions