'Saving trained pyspark pipeline

Pipeline consists of one-hot encoding and min_max scaler

stages = asmbler + mm_scaler + str_indexer + ohe

pp_pl = Pipeline(stages=stages).fit(X)

After fitting the model, I'm trying to save it for later use.

Following the documentation https://spark.apache.org/docs/latest/ml-pipeline.html#ml-persistence-saving-and-loading-pipelines, tells me I can do it however no guide. From Pyspark ML - How to save pipeline and RandomForestClassificationModel It says I can save it by executing following

pp_pl.save(path)

But no matter which path I try I cannot save it.(I've tried multiple paths, some output error saying file already exists.)

Py4JJavaError: An error occurred while calling o6061.save.
: java.lang.RuntimeException: java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset

I'm not understanding why we are not giving type of file like .pkl. Also where does path start from?



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source