'Add fitted model as PipelineStage in spark ML Pipeline
I have a fitted word2vec model that I want to use in various projects.
That is I created a Word2Vec Estimator, fitted it to my dataset. This gives me a Word2VecModel, which I can save. How can I now add this model to a pipeline?
Preferentially, I would still like to be able to "fit" the pipeline, but exclude the Word2VecModel to be re-fitted. But this last part is optional.
Ideally I would want to do this in pyspark. But this is also optional.
Solution 1:[1]
Just add it as is. For example if you have
from pyspark.ml.feature import Word2VecModel
w2vmodel = Word2VecModel.load(...)
you can
from pyspark.ml import Pipeline
Pipeline(stages=[w2vmodel]).fit(df).transform(df)
Solution 2:[2]
Not having to re-fit Word2Vec is quite simple. Fit whichever other estimators you need in Pipeline object pipe1, then create a PipelineModel object with Word2Vec as the first stage and pipe2 as the second stage.
from pyspark.ml import Pipeline, PipelineModel
from pyspark.ml.feature import Word2VecModel, StringIndexer, OneHotEncoder
data = spark.read...
w2vmodel = Word2VecModel.load(...)
pipe1 = Pipeline(stages=[StringIndexer(...), OneHotEncoder(...)])
pipe1_model = pipe1.fit(data)
fitted_pipeline = PipelineModel(stages=[w2vmodel, pipe2_model])
Now you can use fitted_pipeline to transform() your data while keeping your Word2Vec intact.
If you need Word2Vec somewhere in the middle of your pipeline, you just need to break pipe1 into multiple "stages" and compose your final object with PipelineModel
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | user11116275 |
| Solution 2 |
