'Add fitted model as PipelineStage in spark ML Pipeline

I have a fitted word2vec model that I want to use in various projects.

That is I created a Word2Vec Estimator, fitted it to my dataset. This gives me a Word2VecModel, which I can save. How can I now add this model to a pipeline?

Preferentially, I would still like to be able to "fit" the pipeline, but exclude the Word2VecModel to be re-fitted. But this last part is optional.

Ideally I would want to do this in pyspark. But this is also optional.

apache-spark word2vec

Solution 1:^[1]

Just add it as is. For example if you have

from pyspark.ml.feature import Word2VecModel 

w2vmodel = Word2VecModel.load(...)

you can

from pyspark.ml import Pipeline

Pipeline(stages=[w2vmodel]).fit(df).transform(df)

Solution 2:^[2]

Not having to re-fit Word2Vec is quite simple. Fit whichever other estimators you need in Pipeline object pipe1, then create a PipelineModel object with Word2Vec as the first stage and pipe2 as the second stage.

from pyspark.ml import Pipeline, PipelineModel
from pyspark.ml.feature import Word2VecModel, StringIndexer, OneHotEncoder

data = spark.read...

w2vmodel = Word2VecModel.load(...)

pipe1 = Pipeline(stages=[StringIndexer(...), OneHotEncoder(...)])

pipe1_model = pipe1.fit(data)

fitted_pipeline = PipelineModel(stages=[w2vmodel, pipe2_model])

Now you can use fitted_pipeline to transform() your data while keeping your Word2Vec intact.

If you need Word2Vec somewhere in the middle of your pipeline, you just need to break pipe1 into multiple "stages" and compose your final object with PipelineModel

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	user11116275
Solution 2

'Add fitted model as PipelineStage in spark ML Pipeline

Solution 1:[1]

Solution 2:[2]

Sources

Related Questions

Solution 1:^[1]

Solution 2:^[2]