'Can PySpark ML models be run on only parts of a dataframe, depending on a condition?

I have trained a logistic regression algorithm to match job titles and descriptions to a set of 4 digit numeric codes. This it does very well. It will form part of a pipeline that first attempts to match these data by joining to a reference database, which leaves some entries of the dataframe matched to a 4 digit code, and some left with a dummy code indicating they are still to be matched. Therefore, the state of my dataframe just prior to my running the ML algorithm is

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

data = spark.createDataFrame(
    [
        ('1', 'BUTCHER', 'MEAT PERSON', '-1'),
        ('2', 'BAKER', 'BREAD AND PASTRY AND CAKE', '1468'),
        ('3', 'CANDLESTICK MAKER', 'LET THERE BE LIGHT', '-1')
    ],
    [
        'ID',
        'COLUMN_TO_VECTORIZE_1',
        'COLUMN_TO_VECTORIZE_2',
        'RESULTS_COLUMN'
]
)

where '-1' is the dummy code for 'as yet unmatched' and 'BAKER' has been matched to '1468' already.

If I wanted to match the whole dataframe, I would go on to write

data = pretrained_model_pipeline.transform(data) # vectorizes/assembles feature column, runs ML algorithm

# other code to perform index to string conversion on labels, and place labels into RESULTS_COLUMN

But that runs the whole dataframe through the algorithm, and anything previously matched does not need to be matched again.

I COULD create a new dataframe from the old one, selecting just those rows with '-1' in the 'RESULTS_COLUMN', but my limited understanding of Spark says that this would essentially double my memory usage.

Is there a way for the pretrained model to be given the whole dataframe to transform, but with some sort of mask telling it to skip rows with '-1' in the results?

Solution 1:^[1]

Spark doesn't always load data into memory. It also doesn't necessarily load everything into memory, unless we tell it to. So, just this is enough

data = pretrained_model_pipeline.transform(data.where(F.col('RESULTS_COLUMN') != -1))

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	pltc

'Can PySpark ML models be run on only parts of a dataframe, depending on a condition?

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]