'Transformer operating on multiple features in pyspark.ml

I want to make my own transformer of features in a DataFrame, so that I add a column which is, for example, a difference between two other columns. I followed this question, but the transformer there operates on one column only. pyspark.ml.Transformer takes a string as an argument for inputCol, so of course I can not specify multiple columns.

So basically, what I want to achieve is a _transform() method that resembles this one:

def _transform(self, dataset):
    out_col = self.getOutputCol()
    in_col = dataset.select([self.getInputCol()])

    # Define transformer logic
    def f(col1, col2):
        return col1 - col2
    t = IntegerType()

    return dataset.withColumn(out_col, udf(f, t)(in_col))

How is this possible to do?



Solution 1:[1]

You don't need to go through all these trouble in order to operate on multiple columns. Here's a better approach using HasInputCols (instead of HasInputCol)

class MeasurementDifferenceTransformer(Transformer, HasInputCols, HasOutputCol):  
    @keyword_only
    def __init__(self, inputCols=None, outputCol=None):
        super(MeasurementDifferenceTransformer, self).__init__()
        kwargs = self._input_kwargs
        self.setParams(**kwargs)

    @keyword_only
    def setParams(self, inputCols=None, outputCol=None):
        kwargs = self._input_kwargs
        return self._set(**kwargs)

    def _transform(self, dataset):
        out_col = self.getOutputCol()
        in_col = self.getInputCols()

        # Define transformer logic
        def f(col1, col2):
            return float(col1-col2)
        t = FloatType()

        return dataset.withColumn(out_col, udf(lambda f, t)(*in_col))

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Wen Yao