'Transformer operating on multiple features in pyspark.ml
I want to make my own transformer of features in a DataFrame, so that I add a column which is, for example, a difference between two other columns. I followed this question, but the transformer there operates on one column only. pyspark.ml.Transformer takes a string as an argument for inputCol, so of course I can not specify multiple columns.
So basically, what I want to achieve is a _transform() method that resembles this one:
def _transform(self, dataset):
out_col = self.getOutputCol()
in_col = dataset.select([self.getInputCol()])
# Define transformer logic
def f(col1, col2):
return col1 - col2
t = IntegerType()
return dataset.withColumn(out_col, udf(f, t)(in_col))
How is this possible to do?
Solution 1:[1]
You don't need to go through all these trouble in order to operate on multiple columns. Here's a better approach using HasInputCols (instead of HasInputCol)
class MeasurementDifferenceTransformer(Transformer, HasInputCols, HasOutputCol):
@keyword_only
def __init__(self, inputCols=None, outputCol=None):
super(MeasurementDifferenceTransformer, self).__init__()
kwargs = self._input_kwargs
self.setParams(**kwargs)
@keyword_only
def setParams(self, inputCols=None, outputCol=None):
kwargs = self._input_kwargs
return self._set(**kwargs)
def _transform(self, dataset):
out_col = self.getOutputCol()
in_col = self.getInputCols()
# Define transformer logic
def f(col1, col2):
return float(col1-col2)
t = FloatType()
return dataset.withColumn(out_col, udf(lambda f, t)(*in_col))
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Wen Yao |
