'Create Pyspark pandas column with a list of tuples
I am attempting to do this in Pyspark's Pandas API:
Given a pyspark pandas dataframe:
Date CustomerId OPS
2018-01-01 1 0.095216
2018-01-08 1 0.204250
2018-01-15 1 0.287580
2018-01-22 1 0.725796
2018-01-29 1 0.802698
I want to take lags of the OPS column and create a new column which consists of all the lags OPS values for the current row.
I was able to generate the the desired column with this code.
list(zip(*[df_1['OPS'].shift(i).to_numpy() for i in range(1, 3)]))
which has the output:
[(nan, nan),
(0.0952159330335296, nan),
(0.20424959772341578, 0.0952159330335296),
(0.2875797248969043, 0.20424959772341578),
(0.7257961722702307, 0.2875797248969043)]
However when I tried to set this result as an column of the dataframe I get the following error:
df_1['Vectorized Features'] = list(zip(*[df_1['OPS'].shift(i).to_numpy() for i in range(1, 3)]))
~/opt/anaconda3/lib/python3.8/site-packages/pyspark/pandas/frame.py in assign_columns(psdf, this_column_labels, that_column_labels)
11789 psdf: DataFrame, this_column_labels: List[Label], that_column_labels: List[Label]
11790 ) -> Iterator[Tuple["Series", Label]]:
> 11791 assert len(key) == len(that_column_labels)
11792 # Note that here intentionally uses `zip_longest` that combine
11793 # that_columns.
AssertionError:
Is there a way to make this work, where I can have create a pyspark column from a list of tuples?
Thanks!
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
