'Create Pyspark pandas column with a list of tuples

I am attempting to do this in Pyspark's Pandas API:

Given a pyspark pandas dataframe:

Date    CustomerId  OPS
2018-01-01  1   0.095216
2018-01-08  1   0.204250
2018-01-15  1   0.287580
2018-01-22  1   0.725796
2018-01-29  1   0.802698

I want to take lags of the OPS column and create a new column which consists of all the lags OPS values for the current row.

I was able to generate the the desired column with this code.

list(zip(*[df_1['OPS'].shift(i).to_numpy() for i in range(1, 3)]))

which has the output:

[(nan, nan),
 (0.0952159330335296, nan),
 (0.20424959772341578, 0.0952159330335296),
 (0.2875797248969043, 0.20424959772341578),
 (0.7257961722702307, 0.2875797248969043)]

However when I tried to set this result as an column of the dataframe I get the following error:

df_1['Vectorized Features'] = list(zip(*[df_1['OPS'].shift(i).to_numpy() for i in range(1, 3)]))
~/opt/anaconda3/lib/python3.8/site-packages/pyspark/pandas/frame.py in assign_columns(psdf, this_column_labels, that_column_labels)
  11789                 psdf: DataFrame, this_column_labels: List[Label], that_column_labels: List[Label]
  11790             ) -> Iterator[Tuple["Series", Label]]:
> 11791                 assert len(key) == len(that_column_labels)
  11792                 # Note that here intentionally uses `zip_longest` that combine
  11793                 # that_columns.

AssertionError: 

Is there a way to make this work, where I can have create a pyspark column from a list of tuples?

Thanks!



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source