'Is there an Apache Arrow equivalent of the Spark Pandas UDF
Spark provides a few different ways to implement UDFs that consume and return Pandas DataFrames. I am currently using the cogrouped version that takes two (co-grouped) Pandas DataFrames as input and returns a third.
For efficient translation between Spark DataFrames and Pandas DataFrames, Spark uses the Apache Arrow memory layout, however transformation is still required to go from Arrow to Pandas and back. I would really like to access the Arrow data directly, as this is how I will ultimately be working with the data in the UDF (using Polars).
It seems wasteful to go from Spark -> Arrow -> Pandas -> Arrow (Polars) on the way in and the reverse on the return.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
