'Is there an Apache Arrow equivalent of the Spark Pandas UDF

Spark provides a few different ways to implement UDFs that consume and return Pandas DataFrames. I am currently using the cogrouped version that takes two (co-grouped) Pandas DataFrames as input and returns a third.

For efficient translation between Spark DataFrames and Pandas DataFrames, Spark uses the Apache Arrow memory layout, however transformation is still required to go from Arrow to Pandas and back. I would really like to access the Arrow data directly, as this is how I will ultimately be working with the data in the UDF (using Polars).

It seems wasteful to go from Spark -> Arrow -> Pandas -> Arrow (Polars) on the way in and the reverse on the return.



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source