'PySpark's "DataFrameLike" type vs pandas.DataFrame

Spark 3.1 introduced type hints for python (hooray!) but I am puzzled as to why the return type of the toPandas method is "DataFrameLike" instead of pandas.DataFrame - see here: https://github.com/apache/spark/blob/master/python/pyspark/sql/pandas/conversion.pyi

Because of this mypy throws all sorts of errors if I try to use any of the pandas df methods on an object that's the result of calling toPandas. For example

df = spark_df.toPandas()
df.to_csv(out_path, index=False)

results in the error message

error: "DataFrameLike" has no attribute "to_csv" 

What's going on here?



Solution 1:[1]

I believe this issue is fixed by this recent commit (dated Dec 22, 2021): https://github.com/apache/spark/commit/a70006d9a7b578721d152d0f89d1a894de38c25d

Right now when you use .toPandas() and print out type, it will actually give you Pandas DataFrame.

To read more about it, since your link is broken, here's the source code for DataFrameLike

So make sure you update your pyspark to the latest version.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1