'PySpark's "DataFrameLike" type vs pandas.DataFrame
Spark 3.1 introduced type hints for python (hooray!) but I am puzzled as to why the return type of the toPandas method is "DataFrameLike" instead of pandas.DataFrame - see here: https://github.com/apache/spark/blob/master/python/pyspark/sql/pandas/conversion.pyi
Because of this mypy throws all sorts of errors if I try to use any of the pandas df methods on an object that's the result of calling toPandas. For example
df = spark_df.toPandas()
df.to_csv(out_path, index=False)
results in the error message
error: "DataFrameLike" has no attribute "to_csv"
What's going on here?
Solution 1:[1]
I believe this issue is fixed by this recent commit (dated Dec 22, 2021): https://github.com/apache/spark/commit/a70006d9a7b578721d152d0f89d1a894de38c25d
Right now when you use .toPandas() and print out type, it will actually give you Pandas DataFrame.
To read more about it, since your link is broken, here's the source code for DataFrameLike
So make sure you update your pyspark to the latest version.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
