'Issues while trying to export pyspark pandas dataframe to csv in pyspark
df=df_full[df_fill.part_col.isin(['part_a','part_b'])]
df=df[df.some_other_col =='some_value']
#df has shape of roughly 240k,200
#df_full has shape of roughly 30m, 200
df.to_pandas().reset_index().to_csv('testyyy.csv',index=False)
If I do any groupby operation it is amazingly fast. However the issue lies when I try to export small subset of this large dataset to csv. While I am eventually able to export the dataframe to csv but it is taking too much time.
Warnings:
2022-05-08 13:01:15,948 WARN window.WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
df[column_name] = series
Note: part_a and part_b are stored as two separate parquet partitioned files. Also I am using pyspark.pandas in spark3+
So question is what is happening? And what is most efficient wat to export the filtered dataframe to csv?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
