'Visualise grouped line plots in pyspark

I have this DF(sample) and I am using PySpark in Databricks. I would like to have line plot DATE vs BALANCE but for each ID in one single frame.

+-------------+-------------+----------------+
|         DATE|      BALANCE|              ID|
+-------------+-------------+----------------+
|   2021-07-01|     81119.73|         Ax3838J|
|   2021-07-02|     81119.73|         Ax3838J|
|   2021-07-03|     81119.73|         Ax3838J|
|   2021-07-04|     81289.62|         Ax3838J|
|   2021-07-05|     81385.62|         Ax3838J|
|   2021-07-02|     81249.76|         Bz3838J|
|   2021-07-03|     81249.76|         Bz3838J|
|   2021-07-04|     81249.76|         Bz3838J|
|   2021-07-05|     81324.28|         Bz3838J|
|   2021-07-06|     81329.28|         Bz3838J|
+-------------+-------------+----------------+

I can plot for one single ID but I have more than 10000 unique IDs. How can I visualise multiple line plots segmented based on ID. Also, Is there any smart ways to visualise the DF all together?

DF_single.toPandas().plot.line(x='DATE', y='BALANCE')

DATE vs BALANCE for single ID

Note: Image is for a particular ID from the actual dataset.



Solution 1:[1]

You can pivot your pandas DataFrame in order to turn ID labels into separate columns containing, like:

(
    DF_single
    .toPandas()
    .pivot_table(index='DATE',columns='ID',values='BALANCE')
    .plot()
)

the pivot_table function aggregates the values passed in values so if your DataFrame has more than one value for each DATE/ID, you can choose the appropriate aggregation function and pass it through the parameter aggfunc (e.g.: aggfunc=np.mean or aggfunc='mean' - the default is 'mean'). From the way you posed your question, you probably have only one value per DATE/ID, so the aggfunc doesn't really matter in your case, but it's important to understand what pivot_table is doing.

Also, pandas's plot function by default plots lines, and it uses columns as different series and the index as the x-axis, so there's no need to specify anything else =)

You can check the doc for the pivot_table function here:

https://pandas.pydata.org/docs/reference/api/pandas.pivot_table.html

Hope that helps! Good luck =)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Luis Marcanth