'How to add a new column from Grouper value_counts and use in line plots?
I have a dataframe imported from Excel similar to this:
Date ID Chemical
2021-01-01 1 water
2021-01-01 1 acid
2021-01-03 3 water
2021-03-04 5 soda
2021-03-04 5 soda
2021-05-03 6 water
2021-05-03 6 soda
2021-05-05 8 soda
I am trying to plot up a series of lineplots (1 per chemical type) which plots the counts of that chemical per month as a function of time (counts on y axis, time (months) on x axis). So I think I want the above table to look like this:
Chemical Date Count
water 2021-01-31 2
2021-03-31 0
2021-05-31 1
acid 2021-01-31 1
2021-03-31 0
2021-05-31 0
soda 2021-01-31 0
2021-03-31 2
2021-05-31 2
So far I've managed to remove duplicates for the same ID number (not shown in my example) and I've got my data to look like the above but missing the "Count" heading. This has made it so I can't set the y-axis to "Count" for plotting purposes.
This is my code I've tried so far:
import numpy as np
import pandas as pd
import re
import datetime as dt
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_excel('Example.xlsx',
usecols=("A:F"), sheet_name=('Data'))
df_Test1 = df_Test.drop_duplicates(subset=["ID", "Chemical"], keep="first")
df_Test2 = df_Test1.copy()
df_Test2.loc[:, "Date"] = pd.to_datetime(df_Test2.loc[:, "Date"])
df_Test2["Chemical"].value_counts()
df_Test2.groupby(pd.Grouper(key="Date", freq="M"))["Chemical"].value_counts()
df_Test3 = df_Test2.groupby(["Chemical", pd.Grouper(key="Date", freq="M")])["Chemical"].value_counts()
print(df_Test3)
sns.lineplot(x="Date", y="Chemical", data=df_Test3)
plt.show()
This gives me the following output and I know the plot is wrong because I'm not sure how to set the yaxis value.
Chemical Date Chemical
water 2021-01-31 water 2
2021-03-31 water 0
2021-05-31 water 1
acid 2021-01-31 acid 1
2021-03-31 acid 0
2021-05-31 acid 0
soda 2021-01-31 soda 0
2021-03-31 soda 2
2021-05-31 soda 2
How can I get the new count data to become a labeled column in the dataframe and plot it as a function of time? Also, is there a way to add missing months? So the chemical would plot as zero for that month?
Thank you!
Solution 1:[1]
I think I managed to give you a result for the first part of your question: Change date to monthly period, then groupby Chemicals and monthy dates and count the IDs
df = pd.DataFrame(
{
"Date": [
"2021-01-01",
"2021-01-01",
"2021-01-03",
"2021-03-04",
"2021-03-04",
"2021-05-03",
"2021-05-03",
"2021-05-05",
],
"ID": [1, 1, 3, 5, 5, 6, 6, 8],
"Chemical": ["water", "acid", "water", "soda", "soda", "water", "soda", "soda"],
}
)
df["Date"] = pd.to_datetime(df["Date"])
df["Date_month"] = df["Date"].dt.to_period("m")
out = df.groupby(["Chemical", "Date_month"])["ID"].count()
print(out)
Chemical Date_month
acid 2021-01 1
soda 2021-03 2
2021-05 2
water 2021-01 2
2021-05 1
Name: ID, dtype: int64
If you want it to be a df again, just add .reset_index() at the end of out.
The other part with filling the missing month with fill_value 0.....I just didn't get it done, sorry.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Rabinzel |
