'Count rolling window unique values with duplicate dates pandas
If I have a pandas DataFrame like this
| date | person_active |
|---|---|
| 22/2 | John |
| 22/2 | Marie |
| 22/2 | Mark |
| 23/2 | John |
| 24/2 | Mark |
| 24/2 | Marie |
how do I count in a rolling window based on time the unique values in person_active, for example: 2 days rolling window, so it ends up like this:
| date | person_active | people_active |
|---|---|---|
| 22/2 | John | 3 |
| 22/2 | Marie | 3 |
| 22/2 | Mark | 3 |
| 23/2 | John | 3 |
| 24/2 | Mark | 3 |
| 24/2 | Marie | 3 |
The main issue here is that I have duplicate entries on date for each person so a simple df.rolling('2d',on='date').count() won't do the job.
EDIT: Please consider implementation in a big dataset and how the time to compute will scale, the solution needs to be ideally applicable in a real-world environment so if it takes too long to compute it's not that useful.
Solution 1:[1]
IIUC, try:
#convert to datetime if needed
df["date"] = pd.to_datetime(df["date"], format="%d/%m")
#convert string name to categorical codes for numerical aggegation
df["people"] = pd.Categorical(df["person_active"]).codes
#compute the rolling unique count
df["people_active"] = (df.rolling("2D", on="date")["people"]
.agg(lambda x: x.nunique())
.groupby(df["date"])
.transform("max")
)
#drop the unneccessary column
df = df.drop("people", axis=1)
>>> df
date person_active people_active
0 1900-02-22 John 3.0
1 1900-02-22 Marie 3.0
2 1900-02-22 Mark 3.0
3 1900-02-23 John 3.0
4 1900-02-24 Mark 3.0
5 1900-02-24 Marie 3.0
Solution 2:[2]
Group by date, count unique values and then you're good to go:
df.groupby('date').nunique().rolling('2d').sum()
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | not_speshal |
| Solution 2 | Always Right Never Left |
