'Convert pandas dataframe from cumulative time series to absolute change time series
I have imported a data table from an SQL database. Each row of this table contains the response of a query to the twitter api, which is run quite frequently. The stats are cumulative. So if on the first call there are 20 impressions, and by the time we make another call the impressions have increased by 5, the response will return [20, 25].
What I want to do is group this data by day, taking the last entry for that day (this makes sense since the table is auto-incremented). However, for each day, I want to find the absolute change in the number of impressions for each post. Therefore I can get a global picture of how the account is performing over time. I have found a solution, but it seems very cumbersome. I am wondering if there is a better way to achieve this. Here is a sample of my initial dataset:
social_post_id twitter_impression_count created
0 52 6.0 2022-04-14 15:40:07
1 52 11.0 2022-04-14 16:00:08
2 52 11.0 2022-04-14 16:20:08
3 52 11.0 2022-04-14 16:40:05
4 52 11.0 2022-04-14 17:00:08
... ... ... ...
945 142 8.0 2022-05-19 09:20:17
946 142 8.0 2022-05-19 09:40:10
947 142 8.0 2022-05-19 10:00:12
948 142 8.0 2022-05-19 10:20:11
949 142 8.0 2022-05-19 10:40:12
This is the code I have used to transform my data:
day_list = sorted(list(set(df['created'].dt.date)))
current_day = df[df['created'].dt.date==day_list[0]].drop_duplicates(subset='social_post_id', keep='last').set_index('social_post_id')
for day in day_list[1:]:
next_day = df[df['created'].dt.date==day].drop_duplicates(subset='social_post_id', keep='last').set_index('social_post_id')
for index in next_day.index:
if index in current_day.index:
next_day.loc[index, 'twitter_impression_count'] = next_day.loc[index, 'twitter_impression_count'] - current_day.loc[index, 'twitter_impression_count'].sum()
current_day = pd.concat([current_day, next_day])
This outputs my table in the correct format, but as I said, I don't feel like this is a very optimum solution...
current_day.reset_index()
social_post_id twitter_impression_count created
0 52 18.0 2022-04-14 23:40:08
1 52 4.0 2022-04-15 23:20:07
2 55 8.0 2022-04-15 23:40:07
3 52 1.0 2022-04-16 15:20:08
4 60 7.0 2022-04-16 23:40:08
... ... ... ...
182 122 48.0 2022-05-18 12:20:11
183 82 5.0 2022-05-18 17:40:13
184 72 2.0 2022-05-18 21:00:11
185 142 6.0 2022-05-18 23:40:16
186 142 2.0 2022-05-19 10:40:12
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
