'Aggregate dataframe with many columns per week
I have a large dataframe with many columns and a timestamp (see picture) What I am trying to do is to aggregate the data per week for all columns. Any suggestions?
Kind regards, Daniel
Solution 1:[1]
Without a minimal reproducible example it's difficult to know exactly which function to use to summarize the data (sum, mean, median, etc.).
For now we'll assume that each row represents a day or some more granular unit (since the date column is called Timestamp and we can't see whether there are actual time values in the field).
We use a combination of tidyr, dplyr and lubridate to create a summarized data frame that sums the data in the columns.
First, we generated some raw data that is in a format similar to the data in the screen shot, and read it into R.
rawData <- "Timestamp,Var.2,Amazonas,Antioquia,Arauca
2022-01-01,0,0,0,1
2022-01-02,0,0,1,3
2022-01-03,0,1,1,2
2022-01-04,0,0,1,0
2022-01-05,0,2,0,0
2022-01-06,3,0,2,2
2022-01-07,2,3,0,2
2022-01-08,1,0,0,0
2022-01-09,0,1,3,0
2022-01-10,0,0,0,0
2022-01-11,0,2,0,5
2022-01-12,0,0,3,0
2022-01-13,0,3,0,4
2022-01-14,0,0,4,0
2022-01-15,0,0,0,3
2022-01-16,0,0,0,0
2022-01-17,0,3,0,0
2022-01-18,0,0,2,3
2022-01-19,0,0,0,0
2022-01-20,0,2,0,0
2022-01-21,0,0,5,2
2022-01-22,0,0,0,0
2022-01-23,0,1,0,0
2022-01-24,0,0,3,1
2022-01-25,0,1,0,1
2022-01-26,0,0,0,1
2022-01-27,0,2,0,0
2022-01-28,0,2,0,1
2022-01-29,0,0,1,0
2022-01-30,0,0,1,0"
df <- read.csv(text = rawData,
colClasses = c("Date","numeric","numeric","numeric","numeric"))
Next, we load the required libraries. From lubridate package we'll use the year() and week() functions to group the data by week of the year.
library(lubridate)
library(tidyr)
library(dplyr)
Finally, we use tidyr::pivot_longer() to create long format tidy data where each row represents one day's observations for one column in the wide format data frame, create the Year and Week columns, and summarise() the remaining columns in the data frame.
df %>% pivot_longer(-Timestamp,names_to="Area") %>%
mutate(Year = year(Timestamp),
Week = week(Timestamp)) %>%
group_by(Year,Week,Area) %>%
summarise(summedValue = sum(value)) -> summarisedData
head(summarisedData)
...and the first few rows of output:
> head(summarisedData)
# A tibble: 6 × 4
# Groups: Year, Week [2]
Year Week Area summedValue
<dbl> <dbl> <chr> <dbl>
1 2022 1 Amazonas 6
2 2022 1 Antioquia 5
3 2022 1 Arauca 10
4 2022 1 Var.2 5
5 2022 2 Amazonas 6
6 2022 2 Antioquia 10
>
If we need the data in the original format (wide format tidy data), we can use pivot_wider() to restore the data to its original shape.
# if necessary, pivot_wider() to restore data to original format
summarisedData %>%
pivot_wider(id_cols=c("Year","Week"),
names_from=Area,
values_from=summedValue)
...and the output:
# A tibble: 5 × 6
# Groups: Year, Week [5]
Year Week Amazonas Antioquia Arauca Var.2
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2022 1 6 5 10 5
2 2022 2 6 10 9 1
3 2022 3 5 7 8 0
4 2022 4 6 3 4 0
5 2022 5 0 2 0 0
>
Solution 2:[2]
Couple of things:
[1] Never provide pictures of data; provide data.
[2] The answer depends on how you define "week". For instance, 2022-01-01 was a Saturday. The ISO 8601 standard defines a week to begin on Monday, so week 1 in 2022 started on Jan 3. Jan 1 & 2 were technically part of the last week in 2021. The week(...) function in lubridate counts in 7 day increments from whatever is the first day of the year. So obviously you will get different answers.
Here is a very simple way to do this using data.table, borrowing the minimum reproducible example kindly provided by @LenGreski (you should do this).
library(data.table)
setDT(df)[, lapply(.SD, sum), by=.(year=year(Timestamp), week=isoweek(Timestamp))]
## year week Var.2 Amazonas Antioquia Arauca
## 1: 2022 52 0 0 1 4
## 2: 2022 1 6 7 7 6
## 3: 2022 2 0 5 7 12
## 4: 2022 3 0 6 7 5
## 5: 2022 4 0 5 5 4
The first clause, setDT(df), converts df to a data.table. The remaining part applies the sum(...) function to a subset of the columns in the data.table referenced by .SD. The default subset is all columns except ones referenced in the by=... clause (so, all columns except Timestamp).
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | jlhoward |
