'Lagged cumulative sum for time series based off of multiple conditions
I'd like to get the cumulative sum of the corresponding records in the smaller column for each name under Species_a and Species_b as two new columns, and have them in the same row without including the value for that row. the smaller column lists which species column has a smaller width.
Species_a Species_b Sepal.Width_a Sepal.Width_b Date smaller
1 versicolor virginica 2.5 3.0 2022-05-05 a
2 versicolor virginica 2.6 2.8 2022-04-04 a
3 versicolor setosa 2.2 4.4 2021-03-03 a
4 setosa virginica 4.2 2.5 2021-02-02 b
5 virginica setosa 3.0 3.4 2020-01-01 a
Ideally the format of the data would be in the same format as it is now, and the summation would be based off of the smaller, Date, Species_a, and Species_b columns alone. I tried to create a count column but I get stuck on properly accumulating based on Date being less than the current value for that column.
My desired output would be as follows:
Species_a Species_b Sepal.Width_a Sepal.Width_b Date smaller smaller_sum_a smaller_sum_b
1 versicolor virginica 2.5 3.0 2022-05-05 a 2 2
2 versicolor virginica 2.6 2.8 2022-04-04 a 1 2
3 versicolor setosa 2.2 4.4 2021-03-03 a 0 0
4 setosa virginica 4.2 2.5 2021-02-02 b 0 1
5 virginica setosa 3.0 3.4 2020-01-01 a 0 0
Code:
library(tidyverse)
set.seed(12)
data_a <- iris[sample(1:nrow(iris)), ] %>%
head()
colnames(data_a) <- paste0(colnames(data_a), "_a")
data_b <- iris[sample(1:nrow(iris)), ] %>%
tail()
colnames(data_b) <- paste0(colnames(data_b), "_b")
data <- bind_cols(data_a, data_b) %>%
filter(Species_a != Species_b) %>%
select(Species_a,
Species_b,
Sepal.Width_a,
Sepal.Width_b) %>%
mutate(Date = c('2022-05-05', '2022-04-04', '2021-03-03', '2021-02-02', '2020-01-01'),
smaller = ifelse(Sepal.Width_a > Sepal.Width_b, 'b',
ifelse(Sepal.Width_a < Sepal.Width_b, 'a', NA)))
Solution 1:[1]
I don't know if this is a solution, but it might be a start.
How exactly are the new columns calculated? Looks like smaller_sum_a is the number of consecutive rows where species a has the smaller value, minus one. But the same doesn't work for smaller_sum_b I don't think? Or is it just cumulative number of days where each species is has the smaller value, minus 1, but with zero if the species isn't smaller in that row (again this doesn't check out for smaller_sum_b though...).
As for determining if Date is less than the current value, firstly you'll want to tell R that your Date column is actually a date, rather than just a character.
Easiest way to see what format it is in is to make your data (not a good name for your data btw, preferably make it something that R or the computer wouldn't use, like my_data) a tibble rather than a data.frame. tibbles tell you what format each column is in which is handy.
data %>%
tibble
# # A tibble: 5 x 6
# Species_a Species_b Sepal.Width_a Sepal.Width_b Date smaller
# <fct> <fct> <dbl> <dbl> <chr> <chr>
# 1 versicolor virginica 2.5 3 2022-05-05 a
# 2 versicolor virginica 2.6 2.8 2022-04-04 a
# 3 versicolor setosa 2.2 4.4 2021-03-03 a
# 4 setosa virginica 4.2 2.5 2021-02-02 b
# 5 virginica setosa 3 3.4 2020-01-01 a
The bits inside the < > under the column names tell you the formats, <fct> is factor, <dbl> is numeric (see here for explanation) and <chr> is character.
So, we want to make Date into a date format, which we can do with the ymd() (year-month-day) function from lubridate. Also, I rearranged the data so the rows are in chronological order (earliest at the top), because that's how things are normally arranged, and makes more sense to me, especially if you're interested in cumulative sums.
data %>%
tibble %>%
mutate(
Date = ymd(Date)
) %>%
arrange(Date) %>%
{. ->> my_data}
my_data
# # A tibble: 5 x 6
# Species_a Species_b Sepal.Width_a Sepal.Width_b Date smaller
# <fct> <fct> <dbl> <dbl> <date> <chr>
# 1 virginica setosa 3 3.4 2020-01-01 a
# 2 setosa virginica 4.2 2.5 2021-02-02 b
# 3 versicolor setosa 2.2 4.4 2021-03-03 a
# 4 versicolor virginica 2.6 2.8 2022-04-04 a
# 5 versicolor virginica 2.5 3 2022-05-05 a
We can see that R now recognises that the Date column is a date, and is now in the R-recognised <date> format.
Now this is where I'm not 100% sure on exactly how you want to calculate your new columns, but for example you can use ifelse() to determine if species a is smaller, and then calculate the cumulative sum of the days where it was smaller.
my_data %>%
mutate(
s_a = ifelse(smaller == 'a', 1, 0),
smaller_sum_a = cumsum(s_a),
)
# # A tibble: 5 x 8
# Species_a Species_b Sepal.Width_a Sepal.Width_b Date smaller s_a smaller_sum_a
# <fct> <fct> <dbl> <dbl> <date> <chr> <dbl> <dbl>
# 1 virginica setosa 3 3.4 2020-01-01 a 1 1
# 2 setosa virginica 4.2 2.5 2021-02-02 b 0 1
# 3 versicolor setosa 2.2 4.4 2021-03-03 a 1 2
# 4 versicolor virginica 2.6 2.8 2022-04-04 a 1 3
# 5 versicolor virginica 2.5 3 2022-05-05 a 1 4
As long as either a) the Date column is in an R-recognised <date> format, or b) it is arranged chronologically, you can use the less than or greater than operators < & > to determine if dates are before/after a given row.
This is a good resource for understanding how R treats dates and times, and is well worth a read https://r4ds.had.co.nz/dates-and-times.html
Solution 2:[2]
Here is my current solution, I'd like to not use plyr if I can help it since I heard it breaks some of dplyr's functions. I feel like there is definitely a more efficient and modern way of solving this issue but I can't seem to find it.
library(plyr)
library(lubridate)
# creating counts for smaller sums for red side
data$Date <- lubridate::parse_date_time(x = data$Date, # standardizing date (outside of the reproducible example there are two date types)
orders = c("%m/%d/%Y", "%Y-%m-%d"))
A_rn <- mutate(filter(select(data,
Species_a,
Date,
smaller),
smaller == 'a'),
smaller_ct_a = 1)
# creating counts for smaller sums for b
BtoA_rn <- mutate(filter(select(data,
Species_b,
Date,
smaller),
smaller == 'b'), # calling Species_b Species_a for easier joining
Species_a = Species_b,
smaller_ct_a = 1) %>%
select(Species_a, Date, smaller, smaller_ct_a)
# cumsum for both a and b
A <- ddply(bind_rows(A_rn, BtoA_rn) %>%
arrange(Date),
.(Species_a), transform,
smaller_sum_a = lag(cumsum(replace_na(smaller_ct_a, 0)))) %>%
select(-smaller_ct_a)
# naming adjustment
B <- A %>% filter(smaller == "b") %>%
select(-smaller)
names(B) <- gsub(x = names(B), pattern = "_a", replacement = "_b")
A <- A %>% filter(smaller == "a") %>%
select(-smaller)
data <- left_join(data, A, by = c("Species_a", "Date")) %>%
left_join(B, by = c("Species_b", "Date"))
data[is.na(data)] <- 0
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | hugh-allan |
| Solution 2 |
