'R, interval, mutate
I have a dataset about animals.
library(tidyverse)
a <- c("Date", "Specie", "Number")
b <- c("2020-01-01", "Dog", "3")
c <- c("2020-01-02", "Dog", "4")
d <- c("2020-01-03", "Dog", "5")
e <- c("2020-01-04", "Dog", "6")
f <- c("2020-01-01", "Cat", "3")
g <- c("2020-01-02", "Cat", "7")
h <- c("2020-01-03", "Cat", "8")
i <- c("2020-01-04", "Cat", "10")
df <- as.data.frame(rbind(b, c, d, e, f, g, h, i))
names(df) <- a
df$Date <- as.Date(df$Date)
df$Number <- as.integer(df$Number)
start <- as.Date("2020-01-02")
end <- as.Date("2020-01-04")
df %>%
filter(Date >= start & Date <= end) %>%
group_by(Specie) %>%
summarise(new = prod(10 + Number), .groups = "drop")
The goal is to create a new variable that gives me: (using tidyverse) For each specie, between 2020-01-02 and 2020-01-04 (included), I want a new variable that is the product of (10+number of dead animals that day). For-example, for dogs it would be (10+4)(10+5)(10+6). Same for all specie.
Please note that for some specie, I don't have the number of dead animals during all the days of the interval.
- Is dropping them the best option? If yes, how do you do it.
Note that the code filters and hence keeps only my dataset for the dates specify. I want to return the output that the code delivers but in my original dataset.
- That is, the output I get should be a new variable (mutate) for all species. And not a subset of my dataset. I did a left-join to merge the original dataset with the new (filtered) one. It works, but I think there's a more efficient way.
Thank's for your help much appreciated.
Solution 1:[1]
If there are some numbers missing in Number, I could think of three ways for handling those missing values:
- Set them to
NAand useprod(..., na.rm = TRUE)to remove them in the calculation of the product. - Set them to
0and useprod(..., na.rm = FALSE)to at least inflate the product by a factor of10. - If you want to preserve some mean (arithmetic or geometric) of the factors (10+a_i), set the missing values to that mean minus 10.
In cases 2 and 3 you can for example give a lower bound for the product: prod >= 10^n. (Taking the logarithm to base 10 on both sides yields log(prod) = sum(log) >= n.) But maybe you want to reserve case 2 for those rows that really have zero dead animals.
Regarding your second point, use mutate with ifelse to flag your wanted dates (instead of filter) and additionally group by this new flag. Then use again mutate instead of summarise.
df %>% mutate(
new = ifelse(Date >= start & Date <= end, 1, NA)
) %>% group_by(
Specie,
new
) %>% mutate(
new = ifelse(!is.na(new), prod(10 + Number, na.rm = FALSE), new)
)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | iln17 |
