'R: Create numbering within each group
The data that I have:
x = tibble(
study = c("A", "B", "C", "A", "B", "A", "B", "C", "A", "B"),
ID = c(001, 001, 001, 005, 005, 007, 007, 007, 012, 012)
)
The goal is to create the 'number' variable which shows the same number for each unique ID in sequence starting from 1.
goal = tibble(
study = c("A", "B", "C", "A", "B", "A", "B", "C", "A", "B"),
ID = c(001, 001, 001, 005, 005, 007, 007, 007, 012, 012),
number = c(1, 1, 1, 2, 2, 3, 3, 3, 4, 4)
)
And then if within each ID group, the studies are incomplete (e.g., for number = 2, the studies are only A and B, instead of A, B, C), then how to remove the obs associated with that ID (e.g., remove obs that have a number of '2')?
Thanks!
Updated follow-up question on part B:
Once we have the goal dataset, I would like to remove the obs grouped by ID, that meet the following requirements in terms of the study variable:
A and D are required, one of B and C is required (so either B or C), and sometimes each letter will appear more than once.
x = tibble(
study = c("A", "B", "C", "D", "A", "B", "A", "B", "C", "A", "B", "C", "D", "D", "A", "B", "D", "B", "C", "D"),
ID = c(001, 001, 001, 001, 005, 005, 007, 007, 007, 012, 012, 012, 012, 012, 013, 013, 013, 018, 018, 018),
number = c(1, 1, 1, 1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 4, 5, 5, 5, 6, 6, 6)
)
So in the goal dataset above, I would like to remove:
(1) Obs #5 and 6 which share a group number of 2, because they don't have A, B or C, and D in the study variable.
(2) Obs #18, 19, 20 which share a group number of 6, for the same reason as (1).
I would like to keep the rest of the obs because within each number group, they have A, B or C, and D. I cannot use filter(n() > 3) here, because that would delete obs with the number 5.
Solution 1:[1]
We could use cur_group_id()
library(dplyr)
x %>%
group_by(ID) %>%
mutate(number = cur_group_id())
study ID number
<chr> <dbl> <int>
1 A 1 1
2 B 1 1
3 C 1 1
4 A 5 2
5 B 5 2
6 A 7 3
7 B 7 3
8 C 7 3
9 A 12 4
10 B 12 4
OR
library(dplyr)
x %>%
mutate(number = cumsum(ID != lag(ID, default = first(ID)))+1)
study ID number
<chr> <dbl> <dbl>
1 A 1 1
2 B 1 1
3 C 1 1
4 A 5 2
5 B 5 2
6 A 7 3
7 B 7 3
8 C 7 3
9 A 12 4
10 B 12 4
Solution 2:[2]
A and D are required, one of B and C is required (so either B or C)
df %>%
group_by(ID) %>%
mutate(
flag =
(
any(study %in% c("A")) &
any(study %in% c("D"))
) &
(
any(study %in% c("B")) |
any(study %in% c("C"))
)
) %>%
filter(flag)
# A tibble: 12 × 4
# Groups: ID [3]
study ID number flag
<chr> <dbl> <dbl> <lgl>
1 A 1 1 TRUE
2 B 1 1 TRUE
3 C 1 1 TRUE
4 D 1 1 TRUE
5 A 12 4 TRUE
6 B 12 4 TRUE
7 C 12 4 TRUE
8 D 12 4 TRUE
9 D 12 4 TRUE
10 A 13 5 TRUE
11 B 13 5 TRUE
12 D 13 5 TRUE
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | rdelrossi |
