'R - Filtering rows - Keep data between two repeated values
I would like to keep only rows before and after a specific values of a column in my data frame. My data frame below has a structure, where you can see that I have some sort of blocks if I can say so. For example, the data I am interested with always (or most of the time to be specific) starts at group and ends at section, anything else outside I would like to remove.
# Sample Data
df <- data.frame(
Name = c ("x1","NA","group","Jean","Philippe","Celine","Dion","section","NA",
"y2","z1","NA","group","Rosa","Albert","Stromae","section","NA","abc","something",
"group","Sam","Liz"),
value = as.character(seq(1:23))
)
df
Name value
1 x1 1
2 NA 2
3 group 3
4 Jean 4
5 Philippe 5
6 Celine 6
7 Dion 7
8 section 8
9 NA 9
10 y2 10
11 z1 11
12 NA 12
13 group 13
14 Rosa 14
15 Albert 15
16 Stromae 16
17 section 17
18 NA 18
19 abc 19
20 something 20
21 group 21
22 Sam 22
23 Liz 23
Since the block group:section does not always have the same information, I don't know how can I tell R to keep rows between group andsection even if they are repeated. I only came up with this, which just keeps the rows the first time R sees group andsection.
df[which(df$Name=="group")[1]:which(df$Name=="section")[1],]
Name value
3 group 3
4 Jean 4
5 Philippe 5
6 Celine 6
7 Dion 7
8 section 8
Update: Also, sometimes in my data I will have a block that starts with group but does not have an ending section. I would like to keep this information too. Based on your solutions, I added a row with section everytime I don't have it, then apply what you proposed. I don't know if there is another way to take into account this case without adding a new row to the data.
The desired output would be
4 Jean 4
5 Philippe 5
6 Celine 6
7 Dion 7
14 Rosa 14
15 Albert 15
16 Stromae 16
22 Sam 22
23 Liz 23
Thank you guys in advance for your help.
Solution 1:[1]
There are multiple instances of those cases. So, we may need to loop to get the : as : is not vectorized
i1 <- which(df$Name %in% 'group')
i2 <- which(df$Name %in% 'section')
df[unlist(Map(`:`, i1+1, i2-1)),]
-output
Name value
4 Jean 4
5 Philippe 5
6 Celine 6
7 Dion 7
14 Rosa 14
15 Albert 15
16 Stromae 16
Solution 2:[2]
You can use cumsum like so:
df %>%
mutate(cum = lag(cumsum(Name == "group")) + cumsum(Name == "section")) %>%
filter(cum %% 2 == 1)
Or, using sequence in base R:
start <- which(df$Name == "group")
end <- which(df$Name == "section")
df[sequence(end-start-1, start+1),]
output
# A tibble: 7 x 3
Name value cum
<chr> <chr> <int>
1 Jean 4 1
2 Philippe 5 1
3 Celine 6 1
4 Dion 7 1
5 Rosa 14 3
6 Albert 15 3
7 Stromae 16 3
Solution 3:[3]
Another possible solution:
library(tidyverse)
df %>%
mutate(aux1 = if_else(Name == "group", 1, NA_real_),
aux2 = if_else(Name == "section", 2, NA_real_),
aux = coalesce(aux1, aux2)) %>%
fill(aux) %>%
filter(aux == 1 & Name != "group") %>%
select(Name, value)
#> Name value
#> 1 Jean 4
#> 2 Philippe 5
#> 3 Celine 6
#> 4 Dion 7
#> 5 Rosa 14
#> 6 Albert 15
#> 7 Stromae 16
Solution 4:[4]
The following use a sort of state machine where the last state is found in the .x term within the accumulate cycle and the input is in .y:
library(dplyr)
library(purrr)
df |>
mutate(state = accumulate(tail(Name, -1), ~{
if (.y == "section")
"end"
else
if (.y == "group")
"start"
else
if (.x == "start")
"within"
else
if (.x == "end")
"outside"
else
.x
}, .init = "outside"))
##> Name value state
##>1 x1 1 outside
##>2 NA 2 outside
##>3 group 3 start
##>4 Jean 4 within
##>5 Philippe 5 within
##>6 Celine 6 within
##>7 Dion 7 within
##>8 section 8 end
##>9 NA 9 outside
##>10 y2 10 outside
##>11 z1 11 outside
##>12 NA 12 outside
##>13 group 13 start
##>14 Rosa 14 within
##>15 Albert 15 within
##>16 Stromae 16 within
##>17 section 17 end
##>18 NA 18 outside
##>19 abc 19 outside
##>20 something 20 outside
Then you can filter by state == "within" to obtain:
#> Name value
#> 1 Jean 4
#> 2 Philippe 5
#> 3 Celine 6
#> 4 Dion 7
#> 5 Rosa 14
#> 6 Albert 15
#> 7 Stromae 16
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | akrun |
| Solution 2 | |
| Solution 3 | PaulS |
| Solution 4 |
