'Filter out sequences of records in R/Python
I have a table of sequential records of different individuals that look something like this:
(Setting this up in R, although Python solutions are also OK.)
set.seed(22)
df <- data.frame(
score = c(1, 4, 15, 12, 6, 7, 12, 7, 8, 5, 12, 19, 1, 4, 7, 12, 3, 5, 9, 4),
info = sample(1:100, 20)
)
| score | info |
|---|---|
| 1 | 9 |
| 4 | 88 |
| 15 | 74 |
| 12 | 94 |
| 6 | 44 |
| 7 | 59 |
| 12 | 81 |
| 7 | 67 |
| 8 | 48 |
| 5 | 16 |
| 12 | 58 |
| 19 | 72 |
| 1 | 62 |
| 4 | 31 |
| 7 | 65 |
| 12 | 49 |
| 3 | 21 |
| 5 | 68 |
| 9 | 33 |
| 4 | 32 |
The goal is to filter out certain rows of the table based on the score sequence. Here are the rules in human logic. Take the scores as a list:
1, 4, 15, 12, 6, 7, 12, 7, 8, 5 12, 19, 1, 4, 7 12, 3, 5, 9, 4
Step 1: Divide the list into sub-lists by instances of score == 12.
(The last sub-list is included regardless if it ends with the score == 12. And if the list do not contain any instances of score == 12, there would be one sub-list, i.e.: the entire list.)
[1, 4, 15, 12], [6, 7, 12], [7, 8, 5, 12], [19, 1, 4, 7, 12], [3, 5, 9, 4]
Step 2: Filter out all sub-lists that do not contain one or more instances of scores == 4.
[1, 4, 15, 12], [19, 1, 4, 7, 12], [3, 5, 9, 4]
Step 3: Then filter the data frame to include the rows corresponding to these scores only.
| score | info |
|---|---|
| 1 | 9 |
| 4 | 88 |
| 15 | 74 |
| 12 | 94 |
| 19 | 72 |
| 1 | 62 |
| 4 | 31 |
| 7 | 65 |
| 12 | 49 |
| 3 | 21 |
| 5 | 68 |
| 9 | 33 |
| 4 | 32 |
Note that the scores are neither unique nor ordinal.
What would be the best way to implement this routine in either R or Python, hopefully without using a loop?
Solution 1:[1]
Here is a base R method.
- First create a
vecvector containing the index position wheredf$score == 12. Also include 0 andnow(df)to it, which will be used later inlapply - Create sublists
mylistthat ends withscore == 12.deframeis used to create a named vector from a two-column dataframe usinglapply - Use the
%in%operator to see if 4 is contained in the sublist, remove ones that don't contain 4 - And use
enframeto create a two-column dataframe from a named vector.
vec <- sort(c(0, which(df$score == 12), nrow(df)))
mylist <- lapply(1:(length(vec) - 1), function(x) deframe(df[2:1])[(vec[x]+1):vec[x+1]])
enframe(mylist[sapply(mylist, function(x) 4 %in% x)] %>% unlist(), value = "score", name = "info")[2:1]
# A tibble: 13 × 2
score info
<int> <chr>
1 1 9
2 4 88
3 15 74
4 12 94
5 19 72
6 1 62
7 4 31
8 7 65
9 12 49
10 3 21
11 5 68
12 9 33
13 4 32
Solution 2:[2]
Using base R, we can use cumsum on a logical vector (score == 12) to create a grouping column and then filter if there is a 4 %in% score to keep only those groups and finally remove the grp column created - dplyr is used for the piping and the lag
library(dplyr)
df %>%
group_by(grp = lag(cumsum(score == 12), default = 0)) %>%
filter(4 %in% score) %>%
ungroup %>%
select(-grp)
-output
# A tibble: 13 × 2
score info
<dbl> <int>
1 1 9
2 4 88
3 15 74
4 12 94
5 19 72
6 1 62
7 4 31
8 7 65
9 12 49
10 3 21
11 5 68
12 9 33
13 4 32
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 |
