'Which function do I need to apply to get tail of 20 for "a" group not "b" group
I have modified my data frame into group "a" and "b", but currently I want to get all the tail of 20 for group a and not group b. Here is the sample data:
#> id time status displacement group
#> 1 15 1 2 3.4 a
#> 2 15 1 2 3.4 a
#> 3 15 1 2 3.4 a
#> 4 15 1 2 3.4 a
#> 5 15 1 2 3.4 a
#> 6 15 1 2 3.4 a
#> 7 15 1 2 3.4 a
#> 8 15 1 2 3.4 a
#> 9 15 1 2 3.4 b
#> 10 15 1 2 3.4 b
#> 11 15 1 2 3.4 b
#> 12 15 1 2 3.4 b
#> 13 15 1 2 3.4 b
#> 14 15 1 2 3.4 a
#> 15 15 1 2 3.4 a
#> 16 15 1 2 3.4 a
#> 17 15 1 2 3.4 a
#> 18 15 1 2 3.4 a
#> 19 15 1 2 3.4 a
#> 20 15 1 2 3.4 a
#> 21 15 1 2 3.4 a
#> 22 15 1 2 3.4 a
#> 23 15 1 2 3.4 a
#> 24 15 1 2 3.4 a
#> 25 15 1 2 3.4 a
#> 26 15 1 2 3.4 b
#> 27 15 1 2 3.4 b
#> 28 15 1 2 3.4 b
#> 29 15 1 2 3.4 b
#> 30 15 1 2 3.4 b
and so on with this pattern
I only want to get the tail of each group a of tail rows (say 5 rows), and group b remain the same.
Desire output:
#> id time status displacement group
#> 4 15 1 2 3.4 a
#> 5 15 1 2 3.4 a
#> 6 15 1 2 3.4 a
#> 7 15 1 2 3.4 a
#> 8 15 1 2 3.4 a
#> 9 15 1 2 3.4 b
#> 10 15 1 2 3.4 b
#> 11 15 1 2 3.4 b
#> 12 15 1 2 3.4 b
#> 13 15 1 2 3.4 b
#> 14 15 1 2 3.4 a
#> 15 15 1 2 3.4 a
#> 16 15 1 2 3.4 a
#> 17 15 1 2 3.4 a
#> 18 15 1 2 3.4 a
#> 19 15 1 2 3.4 b
#> 20 15 1 2 3.4 b
#> 21 15 1 2 3.4 b
#> 22 15 1 2 3.4 b
#> 23 15 1 2 3.4 b
and so on with this pattern
I know I have to use the group_by function to group all the same group together However, I know if I group them up and tail the number, it will apply all the group
How can I achieve it? Thanks
Solution 1:[1]
Using DF shown reproducibly in the Note at the end
1) filter use filter to choose those rows not in group a or have a row number within the last 5 of that group.
library(dplyr, exclude = c("filter", "lag"))
DF %>%
group_by(group) %>%
dplyr::filter(cur_group()[[1]] != "a" | row_number() > n() - 5) %>%
ungroup
giving:
# A tibble: 15 x 5
id time status displacement group
<int> <int> <int> <dbl> <chr>
1 15 1 2 3.4 b
2 15 1 2 3.4 b
3 15 1 2 3.4 b
4 15 1 2 3.4 b
5 15 1 2 3.4 b
6 15 1 2 3.4 a
7 15 1 2 3.4 a
8 15 1 2 3.4 a
9 15 1 2 3.4 a
10 15 1 2 3.4 a
11 15 1 2 3.4 b
12 15 1 2 3.4 b
13 15 1 2 3.4 b
14 15 1 2 3.4 b
15 15 1 2 3.4 b
2) group_modify An alternative is to use group_modify. Since the grouping variable is not passed to the function in group_modify we make a copy of it and then delete the copy at the end. This gives the same result as (1).
library(dplyr, exclude = c("filter", "lag"))
DF %>%
group_by(group2 = group) %>%
group_modify(~ { if (.$group[1] == "a") slice_tail(., n = 5) else . }) %>%
ungroup %>%
select(- group2)
3) by This base R approach uses by and returns the tail if the group is a and return all the rows otherwise. by produces a by class list so put it back together again afterwards.
by(DF, DF$group, function(x) if (x$group[1] == "a") tail(x, 5) else x) |>
do.call(what = "rbind")
4) subset This is another base R approach. We define a function which is like seq_along but returns the result in reverse order and then use subset to choose groups not equal to a or for which the reverse sequence number is less than or equal to 5.
rev_seq <- function(x) rev(seq_along(x))
subset(DF, group != "a" | ave(seq_along(group), group, FUN = rev_seq) <= 5)
Note
Lines <- "id time status displacement group
15 1 2 3.4 a
15 1 2 3.4 a
15 1 2 3.4 a
15 1 2 3.4 a
15 1 2 3.4 a
15 1 2 3.4 a
15 1 2 3.4 a
15 1 2 3.4 a
15 1 2 3.4 b
15 1 2 3.4 b
15 1 2 3.4 b
15 1 2 3.4 b
15 1 2 3.4 b
15 1 2 3.4 a
15 1 2 3.4 a
15 1 2 3.4 a
15 1 2 3.4 a
15 1 2 3.4 a
15 1 2 3.4 a
15 1 2 3.4 a
15 1 2 3.4 a
15 1 2 3.4 a
15 1 2 3.4 a
15 1 2 3.4 a
15 1 2 3.4 a
15 1 2 3.4 b
15 1 2 3.4 b
15 1 2 3.4 b
15 1 2 3.4 b
15 1 2 3.4 b"
DF <- read.table(text = Lines, header = TRUE)
Added
If the comment below means that we want to keep only the last 5 rows of each set of consecutive rows of group a plus all rows of other groups then make these changes:
# 1
DF %>%
group_by(subgroup = data.table::rleid(group)) %>%
dplyr::filter(group != "a" | row_number() > n() - 5) %>%
ungroup %>%
select(- subgroup)
# 2
DF %>%
group_by(subgroup = data.table::rleid(group)) %>%
group_modify(~ { if (.$group[1] == "a") slice_tail(., n = 5) else . }) %>%
ungroup %>%
select(- subgroup)
# 3
by(DF, data.table::rleid(DF$group),
function(x) if (x$group[1] == "a") tail(x, 5) else x) |>
do.call(what = "rbind")
# 4
rev_seq <- function(x) rev(seq_along(x))
subset(DF, group != "a" |
ave(seq_along(group), data.table::rleid(group), FUN = rev_seq) <= 5)
or maybe not use dplyr at all and just use data.table
library(data.table)
DT <- data.table(DF)
DT[, if (group[1] != "a") .SD else tail(.SD, 5), by = .(group, rleid(group))][, -2]
Solution 2:[2]
If we assume your data is stored in a data.frame called dt:
a_rows <- which(dt$group == "a")
b_rows <- which(dt$group == "b")
rows <- sort(c(a_rows[(length(a_rows) - 5):length(a_rows)], b_rows)
dt[rows, ]
Solution 3:[3]
You may subtract the cumsum of which are in group a from the total sum and compare the result with the desired tail length atail (in example obviously 7) to create boolean subset.
atail <- 7
dat[with(dat, sum(group == 'a') - cumsum(group == 'a') + 1) <= atail |
dat$group == 'b', ]
# id time status displacement group
# 9 15 1 2 3.4 b
# 10 15 1 2 3.4 b
# 11 15 1 2 3.4 b
# 12 15 1 2 3.4 b
# 13 15 1 2 3.4 b
# 19 15 1 2 3.4 a
# 20 15 1 2 3.4 a
# 21 15 1 2 3.4 a
# 22 15 1 2 3.4 a
# 23 15 1 2 3.4 a
# 24 15 1 2 3.4 a
# 25 15 1 2 3.4 a
# 26 15 1 2 3.4 b
# 27 15 1 2 3.4 b
# 28 15 1 2 3.4 b
# 29 15 1 2 3.4 b
# 30 15 1 2 3.4 b
Data:
dat <- structure(list(id = c(15L, 15L, 15L, 15L, 15L, 15L, 15L, 15L,
15L, 15L, 15L, 15L, 15L, 15L, 15L, 15L, 15L, 15L, 15L, 15L, 15L,
15L, 15L, 15L, 15L, 15L, 15L, 15L, 15L, 15L), time = c(1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), status = c(2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), displacement = c(3.4,
3.4, 3.4, 3.4, 3.4, 3.4, 3.4, 3.4, 3.4, 3.4, 3.4, 3.4, 3.4, 3.4,
3.4, 3.4, 3.4, 3.4, 3.4, 3.4, 3.4, 3.4, 3.4, 3.4, 3.4, 3.4, 3.4,
3.4, 3.4, 3.4), group = c("a", "a", "a", "a", "a", "a", "a",
"a", "b", "b", "b", "b", "b", "a", "a", "a", "a", "a", "a", "a",
"a", "a", "a", "a", "a", "b", "b", "b", "b", "b")), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13",
"14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24",
"25", "26", "27", "28", "29", "30"))
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | |
| Solution 3 |
