'Which function do I need to apply to get tail of 20 for "a" group not "b" group

I have modified my data frame into group "a" and "b", but currently I want to get all the tail of 20 for group a and not group b. Here is the sample data:

#>      id time status displacement group
#> 1    15   1     2       3.4        a
#> 2    15   1     2       3.4        a
#> 3    15   1     2       3.4        a
#> 4    15   1     2       3.4        a
#> 5    15   1     2       3.4        a
#> 6    15   1     2       3.4        a
#> 7    15   1     2       3.4        a
#> 8    15   1     2       3.4        a
#> 9    15   1     2       3.4        b
#> 10   15   1     2       3.4        b
#> 11   15   1     2       3.4        b
#> 12   15   1     2       3.4        b
#> 13   15   1     2       3.4        b
#> 14   15   1     2       3.4        a
#> 15   15   1     2       3.4        a
#> 16   15   1     2       3.4        a
#> 17   15   1     2       3.4        a
#> 18   15   1     2       3.4        a
#> 19   15   1     2       3.4        a
#> 20   15   1     2       3.4        a
#> 21   15   1     2       3.4        a
#> 22   15   1     2       3.4        a
#> 23   15   1     2       3.4        a
#> 24   15   1     2       3.4        a
#> 25   15   1     2       3.4        a
#> 26   15   1     2       3.4        b
#> 27   15   1     2       3.4        b
#> 28   15   1     2       3.4        b
#> 29   15   1     2       3.4        b
#> 30   15   1     2       3.4        b
and so on with this pattern

I only want to get the tail of each group a of tail rows (say 5 rows), and group b remain the same.

Desire output:

#>    id time status displacement group
#> 4  15   1     2       3.4        a
#> 5  15   1     2       3.4        a
#> 6  15   1     2       3.4        a
#> 7  15   1     2       3.4        a
#> 8  15   1     2       3.4        a
#> 9  15   1     2       3.4        b
#> 10 15   1     2       3.4        b
#> 11 15   1     2       3.4        b
#> 12 15   1     2       3.4        b
#> 13 15   1     2       3.4        b
#> 14 15   1     2       3.4        a
#> 15 15   1     2       3.4        a
#> 16 15   1     2       3.4        a
#> 17 15   1     2       3.4        a
#> 18 15   1     2       3.4        a
#> 19 15   1     2       3.4        b
#> 20 15   1     2       3.4        b
#> 21 15   1     2       3.4        b
#> 22 15   1     2       3.4        b
#> 23 15   1     2       3.4        b
and so on with this pattern

I know I have to use the group_by function to group all the same group together However, I know if I group them up and tail the number, it will apply all the group

How can I achieve it? Thanks



Solution 1:[1]

Using DF shown reproducibly in the Note at the end

1) filter use filter to choose those rows not in group a or have a row number within the last 5 of that group.

library(dplyr, exclude = c("filter", "lag"))

DF %>%
  group_by(group) %>%
  dplyr::filter(cur_group()[[1]] != "a" | row_number() > n() - 5) %>%
  ungroup

giving:

# A tibble: 15 x 5
      id  time status displacement group
   <int> <int>  <int>        <dbl> <chr>
 1    15     1      2          3.4 b    
 2    15     1      2          3.4 b    
 3    15     1      2          3.4 b    
 4    15     1      2          3.4 b    
 5    15     1      2          3.4 b    
 6    15     1      2          3.4 a    
 7    15     1      2          3.4 a    
 8    15     1      2          3.4 a    
 9    15     1      2          3.4 a    
10    15     1      2          3.4 a    
11    15     1      2          3.4 b    
12    15     1      2          3.4 b    
13    15     1      2          3.4 b    
14    15     1      2          3.4 b    
15    15     1      2          3.4 b    

2) group_modify An alternative is to use group_modify. Since the grouping variable is not passed to the function in group_modify we make a copy of it and then delete the copy at the end. This gives the same result as (1).

library(dplyr, exclude = c("filter", "lag"))

DF %>%
  group_by(group2 = group) %>%
  group_modify(~ { if (.$group[1] == "a") slice_tail(., n = 5) else . }) %>%
  ungroup %>%
  select(- group2)

3) by This base R approach uses by and returns the tail if the group is a and return all the rows otherwise. by produces a by class list so put it back together again afterwards.

by(DF, DF$group, function(x) if (x$group[1] == "a") tail(x, 5) else x) |>
  do.call(what = "rbind")

4) subset This is another base R approach. We define a function which is like seq_along but returns the result in reverse order and then use subset to choose groups not equal to a or for which the reverse sequence number is less than or equal to 5.

rev_seq <-  function(x) rev(seq_along(x))
subset(DF, group != "a" | ave(seq_along(group), group, FUN = rev_seq) <= 5)

Note

Lines <- "id time status displacement group
15   1     2       3.4        a
15   1     2       3.4        a
15   1     2       3.4        a
15   1     2       3.4        a
15   1     2       3.4        a
15   1     2       3.4        a
15   1     2       3.4        a
15   1     2       3.4        a
15   1     2       3.4        b
15   1     2       3.4        b
15   1     2       3.4        b
15   1     2       3.4        b
15   1     2       3.4        b
15   1     2       3.4        a
15   1     2       3.4        a
15   1     2       3.4        a
15   1     2       3.4        a
15   1     2       3.4        a
15   1     2       3.4        a
15   1     2       3.4        a
15   1     2       3.4        a
15   1     2       3.4        a
15   1     2       3.4        a
15   1     2       3.4        a
15   1     2       3.4        a
15   1     2       3.4        b
15   1     2       3.4        b
15   1     2       3.4        b
15   1     2       3.4        b
15   1     2       3.4        b"
DF <- read.table(text = Lines, header = TRUE)

Added

If the comment below means that we want to keep only the last 5 rows of each set of consecutive rows of group a plus all rows of other groups then make these changes:

# 1
DF %>%
  group_by(subgroup = data.table::rleid(group)) %>%
  dplyr::filter(group != "a" | row_number() > n() - 5) %>%
  ungroup  %>%
  select(- subgroup)

# 2
DF %>%
  group_by(subgroup = data.table::rleid(group)) %>%
  group_modify(~ { if (.$group[1] == "a") slice_tail(., n = 5) else . }) %>%
  ungroup %>%
  select(- subgroup)

# 3
by(DF, data.table::rleid(DF$group), 
     function(x) if (x$group[1] == "a") tail(x, 5) else x) |>
  do.call(what = "rbind")

# 4
rev_seq <-  function(x) rev(seq_along(x))
subset(DF, group != "a" | 
  ave(seq_along(group), data.table::rleid(group), FUN = rev_seq) <= 5)

or maybe not use dplyr at all and just use data.table

library(data.table)

DT <- data.table(DF)
DT[, if (group[1] != "a") .SD else tail(.SD, 5), by = .(group, rleid(group))][, -2]

Solution 2:[2]

If we assume your data is stored in a data.frame called dt:

a_rows <- which(dt$group == "a")
b_rows <- which(dt$group == "b")

rows <- sort(c(a_rows[(length(a_rows) - 5):length(a_rows)], b_rows)

dt[rows, ]

Solution 3:[3]

You may subtract the cumsum of which are in group a from the total sum and compare the result with the desired tail length atail (in example obviously 7) to create boolean subset.

atail <- 7
dat[with(dat, sum(group == 'a') - cumsum(group == 'a') + 1) <= atail |
      dat$group == 'b', ]
#    id time status displacement group
# 9  15    1      2          3.4     b
# 10 15    1      2          3.4     b
# 11 15    1      2          3.4     b
# 12 15    1      2          3.4     b
# 13 15    1      2          3.4     b
# 19 15    1      2          3.4     a
# 20 15    1      2          3.4     a
# 21 15    1      2          3.4     a
# 22 15    1      2          3.4     a
# 23 15    1      2          3.4     a
# 24 15    1      2          3.4     a
# 25 15    1      2          3.4     a
# 26 15    1      2          3.4     b
# 27 15    1      2          3.4     b
# 28 15    1      2          3.4     b
# 29 15    1      2          3.4     b
# 30 15    1      2          3.4     b

Data:

dat <- structure(list(id = c(15L, 15L, 15L, 15L, 15L, 15L, 15L, 15L, 
15L, 15L, 15L, 15L, 15L, 15L, 15L, 15L, 15L, 15L, 15L, 15L, 15L, 
15L, 15L, 15L, 15L, 15L, 15L, 15L, 15L, 15L), time = c(1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), status = c(2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), displacement = c(3.4, 
3.4, 3.4, 3.4, 3.4, 3.4, 3.4, 3.4, 3.4, 3.4, 3.4, 3.4, 3.4, 3.4, 
3.4, 3.4, 3.4, 3.4, 3.4, 3.4, 3.4, 3.4, 3.4, 3.4, 3.4, 3.4, 3.4, 
3.4, 3.4, 3.4), group = c("a", "a", "a", "a", "a", "a", "a", 
"a", "b", "b", "b", "b", "b", "a", "a", "a", "a", "a", "a", "a", 
"a", "a", "a", "a", "a", "b", "b", "b", "b", "b")), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", 
"14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24", 
"25", "26", "27", "28", "29", "30"))

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2
Solution 3