'R: dplyr and row_number() does not enumerate as expected

I want to enumerate each record of a dataframe/tibble resulted from a grouping. The index is according a defined order. If I use row_number() it does enumerate but within group. But I want that it enumerates without considering the former grouping.

Here is an example. To make it simple I used the most minimal dataframe:

library(dplyr)

df0 <- data.frame( x1 = rep(LETTERS[1:2],each=2)
                 , x2 = rep(letters[1:2], 2)
                 , y = floor(abs(rnorm(4)*10))
)
df0
#   x1 x2  y
# 1  A  a 12
# 2  A  b 24
# 3  B  a  0
# 4  B  b 12

Now, I group this table:

 df1 <- df0 %>% group_by(x1,x2) %>% summarize(y=sum(y))

This gives me a object of class tibble:

 # A tibble: 4 x 3
 # Groups:   x1 [?]
 #   x1    x2        y
 #   <fct> <fct> <dbl>
 # 1 A     a        12
 # 2 A     b        24
 # 3 B     a         0
 # 4 B     b        12

I want to add a row number to this table using row_numer():

 df2 <- df1 %>% arrange(desc(y)) %>% mutate(index = row_number())
 df2
 # A tibble: 4 x 4
 # Groups:   x1 [2]
 #   x1    x2        y index
 #   <fct> <fct> <dbl> <int>
 # 1 A     b        24     1
 # 2 A     a        12     2
 # 3 B     b        12     1
 # 4 B     a         0     2

row_number() does enumerate within the former grouping. This was not my intention. This can be avoid converting tibble to a dataframe first:

 df2 <- df2 %>% as.data.frame() %>% arrange(desc(y)) %>% mutate(index = row_number())
 df2
 #   x1 x2  y index
 # 1  A  b 24     1
 # 2  A  a 12     2
 # 3  B  b 12     3
 # 4  B  a  0     4

My question is: is this behaviour intended? If yes: is it not very dangerous to incorporate former data processing into tibble? Which type of processing is incorporated? At the moment I will convert tibble into dataframe to avoid this kind of unexpected results.



Solution 1:[1]

As camille nicely showed, there are good reasons for wanting to have the result of summarize() retain additional layers of grouping and it's a documented behaviour so not really dangerous or unexpected per se.

However one additional tip is that if you are just going to call ungroup() after summarize() you might as well use summarize(.groups = "drop") which will return an ungrouped tibble and save you a line of code.

library(tidyverse)

df0 <- data.frame(
  x1 = rep(LETTERS[1:2], each = 2),
  x2 = rep(letters[1:2], 2),
  y = floor(abs(rnorm(4) * 10))
)

df0 %>% 
  group_by(x1,x2) %>% 
  summarize(y=sum(y), .groups = "drop") %>% 
  arrange(desc(y)) %>% 
  mutate(index = row_number())
#> # A tibble: 4 x 4
#>   x1    x2        y index
#>   <chr> <chr> <dbl> <int>
#> 1 A     b         8     1
#> 2 A     a         2     2
#> 3 B     a         2     3
#> 4 B     b         1     4

Created on 2022-02-06 by the reprex package (v2.0.1)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Dan Adams