'Taking only the maximum values of duplicate IDs for all columns of a data frame in R

I have data frame of 24525 rows and 22 columns. Last column is the ID column, other are numeric. Number of unique IDs is 18414 and some IDs are repeated more than 2 times.

I need to remove duplicate IDs and keep only the maximum value for each ID and each column in another data frame.

I tried sorting each column in a for loop and remove duplicates to keep the maximum but it did not work and I am not sure.

Is anyone knows a way to do this task?

Thank you in advance



Solution 1:[1]

Fake data:

mt <- mtcars
mt$cyl <- as.character(mt$cyl)

Base R

aggregate(. ~ cyl, data = mt, FUN = max)
#   cyl  mpg  disp  hp drat    wt  qsec vs am gear carb
# 1   4 33.9 146.7 113 4.93 3.190 22.90  1  1    5    2
# 2   6 21.4 258.0 175 3.92 3.460 20.22  1  1    5    6
# 3   8 19.2 472.0 335 4.22 5.424 18.00  0  1    5    8

(Thanks to @GregorThomas for the tweaks on this.)

tidyverse

library(dplyr)
mt %>%
  group_by(cyl) %>%
  summarize_all(max)
# # A tibble: 3 x 11
#   cyl     mpg  disp    hp  drat    wt  qsec    vs    am  gear  carb
#   <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 4      33.9  147.   113  4.93  3.19  22.9     1     1     5     2
# 2 6      21.4  258    175  3.92  3.46  20.2     1     1     5     6
# 3 8      19.2  472    335  4.22  5.42  18       0     1     5     8

Update: as suggested by @akrun, dplyr::summarize_all has been superseded. From the ?summarize_all:

Scoped verbs (_if, _at, _all) have been superseded by the use of across() in an existing verb. See vignette("colwise") for details.

Updated code:

mt %>%
  group_by(cyl) %>%
  summarize(across(everything(), max))
# # A tibble: 3 x 11
#   cyl     mpg  disp    hp  drat    wt  qsec    vs    am  gear  carb
#   <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 4      33.9  147.   113  4.93  3.19  22.9     1     1     5     2
# 2 6      21.4  258    175  3.92  3.46  20.2     1     1     5     6
# 3 8      19.2  472    335  4.22  5.42  18       0     1     5     8

data.table

library(data.table)
setDT(mt)
mt[, lapply(.SD, max), by=.(cyl)]

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1