'How to choose one row out of several of equal value in a column, depending on the highest value of another column?
So I have a data frame such as this, containing species names, IDs, markers, and size of DNA sequences:
Species | ID| marker| size
-----------------------------------------------------
Tilapia guineensis | 1| co1 | 400
Tilapia guineensis | 1| co2 | 300
Tilapia guineensis | 1| co2 | 700
Tilapia guineensis | 1| co2 | 900
Tilapia zillii | 2| co5 | 600
Tilapia zillii | 2| coi8| 200
Tilapia zillii | 2| coi8| 500
Eutrigla gurnardus | 5| co1 | 100
Eutrigla gurnardus | 5| co2 | 200
Sprattus sprattus | 6| co3 | 300
Sprattus sprattus | 6| co4 | 400
Sardinia pichardus | 7| co1 | 800
Sardinia pichardus | 7| co2 | 800
I would like to keep only one row for each ID, but I want to keep the row that has the highest value in the size column, regardless of the species and marker columns. If the rows have equal ID and equal size, I would like to just keep one randomly. My output would be:
Species | ID| marker| size
-----------------------------------------------------
Tilapia guineensis | 1| co2 | 900
Tilapia zillii | 2| co5 | 600
Eutrigla gurnardus | 5| co2 | 200
Sprattus sprattus | 6| co4 | 400
Sardinia pichardus | 7| co2 | 800
Solution 1:[1]
A possible solution:
library(dplyr)
df %>%
group_by(ID) %>%
slice_max(size)
#> # A tibble: 4 × 4
#> # Groups: ID [4]
#> Species ID marker size
#> <chr> <int> <chr> <int>
#> 1 Tilapia guineensis 1 co2 900
#> 2 Tilapia zillii 2 co5 600
#> 3 Eutrigla gurnardus 5 co2 200
#> 4 Sprattus sprattus 6 co4 400
Solution 2:[2]
data.table
option:
library(data.table)
setDT(df)[, .SD[which.max(size)], by=ID]
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | Quinten |