'How to keep unique list-column values using data.table in R?

I have a dataframe that identifies a set of values with an id. Let's say:

library(data.table)

dt <- data.table(
  id = rep(c("a", "b", "c"), each = 2),
  value1 = c(1, 1, 1, 2, 1, 1),
  value2 = c(0, 3, 0, 3, 0, 3)
)

As you can see, an id identifies many row of values, not a single one. What I want to do is to keep only the ids that identify the first occurrence of a group of values. For example, a and c in the dataframe above identify the same group of values, so I'd like the result to be:

dt[1:4] # desired output
#>    id value1 value2
#> 1:  a      1      0
#> 2:  a      1      3
#> 3:  b      1      0
#> 4:  b      2      3

I figured out that a simple way of doing so would be to nest the group of values into dataframes as well, and then keep only the unique entries based on this new nested dataframe column. Nesting is done quite simply:

dt <- dt[, .(data = list(.SD)), by = id]
dt
#>    id              data
#> 1:  a <data.table[2x2]>
#> 2:  b <data.table[2x2]>
#> 3:  c <data.table[2x2]>

But the actual "keeping unique occurrences" is quite tricky, apparently. I tried two different approaches, but both of them failed. First, using unique.data.table. But the by argument doesn't yet support list columns:

dt <- unique(dt, by = "data")
#> Error in forderv(x, by = by, sort = FALSE, retGrp = TRUE): Column 1 passed to [f]order is type 'list', not yet supported.

And then using .I[], but then I can't pass a list column to the by argument:

dt <- dt[dt[, .I[1], by = data]$V1]
#> Error: column or expression 1 of 'by' or 'keyby' is type list. Do not quote column names. Usage: DT[,sum(colC),by=list(colA,month(colB))]

I spent quite some time on this, but I can't seem to figure out how to achieve what I want. I'm not necessarily attached to the "nest -> keep unique dataframe" path, but this is the only way of solving the problem I could come up with.



Solution 1:[1]

We may use duplicated with unnest

library(tidyr)
dt[, .(data = list(.SD)), by = id][!duplicated(data)] %>%
   unnest(data)

-output

# A tibble: 4 × 3
  id    value1 value2
  <chr>  <dbl>  <dbl>
1 a          1      0
2 a          1      3
3 b          1      0
4 b          2      3

Solution 2:[2]

In the spirit of @akrun answer, but keeping data.table as the only dependency:

library(data.table)

dt <- data.table(
  id = rep(c("a", "b", "c"), each = 2),
  value1 = c(1, 1, 1, 2, 1, 1),
  value2 = c(0, 3, 0, 3, 0, 3)
)

dt <- dt[, .(data = list(.SD)), by = id]
dt <- dt[!duplicated(data)]
dt[, unlist(data, recursive = FALSE), by = id]
#>    id value1 value2
#> 1:  a      1      0
#> 2:  a      1      3
#> 3:  b      1      0
#> 4:  b      2      3

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 akrun
Solution 2 dhersz