'Unnest/Unlist column with variable lengths in R

I'm trying to wrangle a json file into a data frame but I'm having trouble when unnesting to a wider format because the columns contain lists of unequal sizes. In addition, it appears that one column of lists contains what should be the name of the new column and the other column contains the data.

I apologize for not including a repex but I don't know how to create a data frame with lists. I've included a screen shot to hopefully highlight the issue.

nested data frame

When I attempt to unnest with raw_json2 %>% unnest_wider(value) I get the following error: Error: Can't combine `..1$...1` <character> and `..27$...1` <list>.

It appears that the column type contains what should be the column names, and that 'value' contains the values for those columns. So essentially columns would be added to my data frame from type and data added from value.

EDIT------------------------------ Sample data. I put the first 30 as there some tricky lists at the end.

> dput(head(raw_json2, n=30))

structure(list(id = c(112357710L, 112357713L, 112357714L, 112357717L, 
112357719L, 112357723L, 112357727L, 112357730L, 112357732L, 112357736L, 
112357737L, 112357738L, 112357744L, 112357745L, 112357746L, 112357747L, 
112357759L, 112357760L, 112357761L, 112357764L, 112357765L, 112357766L, 
112357767L, 112357775L, 112357777L, 112357780L, 112357782L, 112357783L, 
112357784L, 112357791L), start_sec = c("00:00:19.000000", "00:45:34.000000", 
"00:02:19.000000", "00:00:49.000000", "00:00:46.000000", "00:00:59.000000", 
"00:01:17.000000", "00:01:29.000000", "00:01:43.000000", "00:02:02.000000", 
"00:02:04.000000", "00:02:06.000000", "00:02:14.000000", "00:02:36.000000", 
"00:02:22.000000", "00:02:46.000000", "00:02:52.000000", "00:02:48.000000", 
"00:03:00.000000", "00:03:02.000000", "00:03:10.000000", "00:03:12.000000", 
"00:04:01.000000", "00:03:27.000000", "00:04:15.000000", "00:03:53.000000", 
"00:04:00.000000", "00:04:03.000000", "00:05:05.000000", "00:05:06.000000"
), end_sec = c("00:00:19.000000", "00:45:34.000000", "00:02:19.000000", 
"00:00:49.000000", "00:00:46.000000", "00:00:59.000000", "00:01:17.000000", 
"00:01:29.000000", "00:01:43.000000", "00:02:02.000000", "00:02:04.000000", 
"00:02:06.000000", "00:02:14.000000", "00:02:36.000000", "00:02:22.000000", 
"00:02:46.000000", "00:02:52.000000", "00:02:48.000000", "00:03:00.000000", 
"00:03:02.000000", "00:03:10.000000", "00:03:12.000000", "00:04:01.000000", 
"00:03:27.000000", "00:04:15.000000", "00:03:53.000000", "00:04:00.000000", 
"00:04:03.000000", "00:05:05.000000", "00:05:06.000000"), type = list(
    c("teamNames", "list"), "list", "teamNames", "teamNames", 
    c("teamNames", "list"), "teamNames", "teamNames", "teamNames", 
    "teamNames", c("teamNames", "list"), "teamNames", "teamNames", 
    "teamNames", "teamNames", "teamNames", "teamNames", "teamNames", 
    "teamNames", "teamNames", "teamNames", "teamNames", "teamNames", 
    "teamNames", c("teamNames", "list"), "teamNames", c("teamNames", 
    "list"), c("teamNames", "list", "chartPoint", "chartPoint"
    ), "teamNames", "teamNames", "teamNames"), value = list(c("Real Madrid", 
"kickoff"), "1", "Real Madrid", "Real Madrid", c("Real Madrid", 
"throw in"), "Real Madrid", "Barcelona", "Real Madrid", "Barcelona", 
    c("Real Madrid", "free kick"), "Real Madrid", "Real Madrid", 
    "Real Madrid", "Real Madrid", "Barcelona", "Barcelona", "Barcelona", 
    "Real Madrid", "Real Madrid", "Barcelona", "Barcelona", "Real Madrid", 
    "Real Madrid", c("Real Madrid", "throw in"), "Real Madrid", 
    c("Real Madrid", "corner kick"), list("Real Madrid", "save", 
        list(x = 483.51837158203, y = 397.89303588867, x2 = 0L, 
            y2 = 0L, type = "point", sector = 1L, orientation = "left"), 
        list(x = 274.94967651367, y = 404.6828918457, x2 = 0L, 
            y2 = 0L, type = "point", sector = 3L, orientation = "left")), 
    "Barcelona", "Barcelona", "Barcelona")), row.names = c(NA, 
-30L), class = c("tbl_df", "tbl", "data.frame"))


Solution 1:[1]

Here's one way. The tricky part indeed is that a) your value column has list elements of different length and b) the names stored in type also differ per row, soe we can't use them for naming the new unnested columns (hence I decided for a generic naming):

df |>
  mutate(value2 = map2(value, max(lengths(type)), ~ `length<-`(.x, .y)),
         value2 = map(value2, as.list),
         value2 = map(value2, setNames, paste0("value_", 1:4))) |> 
  unnest_wider(value2)

What we do here is:

  • First we need to make sure that all your list elements have the same length, which is the maximum of all possible lenghts (in your case 4). Inspiration is taken from (how else could it be) akrun's solution here: https://stackoverflow.com/a/57152605/2725773 (give him credit for it!)
  • We then need to convert each row of value2 into a list, because they are of different type (e.g. row 27 again has lists, where most other columns don't).
  • We then give each of the 4 list columns a name.
  • Then we can unnest_wider

The result (showing rows 1:5 and 26:30):

# A tibble: 10 × 9
          id start_sec       end_sec         type      value      value_1     value_2     value_3          value_4         
       <int> <chr>           <chr>           <list>    <list>     <chr>       <chr>       <list>           <list>          
 1 112357710 00:00:19.000000 00:00:19.000000 <chr [2]> <chr [2]>  Real Madrid kickoff     <chr [1]>        <chr [1]>       
 2 112357713 00:45:34.000000 00:45:34.000000 <chr [1]> <chr [1]>  1           NA          <chr [1]>        <chr [1]>       
 3 112357714 00:02:19.000000 00:02:19.000000 <chr [1]> <chr [1]>  Real Madrid NA          <chr [1]>        <chr [1]>       
 4 112357717 00:00:49.000000 00:00:49.000000 <chr [1]> <chr [1]>  Real Madrid NA          <chr [1]>        <chr [1]>       
 5 112357719 00:00:46.000000 00:00:46.000000 <chr [2]> <chr [2]>  Real Madrid throw in    <chr [1]>        <chr [1]>       
 6 112357780 00:03:53.000000 00:03:53.000000 <chr [2]> <chr [2]>  Real Madrid corner kick <chr [1]>        <chr [1]>       
 7 112357782 00:04:00.000000 00:04:00.000000 <chr [4]> <list [4]> Real Madrid save        <named list [7]> <named list [7]>
 8 112357783 00:04:03.000000 00:04:03.000000 <chr [1]> <chr [1]>  Barcelona   NA          <chr [1]>        <chr [1]>       
 9 112357784 00:05:05.000000 00:05:05.000000 <chr [1]> <chr [1]>  Barcelona   NA          <chr [1]>        <chr [1]>       
10 112357791 00:05:06.000000 00:05:06.000000 <chr [1]> <chr [1]>  Barcelona   NA          <chr [1]>        <chr [1]>  

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1