'Extracting data from irregular lists using purrr:map()

Given is a list with several element, the goal is to get them into a data frame. The map_df function from the purr package is highly useful with regular lists, but gives an error with irregular lists.

For instance, following this tutorial the following works:

library(purrr)
library(repurrrsive) # The data comes from this package


map_dfr(got_chars, magrittr::extract, c("name", "culture", "gender", "id", "born", "alive"))

 A tibble: 30 x 6
   name               culture  gender    id born                                   alive
   <chr>              <chr>    <chr>  <int> <chr>                                  <lgl>
 1 Theon Greyjoy      Ironborn Male    1022 In 278 AC or 279 AC, at Pyke           TRUE 
 2 Tyrion Lannister   ""       Male    1052 In 273 AC, at Casterly Rock            TRUE 
 3 Victarion Greyjoy  Ironborn Male    1074 In 268 AC or before, at Pyke           TRUE 
 4 Will               ""       Male    1109 ""                                     FALSE
 5 Areo Hotah         Norvoshi Male    1166 In 257 AC or before, at Norvos         TRUE 
 6 Chett              ""       Male    1267 At Hag's Mire                          FALSE
 7 Cressen            ""       Male    1295 In 219 AC or 220 AC                    FALSE
 8 Arianne Martell    Dornish  Female   130 In 276 AC, at Sunspear                 TRUE 
 9 Daenerys Targaryen Valyrian Female  1303 In 284 AC, at Dragonstone              TRUE 
10 Davos Seaworth     Westeros Male    1319 In 260 AC or before, at King's Landing TRUE 
# … with 20 more rows

However, if an element is removed from the list, the function fails.

got_chars[[1]]["gender"]<-NULL
map_dfr(got_chars, magrittr::extract, c("name", "culture", "gender", "id", "born", "alive"))

#Error: Argument 3 is a list, must contain atomic vectors

The desired output would be an NA value for the missing element. What would an elegant solution be? I suspect the solution includes using purrr:possibly(), but I haven't figured it out yet.

r list dictionary purrr

Solution 1:^[1]

One way is to define a partial()ly-specified pluck() that extracts a name of interest, returning NA if it's missing. Pass the modified pluck() to a double-map, with the inner map traversing the names to extract and the outer map traversing your got_chars list:

v <- set_names(c("name", "culture", "gender", "id", "born", "alive"))
map_dfr( got_chars, ~map(v, partial(pluck, .x, .default=NA)) )
# # A tibble: 30 x 6
#    name             culture  gender    id born                             alive
#    <chr>            <chr>    <chr>  <int> <chr>                            <lgl>
#  1 Theon Greyjoy    Ironborn NA      1022 In 278 AC or 279 AC, at Pyke     TRUE 
#  2 Tyrion Lannister ""       Male    1052 In 273 AC, at Casterly Rock      TRUE 
#  3 Victarion Greyj… Ironborn Male    1074 In 268 AC or before, at Pyke     TRUE 
#  4 Will             ""       Male    1109 ""                               FALSE
#  5 Areo Hotah       Norvoshi Male    1166 In 257 AC or before, at Norvos   TRUE 
#  6 Chett            ""       Male    1267 At Hag's Mire                    FALSE
#  7 Cressen          ""       Male    1295 In 219 AC or 220 AC              FALSE
#  8 Arianne Martell  Dornish  Female   130 In 276 AC, at Sunspear           TRUE 
#  9 Daenerys Targar… Valyrian Female  1303 In 284 AC, at Dragonstone        TRUE 
# 10 Davos Seaworth   Westeros Male    1319 In 260 AC or before, at King's … TRUE 
# # … with 20 more rows

To clarify, .x iterates over got_chars because it lives inside a lambda function specified with ~, so it corresponds to the outer map. The function for the inner map is specified with partial(), which attaches the currently looked-at got_chars element (i.e., the .x) as the first argument to pluck(). The modified pluck() then accepts the name to extract as its (new) first argument, so it can be passed to the inner map as-is, without any extra ~ needed.

Solution 2:^[2]

One inherent problem is the behavior of [ (or its alias magrittr::extract) in the absence of the element we're trying to extract:

list(a = 1)["b"]
# $<NA>
# NULL

magrittr::extract(list(a = 1), "b")
# $<NA>
# NULL

We could define:

extract_if_present <- function(x, y) {
  x[intersect(y, names(x))]
}

that behaves like:

extract_if_present(list(a = 1), "b")
# named list()

Then row-binding with missing elements "just works":

map_dfr(
  got_chars_mutilated,
  extract_if_present,
  c("name", "culture", "gender", "id", "born", "alive")
)
# # A tibble: 30 x 6
#    name               culture     id born                                   alive gender
#    <chr>              <chr>    <int> <chr>                                  <lgl> <chr> 
#  1 Theon Greyjoy      Ironborn  1022 In 278 AC or 279 AC, at Pyke           TRUE  NA    
#  2 Tyrion Lannister   ""        1052 In 273 AC, at Casterly Rock            TRUE  Male  
#  3 Victarion Greyjoy  Ironborn  1074 In 268 AC or before, at Pyke           TRUE  Male  
#  4 Will               ""        1109 ""                                     FALSE Male  
#  5 Areo Hotah         Norvoshi  1166 In 257 AC or before, at Norvos         TRUE  Male  
#  6 Chett              ""        1267 At Hag's Mire                          FALSE Male  
#  7 Cressen            ""        1295 In 219 AC or 220 AC                    FALSE Male  
#  8 Arianne Martell    Dornish    130 In 276 AC, at Sunspear                 TRUE  Female
#  9 Daenerys Targaryen Valyrian  1303 In 284 AC, at Dragonstone              TRUE  Female
# 10 Davos Seaworth     Westeros  1319 In 260 AC or before, at King's Landing TRUE  Male  
# # … with 20 more rows

The order of columns is a bit messed up and dependent on the order of rows and what they miss.

Solution 3:^[3]

Love that tutorial! At the end of the tutorial the author says:

When programming, it is safer, but more cumbersome, to explicitly specify type and build your data frame the usual way.

You can use the more verbose way to set defaults as NA

got_chars %>% {
  tibble(
    name = map_chr(., "name"),
    culture = map_chr(., "culture"),
    gender = map_chr(., "gender", .default = NA),
    id = map_chr(., "id"),
    born = map_chr(., "born"),
    alive = map_chr(., "alive")
  )
}
# # A tibble: 30 x 6
# name               culture    gender id    born                                     alive
# <chr>              <chr>      <chr>  <chr> <chr>                                    <chr>
#   1 Theon Greyjoy      "Ironborn" NA     1022  "In 278 AC or 279 AC, at Pyke"           TRUE 
# 2 Tyrion Lannister   ""         Male   1052  "In 273 AC, at Casterly Rock"            TRUE 
# 3 Victarion Greyjoy  "Ironborn" Male   1074  "In 268 AC or before, at Pyke"           TRUE 
# 4 Will               ""         Male   1109  ""                                       FALSE
# 5 Areo Hotah         "Norvoshi" Male   1166  "In 257 AC or before, at Norvos"         TRUE 
# 6 Chett              ""         Male   1267  "At Hag's Mire"                          FALSE
# 7 Cressen            ""         Male   1295  "In 219 AC or 220 AC"                    FALSE
# 8 Arianne Martell    "Dornish"  Female 130   "In 276 AC, at Sunspear"                 TRUE 
# 9 Daenerys Targaryen "Valyrian" Female 1303  "In 284 AC, at Dragonstone"              TRUE 
# 10 Davos Seaworth     "Westeros" Male   1319  "In 260 AC or before, at King's Landing" TRUE

Solution 4:^[4]

All of these options are pretty fast, but if speed is an issue, here are the benchmarks.


bm <- microbenchmark::microbenchmark(
  jennybryan1 = {
    got_chars_mutilated <- got_chars
    got_chars_mutilated[[1]]["gender"] <- NULL
    tibble(got = got_chars_mutilated) %>% 
      unnest_auto(got)
  },
  jennybryan2 = {
    c("name", "culture", "gender", "id", "born", "alive") %>% 
      set_names() %>% 
      map(~ map(got_chars_mutilated, .x, .default = NA)) %>%
      map(simplify) %>% 
      as_tibble()
    },
  ArtemSokolov = {
    v <- set_names(c("name", "culture", "gender", "id", "born", "alive"))
    map_dfr( got_chars, ~map(v, partial(pluck, .x, .default=NA)) )
    },
  Aurèle = {
    extract_if_present <- function(x, y) {
      x[intersect(y, names(x))]
    }
    map_dfr(
      got_chars,
      extract_if_present,
      c("name", "culture", "gender", "id", "born", "alive")
    )
  },
  jeffs = {
    got_chars %>% {
      tibble(
        name = map_chr(., "name"),
        culture = map_chr(., "culture"),
        gender = map_chr(., "gender", .default = NA),
        id = map_chr(., "id"),
        born = map_chr(., "born"),
        alive = map_chr(., "alive")
      )
    } 
  }, 
  times=1000L
)
autoplot(bm)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1
Solution 2	Aurèle
Solution 3	Jeff Parker
Solution 4

'Extracting data from irregular lists using purrr:map()

Solution 1:[1]

Solution 2:[2]

Solution 3:[3]

Solution 4:[4]

Sources

Related Questions

Solution 1:^[1]

Solution 2:^[2]

Solution 3:^[3]

Solution 4:^[4]