'Extracting data from irregular lists using purrr:map()
Given is a list with several element, the goal is to get them into a data frame. The map_df function from the purr package is highly useful with regular lists, but gives an error with irregular lists.
For instance, following this tutorial the following works:
library(purrr)
library(repurrrsive) # The data comes from this package
map_dfr(got_chars, magrittr::extract, c("name", "culture", "gender", "id", "born", "alive"))
A tibble: 30 x 6
name culture gender id born alive
<chr> <chr> <chr> <int> <chr> <lgl>
1 Theon Greyjoy Ironborn Male 1022 In 278 AC or 279 AC, at Pyke TRUE
2 Tyrion Lannister "" Male 1052 In 273 AC, at Casterly Rock TRUE
3 Victarion Greyjoy Ironborn Male 1074 In 268 AC or before, at Pyke TRUE
4 Will "" Male 1109 "" FALSE
5 Areo Hotah Norvoshi Male 1166 In 257 AC or before, at Norvos TRUE
6 Chett "" Male 1267 At Hag's Mire FALSE
7 Cressen "" Male 1295 In 219 AC or 220 AC FALSE
8 Arianne Martell Dornish Female 130 In 276 AC, at Sunspear TRUE
9 Daenerys Targaryen Valyrian Female 1303 In 284 AC, at Dragonstone TRUE
10 Davos Seaworth Westeros Male 1319 In 260 AC or before, at King's Landing TRUE
# … with 20 more rows
However, if an element is removed from the list, the function fails.
got_chars[[1]]["gender"]<-NULL
map_dfr(got_chars, magrittr::extract, c("name", "culture", "gender", "id", "born", "alive"))
#Error: Argument 3 is a list, must contain atomic vectors
The desired output would be an NA value for the missing element. What would an elegant solution be? I suspect the solution includes using purrr:possibly(), but I haven't figured it out yet.
Solution 1:[1]
One way is to define a partial()ly-specified pluck() that extracts a name of interest, returning NA if it's missing. Pass the modified pluck() to a double-map, with the inner map traversing the names to extract and the outer map traversing your got_chars list:
v <- set_names(c("name", "culture", "gender", "id", "born", "alive"))
map_dfr( got_chars, ~map(v, partial(pluck, .x, .default=NA)) )
# # A tibble: 30 x 6
# name culture gender id born alive
# <chr> <chr> <chr> <int> <chr> <lgl>
# 1 Theon Greyjoy Ironborn NA 1022 In 278 AC or 279 AC, at Pyke TRUE
# 2 Tyrion Lannister "" Male 1052 In 273 AC, at Casterly Rock TRUE
# 3 Victarion Greyj… Ironborn Male 1074 In 268 AC or before, at Pyke TRUE
# 4 Will "" Male 1109 "" FALSE
# 5 Areo Hotah Norvoshi Male 1166 In 257 AC or before, at Norvos TRUE
# 6 Chett "" Male 1267 At Hag's Mire FALSE
# 7 Cressen "" Male 1295 In 219 AC or 220 AC FALSE
# 8 Arianne Martell Dornish Female 130 In 276 AC, at Sunspear TRUE
# 9 Daenerys Targar… Valyrian Female 1303 In 284 AC, at Dragonstone TRUE
# 10 Davos Seaworth Westeros Male 1319 In 260 AC or before, at King's … TRUE
# # … with 20 more rows
To clarify, .x iterates over got_chars because it lives inside a lambda function specified with ~, so it corresponds to the outer map. The function for the inner map is specified with partial(), which attaches the currently looked-at got_chars element (i.e., the .x) as the first argument to pluck(). The modified pluck() then accepts the name to extract as its (new) first argument, so it can be passed to the inner map as-is, without any extra ~ needed.
Solution 2:[2]
One inherent problem is the behavior of [ (or its alias magrittr::extract) in the absence of the element we're trying to extract:
list(a = 1)["b"]
# $<NA>
# NULL
magrittr::extract(list(a = 1), "b")
# $<NA>
# NULL
We could define:
extract_if_present <- function(x, y) {
x[intersect(y, names(x))]
}
that behaves like:
extract_if_present(list(a = 1), "b")
# named list()
Then row-binding with missing elements "just works":
map_dfr(
got_chars_mutilated,
extract_if_present,
c("name", "culture", "gender", "id", "born", "alive")
)
# # A tibble: 30 x 6
# name culture id born alive gender
# <chr> <chr> <int> <chr> <lgl> <chr>
# 1 Theon Greyjoy Ironborn 1022 In 278 AC or 279 AC, at Pyke TRUE NA
# 2 Tyrion Lannister "" 1052 In 273 AC, at Casterly Rock TRUE Male
# 3 Victarion Greyjoy Ironborn 1074 In 268 AC or before, at Pyke TRUE Male
# 4 Will "" 1109 "" FALSE Male
# 5 Areo Hotah Norvoshi 1166 In 257 AC or before, at Norvos TRUE Male
# 6 Chett "" 1267 At Hag's Mire FALSE Male
# 7 Cressen "" 1295 In 219 AC or 220 AC FALSE Male
# 8 Arianne Martell Dornish 130 In 276 AC, at Sunspear TRUE Female
# 9 Daenerys Targaryen Valyrian 1303 In 284 AC, at Dragonstone TRUE Female
# 10 Davos Seaworth Westeros 1319 In 260 AC or before, at King's Landing TRUE Male
# # … with 20 more rows
The order of columns is a bit messed up and dependent on the order of rows and what they miss.
Solution 3:[3]
Love that tutorial! At the end of the tutorial the author says:
When programming, it is safer, but more cumbersome, to explicitly specify type and build your data frame the usual way.
You can use the more verbose way to set defaults as NA
got_chars %>% {
tibble(
name = map_chr(., "name"),
culture = map_chr(., "culture"),
gender = map_chr(., "gender", .default = NA),
id = map_chr(., "id"),
born = map_chr(., "born"),
alive = map_chr(., "alive")
)
}
# # A tibble: 30 x 6
# name culture gender id born alive
# <chr> <chr> <chr> <chr> <chr> <chr>
# 1 Theon Greyjoy "Ironborn" NA 1022 "In 278 AC or 279 AC, at Pyke" TRUE
# 2 Tyrion Lannister "" Male 1052 "In 273 AC, at Casterly Rock" TRUE
# 3 Victarion Greyjoy "Ironborn" Male 1074 "In 268 AC or before, at Pyke" TRUE
# 4 Will "" Male 1109 "" FALSE
# 5 Areo Hotah "Norvoshi" Male 1166 "In 257 AC or before, at Norvos" TRUE
# 6 Chett "" Male 1267 "At Hag's Mire" FALSE
# 7 Cressen "" Male 1295 "In 219 AC or 220 AC" FALSE
# 8 Arianne Martell "Dornish" Female 130 "In 276 AC, at Sunspear" TRUE
# 9 Daenerys Targaryen "Valyrian" Female 1303 "In 284 AC, at Dragonstone" TRUE
# 10 Davos Seaworth "Westeros" Male 1319 "In 260 AC or before, at King's Landing" TRUE
Solution 4:[4]
All of these options are pretty fast, but if speed is an issue, here are the benchmarks.
bm <- microbenchmark::microbenchmark(
jennybryan1 = {
got_chars_mutilated <- got_chars
got_chars_mutilated[[1]]["gender"] <- NULL
tibble(got = got_chars_mutilated) %>%
unnest_auto(got)
},
jennybryan2 = {
c("name", "culture", "gender", "id", "born", "alive") %>%
set_names() %>%
map(~ map(got_chars_mutilated, .x, .default = NA)) %>%
map(simplify) %>%
as_tibble()
},
ArtemSokolov = {
v <- set_names(c("name", "culture", "gender", "id", "born", "alive"))
map_dfr( got_chars, ~map(v, partial(pluck, .x, .default=NA)) )
},
Aurèle = {
extract_if_present <- function(x, y) {
x[intersect(y, names(x))]
}
map_dfr(
got_chars,
extract_if_present,
c("name", "culture", "gender", "id", "born", "alive")
)
},
jeffs = {
got_chars %>% {
tibble(
name = map_chr(., "name"),
culture = map_chr(., "culture"),
gender = map_chr(., "gender", .default = NA),
id = map_chr(., "id"),
born = map_chr(., "born"),
alive = map_chr(., "alive")
)
}
},
times=1000L
)
autoplot(bm)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | Aurèle |
| Solution 3 | Jeff Parker |
| Solution 4 |

