'How to iterate over a nested list in R and assign values to data.frame
I am trying to parse all sitemaps in a sitemap index. I was able to create an object x which has all the three sitemaps from the index.
I am able to create a separate object for each nested xml and then rbind() it together but I believe a function would be easier. I tried writing a for loop or using sapply but it returns error as I am trying to pass a list of lists into sapply.
My aim is to take all of the xml_children and assign them to a dataframe, as doing it my way over a 50 xml list would be very daunting.
sitemap_index <- read_xml("https://www.bodystore.com/sitemap_index.xml")
sitemap_urls <- xml_children(sitemap_index) %>% xml_to_dataframe() %>% rename (url = loc)
x is contaiting all the urls from the sitemap index
x <- lapply(sitemap_urls$url, read_xml)
#creating an empty dataframe
all_sitemaps <- data.frame()
#saving each part of the list
x1 <- x[[1]] %>% xml_children() %>% xml_to_dataframe()
x2 <- x[[2]] %>% xml_children() %>% xml_to_dataframe()
x3 <- x[[3]] %>% xml_children() %>% xml_to_dataframe()
all_sitemaps <- rbind(x1,x2,x3)
xml_to_dataframe is a custom function that parses xml into a dataframe
xml_to_dataframe <- function(nodeset){
if(class(nodeset) != 'xml_nodeset'){
stop('Input should be "xml_nodeset" class')
}
lst <- lapply(nodeset, function(x){
tmp <- xml2::xml_text(xml2::xml_children(x))
names(tmp) <- xml2::xml_name(xml2::xml_children(x))
return(as.list(tmp))
})
result <- do.call(plyr::rbind.fill, lapply(lst, function(x)
as.data.frame(x, stringsAsFactors = F)))
return(dplyr::as_tibble(result))
}
Thank you very much for help
Solution 1:[1]
A combination of lapply, rbind, and do.call should work (all base functions):
# From URL to data.frame
fun_smu2df <- function(smu)
{
xml_to_dataframe(
xml_read_children(
read_xml(
smu
)))
}
# Stack the list of data.frames
all_sitemaps <- do.call(rbind, lapply(sitemap_urls$url, fun_smu2df))
Of course, the actual code should take in account the cases when the reading fails, when the XML is malformed, when it cannot be represented as data.frame etc. In that case, a more rugged version would be:
# From URL to data.frame
fun_smu2df <- function(smu)
{
xmlp <- tryCatch(
expr = read_xml(smu),
error = identity
)
if (inherits(xmlp, "error")) {
warning("Reading URL \"", as.character(smu), "\" failed.\n previous: ", xmlp$message)
return (NULL)
}
xmlc <- tryCatch(
expr = xml_read_children(xmlp),
error = identity
)
if (inherits(xmlc, "error")) {
warning("Reading XML children from URL \"", as.character(smu), "\" failed.\n previous: ", xmlc$message)
return (NULL)
}
xdf <- tryCatch(
expr = xml_to_dataframe(xmlc),
error = identity
)
if (inherits(xdf, "error")) {
warning("Converting XML children from URL \"", as.character(smu), "\" to \"data.frame\" failed.\n previous: ", xdf$message)
return (NULL)
}
return (xdf)
}
# Stack the list of data.frames
list_sitemaps <- lapply(sitemap_urls$url, fun_smu2df) # This should not bomb
dt_sitemaps <- tryCatch(
expr = do.call(rbind, list_sitemaps),
error = identity
)
if (inherits(dt_sitemaps, "error")) {
warning("Stacking the \"data.frame\" objects failed.\n previous: ", dt_sitemaps$message)
dt_sitemaps <- NULL
} else if (is.null(dt_sitemaps)) {
warning("No \"data.frame\" objects were retrieved from the specified URLs.")
}
I know, is not pretty (well, I think is pretty :-D), but it won't bite you in the ass when one single URL fails because reasons.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
