'How to save multiples match in one column? rvest, R and stringr
This question is a sequence to the problem stackoverflow
I have these two example html: url1.html ; url2.html
The url3.html is another example with more IPC
In URL2.html there is no information (51) and in URL1.html there is.
I'm using this code in R:
library(rvest)
library(tidyverse)
library(stringr)
x<-data.frame(
URL=c(1:2),
page=c(paste(readLines("url1.html"), collapse="\n"),
paste(readLines("url2.html"), collapse="\n"))
)
for (i in 1:nrow(x)){
html<-x$page[i]%>% unclass() %>% unlist()
read_html(html,encoding = "ISO-8859-1") %>%
rvest::html_elements(xpath = '//*[@id="principal"]/table[2]') %>%
html_nodes(xpath='//div[@id="classificacao0"]') %>%
html_text(trim=T)%>%
str_replace_all(.,"[\\n\\r\\t]+", "")%>%
stringr::str_trim( ) -> tmp
if(length(tmp) == 0) tmp <- "ND"
x$ipc_0[i] <- tmp %>% str_replace_all(.,"\\s+", " ") %>% str_replace_all(.," \\)", "\\)")
}
for (i in 1:nrow(htm_temp)){
html<-x$page[i]%>% unclass() %>% unlist()
read_html(html,encoding = "ISO-8859-1") %>%
rvest::html_elements(xpath = '//*[@id="principal"]/table[2]') %>%
html_nodes(xpath='//div[@id="classificacao1"]') %>%
html_text(trim=T)%>%
str_replace_all(.,"[\\n\\r\\t]+", "")%>%
stringr::str_trim( ) -> tmp
if(length(tmp) == 0) tmp <- "ND"
x$ipc_1[i] <- tmp %>% str_replace_all(.,"\\s+", " ") %>% str_replace_all(.," \\)", "\\)")
}
Result: partially correct
Desired result:create a new dataframe with the following structure.
| URL | IPC |
|---|---|
| 1 | B62B 1/16 (1968.09)... |
| 1 | B62B 1/00 (1968.09)... |
| 2 | ND |
Problem: There are url`s that have the code (51) and others that do not. When you have the code (51) the structure can contain "n" id with the following structure xpath='//div[@id="classificacao0"]. the Rating Id can contain values from 0 to "n". How to optimize this code to capture the necessary information without having to do a lot of for (variable in vector) for each "n"?
Any idea how to solve this problem?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|

