'Webscraping - unable to get the full content of the page with R
I'm trying to webscrape the job ads from this page: https://con.arbeitsagentur.de/prod/jobboerse/jobsuche-ui/?was=Soziologie%20(grundst%C3%A4ndig)%20(weiterf%C3%BChrend)&wo=&FCT.ANGEBOTSART=ARBEIT&FCT.BEHINDERUNG=AUS&page=1&size=50&aktualitaet=100
However I'm unable to get the information from the individual job ads. I tried it with rvest, xml2 and V8, but I'm a beginner in webscraping and can't manage to solve this problem. It seems that the link doesn't contain the information about the individual job ads, so that navigating with the xPath doesn't work properly.
Does anyone has an idea how to solve this?
Thanks :)
Solution 1:[1]
I have been able to extract the job descriptions with the following code :
library(RSelenium)
shell('docker run -d -p 4445:4444 selenium/standalone-firefox')
remDr <- remoteDriver(remoteServerAddr = "localhost", port = 4445L, browserName = "firefox")
remDr$open()
remDr$navigate("https://www.arbeitsagentur.de/jobsuche/suche?angebotsart=1&was=Soziologie%20(grundst%C3%A4ndig)%20(weiterf%C3%BChrend)&id=10000-1189146489-S")
Sys.sleep(10)
list_Button <- remDr$findElements("class name", "ergebnisliste-item")
Sys.sleep(3)
list_Link_Job_Descriptions <- lapply(X = list_Button, FUN = function(x) x$getElementAttribute("href"))
nb_Links <- length(list_Link_Job_Descriptions)
list_Text_Job_Description <- list()
for(i in 1 : nb_Links)
{
print(i)
remDr$navigate(list_Link_Job_Descriptions[[i]][[1]])
Sys.sleep(1)
web_Obj2 <- remDr$findElement("id", "jobdetails-beschreibung")
list_Text_Job_Description[[i]] <- web_Obj2$getElementText()
}
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 |