'Finding the correct attributes to scrape within a page using rvest
I am trying to scrape the underlying hyperlinks on a webpage but selecting the html nodes and corresponding attributes is not giving any results. I don’t know whether the data is stored in a meta tag or how to even identify that.
Using selectorgadget, I think that the css selector is “td”, but I can also see “tr” in the page. Opening the dev tools, I can see the link under the href attribute, but not getting that result out when running the following code:
library(rvest)
url = "https://www.firstrand.co.za/investors/debt-investor-centre/jse-listed-instruments/"
read_html(url) %>%
html_nodes(css = "td") %>%
html_nodes(css = "a") %>%
html_attr('href')
Page elements:
Solution 1:[1]
If you look behind the scenes you will see that the information is provided to the webpage from a json file. This can easily be read directly and manipulated to provide the url and all the other information that is on the page.
library(tidyverse)
library(jsonlite)
l <- read_json("https://www.firstrand.co.za/DI/debtInstruments.json")
df <- l %>%
enframe %>%
unnest_longer(value) %>%
unnest_wider(value) %>%
mutate(url = paste0("https://www.firstrand.co.za/DI/", fileName))
Solution 2:[2]
Here's a partial answer.
Though we can extract the href using RSelenium, it further needs regex modifications to obtain working url.
library(RSelenium)
driver = rsDriver(
port = 4847L,
browser = c("firefox"))
remDr <- driver[["client"]]
url = "https://www.firstrand.co.za/investors/debt-investor-centre/jse-listed-instruments/"
remDr$navigate(url)
href = remDr$getPageSource()[[1]] %>%
read_html() %>% html_nodes('.jse-table') %>% html_nodes('a') %>% html_attr('href')
href = unique(href)
head(href)
[1] "../../../DI/FRB23 Pricing Supplement 20170920.pdf" "../../../DI/APS - FRB22 - 08.12.2016.pdf"
[3] "../../../DI/FRB28 Pricing Supplement 02122020 Amended.pdf" "../../../DI/FRB24 Amended Pricing Supplement 13042021.pdf"
[5] "../../../DI/FRB25 Amended Pricing Supplement 13042021 Tranche 2.pdf" "../../../DI/FRB25 Amended Pricing Supplement 13042021 Tranche 3.pdf"
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | mkpt_uk |
| Solution 2 | Nad Pat |

