'Scraping from website using getURL() returns string of urls, not website content. How do I get the contents of the site? (R studio, windows 10)
I am completely new to scraping, using Windows 10 PC. I am trying to run this code from class to scrape the content of the party platforms form the URLs below:
years=c(1968, 1972, 1976)
urlsR=paste("https://maineanencyclopedia.com/republican-party-platform-",
years,"/",sep='')
urlsD=paste("https://maineanencyclopedia.com/democratic-party-platform-",
years,"/",sep='')
urls=c(urlsR,urlsD)
scraped_platforms <- getURL(urls)
When I run "scraped_platforms" the result is what is shown below rather than the content of the party platforms from the website.
https://maineanencyclopedia.com/republican-party-platform-1968/
""
https://maineanencyclopedia.com/republican-party-platform-1972/
""
https://maineanencyclopedia.com/republican-party-platform-1976/
""
https://maineanencyclopedia.com/democratic-party-platform-1968/
""
https://maineanencyclopedia.com/democratic-party-platform-1972/
""
https://maineanencyclopedia.com/democratic-party-platform-1976/
""
I've seen Windows 10 might be incompatible with getURL (re: How to get getURL to work on R on Windows 10? [tlsv1 alert protocol version]). Even after looking online though, I'm still unclear on how to fix my specific code?
List of links used here:
https://maineanencyclopedia.com/republican-party-platform-1968/
https://maineanencyclopedia.com/republican-party-platform-1972/
https://maineanencyclopedia.com/republican-party-platform-1976/
https://maineanencyclopedia.com/democratic-party-platform-1968/
https://maineanencyclopedia.com/democratic-party-platform-1972/
https://maineanencyclopedia.com/democratic-party-platform-1976/
Solution 1:[1]
I don't know getURL() function, but in R, there is one very handy package for scraping: rvest
You can just use your urls object which has all URLs and loop over:
library(rvest)
library(dplyr)
df <- tibble(Title= NULL,
Text= NULL)
for (url in urls){
t <- read_html(url) %>% html_nodes(".entry-title") %>% html_text2()
p <- read_html(url) %>% html_nodes("p") %>% html_text2()
tp <- tibble(Title = t,
Text = p)
df <- rbind(df, tp)
}
df
This is a bit unorganized output, but you can adjust for loop so you can get it a bit nicer.
Here is also a bit nicer presentation of data:
df2 <- df %>% group_by(Title) %>%
slice(-1) %>%
mutate(Text_all = paste0(Text, collapse = "\n")) %>%
dplyr::select(-Text) %>%
distinct()
df2
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
