'Handling 404 and other bad URLs when reading in R with read_html
Summary: handling errors and bad pages using trycatch with R's read_html function.
We are using Rs read_html function to connect to some NCAA sports websites, and need to identify when a page is faulty. Here's a few example URLs to faulty pages:
- www.newburynighthawks.com (does not exist)
- http://www.clarkepride.com/sports/womens-basketball/roster/2020-21 (404 not found)
- https://lyon.edu/sports/lyon_sports.html/sports/mens-basketball/roster/2018-19 (not found)
- www.lambuth.edu/athletics/index.html (does not exist)
- https://uvi.edu/pub-relations/athletics/athletics.htm/sports/womens-basketball/roster/2018-19 (page not found)
Each of these URLs have their own issue / problem when using read_html. To handle these issues, I've written a function that uses trycatch to check the validity of these pages when:
check_url_validity <- function(this_url) {
good_url = FALSE
# go to url to check for a rosters page
bad_page_titles = c('Page Not Found', 'Page not found', '404')
result = tryCatch({
team_page <- this_url %>% GET(., timeout(2)) %>% read_html
team_page_title <- team_page %>% html_nodes('title') %>% html_text
team_page_body <- team_page %>% html_nodes('body') %>% html_text
good_page <- !grepl('Page not found', team_page_title) &&
!grepl('Page Not Found', team_page_title) &&
!grepl('404', team_page_title) &&
team_page_title != "" &&
!grepl('Error 404', team_page_body)
if(good_page) { good_url = TRUE }
}, error = function(e) { NA })
return(good_url)
}
Testing this function on the urls linked above provides the following:
these_urls = c(
'www.newburynighthawks.com',
'http://www.clarkepride.com/sports/womens-basketball/roster/2020-21',
'https://lyon.edu/sports/lyon_sports.html/sports/mens-basketball/roster/2018-19',
'www.lambuth.edu/athletics/index.html',
'https://uvi.edu/pub-relations/athletics/athletics.htm/sports/womens-basketball/roster/2018-19'
)
for (this_url in these_urls) {
print(check_rosters_url(this_url))
}
Some of these pages (http://www.newburynighthawks.com/) are easily identified as bad in the trycatch, because there is no page. Others (http://www.clarkepride.com/sports/womens-basketball/roster/2020-21) rely on string-matching in the body text to catch that the page is bad. The issue overall then is that this is a hacky solution, we are dealing with ~1000 different URLs here, and we continue to add conditions to the line of code that determines whether or not good_page is TRUE or FALSE. Currently we are up to 5 conditions, most of which use grepl to string-match phrases like 404 and Not Found in the title and body.
Is there a better solution than string matching for 404 and Not Found in the body, to know that these pages are not good pages?
Solution 1:[1]
The code below does not try to read the pages contents, it issues a HEAD request using package httr. This is faster and returns all necessary information.
library(httr)
check_url_validity <- function(this_url){
r <- tryCatch(HEAD(this_url),
error = function(e) e
)
if(inherits(r, "error")){
"does not exist"
#conditionMessage(r)
} else {
http_status(r)$reason
}
}
lapply(urls_vec, check_url_validity)
#[[1]]
#[1] "does not exist"
#
#[[2]]
#[1] "Not Found"
#
#[[3]]
#[1] "Not Found"
#
#[[4]]
#[1] "does not exist"
#
#[[5]]
#[1] "OK"
To return NA/FALSE/TRUE, the function below follows the same lines.
check_url_validity2 <- function(this_url){
r <- tryCatch(HEAD(this_url),
error = function(e) e
)
if(inherits(r, "error")){
NA
}else{
status_code(r) < 300
}
}
lapply(urls_vec, check_url_validity2)
#[[1]]
#[1] NA
#
#[[2]]
#[1] FALSE
#
#[[3]]
#[1] FALSE
#
#[[4]]
#[1] NA
#
#[[5]]
#[1] TRUE
Data
urls_vec <- c(
"www.newburynighthawks.com",
"http://www.clarkepride.com/sports/womens-basketball/roster/2020-21",
"https://lyon.edu/sports/lyon_sports.html/sports/mens-basketball/roster/2018-19",
"www.lambuth.edu/athletics/index.html",
"https://uvi.edu/pub-relations/athletics/athletics.htm/sports/womens-basketball/roster/2018-19"
)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
