'Handling 404 and other bad URLs when reading in R with read_html

Summary: handling errors and bad pages using trycatch with R's read_html function.

We are using Rs read_html function to connect to some NCAA sports websites, and need to identify when a page is faulty. Here's a few example URLs to faulty pages:

 - www.newburynighthawks.com (does not exist)
 - http://www.clarkepride.com/sports/womens-basketball/roster/2020-21 (404 not found)
 - https://lyon.edu/sports/lyon_sports.html/sports/mens-basketball/roster/2018-19 (not found)
 - www.lambuth.edu/athletics/index.html (does not exist)
 - https://uvi.edu/pub-relations/athletics/athletics.htm/sports/womens-basketball/roster/2018-19 (page not found)

Each of these URLs have their own issue / problem when using read_html. To handle these issues, I've written a function that uses trycatch to check the validity of these pages when:

check_url_validity <- function(this_url) {
  good_url = FALSE

  # go to url to check for a rosters page
  bad_page_titles = c('Page Not Found', 'Page not found', '404')
  result = tryCatch({
    team_page <- this_url %>% GET(., timeout(2)) %>% read_html
    team_page_title <- team_page %>% html_nodes('title') %>% html_text
    team_page_body <- team_page %>% html_nodes('body') %>% html_text
    good_page <- !grepl('Page not found', team_page_title) &&
      !grepl('Page Not Found', team_page_title) &&
      !grepl('404', team_page_title) &&
      team_page_title != "" &&
      !grepl('Error 404', team_page_body)
    
    if(good_page) { good_url = TRUE }
  }, error = function(e) { NA })
  
  return(good_url)
}

Testing this function on the urls linked above provides the following:

these_urls = c(
'www.newburynighthawks.com', 
'http://www.clarkepride.com/sports/womens-basketball/roster/2020-21',
'https://lyon.edu/sports/lyon_sports.html/sports/mens-basketball/roster/2018-19',
'www.lambuth.edu/athletics/index.html',
'https://uvi.edu/pub-relations/athletics/athletics.htm/sports/womens-basketball/roster/2018-19'
)

for (this_url in these_urls) {
  print(check_rosters_url(this_url))
}

Some of these pages (http://www.newburynighthawks.com/) are easily identified as bad in the trycatch, because there is no page. Others (http://www.clarkepride.com/sports/womens-basketball/roster/2020-21) rely on string-matching in the body text to catch that the page is bad. The issue overall then is that this is a hacky solution, we are dealing with ~1000 different URLs here, and we continue to add conditions to the line of code that determines whether or not good_page is TRUE or FALSE. Currently we are up to 5 conditions, most of which use grepl to string-match phrases like 404 and Not Found in the title and body.

Is there a better solution than string matching for 404 and Not Found in the body, to know that these pages are not good pages?



Solution 1:[1]

The code below does not try to read the pages contents, it issues a HEAD request using package httr. This is faster and returns all necessary information.

library(httr)

check_url_validity <- function(this_url){
  r <- tryCatch(HEAD(this_url),
                error = function(e) e
  )
  if(inherits(r, "error")){
    "does not exist"
    #conditionMessage(r)
  } else {
    http_status(r)$reason
  }
}

lapply(urls_vec, check_url_validity)
#[[1]]
#[1] "does not exist"
#
#[[2]]
#[1] "Not Found"
#
#[[3]]
#[1] "Not Found"
#
#[[4]]
#[1] "does not exist"
#
#[[5]]
#[1] "OK"

To return NA/FALSE/TRUE, the function below follows the same lines.

check_url_validity2 <- function(this_url){
  r <- tryCatch(HEAD(this_url),
                error = function(e) e
  )
  if(inherits(r, "error")){
    NA
  }else{
    status_code(r) < 300
  }
}

lapply(urls_vec, check_url_validity2)
#[[1]]
#[1] NA
#
#[[2]]
#[1] FALSE
#
#[[3]]
#[1] FALSE
#
#[[4]]
#[1] NA
#
#[[5]]
#[1] TRUE

Data

urls_vec <- c(
  "www.newburynighthawks.com", 
  "http://www.clarkepride.com/sports/womens-basketball/roster/2020-21", 
  "https://lyon.edu/sports/lyon_sports.html/sports/mens-basketball/roster/2018-19", 
  "www.lambuth.edu/athletics/index.html", 
  "https://uvi.edu/pub-relations/athletics/athletics.htm/sports/womens-basketball/roster/2018-19"
)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1