'How to download CSV from masseyratings.com in R

Consider the URL https://masseyratings.com/cb/ncaa-d1/ratings

If one clicks on "More" and chooses "Export" a CSV file of the ratings is downloaded.

How would I use rvest, httr, etc to directly download this file from R? (Ideally I would even skip the step of saving the file and just convert the cvs to a data frame right away, but I would be satisfied either way.) I have tried to trace what is happening using the developer tools in chrome and firefox, but none of the examples with which I am familiar seem to apply to whatever is happening here.

Obviously it's not too difficult to just download the file and read it into R, but I would really like to automate the process.

The html code for the page include this:

<select class='mopulldown' id='pulldownlinks'>
  <option value=''>More
  <option value='cb/ncaa-d1/ratings?c=1'>Conferences
  <option value='/map.php?s=379387&t=11590'>Map
  <option value='/scores.php?s=cb2022&sub=11590'>Scores/Schedule Data
  <option value='/cb2021/ncaa-d1/ratings'>cb2021
  <option value='/team.php?t=11590&s=cb2022&all=1'>Rating Archive
  <option value='/scoredist?s=cb2022&sub=11590&x=s'>Score Distribution
  <option value='/extgms?s=cb2022&sub=11590'>Extreme Games
  <option value='/path?s=cb2022'>Transitive Path
  <option value='exportCSV'>Export
</select>

and it's the last selection that triggers the download of the CSV file.

r


Solution 1:[1]

When I attempt to scrape something like that, I usually open up a web-browser devtools (often F12) and look at network traffic when I click the button; often it points to a GET or POST that returns the JSON data that will give me the data I want. Using the GET/POST url that created that JSON often precludes the need to do any HTML manipulation at all.

In this case, nothing happens when clicking More or export, instead it is already loaded in a clear URL.

url <- "https://masseyratings.com/json/rate.php?argv=kiqB7tdov4KNhxOtPC9JHgV-OZNKvSJFAtoC3YxpTt4s72nWxxwgp35IAExoj-CvP3XmvNm8l6ksrUVUer342g..&task=json"
res <- httr::GET(url)

Confirm status 200:

res
# Response [https://masseyratings.com/json/rate.php?argv=kiqB7tdov4KNhxOtPC9JHgV-OZNKvSJFAtoC3YxpTt4s72nWxxwgp35IAExoj-CvP3XmvNm8l6ksrUVUer342g..&task=json]
#   Date: 2022-03-11 15:08
#   Status: 200
#   Content-Type: application/json
#   Size: 86.3 kB

Look at the data:

dat <- httr::content(res)
str(dat, max.level=1)
# List of 11
#  $ TI          :List of 5
#  $ CI          :List of 20
#  $ RI          : list()
#  $ DI          :List of 358
#  $ timestamp   : num 1.65e+12
#  $ rating      :List of 4
#  $ prevnextpage: int 0
#  $ seas        : chr "cb2022"
#  $ soid        : int 379387
#  $ suboid      : int 11590
#  $ subname     : chr " : NCAA D1"

From here, there is likely a lot that needs to be done to convert that to a data.frame. (FYI, the exported data is imperfect, as it has missing column names and looks as if it does not contain all of the data within dat.)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 r2evans