'403 Error Web Scraping with R with Specified User Agent

I'm trying to scrape data off of the website GovSalaries, which doesn't appear to be a violation of their terms of service (although I didn't look that hard). However, I keep getting a 403 error. I tried adding headers to my code that I found from the http request using a web browser with no luck. Below is an example,

ua = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36"
url = "https://govsalaries.com/salaries/FD/food-and-drug-administration"
my_page = GET(url, user_agent(ua))

I'm guessing it may have to do with cookies? Any help would be appreciated.

UPDATE: I tried to start a "session" with the main page, but am still getting a 403 error. The idea being I could then jump_to the page of interest. Thanks to @r2evans for suggesting this approach (although it didn't work in this example, seems like a good strategy).

## example
ua = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36"
url = "https://govsalaries.com/"
my_session = session(url, user_agent(ua))


Solution 1:[1]

Here is partial answer,

#start browser
library(RSelenium)
driver = rsDriver(
     port = 4847L,
       browser = c("firefox"))

#navigate
remDr <- driver[["client"]]
url = "https://govsalaries.com/salaries/FD/food-and-drug-administration"
remDr$navigate(url)

#get the links 
link = remDr$getPageSource()[[1]] %>% 
  read_html() %>% html_nodes('.table-sm') %>% html_nodes('a') %>% html_attr('href')
link = unique(link)
link = paste0('https://govsalaries.com', link)
head(link)
[1] "https://govsalaries.com/califf-robert-58583316"           "https://govsalaries.com/woodcock-janet-58598162"         
[3] "https://govsalaries.com/midthun-karen-58591778"           "https://govsalaries.com/jenkins-john-k-58588605"         
[5] "https://govsalaries.com/moscicki-richard-arthur-58592168" "https://govsalaries.com/shuren-jeffrey-e-58595518"

After extracting links when we start to loop through each link to scrape wages captcha starts appearing.

enter image description here

Then closed the browser and started fresh browser and navigated to one of the link that we extracted before

url = "https://govsalaries.com/califf-robert-58583316"
#get the tables 
tab = remDr$getPageSource()[[1]] %>% 
  read_html() %>% html_table()

Here the first two tables may be important to you,

tab[[1]]
# A tibble: 12 x 2
   X1                                       X2                            
   <chr>                                    <chr>                         
 1 "Year"                                   "2015"                        
 2 "Full Name"                              "Robert Califf"               
 3 "Original Job Title"                     "MEDICAL OFFICER"             
 4 "Job Title"                              "Medical Officer"             
 5 "Get Medical Officer\nSalary Statistics" ""                            
 6 "State"                                  "Federal"                     
 7 "Employer"                               "FOOD AND DRUG ADMINISTRATION"
 8 "Location"                               "SILVER SPRING"               
 9 "Annual Wage"                            "$300,000"                    
10 "Bonus"                                  "N/A"                         
11 "Pay Plan"                               "RF"                          
12 "Grade"                                  "00"  

tab[[2]]
# A tibble: 5 x 2
  X1                  X2                          
  <chr>               <chr>                       
1 Employer Name       FOOD AND DRUG ADMINISTRATION
2 Year                2015                        
3 Number of Employees 17,594                      
4 Average Annual Wage $110,621                    
5 Median Annual Wage  $110,902  

But when i navigated to another link again hit a captcha. One obvious solution would be start new browser every time. Or you should come up with some unique method or hack or something.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1