'403 Error Web Scraping with R with Specified User Agent
I'm trying to scrape data off of the website GovSalaries, which doesn't appear to be a violation of their terms of service (although I didn't look that hard). However, I keep getting a 403 error. I tried adding headers to my code that I found from the http request using a web browser with no luck. Below is an example,
ua = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36"
url = "https://govsalaries.com/salaries/FD/food-and-drug-administration"
my_page = GET(url, user_agent(ua))
I'm guessing it may have to do with cookies? Any help would be appreciated.
UPDATE: I tried to start a "session" with the main page, but am still getting a 403 error. The idea being I could then jump_to the page of interest. Thanks to @r2evans for suggesting this approach (although it didn't work in this example, seems like a good strategy).
## example
ua = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36"
url = "https://govsalaries.com/"
my_session = session(url, user_agent(ua))
Solution 1:[1]
Here is partial answer,
#start browser
library(RSelenium)
driver = rsDriver(
port = 4847L,
browser = c("firefox"))
#navigate
remDr <- driver[["client"]]
url = "https://govsalaries.com/salaries/FD/food-and-drug-administration"
remDr$navigate(url)
#get the links
link = remDr$getPageSource()[[1]] %>%
read_html() %>% html_nodes('.table-sm') %>% html_nodes('a') %>% html_attr('href')
link = unique(link)
link = paste0('https://govsalaries.com', link)
head(link)
[1] "https://govsalaries.com/califf-robert-58583316" "https://govsalaries.com/woodcock-janet-58598162"
[3] "https://govsalaries.com/midthun-karen-58591778" "https://govsalaries.com/jenkins-john-k-58588605"
[5] "https://govsalaries.com/moscicki-richard-arthur-58592168" "https://govsalaries.com/shuren-jeffrey-e-58595518"
After extracting links when we start to loop through each link to scrape wages captcha starts appearing.
Then closed the browser and started fresh browser and navigated to one of the link that we extracted before
url = "https://govsalaries.com/califf-robert-58583316"
#get the tables
tab = remDr$getPageSource()[[1]] %>%
read_html() %>% html_table()
Here the first two tables may be important to you,
tab[[1]]
# A tibble: 12 x 2
X1 X2
<chr> <chr>
1 "Year" "2015"
2 "Full Name" "Robert Califf"
3 "Original Job Title" "MEDICAL OFFICER"
4 "Job Title" "Medical Officer"
5 "Get Medical Officer\nSalary Statistics" ""
6 "State" "Federal"
7 "Employer" "FOOD AND DRUG ADMINISTRATION"
8 "Location" "SILVER SPRING"
9 "Annual Wage" "$300,000"
10 "Bonus" "N/A"
11 "Pay Plan" "RF"
12 "Grade" "00"
tab[[2]]
# A tibble: 5 x 2
X1 X2
<chr> <chr>
1 Employer Name FOOD AND DRUG ADMINISTRATION
2 Year 2015
3 Number of Employees 17,594
4 Average Annual Wage $110,621
5 Median Annual Wage $110,902
But when i navigated to another link again hit a captcha. One obvious solution would be start new browser every time. Or you should come up with some unique method or hack or something.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |

