'Cannot scrape this site with requests
I am trying to scrape some of the tables of this web site. When I do the requests, the html returned is different than the one in the browser.
I tried to inspect the browser Network tab to see if there is any json response that fills the info but couldn’t find any. I realized that the html returned by requests is the one present in the html doc showed in the Network tab (in the browser inspector), that as I said, is different by the one showed in the Elements tab (the one shows the full html code I want).
I know I am missing something and would appreciate if you could explain me how this websites works:
- ¿Why is the response different?
- Is there any chance of achieve this without the use of real browsers (I prefer not using selenium for efficiency purposes).
This is my code
url = 'https://coriolis.io/outfit/fer_de_lance?code=A4pktfFalfdpsff30x27272727040404040404B22b2b27m1m1.AwRj4yvI.Aw18WQ%3D%3D.H4sIAAAAAAAAA42SvS9DYRTGTz%2B1vW1v79VWBfXRi8TQNLYabKIDibGryWLoQMQi7AYRwWAwGA1GQ2OyNjEYDCL%2BCBN1jueIvmmbSu5N7pMn7%2Fm9zz335BAPEdF3BNI%2BhVgnAaLUYZLI2YNz7y0irxkkkgAvG%2FIAEtv4ErHfy0T5uzDIJwRJkPMG2oHY3qdItggyd20TFZQsNUZAhnjMkPtKPlf%2BQrKvOCzVPkQkzIsdyK7son65hKsRrpqrR5B4kkUSZw7RtLoZdbPq5tRJlDc7uJuZJypr9OjaBEox3jJJRUikjqaityl8X5uQONdNHS%2BF9Xf6oYQfyPIDJbug0H9QilcMdKWD0VNXp1N4jBNNvqQBpf1Ath8ow7UeyGpgsI6Kd5xA3eHV3vpNFEm6G4663yRPnbhdSRcDkoZ53dS3IcFCWySqG2NVMQ3nDZLTnrymrlqWFwx%2BrnEPsOmpcZDqXN03T53kfJP5frJlyJYh1YnQwOcHDPQx9E8DAAA%3D.EweloBhAOEoUwIYHMA28QgIwV3fEQA%3D%3D'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
s = requests.Session()
n = s.get(url,verify=False,headers=headers)
soup = BeautifulSoup(n.content,'html.parser')
tables = soup.find_all(class_='group half') # Present in the browser but not in the returned request
Solution 1:[1]
Please use python selenium in order to scrap the site.
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
url = "your site url"
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get(url)
page = driver.page_source
soup = BeautifulSoup(page, 'html.parser')
....
Scraped responds might be different by options of header, but selenium will respond same response like as opening site via real browser.
And you can use option.add_argument('--headless') in selenium without opening browser.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
