'How to resolve the 'human or bot' when scraping web using beautifulsoup?
I use the following code to scrape a webpage but it's been two days that the code returns no data and when I printed soup object, I saw the following text.
Code:
i = 1
url = 'https://www.kickstarter.com/discover/advanced?category_id=1&sort=end_date&seed=2639586&page=' + str(i)
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
data = [
(json.loads(i["data-project"]), i["data-ref"])
for i in soup.find_all("div")
if i.get("data-project")
]
print("data: {}".format(data))
print(soup)
data is empty.
The soup object returns:
<!DOCTYPE html>
<html lang="en"> <head> <meta charset="utf-8"/> <meta content="width=device-width, initial-scale=1" name="viewport"/> <title>Access to this page has been denied.</title> <link href="https://fonts.googleapis.com/css?family=Open+Sans:300" rel="stylesheet"/> <style> html, body { margin: 0; padding: 0; font-family: 'Open Sans', sans-serif; color: #000; } a { color: #c5c5c5; text-decoration: none; } .container { align-items: center; display: flex; flex: 1; justify-content: space-between; flex-direction: column; height: 100%; } .container > div { width: 100%; display: flex; justify-content: center; } .container > div > div { display: flex; width: 80%; } .customer-logo-wrapper { padding-top: 2rem; flex-grow: 0; background-color: #fff; visibility: hidden; } .customer-logo { border-bottom: 1px solid #000; } .customer-logo > img { padding-bottom: 1rem; max-height: 50px; max-width: 100%; } .page-title-wrapper { flex-grow: 2; } .page-title { flex-direction: column-reverse; } .content-wrapper { flex-grow: 5; } .content { flex-direction: column; } .page-footer-wrapper { align-items: center; flex-grow: 0.2; background-color: #000; color: #c5c5c5; font-size: 70%; } @media (min-width: 768px) { html, body { height: 100%; } } </style> <!-- Custom CSS --> </head> <body> <section class="container"> <div class="customer-logo-wrapper"> <div class="customer-logo"> <img alt="Logo" src=""/> </div> </div> <div class="page-title-wrapper"> <div class="page-title"> <h1>Backer or bot?</h1> </div> </div> <div class="content-wrapper"> <div class="content"> <div id="px-captcha"> </div> <p id="paragraph-one">Complete this security check to prove that you’re a human. Once you’ve passed this page, you might need to navigate away from your current screen on Kickstarter to refresh and move on.</p> <p id="paragraph-two">To avoid seeing this page again, double-check that JavaScript and cookies are enabled on your web browser and that you’re not blocking them from loading with an extension (e.g., ad blockers).</p> <p> Reference ID: #24b7d759-75b5-11ec-b3ec-6a776b594763 </p> </div> </div> <div class="page-footer-wrapper"> <div class="page-footer"> <p> Powered by <a href="https://www.perimeterx.com/whywasiblocked">PerimeterX</a> , Inc. </p> </div> </div> </section> <!-- Px --> <script> window._pxAppId = 'PXUy3R669N'; window._pxJsClientSrc = '/Uy3R669N/init.js'; window._pxFirstPartyEnabled = true; window._pxVid = ''; window._pxUuid = '24b7d759-75b5-11ec-b3ec-6a776b594763'; window._pxHostUrl = '/Uy3R669N/xhr'; </script> <script> var s = document.createElement('script'); s.src = '/Uy3R669N/captcha/captcha.js?a=c&u=24b7d759-75b5-11ec-b3ec-6a776b594763&v=&m=0'; var p = document.getElementsByTagName('head')[0]; p.insertBefore(s, null); if (true ){s.onerror = function () {s = document.createElement('script'); var suffixIndex = '/Uy3R669N/captcha/captcha.js?a=c&u=24b7d759-75b5-11ec-b3ec-6a776b594763&v=&m=0'.indexOf('/captcha.js'); var temperedBlockScript = '/Uy3R669N/captcha/captcha.js?a=c&u=24b7d759-75b5-11ec-b3ec-6a776b594763&v=&m=0'.substring(suffixIndex); s.src = '//captcha.px-cdn.net/PXUy3R669N' + temperedBlockScript; p.parentNode.insertBefore(s, p);};}</script> <!-- Custom Script --> <script src="https://a.kickstarter.com/px/translations.js"></script> </body> </html>
The important part of the above text is:
Complete this security check to prove that you’re a human. Once you’ve passed this page, you might need to navigate away from your current screen on Kickstarter to refresh and move on
How can I resolve this issue and scrape the page again? Sounds like they recently added this security check because I did not have any problem two days ago.
Solution 1:[1]
Just add a proper User-Agent in the header.
Here's how:
import json
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:95.0) Gecko/20100101 Firefox/95.0",
}
url = 'https://www.kickstarter.com/discover/advanced?category_id=1&sort=end_date&seed=2639586&page=1'
soup = BeautifulSoup(requests.get(url, headers=headers).text, 'html.parser')
data = [
(json.loads(i["data-project"]), i["data-ref"]) for i in soup.find_all("div")
if i.get("data-project")
]
print("data: {}".format(data))
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | baduker |
