'Retrieving data from API via FormRequest() from a dynamically populated web page
I'm trying to scrape news items populated under a url directory of a news web site.
The page that contains the individual news are displayed here: https://t24.com.tr/haber/15-temmuz-darbe-girisimi
As scrolled down the page is populated with target information. Seen from the XHR feed, the title of the news and the relative links are populated from an API request to the url of https://t24.com.tr/graphql
Studying the headers and body on Postman, I realized following are the required request info:
headers = {
'content-type': 'application/json',
'content-length' : 244
}
body = {
"query": "{storiesByCategory(first: 12, after: \"WyIyMDE5LTA2LTI2VDA3OjQxOjE1LjAwMFoiXQ==\", category: \"15-temmuz-darbe-girisimi\") {cursors{after,hasNext},results{id,slug,title,image,imageAlt,excerpt,publishedAt,category{slug,name,color}}}}"
}
Page with populated info from API

Target items to be scraped in API response body:

I wrote a spider to make a request to the API link with necessary request headers&body.
My problem: The spider I wrote crawles constant 400 errors. The Postman request I wrote with the same headers&body gives the json with news items to be scraped. I need to scrape the json with news items to follow links contained in them
Scrapy logs
2022-02-16 15:02:02 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: erdo)
2022-02-16 15:02:02 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.8.10 (default, Nov 26 2021, 20:14:08) - [GCC 9.3.0], pyOpenSSL 22.0.0 (OpenSSL 1.1.1m 14 Dec 2021), cryptography 36.0.1, Platform Linux-5.13.0-28-generic-x86_64-with-glibc2.29
2022-02-16 15:02:02 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2022-02-16 15:02:02 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'erdo',
'DOWNLOAD_DELAY': 5,
'NEWSPIDER_MODULE': 'erdo.spiders',
'SPIDER_MODULES': ['erdo.spiders'],
'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, '
'like Gecko) Chrome/97.0.4692.99 Safari/537.36'}
2022-02-16 15:02:02 [scrapy.extensions.telnet] INFO: Telnet Password: cbfda52138ef1cdc
2022-02-16 15:02:02 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
2022-02-16 15:02:02 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-02-16 15:02:02 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-02-16 15:02:02 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-02-16 15:02:02 [scrapy.core.engine] INFO: Spider opened
2022-02-16 15:02:02 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-02-16 15:02:02 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-02-16 15:02:02 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://t24.com.tr/haber/15-temmuz-darbe-girisimi> (referer: None)
2022-02-16 15:02:09 [scrapy.core.engine] DEBUG: Crawled (400) <POST https://t24.com.tr/graphql> (referer: https://t24.com.tr/haber/15-temmuz-darbe-girisimi/21)
2022-02-16 15:02:09 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 https://t24.com.tr/graphql>: HTTP status code is not handled or not allowed
2022-02-16 15:02:15 [scrapy.core.engine] DEBUG: Crawled (400) <POST https://t24.com.tr/graphql> (referer: https://t24.com.tr/haber/15-temmuz-darbe-girisimi/20)
2022-02-16 15:02:15 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 https://t24.com.tr/graphql>: HTTP status code is not handled or not allowed
2022-02-16 15:02:23 [scrapy.core.engine] DEBUG: Crawled (400) <POST https://t24.com.tr/graphql> (referer: https://t24.com.tr/haber/15-temmuz-darbe-girisimi/19)
2022-02-16 15:02:23 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 https://t24.com.tr/graphql>: HTTP status code is not handled or not allowed
2022-02-16 15:02:30 [scrapy.core.engine] DEBUG: Crawled (400) <POST https://t24.com.tr/graphql> (referer: https://t24.com.tr/haber/15-temmuz-darbe-girisimi/18)
2022-02-16 15:02:30 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 https://t24.com.tr/graphql>: HTTP status code is not handled or not allowed
2022-02-16 15:02:36 [scrapy.core.engine] DEBUG: Crawled (400) <POST https://t24.com.tr/graphql> (referer: https://t24.com.tr/haber/15-temmuz-darbe-girisimi/17)
2022-02-16 15:02:36 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 https://t24.com.tr/graphql>: HTTP status code is not handled or not allowed
2022-02-16 15:02:41 [scrapy.core.engine] DEBUG: Crawled (400) <POST https://t24.com.tr/graphql> (referer: https://t24.com.tr/haber/15-temmuz-darbe-girisimi/16)
2022-02-16 15:02:42 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 https://t24.com.tr/graphql>: HTTP status code is not handled or not allowed
2022-02-16 15:02:48 [scrapy.core.engine] DEBUG: Crawled (400) <POST https://t24.com.tr/graphql> (referer: https://t24.com.tr/haber/15-temmuz-darbe-girisimi/15)
2022-02-16 15:02:48 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 https://t24.com.tr/graphql>: HTTP status code is not handled or not allowed
2022-02-16 15:02:55 [scrapy.core.engine] DEBUG: Crawled (400) <POST https://t24.com.tr/graphql> (referer: https://t24.com.tr/haber/15-temmuz-darbe-girisimi/14)
2022-02-16 15:02:55 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 https://t24.com.tr/graphql>: HTTP status code is not handled or not allowed
2022-02-16 15:03:02 [scrapy.core.engine] DEBUG: Crawled (400) <POST https://t24.com.tr/graphql> (referer: https://t24.com.tr/haber/15-temmuz-darbe-girisimi/13)
2022-02-16 15:03:02 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 https://t24.com.tr/graphql>: HTTP status code is not handled or not allowed
2022-02-16 15:03:02 [scrapy.extensions.logstats] INFO: Crawled 10 pages (at 10 pages/min), scraped 0 items (at 0 items/min)
2022-02-16 15:03:08 [scrapy.core.engine] DEBUG: Crawled (400) <POST https://t24.com.tr/graphql> (referer: https://t24.com.tr/haber/15-temmuz-darbe-girisimi/12)
2022-02-16 15:03:08 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 https://t24.com.tr/graphql>: HTTP status code is not handled or not allowed
2022-02-16 15:03:12 [scrapy.core.engine] DEBUG: Crawled (400) <POST https://t24.com.tr/graphql> (referer: https://t24.com.tr/haber/15-temmuz-darbe-girisimi/11)
2022-02-16 15:03:12 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 https://t24.com.tr/graphql>: HTTP status code is not handled or not allowed
2022-02-16 15:03:18 [scrapy.core.engine] DEBUG: Crawled (400) <POST https://t24.com.tr/graphql> (referer: https://t24.com.tr/haber/15-temmuz-darbe-girisimi/10)
2022-02-16 15:03:19 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 https://t24.com.tr/graphql>: HTTP status code is not handled or not allowed
2022-02-16 15:03:25 [scrapy.core.engine] DEBUG: Crawled (400) <POST https://t24.com.tr/graphql> (referer: https://t24.com.tr/haber/15-temmuz-darbe-girisimi/9)
2022-02-16 15:03:25 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 https://t24.com.tr/graphql>: HTTP status code is not handled or not allowed
2022-02-16 15:03:31 [scrapy.core.engine] DEBUG: Crawled (400) <POST https://t24.com.tr/graphql> (referer: https://t24.com.tr/haber/15-temmuz-darbe-girisimi/8)
2022-02-16 15:03:31 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 https://t24.com.tr/graphql>: HTTP status code is not handled or not allowed
2022-02-16 15:03:38 [scrapy.core.engine] DEBUG: Crawled (400) <POST https://t24.com.tr/graphql> (referer: https://t24.com.tr/haber/15-temmuz-darbe-girisimi/7)
2022-02-16 15:03:38 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 https://t24.com.tr/graphql>: HTTP status code is not handled or not allowed
2022-02-16 15:03:42 [scrapy.core.engine] DEBUG: Crawled (400) <POST https://t24.com.tr/graphql> (referer: https://t24.com.tr/haber/15-temmuz-darbe-girisimi/6)
2022-02-16 15:03:42 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 https://t24.com.tr/graphql>: HTTP status code is not handled or not allowed
2022-02-16 15:03:48 [scrapy.core.engine] DEBUG: Crawled (400) <POST https://t24.com.tr/graphql> (referer: https://t24.com.tr/haber/15-temmuz-darbe-girisimi/5)
2022-02-16 15:03:48 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 https://t24.com.tr/graphql>: HTTP status code is not handled or not allowed
2022-02-16 15:03:55 [scrapy.core.engine] DEBUG: Crawled (400) <POST https://t24.com.tr/graphql> (referer: https://t24.com.tr/haber/15-temmuz-darbe-girisimi/4)
2022-02-16 15:03:55 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 https://t24.com.tr/graphql>: HTTP status code is not handled or not allowed
2022-02-16 15:04:00 [scrapy.core.engine] DEBUG: Crawled (400) <POST https://t24.com.tr/graphql> (referer: https://t24.com.tr/haber/15-temmuz-darbe-girisimi/3)
2022-02-16 15:04:01 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 https://t24.com.tr/graphql>: HTTP status code is not handled or not allowed
2022-02-16 15:04:02 [scrapy.extensions.logstats] INFO: Crawled 20 pages (at 10 pages/min), scraped 0 items (at 0 items/min)
2022-02-16 15:04:07 [scrapy.core.engine] DEBUG: Crawled (400) <POST https://t24.com.tr/graphql> (referer: https://t24.com.tr/haber/15-temmuz-darbe-girisimi/2)
2022-02-16 15:04:08 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 https://t24.com.tr/graphql>: HTTP status code is not handled or not allowed
2022-02-16 15:04:12 [scrapy.core.engine] DEBUG: Crawled (400) <POST https://t24.com.tr/graphql> (referer: https://t24.com.tr/haber/15-temmuz-darbe-girisimi/1)
2022-02-16 15:04:12 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 https://t24.com.tr/graphql>: HTTP status code is not handled or not allowed
2022-02-16 15:04:12 [scrapy.core.engine] INFO: Closing spider (finished)
2022-02-16 15:04:12 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 15127,
'downloader/request_count': 22,
'downloader/request_method_count/GET': 1,
'downloader/request_method_count/POST': 21,
'downloader/response_bytes': 46145,
'downloader/response_count': 22,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/400': 21,
'elapsed_time_seconds': 129.98732,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 2, 16, 12, 4, 12, 326876),
'httpcompression/response_bytes': 226813,
'httpcompression/response_count': 1,
'httperror/response_ignored_count': 21,
'httperror/response_ignored_status_count/400': 21,
'log_count/DEBUG': 22,
'log_count/INFO': 33,
'memusage/max': 64782336,
'memusage/startup': 61345792,
'request_depth_max': 1,
'response_received_count': 22,
'scheduler/dequeued': 22,
'scheduler/dequeued/memory': 22,
'scheduler/enqueued': 22,
'scheduler/enqueued/memory': 22,
'start_time': datetime.datetime(2022, 2, 16, 12, 2, 2, 339556)}
2022-02-16 15:04:12 [scrapy.core.engine] INFO: Spider closed (finished)
Here is my Spider's code
from ast import For
import scrapy
from scrapy.utils.response import open_in_browser
from scrapy.http import FormRequest
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy import Request
import json
from scrapy import Spider
from scrapy_splash import SplashRequest
class TarapSpider(Spider):
name = 't24_15t'
allowed_domains = ['t24.com.tr']
start_urls = [ 'https://t24.com.tr/haber/15-temmuz-darbe-girisimi' ]
urll= "https://t24.com.tr/graphql"
headerz = {
'content-type': 'application/json',
'content-length' : 244
}
def parse(self, response):
counter = 0
check =True
while check == True:
counter +=1
referer = f"{self.root_link}/{str(counter)}"
curr_header = self.headerz
curr_header["referer"] = (referer)
yield FormRequest(
url = self.urll,
callback=self.tarse,
method="POST",
headers=curr_header,
formdata= json.loads(r"""{"query":"{storiesByCategory(first: 12, after: \"WyIyMDE5LTA2LTI2VDA3OjQxOjE1LjAwMFoiXQ==\", category: \"15-temmuz-darbe-girisimi\") {cursors{after,hasNext},results{id,slug,title,image,imageAlt,excerpt,publishedAt,category{slug,name,color}}}}"}"""),dont_filter=True
)
if counter == 21:
check = False
def tarse(self, response):
print(response.body)
Solution 1:[1]
You need to check if the results are correct.
What I did:
- Set the base date.
- Every iteration subtract 1 month from the date (it's just a guess I'm really sure where to get this from, so check if it yields correct results).
- Format the payload with the date.
- Create a scrapy request with the payload as the body.
- Added custom settings to prevent blocking.
- Added the headers as they appear in the browser.
What you can do (only if you want to, it's not a must):
- Check the results for
has nextso you know when to stop. - Instead of a loop make the function go to the next page (what I described in 1), basically the
callbackwill be the same function so it will go by the order of pages and stop when it will reach stopping condition.
The code:
import datetime
from ast import For
import scrapy
from scrapy.utils.response import open_in_browser
from scrapy.http import FormRequest
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy import Request
import json
from scrapy import Spider
from scrapy_splash import SplashRequest
from datetime import date
from dateutil.relativedelta import relativedelta
import base64
class TarapSpider(Spider):
name = 't24_15t'
allowed_domains = ['t24.com.tr']
start_urls = [ 'https://t24.com.tr/haber/15-temmuz-darbe-girisimi' ]
custom_settings = {
'DOWNLOAD_DELAY': 1,
'CONCURRENT_REQUESTS': 4
}
url = "https://t24.com.tr/graphql"
root_link = 'https://t24.com.tr/haber/15-temmuz-darbe-girisimi'
form_date = datetime.datetime.strptime("2019-06-26 07:41:15.000", "%Y-%m-%d %H:%M:%S.%f")
headers = {
"Accept": "application/json",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.5",
"Cache-Control": "no-cache",
"Connection": "keep-alive",
"Content-Type": "application/json",
"DNT": "1",
"Host": "t24.com.tr",
"Origin": "https://t24.com.tr",
"Pragma": "no-cache",
"Sec-Fetch-Dest": "empty",
"Sec-Fetch-Mode": "same-origin",
"Sec-Fetch-Site": "same-origin",
"Sec-GPC": "1",
"TE": "trailers",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36",
"X-KL-Ajax-Request": "Ajax_Request"
}
def parse(self, response):
for counter in range(1, 22):
referer = f"{self.root_link}/{str(counter)}"
curr_header = self.headers
curr_header["referer"] = referer
form_ts = f"[\"{self.form_date.strftime('%Y-%m-%dT%H:%M:%S.%f')[:-3]}Z\"]"
form_ts = base64.b64encode(form_ts.encode("ascii")).decode("ascii")
payload = {"query":"{storiesByCategory(first: 12, after: \"" + form_ts + "\", category: \"15-temmuz-darbe-girisimi\") {cursors{after,hasNext},results{id,slug,title,image,imageAlt,excerpt,publishedAt,category{slug,name,color}}}}"}
self.form_date = self.form_date - relativedelta(months=1)
yield scrapy.Request(
url=self.url,
callback=self.tarse,
method="POST",
headers=curr_header,
body=json.dumps(payload),
dont_filter=True
)
def tarse(self, response):
print(response.body)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | SuperUser |
