'Scrapy send multiple requests
I'm working on a code that must read and process date and time information from a remote Json file at any time. The code I wrote is as follows:
import scrapy
class TimeSpider(scrapy.Spider):
name = 'getTime'
allowed_domains = ['worldtimeapi.org']
start_urls = ['http://worldtimeapi.org']
def parse(self,response):
time_json='http://worldtimeapi.org/api/timezone/Asia/Tehran'
for i in range(5):
print(i)
yield scrapy.Request(url=time_json, callback=self.parse_json)
def parse_json(self,response):
print(response.json())
And the output it gives is as follows:
0
1
2
3
4
{'abbreviation': '+0430', 'client_ip': '45.136.231.43', 'datetime': '2022-04-22T22:01:44.198723+04:30', 'day_of_week': 5, 'day_of_year': 112, 'dst': True, 'dst_from': '2022-03-21T20:30:00+00:00', 'dst_offset': 3600, 'dst_until': '2022-09-21T19:30:00+00:00', 'raw_offset': 12600, 'timezone': 'Asia/Tehran', 'unixtime': 1650648704, 'utc_datetime': '2022-04-22T17:31:44.198723+00:00', 'utc_offset': '+04:30', 'week_number': 16}
As you can see, the program only calls the parse_json function once, while it has to call the function in every loop
Can anyone help me solve this problem?
Solution 1:[1]
Additional requests are being dropped by scrapy's default duplicates filter.
The simplest way to avoid this is to pass the dont_filter argument:
yield scrapy.Request(url=time_json, callback=self.parse_json, dont_filter=True)
From the docs:
dont_filter (bool) – indicates that this request should not be filtered by the scheduler. This is used when you want to perform an identical request multiple times, to ignore the duplicates filter. Use it with care, or you will get into crawling loops. Default to
False.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | stranac |
