'Web scraping pagination with scrapy

I'm trying web scraping with scrapy. But I got "duplicates" warning. Can't jump next page.

How can I scrape all pages with pagination?

example site: teknosa.com

scraping url: https://www.teknosa.com/bilgisayar-tablet-c-116

pagination structure: ?s=%3Arelevance&page=0 (1,2,3,4,5, and more..)

My pagination code:

    next_page = soup.find('button', {'title': 'Daha Fazla Ürün Gör'})['data-gotohref']
    if next_page is not None:
        next_page = response.urljoin(next_page)
        yield scrapy.Request(next_page, callback=self.parse)


Solution 1:[1]

You can make the pagination in start_urls and increase or decrease range of page numbers.

import scrapy
from scrapy.crawler import CrawlerProcess

class CarsSpider(scrapy.Spider):
    name = 'car'
    start_urls=['https://www.teknosa.com/bilgisayar-tablet-c-116?s=%3Arelevance&page='+str(x)+'' for x in range(1,11)]
        
    def parse(self, response):
       print(response.url)

if __name__ == "__main__":
    process =CrawlerProcess()
    process.crawl()
    process.start()

Output:

https://www.teknosa.com/bilgisayar-tablet-c-116?s=%3Arelevance&page=1
2022-05-01 08:55:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.teknosa.com/bilgisayar-tablet-c-116?s=%3Arelevance&page=2> (referer: None)
https://www.teknosa.com/bilgisayar-tablet-c-116?s=%3Arelevance&page=2
2022-05-01 08:55:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.teknosa.com/bilgisayar-tablet-c-116?s=%3Arelevance&page=5> (referer: None)
https://www.teknosa.com/bilgisayar-tablet-c-116?s=%3Arelevance&page=5
2022-05-01 08:55:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.teknosa.com/bilgisayar-tablet-c-116?s=%3Arelevance&page=6> (referer: None)
https://www.teknosa.com/bilgisayar-tablet-c-116?s=%3Arelevance&page=6
2022-05-01 08:55:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.teknosa.com/bilgisayar-tablet-c-116?s=%3Arelevance&page=7> (referer: None)
https://www.teknosa.com/bilgisayar-tablet-c-116?s=%3Arelevance&page=7
2022-05-01 08:55:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.teknosa.com/bilgisayar-tablet-c-116?s=%3Arelevance&page=3> (referer: None)
https://www.teknosa.com/bilgisayar-tablet-c-116?s=%3Arelevance&page=3
2022-05-01 08:55:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.teknosa.com/bilgisayar-tablet-c-116?s=%3Arelevance&page=4> (referer: None)
https://www.teknosa.com/bilgisayar-tablet-c-116?s=%3Arelevance&page=4
2022-05-01 08:55:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.teknosa.com/bilgisayar-tablet-c-116?s=%3Arelevance&page=8> (referer: None)
https://www.teknosa.com/bilgisayar-tablet-c-116?s=%3Arelevance&page=8
2022-05-01 08:55:08 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.teknosa.com/bilgisayar-tablet-c-116?s=%3Arelevance&page=9> (referer: None)
https://www.teknosa.com/bilgisayar-tablet-c-116?s=%3Arelevance&page=9
2022-05-01 08:55:08 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.teknosa.com/bilgisayar-tablet-c-116?s=%3Arelevance&page=10> (referer: None)
https://www.teknosa.com/bilgisayar-tablet-c-116?s=%3Arelevance&page=10

Multiple urls, pagination using for loop

import scrapy

class CarsSpider(scrapy.Spider):
    name = 'car'
    
    def start_requests(self):
        urls=['url_1', 'url_2', 'url_3', ...]
        for url in urls:
            yield scrapy.Request(url=url,callback=self.parse)
           
    def parse(self, response):
        ...

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1