'Web scraping with pagination doesn't return all results

I am trying to scrape Indeed.com but having a problem with pagination. Here is my code:

import scrapy
class JobsNySpider(scrapy.Spider):
    name = 'jobs_ny'
    allowed_domains = ['www.indeed.com']
    start_urls = ['https://www.indeed.com/jobs?q=analytics&l=New%20York%2C%20NY&vjk=7b2f6385304ffc78']

    def parse(self, response):
        jobs = response.xpath("//td[@id='resultsCol']")
        for job in jobs:
            yield {
                'Job_title': job.xpath(".//td[@class='resultContent']/div/h2/span/text()").get(),
                'Company_name': job.xpath(".//span[@class='companyName']/a/text()").get(),
                'Company_rating': job.xpath(".//span[@class='ratingNumber']/span/text()").get(),
                'Company_location': job.xpath(".//div[@class='companyLocation']/text()").get(),
                'Estimated_salary': job.xpath(".//span[@class='estimated-salary']/span/text()").get()
        }

        next_page = response.urljoin(response.xpath("//a[@aria-label='Next']/@href").get())

        if next_page:
           yield scrapy.Request(url=next_page, callback=self.parse)

The problem is that according to Indeed there are 28,789 jobs that match my query. However, when I save what I've scraped to csv file, there are only 76 rows. I also tried: next_page = response.urljoin(response.xpath("//ul[@class='pagination-list']/li[position() = last()]/a/@href").get()) but the result was similar. So my question is what I am doing wrong while handling the pagination.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'Web scraping with pagination doesn't return all results

Sources

Related Questions