'Scrapy Next Page Links

The website (https://www.bernama.com/en/crime_courts/) is using the same class name a="page-link" for all the pagination links. My aim is to get the next button on the right hand side but couldn't differentiate between the previous button, number button, and next button. My current code is trying to get the last element out of the array but failed.

start_urls = {
    'https://www.bernama.com/en/crime_courts/'
}

def parse(self, response):
    for news in response.css('div.col-7.col-md-12.col-lg-12.mb-3'):
        yield{
            'title' : news.css('a.text-dark.text-decoration-none::text').get(),
            'link' : news.css('a.text-dark.text-decoration-none::attr(href)').get()
        }

    next_page = response.css('a.page-link::attr(href)').getall()

    if next_page[-1] != "#":
        yield response.follow(next_page, callback = self.parse)


Solution 1:[1]

You simply forgot [-1] in

yield response.follow( next_page[-1] )

Full working code but I use shorter css selectors

And next pages use relative urls so it needs response.urljoin() to create absolute urls.

import scrapy

class MySpider(scrapy.Spider):
    
    name = 'my_spyder'
    
    start_urls = ['https://www.bernama.com/en/crime_courts/']
    
    def parse(self, response):
        print("url:", response.url)
        
        for news in response.css('h6 a'):
            yield {
                    'title': news.css('::text').get(),
                    'link' : response.urljoin(news.css('::attr(href)').get()),
                    #'link' : response.urljoin(news.attrib['href']),
                    'page' : response.url,
                  }
    
        next_page = response.css('a.page-link::attr(href)').getall()
    
        if next_page[-1] != "#":
            yield response.follow(next_page[-1])
    
from scrapy.crawler import CrawlerProcess

c = CrawlerProcess({
    'USER_AGENT': 'Mozilla/5.0',
    'FEEDS': {'output.csv': {'format': 'csv'}},  # new in 2.1
})
c.crawl(MySpider)
c.start() 

BTW:

Both vesion of css selectors get also links to videos on YouTube but they are same on every page and CVS has the same links many times.

If you need only news without videos then it may need to get section which has row with text More news and later search only in this section.

sections = response.xpath('//div[@class="row"]/div[div[@class="row"]//span[contains(text(), "More news")]]')
#print(sections)

for news in sections[0].css('h6 a'):
    yield{
        'title': news.css('::text').get(),
        'link' : response.urljoin(news.css('::attr(href)').get()),
        'page' : response.url,
    }

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1