'Scrapy Next Page Links
The website (https://www.bernama.com/en/crime_courts/) is using the same class name a="page-link" for all the pagination links. My aim is to get the next button on the right hand side but couldn't differentiate between the previous button, number button, and next button. My current code is trying to get the last element out of the array but failed.
start_urls = {
'https://www.bernama.com/en/crime_courts/'
}
def parse(self, response):
for news in response.css('div.col-7.col-md-12.col-lg-12.mb-3'):
yield{
'title' : news.css('a.text-dark.text-decoration-none::text').get(),
'link' : news.css('a.text-dark.text-decoration-none::attr(href)').get()
}
next_page = response.css('a.page-link::attr(href)').getall()
if next_page[-1] != "#":
yield response.follow(next_page, callback = self.parse)
Solution 1:[1]
You simply forgot [-1] in
yield response.follow( next_page[-1] )
Full working code but I use shorter css selectors
And next pages use relative urls so it needs response.urljoin() to create absolute urls.
import scrapy
class MySpider(scrapy.Spider):
name = 'my_spyder'
start_urls = ['https://www.bernama.com/en/crime_courts/']
def parse(self, response):
print("url:", response.url)
for news in response.css('h6 a'):
yield {
'title': news.css('::text').get(),
'link' : response.urljoin(news.css('::attr(href)').get()),
#'link' : response.urljoin(news.attrib['href']),
'page' : response.url,
}
next_page = response.css('a.page-link::attr(href)').getall()
if next_page[-1] != "#":
yield response.follow(next_page[-1])
from scrapy.crawler import CrawlerProcess
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
'FEEDS': {'output.csv': {'format': 'csv'}}, # new in 2.1
})
c.crawl(MySpider)
c.start()
BTW:
Both vesion of css selectors get also links to videos on YouTube but they are same on every page and CVS has the same links many times.
If you need only news without videos then it may need to get section which has row with text More news and later search only in this section.
sections = response.xpath('//div[@class="row"]/div[div[@class="row"]//span[contains(text(), "More news")]]')
#print(sections)
for news in sections[0].css('h6 a'):
yield{
'title': news.css('::text').get(),
'link' : response.urljoin(news.css('::attr(href)').get()),
'page' : response.url,
}
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
