'How to use Scrapy to do pagination and visit all links found on each page
I have the following spider and I try to combine Pagination and Rules for visiting links on each page.
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class Paging(CrawlSpider):
name = "paging"
start_urls = ['https://ausschreibungen-deutschland.de/1/']
# Visit all 10 links (it recognizes all 10 sublinks starting with a number and a _)
rules = (
Rule(LinkExtractor(allow=r"/[0-9]+_"), callback='parse', follow=True),
)
def parse(self, response):
# just get all the text
all_text = response.xpath("//text()").getall()
yield {
"text": " ".join(all_text),
"url": response.url
}
# visit next page
# next_page_url = response.xpath('//a[@class="button next"]').extract_first()
# if next_page_url is not None:
# yield scrapy.Request(response.urljoin(next_page_url))
I would like to implement the following behavior:
Start with page 1 https://ausschreibungen-deutschland.de/1/, visit all 10 links and get the text. (already implemented)
Go to page 2 https://ausschreibungen-deutschland.de/2/, visit all 10 links and get the text.
Go to page 3 https://ausschreibungen-deutschland.de/3/, visit all 10 links and get the text.
Go to page 4 ...
How would I combine these two concepts?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
