'Increment pagination value Scraping+Selenium
I'm trying to scrape a dinamically generated website using Selenium+Scrapy.
I have scraped the items from the first page successfully but when I try to get to the next page, seems that the browser generated for selenium is always rquesting the same page.
What I'm trying:
1.-Execute parse function to extract the first page data.
2.-Once the extraction is finished, search for next button and get the href attribute.
3.-Call again the same function sending the new url.
import scrapy
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
class IndeedSpiderSpider(scrapy.Spider):
name = "indeed"
allowed_domains = ["es.indeed.com"]
def start_requests(self):
url = "https://es.indeed.com/jobs?q&l=Barcelona"
yield scrapy.Request(url=url, callback=self.parse_jobs)
def parse_jobs(self, response):
driver = webdriver.Firefox()
driver.get("https://es.indeed.com/jobs?q&l=Barcelona")
driver.implicitly_wait(10)
offersnames=driver.find_elements(By.XPATH, "//td/div/h2/span")
for i in range(len(offersnames)):
yield {
"name": offersnames[i].text
}
next_page_element = driver.find_element(By.CSS_SELECTOR, "ul.pagination-list > li:last-child > a")
next_page_url=next_page_element.get_attribute("href")
if next_page_url:
next_page = response.urljoin(next_page_url)
yield scrapy.Request(url=next_page, callback=self.parse_jobs)
driver.quit()
Solution 1:[1]
Instead of scrapy.Request, you should be using click event. The general structure should be; wait for the page to be fully loaded; wait for button to apprear on the page. Then click on the button. Then get the updated page, something like:
from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Chrome()
driver.get("http://www.example.com")
btn = driver.find_element_by_id('input-search')
btn.click()
time.sleep(2)
print(driver.page_source.encode('utf-8'))
I havnt ran code but something on these lines.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
