'Scrapy returns the text of an element in shell but not in the code
I'm new to web scraping and have been learning Scrapy and Selenium for the last couple of days. I'm trying to extract some info from this source: https://www.kodda.co.kr/kr/information/member.php To be specific, I need a company name, CEO name, and email. So far I was able to write code that clicks buttons to navigate to various pages. Currently, I want to extract the info from the table of a given page.
For example, given this table:
The screenshot of a table code
I'd like to extract the text inside the first tag. When I write this code on scrapy shell: response.xpath('//table[@class="sub-table"]/tbody/tr[1]/td[1]/text()').get() it returns what's inside the first (which is what I want). But when I write this exact code on a .py file and run it, it returns empty (""):
import scrapy
class CompanyInfoSpider(scrapy.Spider):
name = 'company_info'
allowed_domains = ['https://www.kodda.co.kr/kr/information/member.php']
start_urls = ['http://https://www.kodda.co.kr/kr/information/member.php/']
def parse(self, response):
print(response.xpath('//table[@class="sub-table"]/tbody/tr[1]/td[1]/text()').get())
I tried the same thing using Selenium:
from selenium import webdriver
from selenium.webdriver.common.by import By
import time
driver = webdriver.Chrome(desired_capabilities=show_browser(False))
driver.get("https://www.kodda.co.kr/kr/information/member.php")
driver.implicitly_wait(10)
column_element = driver.find_element(By.XPATH, '//table[@class="sub-table"]/tbody/tr[1]/td[1]')
column_text = column_element.text
time.sleep(10)
print(column_text)
But this also returns empty (""). I've been googling for hours but couldn't find any possible reason.
Note: I've also tried explicit wait:
ignored_exceptions = (NoSuchElementException, StaleElementReferenceException,)
wait = WebDriverWait(driver, 50, ignored_exceptions=ignored_exceptions)
wait.until(lambda wd: column_text != "")
Attempted: wait.until(expected_conditions.visibility_of_all_elements_located((By.CLASS_NAME, "sub-table"))) but these also returned empty ("")
Solution 1:[1]
Solved the issue! I tracked it down to text function. For some reason that I didn't care to search for, text doesn't work when I use it to extract the text of an element. Instead, get_attribute("innerText") worked!
Solution 2:[2]
You are using a wrong locator.
This locator matches 10 elements on the page, but these elements are not visible, at least not the first one. Since you are using driver.find_element method it returns you the first match of the passed locator on the page.
Also you should use Expected Conditions explicit waits, not a implicitly_wait since the former method waits for element existence only, it will not wait for the element complete rendered. So using this method you are getting the column_element element on the stage when it is still not fully rendered, still not populated with the text content it will finally have.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | ouflak |
| Solution 2 |
