'Scrape Job description Indeed Selenium [closed]
A similar subject exists but I couldn't find the exact answer, so please could you help me?
I copied from the internet the following code to scrape job offers from indeed. The problem is the code cannot scrap full job descriptions.
The question is how to open indeed pages with the full description and how to then retrieve the full description ?
Do you, please, have any idea how to solve this?
for i in range(0,50,10):
driver.get('https://www.indeed.co.in/jobs?q=artificial%20intelligence&l=India&start='+str(i))
jobs = []
driver.implicitly_wait(20)
for job in driver.find_elements_by_class_name('result'):
#soup = BeautifulSoup(job.get_attribute('innerHTML'),'html.parser')
result_html = job.get_attribute('innerHTML')
soup = BeautifulSoup(result_html, 'html.parser')
try:
title = soup.find(class_="jobTitle").text
except:
title = 'None'
try:
location = soup.find(class_="companyLocation").text
except:
location = 'None'
try:
company = soup.find(class_="companyName").text.replace("\n","").strip()
except:
company = 'None'
The problem comes from the following part:
sum_div = job.find_elements_by_class_name('summary')
#sum_div = job.find_element_by_class_name('job_seen_beacon')
try:
sum_div.click()
except:
close_button = driver.find_elements_by_class_name('popover-x-button-close')
close_button.click()
sum_div.click()
driver.implicitly_wait(2)
try:
job_desc = driver.find_element_by_css_selector('div#vjs-desc').text
print(job_desc)
except:
job_desc = 'None'
df = df.append({'Title':title,'Location':location,"Company":company,
"Description":job_desc},ignore_index=True)
Solution 1:[1]
Final answer to my question: As the indeed page is dynamic, it is necessary to first create a list with le url to each job offer. Then open these url one by one and get the job description from these pages.
for i in range(0,2,1):
driver.get('https://www.indeed.co.in/jobs?q=artificial%20intelligence&l=India&start='+str(i))
jobs = []
driver.implicitly_wait(20)
for job in driver.find_elements_by_class_name('result'):
soup = BeautifulSoup(job.get_attribute('innerHTML'),'html.parser')
#then we incluse beautifulsoup. We could do it with Selenium but it
#is much slower.
#Then we use selenium as it allows to manage dynamic click.
try:
title = soup.find(class_="jobTitle").text
except:
title = 'None'
try:
location = soup.find(class_="companyLocation").text
except:
location = 'None'
try:
company = soup.find(class_="companyName").text.replace("\n","").strip()
except:
company = 'None'
#sum_div = job.find_elements_by_class_name('summary')
sum_div = job.find_element_by_class_name('job_seen_beacon')
try:
sum_div.click()
except:
close_button = driver.find_elements_by_class_name('popover-x-button-close')[0]
close_button.click()
sum_div.click()
driver.implicitly_wait(2)
It starts here by identifying each job url, contained in the section and with the attribute "href".
liens = job.find_element_by_tag_name("a")
links = liens.get_attribute("href")
df = df.append({'Title':title,'Location':location,"Company":company, "Links":links},ignore_index=True)
Links_list = df['Links'].tolist()
We now open each link from the list and get the job description
descriptions=[]
for i in Links_list:
print(i)
driver.get(i)
jd = driver.find_element_by_xpath('//div[@id="jobDescriptionText"]').text
descriptions.append(jd)
df['Descriptions'] = descriptions
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
