'Scrape Job description Indeed Selenium [closed]

A similar subject exists but I couldn't find the exact answer, so please could you help me?

I copied from the internet the following code to scrape job offers from indeed. The problem is the code cannot scrap full job descriptions.

The question is how to open indeed pages with the full description and how to then retrieve the full description ?

Do you, please, have any idea how to solve this?

    for i in range(0,50,10):
        driver.get('https://www.indeed.co.in/jobs?q=artificial%20intelligence&l=India&start='+str(i))
        jobs = []
        driver.implicitly_wait(20)
    

    for job in driver.find_elements_by_class_name('result'):
             
       
        #soup = BeautifulSoup(job.get_attribute('innerHTML'),'html.parser')
        result_html = job.get_attribute('innerHTML')
        soup = BeautifulSoup(result_html, 'html.parser')
        
        try:
            title = soup.find(class_="jobTitle").text
            
        except:
            title = 'None'


        try:
            location = soup.find(class_="companyLocation").text
        except:
            location = 'None'

        try:
            company = soup.find(class_="companyName").text.replace("\n","").strip()
        except:
            company = 'None'

The problem comes from the following part:

        
        sum_div = job.find_elements_by_class_name('summary')
        #sum_div = job.find_element_by_class_name('job_seen_beacon')
        
        try: 
            sum_div.click()

        except:
            close_button = driver.find_elements_by_class_name('popover-x-button-close')
            close_button.click()
            sum_div.click()
            
        driver.implicitly_wait(2)
        
        try: 
            job_desc = driver.find_element_by_css_selector('div#vjs-desc').text
            print(job_desc)
        
        except:
            job_desc = 'None'   


        df = df.append({'Title':title,'Location':location,"Company":company,
                                "Description":job_desc},ignore_index=True)



Solution 1:[1]

Final answer to my question: As the indeed page is dynamic, it is necessary to first create a list with le url to each job offer. Then open these url one by one and get the job description from these pages.


for i in range(0,2,1):
    driver.get('https://www.indeed.co.in/jobs?q=artificial%20intelligence&l=India&start='+str(i))
    jobs = []
    driver.implicitly_wait(20)
     

    for job in driver.find_elements_by_class_name('result'):

        soup = BeautifulSoup(job.get_attribute('innerHTML'),'html.parser')
        #then we incluse beautifulsoup. We could do it with Selenium but it
        #is much slower.
        #Then we use selenium as it allows to manage dynamic click.
        
        try:
            title = soup.find(class_="jobTitle").text

        except:
            title = 'None'

        try:
            location = soup.find(class_="companyLocation").text
        except:
            location = 'None'

        try:
            company = soup.find(class_="companyName").text.replace("\n","").strip()
        except:
            company = 'None'


        #sum_div = job.find_elements_by_class_name('summary')
        sum_div = job.find_element_by_class_name('job_seen_beacon')
    
        try:    
                    sum_div.click()
        except:
                    close_button = driver.find_elements_by_class_name('popover-x-button-close')[0]
                    close_button.click()
                    sum_div.click()            
        driver.implicitly_wait(2)


It starts here by identifying each job url, contained in the section and with the attribute "href".

        liens = job.find_element_by_tag_name("a")

        links = liens.get_attribute("href")
    
        

        df = df.append({'Title':title,'Location':location,"Company":company, "Links":links},ignore_index=True)


Links_list = df['Links'].tolist()


We now open each link from the list and get the job description


descriptions=[]
for i in Links_list:
    print(i)
    
    driver.get(i)
    jd = driver.find_element_by_xpath('//div[@id="jobDescriptionText"]').text
    descriptions.append(jd)    


df['Descriptions'] = descriptions

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1