'Indeed Jobs BeautifulSoup Python Webscraping script returning duplicates and missing jobs data

I'm new to Python and stack-overflow, so please bear with me! I'm working on a BeautifulSoup webscraping script to scrape jobs from Indeed, sharing my code below. The script seems to be working but I'm having some issues with the output. I found that a lot of jobs I see on the webpages don't appear in my data output. Moreover, a bigger problem is that there seem to be a lot of duplicates of the same few jobs. The total job count is correct given how many pages get crawled (15 jobs x 5 pages in the below example) but there are too many duplicates so the unique job count is far less (~15-18% of expected). I tested by printing the output of individual pages (item in divs) and found that no matter the page number, the same jobs' html gets extracted every time the loop is run, hence causing the duplicates. But opening the url in a browser with different page numbers gives me different job listings. I have no clue why this happening, if there is an issue with the script syntax or if I'm reading in the html wrong. Any help here would really be appreciated! Happy to provide more information if needed. Thanks!

def extract(page):
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36'}
    url = 'https://www.indeed.com/jobs?q=CDL%20Truck%20driver&l=Portland%2C%20OR&radius=50&jt=fulltime&taxo2=D8UDE&taxo3=MN4VU&start={page}'
    r = requests.get(url, headers)
    soup = BeautifulSoup(r.content, 'html.parser')
 return soup

joblist = []

def transform(soup):
    divs = soup.find_all('div','job_seen_beacon')

    for item in divs:
     title = item.find('h2', 'jobTitle')
     if title is not None:
         title = title.text.strip()
     else:
         title = None
    
     company = item.find('span', 'companyName' )
     if company is not None:
         company = company.text.strip()
     else:
         company = None
     
     location = item.find('div', 'companyLocation')
     if location is not None:
         location = location.text.strip()
     else:
         location = None
        
     try:
        salary = item.find('div', 'metadata salary-snippet-container').text.strip()
     except:
        salary = ''    
    
     job_summary = item.find('div', 'job-snippet')
     if job_summary is not None:
         job_summary = job_summary.text.strip().replace('\n','')
     else:
         job_summary = None
 
     job = {
         'title': title,
         'company': company,
         'location': location,
         'salary': salary,
         'job_summary': job_summary
        }
    
     joblist.append(job)
    
    
 return


for i in range(0,41,10):
    c = extract(i)
    transform(c)

df = pd.DataFrame(joblist)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'Indeed Jobs BeautifulSoup Python Webscraping script returning duplicates and missing jobs data

Sources

Related Questions