'BeautifulSoup request is returning an empty list from LinkedIn.com/jobs

I'm new to BeautifulSoup and web scraping so please bare with me.

I'm using Beautiful soup to pull all job post cards from LinkedIn with the title "Security Engineer". After using inspect element on https://www.linkedin.com/jobs/search/?keywords=security%20engineer on an individual job post card, I believe to have found the correct 'li' portion for the class. The code works, but it's returning an empty list '[ ]'. I don't want to use any APIs because this is an exercise for me to learn web scraping. Thank you for your help. Here's my code so far:

from bs4 import BeautifulSoup
import requests

html_text = requests.get('https://www.linkedin.com/jobs/search/?keywords=security%20engineer').text
soup = BeautifulSoup(html_text, 'lxml')
jobs = soup.find_all('li', class_ = "jobs-search-results__list-item occludable-update p0 relative ember-view")
print(jobs)


Solution 1:[1]

As @baduker mentioned, using plain requests won't do all the heavy lifting that browsers do.

Whenever you open a page on your browser, the browser renders the visuals, does extra network calls, and runs javascript. The first thing it does is load the initial response, which is what you're doing with requests.get('https://www.linkedin.com/jobs/search/?keywords=security%20engineer')

The page you see on your browser is because of many, many more requests: browser requests

The reason your list is empty is because the html you get back is very minimal. You can print it out to the console and compare it to the browser's.

To make things easier, instead of using requests you can use Selenium which is essentially a library for programmatically controlling a browser. Selenium will do all those requests for you like a normal browser and let you access the page-source as you were expecting it to look.

This is a good place to start, but your scraper will be slow. There are things you can do in Selenium to speed things up, like running in headless-mode which means don't render the page graphically, but it won't be as fast as figuring out how to do it on your own with requests.

If you want to do it using requests you're going to need to do a lot of snooping through the requests, maybe using a tool like postman, and see how to simulate the necessary steps to get the data from whatever page.

For example some websites have a handshake process when logging in. A website I've worked on goes like this:

  1. (Step 0 really) Setup request headers because the site doesn't seem to respond unless User-Agent header is included
  2. Fetch initial HTML, get unique key from a hidden element in a <form>
  3. Using this key, make a POST request to the url from that form
  4. Get a session id key from the response
  5. Setup a another POST request that combines username, password, and sessionid. The URL was in some javascript function, but I found it using the network inspector in the devtools

So really, I work strictly with Selenium if it's too complicated and I'm only getting the data once or not so often. I'll go through the heavy stuff if I'm building a scraper for an API that others will use frequently.

Hope any of this made sense to you. Happy scraping!

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Diego Cuadros