'How to scrape Chegg Textbook Solution pages using python?
Long story short, I was re-visiting exercises in an old VBA textbook to do some practice (specifically VBA for Modelers - 5th Edition, S. Christian Albright).
In doing so I wanted to retrieve the answers for the exercises and in doing so I came to Chegg and thought I could try to scrape the code blocks in the solution pages (example hyperlinked below).
Sample Chegg Textbook Solution Page - code block and HTML in red rectangles
I've been trying to get more acquainted with python and thought this would be a good project to learn more about web scraping.
Below is the code I began with as I realized that it would not be as simple as scraping the HTML from each solution page. I initially just wanted to find all div elements on the page itself before going further and looping through each exercise page, and scraping the code blocks as such.
#!/usr/bin/python3
# scrapeChegg.py - Scrapes all answer code blocks from each problem exercise in each chapter for a textbook (VBA For Modelers - 5th Editiion)
import bs4, os, requests
# Starting URL point
url = 'https://www.chegg.com/homework-help/open-new-workbook-get-vbe-insert-module-enter-following-code-chapter-5-problem-1e-solution-9781285869612-exc'
# Retrieve sol'n HTML
head = {'User Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:92.0) Gecko/20100101 Firefox/92.0'}
res = requests.get(url, headers=head)
try:
res.status_code
cheggSoup = bs4.BeautifulSoup(res.text, 'html.parser')
print(cheggSoup.find_all('div'))
except Exception as exc:
print('Issue occurred: %s' % (exc))
Within one of the div results, the output was as follows:
<p>
Access to this page has been denied because we believe you are using automation tools to browse the
website.
</p>
<p>
This may happen as a result of the following:
</p>
<ul>
<li>
Javascript is disabled or blocked by an extension (ad blockers for example)
</li>
<li>
Your browser does not support cookies
</li>
</ul>
<p>
Please make sure that Javascript and cookies are enabled on your browser and that you are not blocking
them from loading.
</p>
<p>
Reference ID: #5ca2ea20-0052-11ec-8c04-7749576e4445
</p>
</div>
So based on the above, I can see that the page is stopping me from using automation tools. I've looked at similar issues that people have brought up concerning scraping from Chegg, and a lot of solutions are beyond my current knowledge (i.e. various solutions had more key/value pairs within the head dict that I was not sure how to interpret).
Essentially my question is how can I gain more knowledge (or what resources should I look deeper into - i.e. HTTP, scraping with python, etc.) to make this project work, if possible that is. If anyone has made something like this work before, I would appreciate any advice on what to look at for myself or how I can make this specific project successful. Thanks!
Solution 1:[1]
Try to add - in User Agent HTTP header:
import requests
from bs4 import BeautifulSoup
url = "https://www.chegg.com/homework-help/open-new-workbook-get-vbe-insert-module-enter-following-code-chapter-5-problem-1e-solution-9781285869612-exc"
# Retrieve sol'n HTML
head = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:92.0) Gecko/20100101 Firefox/92.0",
}
res = requests.get(url, headers=head)
soup = BeautifulSoup(res.content, "html.parser")
print(soup.h1.text)
Prints:
VBA for Modelers (5th Edition) Edit editionThis problem has been solved:Solutions for Chapter 5Problem 1E: Open a new workbook, get into the VBE, insert a module, and enter the following code:Sub Variables()?Dim nPounds As Integer, dayOfWeek As Integer?nPounds = 17.5?dayOfWeek = “Monday”?MsgBox nPounds & “ pounds were ordered on ” & dayOfWeekEnd SubThere are two problems here. One causes the program to fail, and the other causes an incorrect result. Explain what they are and then fix them.…
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Andrej Kesely |
