'How to scrape an article which requires login to view full content using python?
I am trying to scrape an article from The Wall Street Journal and it requires log-in to view the whole content. So, I have written a code like the below using Python Requests:
import requests
from bs4 import BeautifulSoup
import re
import base64
import json
username= <username>
password= <password>
base_url= "https://accounts.wsj.com"
session = requests.Session()
r = session.get("{}/login".format(base_url))
soup = BeautifulSoup(r.text, "html.parser")
credentials_search = re.search("Base64\.decode\('(.*)'", r.text, re.IGNORECASE)
base64_decoded = base64.b64decode(credentials_search.group(1))
credentials = json.loads(base64_decoded)
connection = <connection_name>
r = session.post(
'https://sso.accounts.dowjones.com/usernamepassword/login',
data = {
"username": username,
"password": password,
"connection": connection,
"client_id": credentials["clientID"],
"state": credentials["internalOptions"]["state"],
"nonce": credentials["internalOptions"]["nonce"],
"scope": credentials["internalOptions"]["scope"],
"tenant": "sso",
"response_type": "code",
"protocol": "oauth2",
"redirect_uri": "https://accounts.wsj.com/auth/sso/login"
})
soup = BeautifulSoup(r.text, "html.parser")
login_result = dict([
(t.get("name"), t.get("value"))
for t in soup.find_all('input')
if t.get("name") is not None
])
r = session.post(
'https://sso.accounts.dowjones.com/login/callback',
data = login_result,
headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.67 Safari/537.36"},
)
# article get request
r = session.get(
"https://www.wsj.com/articles/singapore-prime-minister-lee-rejects-claims-he-misused-state-powers-in-family-feud-1499094761",
headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.67 Safari/537.36"}
)
print(r.text)
am able to login through the request but still I am not getting full article to scrape. Can anyone help me with this? Thanks in advance :-)
Solution 1:[1]
An easy and reliable solution is using the selenium webdriver. With selenium you create an automated browser window which opens the website and from there you can make it choose the elements to log in. Then the content loads in as usual, like when you look for it manually on your browser. Then you can soup that page with BeautifulSoup.
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Firefox(executable_path="C:\Program Files (x86)\geckodriver.exe")
# Download the driver for your desired browser and place it in any path
# for Chrome it's: driver = webdriver.Chrome("C:\Program Files (x86)\chromedriver.exe")
# open your website link
driver.get("https://www.your-url.com")
# then soup the page with BS
html = driver.page_source
page_soup = BeautifulSoup(html)
From there you can use the "page_soup" as you usually would.
Any questions? :)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | SYNEC |
