'Python - Web Scraping Entire Page

Looking for a general way to scrape an entire web page, with this one as an example:

https://www.boxscoregeeks.com/players?sort=wins_produced&direction=desc&season=2021

Tried the following:

import requests
import pandas as pd
import bs4

html = requests.get("https://www.boxscoregeeks.com/players?sort=wins_produced&direction=desc&season=2021", headers={"User-Agent": "XY"}).content
df_list = pd.read_html(html)
soup = bs4.BeautifulSoup(html)

In both cases I get a lot of information but none from the big table in the middle of the page.

How do I in general scrape an entire web page as it appears to a human user like me?



Solution 1:[1]

There is no "one fits all" solution, you always have to check website and behavior.

Content is provided dynamically via JavaScript, so you wont get it that simple way with requests and BeautifulSoup, but you should take a look at there api:

import pandas as pd
import requests

jsonData = requests.get('https://www.boxscoregeeks.com/api/player_seasons').json()

pd.DataFrame(jsonData)

#or sort it by wins

pd.DataFrame(jsonData).sort_values(by='wins_produced', ascending=False)

Output

id name games minutes per48_position_adj_prod wins_produced per48_wins_produced per48_points per48_rebounds per48_assists per48_points_over_par exact_position team_abbreviations firstname lastname is_rookie updated_at position secondary_position url
191209 Nikola Jokic 61 2020.7 0.688922 15.5068 0.36835 37.7691 19.9297 11.687 8.37679 5 den Nikola Jokic False March 14, 2022 15:42 UTC C C /players/1500-nikola-jokic
191158 Chris Paul 58 1916.08 0.46432 13.3606 0.334697 21.6943 6.51329 15.5066 7.33018 1 pho Chris Paul False March 14, 2022 15:41 UTC PG PG /players/211-chris-paul
190781 Giannis Antetokounmpo 56 1836.3 0.57629 12.9864 0.339459 43.5484 16.7554 8.65218 7.47829 4.35506 mil Giannis Antetokounmpo False March 14, 2022 15:41 UTC PF C /players/1344-giannis-antetokounmpo
191216 Robert Williams 54 1611.42 0.65683 12.9212 0.384891 16.0257 15.7278 3.30641 8.8912 4.62594 bos Robert Williams False March 14, 2022 15:43 UTC C PF /players/3372-robert-williams
191258 Rudy Gobert 52 1662.03 0.693075 12.8982 0.372504 23.162 22.1223 1.70394 8.50596 5 uth Rudy Gobert False March 14, 2022 15:41 UTC C C /players/1378-rudy-gobert
191049 Tyrese Haliburton 62 2174.88 0.364672 11.3213 0.249862 20.6577 5.53961 10.704 4.69182 1.74806 sac,ind Tyrese Haliburton False March 14, 2022 15:40 UTC SG PG /players/4157-tyrese-haliburton
191100 Dejounte Murray 58 2016.35 0.396915 11.2593 0.268031 28.4712 11.7599 12.9501 5.25687 1.03584 sas Dejounte Murray False March 14, 2022 15:40 UTC PG SG /players/3188-dejounte-murray

Another alternative could be to use selenium, to render website first and scrape based on that rendered page_source.

import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service

url = f'https://www.boxscoregeeks.com/players?sort=wins_produced&direction=desc&season=2021'
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.maximize_window()
driver.get(url)
time.sleep(10)
df = pd.read_html(repr(driver.page_source))[0]
driver.close()
df

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1