'Python - Web Scraping Entire Page
Looking for a general way to scrape an entire web page, with this one as an example:
https://www.boxscoregeeks.com/players?sort=wins_produced&direction=desc&season=2021
Tried the following:
import requests
import pandas as pd
import bs4
html = requests.get("https://www.boxscoregeeks.com/players?sort=wins_produced&direction=desc&season=2021", headers={"User-Agent": "XY"}).content
df_list = pd.read_html(html)
soup = bs4.BeautifulSoup(html)
In both cases I get a lot of information but none from the big table in the middle of the page.
How do I in general scrape an entire web page as it appears to a human user like me?
Solution 1:[1]
There is no "one fits all" solution, you always have to check website and behavior.
Content is provided dynamically via JavaScript, so you wont get it that simple way with requests and BeautifulSoup, but you should take a look at there api:
import pandas as pd
import requests
jsonData = requests.get('https://www.boxscoregeeks.com/api/player_seasons').json()
pd.DataFrame(jsonData)
#or sort it by wins
pd.DataFrame(jsonData).sort_values(by='wins_produced', ascending=False)
Output
| id | name | games | minutes | per48_position_adj_prod | wins_produced | per48_wins_produced | per48_points | per48_rebounds | per48_assists | per48_points_over_par | exact_position | team_abbreviations | firstname | lastname | is_rookie | updated_at | position | secondary_position | url |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 191209 | Nikola Jokic | 61 | 2020.7 | 0.688922 | 15.5068 | 0.36835 | 37.7691 | 19.9297 | 11.687 | 8.37679 | 5 | den | Nikola | Jokic | False | March 14, 2022 15:42 UTC | C | C | /players/1500-nikola-jokic |
| 191158 | Chris Paul | 58 | 1916.08 | 0.46432 | 13.3606 | 0.334697 | 21.6943 | 6.51329 | 15.5066 | 7.33018 | 1 | pho | Chris | Paul | False | March 14, 2022 15:41 UTC | PG | PG | /players/211-chris-paul |
| 190781 | Giannis Antetokounmpo | 56 | 1836.3 | 0.57629 | 12.9864 | 0.339459 | 43.5484 | 16.7554 | 8.65218 | 7.47829 | 4.35506 | mil | Giannis | Antetokounmpo | False | March 14, 2022 15:41 UTC | PF | C | /players/1344-giannis-antetokounmpo |
| 191216 | Robert Williams | 54 | 1611.42 | 0.65683 | 12.9212 | 0.384891 | 16.0257 | 15.7278 | 3.30641 | 8.8912 | 4.62594 | bos | Robert | Williams | False | March 14, 2022 15:43 UTC | C | PF | /players/3372-robert-williams |
| 191258 | Rudy Gobert | 52 | 1662.03 | 0.693075 | 12.8982 | 0.372504 | 23.162 | 22.1223 | 1.70394 | 8.50596 | 5 | uth | Rudy | Gobert | False | March 14, 2022 15:41 UTC | C | C | /players/1378-rudy-gobert |
| 191049 | Tyrese Haliburton | 62 | 2174.88 | 0.364672 | 11.3213 | 0.249862 | 20.6577 | 5.53961 | 10.704 | 4.69182 | 1.74806 | sac,ind | Tyrese | Haliburton | False | March 14, 2022 15:40 UTC | SG | PG | /players/4157-tyrese-haliburton |
| 191100 | Dejounte Murray | 58 | 2016.35 | 0.396915 | 11.2593 | 0.268031 | 28.4712 | 11.7599 | 12.9501 | 5.25687 | 1.03584 | sas | Dejounte | Murray | False | March 14, 2022 15:40 UTC | PG | SG | /players/3188-dejounte-murray |
Another alternative could be to use selenium, to render website first and scrape based on that rendered page_source.
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
url = f'https://www.boxscoregeeks.com/players?sort=wins_produced&direction=desc&season=2021'
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.maximize_window()
driver.get(url)
time.sleep(10)
df = pd.read_html(repr(driver.page_source))[0]
driver.close()
df
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
