'Pyppeteer: {'waitUntil': 'networkidle0'} not waiting till page is loaded
So if I use await page.waitFor(9000) or some hard coded wait number,
my function will wait till page loads.
However, await page.goto(url, {'waitUntil': 'networkidle0'}) results in function running before entire page loads, so script fails.
Here is the entire code:
import requests
from bs4 import BeautifulSoup
import time
import os
import pyppeteer
from pyppeteer import launch
import asyncio
import subprocess
AGENT_DIR = os.path.dirname(__file__) + r'\data\agents'
SAVE_FILE = os.path.join(AGENT_DIR, 'latest.txt')
URL = 'https://techblog.willshouse.com/2012/01/03/most-common-user-agents/'
def get_latest_agents():
''' We are getting most
common lastest user agents
from the {URL} site
and then saving it to text file {SAVE_FILE}
'''
async def scrape():
url = URL
browser = await launch(headless = False)
page = await browser.newPage()
await page.goto(url, {'waitUntil': 'networkidle0'})
await page.waitFor(9000)
content = await page.content()
soup = BeautifulSoup(content, 'html.parser')
agents = soup.select('.get-the-list')[0].text
#agents = agents.split('\n')
print(agents)
await browser.close()
loop = asyncio.get_event_loop()
response = loop.run_until_complete(scrape())
if __name__ == '__main__':
# first kill all chrome.exe as pypetter doesn't close properly
subprocess.call(['taskkill', '/F', '/im', 'chrome.exe'])
get_latest_agents()
Thank you.
Solution 1:[1]
The code here is overcomplicated. Pyppeteer already has selectors, so there's no need for BeautifulSoup, requests, or the other unused libs/variables that might be adding to the confusion.
BS is a static HTML parser that is typically used with requests, whereas Pyppeteer is a driver that works with the browser in real-time. The only reason to use BS is if all of the data is available statically, in which case there's no need for Pyppeteer.
Pyppeteer offers a function page.waitForSelector which lets you do just what you need -- block the code until a selector you want the data from is ready. Once it is, you can extract the value with page.Jeval or a similar function that lets you run code in the browser console.
"networkidle2" can only slow you down since waitForSelector may well find the data you need well before only 2 network requests are outstanding.
Here's a simple example:
import asyncio
from pyppeteer import launch
URL = "https://techblog.willshouse.com/2012/01/03/most-common-user-agents/"
async def scrape():
browser = await launch(headless=False)
page, = await browser.pages()
await page.goto(URL, {"waitUntil": "domcontentloaded"})
await page.waitForSelector(".get-the-list", timeout=1e5)
agents = await page.Jeval(".get-the-list", "e => e.value")
await browser.close()
return agents
if __name__ == "__main__":
print(asyncio.get_event_loop().run_until_complete(scrape()))
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
