'Using requests_html in Google Cloud Functions or Run
I have a fairly basic scraping application that i want to run in a Google Cloud environment, i am using the requests_html Async library and it works fine in my local environment, however i cannot for the life of me figure out how to run it in Google Cloud having fiddled with it for days now. The purpose of the application is to simply render some javascript pages (contained in the urls array) using html.arender and then with BeautifulSoup extract the content of some specific tags (from the tags array).
The error message i keep getting is:
"signal only works in main thread of the main interpreter"
Traceback (most recent call last):
File "/layers/google.python.pip/pip/lib/python3.9/site-packages/flask/app.py", line 2073, in wsgi_app
response = self.full_dispatch_request()
File "/layers/google.python.pip/pip/lib/python3.9/site-packages/flask/app.py", line 1518, in full_dispatch_request
rv = self.handle_user_exception(e)
File "/layers/google.python.pip/pip/lib/python3.9/site-packages/flask/app.py", line 1516, in full_dispatch_request
rv = self.dispatch_request()
File "/layers/google.python.pip/pip/lib/python3.9/site-packages/flask/app.py", line 1502, in dispatch_request
return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args)
File "/layers/google.python.pip/pip/lib/python3.9/site-packages/functions_framework/__init__.py", line 99, in view_func
return function(request._get_current_object())
File "/workspace/main.py", line 53, in main
results = asyncio.run(collect(urls,tags))
File "/opt/python3.9/lib/python3.9/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/python3.9/lib/python3.9/asyncio/base_events.py", line 642, in run_until_complete
return future.result()
File "/workspace/main.py", line 32, in collect
return await asyncio.gather(*tasks)
File "/workspace/main.py", line 18, in getPage
await r.html.arender(timeout=40,sleep=1)
File "/layers/google.python.pip/pip/lib/python3.9/site-packages/requests_html.py", line 615, in arender
self.browser = await self.session.browser
File "/layers/google.python.pip/pip/lib/python3.9/site-packages/requests_html.py", line 714, in browser
self._browser = await pyppeteer.launch(ignoreHTTPSErrors=not(self.verify), headless=True, args=self.__browser_args)
File "/layers/google.python.pip/pip/lib/python3.9/site-packages/pyppeteer/launcher.py", line 307, in launch
return await Launcher(options, **kwargs).launch()
File "/layers/google.python.pip/pip/lib/python3.9/site-packages/pyppeteer/launcher.py", line 159, in launch
signal.signal(signal.SIGINT, _close_process)
File "/opt/python3.9/lib/python3.9/signal.py", line 56, in signal
handler = _signal.signal(_enum_to_int(signalnum), _enum_to_int(handler))
ValueError: signal only works in main thread of the main interpreter
This is my code below:
from requests_html import AsyncHTMLSession
import asyncio
from bs4 import BeautifulSoup as bs
urls = ['list of urls']
tags = ['p','h1']
async def getPage(s, url):
r = await s.get(url)
await r.html.arender(timeout=60,sleep=3, scrolldown=2)
p = bs(r.html.html, "html.parser")
elmList = []
elmList.append(url)
for t in tags:
elements = p.findAll(t)
for e in elements:
elmList.append(e.text)
return elmList
async def collect(urls):
s = AsyncHTMLSession()
tasks = (getPage(s,url) for url in urls)
return await asyncio.gather(*tasks)
results = asyncio.run(collect(urls))
I have also attempted this same thing using the "non async" HTMLSession and rewriting the code to do one URL at a time, but i get the exact same error message in this case with the "signal only works in the main thread".
I have also tried to run this both in Cloud Functions and Cloud Run environment, with the same results.
Additionally, after trawling forums for advice have experimented with manually setting the loop like so, but this had no effect.
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
If anyone has ideas for how to accomplish this with other libraries/methods please do let me know, it does not necessarily even have to be asynchronously, only requirement is javascript rendering of the pages.
Solution 1:[1]
Have you tried keep_page=True as a parameter for r.html.arender?
Why I'm asking: The error seems to happen when the browser that is used for rendering the JS is being shut down. Maybe keep_page=True avoids this.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | mvtango |
