'Using requests_html in Google Cloud Functions or Run

I have a fairly basic scraping application that i want to run in a Google Cloud environment, i am using the requests_html Async library and it works fine in my local environment, however i cannot for the life of me figure out how to run it in Google Cloud having fiddled with it for days now. The purpose of the application is to simply render some javascript pages (contained in the urls array) using html.arender and then with BeautifulSoup extract the content of some specific tags (from the tags array).

The error message i keep getting is:

"signal only works in main thread of the main interpreter"

     Traceback (most recent call last):
  File "/layers/google.python.pip/pip/lib/python3.9/site-packages/flask/app.py", line 2073, in wsgi_app
    response = self.full_dispatch_request()
  File "/layers/google.python.pip/pip/lib/python3.9/site-packages/flask/app.py", line 1518, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/layers/google.python.pip/pip/lib/python3.9/site-packages/flask/app.py", line 1516, in full_dispatch_request
    rv = self.dispatch_request()
  File "/layers/google.python.pip/pip/lib/python3.9/site-packages/flask/app.py", line 1502, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args)
  File "/layers/google.python.pip/pip/lib/python3.9/site-packages/functions_framework/__init__.py", line 99, in view_func
    return function(request._get_current_object())
  File "/workspace/main.py", line 53, in main
    results = asyncio.run(collect(urls,tags))
  File "/opt/python3.9/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/python3.9/lib/python3.9/asyncio/base_events.py", line 642, in run_until_complete
    return future.result()
  File "/workspace/main.py", line 32, in collect
    return await asyncio.gather(*tasks)
  File "/workspace/main.py", line 18, in getPage
    await r.html.arender(timeout=40,sleep=1)
  File "/layers/google.python.pip/pip/lib/python3.9/site-packages/requests_html.py", line 615, in arender
    self.browser = await self.session.browser
  File "/layers/google.python.pip/pip/lib/python3.9/site-packages/requests_html.py", line 714, in browser
    self._browser = await pyppeteer.launch(ignoreHTTPSErrors=not(self.verify), headless=True, args=self.__browser_args)
  File "/layers/google.python.pip/pip/lib/python3.9/site-packages/pyppeteer/launcher.py", line 307, in launch
    return await Launcher(options, **kwargs).launch()
  File "/layers/google.python.pip/pip/lib/python3.9/site-packages/pyppeteer/launcher.py", line 159, in launch
    signal.signal(signal.SIGINT, _close_process)
  File "/opt/python3.9/lib/python3.9/signal.py", line 56, in signal
    handler = _signal.signal(_enum_to_int(signalnum), _enum_to_int(handler))
ValueError: signal only works in main thread of the main interpreter 

This is my code below:

from requests_html import AsyncHTMLSession
import asyncio
from bs4 import BeautifulSoup as bs

    urls = ['list of urls']
    tags = ['p','h1']
    
    async def getPage(s, url):
        r = await s.get(url)
        await r.html.arender(timeout=60,sleep=3, scrolldown=2)
        p = bs(r.html.html, "html.parser")
        elmList = []
        elmList.append(url)
        for t in tags:
            elements = p.findAll(t)  
            for e in elements:
                elmList.append(e.text)
    
        return elmList
                       
    async def collect(urls):
        s = AsyncHTMLSession()
        tasks = (getPage(s,url) for url in urls)
        return await asyncio.gather(*tasks)
    
    results = asyncio.run(collect(urls))

I have also attempted this same thing using the "non async" HTMLSession and rewriting the code to do one URL at a time, but i get the exact same error message in this case with the "signal only works in the main thread".

I have also tried to run this both in Cloud Functions and Cloud Run environment, with the same results.

Additionally, after trawling forums for advice have experimented with manually setting the loop like so, but this had no effect.

loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)

If anyone has ideas for how to accomplish this with other libraries/methods please do let me know, it does not necessarily even have to be asynchronously, only requirement is javascript rendering of the pages.



Solution 1:[1]

Have you tried keep_page=True as a parameter for r.html.arender?

Why I'm asking: The error seems to happen when the browser that is used for rendering the JS is being shut down. Maybe keep_page=True avoids this.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 mvtango