'Splash's response do not render javascript as html

I'm trying to understand why Splash didn't gave a rendered html response:

  • First, succesfully logged with scrapy FormRequest
  • Then SplashRequest, loaded in endpoint

But when I print the response.body, the page wasn't rendered.

Extra info:

  • The page adds more results when scrolling down.
  • the page.com isn't the real webpage.
import scrapy
from scrapy_splash import SplashRequest,SplashFormRequest

 class LoginSpider(scrapy.Spider):
    name = 'page'
    start_urls = ['https://www.page.com']

    def parse(self, response):
       return scrapy.FormRequest(
       'https://www.page.com/login/loginInitAction.do?method=processLogin',
       formdata={'username':'userid','password':'key', 'remember':'on'},
     callback=self.after_login
)

    def after_login(self, response):

       yield SplashRequest("https://www.page.com/search/all/simple?typeaheadTermType=&typeaheadTermId=&searchType=21&keywords=&pageValue=22", self.parse_page2, meta={
        'splash': {
            'endpoint': 'render.html',
            'args': {'wait': 10, 'render_all': 1,'html': 1},
       }
    })

 

    def parse_page2(self, response):

       print(response.body)

       return 

CMD

2017-10-28 11:53:43 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: scrapybot)
2017-10-28 11:53:43 [scrapy.utils.log] INFO: Overridden settings: {'SPIDER_LOADER_WARN_ONLY': True}
2017-10-28 11:53:43 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2017-10-28 11:53:43 [scrapy.middleware] INFO: Enabled downloader 
middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy_splash.SplashCookiesMiddleware',
'scrapy_splash.SplashMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-10-28 11:53:43 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy_splash.SplashDeduplicateArgsMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-10-28 11:53:43 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-10-28 11:53:43 [scrapy.core.engine] INFO: Spider opened
2017-10-28 11:53:43 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-10-28 11:53:43 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-10-28 11:53:44 [scrapy.downloadermiddlewares.redirect] DEBUG: 
Redirecting (301) to <GET https://www.page.com/technology/home.jsp> from 
<GET https://www.page.com>
2017-10-28 11:53:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.page.com/technology/home.jsp> (referer: None)
2017-10-28 11:53:45 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://www.page.com/login/loginInitAction.do?method=processLogin> (referer: https://www.page.com/technology/home.jsp)
2017-10-28 11:53:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.page.com/search/all/simple?typeaheadTermType=&typeaheadTermId=&searchType=21&keywords=&pageValue=1 via http://192.168.0.20:8050/render.html> (referer: None)


Solution 1:[1]

To login you need to send a session cookie, but scrapy-splash doesn't handle cookies when render.html endpoint is used. Try something like this to make cookies work:

import scrapy
from scrapy_splash import SplashRequest

script = """
function main(splash)
  splash:init_cookies(splash.args.cookies)
  assert(splash:go(splash.args.url))
  assert(splash:wait(0.5))

  return {
    url = splash:url(),
    cookies = splash:get_cookies(),
    html = splash:html(),
  }
end
"""

class MySpider(scrapy.Spider):


    # ...
    def parse(self, response):
        # ...
        yield SplashRequest(url, self.parse_result,
            endpoint='execute',
            cache_args=['lua_source'],
            args={'lua_source': script},
        )

This example is adapted from scrapy-splash README; see here to get a better understanding on why this is needed.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Mikhail Korobov