'Python Selenium - Scraping javascript pagination

I've been building this scraper (with some massive help from users here) to get data on some companies' debt with the public sector and I've been able to get to the site, input the desired search parameters and scrape the first 50 results (out of 300). The problem I've encountered is that this page's pagination has the following characteristics:

  1. It does not possess a next page button
  2. The URL doesn't change with the pagination
  3. The pagination is done with a Javascript script

Here's the code so far:

path_driver = "C:/Users/CS330584/Documents/Documentos de Defesa da Concorrência/Automatização de Processos/chromedriver.exe"
website = "https://sat.sef.sc.gov.br/tax.NET/Sat.Dva.Web/ConsultaPublicaDevedores.aspx"
value_search = "300"
final_table = []


driver = webdriver.Chrome(path_driver)
driver.get(website)
search_max = driver.find_element_by_id("Body_Main_Main_ctl00_txtTotalDevedores")
search_max.send_keys(value_search)
btn_consult = driver.find_element_by_id("Body_Main_Main_ctl00_btnBuscar")
btn_consult.click()

driver.implicitly_wait(10)

cnpjs = driver.find_elements_by_xpath("//*[@id='Body_Main_Main_grpDevedores_gridView']/tbody/tr/td[1]")
empresas = driver.find_elements_by_xpath("//*[@id='Body_Main_Main_grpDevedores_gridView']/tbody/tr/td[2]")
dividas = driver.find_elements_by_xpath("//*[@id='Body_Main_Main_grpDevedores_gridView']/tbody/tr/td[3]")
for i in range(len(empresas)):
    temp_data = {'CNPJ' : cnpjs[i].text,
               'Empresas' : empresas[i].text,
                'Divida' : dividas[i].text
                }
    final_table.append(temp_data)

How can I navigate through the pages in order to scrape their data ? Thank you all for the help!



Solution 1:[1]

If you inspect the page and look at what happens when you click on the next page button, you'll see in the tag they're actually executing some javascript. It looks like this:

<a href="javascript:GridView_ScrollToTop(&quot;Body_Main_Main_grpDevedores_gridView&quot;);__doPostBack('ctl00$ctl00$ctl00$Body$Main$Main$grpDevedores$gridView','Page$5')"><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">6</font></font></a>

But if you take that javascript call out of that href tag (and fix the " to be quotations) you'll see two function calls that look like this:

GridView_ScrollToTop("Body_Main_Main_grpDevedores_gridView");
__doPostBack('ctl00$ctl00$ctl00$Body$Main$Main$grpDevedores$gridView','Page$5');

Now I didn't take the time to analyze these functions in depth, but you don't really need to. You see the first call causes the browser to scroll to the top, and the second call actually causes the next page of data to load on the page. For your purposes, you only care about the second call.

You can mess around with this in the browser; Just perform your search and then, in the JS console, paste in the JS call, exchanging the number for the page you want to look at.

If you can do it via JS in the console on the webpage, you can do it with Selenium. You would do something like this to "click" each tab:

for(i in range(1, 7)):
  js = "__doPostBack('ctl00$ctl00$ctl00$Body$Main$Main$grpDevedores$gridView','Page$" + str(i) + "');"
  driver.execute_script(js)
  #do scraping stuff

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 incrediblejonas