'Improving speed when using sapply with Rvest

I am new to R and have just learned rvest. I'm trying to web scrape this website https://www.displayspecifications.com/en

I'm trying to take the specification of every TV model in each brand.

What I’m trying to do is:

  1. Go to the brand page (eg: LG) and scrape the name of every model to use as name column
  2. Scrape the link of every model to use the URL to scrape the specification
  3. Create a function to scrape the specific line for the specification that I want inside each model URL
  4. Run the function using sapply to retrieve the specification of every tv model for the specs column

I used selector gadget to help me identify the element I'm trying to capture. That works fine.

#step 1

link = "https://www.displayspecifications.com/en/brand/a1025"
page = read_html(link)
name = page %>%
  html_nodes(".model-listing-container-80 a") %>%
  html_text()

#step 2

model_links = page %>%
  html_nodes(".model-listing-container-80 a") %>%
  html_attr("href")

Then I tried to create a function that will scrape the specs across all model URLs. The function works fine with just one URL.

#step 3

get_specs = function(model_link) {
  model_page  = read_html(model_link)  
  model_specs = model_page %>%
    html_nodes('[property="og:description"]') %>%
    html_attr('content') 
  return(model_specs)
}

It works fine if I just use it for one model url. Which will gives me the details that I wanted like this

[1] "Specifications of LG 43NANO776QA. Display: 42.5 in, VA, Direct LED, 3840 x 2160 pixels, Viewing angles (H/V): 178 ° / 178 °, Refresh rate: 50 Hz / 60 Hz, TV tuner: Analog (NTSC/PAL/SECAM), DVB-C, DVB-S2, DVB-T, DVB-T2, DVB-S, Cores: 4, Dimensions: 967 x 564 x 57.7 mm, Weight: 9.2 kg. LG 43NANO776QA is also known as LG 43NANO779QA, LG 43NANO773QA."

Since I need to retrieve the specs of all models, I use sapply:

#step 4

specs = sapply(model_links, FUN = get_specs)

The problem is in step 4. It takes forever and I'm not even sure whether it will generate an output. By forever I mean I left the computer overnight and RStudio is still running with no output. And I tried to run again for 4 hours today and decided to stop it. Maybe something is wrong with the code? Is there an alternative to sapply?



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source