'gsub() only working after copying vector back from output of dput()

I have the following problem: I scraped prices from multiple webpages.As for some webpages the price is scraped as html_text(), it contains things as currency or ".-" after the price.

Now if I try to remove these things from the price itself using gsub(), it doesn't fully work. Also if I then try to convert the prices to integer using as.integer(), it gives me just NA's for every price.

The strange thing is that if I use dput()to get the content of the vector shown in the console and then copy this content and save it as a new vector (like vec<-c("5.-","10.-","9.-") it suddenly works and I can properly use gsub() and as.integer(). Does anyone know why this could be happening?

The code I use to scrape the prices is:

input_galaxus2<-paste0('https://www.galaxus.ch/',input_galaxus$`Galaxus Artikel`)

sess <- session(input_galaxus2[1])             #to start the session
for (j in input_galaxus2){
  sess <- sess %>% session_jump_to(j)         #jump to URL
  
  i=i+1
  try(vec_galaxus[i] <- read_html(sess) %>%   #can read direct from sess
        html_nodes('.sc-algx62-1.cwhzPP') %>%
        html_text())
  Sys.sleep(runif(1, min=1, max=2))
}

and the j inside the code refers to the product number that can be pasted just after the base url, for example 14513912, 14513929 or 8606656

Edit: so the product links are for example: https://www.galaxus.ch/14513912, https://www.galaxus.ch/14513929 and https://www.galaxus.ch/8606656



Solution 1:[1]

library(tidyverse)
library(rvest)
#> 
#> Attaching package: 'rvest'
#> The following object is masked from 'package:readr':
#> 
#>     guess_encoding

"https://www.galaxus.ch/8606656" %>%
  read_html() %>%
  html_nodes('.sc-algx62-1.cwhzPP') %>%
  html_text() %>%
  str_extract("[0-9]+") %>%
  as.integer()
#> [1] 385

Created on 2022-03-09 by the reprex package (v2.0.0)

Use as.numeric and [0-9.,] to get the cents, too.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 danlooo