'Amazon reviews web scraping in R: how to avoid running into an error, when one of the reviews is from another country?
In order to get some interesting data for NLP, I just started to do some basic web scraping in R. My goal is to gather product reviews from amazon, as much as I can. My first basic trials succeeded, but now I am running into an error.
As you can check from the url in my reprex, there are 3 pages of reviews for the product. If I scrape the first and second one, everything works fine. The third page contains a review from a foreign customer.
When I am trying to scrape page three I am getting an error indicating, that my tibble columns do not have compatible sizes. How can I explain this and how to avoid the error?
Also the error disappears, if I delete review_star and review_title from the scrape function.
library(pacman)
pacman::p_load(RCurl, XML, dplyr, rvest)
#### SCRAPE
scrape_amazon <- function(page_num){
url_reviews <- paste0("https://www.amazon.de/Lavendel-%C3%96L-Fein-kbA-%C3%84therisch/product-reviews/B00EXBKQDS/ref=cm_cr_getr_d_paging_btm_next_3?ie=UTF8&reviewerType=all_reviews&pageNumber=",page_num)
doc <- read_html(url_reviews)
# Review Title
doc %>%
html_nodes("[class='a-size-base a-link-normal review-title a-color-base review-title-content a-text-bold']") %>%
html_text() -> review_title
# Review Text
doc %>%
html_nodes("[class='a-size-base review-text review-text-content']") %>%
html_text() -> review_text
# Number of stars in review
doc %>%
html_nodes("[data-hook='review-star-rating']") %>%
html_text() -> review_star
# date
date <- doc %>%
html_nodes("#cm_cr-review_list .review-date") %>%
html_text() %>%
gsub(".*on ", "", .)
# author
author <- doc %>%
html_nodes("#cm_cr-review_list .a-profile-name") %>%
html_text()
# Return a tibble
tibble(review_title,
review_text,
review_star,
date,
author,
page = page_num) %>% return()
}
# extract testing
df <- scrape_amazon(page_num = 3)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
