'Is there an abstract pattern I can use to web scrape journal abstracts in rvest?
I am new to web scraping so please forgive me if what I am looking for is not possible. I want to extract all the journal article abstracts from a large database.
I am able to generate all the links from the database.
pangiaoDB <- read_html('https://panglaodb.se/papers.html')
table <- pangiaoDB %>%
html_node(xpath = '/html/body/div[2]/div[2]/table') %>%
html_table()
url <- lapply(table$DOI, function(x) {
paste('https://doi.org/', x, sep = '')
})
head(url)
The table has over 800 unique journals.
length(unique(table$Journal))
length(table$Journal)
The abstracts are tucked away in various ways but for the most part I have found them in
xpath = '//*[@id="3179475"]/section') and xpath = '//*[@id="Abs1"]. The ladder is less of an issue but how can I generate a relative xpath for abstracts in the former path?
Solution 1:[1]
Here is the code I ended up using after getting feedback from @Axeman:
data <- rcrossref::cr_abstract(str_remove(url, pattern = 'https://doi.org/')) %>%
str_remove(., pattern = 'Abstract')
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Noah_Seagull |
