'How to scrape elements of nested drop-down with Rselenium?
I'm trying to scrape this website with Rselenium. On the left side of website, there are "nested" drop-down lists. For each list I only can take xpath of elements.
So I tried using for loop for first drop-down list as below:
for (i in 1:6) { q <- enexpr(i) xpath_1 <- glue("/html/body/div[1]/div[3]/div/div[2]/div[1]/div[{enexpr(q)}]/h2/a") driver$findElement("xpath", xpath_1)$clickElement() result[i,1] <- driver$findElement("xpath", xpath_1)$getElementText()
That gives me first 6 drop-down elements as dataframe. However for second nested drop-down I need to connect them in result dataframe:
for (i in 1:6) { q <- enexpr(i) xpath_1 <- glue("/html/body/div[1]/div[3]/div/div[2]/div[1]/div[{enexpr(q)}]/h2/a") driver$findElement("xpath", xpath_1)$clickElement() result[i,1] <- driver$findElement("xpath", xpath_1)$getElementText() for (a in 1:17) { b <- enexpr(a) xpath_2 <- glue("/html/body/div[1]/div[3]/div/div[2]/div[1]/div[1]/div/article[{enexpr(b)}]/h3/a") driver$findElement("xpath", xpath_2)$clickElement() result[a,2] <- driver$findElement("xpath", xpath_2)$getElementText() } }
Result is like this. Only the elements of the first drop-down are given in column 2, although other drop-downs have the same xpath as the sub-elements. My aim is to get the table with the related drop-down elements, as below:
Col1 Col2
a 1
a 2
a 3
b 8
b 9
Can anyone help me figure out what I can do?
Solution 1:[1]
Here is a partial answer,
To get the parents text we can do
remDr$getPageSource()[[1]] %>% read_html() %>%
html_nodes('body > div.main_wrapper > div.inner_content.long > div > div.analiz_wrapper > div.analiz_left.css_scroll > div:nth-child(n)') %>%
html_nodes('h2') %>% html_text()
[1] "Laborator müayin?l?r" "Funksional müayin?l?r" "Poliklinik müayin?l?r" "H?kim konsultasiyalar?" "H?diyy? kartlar?" "Endirimli müayin?l?r"
To get the text of parents > child we can do, (This fetches you text of only first parent, you need to write a loop to get text from other parents)
remDr$getPageSource()[[1]] %>% read_html() %>%
html_nodes('body > div.main_wrapper > div.inner_content.long > div > div.analiz_wrapper > div.analiz_left.css_scroll > div:nth-child(1) > div > article:nth-child(n)') %>%
html_nodes('h3') %>% html_text()
[1] "COVID-19 testi" "Qan?n müayin?si" "Sidiyin müayin?si" "N?cisin müayin?si"
[5] "Sperman?n müayin?si" "Urogenital s?yr?nt?n?n müayin?si" "Likvorun müayin?si" "Saç?n müayin?si"
[9] "Abortiv material?n müayin?si" "Ana südünün müayin?si" "Plevral mayenin müayin?si" "Prostat v?zi ?ir?sinin müayin?si"
[13] "B?lg??min mu?ayin?si" "Bioptat?n müayin?si" "Konyuktivadan s?yr?nt?" "Yaradan s?yr?nt?n?n mu?ayin?si"
[17] "Dig?r biomateriallar"
Finally to get text from parents > child > child (It too gets text from first parent's second child, you need to write loop for others)
remDr$getPageSource()[[1]] %>% read_html() %>%
html_nodes('body > div.main_wrapper > div.inner_content.long > div > div.analiz_wrapper > div.analiz_left.css_scroll > div:nth-child(1) > div > article:nth-child(2) > div > ul > li:nth-child(n)') %>%
html_text()
[1] "Hematoloji müayin?l?r" "Biokimy?vi müayin?l?r" "Hormonal müayin?l?r"
[4] "?mmunoloji müayin?l?r" "Allerqoloji müayin?l?r" "?nfeksion x?st?likl?rin diaqnostikas?"
[7] "Bakterioloji müayin?" "Genetik analizl?r"
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Nad Pat |
