'How to scrape elements of nested drop-down with Rselenium?

I'm trying to scrape this website with Rselenium. On the left side of website, there are "nested" drop-down lists. For each list I only can take xpath of elements.

So I tried using for loop for first drop-down list as below:

for (i in 1:6) {
  q <- enexpr(i)
  xpath_1 <- glue("/html/body/div[1]/div[3]/div/div[2]/div[1]/div[{enexpr(q)}]/h2/a")
  driver$findElement("xpath", xpath_1)$clickElement()
  result[i,1] <- driver$findElement("xpath", xpath_1)$getElementText()

That gives me first 6 drop-down elements as dataframe. However for second nested drop-down I need to connect them in result dataframe:

for (i in 1:6) {
  q <- enexpr(i)
  xpath_1 <- glue("/html/body/div[1]/div[3]/div/div[2]/div[1]/div[{enexpr(q)}]/h2/a")
  driver$findElement("xpath", xpath_1)$clickElement()
  result[i,1] <- driver$findElement("xpath", xpath_1)$getElementText()
  
  for (a in 1:17) {
    b <- enexpr(a)
    xpath_2 <- glue("/html/body/div[1]/div[3]/div/div[2]/div[1]/div[1]/div/article[{enexpr(b)}]/h3/a")
    driver$findElement("xpath", xpath_2)$clickElement()
    result[a,2] <- driver$findElement("xpath", xpath_2)$getElementText() 
  } 
}

Result is like this. Only the elements of the first drop-down are given in column 2, although other drop-downs have the same xpath as the sub-elements. My aim is to get the table with the related drop-down elements, as below:

Col1      Col2
 a         1 
 a         2
 a         3
 b         8
 b         9

Can anyone help me figure out what I can do?



Solution 1:[1]

Here is a partial answer,

To get the parents text we can do

remDr$getPageSource()[[1]] %>% read_html() %>% 
  html_nodes('body > div.main_wrapper > div.inner_content.long > div > div.analiz_wrapper > div.analiz_left.css_scroll > div:nth-child(n)') %>%
   html_nodes('h2') %>% html_text()
[1] "Laborator müayin?l?r"   "Funksional müayin?l?r"  "Poliklinik müayin?l?r"  "H?kim konsultasiyalar?" "H?diyy? kartlar?"       "Endirimli müayin?l?r"

To get the text of parents > child we can do, (This fetches you text of only first parent, you need to write a loop to get text from other parents)

remDr$getPageSource()[[1]] %>% read_html() %>% 
  html_nodes('body > div.main_wrapper > div.inner_content.long > div > div.analiz_wrapper > div.analiz_left.css_scroll > div:nth-child(1) > div > article:nth-child(n)') %>%
  html_nodes('h3') %>% html_text()

 [1] "COVID-19 testi"                   "Qan?n müayin?si"                  "Sidiyin müayin?si"                "N?cisin müayin?si"               
 [5] "Sperman?n müayin?si"              "Urogenital s?yr?nt?n?n müayin?si" "Likvorun müayin?si"               "Saç?n müayin?si"                 
 [9] "Abortiv material?n müayin?si"     "Ana südünün müayin?si"            "Plevral mayenin müayin?si"        "Prostat v?zi ?ir?sinin müayin?si"
[13] "B?lg??min mu?ayin?si"               "Bioptat?n müayin?si"              "Konyuktivadan s?yr?nt?"           "Yaradan s?yr?nt?n?n mu?ayin?si"   
[17] "Dig?r biomateriallar" 

Finally to get text from parents > child > child (It too gets text from first parent's second child, you need to write loop for others)

remDr$getPageSource()[[1]] %>% read_html() %>% 
  html_nodes('body > div.main_wrapper > div.inner_content.long > div > div.analiz_wrapper > div.analiz_left.css_scroll > div:nth-child(1) > div > article:nth-child(2) > div > ul > li:nth-child(n)') %>%
  html_text()
[1] "Hematoloji müayin?l?r"                 "Biokimy?vi müayin?l?r"                 "Hormonal müayin?l?r"                  
[4] "?mmunoloji müayin?l?r"                 "Allerqoloji müayin?l?r"                "?nfeksion x?st?likl?rin diaqnostikas?"
[7] "Bakterioloji müayin?"                  "Genetik analizl?r"

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Nad Pat