'How to obtain full html with all Shadow DOM elements expanded

I'm stuck with this one. I'm creating a web crawler that should get the HTML of the page. The problem is when I'm reaching content rendered by JS. For that I need to use something like Selenium to obtain full HTML.

That's fine and works pretty well for pages created using for example Angular. The problem starts when we are reaching pages written in Polymer or any other framework with Shadow DOM and web components. In that case I'm only getting the content till first shadow root. The code that I use:

driver.execute_script("return document.body.innerHTML")

Yeah... So I would like to build a string with all custom elements inline. all I get is:

<some-app page="homepage"></some-app><iron-a11y-announcer></iron-a11y-announcer>

You can imagine that it's not enough. So I know that I can recursively access all shadow Root elements like.

document.querySelector("some-app").shadowRoot

I want to make it generic. Any ideas? Any ready solutions?



Solution 1:[1]

one dirty solution I made:

def expand_element(element):
    subelements  = element.find_elements_by_xpath("./*")
    tag = element.get_attribute('tagName')
    tags_to_skip= ["TEMPLATE" , "svg" , "g" ,"path" , "STYLE" , "img" , "video" , ]
    if tag in tags_to_skip:
        return
    print(tag)
    self.counter+=1
    if self.counter %100 == 0: 
        print("==================="  ,  self.counter , "==================")
    
    shadowroot = expand_shadow_element(element)
    if shadowroot:
    
        subelements  = driver.execute_script('return arguments[0].querySelectorAll("*")', shadowroot)

    for obj in subelements:
        expand_element(obj)    

def expand_shadow_element(element):
    shadow_root = driver.execute_script('return arguments[0].shadowRoot', element)
    return shadow_root

Works only in chrome web driver and I have to concatenate results, but it's the basic mechanism....

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1