'Need help parsing link from iframe using BeautifulSoup and Python3

I have this url here, and I'm trying to get the video's source link, but it's located within an iframe. The video url is https://ndisk.cizgifilmlerizle.com... inside an iframe called vjs_iframe. My code is below:

import requests
from bs4 import BeautifulSoup
url = "https://m.wcostream.com/my-hero-academia-season-4-episode-5-english-dubbed"
r = requests.Session() 
headers = {"User-Agent":"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:75.0) Gecko/20100101 Firefox/75.0"} # Noticed that website responds better with headers
req = r.get(url, headers=headers)
soup = BeautifulSoup(req.content, 'html.parser')
iframes = soup.find_all("iframe") # Returns an empty list
vjs_iframe = soup.find_all(class_="vjs_iframe") # Also returns an empty list

I don't know how to get the url within the iframe, since not even the iframe's source is loaded upon the first request. Is getting the https://ndisk.cizgifilmlerizle.com... url even possible using BeautifulSoup or would I need to use another library like selenium or something else? Thanks in advance!



Solution 1:[1]

This website is quite tricky: the iframe is generated from a semi-obfuscated code just after the <meta itemprop="embedURL"> (I am formatting it):

<script>var nUk = ""; var BBX = ["RnF6Mjk4NzU0MkRXcw==", "Tnl0Mjk4NzU4N3NidA==", 
// TONS OF STRINGS IN THE ARRAY
];
BBX.forEach(function EtT(value) {
   nUk += String.fromCharCode(
      parseInt(atob(value).replace(/\D/g,'')) - 2987482); 
});
document.write(decodeURIComponent(escape(nUk)));</script>

Variable names are auto-generated, if you reload the page they will change, but the obfuscating technique is the same. Each segment of that array (value in the forEach loop) contains an obfuscated character, this is how things are going:

  • it decodes the base64 string (atob)
  • removes all non-digit characters (the replace thing), so you have a number
  • subtracts 2987482 (another auto-generated number which changes at every request)
  • converts it to a character (the fromCharCode call)
  • merges all characters

If you are able to execute that code in a Firefox/Chromium console, omit the document.write() and print the variable in the console: you can see that iframe code, which is then injected into the page by the document.write() call.

You should be able to pass that javascript to an interpreter, to get the iframe stuff and capture the URL, then you can scrape that URL.

To scrape this website in python, you should have some javascript interpreter, or work really hard with regexp, e.g. soup.find(string=re.compile('.*atob\(')) and then do the same as the javascript does in the browser. It's really overkill, you should do that only for learning purposes. If your mission is to download the iframe stuff, maybe it's easier to find another website.

I recommend using the lxml parser, if you are able to install that library. Moreover, i really recommend scrapy, it's a gorgeous piece of software.

Solution 2:[2]

My approach to scraping their stuff is as follows. Idk if you need this anymore, but I was searching for problems with that https://ndisk.cizgifilmlerizle.com website, and saw this. Figured it might help someone else. It's crude, but gets the job done.

import time
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.chrome.options import Options
from selenium.webenter code heredriver.common.by import By
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
from time import sleep
import os
import string


#   tab 5, space, up arrow 2, space
def do_keys(key, num_times, action_chain):
    for x in range(num_times):
        action_chain.send_keys(key)


def cls():
    print("\033[2J")


# Press the green button in the gutter to run the script.
if __name__ == '__main__':
    count = 274
    # Stuck on 274 - 500.  273 also failed.
    attempts = 0
    while count < 501:
        url = f"https://www.wcostream.com/naruto-shippuden-episode-{count}"
        video_dir = f"{os.path.dirname(os.path.realpath(__file__))}\\videos\\"
        default_video_name = f"{video_dir}getvid.mp4"
        if not os.path.exists(video_dir):
            os.mkdir(video_dir)
        options = Options()
        options.add_argument('--headless')
        options.add_argument('--mute-audio')
        options.add_experimental_option("prefs", {
            "download.default_directory": video_dir,
            "download.prompt_for_download": False,
            "download.directory_upgrade": True,
            "safebrowsing.enabled": True
        })
        browser = webdriver.Chrome(options=options)
        # browser = webdriver.Chrome()
        browser.get(url)
        sleep(1)
        title_element = None
        try:
            title_element = browser.find_element(By.XPATH,
                                 "//*[@id=\"content\"]/table/tbody/tr/td[1]/table/tbody/tr/td/table[1]/tbody/tr[2]/td/div[2]/b/b[1]")
        except Exception as e:
            title_element = browser.find_element(By.XPATH,
                                                 "//*[@id=\"content\"]/table/tbody/tr/td[1]/table/tbody/tr/td/table[1]/tbody/tr[2]/td/div[2]/b[2]")

        title = title_element.text.lower().translate(str.maketrans('', '', string.punctuation)).replace(' ', '_')
        new_video_name = f"{video_dir}episode_{count}_{title}.mp4"
        cls()
        print(f"Title: {title}")

        # Below is working.
        browser.switch_to.frame(browser.find_element(By.XPATH, "//*[@id=\"frameNewcizgifilmuploads0\"]"))

        results = browser.page_source

        soup = BeautifulSoup(results, "html.parser")
        video_url = soup.find("video").get("src")

        print(f"URL:\t{video_url}")
        browser.get(video_url)

        element = browser.find_element(By.TAG_NAME, "video")
        sleep(1)
        actions = ActionChains(browser)
        actions.send_keys(Keys.SPACE)
        actions.perform()

        sleep(1)
        do_keys(Keys.TAB, 5, actions)
        do_keys(Keys.SPACE, 1, actions)
        do_keys(Keys.UP, 2, actions)
        do_keys(Keys.SPACE, 1, actions)
        actions.perform()
        start = time.time()
        print(f"Downloading: {new_video_name}")

        #
        # # browser.get(video_url)
        # print(browser)
        #
        # # print(results)
        # print(f"{video_url}")
        browser_open = True
        timeout = 0
        while browser_open:
            if os.path.isfile(default_video_name):
                if os.path.exists(new_video_name):
                    os.remove(default_video_name)
                    end = time.time()
                    print(f"Already Exists! [{end - start}s]")
                else:
                    os.rename(default_video_name, new_video_name)
                    end = time.time()
                    print(f"Download complete! [{end - start}s]")
                count += 1
                browser_open = False
                browser.close()
            try:
                _ = browser.window_handles
            except Exception as e:
                browser_open = False
            if timeout > 50:
                attempts += 1
                print(f"Download Timed Out.  Trying again. [{attempts}]")
                browser_open = False
                browser.close()
            else:
                attempts = 0
            timeout += 1
            sleep(1)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 danieli
Solution 2 Bill Kleinhomer