'parsing instagram's html of log-in page with beautifulsoup on python 3.9.10
Basically I am trying to build a program that can identify log in pages by url. My idea for doing so is parsing through the pages in search for textboxes (and than identify them by name and type). here is the code:
import requests
from bs4 import BeautifulSoup
\\parse page html (soup)
def parse(soup):
found = []
for a in soup.find_all('input'):
if(a['type'] in ['text','password','email']):
found.append(a['name'])
return found
\\get site's html
def get_site_content(url):
html = requests.get(url)
soup = BeautifulSoup(html.text, 'html5lib')
textBoxes = parse(soup)
print("Found in: " +url)
print(textBoxes)
if __name__ == '__main__':
get_site_content('https://login.facebook.com')
get_site_content('https://www.instagram.com/accounts/login/')
get_site_content('https://instagram.com')
get_site_content('https://instagram.com/login')
get_site_content('https://login.yahoo.com')
Seems to work just fine, but for some reason I've had problems with instagram's log in page. here is the output:
Found in: https://login.facebook.com
['email', 'pass']
Found in: https://www.instagram.com/accounts/login/
[]
Found in: https://instagram.com
[]
Found in: https://instagram.com/login
[]
Found in: https://login.yahoo.com
['username', 'passwd']
Process finished with exit code 0
After using different libraries for getting the html and different parsers Ive come to understand that the problem is with the html = requests.get(url) line. it just doesn't get the full html.
any ideas on how to fix this?
Thanks in advance!
by the way if you have a better idea for what I am trying to accomplish I would love to hear it :)
Solution 1:[1]
Alright, so thanks to @user:14460824 (HedgHog) I have come to realize that the problem was the need to render the page since its rendered dynamically from Javascript. Personally, I didn't like selenium and used requests-html instead. it operates the same as selenium but just feels easier to use and in the future when I realize how to identify weather a web page is rendered dynamically from Javascript or not this library will be much easier to use so I won't waste resources. here is the code:
from requests_html import HTMLSession
import requests
#parse page html
def parse(html):
found = []
for a in html.find('input'):
if(a.attrs['type'] in ['text','password','email'] and 'name' in a.attrs):
found.append(a.attrs['name'])
return found
#get site's html
def get_site_content(url):
try:
session = HTMLSession()
response = session.get(url)
#if(JAVASCRIPT): #here i need to find a way to tell weather
#Render the page #the page is rendered dynamically from Javascript
#response.html.render(timeout=20)
response.html.render(timeout=20) #for now render all pages
return response.html
except requests.exceptions.RequestException as e:
print(e)
def find_textboxes(url):
textBoxes = parse(get_site_content(url))
print("Found in: " +url)
print(textBoxes)
if __name__ == '__main__':
find_textboxes('https://login.facebook.com')
find_textboxes('https://www.instagram.com/accounts/login/')
find_textboxes('https://instagram.com')
find_textboxes('https://login.yahoo.com')
Solution 2:[2]
Content is provided dynamically by JavaScript that would not be rendered by requests. To get the rendered page_source use selenium.
You also could select your elements more specific:
for a in soup.select('input[name]'):
Example
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().install())
def parse(soup):
found = []
for a in soup.select('input[name]'):
if(a['type'] in ['text','password','email']):
found.append(a['name'])
return found
def get_site_content(url):
driver.get(url)
time.sleep(2)
soup = BeautifulSoup(driver.page_source, 'html5lib')
textBoxes = parse(soup)
print("Found in: " +url)
print(textBoxes)
if __name__ == '__main__':
get_site_content('https://login.facebook.com')
get_site_content('https://www.instagram.com/accounts/login/')
get_site_content('https://instagram.com')
get_site_content('https://instagram.com/login')
get_site_content('https://login.yahoo.com')
Output
Found in: https://login.facebook.com
['email', 'pass']
Found in: https://www.instagram.com/accounts/login/
['username', 'password']
Found in: https://instagram.com
['username', 'password']
Found in: https://instagram.com/login
['username', 'password']
Found in: https://login.yahoo.com
['username', 'passwd']
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | user10696838 |
| Solution 2 |
