'how can i scrape medium content and get all the h1 nad p tag in strings
I have been trying to scrape medium content but was aunable to get all the h1 tag, I was able to get all p-tag all to the end but the h1-tag is missing in between the text
I want to be able to scrape all the content in order of appearance along with all the subheadings in h1 tag
this is what i have done
import stuff
import requests
import bs4
import os
import shutil
from PIL import Image
article_URL = 'https://medium.com/bhavaniravi/build-your-1st-python-web-app-with-flask-b039d11f101c' #@param {type:"string"}
# article_URL = 'https://www.tmz.com/2020/07/29/dr-dre-answers-wife-divorce-petition-prenup/'
response = requests.get(article_URL)
soup = bs4.BeautifulSoup(response.text,'html')
paragraphs = soup.find_all(['li', 'p', 'strong', 'em'])
title = soup.find(['h1','title']).get_text()
print(title)
txt_list = []
tag_list = []
with open('content2.txt', 'w') as f:
f.write(title + '\n\n')
for p in paragraphs:
if p.href:
pass
else:
if len(p.get_text()) > 100: # this filters out things that are most likely not part of the core article
# print(p.href)
tag_list.append(p.name)
txt_list.append(p.get_text())
txt_list2 = []
tag_list2 = []
for i in range(len(txt_list)):
# if '\n' not in txt_list[i]:
print(txt_list[i])
# print(len(txt_list[i]))
# print(tag_list[i])
print()
comp1 = txt_list[i].split()[0:5]
comp2 = txt_list[i-1].split()[0:5]
if comp1 == comp2:
pass
else:
pass
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
