'how can I pass the multiple .html file names in to a single txt output file that outputs all the href links in html along with their file names?
import pandas as pd
import glob
import csv
import re
from bs4 import BeautifulSoup
links_with_text = []
textfile = open("a_file.txt", "w")
for filename in glob.iglob('*.html'):
with open(filename) as f:
soup = BeautifulSoup(f)
links_with_text = [a['href'] for a in soup.find_all('a', href=True) if a.text]
print(links_with_text)
for element in links_with_text:
textfile.write(element + "\n")
sample Output:
file name:
- link1
- link2
- link3
file name2:
- link1
- link2
- link3
file name3:
- link1
- link2
- link3
I found a post some what related to mine but there it prints the output in multiple text files but here I would like to have those file names with their links in one textfile.
BeautifulSoup on multiple .html files
Please suggest. Thank you in advance
Solution 1:[1]
To have the filename at the top of each block, just add another .write() line as follows:
from bs4 import BeautifulSoup
import glob
import csv
links_with_text = []
with open("a_file.txt", "w") as textfile:
for filename in glob.iglob('*.html'):
textfile.write(f"{filename}:\n")
with open(filename) as f:
soup = BeautifulSoup(f)
links_with_text = [a['href'] for a in soup.find_all('a', href=True) if a.text]
for element in links_with_text:
textfile.write(f" {element}\n")
Solution 2:[2]
I made a similar thing but with img maybe it will help you:
link = input('Url is: ')
html = urlopen(link)
bs = BeautifulSoup(html, 'html.parser')
images = bs.find_all('img', {'src':re.compile('.jpg')})
f= open("cache.txt","w+")
for image in images:
url = ('https:' + image['src']+'\n')
f.write(url)
with open('cache.txt') as f:
for line in f:
url = line
path = 'image'+url.split('/', -1)[-1]
urllib.request.urlretrieve(url, path.rstrip('\n'))
Solution 3:[3]
try this
with open("a_file.txt", "a") as textfile: # "a" to append string
for filename in glob.iglob('*.html'):
with open(filename) as f:
soup = BeautifulSoup(f)
links_with_text = [a['href'] for a in soup.find_all('a', href=True) if a.text]
links_with_text = "\n".join(links_with_text)
textfile.write(f"{filename}\n{links_with_text}\n")
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Martin Evans |
| Solution 2 | Marius Gabriel |
| Solution 3 | uingtea |
