'how can I pass the multiple .html file names in to a single txt output file that outputs all the href links in html along with their file names?

import pandas as pd
import glob
import csv
import re
from bs4 import BeautifulSoup
links_with_text = []
textfile = open("a_file.txt", "w")
 for filename in glob.iglob('*.html'):
  with open(filename) as f:
    soup = BeautifulSoup(f)


    links_with_text = [a['href'] for a in soup.find_all('a', href=True) if a.text]

    print(links_with_text)
    
    for element in links_with_text:
      textfile.write(element + "\n")
    

sample Output:

file name:

  • link1
  • link2
  • link3

file name2:

  • link1
  • link2
  • link3

file name3:

  • link1
  • link2
  • link3

I found a post some what related to mine but there it prints the output in multiple text files but here I would like to have those file names with their links in one textfile.

BeautifulSoup on multiple .html files

Please suggest. Thank you in advance



Solution 1:[1]

To have the filename at the top of each block, just add another .write() line as follows:

from bs4 import BeautifulSoup
import glob
import csv

links_with_text = []

with open("a_file.txt", "w") as textfile:
    for filename in glob.iglob('*.html'):
        textfile.write(f"{filename}:\n")
        
        with open(filename) as f:
            soup = BeautifulSoup(f)
            links_with_text = [a['href'] for a in soup.find_all('a', href=True) if a.text]
            
            for element in links_with_text:
                textfile.write(f"  {element}\n")        

Solution 2:[2]

I made a similar thing but with img maybe it will help you:

link = input('Url is: ')
html = urlopen(link)
bs = BeautifulSoup(html, 'html.parser')
images = bs.find_all('img', {'src':re.compile('.jpg')})
f= open("cache.txt","w+")
for image in images: 
    url = ('https:' + image['src']+'\n')
    f.write(url)

with open('cache.txt') as f:
   for line in f:
      url = line
      path = 'image'+url.split('/', -1)[-1]
      urllib.request.urlretrieve(url, path.rstrip('\n'))

Solution 3:[3]

try this

with open("a_file.txt", "a") as textfile: # "a" to append string
    for filename in glob.iglob('*.html'):
        with open(filename) as f:
            soup = BeautifulSoup(f)
            links_with_text = [a['href'] for a in soup.find_all('a', href=True) if a.text]
            links_with_text = "\n".join(links_with_text)
            textfile.write(f"{filename}\n{links_with_text}\n")

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Martin Evans
Solution 2 Marius Gabriel
Solution 3 uingtea