'When using a loop trying to web scrape multiple pages I get all the links but when I do a list comprehension I only get some of the links

I am using requests and BeautifulSoup to scrape a website. I am trying to learn how to scrape with different methods for different purposes and I am using a press release website to do that. I am trying to scrape each article from each link from each page. So doing a multi-page scrape where I first scrape the links for all the articles from each page and then I loop through the links and scrape the content of each one.

I am having trouble with the first part where I scrape all the links and save them to a variable so I can then use it for the next step of scraping content from each link.

I was able to get each link with this code

import requests
from bs4 import BeautifulSoup
import re
  
URL = 'https://www...page='

for page in range(1,32):
  
    req = requests.get(URL + str(page))
    html_document = req.text
    soup = BeautifulSoup(html_document, 'html.parser')
  
    for link in soup.find_all('a', 
                      attrs={'href': re.compile("^https://www...")}): 
    # print(link.get('href'))
      soup_link = link.get('href') +'\n' 
      print(soup_link)

The output is all the links from each of the pages in the specified range (1 to 32). Exactly what I want!

However, I want to save the output to a variable so I can use it in my next function to scrape the content of each link as well as to save the links to a .txt file.

When I change the above code to be able to save the output to a variable, I only get a limited amount of random links and not all the links I was able to scrape with the code from above.


URL = 'https://www....page='

for page in range(1,32):
  
    req = requests.get(URL + str(page))
    html_document = req.text
    soup = BeautifulSoup(html_document, 'html.parser')
 
    links = [link['href'] for link in soup.find_all('a', attrs={'href':
re.compile("^https://...")})]

The output is a few random links. Not the full list I get from the first code.

What am I doing wrong?



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source