'When using a loop trying to web scrape multiple pages I get all the links but when I do a list comprehension I only get some of the links
I am using requests and BeautifulSoup to scrape a website. I am trying to learn how to scrape with different methods for different purposes and I am using a press release website to do that. I am trying to scrape each article from each link from each page. So doing a multi-page scrape where I first scrape the links for all the articles from each page and then I loop through the links and scrape the content of each one.
I am having trouble with the first part where I scrape all the links and save them to a variable so I can then use it for the next step of scraping content from each link.
I was able to get each link with this code
import requests
from bs4 import BeautifulSoup
import re
URL = 'https://www...page='
for page in range(1,32):
req = requests.get(URL + str(page))
html_document = req.text
soup = BeautifulSoup(html_document, 'html.parser')
for link in soup.find_all('a',
attrs={'href': re.compile("^https://www...")}):
# print(link.get('href'))
soup_link = link.get('href') +'\n'
print(soup_link)
The output is all the links from each of the pages in the specified range (1 to 32). Exactly what I want!
However, I want to save the output to a variable so I can use it in my next function to scrape the content of each link as well as to save the links to a .txt file.
When I change the above code to be able to save the output to a variable, I only get a limited amount of random links and not all the links I was able to scrape with the code from above.
URL = 'https://www....page='
for page in range(1,32):
req = requests.get(URL + str(page))
html_document = req.text
soup = BeautifulSoup(html_document, 'html.parser')
links = [link['href'] for link in soup.find_all('a', attrs={'href':
re.compile("^https://...")})]
The output is a few random links. Not the full list I get from the first code.
What am I doing wrong?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
