'Unable To Scrape url from page using Python and BeautifulSoup. Any ideas?
As the title suggests. I'm playing around with a Twitter bot that scrapes rss feeds and tweets the title of the article and a link.
For some reason when I run the below code it runs without errors but doesn't retrieve the url link. Any suggestions are gratefully recieved.
from bs4 import BeautifulSoup
import requests
url = "https://www.kdnuggets.com/feed"
resp = requests.get(url)
soup = BeautifulSoup(resp.content)
items = soup.findAll('item')
item = items[1]
print(item.title.text)
print(item.link.text)
The title prints fine but the link is nowhere to be found. For reference, below is a copy of the html that is returned for this item.
<item>
<title>An Overview of Logistic Regression</title>
<link/>https://www.kdnuggets.com/2022/02/overview-logistic-regression.html
<comments>https://www.kdnuggets.com/2022/02/overview-logistic-regression.html#disqus_thread</comments>
<dc:creator><![CDATA[Matt Mayo Editor]]></dc:creator>
<pubdate>Fri, 04 Feb 2022 13:00:11 +0000</pubdate>
<category><![CDATA[2022 Feb Tutorials, Overviews]]></category>
<category><![CDATA[Machine Learning]]></category>
<guid ispermalink="false">https://www.kdnuggets.com/?p=137943</guid>
<description><![CDATA[Logistic regression is an extension of linear regression to solve classification problems. Read more on the specifics of this algorithm here.]]></description>
<wfw:commentrss>https://www.kdnuggets.com/2022/02/overview-logistic-regression.html/feed</wfw:commentrss>
<slash:comments>0</slash:comments>
</item>
Thanks in advance.
Solution 1:[1]
Try this?
import httplib2
from bs4 import BeautifulSoup, SoupStrainer
http = httplib2.Http()
status, response = http.request('http://www.nytimes.com')
for link in BeautifulSoup(response, parse_only=SoupStrainer('a')):
if link.has_attr('href'):
print(link['href'])
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
- from here
Solution 2:[2]
Try this one. I think you need to loop throught the "item" to get all links.
from bs4 import BeautifulSoup
import requests
url = "https://www.kdnuggets.com/feed"
resp = requests.get(url)
soup = BeautifulSoup(resp.content)
items = soup.findAll('item')
item = items[1]
links = []
for link in item:
links.append(link.text)
print(item.title.text)
print(links)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | x0xRumbleLorex0x |
| Solution 2 | Munir |
