'Unable To Scrape url from page using Python and BeautifulSoup. Any ideas?

As the title suggests. I'm playing around with a Twitter bot that scrapes rss feeds and tweets the title of the article and a link.

For some reason when I run the below code it runs without errors but doesn't retrieve the url link. Any suggestions are gratefully recieved.

from bs4 import BeautifulSoup
import requests

url = "https://www.kdnuggets.com/feed"
resp = requests.get(url)
soup = BeautifulSoup(resp.content)
items = soup.findAll('item')
item = items[1]

print(item.title.text)
print(item.link.text)

The title prints fine but the link is nowhere to be found. For reference, below is a copy of the html that is returned for this item.

<item>
<title>An Overview of Logistic Regression</title>
<link/>https://www.kdnuggets.com/2022/02/overview-logistic-regression.html
                                        <comments>https://www.kdnuggets.com/2022/02/overview-logistic-regression.html#disqus_thread</comments>
<dc:creator>&lt;![CDATA[Matt Mayo Editor]]&gt;</dc:creator>
<pubdate>Fri, 04 Feb 2022 13:00:11 +0000</pubdate>
<category>&lt;![CDATA[2022 Feb Tutorials, Overviews]]&gt;</category>
<category>&lt;![CDATA[Machine Learning]]&gt;</category>
<guid ispermalink="false">https://www.kdnuggets.com/?p=137943</guid>
<description>&lt;![CDATA[Logistic regression is an extension of linear regression to solve classification problems. Read more on the specifics of this algorithm here.]]&gt;</description>
<wfw:commentrss>https://www.kdnuggets.com/2022/02/overview-logistic-regression.html/feed</wfw:commentrss>
<slash:comments>0</slash:comments>
</item>

Thanks in advance.



Solution 1:[1]

Try this?

import httplib2
from bs4 import BeautifulSoup, SoupStrainer

http = httplib2.Http()
status, response = http.request('http://www.nytimes.com')

for link in BeautifulSoup(response, parse_only=SoupStrainer('a')):
    if link.has_attr('href'):
        print(link['href'])

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

  • from here

Solution 2:[2]

Try this one. I think you need to loop throught the "item" to get all links.

from bs4 import BeautifulSoup
import requests

url = "https://www.kdnuggets.com/feed"
resp = requests.get(url)
soup = BeautifulSoup(resp.content)
items = soup.findAll('item')
item = items[1]

links = []
for link in item:
    links.append(link.text)
print(item.title.text)
print(links)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 x0xRumbleLorex0x
Solution 2 Munir