'How to iterate through all tags of a website in Python with Beautifulsoup?
I'm a newbie in this sector. Here is the website I need to crawling "http://py4e-data.dr-chuck.net/comments_1430669.html" and here is it source code "view-source:http://py4e-data.dr-chuck.net/comments_1430669.html" It's a simple website for practice. The HTML code look something like:
<html>
<head>
<title>Welcome to the comments assignment from www.py4e.com</title>
</head>
<body>
<h1>This file contains the actual data for your assignment - good luck!</h1>
<table border="2">
<tr>
<td>Name</td><td>Comments</td>
</tr>
<tr><td>Melodie</td><td><span class="comments">100</span></td></tr>
<tr><td>Machaela</td><td><span class="comments">100</span></td></tr>
<tr><td>Rhoan</td><td><span class="comments">99</span></td></tr>
I need to get the number between comments and span (100,100,99) Below is my code:
html=urllib.request.urlopen('http://py4e-data.dr-chuck.net/comments_1430669.html').read()
soup=BeautifulSoup(html,'html.parser')
tag=soup.span
print(tag) #<span class="comments">100</span>
print(tag.string) #100
I got the number 100 but only the first one, now I want to get all of them by iterating through a list or sth like that. What is the method to do this with beautifulsoup?
Solution 1:[1]
Try the following approach:
from bs4 import BeautifulSoup
import urllib.request
html = urllib.request.urlopen('http://py4e-data.dr-chuck.net/comments_1430669.html').read()
soup = BeautifulSoup(html, 'html.parser')
data = []
for tr in soup.find_all('tr'):
row = [td.text for td in tr.find_all('td')]
data.append(row[1]) # or data.append(row) for both
print(data)
Giving you data holding a list containing just the one column:
['Comments', '100', '100', '99', '96', '93', '93', '89', '88', '85', '84', '84', '81', '79', '76', '74', '73', '71', '70', '67', '61', '60', '60', '59', '54', '53', '53', '52', '50', '46', '46', '45', '41', '38', '37', '37', '36', '34', '26', '24', '24', '23', '23', '21', '17', '17', '16', '14', '12', '11', '7']
First locate all of the table <tr> rows. Then extract all of the <td> values for each row. As you only want the second one, append row[1] to a data list holding your values.
You can skip the first one if needed with data[1:].
This approach would let you also save the name at the same time by appending the whole of row. e.g. use data.append(row) instead...
You could then display the entries using:
for name, comment in data[1:]:
print(name, comment)
Giving output starting:
Melodie 100
Machaela 100
Rhoan 99
Murrough 96
Lilygrace 93
Ellenor 93
Verity 89
Karlie 88
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
