'Beautiful Soup and Requests - add missing </td></tr> to one line of HTML code
I am currently Python coding using Beautiful Soup. The website i am trying to extract data from is http://xml.coverpages.org/country3166.html
On the whole I can get everything working that I want. I am extracting country code and country from the HTML using the <tr> tag. This is for a project I am setting myself.
The problem is that the source HTML is missing some closing tags on one of the countries (Moldova). See below. This means when I loop through my code it stops doing what I need at Moldova.
<tr valign=top><td>MA</td><td>Morocco</td></tr>
<tr valign=top><td>MC</td><td>Monaco</td></tr>
<tr valign=top><td>MD</td><td>Moldova, Republic of
<tr valign=top><td>MG</td><td>Madagascar</td></tr>
Thanks
I know I could just create a new text file and manually amend it but is there anything I can do Beautiful Soup wise to fix this? My plan was to iterate through each line until Moldova is found and then append </td></tr> on the end. Is there a more efficient way?
Solution 1:[1]
If I inspect the source you've linked the HTML seems fine, there's probably a mistake in your way of scraping the data.
A small example were we search for each tr, get it's children (2x td), and parse those as code and country to show a list:
import urllib3
from bs4 import BeautifulSoup
http = urllib3.PoolManager()
response = http.request('GET', 'http://xml.coverpages.org/country3166.html')
soup = BeautifulSoup(response.data, 'html.parser')
for tr in soup.findAll("tr"):
childs = tr.findChildren();
code = childs[0].getText();
country = childs[1].getText();
print(code, country)
Will output:
AD Andorra
AE United Arab Emirates
AF Afghanistan
AG Antigua & Barbuda
AI Anguilla
AL Albania
AM Armenia
AN Netherlands Antilles
AO Angola
AQ Antarctica
AR Argentina
AS American Samoa
... and many more, including Moldova and beyond
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | 0stone0 |
