'BeautifulSoup and pd.read_html - how to save the links into separate column in the final dataframe?
My question is somehow similiar to this one: How to save out in a new column the url which is reading pandas read_html() function?
I have a set of links that contain tables (4 tables each and I need only first three of them). The goal is to store the link of each table in the separate 'address' column.
links = ['www.link1.com', 'www.link2.com', ... , 'www.linkx.com']
details = []
for link in tqdm(links):
page = requests.get(link)
sauce = BeautifulSoup(page.content, 'lxml')
table = sauce.find_all('table')
# Only first 3 tables include data
for i in range(3):
details.append(pd.read_html(str(table))[i])
final_df = pd.concat(details, ignore_index=True)
final_df['address'] = link
time.sleep(2)
However, when I use this code, only the last link is assigned to every row in the 'address' column.
I'm probably missing a detail but spent last 2 hours figuring that out and simply can't make any progress - would really appreciate some help.
Solution 1:[1]
You are close to your goal - Add df['address'] in each iteration to your DataFrame before appending it to your list:
for i in table[:3]:
df = pd.read_html(str(i))[0]
df['address'] = link
details.append(df)
Note You could also slice your ResultSet of tables table[:3] so you do not have to use range
Move the concatination outside of your loop and call it ones if your iterations are over:
final_df = pd.concat(details, ignore_index=True)
Example
import pandas as pd
links = ['www.link1.com', 'www.link2.com','www.linkx.com']
details = []
for link in links:
# page = requests.get(link)
# sauce = BeautifulSoup(page.content, 'lxml')
# table = sauce.find_all('table')
table = ['<table><tr><td>table 1</td></tr></table>',
'<table><tr><td>table 2</td></tr></table>',
'<table><tr><td>table 3</td></tr></table>']
# Only first 3 tables include data
for i in table[:3]:
df = pd.read_html(str(i))[0]
df['address'] = link
details.append(df)
final_df = pd.concat(details, ignore_index=True)
Output
| 0 | address |
|---|---|
| table 1 | www.link1.com |
| table 2 | www.link1.com |
| table 3 | www.link1.com |
| table 1 | www.link2.com |
| table 2 | www.link2.com |
| table 3 | www.link2.com |
| table 1 | www.linkx.com |
| table 2 | www.linkx.com |
| table 3 | www.linkx.com |
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
