'HTML table to pandas table: Info inside html tags

I have a large table from the web, accessed via requests and parsed with BeautifulSoup. Part of it looks something like this:

<table>
<tbody>
<tr>
<td>265</td>
<td> <a href="/j/jones03.shtml">Jones</a>Blue</td>
<td>29</td>
</tr>
<tr >
<td>266</td>
<td> <a href="/s/smith01.shtml">Smith</a></td>
<td>34</td>
</tr>
</tbody>
</table>

When I convert this to pandas using pd.read_html(tbl) the output is like this:

    0    1          2
 0  265  JonesBlue  29
 1  266  Smith      34

I need to keep the information in the <A HREF ... > tag, since the unique identifier is stored in the link. That is, the table should look like this:

    0    1        2
 0  265  jones03  29
 1  266  smith01  34

I'm fine with various other outputs (for example, jones03 Jones would be even more helpful) but the unique ID is critical.

Other cells also have html tags in them, and in general I don't want those to be saved, but if that's the only way of getting the uid I'm OK with keeping those tags and cleaning them up later, if I have to.

Is there a simple way of accessing this information?



Solution 1:[1]

You could simply parse the table manually like this:

import BeautifulSoup
import pandas as pd

TABLE = """<table>
<tbody>
<tr>
<td>265</td>
<td <a href="/j/jones03.shtml">Jones</a>Blue</td>
<td >29</td>
</tr>
<tr >
<td>266</td>
<td <a href="/s/smith01.shtml">Smith</a></td>
<td>34</td>
</tr>
</tbody>
</table>"""

table = BeautifulSoup.BeautifulSoup(TABLE)
records = []
for tr in table.findAll("tr"):
    trs = tr.findAll("td")
    record = []
    record.append(trs[0].text)
    record.append(trs[1].a["href"])
    record.append(trs[2].text)
    records.append(record)

df = pd.DataFrame(data=records)
df

which gives you

     0                 1   2
0  265  /j/jones03.shtml  29
1  266  /s/smith01.shtml  34

Solution 2:[2]

You could use regular expressions to modify the text first and remove the html tags:

import re, pandas as pd
tbl = """<table>
<tbody>
<tr>
<td>265</td>
<td> <a href="/j/jones03.shtml">Jones</a>Blue</td>
<td>29</td>
</tr>
<tr >
<td>266</td>
<td> <a href="/s/smith01.shtml">Smith</a></td>
<td>34</td>
</tr>
</tbody>
</table>"""
tbl = re.sub('<a.*?href="(.*?)">(.*?)</a>', '\\1 \\2', tbl)
pd.read_html(tbl)

which gives you

[     0                           1   2
 0  265  /j/jones03.shtml JonesBlue  29
 1  266      /s/smith01.shtml Smith  34]

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 k-nut
Solution 2 freeseek