'pandas scraping html tables
There is an HTML file of tables. There are about 100 of them, and they all often have the same values. The values in the second and first column of all tables are non-static (these are the columns that I will work with in the future):
Here is the HTML code of any line in the table. They are all the same, for example, 1 line is taken from the first table in the screenshot above:
<tr>
<td class="randomclass">1project</td>
<td class="randomclass"> <font color="#00875a"><b>ID1</b></font></td>
<td class="randomclass"><font color="#00875a"> <b>EMPTY_ROW</b></font></td>
<td class="randomclass">EMPTY_ROW</td>
</tr>
I'm looking for a method by which I:
- will pull only green values from all the tables (ignoring ID_TEMP1, ID_TEMP2, ID_TEMP3)
- will be able to ignore small tables with an "ID_Review" value.
Here is my code:
import pandas as pd
cons = pd.DataFrame()
all_HOs = pd.DataFrame()
ids_cons = pd.DataFrame()
df_list = pd.read_html('Production Issue Tracking.html', match='ID')
df_list2 = pd.read_html('Production Issue Tracking.html', match='Attachments')
df_list3 = pd.read_html('Production Issue Tracking.html', match='number')
df3 = pd.concat(df_list3, axis=1)
df3 = df3.iloc[:, ::-1]
df = pd.concat(df_list, axis=1)
df2 = pd.concat(df_list2, axis=1)
df_rev = df.iloc[:, ::-1]
df2 = df2.iloc[:, ::-1]
df_rev.columns = df_rev.iloc[0]
lc = df_rev[["STATIC"]]
lc = pd.DataFrame({"STATIC: df_rev["Review"].values.T.ravel(),})
lc = lc[lc['STATIC'] != 'STATIC']
lc = lc[lc['STATIC'] != 'ID_Review']
sup = df_rev[["IDs"]]
sup = pd.DataFrame({"IDs": df_rev["IDs"].values.T.ravel(),})
sup = sup[sup['IDs'] != 'IDs']
lc_sup = pd.concat([lc, sup], axis=1)
lc_sup DF in excel:
EMPTY_ROW EMPTY_ROW
1project ID1
2project ID2
3project ID3 ID_TEMP1
4project ID4
5project ID5
6project ID6
7project ID7 ID_TEMP2
8project ID8
9project ID9
10project ID10 ID_TEMP3
Project1
Project2
Project3
Project4
Project5
Project6
Project7
Project8
Project9
Project10
In theory, this line:
lc = lc[lc['STATIC'] != 'ID_Review']
should remove unnecessary lines in lc_sup DF, but the problem is that the values in this line are almost always different, and I can't ignore them all in any way. In the example above, I managed to ignore "ID_Review", but if it is any other value, it will appear as a new line in DF.
I thought to collect all the values of these small tables, move them to the list and then ignore them all, but these tables most often have new values, and in this case I will have to add them to the code all the time in order to ignore them in the future.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|