'pandas scraping html tables

There is an HTML file of tables. There are about 100 of them, and they all often have the same values. The values in the second and first column of all tables are non-static (these are the columns that I will work with in the future):

HTML tables

Here is the HTML code of any line in the table. They are all the same, for example, 1 line is taken from the first table in the screenshot above:

<tr>
<td class="randomclass">1project</td>
<td class="randomclass">&nbsp;<font color="#00875a"><b>ID1</b></font></td>
<td class="randomclass"><font color="#00875a">&nbsp;<b>EMPTY_ROW</b></font></td>
<td class="randomclass">EMPTY_ROW</td>
</tr>

I'm looking for a method by which I:

  1. will pull only green values from all the tables (ignoring ID_TEMP1, ID_TEMP2, ID_TEMP3)
  2. will be able to ignore small tables with an "ID_Review" value.

Here is my code:

import pandas as pd  

cons = pd.DataFrame() 
all_HOs = pd.DataFrame() 
ids_cons = pd.DataFrame() 
df_list = pd.read_html('Production Issue Tracking.html', match='ID') 
df_list2 = pd.read_html('Production Issue Tracking.html', match='Attachments') 
df_list3 = pd.read_html('Production Issue Tracking.html', match='number') 
df3 = pd.concat(df_list3, axis=1) 
df3 = df3.iloc[:, ::-1] 
df = pd.concat(df_list, axis=1) 
df2 = pd.concat(df_list2, axis=1) 
 
df_rev = df.iloc[:, ::-1] 
df2 = df2.iloc[:, ::-1] 
df_rev.columns = df_rev.iloc[0] 
 
lc = df_rev[["STATIC"]] 
lc = pd.DataFrame({"STATIC: df_rev["Review"].values.T.ravel(),})
lc = lc[lc['STATIC'] != 'STATIC'] 
lc = lc[lc['STATIC'] != 'ID_Review']  
 
 
sup = df_rev[["IDs"]] 
sup = pd.DataFrame({"IDs": df_rev["IDs"].values.T.ravel(),}) 
sup = sup[sup['IDs'] != 'IDs'] 

lc_sup = pd.concat([lc, sup], axis=1) 

lc_sup DF in excel:

EMPTY_ROW   EMPTY_ROW
1project    ID1
2project    ID2
3project    ID3 ID_TEMP1
4project    ID4
5project    ID5
6project    ID6
7project    ID7 ID_TEMP2
8project    ID8
9project    ID9
10project   ID10 ID_TEMP3
Project1    
Project2    
Project3    
Project4    
Project5    
Project6    
Project7    
Project8    
Project9    
Project10  

In theory, this line:

lc = lc[lc['STATIC'] != 'ID_Review']  

should remove unnecessary lines in lc_sup DF, but the problem is that the values in this line are almost always different, and I can't ignore them all in any way. In the example above, I managed to ignore "ID_Review", but if it is any other value, it will appear as a new line in DF.

I thought to collect all the values of these small tables, move them to the list and then ignore them all, but these tables most often have new values, and in this case I will have to add them to the code all the time in order to ignore them in the future.



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source