'Pulling company name from webpage within <a> tag

I am trying to streamline my data collection by using Python 3.7 and BeautifulSoup to pull company name, if that company is approved or other, and if they are marketing to residential and/or businesses from this CT license list:

http://www.dpuc.state.ct.us/electric.nsf/$FormByElectricApplicantsView?OpenForm&Start=1&Count=1000&ExpandView

If I can get any help on how to finish the code to at least create a list of appended companies names within Python that would be a big help.

I am learning as I go and able to connect to the site and pull in the source code.

So far, I know this part of the code to work but not quite sure where to go from here:

import requests
from bs4 import BeautifulSoup
result = requests.get("http://www.dpuc.state.ct.us/electric.nsf/$FormByElectricApplicantsView?OpenForm&Start=1&Count=1000&ExpandView")
src = result.content
soup = BeautifulSoup(src,'lxml')

I can see the company name within the source code but not sure the best way to extract it and the others into a list:

<a href="/electric.nsf/c39dc573ab1299538525743b004d4df6/719f088ca20a6a1f85257dfd00480e13?OpenDocument">3Degrees Group, Inc.</a>

I would love to be able to pull all this into an csv. file at some point that has the companies, the status of the license, and who they market to but if someone can help me finish the code to put the companies in a list within Python that would much appreciated and allow me to learn by example.



Solution 1:[1]

23/04/22

:contains() is being deprecated and replaced with :-soup-contains(). This was introduced with Soup Sieve 2.1.


With bs4 4.7.1+ you can use :contains and :has to filter for only the sections pertaining to Supplier. You can further subset the df for columns of interest.


tl;dr;

Make the request and read response into soup object:

r = requests.get('http://www.dpuc.state.ct.us/electric.nsf/$FormByElectricApplicantsView?OpenForm&Start=1&Count=1000&ExpandView')
soup = bs(r.content, 'lxml')

Let's use some images to explain the next steps......

What we have, which may be confusing, is a lot of nested tables, but we can start by thinking of the 'top' level table of interest:

enter image description here

We can take from this (hover over the different trs in that image within the actual html and observe what is highlighted on the page to follow along. You can enter p:nth-child(4) > table in the browser search box to isolate this table):

enter image description here

True, the actual visible content is nested but we know all content of interest is within trs that at a given level are a series of sibling trs.

This bit

soup.select('tr:has(td:nth-of-type(1) font:contains(Supplier)) ~ tr:not(:has(td:nth-of-type(1) font:contains(Aggregator)), :has(td:nth-of-type(1) font:contains(Aggregator)) ~ tr)')

gathers the sibling trs of the 'top' level tr that has child font node containing Supplier, tr:has(td:nth-of-type(1) font:contains(Supplier)), then removes the 'top' level tr containing Aggregator and its siblings. As what you have in the html is a long list of trs, you just want to filter out those that come after the section of interest.

Looking at the very first part (I'm using select_one, rather than select, to demonstrate the first node matched, and not the multiple siblings that are matched when we add in the general sibling combinator - more about that later):

soup.select_one('tr:has(td:nth-of-type(1) font:contains(Supplier))')

Running the above gives:

enter image description here

Compare this to the page:

enter image description here

You see how we have found a way to start selecting trs at the top level starting with the one that (though nested) contains the Supplier label. However, as we know that top level is flat we need to remove everything from (and including) the Aggregator label. So, we add to that general sibling combinator the :not pseudo class. In a nutshell we say get all trs that are siblings except those within the :not

enter image description here

Now we have our subset of rows of interest (green rectangle). We for loop over these and each time we find a node matched by 'td:nth-of-type(2) font' we set the status to the value found e.g. Approved, Pending.....

enter image description here

This line:

if node is not None:

is checking whether the current tr contains a 'status' child node e.g. Approved/Pending (which I name as status for output) and set the later rows identifier for this category accordingly.

If the tr does not have this child node then we know it is one of the other trs that houses the tds with the additional columns of info for output e.g.:

enter image description here

Due to the level of nesting, there are a few empty tds including that we don't want. These we remove using pandas later:

df.drop([1,2,10], axis=1, inplace=True)

As we want the status label and all the tds in one list, row, I use extend to extend the first list to include the second. I believe this is quicker than insert but would appreciate a yey/ney on that.

So, we would go, for example, from:

['Approved'] 

and

['', '', 'Yes', 'Yes', '3Degrees Group, Inc.', 'http://www.3degreesinc.com', '235 Montgomery Street, Suite 320 San Francisco, CA 94104', '(866) 476-9378', '11-11-07 12/14/2011', ''] 

to

['Approved', '', '', 'Yes', 'Yes', '3Degrees Group, Inc.', 'http://www.3degreesinc.com', '235 Montgomery Street, Suite 320 San Francisco, CA 94104', '(866) 476-9378', '11-11-07 12/14/2011', ''] 

These rows are added to a list called final. So, you have a list of lists (each row).

You can pass this list to pandas.DataFrame to generate a dataframe ready for csv export with to_csv method. You use the columns argument to specify the headers.

When generating the rows:

tds = [td.text for td in tr.select('td')]

We find the occasional additional whitespace and \n (newline) e.g.

['', '', '', '', 'WFM Intermediary New England Energy, LLC', '', '125 Cambridgepark Dr, Cambridge, MA 02140', '', '07-10-08  \n11/28/2007\n9/8/2011', ''] 

I implement a simple regex to remove this:

tds = [re.sub('\n+|\s+',' ',td.text) for td in tr.select('td')]

Resulting in (perhaps not best example but is illustrative):

['', '', '', '', 'WFM Intermediary New England Energy, LLC', '', '125 Cambridgepark Dr, Cambridge, MA 02140', '', '07-10-08 11/28/2007 9/8/2011', '']

Regex:

enter image description here

Finally (did you make it this far?), we want to add some headers for the dataframe. We can use the ones from the page and again, via extend, ensure we include our custom status header.

headers = ['Status']
headers.extend([th.text for th in soup.select('th[align]')])

enter image description here


Py:

from bs4 import BeautifulSoup as bs
import requests, re
import pandas as pd

r = requests.get('http://www.dpuc.state.ct.us/electric.nsf/$FormByElectricApplicantsView?OpenForm&Start=1&Count=1000&ExpandView')
soup = bs(r.content, 'lxml')
final = []
headers = ['Status']
headers.extend([th.text for th in soup.select('th[align]')])

for tr in soup.select('tr:has(td:nth-of-type(1) font:contains(Supplier)) ~ tr:not(:has(td:nth-of-type(1) font:contains(Aggregator)), :has(td:nth-of-type(1) font:contains(Aggregator)) ~ tr)'):
    node = tr.select_one('td:nth-of-type(2) font')
    if node is not None:
        status = node.text
    else:
        row = [status]
        tds = [re.sub('\n+|\s+',' ',td.text) for td in tr.select('td')]
        row.extend(tds)
        final.append(row)
        
df = pd.DataFrame(final)
df.drop([1,2,10], axis=1, inplace=True)
df.columns = headers
df.to_csv(r'C:\Users\User\Desktop\Public Utilities.csv', sep=',', encoding='utf-8-sig',index = False )

Sample of output:

enter image description here


Additional reading:

  1. Css selectors

Solution 2:[2]

company_list = [];
for company in soup.find_all('a'):
    company_list.append(company.string);

You can find more details regarding this on the BeautifulSoup documentation. You likely will have to modify this code to filter out specific tags via conditions inside the loop, but I believe that should be pretty simple.

More specifically, for your particular example, you can check the company.get('href') matches ".nsf" via regex or whatever you prefer.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2