'BeautifulSoup fill missing information with "NA" in csv

I am working on a web scraper that creates a .csv file of all chemicals on the Sigma-Aldrich website. The .csv file would have the chemical name followed by variables such as product number, cas number, molecular weight and chemical formula. 1 chemical + info per row.

The issue I'm having is that not all chemicals have all their fields, many only have product and cas numbers. This results in my .csv file being offset and chemical rows having incorrect info associated with another chemical.

To right this wrong, I want to add 'N/A' if the field is empty.

Here is my scraping method:

def scraap(urlLi):
    for url in urlLi:
        content = requests.get(url).content
        soup = BeautifulSoup(content, 'lxml')
        containers = soup.find_all('div', {'class': 'productContainer-inner'})


        for c in containers:
            sub = c.find_all('div', {'class': 'productContainer-inner-content'})
            names = c.find_all('div', {'class': 'searchResultSubstanceBlock clearfix'})

            for n in names:
                hope = n.find("h2").text
                print(hope)
                nombres.append(hope.encode('utf-8'))

            for s in sub:
                info = s.find_all('ul', {'class': 'nonSynonymProperties'})
                proNum = s.find_all('div', {'class': 'product-listing-outer'})

                for p in proNum:
                    ping = p.find_all('div', {'class': 'row clearfix'})

                    for po in ping:
                        pro = p.find_all('li', {'class': 'productNumberValue'})
                        pnPp = []
                        for pri in pro:
                            potus = pri.get_text()
                            pnPp.append(potus.encode('utf-8'))

                    ProductNumber.append(pnPp)
                    print(pnPp)

                for i in info:
                    c = 1
                    for gling in i:
                        print(gling.get_text())
                        if c == 1:
                            formu.append(gling.get_text().encode('utf-8'))
                        elif c == 2:
                            molWei.append(gling.get_text().encode('utf-8'))
                        else:
                            casNum.append(gling.get_text().encode('utf-8'))

                        c += 1
                    c == 1
                    print("---")

here is my writing method:

def pipeUp():

    with open('sigma_pipe_out.csv', mode='wb') as csv_file:
        fieldnames = ['chem_name', 'productNum', 'formula', 'molWei', 'casNum']
        writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
        # writer.writeheader()
        # csv_file.write(' '.join(fieldnames))
        for n, p, f, w, c in zip(nombres, ProductNumber, formu, molWei, casNum):
            # writer.writerow([n, p, f, w, c])
            writer.writerow({'chem_name': n, 'productNum': p, 'formula': f, 'molWei': w, 'casNum': c})

The issue arises in the get i from info: section. The formu, molWei and casNum list are off.

How can I add "N/a" if formu and molWei are missing information?



Solution 1:[1]

I'm assuming get_text() returns an empty string if there's no information on the formula and molecular weight etc. In that case you can just add:

if not molWei: molWei = "N/A"

Which updates molWei to be N/A if the string is empty.

Solution 2:[2]

you cannot use index as value checking (if c == 1:), use string check before adding to the list

replace:

for i in info:
    ....
    ....
print("---")

with:

rowNames = ['formu', 'molWei', 'casNum']

for li in info[0].find_all('li'):
    textVal = li.text.encode('utf-8')
    #print(textVal)
    if b'Formula' in textVal:
        formu.append(textVal)
        rowNames.remove('formu')
    elif b'Molecular' in textVal:
        molWei.append(textVal)
        rowNames.remove('molWei')
    else:
        casNum.append(textVal)
        rowNames.remove('casNum')

# add missing row here
if len(rowNames) > 1:
    for item in rowNames:
        globals()[item].append('NA')
print("---")

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Mike Chen
Solution 2 ewwink