'Get html contents of div with a header/detail mapping using bs4

I have the following div and I would like to extract the corresponding heading & detail to a csv file for each url in the url list. I need to iterate all the urls.

Posted the html as image for better representation enter image description here

I tried the code below, but dont seem to get the thing going on.

urls = ["https://xx.com/xat-exam",
        "https://xx.com/wb-excise-constable",
       ]

all_exam = []
for index,url in enumerate(urls):
    html = requests.get(url).text
    soup = BeautifulSoup(html, "html.parser")
    examsinfo = soup.findAll('div', {"class": "banner-left"})
    all_exam.append(examsinfo)
    filename =  "Examdetails.csv"
resultset = []

for examsinfo in all_exam:
    for exams in examsinfo:
        exams_details = dict()
        try:
            exams_details['examinfocontent'] = exams.find('div', {'class': 'highlight__heading'})
            if exams.find('div', {'class': 'highlight__heading'}).text.strip('\n\r\t": ').strip('\n\r\t": ').strip('\n\r\t": ') =='Registration Date':
                print("true")
                try:
                   exams_details['regdate'] = exams.find('div', {'class': 'highlight__detail'}).text.strip('\n\r\t": ').strip('\n\r\t": ').strip('\n\r\t": ')
                except Exception as e:
                    exams_details['regdate'] = 'N/A'

            exams_details['examinfocontent'] = exams.find('div', {'class': 'highlight__heading'})
            if exams.find('div', {'class': 'highlight__heading'}).text.strip('\n\r\t": ').strip('\n\r\t": ').strip('\n\r\t": ') =='Exam Date':
                try:
                   exams_details['examdate'] = exams.find('div', {'class': 'highlight__detail'}).text.strip('\n\r\t": ').strip('\n\r\t": ').strip('\n\r\t": ')
                except Exception as e:
                    exams_details['examdate'] = 'N/A'

            
        except Exception as e:
            exams_details['examinfocontent'] = 'N/A'
        
        resultset.append(exams_details)
        print(exams_details)
        #print(filename)
        
        with open(filename+".html", 'w+',newline='', encoding='utf-8') as csvFile:
            writer = csv.writer(csvFile)
            #writer.writerow(['examinfocontent'])
            writer.writerow(['RegistrationDate', 'Exam Date', 'Eligibility', 'Salary', 'Application Link'])
            for exams in resultset:
                writer.writerow([exams['regdate'], exams['examdate'],exams['salary']...])
    

I am getting the following error

writer.writerow([exams['regdate'], exams['examdate'],exams['salary'], exams['eligibility'], exams['applink']])
KeyError: 'examdate'

Expected Outcome

Registration Date                   Exam Date        Eligibility     Salary
10 Aug 2021 - 30 Nov 2021           2 Jan 2022       Graduation 


Solution 1:[1]

You have numerous points where your code can branch and end up with keys not being set in the dictionary. Firstly, you have an outer try except with only 1 key assignment in the except. Next, you have various if blocks where there is no else handling to assign keys. Given your current error, it is the latter that is currently the stumbling block.

I would revisit your logic throughout shown code block and decide if you need all that nested logic and how you might better ensure all expected keys are assigned or tested for.

Below I show 2 points to reconsider, via comments. I give an example addition of code to handle one of those points.

for examsinfo in all_exam:

for exams in examsinfo:
    
    exams_details = dict()
    
    try:
        # do stuff 
        if exams.find('div', {'class': 'highlight__heading'}).text.strip('\n\r\t": ').strip('\n\r\t": ').strip('\n\r\t": ') =='Exam Date':
            try:
                exams_details['examdate'] = exams.find('div', {'class': 'highlight__detail'}).text.strip('\n\r\t": ').strip('\n\r\t": ').strip('\n\r\t": ')
            except Exception as e:
                exams_details['examdate'] = 'N/A'
        # What happens here? The key won't be present when if block not entered. You need a key assignment
                
    except Exception as e:
        exams_details['examinfocontent'] = 'N/A'
        # What happens here? The key won't be present when except entered.
    resultset.append(exams_details)
    
    with open(filename+".html", 'w+',newline='', encoding='utf-8') as csvFile:
        writer = csv.writer(csvFile)

        writer.writerow(['RegistrationDate', 'Exam Date', 'Eligibility', 'Salary', 'Application Link'])
        
        # New logic for outer try except where only one key in dict (examinfocontent) - you will need to decide what to do
        for exams in resultset:
            if exams_details['examinfocontent'] == 'N/A':
                writer.writerow([['N/A'], ['N/A'] , ['N/A']...])
            else:
                writer.writerow([exams['regdate'], exams['examdate'],exams['salary']...])

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 QHarr