'Get html contents of div with a header/detail mapping using bs4
I have the following div and I would like to extract the corresponding heading & detail to a csv file for each url in the url list. I need to iterate all the urls.
Posted the html as image for better representation

I tried the code below, but dont seem to get the thing going on.
urls = ["https://xx.com/xat-exam",
"https://xx.com/wb-excise-constable",
]
all_exam = []
for index,url in enumerate(urls):
html = requests.get(url).text
soup = BeautifulSoup(html, "html.parser")
examsinfo = soup.findAll('div', {"class": "banner-left"})
all_exam.append(examsinfo)
filename = "Examdetails.csv"
resultset = []
for examsinfo in all_exam:
for exams in examsinfo:
exams_details = dict()
try:
exams_details['examinfocontent'] = exams.find('div', {'class': 'highlight__heading'})
if exams.find('div', {'class': 'highlight__heading'}).text.strip('\n\r\t": ').strip('\n\r\t": ').strip('\n\r\t": ') =='Registration Date':
print("true")
try:
exams_details['regdate'] = exams.find('div', {'class': 'highlight__detail'}).text.strip('\n\r\t": ').strip('\n\r\t": ').strip('\n\r\t": ')
except Exception as e:
exams_details['regdate'] = 'N/A'
exams_details['examinfocontent'] = exams.find('div', {'class': 'highlight__heading'})
if exams.find('div', {'class': 'highlight__heading'}).text.strip('\n\r\t": ').strip('\n\r\t": ').strip('\n\r\t": ') =='Exam Date':
try:
exams_details['examdate'] = exams.find('div', {'class': 'highlight__detail'}).text.strip('\n\r\t": ').strip('\n\r\t": ').strip('\n\r\t": ')
except Exception as e:
exams_details['examdate'] = 'N/A'
except Exception as e:
exams_details['examinfocontent'] = 'N/A'
resultset.append(exams_details)
print(exams_details)
#print(filename)
with open(filename+".html", 'w+',newline='', encoding='utf-8') as csvFile:
writer = csv.writer(csvFile)
#writer.writerow(['examinfocontent'])
writer.writerow(['RegistrationDate', 'Exam Date', 'Eligibility', 'Salary', 'Application Link'])
for exams in resultset:
writer.writerow([exams['regdate'], exams['examdate'],exams['salary']...])
I am getting the following error
writer.writerow([exams['regdate'], exams['examdate'],exams['salary'], exams['eligibility'], exams['applink']])
KeyError: 'examdate'
Expected Outcome
Registration Date Exam Date Eligibility Salary
10 Aug 2021 - 30 Nov 2021 2 Jan 2022 Graduation
Solution 1:[1]
You have numerous points where your code can branch and end up with keys not being set in the dictionary. Firstly, you have an outer try except with only 1 key assignment in the except. Next, you have various if blocks where there is no else handling to assign keys. Given your current error, it is the latter that is currently the stumbling block.
I would revisit your logic throughout shown code block and decide if you need all that nested logic and how you might better ensure all expected keys are assigned or tested for.
Below I show 2 points to reconsider, via comments. I give an example addition of code to handle one of those points.
for examsinfo in all_exam:
for exams in examsinfo:
exams_details = dict()
try:
# do stuff
if exams.find('div', {'class': 'highlight__heading'}).text.strip('\n\r\t": ').strip('\n\r\t": ').strip('\n\r\t": ') =='Exam Date':
try:
exams_details['examdate'] = exams.find('div', {'class': 'highlight__detail'}).text.strip('\n\r\t": ').strip('\n\r\t": ').strip('\n\r\t": ')
except Exception as e:
exams_details['examdate'] = 'N/A'
# What happens here? The key won't be present when if block not entered. You need a key assignment
except Exception as e:
exams_details['examinfocontent'] = 'N/A'
# What happens here? The key won't be present when except entered.
resultset.append(exams_details)
with open(filename+".html", 'w+',newline='', encoding='utf-8') as csvFile:
writer = csv.writer(csvFile)
writer.writerow(['RegistrationDate', 'Exam Date', 'Eligibility', 'Salary', 'Application Link'])
# New logic for outer try except where only one key in dict (examinfocontent) - you will need to decide what to do
for exams in resultset:
if exams_details['examinfocontent'] == 'N/A':
writer.writerow([['N/A'], ['N/A'] , ['N/A']...])
else:
writer.writerow([exams['regdate'], exams['examdate'],exams['salary']...])
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | QHarr |
