'Scraping in <dt> and <dd> tags with bs4 ant python
How should i extract info i only need from <dt> and <dd> tags ? P.S and there is a lot of pages like that - hundreds
Here is link for main page:
https://www.aruodas.lt/butai/vilniuje/
and link for child page into it:
https://www.aruodas.lt/butai-vilniuje-santariskese-dangerucio-g-parduodamas-7385-kv-m-triju-kambariu-butas-1-3172400/
My desired output should look like that:
Plotas: 22 m2
Kambariu_skaicius: 4
Metai: 2022
Code block, iam using is:
import pandas as pd
from selenium import webdriver
from bs4 import BeautifulSoup
import re
import time
PATH = 'C:\Program Files (x86)\chromedriver.exe'
driver = webdriver.Chrome(PATH)
for puslapis in range(2, 3):
driver.get(f'https://www.aruodas.lt/butai/vilniuje/puslapis/{puslapis}')
response = driver.page_source
soup = BeautifulSoup(response, 'html.parser')
blocks = soup.find_all('tr', class_= 'list-row')
stored_urls = []
for url in blocks:
try:
stored_urls.append(url.a['href'])
except:
pass
for link in stored_urls:
driver.get(link)
response = driver.page_source
soup = BeautifulSoup(response, 'html.parser')
try:
#Reikia su RegEx sutvarkyti adresa
adress = soup.find('h1','obj-header-text').text.strip()
# print(adress)
except:
adress = 'n/a'
def get_dl(soup):
keys, values = [], []
for dl in soup.findAll("dl", {"class": "obj-details"}):
for dt in dl.findAll("dt"):
keys.append(dt.text.strip())
for dd in dl.findAll("dd"):
values.append(dd.text.strip())
return dict(zip(keys, values))
dl_dict = get_dl(soup)
print(dl_dict)
So, in this case i can get all info, which is in dd and dt tags, but i need information, which is in picture below
This is html source :
Solution 1:[1]
Pull them into a list, then use zip.
from bs4 import BeautifulSoup
html = '''<dl class="obj-details ">
<dt> Namo numeris: </dt>
<dd> 27 </dd>
<hr class="clear">
<dt> Buto numeris: </dt>
<dd> 6 </dd>
<hr class="clear">
<dt> Other: </dt>
<dd> 42 </dd>
<hr> class="clear">'''
soup = BeautifulSoup(html, 'html.parser')
dt = [x.text.strip() for x in soup.find_all('dt')]
dd = [x.text.strip() for x in soup.find_all('dd')]
myList = list(zip(dt, dd))
for each in myList:
print(each[0], each[-1])
Output:
Namo numeris: 27
Buto numeris: 6
Other: 42
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | chitown88 |

