'Creating dictionary from two lists of webscraped data
so I have this webdata that I scraped from a product website. I scraped it using BeautifulSoup, and scraped multiple pages from the product website. I get two lists from the scraper, one is of the specification and the other is the data for specification. Here is an example:
Blade length : 2.97/1.97"
Blade Thickness : 0.090/2.54"
Open Length : 7.05/6.05"
Closed Length: 4.08" | 9.78cm
Handle Thickness: 0.40" | 10.16mm
Weight: 2.28oz | 64.64g
I want to get the left hand side to be a Key for dictionary and the right hand side to be the value. The ultimate goal is to put it in a csv where I can have the left hand side to be the column headers for the data in the right hand side. Since I am scraping multiple pages, the left hand side repeats itself and the there are multiple values of the right hand side.
So the desired output should be something like this:
Blade Length. | | Blade Thickness|| Open Length |--etc etc
|------------- | |----------------||-------------|
| 2.97/1.97" | | 4.34/12.54 || 1.23/5.65 |
| 4.24/2.23" | | 2.34/5.63 || 5.43/2.90 |
| 3.54/2.65 | | 2.57/6.54 || 6.90/4.20 |
| 7.65/5/43 | | 4.65/3.56 || 3.32/4.54 |
so if there is a better way to do this than dictionaries then please let me know!
The HTML is something like this:
<table class="specifications-table">
<tbody>
<tr>
<th class="col label">Blade Length:</th>
<td class="col value">2.97/1.97"</td>
</tr>
<tr>
<th class="col label">Blade Thickness:</th>
<td class="col value">0.090"</td>
</tr>
<tr>
<th class="col label">Open Length: </th>
<td class="col value">7.05/6.05"</td>
</tr>
<tr>
<th class="col label">Closed Length: </th>
<td class="col value">4.08"</td>
</tr>
<tr>
<th class="col label">Handle Thickness:</th>
<td class="col value">0.40" </td>
</tr>
<tr>
<th class="col label">Weight:</th>
<td class="col value">2.28oz</td>
</tr>
</tbody>
</table>
Here is my attempt to get this data:
Specs = []
Specs_Datas = defaultdict(list)
Specs2 = []
for links in product_links:
HTML2 = requests.get(links, HEADER)
Booti2 = soup(HTML2.content,"html.parser")
table_feature = Booti2.select_one('#product-attribute-specs-table')
#find all rows
try:
for S in Booti2.find_all('th', attrs ={'class': 'col label'}):
Specs.append(S.text.replace('\n', '').strip())
unique_specs = np.unique(Specs).tolist()
while unique_specs in Specs:
for SD in Booti2.find_all('td', attrs ={'class': 'col value'}):
Specs2.append(SD.text.replace('\n', '').strip())
Specs_Datas[unique_specs] = []
Specs_Datas[unique_specs].update(SD.text.replace('\n', '').strip())
#Specs.append(S.text.replace('\n', '').strip())
except:
continue
Any help would be appreciated!! Thank you so much!!!
Solution 1:[1]
Assuming you actually have two lists as:
specs = [
"Blade length",
"Blade Thickness",
"Open Length",
"Closed Length",
"Handle Thickness",
"Weight"
]
and
spec_data = [
"2.97/1.97\"",
"0.090/2.54\"",
"7.05/6.05\"",
"4.08\" | 9.78cm",
"0.40\" | 10.16mm",
"2.28oz | 64.64g"
]
The easiest path might be to zip() them together:
specs_reshaped = [
{key: value for key, value in zip(specs, spec_data)}
]
Then use a DictWriter()
with open("output.csv", "w", newline="", encoding="utf-8") as file_out:
writer = csv.DictWriter(file_out, fieldnames=specs)
writer.writeheader()
writer.writerows(specs_reshaped)
This produces a file like:
Blade length,Blade Thickness,Open Length,Closed Length,Handle Thickness,Weight
"2.97/1.97""","0.090/2.54""","7.05/6.05""","4.08"" | 9.78cm","0.40"" | 10.16mm",2.28oz | 64.64g
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | JonSG |
