'How to Extract JSON From HTML Source Code Using Regex
Python Script
import requests
import json
from bs4 import BeautifulSoup
import re
url = 'https://www.dunelm.com/product/caldonia-check-natural-eyelet-curtains-1000187301?defaultSkuId=30729125'
r = requests.get(url)
soup = BeautifulSoup(r.content,'html.parser')
# Save source code to file for testing
with open("sourcecode.html", "w", encoding='utf-8') as file:
file.write(str(soup))
# Regex pattern to capture JSON data within webpage source code
regex_pattern = r"{\"delivery\"*.*false*}}}"
I'm trying to pull the JSON data embedded within the source code of the URL listed above using Regex.
I have manually pulled the source code from the URL listed and entered into regex101.com using the following regex pattern:
{\"delivery\"*.*false*}}}
The regex pattern appears to capture the desired JSON data needed.
Issue
When I view the contents of the soup variable or saved file it appears to capture the HTML source code.
However, I do not know how to process regex to only capture the JSON data string needed to build my desired Python Dictionary.
Any help would be greatly appreciated.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
