'Webscraping a script with Beautiful Soap

I'm a Python newbee and building a webscraper to get data from a site so i can buy electricity when it's cheapest. Problem is the data I need is in a script, can i use Beautiful Soap to get it? I have tried a lot different ways now and could really need some help here. The page i want to scrape is https://www.elbruk.se/timpriser-se3-stockholm and the information i need is in the data list below.

const labels = [
'00:00','01:00','02:00','03:00','04:00','05:00','06:00','07:00','08:00','09:00','10:00','11:00','12:00','13:00','14:00','15:00','16:00','17:00','18:00','19:00','20:00','21:00','22:00','23:00','24:00',];
const data = {
    labels: labels,
    datasets: [{
        stepped:true,
        label: 'Idag',
        backgroundColor: '#357DA7',
        borderColor: '#357DA7',
        data: [94.24,91.59,93.52,97.70,103.23,155.15,233.20,269.03,279.92,255.87,231.30,226.70,209.64,174.65,164.84,154.16,134.04,199.48,205.03,204.88,192.49,154.16,74.40,19.47,19.47]
    },

(Row 494 in the page code) Is it possible to extract it with Beautiful Soap or am I in a dead end here? Parse it with Json maybe? There is no site with an API for the information either.. (my first hope..)



Solution 1:[1]

An easy (but not perfect) solution would be to iterate over all the scripts and find the one that contains "const labels =" after that you just have to trim off the text you dont want and parse the list

Solution 2:[2]

BeautifulSoup is not required because in the end you will need alot replace with regex because it not valid json

import requests
import re
import json

response = requests.get(theURL)
data = re.search(r'data\s=\s(\{[^;]+)', response.text)
data = data[1].replace("'", '"') # 'Idag' -> "Idag"
data = data.replace(",]", ']') # ,] -> ]
data = re.sub(r"(\w+):", r'"\1":', data) # labels: labels -> "labels": labels
data = re.sub(r":\s?(\w+)", r':"\1"', data) # "labels": "labels"
data = json.loads(data)

print(data['datasets'][0]['backgroundColor'])

# print(json.dumps(data, indent=2))

Solution 3:[3]

just do this.

use python to download the source code, then parse it with this regex (string below) then take the first match it finds

/^const labels(.*)const config = {type: 'line',data: data,options: {}};/gmis

example here

Solution 4:[4]

Assuming you want just the labels and values for that chart you could regex them out , have both as lists, and turn into a dict.

import re, requests, ast

r = requests.get('https://www.elbruk.se/timpriser-se3-stockholm')
idag = dict(zip([i[0] for i in re.findall(r"'((2[0-4]|[01]?[0-9]):([0-5]?[0-9]))'", r.text)],
            ast.literal_eval(re.search(r"data: .*(\[.*?\])[\s\S]+(?='Idag snitt')", r.text).group(1))))

print(idag)

I adapted the regex for 24 hour times from O’Reilly Regular Expressions Cookbook, 2nd Edition by Jan Goyvaerts, Steven Levithan.

The first regex grabs the 24 hour clock times. The second regex uses data: as the beginning text then pulls out the array of numbers before (using positive lookahead) the text Idag snitt. ast is used to convert string list/array to python list.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Nacho R
Solution 2 uingtea
Solution 3 Dean Van Greunen
Solution 4