'Correct scraping of a CSV file from government website
I am very new to python - and I am trying to scrape this website https://itc.aeso.ca/itc/public/atc/historic/. I have found the target link (I see how the string url is manipulated to something like this "https://itc.aeso.ca/itc/public/api/v2/interchange?startDate=20220517&endDate=20220519&pageNo=1&pageSize=3"
I have tried to parse it two different ways, using requests and pandas, and both have given me dataframes that are really hard to use / diagnose.
import requests
import pandas as pd
import json
payload = {'startDate': '20220517', 'endDate': '20220519'}
r = requests.get("https://itc.aeso.ca/itc/public/atc/historic/", params=payload)
csv_url = "https://itc.aeso.ca/itc/public/api/v2/interchange?startDate=20220517&endDate=20220519&pageNo=1&pageSize=3"
req = requests.get(csv_url)
url_content = req.content
ss = url_content
dd=json.loads(req.text)
This gave me data that looks like this
{'message': 'Report successfully retrieved.',
'responseCode': '200',
'localTimestamp': '2022-05-19 10:17:59 MDT',
'return': {'BcIntertie': {'Allocations': [{'flowgate': False,
'date': '2022-05-17',
'he': '1',
'import': {'transferType': 'BC_IMPORT',
'reason': '2L113 outage',
'atc': 670,
'trmTotal': 130,
'ttc': 800,
'trmSystem': 130,
'trmAllocation': 0,
'grossOffer': 50,
'effectiveLocalTime': '2022-05-17 00:00:00 MDT',
'updatedLocalTime': '2022-05-16 19:35:40 MDT'},
'export': {'transferType': 'BC_EXPORT',
'reason': '',
'atc': 950,
'trmTotal': 50,
'ttc': 1000,
'trmSystem': 50,
'trmAllocation': 0,
'grossOffer': 0,
'effectiveLocalTime': '2022-05-17 00:00:00 MDT',
'updatedLocalTime': '2021-04-17 09:13:00 MDT'}},
{'flowgate': False,
'date': '2022-05-17',
'he': '2',
'import': {'transferType': 'BC_IMPORT',
'reason': '2L113 outage',
'atc': 670,
'trmTotal': 130,
'ttc': 800,
'trmSystem': 130,
'trmAllocation': 0,
'grossOffer': 50,
'effectiveLocalTime': '2022-05-17 01:00:00 MDT',
'updatedLocalTime': '2022-05-16 22:35:40 MDT'},
'export': {'transferType': 'BC_EXPORT',
'reason': '',
'atc': 950,
The second way I tried was using pandas
import pandas as pd
df_names = pd.read_csv("https://itc.aeso.ca/itc/public/api/v2/interchange?
startDate=20220517&endDate=20220519&pageNo=1&pageSize=3")`
This gave me data that was all columns and only one row.
{"message":"Report successfully retrieved." responseCode:"200" localTimestamp:"2022-05-19 11:52:53 MDT" return:{"BcIntertie":{"Allocations":[{"flowgate":false date:"2022-05-17" he:"1" import:{"transferType":"BC_IMPORT" reason:"2L113 outage" atc:670 trmTotal:130 ... updatedLocalTime:"2022-05-19 07:59:40 MDT"}.3 export:{"transferType":"SYSTEM_EXPORT".71 reason:"".551 atc:1088.71 trmTotal:65.224 ttc:1153.71 trmSystem:65.225 grossOffer:0.387 effectiveLocalTime:"2022-05-19 23:00:00 MDT".9 updatedLocalTime:"2021-04-19 09:12:59 MDT"}}]}}}
Everything I find online has it working so cleanly, I have tried numerous solutions using stuff like GZip and etc. Not sure where I am going wrong.
Solution 1:[1]
It is returning a json string. You can convert it using the json library.
import json
json_resp = json.loads(req.text)
Or probably better yet, requests has json built into it:
json_resp = req.json()
A real example of how to access data in your returned dataset would then be:
print(json_resp['return']['BcIntertie']['Allocations'][0]['date'])
returns:
2022-05-17
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | mr_mooo_cow |
