'Correct scraping of a CSV file from government website

I am very new to python - and I am trying to scrape this website https://itc.aeso.ca/itc/public/atc/historic/. I have found the target link (I see how the string url is manipulated to something like this "https://itc.aeso.ca/itc/public/api/v2/interchange?startDate=20220517&endDate=20220519&pageNo=1&pageSize=3"

I have tried to parse it two different ways, using requests and pandas, and both have given me dataframes that are really hard to use / diagnose.

import requests
import pandas as pd
import json

payload = {'startDate': '20220517', 'endDate': '20220519'}
r = requests.get("https://itc.aeso.ca/itc/public/atc/historic/", params=payload)
csv_url = "https://itc.aeso.ca/itc/public/api/v2/interchange?startDate=20220517&endDate=20220519&pageNo=1&pageSize=3"
req = requests.get(csv_url)
url_content = req.content
ss = url_content
dd=json.loads(req.text)

This gave me data that looks like this

{'message': 'Report successfully retrieved.',
 'responseCode': '200',
 'localTimestamp': '2022-05-19 10:17:59 MDT',
 'return': {'BcIntertie': {'Allocations': [{'flowgate': False,
     'date': '2022-05-17',
     'he': '1',
     'import': {'transferType': 'BC_IMPORT',
      'reason': '2L113 outage',
      'atc': 670,
      'trmTotal': 130,
      'ttc': 800,
      'trmSystem': 130,
      'trmAllocation': 0,
      'grossOffer': 50,
      'effectiveLocalTime': '2022-05-17 00:00:00 MDT',
      'updatedLocalTime': '2022-05-16 19:35:40 MDT'},
     'export': {'transferType': 'BC_EXPORT',
      'reason': '',
      'atc': 950,
      'trmTotal': 50,
      'ttc': 1000,
      'trmSystem': 50,
      'trmAllocation': 0,
      'grossOffer': 0,
      'effectiveLocalTime': '2022-05-17 00:00:00 MDT',
      'updatedLocalTime': '2021-04-17 09:13:00 MDT'}},
    {'flowgate': False,
     'date': '2022-05-17',
     'he': '2',
     'import': {'transferType': 'BC_IMPORT',
      'reason': '2L113 outage',
      'atc': 670,
      'trmTotal': 130,
      'ttc': 800,
      'trmSystem': 130,
      'trmAllocation': 0,
      'grossOffer': 50,
      'effectiveLocalTime': '2022-05-17 01:00:00 MDT',
      'updatedLocalTime': '2022-05-16 22:35:40 MDT'},
     'export': {'transferType': 'BC_EXPORT',
      'reason': '',
      'atc': 950,

The second way I tried was using pandas

import pandas as pd    

df_names = pd.read_csv("https://itc.aeso.ca/itc/public/api/v2/interchange? 
    startDate=20220517&endDate=20220519&pageNo=1&pageSize=3")`

This gave me data that was all columns and only one row.

{"message":"Report successfully retrieved." responseCode:"200"  localTimestamp:"2022-05-19 11:52:53 MDT"    return:{"BcIntertie":{"Allocations":[{"flowgate":false  date:"2022-05-17"   he:"1"  import:{"transferType":"BC_IMPORT"  reason:"2L113 outage"   atc:670 trmTotal:130    ... updatedLocalTime:"2022-05-19 07:59:40 MDT"}.3   export:{"transferType":"SYSTEM_EXPORT".71   reason:"".551   atc:1088.71 trmTotal:65.224 ttc:1153.71 trmSystem:65.225    grossOffer:0.387    effectiveLocalTime:"2022-05-19 23:00:00 MDT".9  updatedLocalTime:"2021-04-19 09:12:59 MDT"}}]}}}

Everything I find online has it working so cleanly, I have tried numerous solutions using stuff like GZip and etc. Not sure where I am going wrong.



Solution 1:[1]

It is returning a json string. You can convert it using the json library.

import json

json_resp = json.loads(req.text)

Or probably better yet, requests has json built into it:

json_resp = req.json()

A real example of how to access data in your returned dataset would then be:

    print(json_resp['return']['BcIntertie']['Allocations'][0]['date'])
returns:
    2022-05-17

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 mr_mooo_cow