'Downloading large files (10GB +) fails when Python Athena (Token Expired)
I am still learning Python (3.6) and now working on AWS. I am trying to automate a process where in the user is running a query in Athena. The results for the query are being directed to an S3 bucket. From the S3, I need to pull the file into my local and then run some more analysis using legacy tools. All this is being done step by step manually, by first firing a query in Athena Query Editor.
The problem I am facing is that the file(s) will be larger than 10GB and the SAML profile token expires after 1 hour. I have read some documentation about auto refreshing the credentials, however, while the file in being downloaded, how to even implement a solution like that. I have put my code below (that's the closest I got to a successful run with about 10000 records).
Any suggestions/help is appreciated.
import boto3
from boto3.s3.transfer import TransferConfig
import pandas as pd
import time
pd.set_option('display.max_rows', 50)
pd.set_option('display.max_columns', 100)
pd.set_option('display.width', 1000)
session=boto3.Session(profile_name='saml')
athena_client = session.client("athena")
query_response = athena_client.start_query_execution(
QueryString="SELECT * FROM TABLENAME WHERE=<condition>",
QueryExecutionContext={"Database": 'some_db'},
ResultConfiguration={
"OutputLocation": 's3://131653427868-heor-epi-workbench-results',
"EncryptionConfiguration": {"EncryptionOption": "SSE_S3"},
},
WorkGroup='myworkgroup'
)
print(query_response)
iteration = 30
temp_file_location: str = "C:\\Users\\<user>\\Desktop\\Python Projects\\tablename.csv"
while(iteration > 0):
iteration = iteration - 1
print(iteration)
query_response_id = athena_client.get_query_execution(QueryExecutionId=query_response['QueryExecutionId'])
print(query_response_id)
if (query_response_id['QueryExecution']['Status']['State'] == 'FAILED') or (query_response_id['QueryExecution']['Status']['State'] == 'CANCELLED'):
print("IF BLOCK: ", query_response_id['QueryExecution']['Status']['State'])
print("The Query Failed.")
elif (query_response_id['QueryExecution']['Status']['State'] == 'SUCCEEDED'):
print("ELSE IF BLOCK: ", query_response_id['QueryExecution']['Status']['State'])
print("Query Completed. Ready to download.")
print("Proceeding to Download File......")
config = TransferConfig(max_concurrency=5)
s3_client = session.client("s3")
s3_client.download_file('131653427868-heor-epi-workbench-results',
f"{query_response['QueryExecutionId']}.csv",
temp_file_location,
Config = config
)
print("Download complete. Setting Iteration to 0 to exit loop. ")
iteration = 0
else:
print("ELSE BLOCK: ", query_response_id['QueryExecution']['Status']['State'])
print(query_response_id['QueryExecution']['Status']['State'])
time.sleep(10)
pandasDF=pd.read_csv(temp_file_location)
print(pandasDF)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
