'Best way to automate AWS EMR Creation,termination and pyspark jobs
I have pyspark code in S3 and I will execute them and write test cases in Pyspark and load it to the snowflake. My job runs for 1 min on a daily basis and also need log if it has failed
I am new to aws, How to trigger a new EMR and execute pyspark job? I know we can do it through Airflow, can we do it through AWS also or any other means?
What would be better Airflow or AWS Datapipeline in terms of cost or easiness?
Thanks, Xi
Solution 1:[1]
Airflow has operators to interact with EMR. You can use EmrCreateJobFlowOperator, EmrTerminateJobFlowOperator, EmrAddStepsOperator, etc...
Information about the operators and how to use them can be found in this doc.
A skeleton DAG can be:
from datetime import datetime, timedelta
from airflow import DAG
from airflow.providers.amazon.aws.operators.emr import (
EmrAddStepsOperator,
EmrCreateJobFlowOperator,
EmrTerminateJobFlowOperator,
)
from airflow.providers.amazon.aws.sensors.emr import EmrStepSensor
with DAG(
dag_id='emr_job_flow_manual_steps_dag',
start_date=datetime(2021, 1, 1),
schedule_interval='0 3 * * *',
) as dag:
# Create machine
cluster_creator = EmrCreateJobFlowOperator()
# Submit steps
step_adder = EmrAddStepsOperator()
#Verify steps are completed
step_checker = EmrStepSensor()
#Terminate machine
cluster_remover = EmrTerminateJobFlowOperator()
cluster_creator >> step_adder >> step_checker >> cluster_remover
Full example dag for manual steps is available here. Full example dag for automatic steps is available here.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Elad Kalif |
