'Apache Airflow - split task in multiple parallel tasks where each task takes portion of the list as input argument

Let's say I have list with 100 items called mylist.

I have function that performs certain operation with each element of the list. I order to speed things up I want define n parallel tasks. Each task should take 100/n list items and process them.

I understand all about executors and core settings which I need to change to enable parallelism, I need some basic pointer on how to set tasks right.

My idea is to build it like this:

imports...

mylist = [item0, item2,...,item99]
n=5


def myfunction(sub_list):
    """This is a function that will run within the DAG execution"""
    """Procesing list elements"""


# Generate 5 tasks
for i in range(1,len(lst), n):
    task = PythonOperator(
        task_id='myfunction_' + str(i),
        python_callable=myfunction,
        op_kwargs={'sub_list': mylist[i:i + n]},
        dag=dag,
    )

 task

I've assebmled this algorithm according to documentation. Is this proper way of doing this?



Solution 1:[1]

My initial proposition is correct, so you can run tasks in parallel while all tasks are using certain portion of the list as input like this:

imports...

mylist = [item0, item2,...,item99]
n=5 # Number of parallel tasks, this is also number of splits that we are doing on list. 


def myfunction(sub_list):
    """This is a function that will run within the DAG execution"""
    """Procesing list elements"""


# Generate some dummy initial task
d1 = DummyOperator(task_id='kick_off_dag')

# Generate 5 tasks
for i in range(1,len(mylist), n):
    five_parallelTasks = PythonOperator(
        task_id='myfunction_' + str(i),
        python_callable=myfunction,
        op_kwargs={'sub_list': mylist[i:i + n]},
        dag=dag,
    )

 
d1 >> five_parallelTasks

Where range(1,len(lst), n) is iterating over list by every n-th element. This allows us to define input in each task as mylist[i:i + n]

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Calder White