'Is there a way to force a component to execute last?

My current pipeline runs a training process across multiple users in a ParallelFor operation, eg:

def pipeline(run_id):
    setup_step = create_setup_step(run_id)
    
    with dsl.ParallelFor(setup_step.outputs['users']) as user:
        preprocess = create_preprocess_step(run_id, user.user_id)
        train = create_training_step(run_id, 
                                     user.user_id, 
                                     preprocess.outputs['user_data'])

    summary = create_summary_step(run_id)  # this is the component that needs to execute last

My goal is to add the "summarize" step which executes after all of the above components have finished running. This component will compile a report across all users, so it should not exist inside the ParallelFor

Each component's results are being logged to a database, so the summary component gets its data by querying the db rather than trying to "fan in" the ParallelFor operator.

I have tried specifying to run after the train step, as in create_summary_step(run_id).after(train) but that spins up one summary per branch of the ParallelFor.

I have had some success by manually running the summary component after the run completes, as in client.wait_for_run_completion(...), but this restricts me from compiling and uploading a pipeline to EKS, which is the end goal.



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source