'Allow failure in Kubeflow Pipelines
Context
I have a kubeflow pipeline running multiple stages with python scripts. In one of the inner stages, I use kfp.dsl.ParallelFor to run 5-6 deep learning models, and in the next stage, I choose the best one with respect to a validation metric.
Problem
The issue is if one of the models fail, the whole pipeline fails. It'll complain that the dependencies of the next stage is not satisfied. However, if model A fails and model B is still running at that time, the pipeline state will continue to be running till the time model B is running, and it'll change only at end of all models in that stage.
Question
How can I allow partial failures in a stage? As long as at least one of the model is working, the next stage can work. How do I make it happen in kubeflow? For example, I have setup CI in Gitlab, which supports this.
If it is not possible to have this, I want the pipeline to fail immediately as soon as one model fails, and not wait for others only to fail later, which possibly can be way later based on configurations.
Obviously, a way to avoid failure will be to include a top level try - except in the python script, and it'll always return exit code as 0. However, in this way there shall be no visual indication that one (or more) models failed. It can be recovered from the logs, but it's rarely monitored in a scheduled pipeline when the entire run status is successful.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
