'Fault Tolerance in Flink

How can we configure a Flink application to start/restart only the pods/(sub)tasks that crashed instead of restarting the whole job i.e. restart all the tasks/sub-tasks in the job/pipeline including that tasks that are healthy. It does not make sense and feels unnecessary to try to restart the healthy tasks along with the crashed ones. The stream processing application processes messages from Kafka and writes the output back to Kafka and runs on Flink 1.13.5 and a Kubernetes resource manager - using Lyft's Kubernetes operator to schedule and run the Flink job. We tried setting the property, **jobmanager.execution.failover-strategy** to **region** and did not help.



Solution 1:[1]

Flink only supports partial restarts to the extent that this is possible without sacrificing completely correct, exactly-once results.

After recovery, failed tasks are restarted from the latest checkpoint. Their inputs are rewound, and they will reproduce previously emitted results. If healthy downstream consumers of those failed tasks aren't also reset and restarted from that same checkpoint, then they will end up producing duplicate/inflated results.

With streaming jobs, only with embarrassingly parallel pipelines will have you disjoint pipelined regions. Any use of keyBy or rebalancing (e.g., to change the parallelism) will produce a job with a single failure region.

Solution 2:[2]

Restart Pipelined Region Failover Strategy.

This strategy groups tasks into disjoint regions. When a task failure is detected, this strategy computes the smallest set of regions that must be restarted to recover from the failure. For some jobs this can result in fewer tasks that will be restarted compared to the Restart All Failover Strategy.

Refer to https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/task_failure_recovery/#restart-pipelined-region-failover-strategy

But another failover strategy is in progress in https://cwiki.apache.org/confluence/display/FLINK/FLIP-135+Approximate+Task-Local+Recovery

Approximate task-local recovery is useful in scenarios where a certain amount of data loss is tolerable, but a full pipeline restart is not affordable

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 David Anderson
Solution 2 ChangLi