'Ansible stop the whole playbook if all hosts in a single play fail
I'm struggling to understand what's the intended behavior of ansible in case all hosts fail in a single play but there are other plays on other hosts in the playbook.
For example consider the following playbook:
---
- name: P1
hosts: a,b
tasks:
- name: Assert 1
ansible.builtin.assert:
that: 1==2
when: inventory_hostname != "c"
- name: P2
hosts: y,z
tasks:
- name: Debug 2
ansible.builtin.debug:
msg: 'YZ'
All 4 hosts a,b,y,z point to localhost for the sake of clarity.
What happens is assert fails and the whole playbook stops. However it seems to contradict the documentation which says that in case of an error ansible stops executing on the failed host but continues on the other hosts, see Error handling
In case I change the condition to when: inventory_hostname != 'b' and therefore b does not fail then the playbook continues to execute the second play on hosts y,z.
To me the initial failure does not seem reasonable because the hosts y,z have not experience any errors and therefore execution on them should not be prevented by the error on the other hosts.
Is this is a bug or am I missing something?
Solution 1:[1]
It's not a bug. It's by design (see Notes 3,4 below). As discussed in the comments to the other answer, the decision whether to terminate the whole playbook when all hosts in a play fail or not seems to be a trade-off. Either a user will have to handle how to proceed to the next play if necessary or how to stop the whole playbook if necessary. You can see in the examples below that both options require handling errors in a block approximately to the same extent.
- The first case was implemented by Ansible: A playbook will terminate when all hosts in a play fail. For example,
- hosts: host01,host02
tasks:
- assert:
that: false
- hosts: host03
tasks:
- debug:
msg: Hello
PLAY [host01,host02] *************************************************************************
TASK [assert] ********************************************************************************
fatal: [host01]: FAILED! => changed=false
assertion: false
evaluated_to: false
msg: Assertion failed
fatal: [host02]: FAILED! => changed=false
assertion: false
evaluated_to: false
msg: Assertion failed
PLAY RECAP ***********************************************************************************
host01 : ok=0 changed=0 unreachable=0 failed=1 skipped=0 rescued=0 ignored=0
host02 : ok=0 changed=0 unreachable=0 failed=1 skipped=0 rescued=0 ignored=0
- The playbook will proceed to the next play when not all hosts in a play fail. For example,
- hosts: host01,host02
tasks:
- assert:
that: false
when: inventory_hostname == 'host01'
- hosts: host03
tasks:
- debug:
msg: Hello
PLAY [host01,host02] *************************************************************************
TASK [assert] ********************************************************************************
fatal: [host01]: FAILED! => changed=false
assertion: false
evaluated_to: false
msg: Assertion failed
skipping: [host02]
PLAY [host03] ********************************************************************************
TASK [debug] *********************************************************************************
ok: [host03] =>
msg: Hello
PLAY RECAP ***********************************************************************************
host01 : ok=0 changed=0 unreachable=0 failed=1 skipped=0 rescued=0 ignored=0
host02 : ok=0 changed=0 unreachable=0 failed=0 skipped=1 rescued=0 ignored=0
host03 : ok=1 changed=0 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
- To proceed to the next play when all hosts in a play fail, a user has to clear the errors, and, optionally, end the host in a play as well. For example,
- hosts: host01,host02
tasks:
- block:
- assert:
that: false
rescue:
- meta: clear_host_errors
- meta: end_host
- hosts: host03
tasks:
- debug:
msg: Hello
PLAY [host01,host02] *************************************************************************
TASK [assert] ********************************************************************************
fatal: [host01]: FAILED! => changed=false
assertion: false
evaluated_to: false
msg: Assertion failed
fatal: [host02]: FAILED! => changed=false
assertion: false
evaluated_to: false
msg: Assertion failed
TASK [meta] **********************************************************************************
TASK [meta] **********************************************************************************
TASK [meta] **********************************************************************************
PLAY [host03] ********************************************************************************
TASK [debug] *********************************************************************************
ok: [host03] =>
msg: Hello
PLAY RECAP ***********************************************************************************
host01 : ok=0 changed=0 unreachable=0 failed=0 skipped=0 rescued=1 ignored=0
host02 : ok=0 changed=0 unreachable=0 failed=0 skipped=0 rescued=1 ignored=0
host03 : ok=1 changed=0 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
- Update: The playbook can't be stopped by meta end_play after this was 'fixed' in 2.12.2.
- hosts: host01,host02
tasks:
- block:
- assert:
that: false
rescue:
- meta: clear_host_errors
- set_fact:
host_failed: true
- meta: end_play
when: ansible_play_hosts_all|map('extract', hostvars, 'host_failed') is all
run_once: true
- hosts: host03
tasks:
- debug:
msg: Hello
PLAY [host01,host02] *************************************************************************
TASK [assert] ********************************************************************************
fatal: [host01]: FAILED! => changed=false
assertion: false
evaluated_to: false
msg: Assertion failed
fatal: [host02]: FAILED! => changed=false
assertion: false
evaluated_to: false
msg: Assertion failed
TASK [meta] **********************************************************************************
TASK [set_fact] ******************************************************************************
ok: [host01]
ok: [host02]
TASK [meta] **********************************************************************************
PLAY RECAP ***********************************************************************************
host01 : ok=1 changed=0 unreachable=0 failed=0 skipped=0 rescued=1 ignored=0
host02 : ok=1 changed=0 unreachable=0 failed=0 skipped=0 rescued=1 ignored=0
Notes
- meta end_host means: 'end the play for this host'
- hosts: host01
tasks:
- meta: end_host
- hosts: host01,host02
tasks:
- debug:
msg: Hello
PLAY [host01] ********************************************************************************
TASK [meta] **********************************************************************************
PLAY [host01,host02] *************************************************************************
TASK [debug] *********************************************************************************
ok: [host01] =>
msg: Hello
ok: [host02] =>
msg: Hello
PLAY RECAP ***********************************************************************************
host01 : ok=1 changed=0 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
host02 : ok=1 changed=0 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
- hosts: host01
tasks:
- meta: end_play
- hosts: host01,host02
tasks:
- debug:
msg: Hello
PLAY [host01] ********************************************************************************
TASK [meta] **********************************************************************************
PLAY RECAP ***********************************************************************************
- Quoting from #37309
If all hosts in the current play batch (fail) the play ends, this is 'as designed' behavior ... 'play batch' is 'serial size' or all hosts in play if serial is not set.
- Quoting from the source
# check the number of failures here, to see if they're above the maximum
# failure percentage allowed, or if any errors are fatal. If either of those
# conditions are met, we break out, otherwise, we only break out if the entire
# batch failed
failed_hosts_count = len(self._tqm._failed_hosts) + len(self._tqm._unreachable_hosts) - \
(previously_failed + previously_unreachable)
if len(batch) == failed_hosts_count:
break_play = True
break
Solution 2:[2]
A playbook with multiple play is just sequential, it cannot know in front that you are going to have any other hosts in a later play.
Because your assert task, in the first play, has exhausted all hosts of the play, it makes sense that the playbook stops there, as it won't have anything to do on any further tasks in P1, and remember, it doesn't know anything about P2 yet, so it just end there.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 |
