Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error enabling msgr2 messenger in Ceph during Ansible playbook execution #11

Open
reinaldosaraiva opened this issue Jul 11, 2024 · 17 comments

Comments

@reinaldosaraiva
Copy link

Description:When running the Ansible playbook deploy.yaml from the incus-deploy project, an error occurs while attempting to enable the msgr2 messenger in Ceph. The ceph mon enable-msgr2 command fails with a timeout, indicating that it could not connect to the RADOS cluster.

Error Message:
fatal: [server01]: FAILED! => {"changed": true, "cmd": "ceph mon enable-msgr2", "delta": "0:05:00.070563", "end": "2024-07-11 13:43:18.284315", "msg": "non-zero return code", "rc": 1, "start": "2024-07-11 13:38:18.213752", "stderr": "2024-07-11T13:43:18.279+0000 7ff21f567640 0 monclient(hunting): authenticate timed out after 300\n[errno 110] RADOS timed out (error connecting to the cluster)", "stderr_lines": ["2024-07-11T13:43:18.279+0000 7ff21f567640 0 monclient(hunting): authenticate timed out after 300", "[errno 110] RADOS timed out (error connecting to the cluster)"], "stdout": "", "stdout_lines": []}
fatal: [server03]: FAILED! => {"changed": true, "cmd": "ceph mon enable-msgr2", "delta": "0:05:00.109144", "end": "2024-07-11 13:43:18.320621", "msg": "non-zero return code", "rc": 1, "start": "2024-07-11 13:38:18.211477", "stderr": "2024-07-11T13:43:18.316+0000 7fc48f66d640 0 monclient(hunting): authenticate timed out after 300\n[errno 110] RADOS timed out (error connecting to the cluster)", "stderr_lines": ["2024-07-11T13:43:18.316+0000 7fc48f66d640 0 monclient(hunting): authenticate timed out after 300", "[errno 110] RADOS timed out (error connecting to the cluster)"], "stdout": "", "stdout_lines": []}
fatal: [server02]: FAILED! => {"changed": true, "cmd": "ceph mon enable-msgr2", "delta": "0:05:00.093801", "end": "2024-07-11 13:43:18.316757", "msg": "non-zero return code", "rc": 1, "start": "2024-07-11 13:38:18.222956", "stderr": "2024-07-11T13:43:18.314+0000 7f4cb7b4a640 0 monclient(hunting): authenticate timed out after 300\n[errno 110] RADOS timed out (error connecting to the cluster)", "stderr_lines": ["2024-07-11T13:43:18.314+0000 7f4cb7b4a640 0 monclient(hunting): authenticate timed out after 300", "[errno 110] RADOS timed out (error connecting to the cluster)"], "stdout": "", "stdout_lines": []}

Steps to Reproduce:

Execute the Ansible playbook deploy.yaml in the directory ~/incus-deploy/ansible.
Observe the error during the task to enable the msgr2 messenger in Ceph.
Expected Behavior:

The ceph mon enable-msgr2 command should execute without errors, enabling the msgr2 messenger in the Ceph cluster.

Actual Behavior:

The ceph mon enable-msgr2 command fails with a timeout, indicating it could not connect to the RADOS cluster.

Additional Details:

The error occurs on multiple servers (server01, server02, server03).
Specific error message: RADOS timed out (error connecting to the cluster).
The playbook was executed as root.
Environment:

Ansible version: [2.17.1]]
Ubuntu: 22.04


Execute:
root@haruunkal:/incus-deploy/terraform# cd ../ansible/
root@haruunkal:
/incus-deploy/ansible# ansible-playbook deploy.yaml

PLAY [Ceph - Generate cluster keys and maps] ********************************************************************************************

TASK [Gathering Facts] ******************************************************************************************************************
[WARNING]: Platform linux on host server03 is using the discovered Python interpreter at /usr/bin/python3.10, but future installation of
another Python interpreter could change the meaning of that path. See https://docs.ansible.com/ansible-
core/2.17/reference_appendices/interpreter_discovery.html for more information.
ok: [server03]
[WARNING]: Platform linux on host server04 is using the discovered Python interpreter at /usr/bin/python3.10, but future installation of
another Python interpreter could change the meaning of that path. See https://docs.ansible.com/ansible-
core/2.17/reference_appendices/interpreter_discovery.html for more information.
ok: [server04]
[WARNING]: Platform linux on host server02 is using the discovered Python interpreter at /usr/bin/python3.10, but future installation of
another Python interpreter could change the meaning of that path. See https://docs.ansible.com/ansible-
core/2.17/reference_appendices/interpreter_discovery.html for more information.
ok: [server02]
[WARNING]: Platform linux on host server05 is using the discovered Python interpreter at /usr/bin/python3.10, but future installation of
another Python interpreter could change the meaning of that path. See https://docs.ansible.com/ansible-
core/2.17/reference_appendices/interpreter_discovery.html for more information.
ok: [server05]
[WARNING]: Platform linux on host server01 is using the discovered Python interpreter at /usr/bin/python3.10, but future installation of
another Python interpreter could change the meaning of that path. See https://docs.ansible.com/ansible-
core/2.17/reference_appendices/interpreter_discovery.html for more information.
ok: [server01]

TASK [Generate mon keyring] *************************************************************************************************************
changed: [server03 -> 127.0.0.1]
ok: [server04 -> 127.0.0.1]
ok: [server01 -> 127.0.0.1]
ok: [server05 -> 127.0.0.1]
ok: [server02 -> 127.0.0.1]

TASK [Generate client.admin keyring] ****************************************************************************************************
changed: [server03 -> 127.0.0.1]
ok: [server04 -> 127.0.0.1]
ok: [server01 -> 127.0.0.1]
ok: [server05 -> 127.0.0.1]
ok: [server02 -> 127.0.0.1]

TASK [Generate bootstrap-osd keyring] ***************************************************************************************************
changed: [server03 -> 127.0.0.1]
ok: [server04 -> 127.0.0.1]
ok: [server01 -> 127.0.0.1]
ok: [server05 -> 127.0.0.1]
ok: [server02 -> 127.0.0.1]

TASK [Generate mon map] *****************************************************************************************************************
changed: [server03 -> 127.0.0.1]
ok: [server04 -> 127.0.0.1]
ok: [server01 -> 127.0.0.1]
ok: [server05 -> 127.0.0.1]
ok: [server02 -> 127.0.0.1]

RUNNING HANDLER [Add key to client.admin keyring] ***************************************************************************************
changed: [server03 -> 127.0.0.1]

RUNNING HANDLER [Add key to bootstrap-osd keyring] **************************************************************************************
changed: [server03 -> 127.0.0.1]

RUNNING HANDLER [Add nodes to mon map] **************************************************************************************************
changed: [server03 -> 127.0.0.1] => (item={'name': 'server01', 'ip': 'fd42:60dc:dec6:a73b:216:3eff:fe2d:4c57'})
changed: [server03 -> 127.0.0.1] => (item={'name': 'server02', 'ip': 'fd42:60dc:dec6:a73b:216:3eff:fe05:31f6'})
changed: [server03 -> 127.0.0.1] => (item={'name': 'server03', 'ip': 'fd42:60dc:dec6:a73b:216:3eff:fe01:1c21'})

PLAY [Ceph - Add package repository] ****************************************************************************************************

TASK [Gathering Facts] ******************************************************************************************************************
ok: [server04]
ok: [server05]
ok: [server03]
ok: [server01]
ok: [server02]

TASK [Create apt keyring path] **********************************************************************************************************
ok: [server03]
ok: [server01]
ok: [server05]
ok: [server04]
ok: [server02]

TASK [Add ceph GPG key] *****************************************************************************************************************
changed: [server04]
changed: [server03]
changed: [server05]
changed: [server01]
changed: [server02]

TASK [Get DPKG architecture] ************************************************************************************************************
ok: [server04]
ok: [server03]
ok: [server05]
ok: [server01]
ok: [server02]

TASK [Add ceph package sources] *********************************************************************************************************
changed: [server03]
changed: [server05]
changed: [server04]
changed: [server02]
changed: [server01]

RUNNING HANDLER [Update apt] ************************************************************************************************************
changed: [server01]
changed: [server04]
changed: [server05]
changed: [server03]
changed: [server02]

PLAY [Ceph - Install packages] **********************************************************************************************************

TASK [Gathering Facts] ******************************************************************************************************************
ok: [server01]
ok: [server04]
ok: [server05]
ok: [server03]
ok: [server02]

TASK [Install ceph-common] **************************************************************************************************************
changed: [server02]
changed: [server03]
changed: [server05]
changed: [server04]
changed: [server01]

TASK [Install ceph-mon] *****************************************************************************************************************
skipping: [server04]
skipping: [server05]
changed: [server03]
changed: [server01]
changed: [server02]

TASK [Install ceph-mgr] *****************************************************************************************************************
skipping: [server04]
skipping: [server05]
changed: [server03]
changed: [server02]
changed: [server01]

TASK [Install ceph-mds] *****************************************************************************************************************
skipping: [server04]
skipping: [server05]
changed: [server01]
changed: [server02]
changed: [server03]

TASK [Install ceph-osd] *****************************************************************************************************************
changed: [server01]
changed: [server04]
changed: [server03]
changed: [server02]
changed: [server05]

TASK [Install ceph-rbd-mirror] **********************************************************************************************************
skipping: [server01]
skipping: [server02]
skipping: [server04]
skipping: [server05]
skipping: [server03]

TASK [Install radosgw] ******************************************************************************************************************
skipping: [server01]
skipping: [server02]
skipping: [server03]
changed: [server04]
changed: [server05]

PLAY [Ceph - Set up config and keyrings] ************************************************************************************************

TASK [Transfer the cluster configuration] ***********************************************************************************************
changed: [server01]
changed: [server04]
changed: [server03]
changed: [server05]
changed: [server02]

TASK [Create main storage directory] ****************************************************************************************************
ok: [server04]
ok: [server01]
ok: [server03]
ok: [server05]
ok: [server02]

TASK [Create monitor bootstrap path] ****************************************************************************************************
skipping: [server05]
skipping: [server04]
changed: [server01]
changed: [server03]
changed: [server02]

TASK [Create OSD bootstrap path] ********************************************************************************************************
changed: [server05]
changed: [server04]
changed: [server01]
changed: [server03]
changed: [server02]

TASK [Transfer main admin keyring] ******************************************************************************************************
changed: [server05]
changed: [server03]
changed: [server01]
changed: [server02]
changed: [server04]

TASK [Transfer additional client keyrings] **********************************************************************************************
skipping: [server05]
skipping: [server03]
skipping: [server04]
skipping: [server01]
skipping: [server02]

TASK [Transfer bootstrap mon keyring] ***************************************************************************************************
skipping: [server05]
skipping: [server04]
changed: [server03]
changed: [server02]
changed: [server01]

TASK [Transfer bootstrap mon map] *******************************************************************************************************
skipping: [server05]
skipping: [server04]
changed: [server03]
changed: [server02]
changed: [server01]

TASK [Transfer bootstrap OSD keyring] ***************************************************************************************************
changed: [server05]
changed: [server04]
changed: [server01]
changed: [server03]
changed: [server02]

RUNNING HANDLER [Restart Ceph] **********************************************************************************************************
changed: [server05]
changed: [server03]
changed: [server02]
changed: [server04]
changed: [server01]

PLAY [Ceph - Deploy mon] ****************************************************************************************************************

TASK [Gathering Facts] ******************************************************************************************************************
ok: [server01]
ok: [server02]
ok: [server05]
ok: [server04]
ok: [server03]

TASK [Bootstrap Ceph mon] ***************************************************************************************************************
skipping: [server04]
skipping: [server05]
changed: [server02]
changed: [server03]
changed: [server01]

TASK [Enable and start Ceph mon] ********************************************************************************************************
skipping: [server04]
skipping: [server05]
changed: [server02]
changed: [server03]
changed: [server01]

RUNNING HANDLER [Enable msgr2] **********************************************************************************************************
fatal: [server01]: FAILED! => {"changed": true, "cmd": "ceph mon enable-msgr2", "delta": "0:05:00.070563", "end": "2024-07-11 13:43:18.284315", "msg": "non-zero return code", "rc": 1, "start": "2024-07-11 13:38:18.213752", "stderr": "2024-07-11T13:43:18.279+0000 7ff21f567640 0 monclient(hunting): authenticate timed out after 300\n[errno 110] RADOS timed out (error connecting to the cluster)", "stderr_lines": ["2024-07-11T13:43:18.279+0000 7ff21f567640 0 monclient(hunting): authenticate timed out after 300", "[errno 110] RADOS timed out (error connecting to the cluster)"], "stdout": "", "stdout_lines": []}
fatal: [server03]: FAILED! => {"changed": true, "cmd": "ceph mon enable-msgr2", "delta": "0:05:00.109144", "end": "2024-07-11 13:43:18.320621", "msg": "non-zero return code", "rc": 1, "start": "2024-07-11 13:38:18.211477", "stderr": "2024-07-11T13:43:18.316+0000 7fc48f66d640 0 monclient(hunting): authenticate timed out after 300\n[errno 110] RADOS timed out (error connecting to the cluster)", "stderr_lines": ["2024-07-11T13:43:18.316+0000 7fc48f66d640 0 monclient(hunting): authenticate timed out after 300", "[errno 110] RADOS timed out (error connecting to the cluster)"], "stdout": "", "stdout_lines": []}
fatal: [server02]: FAILED! => {"changed": true, "cmd": "ceph mon enable-msgr2", "delta": "0:05:00.093801", "end": "2024-07-11 13:43:18.316757", "msg": "non-zero return code", "rc": 1, "start": "2024-07-11 13:38:18.222956", "stderr": "2024-07-11T13:43:18.314+0000 7f4cb7b4a640 0 monclient(hunting): authenticate timed out after 300\n[errno 110] RADOS timed out (error connecting to the cluster)", "stderr_lines": ["2024-07-11T13:43:18.314+0000 7f4cb7b4a640 0 monclient(hunting): authenticate timed out after 300", "[errno 110] RADOS timed out (error connecting to the cluster)"], "stdout": "", "stdout_lines": []}

PLAY RECAP ******************************************************************************************************************************
server01 : ok=29 changed=18 unreachable=0 failed=1 skipped=3 rescued=0 ignored=0
server02 : ok=29 changed=18 unreachable=0 failed=1 skipped=3 rescued=0 ignored=0
server03 : ok=32 changed=25 unreachable=0 failed=1 skipped=3 rescued=0 ignored=0
server04 : ok=22 changed=11 unreachable=0 failed=0 skipped=10 rescued=0 ignored=0
server05 : ok=22 changed=11 unreachable=0 failed=0 skipped=10 rescued=0 ignored=0

@stgraber
Copy link
Member

That would happen if the Ceph cluster isn't functional.

This most commonly happen if you have fully redone your deployment without also wiping the data from the ansible/data directory.

In this scenario you end up with a freshly deployed cluster that's still expecting the servers from the previous deployment and so is unable to achieve a quorum, causing the Ceph API to fall to come online and results in the configuration failure you're getting.

@reinaldosaraiva
Copy link
Author

Ceph monitor initialization issue: monmap min_mon_release older than installed version
ERROR:
Jul 11 16:51:06 distrobuilder-5cca1f2a-f8a9-4b77-a1df-8173d38747bc systemd[1]: Created slice Slice /system/ceph-mon.
Jul 11 16:51:06 distrobuilder-5cca1f2a-f8a9-4b77-a1df-8173d38747bc systemd[1]: Reached target System Time Synchronized.
Jul 11 16:51:06 distrobuilder-5cca1f2a-f8a9-4b77-a1df-8173d38747bc systemd[1]: Started Ceph cluster monitor daemon.
Jul 11 16:51:06 distrobuilder-5cca1f2a-f8a9-4b77-a1df-8173d38747bc ceph-mon[6467]: 2024-07-11T16:51:06.738+0000 7f0cf2c8cc40 -1 mon.server01@-1(probing) e0 current monmap has recorded min_mon_release 15 (octopus) is more than two releases older than installed 18 (reef); you can only upgrade 2 releases at a time
Jul 11 16:51:06 distrobuilder-5cca1f2a-f8a9-4b77-a1df-8173d38747bc ceph-mon[6467]: you should first upgrade to 16 (pacific) or 17 (quincy)

@stgraber
Copy link
Member

Can you show monmaptool --show ansible/data/ceph/cluster.FSID.mon.map?

Normally the logic in the playbook is to set the min-mon-release in the mon map to the same release as ceph_release (reef by default).

@reinaldosaraiva
Copy link
Author

That would happen if the Ceph cluster isn't functional.

This most commonly happen if you have fully redone your deployment without also wiping the data from the ansible/data directory.

In this scenario you end up with a freshly deployed cluster that's still expecting the servers from the previous deployment and so is unable to achieve a quorum, causing the Ceph API to fall to come online and results in the configuration failure you're getting.

I have already cleaned the data/ceph/ folder and others. I also used both Quincy and Reef versions. I am lost in this deployment.

@stgraber
Copy link
Member

Also the output of git rev-parse HEAD would be useful

@reinaldosaraiva
Copy link
Author

git rev-parse HEAD

root@haruunkal:~/incus-deploy# git rev-parse HEAD
f207054

@stgraber
Copy link
Member

Okay, so it shouldn't be because of lack of support for calling monmaptool with the needed set-min-mon-release, but then it's pretty confusing as to why it would have set a release of 15 when it should have been passed 18.

The output of monmaptool --show ansible/data/ceph/cluster.FSID.mon.map may help figure it out

@reinaldosaraiva
Copy link
Author

Thank you very much for your support. It seems that there was an issue with my lab workstation that was resolved only when I disabled the IPv6 network. After that, the entire process ran perfectly.

@reinaldosaraiva
Copy link
Author

Okay, so it shouldn't be because of lack of support for calling monmaptool with the needed set-min-mon-release, but then it's pretty confusing as to why it would have set a release of 15 when it should have been passed 18.

The output of monmaptool --show ansible/data/ceph/cluster.FSID.mon.map may help figure it out

root@haruunkal:~/incus-deploy# monmaptool --print ansible/data/ceph/cluster.e2850e1f-7aab-472e-b6b1-824e19a75071.mon.map
monmaptool: monmap file ansible/data/ceph/cluster.e2850e1f-7aab-472e-b6b1-824e19a75071.mon.map
epoch 0
fsid e2850e1f-7aab-472e-b6b1-824e19a75071
last_changed 2024-07-11T15:15:56.636758-0300
created 2024-07-11T15:15:56.636758-0300
min_mon_release 15 (octopus)
election_strategy: 1
0: v1:10.177.121.10:6789/0 mon.server03
1: v1:10.177.121.13:6789/0 mon.server01
2: v1:10.177.121.242:6789/0 mon.server02

@reinaldosaraiva
Copy link
Author

rsrsrs. Other error:
TASK [Install the Incus package] ***********************************************************************************************************************************************************************************************************
task path: /root/incus-deploy/ansible/books/incus.yaml:60
ESTABLISH Incus CONNECTION FOR USER: root
EXEC /bin/sh -c 'echo ~root && sleep 0'
ESTABLISH Incus CONNECTION FOR USER: root
EXEC /bin/sh -c 'echo ~root && sleep 0'
ESTABLISH Incus CONNECTION FOR USER: root
EXEC /bin/sh -c 'echo ~root && sleep 0'
ESTABLISH Incus CONNECTION FOR USER: root
EXEC /bin/sh -c 'echo ~root && sleep 0'
ESTABLISH Incus CONNECTION FOR USER: root
EXEC /bin/sh -c 'echo ~root && sleep 0'
EXEC /bin/sh -c '( umask 77 && mkdir -p "echo /root/.ansible/tmp"&& mkdir "echo /root/.ansible/tmp/ansible-tmp-1720722234.6681595-434073-171543719892459" && echo ansible-tmp-1720722234.6681595-434073-171543719892459="echo /root/.ansible/tmp/ansible-tmp-1720722234.6681595-434073-171543719892459" ) && sleep 0'
EXEC /bin/sh -c '( umask 77 && mkdir -p "echo /root/.ansible/tmp"&& mkdir "echo /root/.ansible/tmp/ansible-tmp-1720722234.6905096-434074-237109845491203" && echo ansible-tmp-1720722234.6905096-434074-237109845491203="echo /root/.ansible/tmp/ansible-tmp-1720722234.6905096-434074-237109845491203" ) && sleep 0'
EXEC /bin/sh -c '( umask 77 && mkdir -p "echo /root/.ansible/tmp"&& mkdir "echo /root/.ansible/tmp/ansible-tmp-1720722234.7017157-434080-221203579380132" && echo ansible-tmp-1720722234.7017157-434080-221203579380132="echo /root/.ansible/tmp/ansible-tmp-1720722234.7017157-434080-221203579380132" ) && sleep 0'
EXEC /bin/sh -c '( umask 77 && mkdir -p "echo /root/.ansible/tmp"&& mkdir "echo /root/.ansible/tmp/ansible-tmp-1720722234.704671-434088-195068848106763" && echo ansible-tmp-1720722234.704671-434088-195068848106763="echo /root/.ansible/tmp/ansible-tmp-1720722234.704671-434088-195068848106763" ) && sleep 0'
EXEC /bin/sh -c '( umask 77 && mkdir -p "echo /root/.ansible/tmp"&& mkdir "echo /root/.ansible/tmp/ansible-tmp-1720722234.7366605-434103-88856948673624" && echo ansible-tmp-1720722234.7366605-434103-88856948673624="echo /root/.ansible/tmp/ansible-tmp-1720722234.7366605-434103-88856948673624" ) && sleep 0'
Using module file /usr/local/lib/python3.10/dist-packages/ansible/modules/apt.py
PUT /root/.ansible/tmp/ansible-local-401506wyvvy13h/tmpfw9gbsu5 TO /root/.ansible/tmp/ansible-tmp-1720722234.6681595-434073-171543719892459/AnsiballZ_apt.py
Using module file /usr/local/lib/python3.10/dist-packages/ansible/modules/apt.py
PUT /root/.ansible/tmp/ansible-local-401506wyvvy13h/tmpo15mkolf TO /root/.ansible/tmp/ansible-tmp-1720722234.6905096-434074-237109845491203/AnsiballZ_apt.py
EXEC /bin/sh -c 'chmod u+x /root/.ansible/tmp/ansible-tmp-1720722234.6681595-434073-171543719892459/ /root/.ansible/tmp/ansible-tmp-1720722234.6681595-434073-171543719892459/AnsiballZ_apt.py && sleep 0'
Using module file /usr/local/lib/python3.10/dist-packages/ansible/modules/apt.py
PUT /root/.ansible/tmp/ansible-local-401506wyvvy13h/tmp0yqjokw7 TO /root/.ansible/tmp/ansible-tmp-1720722234.704671-434088-195068848106763/AnsiballZ_apt.py
Using module file /usr/local/lib/python3.10/dist-packages/ansible/modules/apt.py
PUT /root/.ansible/tmp/ansible-local-401506wyvvy13h/tmpc2_lbb63 TO /root/.ansible/tmp/ansible-tmp-1720722234.7017157-434080-221203579380132/AnsiballZ_apt.py
EXEC /bin/sh -c 'chmod u+x /root/.ansible/tmp/ansible-tmp-1720722234.6905096-434074-237109845491203/ /root/.ansible/tmp/ansible-tmp-1720722234.6905096-434074-237109845491203/AnsiballZ_apt.py && sleep 0'
EXEC /bin/sh -c 'chmod u+x /root/.ansible/tmp/ansible-tmp-1720722234.704671-434088-195068848106763/ /root/.ansible/tmp/ansible-tmp-1720722234.704671-434088-195068848106763/AnsiballZ_apt.py && sleep 0'
EXEC /bin/sh -c 'chmod u+x /root/.ansible/tmp/ansible-tmp-1720722234.7017157-434080-221203579380132/ /root/.ansible/tmp/ansible-tmp-1720722234.7017157-434080-221203579380132/AnsiballZ_apt.py && sleep 0'
Using module file /usr/local/lib/python3.10/dist-packages/ansible/modules/apt.py
PUT /root/.ansible/tmp/ansible-local-401506wyvvy13h/tmpe4qdstqu TO /root/.ansible/tmp/ansible-tmp-1720722234.7366605-434103-88856948673624/AnsiballZ_apt.py
EXEC /bin/sh -c '/usr/bin/python3.10 /root/.ansible/tmp/ansible-tmp-1720722234.6681595-434073-171543719892459/AnsiballZ_apt.py && sleep 0'
EXEC /bin/sh -c 'chmod u+x /root/.ansible/tmp/ansible-tmp-1720722234.7366605-434103-88856948673624/ /root/.ansible/tmp/ansible-tmp-1720722234.7366605-434103-88856948673624/AnsiballZ_apt.py && sleep 0'
EXEC /bin/sh -c '/usr/bin/python3.10 /root/.ansible/tmp/ansible-tmp-1720722234.6905096-434074-237109845491203/AnsiballZ_apt.py && sleep 0'
EXEC /bin/sh -c '/usr/bin/python3.10 /root/.ansible/tmp/ansible-tmp-1720722234.704671-434088-195068848106763/AnsiballZ_apt.py && sleep 0'
EXEC /bin/sh -c '/usr/bin/python3.10 /root/.ansible/tmp/ansible-tmp-1720722234.7017157-434080-221203579380132/AnsiballZ_apt.py && sleep 0'
EXEC /bin/sh -c '/usr/bin/python3.10 /root/.ansible/tmp/ansible-tmp-1720722234.7366605-434103-88856948673624/AnsiballZ_apt.py && sleep 0'
EXEC /bin/sh -c 'rm -f -r /root/.ansible/tmp/ansible-tmp-1720722234.7366605-434103-88856948673624/ > /dev/null 2>&1 && sleep 0'
EXEC /bin/sh -c 'rm -f -r /root/.ansible/tmp/ansible-tmp-1720722234.6681595-434073-171543719892459/ > /dev/null 2>&1 && sleep 0'
EXEC /bin/sh -c 'rm -f -r /root/.ansible/tmp/ansible-tmp-1720722234.6905096-434074-237109845491203/ > /dev/null 2>&1 && sleep 0'
EXEC /bin/sh -c 'rm -f -r /root/.ansible/tmp/ansible-tmp-1720722234.7017157-434080-221203579380132/ > /dev/null 2>&1 && sleep 0'
EXEC /bin/sh -c 'rm -f -r /root/.ansible/tmp/ansible-tmp-1720722234.704671-434088-195068848106763/ > /dev/null 2>&1 && sleep 0'
[WARNING]: Error deleting remote temporary files (rc: 255, stderr: Error: dial unix /run/incus/dev-incus-deploy_server02/qemu.monitor: connect: connection refused })
EXEC /bin/sh -c 'rm -f -r /root/.ansible/tmp/ansible-tmp-1720722234.7366605-434103-88856948673624/ > /dev/null 2>&1 && sleep 0'
fatal: [server03]: FAILED! => {
"changed": false,
"module_stderr": "Error: websocket: close 1006 (abnormal closure): unexpected EOF\n",
"module_stdout": "",
"msg": "MODULE FAILURE\nSee stdout/stderr for the exact error",
"rc": 1
}
fatal: [server05]: FAILED! => {
"changed": false,
"module_stderr": "Error: websocket: close 1006 (abnormal closure): unexpected EOF\n",
"module_stdout": "",
"msg": "MODULE FAILURE\nSee stdout/stderr for the exact error",
"rc": 1
}
fatal: [server04]: FAILED! => {
"changed": false,
"module_stderr": "Error: websocket: close 1006 (abnormal closure): unexpected EOF\n",
"module_stdout": "",
"msg": "MODULE FAILURE\nSee stdout/stderr for the exact error",
"rc": 1
}
fatal: [server01]: FAILED! => {
"changed": false,
"module_stderr": "Error: websocket: close 1006 (abnormal closure): unexpected EOF\n",
"module_stdout": "",
"msg": "MODULE FAILURE\nSee stdout/stderr for the exact error",
"rc": 1
}
fatal: [server02]: FAILED! => {
"changed": false,
"module_stderr": "Error: websocket: close 1006 (abnormal closure): unexpected EOF\n",
"module_stdout": "",
"msg": "MODULE FAILURE\nSee stdout/stderr for the exact error",
"rc": 1
}

@stgraber
Copy link
Member

Yeah, so the min_mon_release 15 (octopus) is obviously going to be a problem but I don't get why it would be set to that when we specifically call monmaptool with the argument to set it to 18...

Maybe that older version of monmaptool doesn't know how to handle that properly?

You could add the Ceph repository to your own machine and then update to a new version of monmaptool, that would certainly fix that issue, it just shouldn't be necessary...

@Sensei-CHO
Copy link

Having the exact same issue here

@wdavidw
Copy link

wdavidw commented Nov 18, 2024

Same here as well

@wdavidw
Copy link

wdavidw commented Nov 18, 2024

OK, i got the full installation. Here are my logs.

I had the same error relative to the --set-min-mon-release pointing to octopus instead of the requested version and despite the argument being provided.

monmaptool -v
ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)

monmaptool --create --set-min-mon-release 18 --fsid e2850e1f-7aab-472e-b6b1-824e19a75071 data/ceph/cluster.e2850e1f-7aab-472e-b6b1-824e19a75071.mon.map --clobber
monmaptool: monmap file data/ceph/cluster.e2850e1f-7aab-472e-b6b1-824e19a75071.mon.map
setting min_mon_release = octopus
monmaptool: set fsid to e2850e1f-7aab-472e-b6b1-824e19a75071
monmaptool: writing epoch 0 to data/ceph/cluster.e2850e1f-7aab-472e-b6b1-824e19a75071.mon.map (0 monitors)

On Ubuntu, I used cephadm to update the ceph-base and ceph-common packages on my local host.

CEPH_RELEASE=18.2.0
curl --silent --remote-name --location https://download.ceph.com/rpm-${CEPH_RELEASE}/el9/noarch/cephadm
chmod +x cephadm
sudo mv cephadm  /usr/local/bin/
sudo cephadm add-repo --release reef
sudo apt update
sudo apt upgrade -y

Once update to version 18, the --set-min-mon-release is applied.

monmaptool -v
ceph version 18.2.4 (e7ad5345525c7aa95470c26863873b581076945d) reef (stable)

monmaptool --create --set-min-mon-release reef --fsid e2850e1f-7aab-472e-b6b1-824e19a75071 ansible/data/ceph/cluster.e2850e1f-7aab-472e-b6b1-824e19a75071.mon.map --clobber
monmaptool: monmap file ansible/data/ceph/cluster.e2850e1f-7aab-472e-b6b1-824e19a75071.mon.map
setting min_mon_release = reef
monmaptool: set fsid to e2850e1f-7aab-472e-b6b1-824e19a75071
monmaptool: writing epoch 0 to ansible/data/ceph/cluster.e2850e1f-7aab-472e-b6b1-824e19a75071.mon.map (0 monitors)

Finnally, the installation is reset and re-applied.

rm -rf terraform/.terraform terraform/.terraform.lock.hcl terraform/terraform.tfstate terraform/terraform.tfstate.backup
rm -rf ansible/data/ceph/cluster.e2850e1f-7aab-472e-b6b1-824e19a75071.*
cd terraform
tofu init
tofu apply
cd ../ansible
ansible-playbook deploy.yaml

It succeed.

PLAY RECAP ***********************************************************************************************************************************************************************************
server01                   : ok=67   changed=0    unreachable=0    failed=0    skipped=40   rescued=0    ignored=0   
server02                   : ok=67   changed=0    unreachable=0    failed=0    skipped=40   rescued=0    ignored=0   
server03                   : ok=68   changed=0    unreachable=0    failed=0    skipped=41   rescued=0    ignored=0   
server04                   : ok=56   changed=0    unreachable=0    failed=0    skipped=51   rescued=0    ignored=0   
server05                   : ok=56   changed=0    unreachable=0    failed=0    skipped=51   rescued=0    ignored=0 

Note, I had to re-execute ansible-playbook deploy.yaml a second time, one of the node didn't complete the "Add storage pools" task.

TASK [Add storage pools]
changed: [server01] => (item={'key': 'local', 'value': {'driver': 'zfs', 'local_config': {'source': '/dev/disk/by-id/nvme-QEMU_NVMe_Ctrl_incus_disk3'}, 'description': 'Local storage pool'}})
changed: [server01] => (item={'key': 'remote', 'value': {'driver': 'ceph', 'local_config': {'source': 'incus_baremetal'}, 'description': 'Distributed storage pool (cluster-wide)'}})
failed: [server01] (item={'key': 'shared', 'value': {'driver': 'lvmcluster', 'local_config': {'lvm.vg.name': 'vg0', 'source': 'vg0'}, 'default': True, 'description': 'Shared storage pool (cluster-wide)'}}) => {"ansible_loop_var": "item", "changed": true, "cmd": "incus storage create shared lvmcluster lvm.vg.name=vg0 source=vg0", "delta": "0:00:00.011775", "end": "2024-11-18 20:35:58.613317", "item": {"key": "shared", "value": {"default": true, "description": "Shared storage pool (cluster-wide)", "driver": "lvmcluster", "local_config": {"lvm.vg.name": "vg0", "source": "vg0"}}}, "msg": "non-zero return code", "rc": 1, "start": "2024-11-18 20:35:58.601542", "stderr": "Error: Invalid option \"lvm.vg.name\"", "stderr_lines": ["Error: Invalid option \"lvm.vg.name\""], "stdout": "", "stdout_lines": []}

@stgraber
Copy link
Member

Yeah, we need to re-shuffle things a bit to have the monmap be generated on the target servers and pulled back onto the source, you'd think that monmaptool having the argument would work or if not, would at least give an error, but it doesn't...

@mttjohnson
Copy link
Contributor

One of the things I worked on in my fork (https://github.com/mttjohnson/incus-deploy/tree/fixes-to-run-for-me) was to get the ceph commands running from the target host because I'm initiating the incus-deploy actions from a mac and couldn't get the ceph tools installed on my mac. That branch on my fork fixes-to-run-for-me worked for me to spin up incus-deply a few months back. Since I got that working I've been working on how to implement the same kind of things on a small baremetal cluster to learn more, so I haven't gotten back to the fork to try and update it or start discussions on what would work best to contribute some of the changes I've made if it would be useful for others.

@stgraber
Copy link
Member

Ah, it'd be great if you could extract that logic and send it as a PR!
We definitely want to move away from needing the Ceph tools installed on the machine running Ansible!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

5 participants