[cinder-csi-plugin] [Bug] Failed to GetOpenStackProvider i/o timeout #1874

modzilla99 · 2022-05-19T11:39:48Z

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened:
When starting the csi plugin it is not able to communicate with keystone. It will get stuck in an io timeout.

What you expected to happen:
The plugin should talk to the API and start.

How to reproduce it:
I am running a 1.24.0 cluster with Version 1.23.0 and CoreDNS 1.9.2

Anything else we need to know?:
A TCPDump suggests that the pod tries to resolve the wrong URL. It will try to connect to ${URL}.kube-system.svc.cluster.local. The same version of the csi driver works on Kubernetes 1.23 and CoreDNS 1.8.7.

Environment:

openstack-cloud-controller-manager(or other related binary) version: 1.23
OpenStack version: Victoria
Others:

jichenjc · 2022-05-20T09:53:20Z

The current info seems too generic

A TCPDump suggests that the pod tries to resolve the wrong URL. It will try to connect to ${URL}.kube-system.svc.cluster.local.

looks like it's not CPO CSI function not working, it's the pod not able to connect to keystone
some erorr might be

the pod can't connect to the service due to network not reachable
as you described, not able to parse the right DNS (no detailed info though)

so I think more info like the real error you saw, the logs of CSI pods etc will be helpful

cheetahfox · 2022-06-26T21:02:14Z

I seem to be having a similar problem. So let me hopefully provide enough information to get somewhere with this.

Cluster info: Kubernetes 1.24.1, CoreDNS 1.8.6, csi-cinder-plugin 1.22.0 (and I tested with 1.24.2).

Cloud config for csi-cinder-plugin.

[Global]
auth-url="http://keystone.m6me.cheetahfox.com:80/v3"
username="k8s"
password="*********************"
region="RegionOne"
tenant-id="7d5e3725250c434cb935a43dc34865d9"
tenant-name="k8s"
domain-name="Default"
os-endpoint-type="internalURL"

[BlockStorage]
bs-version=v3
ignore-volume-az=False

Logs from the csi-cinder-controllerplugin pod / container : cinder-csi-plugin

I0626 04:40:07.793361       1 driver.go:74] Driver: cinder.csi.openstack.org
I0626 04:40:07.793489       1 driver.go:75] Driver version: [email protected]
I0626 04:40:07.793496       1 driver.go:76] CSI Spec version: 1.3.0
I0626 04:40:07.793530       1 driver.go:106] Enabling controller service capability: LIST_VOLUMES
I0626 04:40:07.793538       1 driver.go:106] Enabling controller service capability: CREATE_DELETE_VOLUME
I0626 04:40:07.793544       1 driver.go:106] Enabling controller service capability: PUBLISH_UNPUBLISH_VOLUME
I0626 04:40:07.793549       1 driver.go:106] Enabling controller service capability: CREATE_DELETE_SNAPSHOT
I0626 04:40:07.793554       1 driver.go:106] Enabling controller service capability: LIST_SNAPSHOTS
I0626 04:40:07.793564       1 driver.go:106] Enabling controller service capability: EXPAND_VOLUME
I0626 04:40:07.793569       1 driver.go:106] Enabling controller service capability: CLONE_VOLUME
I0626 04:40:07.793574       1 driver.go:106] Enabling controller service capability: LIST_VOLUMES_PUBLISHED_NODES
I0626 04:40:07.793578       1 driver.go:106] Enabling controller service capability: GET_VOLUME
I0626 04:40:07.793583       1 driver.go:118] Enabling volume access mode: SINGLE_NODE_WRITER
I0626 04:40:07.793589       1 driver.go:128] Enabling node service capability: STAGE_UNSTAGE_VOLUME
I0626 04:40:07.793595       1 driver.go:128] Enabling node service capability: EXPAND_VOLUME
I0626 04:40:07.793599       1 driver.go:128] Enabling node service capability: GET_VOLUME_STATS
I0626 04:40:07.794109       1 openstack.go:90] Block storage opts: {0 false false}
W0626 04:40:37.796236       1 main.go:108] Failed to GetOpenStackProvider: Post "http://keystone.m6me.cheetahfox.com:80/v3/auth/tokens": dial tcp: i/o timeout

When looking at the network traffic from the cinder-csi-plugin, I see only DNS requests for AAAA and A records looking for this dns name.

keystone.m6me.cheetahfox.com.kube-system.svc.cluster.local

So I see the same strange thing that was reported above. The container seems to be trying to resolve this address with ".kube-system.svc.cluster.local" appended to the valid auth url.

The keystone API is at that url. I don't think this is a networking issue since I can access the API from other pods in the cluster( it's kinda hard to check from the container itself since it doesn't really have any tools and it restarts after about 20 seconds).

This configuration was working just fine in Kubernetes 1.22. I upgraded the cluster to 1.23.7 and then 1.24.1 and everything was working fine for about a week. Then for unrelated reasons I needed to restart the VM's in this cluster. After the restart was when I noticed this container wasn't ready along with all of my pods that have Cinder provided pvc were not working.

The other containers in the pod just have the following logs look like this with "Still connecting" repeating about every ten minutes.

josh@Cheetah:~/network-automation/services/openstack-deployment$ kubectl logs --namespace=kube-system csi-cinder-controllerplugin-6549b5d56-tgsfx csi-provisioner
I0626 20:34:49.018523       1 csi-provisioner.go:138] Version: v3.0.0
I0626 20:34:49.018642       1 csi-provisioner.go:161] Building kube configs for running in cluster...
W0626 20:34:59.021171       1 connection.go:173] Still connecting to unix:///var/lib/csi/sockets/pluginproxy/csi.sock
W0626 20:35:09.021641       1 connection.go:173] Still connecting to unix:///var/lib/csi/sockets/pluginproxy/csi.sock

I was looking at the driver code and I don't see how it could be getting a different url in the driver itself. Could it be something from gophercloud? Tracing things back up the stack it seems like that might be where this is happening. But I am not sure...

I also tried setting os-endpoint-type to "internalURL" since about the only thing I could figure is that gophercloud was changing something with the url because of endpoint-type. This seemed to have no effect. I have tried removing the :80 in url. Because why not... Also no effect. I am going to try to downgrade my cluster and hope this start working with kubernetes 1.23.7.

jichenjc · 2022-06-27T06:40:57Z

from context, looks like the URL is taken as short service name and appended local host domain name
so it's CSI issue instead likely to be a k8s or your DNS server setting
e.g https://en.wikipedia.org/wiki/Fully_qualified_domain_name tells us a URL which is not FQDN will be append a domain
name and in this case it its .. so think the workaround might be make keystone.m6me.cheetahfox.com as IP in your configuration or setup the DNS correctly to avoid the append of the service name (how to make this I don't know ,still digging)

jfpucheu · 2022-09-06T13:20:15Z

Hello, I have exactly the same issue, did you finally find a solution ?

Thanks

Jeff

jichenjc · 2022-09-06T13:26:42Z

are you able to connect to the openstack endpoint from your local ? e.g the DNS with the cloud.conf you used?

jfpucheu · 2022-09-06T13:42:40Z

yes from the node yes, i don't understand why cinder-csi-pluggin can't.....

I0906 13:34:01.026232 1 openstack.go:89] Block storage opts: {0 false false}
W0906 13:34:31.026950 1 main.go:100] Failed to GetOpenStackProvider: Post "https://iam.eu-west-0.mycloudprovider.com/v3/auth/tokens": dial tcp: i/o timeout

The openstack-cloud-controller-manager-9kkt have no Issue ... to join same endpoint...

jichenjc · 2022-09-06T14:13:20Z

Original issue is So I see the same strange thing that was reported above. The container seems to be trying to resolve this address with ".kube-system.svc.cluster.local" appended to the valid auth url.

which is incorrect .. I suggest you try use ip instead of hostname of openstack service and try again to see
whether it's same pattern, basically, I think it's related to DNS setup but I am not sure why OCCM works but Cinder CSI not..

k8s-triage-robot · 2022-12-05T14:16:02Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2023-01-04T15:07:13Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2023-02-03T15:37:22Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot · 2023-02-03T15:37:27Z

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sqaisar · 2024-04-28T22:03:36Z

NFO] 192.168.145.12:44558 - 55791 "A IN <my openstack url>.kube-system.svc.cluster.local. udp 70 false 512" NXDOMAIN qr,aa,rd 163 0.000231634s
[INFO] 192.168.145.12:47886 - 48222 "AAAA IN <my openstack url>.kube-system.svc.cluster.local. udp 70 false 512" NXDOMAIN qr,aa,rd 163 0.000184359s
[INFO] 192.168.145.12:44347 - 13713 "AAAA IN <my openstack url>.svc.cluster.local. udp 58 false 512" NXDOMAIN qr,aa,rd 151 0.000144048s
[INFO] 192.168.145.12:43551 - 9405 "A IN openstack.im.pype.tech.svc.cluster.local. udp 58 false 512" NXDOMAIN qr,aa,rd 151 0.000242885s
[INFO] 192.168.145.12:45532 - 12467 "A IN <my openstack url>.cluster.local. udp 54 false 512" NXDOMAIN qr,aa,rd 147 0.000225486s
[INFO] 192.168.145.12:46515 - 14766 "A IN <my openstack url>.openstacklocal. udp 55 false 512" NXDOMAIN qr,rd,ra 130 0.001658468s

These log entries are from coredns pods after enabling the query logs.
But the cloud-config that I've provided have the correct URL. I even tried to use IP rather than using dns for openstack.

I'm not sure why it appends these local svc urls etc

timonegk · 2024-12-10T12:38:23Z

We had the same issue. In the end, the problem was that the the csi plugin could not reach the coredns pod. Due to containernetworking/cni#878, the early-scheduled pods, such as coredns, were using a different subnet. Removing podman, removing /etc/cni/net.d/87-podman-bridge.conflist, or switching to podman 5 were possible solutions.

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label May 19, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 5, 2022

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 4, 2023

k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Feb 3, 2023

jichenjc mentioned this issue Apr 14, 2023

Probing CSI driver for readiness failed CSI driver probe failed: rpc error: code = FailedPrecondition desc = Failed to communicate with OpenStack BlockStorage API #2192

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[cinder-csi-plugin] [Bug] Failed to GetOpenStackProvider i/o timeout #1874

[cinder-csi-plugin] [Bug] Failed to GetOpenStackProvider i/o timeout #1874

modzilla99 commented May 19, 2022 •

edited

Loading

jichenjc commented May 20, 2022

cheetahfox commented Jun 26, 2022

jichenjc commented Jun 27, 2022

jfpucheu commented Sep 6, 2022

jichenjc commented Sep 6, 2022

jfpucheu commented Sep 6, 2022

jichenjc commented Sep 6, 2022

k8s-triage-robot commented Dec 5, 2022

k8s-triage-robot commented Jan 4, 2023

k8s-triage-robot commented Feb 3, 2023

k8s-ci-robot commented Feb 3, 2023

sqaisar commented Apr 28, 2024

timonegk commented Dec 10, 2024

[cinder-csi-plugin] [Bug] Failed to GetOpenStackProvider i/o timeout #1874

[cinder-csi-plugin] [Bug] Failed to GetOpenStackProvider i/o timeout #1874

Comments

modzilla99 commented May 19, 2022 • edited Loading

jichenjc commented May 20, 2022

cheetahfox commented Jun 26, 2022

jichenjc commented Jun 27, 2022

jfpucheu commented Sep 6, 2022

jichenjc commented Sep 6, 2022

jfpucheu commented Sep 6, 2022

jichenjc commented Sep 6, 2022

k8s-triage-robot commented Dec 5, 2022

k8s-triage-robot commented Jan 4, 2023

k8s-triage-robot commented Feb 3, 2023

k8s-ci-robot commented Feb 3, 2023

sqaisar commented Apr 28, 2024

timonegk commented Dec 10, 2024

modzilla99 commented May 19, 2022 •

edited

Loading