Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[cinder-csi-plugin] [Bug] Failed to GetOpenStackProvider i/o timeout #1874

Closed
modzilla99 opened this issue May 19, 2022 · 13 comments
Closed
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@modzilla99
Copy link

modzilla99 commented May 19, 2022

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened:
When starting the csi plugin it is not able to communicate with keystone. It will get stuck in an io timeout.

What you expected to happen:
The plugin should talk to the API and start.

How to reproduce it:
I am running a 1.24.0 cluster with Version 1.23.0 and CoreDNS 1.9.2

Anything else we need to know?:
A TCPDump suggests that the pod tries to resolve the wrong URL. It will try to connect to ${URL}.kube-system.svc.cluster.local. The same version of the csi driver works on Kubernetes 1.23 and CoreDNS 1.8.7.

Environment:

  • openstack-cloud-controller-manager(or other related binary) version: 1.23
  • OpenStack version: Victoria
  • Others:
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label May 19, 2022
@jichenjc
Copy link
Contributor

The current info seems too generic

A TCPDump suggests that the pod tries to resolve the wrong URL. It will try to connect to ${URL}.kube-system.svc.cluster.local.

looks like it's not CPO CSI function not working, it's the pod not able to connect to keystone
some erorr might be

  1. the pod can't connect to the service due to network not reachable
  2. as you described, not able to parse the right DNS (no detailed info though)

so I think more info like the real error you saw, the logs of CSI pods etc will be helpful

@cheetahfox
Copy link

I seem to be having a similar problem. So let me hopefully provide enough information to get somewhere with this.

Cluster info: Kubernetes 1.24.1, CoreDNS 1.8.6, csi-cinder-plugin 1.22.0 (and I tested with 1.24.2).

Cloud config for csi-cinder-plugin.

[Global]
auth-url="http://keystone.m6me.cheetahfox.com:80/v3"
username="k8s"
password="*********************"
region="RegionOne"
tenant-id="7d5e3725250c434cb935a43dc34865d9"
tenant-name="k8s"
domain-name="Default"
os-endpoint-type="internalURL"

[BlockStorage]
bs-version=v3
ignore-volume-az=False

Logs from the csi-cinder-controllerplugin pod / container : cinder-csi-plugin

I0626 04:40:07.793361       1 driver.go:74] Driver: cinder.csi.openstack.org
I0626 04:40:07.793489       1 driver.go:75] Driver version: [email protected]
I0626 04:40:07.793496       1 driver.go:76] CSI Spec version: 1.3.0
I0626 04:40:07.793530       1 driver.go:106] Enabling controller service capability: LIST_VOLUMES
I0626 04:40:07.793538       1 driver.go:106] Enabling controller service capability: CREATE_DELETE_VOLUME
I0626 04:40:07.793544       1 driver.go:106] Enabling controller service capability: PUBLISH_UNPUBLISH_VOLUME
I0626 04:40:07.793549       1 driver.go:106] Enabling controller service capability: CREATE_DELETE_SNAPSHOT
I0626 04:40:07.793554       1 driver.go:106] Enabling controller service capability: LIST_SNAPSHOTS
I0626 04:40:07.793564       1 driver.go:106] Enabling controller service capability: EXPAND_VOLUME
I0626 04:40:07.793569       1 driver.go:106] Enabling controller service capability: CLONE_VOLUME
I0626 04:40:07.793574       1 driver.go:106] Enabling controller service capability: LIST_VOLUMES_PUBLISHED_NODES
I0626 04:40:07.793578       1 driver.go:106] Enabling controller service capability: GET_VOLUME
I0626 04:40:07.793583       1 driver.go:118] Enabling volume access mode: SINGLE_NODE_WRITER
I0626 04:40:07.793589       1 driver.go:128] Enabling node service capability: STAGE_UNSTAGE_VOLUME
I0626 04:40:07.793595       1 driver.go:128] Enabling node service capability: EXPAND_VOLUME
I0626 04:40:07.793599       1 driver.go:128] Enabling node service capability: GET_VOLUME_STATS
I0626 04:40:07.794109       1 openstack.go:90] Block storage opts: {0 false false}
W0626 04:40:37.796236       1 main.go:108] Failed to GetOpenStackProvider: Post "http://keystone.m6me.cheetahfox.com:80/v3/auth/tokens": dial tcp: i/o timeout

When looking at the network traffic from the cinder-csi-plugin, I see only DNS requests for AAAA and A records looking for this dns name.

keystone.m6me.cheetahfox.com.kube-system.svc.cluster.local

So I see the same strange thing that was reported above. The container seems to be trying to resolve this address with ".kube-system.svc.cluster.local" appended to the valid auth url.

The keystone API is at that url. I don't think this is a networking issue since I can access the API from other pods in the cluster( it's kinda hard to check from the container itself since it doesn't really have any tools and it restarts after about 20 seconds).

This configuration was working just fine in Kubernetes 1.22. I upgraded the cluster to 1.23.7 and then 1.24.1 and everything was working fine for about a week. Then for unrelated reasons I needed to restart the VM's in this cluster. After the restart was when I noticed this container wasn't ready along with all of my pods that have Cinder provided pvc were not working.

The other containers in the pod just have the following logs look like this with "Still connecting" repeating about every ten minutes.

josh@Cheetah:~/network-automation/services/openstack-deployment$ kubectl logs --namespace=kube-system csi-cinder-controllerplugin-6549b5d56-tgsfx csi-provisioner
I0626 20:34:49.018523       1 csi-provisioner.go:138] Version: v3.0.0
I0626 20:34:49.018642       1 csi-provisioner.go:161] Building kube configs for running in cluster...
W0626 20:34:59.021171       1 connection.go:173] Still connecting to unix:///var/lib/csi/sockets/pluginproxy/csi.sock
W0626 20:35:09.021641       1 connection.go:173] Still connecting to unix:///var/lib/csi/sockets/pluginproxy/csi.sock

I was looking at the driver code and I don't see how it could be getting a different url in the driver itself. Could it be something from gophercloud? Tracing things back up the stack it seems like that might be where this is happening. But I am not sure...

I also tried setting os-endpoint-type to "internalURL" since about the only thing I could figure is that gophercloud was changing something with the url because of endpoint-type. This seemed to have no effect. I have tried removing the :80 in url. Because why not... Also no effect. I am going to try to downgrade my cluster and hope this start working with kubernetes 1.23.7.

@jichenjc
Copy link
Contributor

from context, looks like the URL is taken as short service name and appended local host domain name
so it's CSI issue instead likely to be a k8s or your DNS server setting
e.g https://en.wikipedia.org/wiki/Fully_qualified_domain_name tells us a URL which is not FQDN will be append a domain
name and in this case it its .. so think the workaround might be make keystone.m6me.cheetahfox.com as IP in your configuration or setup the DNS correctly to avoid the append of the service name (how to make this I don't know ,still digging)

@jfpucheu
Copy link

jfpucheu commented Sep 6, 2022

Hello, I have exactly the same issue, did you finally find a solution ?

Thanks

Jeff

@jichenjc
Copy link
Contributor

jichenjc commented Sep 6, 2022

are you able to connect to the openstack endpoint from your local ? e.g the DNS with the cloud.conf you used?

@jfpucheu
Copy link

jfpucheu commented Sep 6, 2022

yes from the node yes, i don't understand why cinder-csi-pluggin can't.....

I0906 13:34:01.026232 1 openstack.go:89] Block storage opts: {0 false false}
W0906 13:34:31.026950 1 main.go:100] Failed to GetOpenStackProvider: Post "https://iam.eu-west-0.mycloudprovider.com/v3/auth/tokens": dial tcp: i/o timeout

The openstack-cloud-controller-manager-9kkt have no Issue ... to join same endpoint...

@jichenjc
Copy link
Contributor

jichenjc commented Sep 6, 2022

Original issue is So I see the same strange thing that was reported above. The container seems to be trying to resolve this address with ".kube-system.svc.cluster.local" appended to the valid auth url.

which is incorrect .. I suggest you try use ip instead of hostname of openstack service and try again to see
whether it's same pattern, basically, I think it's related to DNS setup but I am not sure why OCCM works but Cinder CSI not..

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 5, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 4, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sqaisar
Copy link

sqaisar commented Apr 28, 2024

NFO] 192.168.145.12:44558 - 55791 "A IN <my openstack url>.kube-system.svc.cluster.local. udp 70 false 512" NXDOMAIN qr,aa,rd 163 0.000231634s
[INFO] 192.168.145.12:47886 - 48222 "AAAA IN <my openstack url>.kube-system.svc.cluster.local. udp 70 false 512" NXDOMAIN qr,aa,rd 163 0.000184359s
[INFO] 192.168.145.12:44347 - 13713 "AAAA IN <my openstack url>.svc.cluster.local. udp 58 false 512" NXDOMAIN qr,aa,rd 151 0.000144048s
[INFO] 192.168.145.12:43551 - 9405 "A IN openstack.im.pype.tech.svc.cluster.local. udp 58 false 512" NXDOMAIN qr,aa,rd 151 0.000242885s
[INFO] 192.168.145.12:45532 - 12467 "A IN <my openstack url>.cluster.local. udp 54 false 512" NXDOMAIN qr,aa,rd 147 0.000225486s
[INFO] 192.168.145.12:46515 - 14766 "A IN <my openstack url>.openstacklocal. udp 55 false 512" NXDOMAIN qr,rd,ra 130 0.001658468s

These log entries are from coredns pods after enabling the query logs.
But the cloud-config that I've provided have the correct URL. I even tried to use IP rather than using dns for openstack.

I'm not sure why it appends these local svc urls etc

@timonegk
Copy link

We had the same issue. In the end, the problem was that the the csi plugin could not reach the coredns pod. Due to containernetworking/cni#878, the early-scheduled pods, such as coredns, were using a different subnet. Removing podman, removing /etc/cni/net.d/87-podman-bridge.conflist, or switching to podman 5 were possible solutions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

8 participants