Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom Modules are getting disconnected from edgeHub frequently #7372

Closed
Pulavarthykarthik-nagarro opened this issue Sep 18, 2024 · 7 comments

Comments

@Pulavarthykarthik-nagarro

Expected Behavior

Modules should not loose their connection to edgeHub

Current Behavior

When we upgraded edgeHub and edgeAgent from version 1.1.19 to 1.5.3, we noticed that all the custom modules connect when edgeHub starts. However, after about an hour, the modules disconnect and reconnect after 20 minutes. This process keeps happening repeatedly.

EdgeHub using Microsoft.Azure.Devices.Edge.Hub.Core.Device.DeviceMessageHandler.SendMessageAsync causes TimeoutException when trying to send messages between modules.

Below are the logs of the edgeHub module

<6> 2024-09-18 05:27:41.289 +00:00 [INF] - "Closing connection for device: <deviceid>/<custommodule>, , "
<6> 2024-09-18 05:27:41.289 +00:00 [INF] - Disposing MessagingServiceClient for device Id <deviceid>/<custommodule>because of exception - 
<6> 2024-09-18 05:27:41.289 +00:00 [INF] - Setting device proxy inactive for device Id <deviceid>/<custommodule>
<6> 2024-09-18 05:27:41.289 +00:00 [INF] - Removing device connection for device <deviceid>/<custommodule>with removeCloudConnection flag 'True'.
<6> 2024-09-18 05:27:41.344 +00:00 [INF] - Client <deviceid>/<custommodule>in device scope authenticated locally.
<6> 2024-09-18 05:27:41.345 +00:00 [INF] - Successfully generated identity for clientId <deviceid>/<custommodule>and username iothub-eus2-dev-001.azure-devices.net/<deviceid>/<custommodule>/?api-version=2019-10-01&DeviceClientType=azure-iot-device-iothub-py%2F2.8.0%283.8.17%3BLinux%20%23163-Ubuntu%20SMP%20Fri%20Mar%2017%2018%3A26%3A02%20UTC%202023%3Bx86_64%29
<6> 2024-09-18 05:27:41.345 +00:00 [INF] - "ClientAuthenticated, <deviceid>/<custommodule>, 25ded387"
<6> 2024-09-18 05:27:41.345 +00:00 [INF] - New device connection for device <deviceid>/<custommodule>
<6> 2024-09-18 05:27:41.345 +00:00 [INF] - Client <deviceid>/<custommodule>r connected to edgeHub, processing existing subscriptions.
<6> 2024-09-18 05:27:41.345 +00:00 [INF] - Bind device proxy for device <deviceid>/<custommodule>
<6> 2024-09-18 05:27:41.345 +00:00 [INF] - Binding message channel for device Id <deviceid>/<custommodule>
<6> 2024-09-18 05:27:41.345 +00:00 [INF] - Processing pending subscriptions for <deviceid>/<custommodule>
<6> 2024-09-18 05:27:41.345 +00:00 [INF] - Attempting to connect to IoT Hub for client <deviceid>/<custommodule>via AMQP...
<6> 2024-09-18 05:27:41.358 +00:00 [INF] - Closing receiver in cloud proxy bc05704b-2a16-40c8-bc3e-fae4db184ebd for <deviceid>/<custommodule>
<6> 2024-09-18 05:27:41.358 +00:00 [INF] - Closed cloud proxy bc05704b-2a16-40c8-bc3e-fae4db184ebd for <deviceid>/<custommodule>
<6> 2024-09-18 05:27:41.358 +00:00 [INF] - Device connection removed for device <deviceid>/<custommodule>
<6> 2024-09-18 05:27:41.358 +00:00 [INF] - Remove device connection for device <deviceid>/<custommodule>
<6> 2024-09-18 05:27:41.510 +00:00 [INF] - Cloud connection for <deviceid>/<custommodule>is True
<6> 2024-09-18 05:27:41.510 +00:00 [INF] - Connection status for <deviceid>/<custommodule>changed to ConnectionEstablished
<6> 2024-09-18 05:27:41.510 +00:00 [INF] - Client <deviceid>/<custommodule>connected to cloud, processing existing subscriptions.
<6> 2024-09-18 05:27:41.510 +00:00 [INF] - Skipping <deviceid>/<custommodule>for subscription processing, as it is currently being processed.
<6> 2024-09-18 05:27:41.510 +00:00 [INF] - Created cloud proxy for client <deviceid>/<custommodule>via AMQP, with client operation timeout 20 seconds.
<6> 2024-09-18 05:27:41.510 +00:00 [INF] - Initialized cloud proxy 7b9ee43a-e16f-45cd-9285-3962f5ccbf5c for <deviceid>/<custommodule>
<6> 2024-09-18 05:27:41.510 +00:00 [INF] - Created cloud connection for client <deviceid>/<custommodule>
<6> 2024-09-18 05:27:44.020 +00:00 [INF] - Updated reported properties for <deviceid>/$edgeHub
<4> 2024-09-18 05:28:11.289 +00:00 [WRN] - Did not receive ack for message d89939d2-32a4-4764-afc3-fa0ef5d84ffe from device/module <deviceid>/<custommodule>
<4> 2024-09-18 05:28:11.289 +00:00 [WRN] - **Error sending messages to module <deviceid>/<custommodule>
System.TimeoutException: Message completion response not received
   at Microsoft.Azure.Devices.Edge.Hub.Core.Device.DeviceMessageHandler.SendMessageAsync(IMessage message, String input) in /mnt/vss/_work/1/s/edge-hub/core/src/Microsoft.Azure.Devices.Edge.Hub.Core/device/DeviceMessageHandler.cs:line 498
   at Microsoft.Azure.Devices.Edge.Hub.Core.Routing.ModuleEndpoint.ModuleMessageProcessor.ProcessAsync(ICollection`1 routingMessages, IDeviceProxy dp, CancellationToken token) in /mnt/vss/_work/1/s/edge-hub/core/src/Microsoft.Azure.Devices.Edge.Hub.Core/routing/ModuleEndpoint.cs:line 166**

<6> 2024-09-18 05:31:43.236 +00:00 [INF] - Entering periodic task to reauthenticate connected clients
<6> 2024-09-18 05:36:43.236 +00:00 [INF] - Entering periodic task to reauthenticate connected clients
<6> 2024-09-18 05:36:43.236 +00:00 [INF] - Error authenticating token for <deviceid>/blobstorageiotedge because the token is expired or could not be parsed
<6> 2024-09-18 05:36:43.236 +00:00 [INF] - Credentials for client <deviceid>/blobstorageiotedge are not valid.
<4> 2024-09-18 05:36:43.236 +00:00 [WRN] - Reauthenticating client <deviceid>/blobstorageiotedge failed, removing client connection
<6> 2024-09-18 05:36:43.236 +00:00 [INF] - Unable to re-authenticate <deviceid>/blobstorageiotedge, dropping client connection.
<6> 2024-09-18 05:36:43.236 +00:00 [INF] - Removing device connection for device <deviceid>/blobstorageiotedge with removeCloudConnection flag 'True'.
<4> 2024-09-18 05:36:43.237 +00:00 [WRN] - "Closing connection for device: <deviceid>/blobstorageiotedge, Microsoft.Azure.Devices.Edge.Hub.Core.EdgeHubConnectionException: Connection closed for device <deviceid>/blobstorageiotedge., "
<6> 2024-09-18 05:36:43.237 +00:00 [INF] - **Disposing MessagingServiceClient for device Id <deviceid>/blobstorageiotedge because of exception - Microsoft.Azure.Devices.Edge.Hub.Core.EdgeHubConnectionException: Connection closed for device <deviceid>/blobstorageiotedge.**
<6> 2024-09-18 05:36:43.237 +00:00 [INF] - Closing device proxy for device Id <deviceid>/blobstorageiotedge
<6> 2024-09-18 05:36:43.297 +00:00 [INF] - Closing receiver in cloud proxy 641d6314-9200-4797-9e5d-cdc1c5fceed6 for <deviceid>/blobstorageiotedge
<6> 2024-09-18 05:36:43.297 +00:00 [INF] - Closed cloud proxy 641d6314-9200-4797-9e5d-cdc1c5fceed6 for <deviceid>/blobstorageiotedge
<6> 2024-09-18 05:36:43.297 +00:00 [INF] - Device connection removed for device <deviceid>/blobstorageiotedge

Context (Environment)

Host OS: VERSION= "Ubuntu 20.04.6 LTS"
Architecture: amd64
Container OS: Linux containers

Output of iotedge check


<Configuration checks (aziot-identity-service)
---------------------------------------------
√ keyd configuration is well-formed - OK
√ certd configuration is well-formed - OK
√ tpmd configuration is well-formed - OK
√ identityd configuration is well-formed - OK
√ daemon configurations up-to-date with config.toml - OK
√ identityd config toml file specifies a valid hostname - OK
× aziot-identity-service package is up-to-date - Error
    could not query https://aka.ms/azure-iotedge-latest-versions for latest available version
        caused by: could not query https://aka.ms/azure-iotedge-latest-versions for latest available version
        caused by: error trying to connect: tcp connect error: Connection timed out (os error 110)
        caused by: tcp connect error: Connection timed out (os error 110)
        caused by: Connection timed out (os error 110)
‼ host time is close to reference time - Warning
    Could not query NTP server
        caused by: Could not query NTP server
        caused by: could not receive NTP server response: Resource temporarily unavailable (os error 11)
        caused by: Resource temporarily unavailable (os error 11)
√ preloaded certificates are valid - OK
√ keyd is running - OK
√ certd is running - OK
√ identityd is running - OK
√ read all preloaded certificates from the Certificates Service - OK
√ read all preloaded key pairs from the Keys Service - OK
√ check all EST server URLs utilize HTTPS - OK
√ ensure all preloaded certificates match preloaded private keys with the same ID - OK

Connectivity checks (aziot-identity-service)
--------------------------------------------
√ host can connect to and perform TLS handshake with iothub AMQP port - OK
√ host can connect to and perform TLS handshake with iothub HTTPS / WebSockets port - OK
√ host can connect to and perform TLS handshake with iothub MQTT port - OK

Configuration checks
--------------------
√ aziot-edged configuration is well-formed - OK
√ configuration up-to-date with config.toml - OK
√ container engine is installed and functional - OK
√ configuration has correct URIs for daemon mgmt endpoint - OK
× aziot-edge package is up-to-date - Error
    Error while fetching latest versions of edge components: could not send HTTP request
        caused by: Error while fetching latest versions of edge components: could not send HTTP request
        caused by: error trying to connect: tcp connect error: Connection timed out (os error 110)
        caused by: tcp connect error: Connection timed out (os error 110)
        caused by: Connection timed out (os error 110)
√ container time is close to host time - OK
√ DNS server - OK
√ production readiness: logs policy - OK
√ production readiness: Edge Agent's storage directory is persisted on the host filesystem - OK
√ production readiness: Edge Hub's storage directory is persisted on the host filesystem - OK
√ Agent image is valid and can be pulled from upstream - OK
√ proxy settings are consistent in aziot-edged, aziot-identityd, moby daemon and config.toml - OK

Connectivity checks
-------------------
√ container on the default network can connect to upstream AMQP port - OK
√ container on the default network can connect to upstream HTTPS / WebSockets port - OK
√ container on the default network can connect to upstream MQTT port - OK
    skipping because of not required in this configuration
√ container on the IoT Edge module network can connect to upstream AMQP port - OK
√ container on the IoT Edge module network can connect to upstream HTTPS / WebSockets port - OK
√ container on the IoT Edge module network can connect to upstream MQTT port - OK
    skipping because of not required in this configuration
32 check(s) succeeded.
1 check(s) raised warnings.
2 check(s) raised errors.
2 check(s) were skipped due to errors from other checks.>

Device Information

  • Host OS [e.g. Ubuntu 22.04, Windows Server IoT 2019]: Ubuntu 20.04.6 LTS
  • Architecture [e.g. amd64, arm32, arm64]: amd64
  • Container OS [e.g. Linux containers, Windows containers]: Linux containers

Runtime Versions

  • aziot-edged [run iotedge version]: 1.5.7
  • Edge Agent [image tag (e.g. 1.0.0)]: 1.5.3
  • Edge Hub [image tag (e.g. 1.0.0)]: 1.5.3
  • Docker/Moby [run docker version]: 23.0.7+
@huguesBouvier
Copy link
Contributor

It is expected that the module would be disconnected every 60-70 mn for re-authenticating, however, it shouldn't take 20mn to reconnect.
Do you see the issue for all modules?

This issue happens when the system is strapped for ressources, can you check the CPU/RAM usage on the machine?

@Pulavarthykarthik-nagarro
Copy link
Author

@huguesBouvier In the EdgeHub module logs, we noticed it reconnects immediately after disconnecting. However, in the IoT Edge Hub Connections section, the modules still show as disconnected. So, we’re unsure if the modules are actually connected or disconnected to EdgeHub or if this is just how IoT Edge version 1.5 behaves.

The same issue is happening with all the custom modules—they show as disconnected even after reconnecting. However, one Microsoft module is working fine, and its connection remains stable. I am attaching the screeshot of the IoT Edge Hub Connections Section. In the screenshot I shared, I've hidden the names of the custom modules.

Below are the logs which are getting generated.

<6> 2024-09-18 05:27:41.289 +00:00 [INF] - "Closing connection for device: <deviceid>/<custommodule>, , "
<6> 2024-09-18 05:27:41.289 +00:00 [INF] - Disposing MessagingServiceClient for device Id <deviceid>/<custommodule>because of exception - 
<6> 2024-09-18 05:27:41.289 +00:00 [INF] - Setting device proxy inactive for device Id <deviceid>/<custommodule>
<6> 2024-09-18 05:27:41.289 +00:00 [INF] - Removing device connection for device <deviceid>/<custommodule>with removeCloudConnection flag 'True'.
<6> 2024-09-18 05:27:41.344 +00:00 [INF] - Client <deviceid>/<custommodule>in device scope authenticated locally.
<6> 2024-09-18 05:27:41.345 +00:00 [INF] - Successfully generated identity for clientId <deviceid>/<custommodule>and username iothub-eus2-dev-001.azure-devices.net/<deviceid>/<custommodule>/?api-version=2019-10-01&DeviceClientType=azure-iot-device-iothub-py%2F2.8.0%283.8.17%3BLinux%20%23163-Ubuntu%20SMP%20Fri%20Mar%2017%2018%3A26%3A02%20UTC%202023%3Bx86_64%29
<6> 2024-09-18 05:27:41.345 +00:00 [INF] - "ClientAuthenticated, <deviceid>/<custommodule>, 25ded387"
<6> 2024-09-18 05:27:41.345 +00:00 [INF] - New device connection for device <deviceid>/<custommodule>
<6> 2024-09-18 05:27:41.345 +00:00 [INF] - Client <deviceid>/<custommodule>r connected to edgeHub, processing existing subscriptions.
<6> 2024-09-18 05:27:41.345 +00:00 [INF] - Bind device proxy for device <deviceid>/<custommodule>
<6> 2024-09-18 05:27:41.345 +00:00 [INF] - Binding message channel for device Id <deviceid>/<custommodule>
<6> 2024-09-18 05:27:41.345 +00:00 [INF] - Processing pending subscriptions for <deviceid>/<custommodule>
<6> 2024-09-18 05:27:41.345 +00:00 [INF] - Attempting to connect to IoT Hub for client <deviceid>/<custommodule>via AMQP...
<6> 2024-09-18 05:27:41.358 +00:00 [INF] - Closing receiver in cloud proxy bc05704b-2a16-40c8-bc3e-fae4db184ebd for <deviceid>/<custommodule>
<6> 2024-09-18 05:27:41.358 +00:00 [INF] - Closed cloud proxy bc05704b-2a16-40c8-bc3e-fae4db184ebd for <deviceid>/<custommodule>
<6> 2024-09-18 05:27:41.358 +00:00 [INF] - Device connection removed for device <deviceid>/<custommodule>
<6> 2024-09-18 05:27:41.358 +00:00 [INF] - Remove device connection for device <deviceid>/<custommodule>
<6> 2024-09-18 05:27:41.510 +00:00 [INF] - Cloud connection for <deviceid>/<custommodule>is True
<6> 2024-09-18 05:27:41.510 +00:00 [INF] - Connection status for <deviceid>/<custommodule>changed to ConnectionEstablished
<6> 2024-09-18 05:27:41.510 +00:00 [INF] - Client <deviceid>/<custommodule>connected to cloud, processing existing subscriptions.
<6> 2024-09-18 05:27:41.510 +00:00 [INF] - Skipping <deviceid>/<custommodule>for subscription processing, as it is currently being processed.
<6> 2024-09-18 05:27:41.510 +00:00 [INF] - Created cloud proxy for client <deviceid>/<custommodule>via AMQP, with client operation timeout 20 seconds.
<6> 2024-09-18 05:27:41.510 +00:00 [INF] - Initialized cloud proxy 7b9ee43a-e16f-45cd-9285-3962f5ccbf5c for <deviceid>/<custommodule>
<6> 2024-09-18 05:27:41.510 +00:00 [INF] - Created cloud connection for client <deviceid>/<custommodule>
<6> 2024-09-18 05:27:44.020 +00:00 [INF] - Updated reported properties for <deviceid>/$edgeHub

Image

@ryanwinter
Copy link

Hi @Pulavarthykarthik-nagarro,

  1. What version of the SDK are you using in your custom module?
  2. Did you upgrade to the latest support SDK when you updated IoT Edge?
  3. Are you using MQTT or AMPQ for the custom module connection?

@jlian
Copy link
Member

jlian commented Nov 7, 2024

@Pulavarthykarthik-nagarro any update on the questions from @ryanwinter

@jlian
Copy link
Member

jlian commented Nov 12, 2024

@Pulavarthykarthik-nagarro any updates?

@Pulavarthykarthik-nagarro
Copy link
Author

@ryanwinter @jlian We raised a support ticket with Microsoft, and they suggested increasing the sastoken_ttl time in the create_from_edge_environment() method. The default was set to 1 hour, but we increased it to 1 day. We're also using a daily cron job to restart the modules, which renews the SAS token. With this setup, we’ve noticed that the connections are now staying stable.

@jlian jlian closed this as completed Nov 14, 2024
@jlian
Copy link
Member

jlian commented Nov 14, 2024

@Pulavarthykarthik-nagarro Can you answer the questions from @ryanwinter just so that we can help look further?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants