Linkerd keeps TCP connections open for 10 min #12618

bzlom · 2024-05-17T09:25:58Z

bzlom
May 17, 2024

Hello,

EKS Server Version: v1.29.3-eks-adc7111
linkerd version: stable-2.14.10

we're encountering an issue on AWS cloud that we narrowed down to linkerd (Server version: stable-2.14.10).
We have an EKS environment where all pods are injected with linkerd (default settings). We have a Client (on internet) that initiate a TCP connection with a Server behind linkerd (AWS EKS). The Client then keeps sending TCP Keep-Alive packets to the Server and the Server sends Keep-Alive frames back towards the Client.

If the Client closes the connection gracefully the Server stops sending Keep-Alive frames with no issues in ~1min timeframe. However if the Client loses Internet connection and stops sending Keep-Alive frames - the Server will keep on sending Keep-alive frames for another 10 minutes.

This issue completely disappears if we remove linkerd from the EKS pod on which the Server resides.

Does anyone know of any reason for why this is happening?

In the wireshark output below the red line before 9:45 indicates when the connection from the client (10.5.14.150) was interrupted due to loss of Internet connection to the server (10.5.90.0). You can also see the connection is kept alive even though there's no reply from the client in the 9:45-9:57 interval

Here's linkerd check -o short output below:

linkerd check -o short
linkerd-version
---------------
‼ cli is up-to-date
    unsupported version channel: stable-2.14.10
    see https://linkerd.io/2.14/checks/#l5d-version-cli for hints

control-plane-version
---------------------
‼ control plane is up-to-date
    unsupported version channel: stable-2.14.10
    see https://linkerd.io/2.14/checks/#l5d-version-control for hints

linkerd-control-plane-proxy
---------------------------
‼ control plane proxies are up-to-date
    some proxies are not running the current version:
        * linkerd-destination-8d9b4766c-54w9r (stable-2.14.10)
        * linkerd-destination-8d9b4766c-hkb8d (stable-2.14.10)
        * linkerd-destination-8d9b4766c-m5n4k (stable-2.14.10)
        * linkerd-identity-695df4b664-4mz6d (stable-2.14.10)
        * linkerd-identity-695df4b664-f6cvr (stable-2.14.10)
        * linkerd-identity-695df4b664-kqfmq (stable-2.14.10)
        * linkerd-proxy-injector-6f58c7b7cd-6dpqj (stable-2.14.10)
        * linkerd-proxy-injector-6f58c7b7cd-bwdrk (stable-2.14.10)
        * linkerd-proxy-injector-6f58c7b7cd-wrd6b (stable-2.14.10)
    see https://linkerd.io/2.14/checks/#l5d-cp-proxy-version for hints

linkerd-viz
-----------
‼ viz extension proxies are up-to-date
    some proxies are not running the current version:
        * metrics-api-645458bf65-bhhlt (stable-2.14.10)
        * prometheus-5dbfd4b586-f82x2 (stable-2.14.10)
        * tap-6947559986-z5bpf (stable-2.14.10)
        * tap-injector-78c5bc9f99-trhsl (stable-2.14.10)
        * web-7f977fd9c5-vrz27 (stable-2.14.10)
    see https://linkerd.io/2.14/checks/#l5d-viz-proxy-cp-version for hints

Status check results are √

Answered by alpeb

Jun 13, 2024

To reiterate over my previous answer, linkerd enables keepalives with its defaults set at the kernel level. There aren't to my knowledge reports that this is undesired behavior. Linkerd doesn't offer the ability to tweak these defaults, but as you point out, they're already exposed in the pod spec, so you could rely on those settings if you think your setup requires them.

View full answer

alpeb · 2024-05-23T18:22:16Z

alpeb
May 23, 2024
Collaborator

I believe this behavior is in line with the TCP protocol:

       tcp_keepalive_intvl (integer; default: 75; since Linux 2.4)
              The number of seconds between TCP keep-alive probes.

       tcp_keepalive_probes (integer; default: 9; since Linux 2.2)
              The maximum number of TCP keep-alive probes to send before
              giving up and killing the connection if no response is
              obtained from the other end.

So about 11 mins (75*9/60) after the client stopped responding will the server give up.

3 replies

bzlom May 24, 2024
Author

@alpeb hi, thank you for your reply. The issue completely dissapears though when we remove linkerd from the pod.

So to sum it up:
Linkerd present: we get a ~10min lingering connection after the client lost it's internet connection
Linkerd removed: the issue is non existent and the connection gets terminated almost immediately ~1min

alpeb May 24, 2024
Collaborator

Seems like your server app doesn't natively do keepalives, so the connection is closed down immediately when not injected? Keepalives could serve for example to avoid severing (still-valid) connections when there's no activity. I believe we don't expose on the server side these keepalive settings I referred to above. Besides the sole fact of having an unused connection up for an extra 10mins, is this causing you other concrete issues?

bzlom May 28, 2024
Author

@alpeb The issue started when we accidentally noticed that our connections, that should have been terminated, are being kept for 30min (not a typo). This lead us to start investigating the cause.

After removing layers from our application architecture we finally got to linkerd. There were no other elements between our application and the externally facing AWS load balancer except linkerd. We noticed that even though the connections were kept alive for only 10min, not 30min like initially discovered, it was still linkerd that was causing both of these issues.

We have no clue why it would keep the connections alive for 30min (or 10min with no other elements present except application and linkerd-proxy in an EKS pod + AWS load balancer)- but simply removing linkerd solves this issue completely. We worry that in the future these lingering 30min connections can potentially cause us issues.

kflynn · 2024-05-30T19:11:03Z

kflynn
May 30, 2024
Collaborator

@bzlom After looking over this with the maintainers, we have questions. 🙂

Linkerd doesn't actually send keepalives in the proxy code. What Linkerd does is to request that the underlying TCP stack send keepalives after 10 seconds of idle time -- so Linkerd forces tcp_keepalive_time to 10s, but leaves tcp_keepalive_intvl and tcp_keepalive_probes alone. As @alpeb notes, on Linux tcp_keepalive_intvl defaults to 75s and tcp_keepalive_probes defaults to 9.

Your wireshark shows pretty much exactly what we'd expect for this: at line 16, we see the first keepalive being sent, after 10s of idle time. Lines 18-26 show 9 keepalives sent at roughly 75-second intervals, then at line 27 we see the TCP reset that closes the connection. (Line 17, where something ACKs the first keepalive on behalf of the client, is strange to me, and makes me wonder exactly what's between the client and this wireshark.)

But, overall, what we see here is the TCP stack doing exactly what we'd expect Linkerd to ask it to do. As I said, this leaves us with questions:

Is this actually causing a problem?
What kind of traffic is this? For example, if it's HTTP, we would expect Linkerd to keep a connection between the proxy and the workload open for some time, since incoming requests will reuse the same connection to the workload.
Without Linkerd, is your application or workload explicitly disabling keepalives?
What exactly is in between the client and Linkerd?

Of these, the most important is definitely the first -- is this actually causing you a problem, or is it simply that you're seeing metrics with values that are unexpected?

Thanks!

3 replies

bzlom Jun 3, 2024
Author

@kflynn thank you for your reply.

I'll try to answer the questions in order:

Is this actually causing a problem? - not right now, but we're worried about the future implications of such large keep-alive values
What kind of traffic is this? For example, if it's HTTP, we would expect Linkerd to keep a connection between the proxy and the workload open for some time, since incoming requests will reuse the same connection to the workload. - it's HTTP traffic
Without Linkerd, is your application or workload explicitly disabling keepalives? - no, the application has its own keep-alive that waits for 1-2min and then terminates the connection. When Linkerd comes into play and the client looses internet connection to the server (that's behind linkerd) - only in this specific corner case the connection is being kept alive for long periods.
What exactly is in between the client and Linkerd? - AWS Load Balancer and Traefik ingress controller.

For testing purposes I've decreased the values for tcp_keepalive_intvl and tcp_keepalive_probes to 10s and 6 respectively on the Pod level (the pod contains our application container and linkerd proxy) and it solved the current issue. But we're worried this can create some unexpected issues in the future.

alpeb Jun 13, 2024
Collaborator

To reiterate over my previous answer, linkerd enables keepalives with its defaults set at the kernel level. There aren't to my knowledge reports that this is undesired behavior. Linkerd doesn't offer the ability to tweak these defaults, but as you point out, they're already exposed in the pod spec, so you could rely on those settings if you think your setup requires them.

Answer selected by kflynn

kflynn Jun 27, 2024
Collaborator

I've marked @alpeb's response as the answer here – the best way to tackle this is to configure the underlying OS correctly. Feel free to comment further if you disagree!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Linkerd keeps TCP connections open for 10 min #12618

{{title}}

Replies: 2 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Linkerd keeps TCP connections open for 10 min #12618

bzlom May 17, 2024

Replies: 2 comments · 6 replies

alpeb May 23, 2024 Collaborator

bzlom May 24, 2024 Author

alpeb May 24, 2024 Collaborator

bzlom May 28, 2024 Author

kflynn May 30, 2024 Collaborator

bzlom Jun 3, 2024 Author

alpeb Jun 13, 2024 Collaborator

kflynn Jun 27, 2024 Collaborator

bzlom
May 17, 2024

Replies: 2 comments 6 replies

alpeb
May 23, 2024
Collaborator

bzlom May 24, 2024
Author

alpeb May 24, 2024
Collaborator

bzlom May 28, 2024
Author

kflynn
May 30, 2024
Collaborator

bzlom Jun 3, 2024
Author

alpeb Jun 13, 2024
Collaborator

kflynn Jun 27, 2024
Collaborator