Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nginx Server Timeout Issue, automatically stopping every week #950

Open
haribachala opened this issue Dec 7, 2024 · 20 comments
Open

Nginx Server Timeout Issue, automatically stopping every week #950

haribachala opened this issue Dec 7, 2024 · 20 comments

Comments

@haribachala
Copy link

haribachala commented Dec 7, 2024

Nginx Version: nginx:1.27.1-alpine3.20

Environment: ECS Fargate Container

Description: We are using Nginx as a reverse proxy in our production environment. Every Saturday around 00:00 UTC, the Nginx server stops responding to API calls, resulting in HTTP Status Code 499 errors. The API calls take more than 60 seconds to complete. Initially, we suspected the issue was with the upstream API taking too long to respond. However, during the latest occurrence, we observed that the upstream API responds within 5 seconds when called directly. When called via Nginx, the requests fail with a timeout error.

Observations:

Nginx health check (nginx_status) shows active, and waiting connections are less than 10.
The issue occurs consistently every Saturday around 00:00 UTC.
Logs show only timeout errors.
Restarting Nginx resolves the issue temporarily.

To reproduce

Use the nginx:1.27.1-alpine3.20 image in an ECS Fargate container.
Configure Nginx as a reverse proxy.
Observe the server behavior around 00:00 UTC on Saturdays.

Expected behavior

Nginx should work without any issues
.

Your environment

Nginx image version: nginx:1.27.1-alpine3.20
Environments: ECS Fargate

Additional context

The issue is not occurring every day, the pattern is every Saturday at 00:00 UTC. We don't ahave ny schedule jobs to stop the nginx server; the nginx_status endpoint is responding correctly.

@oxpa
Copy link
Collaborator

oxpa commented Dec 7, 2024

did you test the upstream from an Nginx container or other locations?
I'm not familiar with ECS Fargate but I'm pretty sure nginx in a container can work flawlessly for more than a week.

Having nginx error log may help. Having container logs may help as well.

@haribachala
Copy link
Author

Thanks. Yes, I have verified the upstream from the hosted environment. I have found similar issues reported in StackOverflow, but I wonder if this issue is with the latest versions. Verified logs and metrics of the container they look good

https://stackoverflow.com/questions/51147952/nginx-stops-automatically-at-particular-time
https://stackoverflow.com/questions/42622986/nginx-server-stops-automatically-and-my-site-goes-down-and-i-need-to-restart-in

@oxpa
Copy link
Collaborator

oxpa commented Dec 9, 2024

Well, I can assure you there are no weekly limits and no weekly tasks in nginx or nginx docker containers.
And without seeing logs there is nothing much more I can say to help you.

@haribachala
Copy link
Author

haribachala commented Dec 11, 2024

Thanks.The logs don’t provide much information, except that the upstream fails with HTTP code 499. I have enabled debug mode (error_log /dev/stderr debug;) and expect to gather more details soon. One observation is that when an upstream call times out, I checked the active connections using the nginx_status endpoint. The active connections were 4, with 1 waiting and 1 writing, which seems normal to me.

The health endpoints are working as expected, and there are issues only with Reverse proxy , apart from debug mode, how can I troubleshoot this?

Quick question: if an upstream server call times out for any reason, will NGINX stop forwarding subsequent calls to that upstream?
(we have only one upstream server)

@oxpa
Copy link
Collaborator

oxpa commented Dec 11, 2024

@haribachala
HTTP 499 is not an upstream code. This is a code to indicate that the client closed the connection without getting a response. Basically, client timeout is lower than that of nginx towards an upstream.

Error log usually have some details on what's going on. Also having upstream related variables in access log may help (upstream_addr, status, etc).
Can you post your configuration? Do you have hostnames in your configuration with no dynamic resolution? Can it be that Amazon changes an IP of your upstream and nginx still tries to use an old IP ? In this case your tests (say, with cURL) will be successful but nginx may timeout requests.

if you have access into the container, probably it's worth trying doing requests through nginx and directly and comparing tcpdump output for both. It should be pretty obvious, really.

@haribachala
Copy link
Author

haribachala commented Dec 11, 2024

Yes, 499 is a client timeout error( this happened via datadog agent, which waits only for 60 seconds).

custom template config:

server {
listen 8080 default_server;
listen [::]:8080;
server_name ${server_name_env};

http2 on;
client_body_timeout 240;
client_header_timeout 240;
keepalive_timeout 240;
proxy_connect_timeout 300;
proxy_send_timeout 300;
proxy_read_timeout 300;
fastcgi_send_timeout 300;
fastcgi_read_timeout 300;
send_timeout 240;
proxy_buffers 32 16k;
gzip_comp_level 9;
gzip_types text/css text/javascript application/javascript application/x-javascript;

client_max_body_size 100M;

location = /health {
return 200 'OK';
access_log off;
add_header Content-Type text/plain;
}

location /route/nginx_status/ {
stub_status;
include /etc/nginx/conf.d/proxy_headers.conf;
include /etc/nginx/conf.d/access/ips.conf;
deny all;
}

location /route/dbservice/ {
include /etc/nginx/conf.d/proxy_headers.conf;
proxy_pass https://${kong_alb_url}/dbservice/;
}

location /route/exportservice/ {
include /etc/nginx/conf.d/proxy_headers.conf;
proxy_pass https://${kong_alb_url}/exportservice/;
}

error_page 404 /404.html;
error_page 500 /500.html;
error_page 502 /502.html;
error_page 503 504 /5xx.html;

location ~^/(404.html|500.html|502.html|5xx.html|scheduled-downtime.html) {
root /etc/nginx/error;
}

}

nginx.conf - template conf

load_module modules/ngx_http_geoip_module.so;

worker_processes auto;

error_log /dev/stderr debug;

pid /tmp/nginx.pid;

events {
worker_connections 1024;
}

http {
include /etc/nginx/mime.types;
default_type application/octet-stream;
include /etc/nginx/conf.d/access/real_ip.conf;
underscores_in_headers on;
map $request_uri $loggable {
/health/ 0;
default 1;
}
map $http_trace_Id $trace_Id {
default $http_trace_Id; # Use the incoming trace_id if it exists
'' $request_id; # If trace_id is empty, use correlation_id
}

log_format upstream_time '$remote_addr - $remote_user [$time_local] "$request" '
'$status Trace-ID: $trace_Id "$http_referer" '
'"$http_user_agent" "$http_x_forwarded_for" '
'rt=$request_time uct=$upstream_connect_time uht=$upstream_header_time urt=$upstream_response_time';

access_log /dev/stdout upstream_time if=$loggable;

sendfile on;
tcp_nopush on;
tcp_nodelay on;
keepalive_timeout 75; #default 65;

gzip on;
gzip_disable "msie6";

include /etc/nginx/conf.d/*.conf;

client_body_temp_path /tmp/client_temp;
proxy_temp_path /tmp/proxy_temp_path;
fastcgi_temp_path /tmp/fastcgi_temp;
uwsgi_temp_path /tmp/uwsgi_temp;
scgi_temp_path /tmp/scgi_temp;

}

@oxpa
Copy link
Collaborator

oxpa commented Dec 11, 2024

You don't have upstream_addr in your log format. It would be curious to compare the IP from nginx and IP from, say, cURL

Is it ${kong_alb_url} that doesn't reply? I assume this variable is substituted at the container configuration and has a hostname in it.
Does IP of this service change weekly?
If yes - that's probably the reason. Maybe create an upstream block with "server $abl_name resolve;" and configure a resolver.

@haribachala
Copy link
Author

kong_alb_url - yes, it's a container environment variable. It's an ALB URL; the underlying target IP(s) will change, not the ALB name. For example, at 'https://mykongHost.com' , the underlying IP(s) of 'mykongHost' will change when there is a deployment.
When ALB resolves to IP, does the NGINX cache the IP? It's stateless requests, right?
I will add upstream_addr in the logs.

@oxpa
Copy link
Collaborator

oxpa commented Dec 11, 2024

if the container has something like "proxy_pass https://mykongHost.com ...." - it is resolved once, at nginx startup. And then never re-resolved.
What you should do is to configure a resolver at http{} level, then configure an upstream for this host and add "resolve" parameter to the said server. resulting config should be roughly like this:

resolver $resolvers_from_host;
upstream kong {
    zone u_kong 128k;
    server myconghost:443 resolve;
    keepalive 4;
}
server {
    proxy_http_version 1.1;
    proxy_set_header Connection '';
    location / { proxy_pass https://kong;}
}

This way "myconghost" will be properly re-resolved when needed

@haribachala
Copy link
Author

“Thank you. Regarding the suggestion for ‘myconghost’ in the upstream block, can I still use the ALB/DNS placeholder here, considering its environment variable will change for each environment?”

@oxpa
Copy link
Collaborator

oxpa commented Dec 11, 2024

It should be a domain name that nginx can resolve. If your placeholder is substituted with a domain name - then yes, sure you can.

@haribachala
Copy link
Author

resolvers_from_host is ALB/DNS url right?

@oxpa
Copy link
Collaborator

oxpa commented Dec 11, 2024

nope: http://nginx.org/en/docs/http/ngx_http_core_module.html#resolver
https://github.com/nginxinc/docker-nginx/blob/master/entrypoint/15-local-resolvers.envsh
It's IPs of nameservers from your resolve.conf (you can use env var from the entrypoint script above)

@haribachala
Copy link
Author

Thanks. I have enabled debug mode and also modified the config as follows:
resolver 10.X.X.X 10.X.X.X valid=30s;
upstream kong_url {
zone u_kong 128k;
server kong_alb_here:443 resolve max_fails=5 fail_timeout=360s;

location /route/dbservice/ {
include /etc/nginx/conf.d/proxy_headers.conf;
proxy_pass https://kong_url/dbservice/;
}

I will monitor this for sometime

@oxpa
Copy link
Collaborator

oxpa commented Dec 12, 2024

@haribachala just two notes: if you don't have "keepalive" inside the upstream block - there will be no keepalive connections. Keepalive saves a lot of resources. And be careful with max_fails and fail_timeout: it's easy to misconfigure them. Many people prefer max_fails=0.

@haribachala
Copy link
Author

Sorry, keepalive 4; is there. While copying, I missed the last two lines of the upstream block. I will verify the max_fails and fail_timeout config.

After the changes, the example log is:
172.17.0.1 - - [12/Dec/2024:12:42:18 +0000] "GET '/route/dbservice/uri/ HTTP/1.1" 200 Trace-ID: aad20d6b74f18ef04289cec94351d99d "-" "PostmanRuntime/7.43.0" "-" rt=1.386 uct=1.006 uht=1.388 urt=1.388 upstream_addr=10.X.X.83:443

upstream_addr - is nothing but Kong internal ALB IP (the internal ALB has more than one IP; in my case, 3 IP(s) are there, for every client request any one of the IP printing in round robin) , to replicate the issue (resolve working or not), we have restarted kong server , the IP of kong instance were changed but not the ALB IP(s)

@oxpa
Copy link
Collaborator

oxpa commented Dec 12, 2024

upstream connect time is a bit high though (1 second?). You can, probably, increase keepalive value to have a bigger pool of connections.

Otherwise - let's wait till saturday and this time we'll have more data to work with.
If it stops working for you on Saturday again - try requests with curl from nginx container directly and through nginx. And we'll figure out what's wrong.

@haribachala
Copy link
Author

It may be high because I am trying from my local machine and upstream in another region, in prod upstream, and the application is in the same region. Sure, next time the event happens, I will try to call upstream from the Nginx container.

@haribachala
Copy link
Author

this time upstream ALB IP(s) not changed yet, will update once its done.

@haribachala
Copy link
Author

It’s been almost a month since we added the resolver to resolve ALB IP, and the issue hasn’t occurred again, even though the ALB IP(s) were changed. We will continue to observe for another couple of days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants