-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nginx Server Timeout Issue, automatically stopping every week #950
Comments
did you test the upstream from an Nginx container or other locations? Having nginx error log may help. Having container logs may help as well. |
Thanks. Yes, I have verified the upstream from the hosted environment. I have found similar issues reported in StackOverflow, but I wonder if this issue is with the latest versions. Verified logs and metrics of the container they look good https://stackoverflow.com/questions/51147952/nginx-stops-automatically-at-particular-time |
Well, I can assure you there are no weekly limits and no weekly tasks in nginx or nginx docker containers. |
Thanks.The logs don’t provide much information, except that the upstream fails with HTTP code 499. I have enabled debug mode (error_log /dev/stderr debug;) and expect to gather more details soon. One observation is that when an upstream call times out, I checked the active connections using the nginx_status endpoint. The active connections were 4, with 1 waiting and 1 writing, which seems normal to me. The health endpoints are working as expected, and there are issues only with Reverse proxy , apart from debug mode, how can I troubleshoot this? Quick question: if an upstream server call times out for any reason, will NGINX stop forwarding subsequent calls to that upstream? |
@haribachala Error log usually have some details on what's going on. Also having upstream related variables in access log may help (upstream_addr, status, etc). if you have access into the container, probably it's worth trying doing requests through nginx and directly and comparing tcpdump output for both. It should be pretty obvious, really. |
Yes, 499 is a client timeout error( this happened via datadog agent, which waits only for 60 seconds). custom template config: server { http2 on; client_max_body_size 100M; location = /health { location /route/nginx_status/ { location /route/dbservice/ { location /route/exportservice/ { error_page 404 /404.html; location ~^/(404.html|500.html|502.html|5xx.html|scheduled-downtime.html) { }nginx.conf - template confload_module modules/ngx_http_geoip_module.so; worker_processes auto; error_log /dev/stderr debug; pid /tmp/nginx.pid; events { http { log_format upstream_time '$remote_addr - $remote_user [$time_local] "$request" ' access_log /dev/stdout upstream_time if=$loggable; sendfile on; gzip on; include /etc/nginx/conf.d/*.conf; client_body_temp_path /tmp/client_temp; } |
You don't have upstream_addr in your log format. It would be curious to compare the IP from nginx and IP from, say, cURL Is it ${kong_alb_url} that doesn't reply? I assume this variable is substituted at the container configuration and has a hostname in it. |
kong_alb_url - yes, it's a container environment variable. It's an ALB URL; the underlying target IP(s) will change, not the ALB name. For example, at 'https://mykongHost.com' , the underlying IP(s) of 'mykongHost' will change when there is a deployment. |
if the container has something like "proxy_pass https://mykongHost.com ...." - it is resolved once, at nginx startup. And then never re-resolved.
This way "myconghost" will be properly re-resolved when needed |
“Thank you. Regarding the suggestion for ‘myconghost’ in the upstream block, can I still use the ALB/DNS placeholder here, considering its environment variable will change for each environment?” |
It should be a domain name that nginx can resolve. If your placeholder is substituted with a domain name - then yes, sure you can. |
resolvers_from_host is ALB/DNS url right? |
nope: http://nginx.org/en/docs/http/ngx_http_core_module.html#resolver |
Thanks. I have enabled debug mode and also modified the config as follows: location /route/dbservice/ { I will monitor this for sometime |
@haribachala just two notes: if you don't have "keepalive" inside the upstream block - there will be no keepalive connections. Keepalive saves a lot of resources. And be careful with max_fails and fail_timeout: it's easy to misconfigure them. Many people prefer max_fails=0. |
Sorry, keepalive 4; is there. While copying, I missed the last two lines of the upstream block. I will verify the max_fails and fail_timeout config. After the changes, the example log is: upstream_addr - is nothing but Kong internal ALB IP (the internal ALB has more than one IP; in my case, 3 IP(s) are there, for every client request any one of the IP printing in round robin) , to replicate the issue (resolve working or not), we have restarted kong server , the IP of kong instance were changed but not the ALB IP(s) |
upstream connect time is a bit high though (1 second?). You can, probably, increase keepalive value to have a bigger pool of connections. Otherwise - let's wait till saturday and this time we'll have more data to work with. |
It may be high because I am trying from my local machine and upstream in another region, in prod upstream, and the application is in the same region. Sure, next time the event happens, I will try to call upstream from the Nginx container. |
this time upstream ALB IP(s) not changed yet, will update once its done. |
It’s been almost a month since we added the resolver to resolve ALB IP, and the issue hasn’t occurred again, even though the ALB IP(s) were changed. We will continue to observe for another couple of days. |
Nginx Version: nginx:1.27.1-alpine3.20
Environment: ECS Fargate Container
Description: We are using Nginx as a reverse proxy in our production environment. Every Saturday around 00:00 UTC, the Nginx server stops responding to API calls, resulting in HTTP Status Code 499 errors. The API calls take more than 60 seconds to complete. Initially, we suspected the issue was with the upstream API taking too long to respond. However, during the latest occurrence, we observed that the upstream API responds within 5 seconds when called directly. When called via Nginx, the requests fail with a timeout error.
Observations:
Nginx health check (nginx_status) shows active, and waiting connections are less than 10.
The issue occurs consistently every Saturday around 00:00 UTC.
Logs show only timeout errors.
Restarting Nginx resolves the issue temporarily.
To reproduce
Use the nginx:1.27.1-alpine3.20 image in an ECS Fargate container.
Configure Nginx as a reverse proxy.
Observe the server behavior around 00:00 UTC on Saturdays.
Expected behavior
Nginx should work without any issues
.
Your environment
Nginx image version: nginx:1.27.1-alpine3.20
Environments: ECS Fargate
Additional context
The issue is not occurring every day, the pattern is every Saturday at 00:00 UTC. We don't ahave ny schedule jobs to stop the nginx server; the nginx_status endpoint is responding correctly.
The text was updated successfully, but these errors were encountered: