Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Timeout handling and improves obscuring of DNS failure messages being swallowed by the base timeout (circuit breaker) #123

Merged
merged 5 commits into from
Feb 17, 2019

Conversation

johanek
Copy link
Contributor

@johanek johanek commented Jan 2, 2019

…ture Net::HTTP errors

Pull Request Checklist

Related to comments in #82

General

  • Update Changelog following the conventions laid out here

- [ ] Update README with any necessary configuration snippets

- [ ] Binstubs are created if needed

  • RuboCop passes

  • Existing tests pass

Purpose

Improve error reporting related to Net::HTTP timeouts

@majormoses
Copy link
Member

Thanks for your contribution to Sensu plugins! Without people like you submitting PRs we couldn't run the project. I will review it shortly.

Copy link
Member

@majormoses majormoses left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good but currently this is a breaking as it removes a public interface which we can add backwards compatibility pretty easily. Also can you please add a changelog?

Thanks for your awesome work.

@@ -251,9 +251,11 @@ def run
end

begin
Timeout.timeout(config[:timeout]) do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inside acquire_resource below we should use this to still provide a way for people to account for slow overloaded web servers.

Copy link
Contributor Author

@johanek johanek Jan 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timeout options to Net::HTTP are already being set to the timeout value passed as config:

http.read_timeout = config[:timeout]
http.open_timeout = config[:timeout]
http.ssl_timeout = config[:timeout]
http.continue_timeout = config[:timeout]
http.keep_alive_timeout = config[:timeout]

@johanek
Copy link
Contributor Author

johanek commented Jan 3, 2019

Hi, thanks for looking at this so quickly.

I don't think this is a breaking change, the timeout command line option is still being used to set the Net::HTTP timeout options.

Updated the changelog as requested

@majormoses
Copy link
Member

I need to sit down and look at it again to eval it for breaking changes. Will try to get to it tonight.

@majormoses
Copy link
Member

First off I want to thank you for your time and apologies it took me a couple of days to get back to this.

It's definitely a breaking change but one I agree that we should make, now the real question is how to get there. In it's current form it would timeout at the timeout threshold (defaults to 15) with the proposed changes it could take much longer to fail as each of http.read_timeout, http.open_timeout, http.ssl_timeout, http.continue_timeout, and http.keep_alive_timeout are all individually being set to the same value. As it could easily take 30 seconds overall to fail. I know I have personally configured http checks to run at orgs every 15 seconds, changing this could lead to multiple instances of these checks piling up which could exacerbate the situation. Perhaps the best path forward is to deprecate the --timeout option in favor of unique arguments for each of those various http timeouts so that they can be controlled independently. Alternatively we could keep the main timeout one as a heavy handed circuit breaker and introduce options for each of the timeouts with much lower values.

@johanek
Copy link
Contributor Author

johanek commented Jan 10, 2019

OK, I agree with your point about the changed behaviour.

Keeping Timeout.timeout() will always mask the DNS failures, as these are not caught by the Net::HTTP open_timeout. I tested setting the open timeout to 1 second and confirmed this.

So the only way forward I see is to set explicit open and read timeout values as unique CLI options. Let me know what you think and I'll implement it.

@majormoses
Copy link
Member

majormoses commented Jan 11, 2019

Ya I think we should expose those options, we could either do this as one or more changes

Keeping Timeout.timeout() will always mask the DNS failures

We could use the ruby resolver rather than relying on the OS resolver so we can rescue those. We could expose the DNS server to use so that it could be set to still use the local resolver that may be in place (such as dnsmasq running as a cache and forward lookup in front of your service discovery such as consul) which would allow us to keep the high level circuit breaker, I am not opposed to removing it entirely I just want to find a path forward that serves everyone's needs. Documentation: https://ruby-doc.org/stdlib-2.4.0/libdoc/resolv/rdoc/Resolv/DNS.html But yes I agree that anytime you use Timeout.timout() it will obscure the real error maybe we could add a --debug flag that turns off that timeout 🤔 actually once they are separate flags we should be good because you just need to set the timeout to 0 and it will sit around forever until wither an exception is thrown or it completes successfully.

@johanek johanek force-pushed the feature/remove-timeout branch from 5068346 to 9c1dce1 Compare January 14, 2019 12:46
@johanek
Copy link
Contributor Author

johanek commented Jan 14, 2019

Hi,

That's really interesting that you can use the ruby resolver instead, and set the timeouts via that. I didn't know you could do that.

So I've updated the PR accordingly to use that method, and to add the open and read timeout values via CLI options. If the DNS request times out, we can't catch that error directly as it's caught by Net::HTTP. But the resulting error is much more useful for debugging:

CheckHttp CRITICAL: Request error: Failed to open TCP connection to testing.com:443 (Hostname not known: testing.com)

I set the DNS timeout to 800ms, so that it's always lower than the lowest value for --timeout. But perhaps this too should be configurable?

@johanek johanek force-pushed the feature/remove-timeout branch from 9c1dce1 to e56904e Compare January 14, 2019 12:51
@johanek
Copy link
Contributor Author

johanek commented Jan 14, 2019

The changed resolver behaviour has broken some tests, I'll look into that

@majormoses
Copy link
Member

You can rescue the ruby DNS timeout via https://ruby-doc.org/stdlib-2.4.0/libdoc/resolv/rdoc/Resolv/ResolvTimeout.html additionally you can use https://ruby-doc.org/stdlib-2.4.0/libdoc/resolv/rdoc/Resolv/ResolvError.html as a more generic DNS resolution exception class.

Copy link
Member

@majormoses majormoses left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much better this now no longer is really a breaking change, I do have some additional suggestions/improvements to make. Would you be willing to add a test for a DNs timeout? We should be able to trigger it to setting a ridiculously low timeout (something like 0.0001 seconds) here: https://github.com/sensu-plugins/sensu-plugins-http/blob/master/test/integration/helpers/serverspec/check-http-shared_spec.rb if not I can look at writing it once you have everything else updated.

@@ -250,10 +263,19 @@ def run
config[:port] ||= config[:ssl] ? 443 : 80
end

# Use Ruby DNS Resolver and set DNS resolution timeout to 800ms
dns_resolver = Resolv::DNS.new
dns_resolver.timeouts = 0.8
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya let's expose that as an option so its configurable and make sure to cast it as a float.

CHANGELOG.md Outdated
@@ -4,6 +4,9 @@ This project adheres to [Semantic Versioning](http://semver.org/).
This CHANGELOG follows the format listed [here](https://github.com/sensu-plugins/community/blob/master/HOW_WE_CHANGELOG.md)

## [Unreleased]
### Changed
- `check-http.rb`: Add options to set open-timeout and read-timeout for Net:HTTP. Improve output on what Net::HTTP timeout was encountered.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally I'd like to see the option names being called out here so someone reading this knows exactly what to update to leverage.

For example:

## [Unreleased]
### Added
- `check-http.rb`: Add options to set `--open-timeout` and `--read-timeout` for Net:HTTP. Additionally rescue `Net::OpenTimeout` and `Net::ReadTimeout` exception classes (@mygithubuser)
- `check-http.rb`: exposed `--dns-timeout` and rescue on `Resolv::ResolvTimeout` exception class (@mygithubuser)

### Changed
- `check-http.rb`: switched to using rubies DNS resolver (which looks at `/etc/resolv.conf`) and rescue `Resolv::ResolvError` exception class for generic DNS failures (@mygithubuser)

@@ -250,10 +263,19 @@ def run
config[:port] ||= config[:ssl] ? 443 : 80
end

# Use Ruby DNS Resolver and set DNS resolution timeout to 800ms
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update comment when you expose it as a var.

@@ -20,3 +20,5 @@ mkmf.log
#Intellij files
*.iml
*.idea

vendor/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@majormoses
Copy link
Member

It also appears that this breaks DNS resolution according to the tests which is curious because it mostly uses localhost, I have not used the DefaultResolver before it feels like it should work from a quick test locally:

$ irb
irb(main):001:0> require 'resolv-replace'
=> true
irb(main):002:0> dns_resolver = Resolv::DNS.new
=> #<Resolv::DNS:0x0055aa6d6800e8 @mutex=#<Thread::Mutex:0x0055aa6d6800c0>, @config=#<Resolv::DNS::Config:0x0055aa6d680098 @mutex=#<Thread::Mutex:0x0055aa6d680070>, @config_info=nil, @initialized=nil, @timeouts=nil>, @initialized=nil>
irb(main):003:0> dns_resolver.timeouts = 0.001
=> 0.001
irb(main):004:0> Resolv::DefaultResolver.replace_resolvers([dns_resolver])
=> [#<Resolv::DNS:0x0055aa6d6800e8 @mutex=#<Thread::Mutex:0x0055aa6d6800c0>, @config=#<Resolv::DNS::Config:0x0055aa6d680098 @mutex=#<Thread::Mutex:0x0055aa6d680070>, @config_info=nil, @initialized=nil, @timeouts=[0.001]>, @initialized=nil>]
irb(main):006:0> Resolv.getaddress('localhost')
=> "127.0.0.1"

I will need to take a closer look at why its failing in the tests.

@majormoses
Copy link
Member

I was not able to replicate this behavior in a container but I did notice that in the container it returned an ipv6 instead of an ipv4 address:

$ docker run -it ruby:2.4.1-slim
irb(main):001:0> require 'resolv-replace'
=> true
irb(main):002:0> dns_resolver = Resolv::DNS.new
=> #<Resolv::DNS:0x00000001cc7698 @mutex=#<Thread::Mutex:0x00000001cc7670>, @config=#<Resolv::DNS::Config:0x00000001cc7648 @mutex=#<Thread::Mutex:0x00000001cc7620>, @config_info=nil, @initialized=nil, @timeouts=nil>, @initialized=nil>
irb(main):003:0> dns_resolver.timeouts = 0.001
=> 0.001
irb(main):004:0> Resolv.getaddress('localhost')
=> "::1"

I don't think this is an ipv4 vs 6 issue but I will need to do some more testing to figure out whats going on.

@majormoses
Copy link
Member

majormoses commented Jan 15, 2019

Interestingly enough it seems to fail with the same error about localhost not being resolved even when specifying the ipv4 loopback address:

kitchen@27364eb91750:~$ /usr/local/bin/check-http.rb --url http://127.0.0.1/okay
Check failed to run: no address for localhost, ["/usr/local/lib/ruby/2.4.0/resolv.rb:95:in `getaddress'", "/usr/local/lib/ruby/2.4.0/resolv.rb:45:in `getaddress'", "/usr/local/lib/ruby/gems/2.4.0/gems/sensu-plugins-http-4.0.0/bin/check-http.rb:270:in `run'", "/usr/local/lib/ruby/gems/2.4.0/gems/sensu-plugin-3.0.1/lib/sensu-plugin/cli.rb:57:in `block in <class:CLI>'"]

Outside of the code it works as expected:

kitchen@27364eb91750:~$ curl -I localhost/okay
HTTP/1.1 200 OK
Server: nginx/1.6.2
Date: Tue, 15 Jan 2019 05:18:57 GMT
Content-Type: application/octet-stream
Content-Length: 0
Connection: keep-alive

@majormoses
Copy link
Member

majormoses commented Jan 15, 2019

I need to step away to regain some sanity but I am not 100% sure if its caused by the tests being run in docker with a broken ipv6 or if it breaks with all ipv6 setups. I assume its the former, I don't really do anything with ipv6 so I am not sure where to go from here. If you have a hostname that resolves to ipv6 and can validate that it works that would be appreciated and then I can figure out how to disable ipv6 in the container (probably easiest would be to just comment it out like I did while messing around).

kitchen@27364eb91750:~$ ping6 localhost
PING localhost (::1): 56 data bytes
ping6: sending packet: Cannot assign requested address
kitchen@27364eb91750:~$ ping localhost
PING localhost (127.0.0.1): 56 data bytes
64 bytes from 127.0.0.1: icmp_seq=0 ttl=64 time=0.037 ms
64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.052 ms
64 bytes from 127.0.0.1: icmp_seq=2 ttl=64 time=0.060 ms
^C--- localhost ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max/stddev = 0.037/0.050/0.060/0.000 ms
# I commented the ipv6 addresses and it worked
kitchen@27364eb91750:~$ grep -rn '::1' /etc/hosts
2:#::1	localhost ip6-localhost ip6-loopback
5:ff02::1	ip6-allnodes

kitchen@27364eb91750:~$ /usr/local/bin/check-http.rb --url http://localhost/okay
"127.0.0.1"
"127.0.0.1"
CheckHttp OK: 200, 0 bytes

@majormoses
Copy link
Member

Here is the current diff of my changes/debug code:

$ git diff
diff --git a/bin/check-http.rb b/bin/check-http.rb
index cda472b..5d0fd0b 100755
--- a/bin/check-http.rb
+++ b/bin/check-http.rb
@@ -264,14 +264,26 @@ class CheckHttp < Sensu::Plugin::Check::CLI
     end
 
     # Use Ruby DNS Resolver and set DNS resolution timeout to 800ms
+    hosts_resolver = Resolv::Hosts.new
     dns_resolver = Resolv::DNS.new
     dns_resolver.timeouts = 0.8
-    Resolv::DefaultResolver.replace_resolvers([dns_resolver])
+    Resolv::DefaultResolver.replace_resolvers(
+      [
+        hosts_resolver,
+        dns_resolver,
+      ]
+    )
+    resolved_hostname = Resolv.getaddress(config[:host])
+    p resolved_hostname
 
     begin
       Timeout.timeout(config[:timeout]) do
-        acquire_resource
+        acquire_resource(resolved_hostname)
       end
+    rescue Resolv::ResolvTimeout => e
+      critical "Unable to resolve DNS in a timely manner: #{e}"
+    rescue Resolv::ResolvError => e
+      critical "Unable to resolve DNS: #{e}"
     rescue Net::OpenTimeout
       critical 'Request timed out opening connection'
     rescue Net::ReadTimeout
@@ -283,23 +295,26 @@ class CheckHttp < Sensu::Plugin::Check::CLI
     end
   end
 
-  def acquire_resource
+  def acquire_resource(endpoint)
     http = nil
-
+    # if endpoint.start_with?('::')
+    #   endpoint = "\'[#{endpoint}]\'"
+    # end
+    p endpoint
     if config[:no_proxy]
-      http = Net::HTTP.new(config[:host], config[:port], nil, nil)
+      http = Net::HTTP.new(endpoint, config[:port], nil, nil)
     elsif config[:proxy_url]
       proxy_uri = URI.parse(config[:proxy_url])
       if proxy_uri.host.nil?
         unknown 'Invalid proxy url specified'
       end
       http = if proxy_uri.user && proxy_uri.password
-               Net::HTTP.new(config[:host], config[:port], proxy_uri.host, proxy_uri.port, proxy_uri.user, proxy_uri.password)
+               Net::HTTP.new(endpoint, config[:port], proxy_uri.host, proxy_uri.port, proxy_uri.user, proxy_uri.password)
              else
-               Net::HTTP.new(config[:host], config[:port], proxy_uri.host, proxy_uri.port)
+               Net::HTTP.new(endpoint, config[:port], proxy_uri.host, proxy_uri.port)
              end
     else
-      http = Net::HTTP.new(config[:host], config[:port])
+      http = Net::HTTP.new(endpoint, config[:port])
     end
     http.read_timeout = config[:read_timeout]
     http.open_timeout = config[:open_timeout]
diff --git a/test/fixtures/bootstrap.sh b/test/fixtures/bootstrap.sh
index 8816812..048dd01 100755
--- a/test/fixtures/bootstrap.sh
+++ b/test/fixtures/bootstrap.sh
@@ -37,6 +37,7 @@ fi
 echo "
   server {
     listen 80;
+    listen [::]:80 ipv6only=on;
 
     location /okay {
       limit_except GET {

@johanek
Copy link
Contributor Author

johanek commented Jan 15, 2019

I started working on this but ran out of time, I won't have a chance to pick it up again for a couple of days. But I commited and pushed the current WIP

  • Added your diff
  • Set timeout for dns resolution
  • Updated comments
  • Added serverspec test for DNS timeout

I've also added a 'Host' header to Net::HTTP requests as otherwise the request may not succeed for a virtual host, given we are now making the HTTP request via IP address.

I also had problems with localhost resolving to ::1 and the rests failing - I didn't get to the bottom of this, but I suspect this may be a docker for mac issue. I wanted to see what happened when the tests were run by travis. I "fixed" them by updating /etc/hosts in the docker container, and made sure that the hostname localhost is only used for 127.0.0.1 and not also for ::1.

The test I added to validate DNS timeouts isn't working, the resolution doesn't timeout because the lookup for localhost comes from the host resolver. I tried using google.com and it succeeded too. How can we simulate a DNS timeout in these conditions? I was looking at using tc to simulate a network delay but this requires more config to work successfully within docker containers.

@majormoses
Copy link
Member

Probably need to use some kind of unresponsive server for example blackhole.webpagetest.org see: https://serverfault.com/questions/776049/how-to-simulate-dns-server-response-timeout

@johanek johanek force-pushed the feature/remove-timeout branch from e075be6 to 36dd335 Compare January 17, 2019 13:00
@johanek
Copy link
Contributor Author

johanek commented Jan 17, 2019

Hi,

I've made some more updates:

Updated tests to use 127.0.0.1 instead of localhost. This is to get around the ruby resolver (correctly) preferring the ipv6 localhost address over ipv4. This could also be solved by updating /etc/hosts within the container to only use localhost for the ipv4 address.

I removed the suggested step to manually resolve the hostname and pass the IP address to acquire_resource. This didn't work with https connections - establishing the SSL connection requires the hostname to be used so it can be compared against the certificate. Otherwise the check would fail with:

CheckHttp CRITICAL: Request error: SSL_connect returned=1 errno=0 state=error: certificate verify failed

So I don't think we can explicitly capture the DNS lookup failures. However, the error returned from the check is still useful and incidicative of the problem, i.e.

CheckHttp CRITICAL: Request error: Failed to open TCP connection to blackhole.webpagetest.org

The DNS timeout test is working when setting a suitably low resolution timeout. I've also updated the changelog as requested.

I think this is the best we can do with this, but perhaps there's a better solution for the localhost issue with the tests.

Thanks!

@majormoses
Copy link
Member

@johanek awesome I will take a closer look this evening, the other option I was thinking about was updating the bootstrap script to run a sed to comment out that line from /etc/hosts but I think I like your solution better. I think we probably want to put a comment there to indicate this as a year from now there is about a 3% chance I will remember why we needed to do that. I can add that comment for you if we are good to go.

@johanek
Copy link
Contributor Author

johanek commented Jan 18, 2019

I'm happy with where this is now, thanks for your feedback and suggestions to improve my original request 👍

@majormoses majormoses merged commit b7ed5e9 into sensu-plugins:master Feb 17, 2019
@majormoses
Copy link
Member

@johanek thanks for all your awesome work and being willing to explore the better vs the easier path with me. Apologies on the delay for getting this merged and released, I had a lot going on during that time frame and things should be back to whatever "normal" means. I have cut a release and when it makes it through the CI/CD pipeline and creates an artifact I will update here.

@majormoses majormoses changed the title Remove use of Timeout.timeout() as it obscures DNS lookup errors. Cap… Improve Timeout handling and improves obscuring of DNS failure messages being swallowed by the base timeout (circuit breaker) Feb 17, 2019
@johanek
Copy link
Contributor Author

johanek commented Feb 17, 2019

Thanks for your assistance @majormoses, it was a pleasure working through it with you 😄

@majormoses
Copy link
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants