Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error handling; don't panic. #61

Merged
merged 15 commits into from
Jan 16, 2024
Merged

Error handling; don't panic. #61

merged 15 commits into from
Jan 16, 2024

Conversation

puffitos
Copy link
Collaborator

@puffitos puffitos commented Jan 5, 2024

Motivation

Addresses #53

Still a WIP. I just wanted to put everything together as documentation so we can discuss this together next week.

Changes

A friendly guide for the changes:

  • No panic, only errors
  • Two new channels for sparrow, to handle errors and to handle end of life
  • All sparrow components start as goroutines in Run and write their errors into the error channel (when something irrecoverable happens)
  • The errors are handled by a separate goroutine (handleErrors), which may be extended to handle errors in other ways (currently only fatal errors are expected in the error channel)
  • The handleErrors will gracefully shutdown all sparrow components (if possible) and return the error(s) which led to the shutdown, along with other errors that happened during shutdown.
  • Each component should/ must have a shutdown function, to be terminated gracefully

Additionally, I couldn't resist the urge and did the following (sorry!):

  • Added the source to the logger, so we know where the log is produced
  • Removed the bulky and unhelpful withGroup from all loggers

Tests done

  • Normal run without interruptions with local config
  • Normal run without interruptions with remote config
  • Running with a context that will only last 30 seconds -> routines shutdown and sparrow exits with 1
  • Running with misconfigured sparrow config (wrong filepath) -> api starts and shuts down immediately → sparrow exits with 1

Normal run, metrics from remote config

# HELP sparrow_health_up Health of targets
# TYPE sparrow_health_up gauge
sparrow_health_up{target="https://caas-max-sparrow.caas-t02.telekom.de"} 1
sparrow_health_up{target="https://caas-max-sparrow.caas-t02.telekom.de/checks/health"} 1
sparrow_health_up{target="https://caas-max-sparrow.caas-t21.telekom.de"} 1
sparrow_health_up{target="https://gitlab.devops.telekom.de"} 1
sparrow_health_up{target="https://www.google.com/"} 1
# HELP sparrow_latency_count Count of latency checks done
# TYPE sparrow_latency_count counter
sparrow_latency_count{target="https://caas-max-sparrow.caas-t02.telekom.de"} 15
sparrow_latency_count{target="https://caas-max-sparrow.caas-t21.telekom.de"} 15
sparrow_latency_count{target="https://example.com/"} 14
sparrow_latency_count{target="https://gitlab.devops.telekom.de"} 14
sparrow_latency_count{target="https://google.com/"} 14
sparrow_latency_count{target="https://yam.telekom.de"} 14
# HELP sparrow_latency_duration Latency of targets in seconds
# TYPE sparrow_latency_duration histogram
sparrow_latency_duration_bucket{target="https://caas-max-sparrow.caas-t02.telekom.de",le="0.005"} 1
sparrow_latency_duration_bucket{target="https://caas-max-sparrow.caas-t02.telekom.de",le="0.01"} 1
sparrow_latency_duration_bucket{target="https://caas-max-sparrow.caas-t02.telekom.de",le="0.025"} 11
sparrow_latency_duration_bucket{target="https://caas-max-sparrow.caas-t02.telekom.de",le="0.05"} 13
sparrow_latency_duration_bucket{target="https://caas-max-sparrow.caas-t02.telekom.de",le="0.1"} 13
sparrow_latency_duration_bucket{target="https://caas-max-sparrow.caas-t02.telekom.de",le="0.25"} 15
sparrow_latency_duration_bucket{target="https://caas-max-sparrow.caas-t02.telekom.de",le="0.5"} 15
sparrow_latency_duration_bucket{target="https://caas-max-sparrow.caas-t02.telekom.de",le="1"} 15
sparrow_latency_duration_bucket{target="https://caas-max-sparrow.caas-t02.telekom.de",le="2.5"} 15
sparrow_latency_duration_bucket{target="https://caas-max-sparrow.caas-t02.telekom.de",le="5"} 15
sparrow_latency_duration_bucket{target="https://caas-max-sparrow.caas-t02.telekom.de",le="10"} 15
sparrow_latency_duration_bucket{target="https://caas-max-sparrow.caas-t02.telekom.de",le="+Inf"} 15
sparrow_latency_duration_sum{target="https://caas-max-sparrow.caas-t02.telekom.de"} 0.5367250109999999
sparrow_latency_duration_count{target="https://caas-max-sparrow.caas-t02.telekom.de"} 15
sparrow_latency_duration_bucket{target="https://caas-max-sparrow.caas-t21.telekom.de",le="0.005"} 1
sparrow_latency_duration_bucket{target="https://caas-max-sparrow.caas-t21.telekom.de",le="0.01"} 1
sparrow_latency_duration_bucket{target="https://caas-max-sparrow.caas-t21.telekom.de",le="0.025"} 2
sparrow_latency_duration_bucket{target="https://caas-max-sparrow.caas-t21.telekom.de",le="0.05"} 13
sparrow_latency_duration_bucket{target="https://caas-max-sparrow.caas-t21.telekom.de",le="0.1"} 13
sparrow_latency_duration_bucket{target="https://caas-max-sparrow.caas-t21.telekom.de",le="0.25"} 15
sparrow_latency_duration_bucket{target="https://caas-max-sparrow.caas-t21.telekom.de",le="0.5"} 15
sparrow_latency_duration_bucket{target="https://caas-max-sparrow.caas-t21.telekom.de",le="1"} 15
sparrow_latency_duration_bucket{target="https://caas-max-sparrow.caas-t21.telekom.de",le="2.5"} 15
sparrow_latency_duration_bucket{target="https://caas-max-sparrow.caas-t21.telekom.de",le="5"} 15
sparrow_latency_duration_bucket{target="https://caas-max-sparrow.caas-t21.telekom.de",le="10"} 15
sparrow_latency_duration_bucket{target="https://caas-max-sparrow.caas-t21.telekom.de",le="+Inf"} 15
sparrow_latency_duration_sum{target="https://caas-max-sparrow.caas-t21.telekom.de"} 0.54715294
sparrow_latency_duration_count{target="https://caas-max-sparrow.caas-t21.telekom.de"} 15
sparrow_latency_duration_bucket{target="https://example.com/",le="0.005"} 1
sparrow_latency_duration_bucket{target="https://example.com/",le="0.01"} 1
sparrow_latency_duration_bucket{target="https://example.com/",le="0.025"} 1
sparrow_latency_duration_bucket{target="https://example.com/",le="0.05"} 1
sparrow_latency_duration_bucket{target="https://example.com/",le="0.1"} 1
sparrow_latency_duration_bucket{target="https://example.com/",le="0.25"} 12
sparrow_latency_duration_bucket{target="https://example.com/",le="0.5"} 12
sparrow_latency_duration_bucket{target="https://example.com/",le="1"} 13
sparrow_latency_duration_bucket{target="https://example.com/",le="2.5"} 14
sparrow_latency_duration_bucket{target="https://example.com/",le="5"} 14
sparrow_latency_duration_bucket{target="https://example.com/",le="10"} 14
sparrow_latency_duration_bucket{target="https://example.com/",le="+Inf"} 14
sparrow_latency_duration_sum{target="https://example.com/"} 2.9955032790000002
sparrow_latency_duration_count{target="https://example.com/"} 14
sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="0.005"} 3
sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="0.01"} 3
sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="0.025"} 3
sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="0.05"} 3
sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="0.1"} 3
sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="0.25"} 3
sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="0.5"} 14
sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="1"} 14
sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="2.5"} 14
sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="5"} 14
sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="10"} 14
sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="+Inf"} 14
sparrow_latency_duration_sum{target="https://gitlab.devops.telekom.de"} 3.6277407059999995
sparrow_latency_duration_count{target="https://gitlab.devops.telekom.de"} 14
sparrow_latency_duration_bucket{target="https://google.com/",le="0.005"} 2
sparrow_latency_duration_bucket{target="https://google.com/",le="0.01"} 2
sparrow_latency_duration_bucket{target="https://google.com/",le="0.025"} 2
sparrow_latency_duration_bucket{target="https://google.com/",le="0.05"} 2
sparrow_latency_duration_bucket{target="https://google.com/",le="0.1"} 2
sparrow_latency_duration_bucket{target="https://google.com/",le="0.25"} 13
sparrow_latency_duration_bucket{target="https://google.com/",le="0.5"} 14
sparrow_latency_duration_bucket{target="https://google.com/",le="1"} 14
sparrow_latency_duration_bucket{target="https://google.com/",le="2.5"} 14
sparrow_latency_duration_bucket{target="https://google.com/",le="5"} 14
sparrow_latency_duration_bucket{target="https://google.com/",le="10"} 14
sparrow_latency_duration_bucket{target="https://google.com/",le="+Inf"} 14
sparrow_latency_duration_sum{target="https://google.com/"} 2.066476341
sparrow_latency_duration_count{target="https://google.com/"} 14
sparrow_latency_duration_bucket{target="https://yam.telekom.de",le="0.005"} 2
sparrow_latency_duration_bucket{target="https://yam.telekom.de",le="0.01"} 2
sparrow_latency_duration_bucket{target="https://yam.telekom.de",le="0.025"} 2
sparrow_latency_duration_bucket{target="https://yam.telekom.de",le="0.05"} 2
sparrow_latency_duration_bucket{target="https://yam.telekom.de",le="0.1"} 2
sparrow_latency_duration_bucket{target="https://yam.telekom.de",le="0.25"} 11
sparrow_latency_duration_bucket{target="https://yam.telekom.de",le="0.5"} 14
sparrow_latency_duration_bucket{target="https://yam.telekom.de",le="1"} 14
sparrow_latency_duration_bucket{target="https://yam.telekom.de",le="2.5"} 14
sparrow_latency_duration_bucket{target="https://yam.telekom.de",le="5"} 14
sparrow_latency_duration_bucket{target="https://yam.telekom.de",le="10"} 14
sparrow_latency_duration_bucket{target="https://yam.telekom.de",le="+Inf"} 14
sparrow_latency_duration_sum{target="https://yam.telekom.de"} 2.8014992370000003
sparrow_latency_duration_count{target="https://yam.telekom.de"} 14
# HELP sparrow_latency_duration_seconds Latency with status information of targets
# TYPE sparrow_latency_duration_seconds gauge
sparrow_latency_duration_seconds{status="0",target="https://caas-max-sparrow.caas-t02.telekom.de"} 0
sparrow_latency_duration_seconds{status="0",target="https://caas-max-sparrow.caas-t21.telekom.de"} 0
sparrow_latency_duration_seconds{status="0",target="https://example.com/"} 0
sparrow_latency_duration_seconds{status="0",target="https://gitlab.devops.telekom.de"} 0
sparrow_latency_duration_seconds{status="0",target="https://google.com/"} 0
sparrow_latency_duration_seconds{status="0",target="https://yam.telekom.de"} 0
sparrow_latency_duration_seconds{status="200",target="https://caas-max-sparrow.caas-t02.telekom.de"} 0.0246382
sparrow_latency_duration_seconds{status="200",target="https://caas-max-sparrow.caas-t21.telekom.de"} 0.031552157
sparrow_latency_duration_seconds{status="200",target="https://example.com/"} 0.12981956
sparrow_latency_duration_seconds{status="200",target="https://gitlab.devops.telekom.de"} 0.319594274
sparrow_latency_duration_seconds{status="200",target="https://google.com/"} 0.153979783
sparrow_latency_duration_seconds{status="418",target="https://yam.telekom.de"} 0.212991898

Cancel context after 30 secs

A timeout context was set in the top-level sparrow.Run(ctx) call. After running for a bit, we get the following logs. Note the many errors in the various checks, the api shutdown and the target manager shutdown. Finally, sparrow exits with 1 as it should :)

{"time":"2024-01-05T18:54:15.235186838+01:00","level":"INFO","source":{"function":"github.com/caas-team/sparrow/pkg/sparrow/gitlab.(*Client).FetchFiles","file":"/home/bbressi/dev/repos/sparrow/pkg/sparrow/gitlab/gitlab.go","line":147},"msg":"Successfully fetched all target files","files":8}
{"time":"2024-01-05T18:54:16.622747226+01:00","level":"INFO","source":{"function":"github.com/caas-team/sparrow/pkg/config.(*HttpLoader).Run","file":"/home/bbressi/dev/repos/sparrow/pkg/config/http.go","line":71},"msg":"Successfully got remote runtime configuration"}
{"time":"2024-01-05T18:54:20.957935964+01:00","level":"INFO","source":{"function":"github.com/caas-team/sparrow/pkg/sparrow.(*Sparrow).handleErrors","file":"/home/bbressi/dev/repos/sparrow/pkg/sparrow/run.go","line":330},"msg":"Context done, shutting down sparrow"}
{"time":"2024-01-05T18:54:20.957941044+01:00","level":"ERROR","source":{"function":"github.com/caas-team/sparrow/pkg/checks.getLatency","file":"/home/bbressi/dev/repos/sparrow/pkg/checks/latency.go","line":274},"msg":"Error while checking latency","url":"https://google.com/","error":"Get \"https://www.google.com/\": context deadline exceeded"}
{"time":"2024-01-05T18:54:20.959002788+01:00","level":"ERROR","source":{"function":"github.com/caas-team/sparrow/pkg/sparrow.(*Sparrow).registerCheck.func1","file":"/home/bbressi/dev/repos/sparrow/pkg/sparrow/run.go","line":257},"msg":"Failed to run check","name":"health","error":"context deadline exceeded"}
{"time":"2024-01-05T18:54:20.959042803+01:00","level":"ERROR","source":{"function":"github.com/caas-team/sparrow/pkg/checks.getLatency","file":"/home/bbressi/dev/repos/sparrow/pkg/checks/latency.go","line":274},"msg":"Error while checking latency","url":"https://example.com/","error":"Get \"https://example.com/\": context deadline exceeded"}
{"time":"2024-01-05T18:54:20.959082609+01:00","level":"ERROR","source":{"function":"github.com/caas-team/sparrow/pkg/sparrow/gitlab.(*Client).fetchFile","file":"/home/bbressi/dev/repos/sparrow/pkg/sparrow/gitlab/gitlab.go","line":171},"msg":"Failed to fetch file","file":"sparrow-dev-cool.de.json","error":"Get \"https://gitlab.devops.telekom.de/api/v4/projects/237078/repository/files/sparrow-dev-cool.de.json/raw?ref=main\": context deadline exceeded"}
{"time":"2024-01-05T18:54:20.959101665+01:00","level":"ERROR","source":{"function":"github.com/caas-team/sparrow/pkg/sparrow/gitlab.(*Client).FetchFiles","file":"/home/bbressi/dev/repos/sparrow/pkg/sparrow/gitlab/gitlab.go","line":142},"msg":"Failed fetching files","error":"Get \"https://gitlab.devops.telekom.de/api/v4/projects/237078/repository/files/sparrow-dev-cool.de.json/raw?ref=main\": context deadline exceeded"}
{"time":"2024-01-05T18:54:20.959088671+01:00","level":"ERROR","source":{"function":"github.com/caas-team/sparrow/pkg/checks.getLatency","file":"/home/bbressi/dev/repos/sparrow/pkg/checks/latency.go","line":274},"msg":"Error while checking latency","url":"https://yam.telekom.de","error":"Get \"https://yam-united.telekom.com/pages\": context deadline exceeded"}
{"time":"2024-01-05T18:54:20.959117124+01:00","level":"ERROR","source":{"function":"github.com/caas-team/sparrow/pkg/sparrow/targets.(*gitlabTargetManager).refreshTargets","file":"/home/bbressi/dev/repos/sparrow/pkg/sparrow/targets/gitlab.go","line":203},"msg":"Failed to update global targets","error":"Get \"https://gitlab.devops.telekom.de/api/v4/projects/237078/repository/files/sparrow-dev-cool.de.json/raw?ref=main\": context deadline exceeded"}
{"time":"2024-01-05T18:54:20.959044988+01:00","level":"ERROR","source":{"function":"github.com/caas-team/sparrow/pkg/checks.getLatency","file":"/home/bbressi/dev/repos/sparrow/pkg/checks/latency.go","line":274},"msg":"Error while checking latency","url":"https://gitlab.devops.telekom.de","error":"Get \"https://gitlab.devops.telekom.de/users/sign_in\": context deadline exceeded"}
{"time":"2024-01-05T18:54:20.959173922+01:00","level":"INFO","source":{"function":"github.com/caas-team/sparrow/pkg/sparrow/targets.(*gitlabTargetManager).Shutdown","file":"/home/bbressi/dev/repos/sparrow/pkg/sparrow/targets/gitlab.go","line":149},"msg":"Stopping gitlab reconciliation routine"}
{"time":"2024-01-05T18:54:20.959185023+01:00","level":"ERROR","source":{"function":"github.com/caas-team/sparrow/pkg/checks.(*Latency).Run","file":"/home/bbressi/dev/repos/sparrow/pkg/checks/latency.go","line":90},"msg":"Context canceled","err":"context deadline exceeded"}
{"time":"2024-01-05T18:54:20.959235329+01:00","level":"ERROR","source":{"function":"github.com/caas-team/sparrow/pkg/sparrow.(*Sparrow).registerCheck.func1","file":"/home/bbressi/dev/repos/sparrow/pkg/sparrow/run.go","line":257},"msg":"Failed to run check","name":"latency","error":"context deadline exceeded"}
{"time":"2024-01-05T18:54:20.959233375+01:00","level":"ERROR","source":{"function":"github.com/caas-team/sparrow/pkg/sparrow.(*Sparrow).api.func1","file":"/home/bbressi/dev/repos/sparrow/pkg/sparrow/api.go","line":86},"msg":"Failed to serve api","error":"http: Server closed"}
{"time":"2024-01-05T18:54:20.959139457+01:00","level":"ERROR","source":{"function":"github.com/caas-team/sparrow/pkg/sparrow/targets.(*gitlabTargetManager).Reconcile","file":"/home/bbressi/dev/repos/sparrow/pkg/sparrow/targets/gitlab.go","line":99},"msg":"Failed to get global targets","error":"Get \"https://gitlab.devops.telekom.de/api/v4/projects/237078/repository/files/sparrow-dev-cool.de.json/raw?ref=main\": context deadline exceeded"}
{"time":"2024-01-05T18:54:20.959275775+01:00","level":"ERROR","source":{"function":"github.com/caas-team/sparrow/pkg/sparrow.(*Sparrow).handleErrors","file":"/home/bbressi/dev/repos/sparrow/pkg/sparrow/run.go","line":335},"msg":"Error in sparrow component, shutting down","error":"context deadline exceeded"}
{"time":"2024-01-05T18:54:20.959285975+01:00","level":"INFO","source":{"function":"github.com/caas-team/sparrow/pkg/sparrow/targets.(*gitlabTargetManager).Reconcile","file":"/home/bbressi/dev/repos/sparrow/pkg/sparrow/targets/gitlab.go","line":94},"msg":"Gitlab target reconciliation ended"}
{"time":"2024-01-05T18:54:20.95933082+01:00","level":"ERROR","source":{"function":"github.com/caas-team/sparrow/cmd.NewCmdRun.run.func1","file":"/home/bbressi/dev/repos/sparrow/cmd/run.go","line":123},"msg":"Error while running sparrow","error":"sparrow was shut down"}

TODO

  • I've assigned this PR to myself
  • I've labeled this PR correctly

- feat: added done and error channels in sparrow
- feat: added global shutdown function for main components
- chore: edited most functions to return errors and not panic
Signed-off-by: Bruno Bressi <[email protected]>
Signed-off-by: Bruno Bressi <[email protected]>
@puffitos puffitos self-assigned this Jan 5, 2024
@puffitos puffitos added feature Introduces a new feature refactoring Refactoring of existing code labels Jan 5, 2024
@puffitos puffitos added this to the 0.3.0 milestone Jan 12, 2024
@puffitos puffitos changed the title [WIP] error handling Error handling; don't panic. Jan 12, 2024
@puffitos
Copy link
Collaborator Author

Ready for review.

pkg/sparrow/targets/gitlab.go Outdated Show resolved Hide resolved
pkg/sparrow/run.go Outdated Show resolved Hide resolved
pkg/sparrow/run.go Outdated Show resolved Hide resolved
pkg/sparrow/run.go Show resolved Hide resolved
pkg/sparrow/run.go Show resolved Hide resolved
Signed-off-by: Bruno Bressi <[email protected]>
Copy link
Collaborator

@lvlcn-t lvlcn-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the call

@puffitos puffitos merged commit d2de615 into main Jan 16, 2024
9 checks passed
@lvlcn-t lvlcn-t deleted the refactor/sparrow-shutdowns branch January 16, 2024 16:26
@lvlcn-t lvlcn-t mentioned this pull request Jan 30, 2024
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Introduces a new feature refactoring Refactoring of existing code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants