Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Monitoring - Select a Monitoring Platform #4124

Closed
acozine opened this issue Aug 8, 2023 · 4 comments
Closed

Monitoring - Select a Monitoring Platform #4124

acozine opened this issue Aug 8, 2023 · 4 comments
Labels
Epic Operations pulls issues into the Operations ZenHub board post-incident created from a post-incident meeting

Comments

@acozine
Copy link
Contributor

acozine commented Aug 8, 2023

Develop requirements for monitoring.

Research product options. Our ideal product:

  • is affordable and/or open-source
  • is mature / has a track record
  • monitors linux and macos and windows operating systems
  • monitors for hardware issues (bad disks, power systems, network cards, etc.)
  • monitors on-prem and cloud resources
  • sends alerts via slack, email, SMS
  • sends timely alerts that reflect the progression of the problem (no alert fatigue)
  • is sensitive enough to provide good warnings
  • also sends alerts/notifications of recoveries
  • we can automate installation of all elements of the system (server, agent if relevant, other elements)
  • we can automate additions/deletions/updates to the configuration (what gets monitored, at what level, etc.)
  • all configuration is under source control
  • incorporates role-based access controls, so everyone in the library gets appropriate views and alerts
  • uses SSO in some form, preferably Shibboleth, for the GUI
  • provides a good GUI for user views
  • provides a CLI for maintenance of the system itself - restarting, etc.

Develop list of services, etc., to monitor.

@acozine acozine added the Epic label Aug 8, 2023
@acozine
Copy link
Contributor Author

acozine commented Aug 8, 2023

Current front-runner for product is Icinga. It's based on Nagios, so it's familiar. It's open-source. Not sure if it can monitor cloud resources.

Other options:

  • CheckMK. can monitor cloud resources with a paid account. We could do a free trial to test it out.
  • Centreon only runs on RHEL.
  • Prometheus is cloud-native.
  • Zabbix uses Java, we have no local expertise there.

@acozine
Copy link
Contributor Author

acozine commented Aug 10, 2023

Related documents and spreadsheets on Google Drive:

@acozine acozine added the Operations pulls issues into the Operations ZenHub board label Sep 11, 2023
@acozine acozine added the post-incident created from a post-incident meeting label Oct 19, 2023
@acozine
Copy link
Contributor Author

acozine commented Oct 19, 2023

Related to the following post-incident reviews:

@acozine acozine changed the title Monitoring Epic Monitoring - Select a Monitoring Platform Jan 6, 2025
@acozine
Copy link
Contributor Author

acozine commented Jan 21, 2025

We have selected CheckMK

@acozine acozine closed this as completed Jan 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Epic Operations pulls issues into the Operations ZenHub board post-incident created from a post-incident meeting
Projects
None yet
Development

No branches or pull requests

1 participant