Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add downtime incident log #530

Closed
DilwoarH opened this issue Oct 9, 2024 · 1 comment
Closed

Add downtime incident log #530

DilwoarH opened this issue Oct 9, 2024 · 1 comment
Assignees

Comments

@DilwoarH
Copy link
Contributor

DilwoarH commented Oct 9, 2024

Slack thread: https://tpximpact.slack.com/archives/C077QAHM8TB/p1728406581259469

@DilwoarH DilwoarH self-assigned this Oct 9, 2024
@DilwoarH DilwoarH converted this from a draft issue Oct 9, 2024
@DilwoarH
Copy link
Contributor Author

digital-land/technical-documentation#53

Outage - Submit Service - 2024-10-08

In attendance

  • Providers team
  • SM

Description

The outage affected the Submit Service, causing users to experience 502 and 503 errors, with the landing page becoming unresponsive. The root cause was identified as a bug in handling undefined organisation IDs, which led to server crashes when users refreshed the error page. The issue persisted until a fix was developed and deployed.

Running log

  • 17:56Action Service down reported.
  • 17:57Observation DH posted a Sentry issue showing a "Cannot set headers after they are sent to the client" error.
  • 17:59Observation DH and GG reported 502 and 503 errors.
  • 18:02Observation DH confirmed that the last merged PR didn’t seem to be the cause of the issue (ref: PR Fix bug with issue count not showing on overview page #527).
  • 18:04Observation DH experienced the landing page being down.
  • 18:05Observation GG confirmed the server was crashing.
  • 18:06Action DH began checking out the code and started debugging locally.
  • 18:10Observation DH reproduced the error locally.
  • 18:11Observation DH and GG confirmed that the issue was caused by undefined organisation IDs.
  • 18:12Observation SM suggested setting up alerts for such incidents.
  • 18:12Observation DH noticed that after the error page loads once, refreshing it brings the server down.
  • 18:20Action DH started working on a fix.
  • 18:27Action DH raised a new PR with the fix (ref: PR Hotfix: Fixes an issue where missing org could down the server if it's run in parallel #529).
  • 18:28Action GG approved the PR.
  • 18:29Observation DH and GG identified the parallel middleware feature as the root cause of the issue.
  • 18:45Action Deployment started for the fix.
  • 18:47Observation Replacement service was running and accepting 20% of traffic.
  • 18:50Observation Replacement service accepting 100% of traffic.
  • 18:51Observation DH confirmed that the fix was deployed successfully, and the issue was resolved.

Postmortem

The outage was caused by a bug in handling undefined organisation IDs, leading to a "Cannot set headers after they are sent to the client" error. This was compounded by a flaw in the parallel middleware feature, which caused server crashes whenever users refreshed the error page.

Once the root cause was identified, a fix was developed to properly handle undefined organisation IDs, preventing the crash from occurring. The fix was tested locally, reviewed, and deployed in a staggered manner to restore the service without further disruption. Traffic was fully restored within an hour of the initial outage.

Actions to Prevent Similar Incidents in the Future

  1. Improved error handling – We will implement more robust checks for undefined organisation IDs to prevent similar issues.
  2. Alerting system – An automated alert system will be introduced to notify the team when critical issues like 5xx errors occur.
  3. Middleware review – The parallel middleware feature will be thoroughly reviewed and tested to ensure stability under all conditions.

@DilwoarH DilwoarH moved this from In Development to In Peer Review in Submit and update planning data service Oct 10, 2024
@DilwoarH DilwoarH moved this from In Peer Review to Done in Submit and update planning data service Oct 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

1 participant