-
Notifications
You must be signed in to change notification settings - Fork 28
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
docs: add post-mortem for xmandump incident
- Loading branch information
1 parent
7d613af
commit 06f4648
Showing
1 changed file
with
97 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,97 @@ | ||
# Post Mortem 2024-07-27 | ||
|
||
## Incident summary | ||
|
||
For an unknown amount of time, the service extracting manual pages from | ||
packages has been encountering several errors, which caused the | ||
man.voidlinux.org interface to serve outdated manual pages. The root | ||
cause is suspected to be either the out-of-memory events during the | ||
builds of the llvm17 and llvm18 packages on 2024-07-02 to 2024-07-04 | ||
or the disk failure on a-fsn-de on 2024-06-18. | ||
|
||
## Leadup | ||
|
||
At this time we do not know what caused the these issues, but there were | ||
two recent incidents on the service host that are likely candidates: | ||
|
||
* the disk failure event on 2024-06-18 could have caused some data corruption | ||
if a write was happening when the disk failed. The host in question uses | ||
BTRFS in RAID1, which should have mitigated this, but it is possible | ||
some corruption slipped through. | ||
* the out-of-memory events on 2024-07-02 to 2024-07-04 could have caused | ||
some data corruption if a write was happening when the host was hard-rebooted. | ||
|
||
## Fault | ||
|
||
Several manual pages in the `openssl-doc` package had formed a symlink loop. | ||
`xmandump` could not handle this and panicked: | ||
|
||
``` | ||
2024-07-27T04:22:44.298Z ERROR xmandump/main.go:442 Error processing package file {"file": "/mirror/current/openssl-doc-3.3.1_1.x86_64.xbps", "pkgfile": "./usr/share/man/man3/OpenSSL_version.3ssl", "error": "open man3/OpenSSL_version.3ssl.gz: too many levels of symbolic links"} | ||
``` | ||
|
||
Additionally, several manual pages in other packages (`man-pages-posix` and | ||
either `clang-tools-extra17` or `clang-tools-extra18`) were corrupted and | ||
caused `makewhatis` to segfault: | ||
|
||
``` | ||
#0 __strcasecmp_l_avx2 () at ../sysdeps/x86_64/multiarch/strcmp-avx2.S:283 | ||
#1 0x0000555e081f0bd3 in mlink_check (mlink=0x555e17eb79b0, mpage=0x555e15502070) at ./mandocdb.c:1141 | ||
#2 mpages_merge (dba=dba@entry=0x555e0bf427d0, mp=mp@entry=0x555e098142b0) at ./mandocdb.c:1296 | ||
#3 0x0000555e081f1d51 in mandocdb (argc=<optimized out>, argv=<optimized out>) at ./mandocdb.c:516 | ||
#4 0x00007fed5894cc4c in __libc_start_call_main (main=main@entry=0x555e081d3c50 <main>, argc=argc@entry=4, argv=argv@entry=0x7ffef428cba8) | ||
at ../sysdeps/nptl/libc_start_call_main.h:58 | ||
#5 0x00007fed5894cd05 in __libc_start_main_impl (main=0x555e081d3c50 <main>, argc=4, argv=0x7ffef428cba8, init=<optimized out>, | ||
fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffef428cb98) at ../csu/libc-start.c:360 | ||
#6 0x0000555e081d5cb1 in _start () at ../sysdeps/x86_64/start.S:115 | ||
``` | ||
|
||
## Impact | ||
|
||
Publicly Visible: | ||
|
||
* Manual pages served by man.voidlinux.org were outdated or missing | ||
|
||
Internally Visible: | ||
|
||
* The xmandump service was logging multiple failures | ||
|
||
## Detection | ||
|
||
When looking at the river(1) manpage on man.voidlinux.org, @classabbyamp | ||
noticed the page did not match the one in the upstream repository for the | ||
packaged version. Upon further investigation of the xmandump service logs, | ||
it was found that it was not completing successfully. | ||
|
||
## Response | ||
|
||
To investigate, local testing was initially performed, which could not | ||
reproduce the issue. To debug in-situ, a nomad task was added to run | ||
a container in a matching environment to the xmandump service. From there, | ||
the causes of the failure were discovered. | ||
|
||
## Recovery | ||
|
||
@classabbyamp removed the manual pages causing issues and re-ran the xmandump | ||
service, which allowed the system to recover. One package was rebuilt to | ||
regenerate possibly-corrupted manual pages. Several packages which had | ||
possibly-corrupted manual pages were removed from from the `cache.json` files | ||
so `xmandump` would extract those manual pages again. | ||
|
||
## What went well | ||
|
||
* Nomad allowed for easy reproduction of the production environment for debugging the service. | ||
|
||
## What could be done better | ||
|
||
* Error handling in xmandump could be more robust. | ||
|
||
## Lessons learned | ||
|
||
* Services like xmandump should have healthchecks and status indications so | ||
the monitoring infrastructure can notify when they are unhealthy. | ||
|
||
## Timeline | ||
|
||
No timeline is provided for this incident. | ||
|