diff --git a/docs/src/postmortem/2024-07-27.md b/docs/src/postmortem/2024-07-27.md new file mode 100644 index 00000000..7508a448 --- /dev/null +++ b/docs/src/postmortem/2024-07-27.md @@ -0,0 +1,97 @@ +# Post Mortem 2024-07-27 + +## Incident summary + +For an unknown amount of time, the service extracting manual pages from +packages has been encountering several errors, which caused the +man.voidlinux.org interface to serve outdated manual pages. The root +cause is suspected to be either the out-of-memory events during the +builds of the llvm17 and llvm18 packages on 2024-07-02 to 2024-07-04 +or the disk failure on a-fsn-de on 2024-06-18. + +## Leadup + +At this time we do not know what caused the these issues, but there were +two recent incidents on the service host that are likely candidates: + + * the disk failure event on 2024-06-18 could have caused some data corruption + if a write was happening when the disk failed. The host in question uses + BTRFS in RAID1, which should have mitigated this, but it is possible + some corruption slipped through. + * the out-of-memory events on 2024-07-02 to 2024-07-04 could have caused + some data corruption if a write was happening when the host was hard-rebooted. + +## Fault + +Several manual pages in the `openssl-doc` package had formed a symlink loop. +`xmandump` could not handle this and panicked: + +``` +2024-07-27T04:22:44.298Z ERROR xmandump/main.go:442 Error processing package file {"file": "/mirror/current/openssl-doc-3.3.1_1.x86_64.xbps", "pkgfile": "./usr/share/man/man3/OpenSSL_version.3ssl", "error": "open man3/OpenSSL_version.3ssl.gz: too many levels of symbolic links"} +``` + +Additionally, several manual pages in other packages (`man-pages-posix` and +either `clang-tools-extra17` or `clang-tools-extra18`) were corrupted and +caused `makewhatis` to segfault: + +``` +#0 __strcasecmp_l_avx2 () at ../sysdeps/x86_64/multiarch/strcmp-avx2.S:283 +#1 0x0000555e081f0bd3 in mlink_check (mlink=0x555e17eb79b0, mpage=0x555e15502070) at ./mandocdb.c:1141 +#2 mpages_merge (dba=dba@entry=0x555e0bf427d0, mp=mp@entry=0x555e098142b0) at ./mandocdb.c:1296 +#3 0x0000555e081f1d51 in mandocdb (argc=, argv=) at ./mandocdb.c:516 +#4 0x00007fed5894cc4c in __libc_start_call_main (main=main@entry=0x555e081d3c50
, argc=argc@entry=4, argv=argv@entry=0x7ffef428cba8) + at ../sysdeps/nptl/libc_start_call_main.h:58 +#5 0x00007fed5894cd05 in __libc_start_main_impl (main=0x555e081d3c50
, argc=4, argv=0x7ffef428cba8, init=, + fini=, rtld_fini=, stack_end=0x7ffef428cb98) at ../csu/libc-start.c:360 +#6 0x0000555e081d5cb1 in _start () at ../sysdeps/x86_64/start.S:115 +``` + +## Impact + +Publicly Visible: + + * Manual pages served by man.voidlinux.org were outdated or missing + +Internally Visible: + + * The xmandump service was logging multiple failures + +## Detection + +When looking at the river(1) manpage on man.voidlinux.org, @classabbyamp +noticed the page did not match the one in the upstream repository for the +packaged version. Upon further investigation of the xmandump service logs, +it was found that it was not completing successfully. + +## Response + +To investigate, local testing was initially performed, which could not +reproduce the issue. To debug in-situ, a nomad task was added to run +a container in a matching environment to the xmandump service. From there, +the causes of the failure were discovered. + +## Recovery + +@classabbyamp removed the manual pages causing issues and re-ran the xmandump +service, which allowed the system to recover. One package was rebuilt to +regenerate possibly-corrupted manual pages. Several packages which had +possibly-corrupted manual pages were removed from from the `cache.json` files +so `xmandump` would extract those manual pages again. + +## What went well + + * Nomad allowed for easy reproduction of the production environment for debugging the service. + +## What could be done better + + * Error handling in xmandump could be more robust. + +## Lessons learned + + * Services like xmandump should have healthchecks and status indications so + the monitoring infrastructure can notify when they are unhealthy. + +## Timeline + +No timeline is provided for this incident. +