Skip to content

Commit

Permalink
docs: add post-mortem for xmandump incident
Browse files Browse the repository at this point in the history
  • Loading branch information
classabbyamp authored and the-maldridge committed Aug 14, 2024
1 parent 7d613af commit 06f4648
Showing 1 changed file with 97 additions and 0 deletions.
97 changes: 97 additions & 0 deletions docs/src/postmortem/2024-07-27.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
# Post Mortem 2024-07-27

## Incident summary

For an unknown amount of time, the service extracting manual pages from
packages has been encountering several errors, which caused the
man.voidlinux.org interface to serve outdated manual pages. The root
cause is suspected to be either the out-of-memory events during the
builds of the llvm17 and llvm18 packages on 2024-07-02 to 2024-07-04
or the disk failure on a-fsn-de on 2024-06-18.

## Leadup

At this time we do not know what caused the these issues, but there were
two recent incidents on the service host that are likely candidates:

* the disk failure event on 2024-06-18 could have caused some data corruption
if a write was happening when the disk failed. The host in question uses
BTRFS in RAID1, which should have mitigated this, but it is possible
some corruption slipped through.
* the out-of-memory events on 2024-07-02 to 2024-07-04 could have caused
some data corruption if a write was happening when the host was hard-rebooted.

## Fault

Several manual pages in the `openssl-doc` package had formed a symlink loop.
`xmandump` could not handle this and panicked:

```
2024-07-27T04:22:44.298Z ERROR xmandump/main.go:442 Error processing package file {"file": "/mirror/current/openssl-doc-3.3.1_1.x86_64.xbps", "pkgfile": "./usr/share/man/man3/OpenSSL_version.3ssl", "error": "open man3/OpenSSL_version.3ssl.gz: too many levels of symbolic links"}
```

Additionally, several manual pages in other packages (`man-pages-posix` and
either `clang-tools-extra17` or `clang-tools-extra18`) were corrupted and
caused `makewhatis` to segfault:

```
#0 __strcasecmp_l_avx2 () at ../sysdeps/x86_64/multiarch/strcmp-avx2.S:283
#1 0x0000555e081f0bd3 in mlink_check (mlink=0x555e17eb79b0, mpage=0x555e15502070) at ./mandocdb.c:1141
#2 mpages_merge (dba=dba@entry=0x555e0bf427d0, mp=mp@entry=0x555e098142b0) at ./mandocdb.c:1296
#3 0x0000555e081f1d51 in mandocdb (argc=<optimized out>, argv=<optimized out>) at ./mandocdb.c:516
#4 0x00007fed5894cc4c in __libc_start_call_main (main=main@entry=0x555e081d3c50 <main>, argc=argc@entry=4, argv=argv@entry=0x7ffef428cba8)
at ../sysdeps/nptl/libc_start_call_main.h:58
#5 0x00007fed5894cd05 in __libc_start_main_impl (main=0x555e081d3c50 <main>, argc=4, argv=0x7ffef428cba8, init=<optimized out>,
fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffef428cb98) at ../csu/libc-start.c:360
#6 0x0000555e081d5cb1 in _start () at ../sysdeps/x86_64/start.S:115
```

## Impact

Publicly Visible:

* Manual pages served by man.voidlinux.org were outdated or missing

Internally Visible:

* The xmandump service was logging multiple failures

## Detection

When looking at the river(1) manpage on man.voidlinux.org, @classabbyamp
noticed the page did not match the one in the upstream repository for the
packaged version. Upon further investigation of the xmandump service logs,
it was found that it was not completing successfully.

## Response

To investigate, local testing was initially performed, which could not
reproduce the issue. To debug in-situ, a nomad task was added to run
a container in a matching environment to the xmandump service. From there,
the causes of the failure were discovered.

## Recovery

@classabbyamp removed the manual pages causing issues and re-ran the xmandump
service, which allowed the system to recover. One package was rebuilt to
regenerate possibly-corrupted manual pages. Several packages which had
possibly-corrupted manual pages were removed from from the `cache.json` files
so `xmandump` would extract those manual pages again.

## What went well

* Nomad allowed for easy reproduction of the production environment for debugging the service.

## What could be done better

* Error handling in xmandump could be more robust.

## Lessons learned

* Services like xmandump should have healthchecks and status indications so
the monitoring infrastructure can notify when they are unhealthy.

## Timeline

No timeline is provided for this incident.

0 comments on commit 06f4648

Please sign in to comment.