-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stopping a PostgreSQL container immediately after starting it kills PostgreSQL with SIGKILL, possibly corrupting the database #1207
Comments
You probably also want to set a much more generous |
I'm pretty sure increasing the We've already decided to migrate our PostgreSQL instance out of Docker to ensure we won't get hit by this again. |
Ah right -- I guess we could implement some There's not really something uncontroversially "sane" we could do if we get a request to stop/kill during that initialization process. 🤔 |
We are running a PostgreSQL docker container in production for our ERP, and have had multiple instances of corruption in the last few months.
The most recent corruption started with the following message "invalid record length" immediately after a restart:
followed by more serious errors:
and finally a PostgreSQL container unable to start due to a corrupted WAL:
We were able to recover from the corruption by using
pg_resetwal
, dumping the database to SQL and restoring into a clean database from the dump, but the multiple corruption incidents got me curious, and I started to investigate if I could reproduce the error.From the initial error it is obvious that PostgreSQL wasn't properly shut down, but probably killed with SIGKILL, because the log doesn't indicate that the instance running at 2024-02-29 12:44:37.264 was shut down before the instance at 2024-02-29 12:44:37.983 started:
To compare it with a clean restart, where the old instance reports that it is being properly shut down:
At the time where the corruption occurred, a team member issued a command to restart the docker container (we have since figured out that the team restarted PostgreSQL very often due to a misunderstanding that it would be necessary when deploying schema changes), so I went ahead to see if I could reproduce the problem just by restarting the container. I created an empty PostgreSQL container as documented on https://hub.docker.com/_/postgres under "start a postgres instance":
And then I restarted the container in a loop:
It's immediately obvious that something is going wrong here - sometimes
docker stop
hangs for 10 seconds, which is suspiciously similar to Docker's default --stop-timeout:man docker-stop:
(Sidenote: the issue is very timing-sensitive. Without the
date
call, everydocker stop
after the first one was affected, and doing anything more complicated like callingdocker logs
between adocker start
anddocker stop
makes the issue unreproducible.)The log confirms that e.g. the docker stop run at 2024-03-05T15:44:45+01:00 did not shut down PostgreSQL cleanly:
Apparently sometimes the SIGTERM that
docker stop
sends initially seems to go unnoticed by the container, leadingdocker stop
to send a SIGKILL 10 seconds later. Digging down withexecsnoop
, it seems that the underlying problem is the fact that thedocker stop
happens during the container startup phase before the actualpostgres
binary is even started:Details
I was unable to reproduce the issue with a non-containerized PostgreSQL installation on Debian 11 bullseye with a similar
systemctl stop [email protected]; systemctl start [email protected]
loop, so this issue seems to affect only the Docker version of PostgreSQL.The text was updated successfully, but these errors were encountered: