Postmortem of a failed in-cluster container registry

How Kubernetes partly killed itself

Introduction

Today I want to share a postmortem about the outage of many k8s applications across multiple clusters. This is a formatted repetition of my mastodon thread early this year.

What happened

When I started my workstation I immediately received multiple chat messages about outages of various containerized applications. After short investigation of all clusters I saw lots of ImagePullBackoffs in kubectl get pods -A.

So I took a look at the harbor container registry as this seems to be the root cause.

The pods where all up and running, but there where multiple “invalid credentials” in the logs. I tried to log into the WebUI, but got a popup mentioning invalid credentials, too. So, have we been hacked? Possible, but I think this was very unlikely. Next step was to analyse the database pod, because there is the place where credentials and containerimage metadata is stored. I psql’d in the right db and realized that it was empty.

Oh no… but lucky me, I backup the PVC containing pgdata every night :)

I restored the pgdata volume and thought I was done, but the database was still empty. ☹️

Now comes the curious part: I investigated the pgdata PVC and found out that there where two folders inside: pg13 and pg14. So, what happened? The official helm chart of harbor has a rolling release tag (default is :dev). Past me removed the tag in my GitOps workflow, because I wanted to stick to the image default to the current Chartversion. It became tricky:

The pull policy for harbor database pod was “IfNotPresent”. The node, which ran the database pod yesterday had a pretty recent “:dev”-image, which upgraded the PostgreSQL version to 14. I use kured to reboot every node bi-weekly and tonight the node with the database pod was drained and the db was scheduled to another node, which had an older harbor-db:dev image, (but it was present!). So, inside the PVC the data migrated to pg14 someday before today and left pg13 empty.

The registry lost its mind and the other nodes which rebooted tonight drained their pods too. These pods could not be rescheduled because of missing images.

In the end, I pulled the latest harbor-db:dev container on all (infra-)nodes to upgrade every node for harbor-db to use Postgres 14. Downside: I want to avoid further yak shaving, so I have to stick to current dev-image till Postgres 14 is default in stable release.

Conclusion

My lesson learned: Always check helm charts for rolling release images and set tags explicitly

In a future post I will describe how I get rid of the bundled Postgres in harbor and deploy a highly available container image registry.

Built with Hugo
Theme Stack designed by Jimmy